# Corpus Ingestion Operations Guide **Version:** 1.0 **Sprint:** SPRINT_20260105_001_002_BINDEX **Status:** Implementation Complete - Operational Execution Pending ## Overview This guide describes how to execute corpus ingestion operations to populate the function behavior corpus with fingerprints from known library functions. ## Prerequisites - StellaOps.BinaryIndex.Corpus library built and deployed - PostgreSQL database with corpus schema (see `docs/db/schemas/corpus.sql`) - Network access to package mirrors (or local package cache) - Sufficient disk space (~100GB for full corpus) - Required tools: - .NET 10 runtime - HTTP client access to package repositories ## Implementation Status **CORP-015, CORP-016, CORP-017: Implementation COMPLETE** All corpus connector implementations are complete and build successfully: - ✓ GlibcCorpusConnector (GNU C Library) - ✓ OpenSslCorpusConnector (OpenSSL) - ✓ ZlibCorpusConnector (zlib) - ✓ CurlCorpusConnector (libcurl) **Status:** Code implementation is done. These tasks require **operational execution** to download and ingest real package data. ## Running Corpus Ingestion ### 1. Configure Package Sources Set up access to package mirrors in your configuration: ```yaml # config/corpus-ingestion.yaml packageSources: debian: mirrorUrl: "http://deb.debian.org/debian" distributions: ["bullseye", "bookworm"] components: ["main"] ubuntu: mirrorUrl: "http://archive.ubuntu.com/ubuntu" distributions: ["focal", "jammy"] alpine: mirrorUrl: "https://dl-cdn.alpinelinux.org/alpine" versions: ["v3.18", "v3.19"] ``` ### 2. Environment Variables ```bash # Database connection export STELLAOPS_CORPUS_DB="Host=localhost;Database=stellaops;Username=corpus_user;Password=..." # Package cache directory (optional) export STELLAOPS_PACKAGE_CACHE="/var/cache/stellaops/packages" # Concurrent workers export STELLAOPS_INGESTION_WORKERS=4 ``` ### 3. Execute Ingestion (CLI) ```bash # Ingest specific library version stellaops corpus ingest --library glibc --version 2.31 --architectures x86_64,aarch64 # Ingest version range stellaops corpus ingest --library openssl --version-range "1.1.0..1.1.1" --architectures x86_64 # Ingest from local binary stellaops corpus ingest-binary --library glibc --version 2.31 --arch x86_64 --path /usr/lib/x86_64-linux-gnu/libc.so.6 # Full ingestion job (all configured libraries) stellaops corpus ingest-full --config config/corpus-ingestion.yaml ``` ### 4. Execute Ingestion (Programmatic) ```csharp using StellaOps.BinaryIndex.Corpus; using StellaOps.BinaryIndex.Corpus.Connectors; // Setup var serviceProvider = ...; // Configure DI var ingestionService = serviceProvider.GetRequiredService(); var glibcConnector = serviceProvider.GetRequiredService(); // Fetch available versions var versions = await glibcConnector.GetAvailableVersionsAsync(ct); // Ingest specific version foreach (var version in versions.Take(5)) { foreach (var arch in new[] { "x86_64", "aarch64" }) { try { var binary = await glibcConnector.FetchBinaryAsync(version, arch, abi: "gnu", ct); var metadata = new LibraryMetadata( Name: "glibc", Version: version, Architecture: arch, Abi: "gnu", Compiler: "gcc", OptimizationLevel: "O2" ); using var stream = File.OpenRead(binary.Path); var result = await ingestionService.IngestLibraryAsync(metadata, stream, ct: ct); Console.WriteLine($"Ingested {result.FunctionsIndexed} functions from glibc {version} {arch}"); } catch (Exception ex) { Console.WriteLine($"Failed to ingest glibc {version} {arch}: {ex.Message}"); } } } ``` ## Ingestion Workflow ``` 1. Package Discovery └─> Query package mirror for available versions 2. Package Download └─> Fetch .deb/.apk/.rpm package └─> Extract binary files 3. Binary Analysis └─> Disassemble with B2R2 └─> Lift to IR (semantic fingerprints) └─> Extract functions, imports, exports 4. Fingerprint Generation └─> Instruction-level fingerprints └─> Semantic graph fingerprints └─> API call sequence fingerprints └─> Combined fingerprints 5. Database Storage └─> Insert library/version records └─> Insert build variant records └─> Insert function records └─> Insert fingerprint records 6. Clustering (post-ingestion) └─> Group similar functions across versions └─> Compute centroids ``` ## Expected Corpus Coverage ### Phase 2a (Priority Libraries) | Library | Versions | Architectures | Est. Functions | Status | |---------|----------|---------------|----------------|--------| | glibc | 2.17, 2.28, 2.31, 2.35, 2.38 | x64, arm64, armv7 | ~15,000 | Ready to ingest | | OpenSSL | 1.0.2, 1.1.0, 1.1.1, 3.0, 3.1 | x64, arm64 | ~8,000 | Ready to ingest | | zlib | 1.2.8, 1.2.11, 1.2.13, 1.3 | x64, arm64 | ~200 | Ready to ingest | | libcurl | 7.50-7.88 (select) | x64, arm64 | ~2,000 | Ready to ingest | | SQLite | 3.30-3.44 (select) | x64, arm64 | ~1,500 | Ready to ingest | **Total Phase 2a:** ~26,700 unique functions, ~80,000 fingerprints (with variants) ## Monitoring Ingestion ```bash # Check ingestion job status stellaops corpus jobs list # View statistics stellaops corpus stats # Query specific library coverage stellaops corpus query --library glibc --show-versions ``` ## Performance Considerations - **Parallel ingestion:** Use multiple workers for concurrent processing - **Disk I/O:** Local package cache significantly speeds up repeated ingestion - **Database:** Ensure PostgreSQL has adequate memory for bulk inserts - **Network:** Mirror selection impacts download speed ## Troubleshooting ### Package Download Failures ``` Error: Failed to download package from mirror Solution: Check mirror availability, try alternative mirror ``` ### Fingerprint Generation Failures ``` Error: Failed to generate semantic fingerprint for function X Solution: Check B2R2 support for architecture, verify binary format ``` ### Database Connection Issues ``` Error: Could not connect to corpus database Solution: Verify STELLAOPS_CORPUS_DB connection string, check PostgreSQL is running ``` ## Next Steps After successful ingestion: 1. Run clustering: `stellaops corpus cluster --library glibc` 2. Update CVE associations: `stellaops corpus update-cves` 3. Validate query performance: `stellaops corpus benchmark-query` 4. Export statistics: `stellaops corpus export-stats --output corpus-stats.json` ## Related Documentation - Database Schema: `docs/db/schemas/corpus.sql` - Architecture: `docs/modules/binary-index/corpus-management.md` - Sprint: `docs/implplan/SPRINT_20260105_001_002_BINDEX_semdiff_corpus.md`