6.8 KiB
6.8 KiB
Corpus Ingestion Operations Guide
Version: 1.0 Sprint: SPRINT_20260105_001_002_BINDEX Status: Implementation Complete - Operational Execution Pending
Overview
This guide describes how to execute corpus ingestion operations to populate the function behavior corpus with fingerprints from known library functions.
Prerequisites
- StellaOps.BinaryIndex.Corpus library built and deployed
- PostgreSQL database with corpus schema (see
docs/db/schemas/corpus.sql) - Network access to package mirrors (or local package cache)
- Sufficient disk space (~100GB for full corpus)
- Required tools:
- .NET 10 runtime
- HTTP client access to package repositories
Implementation Status
CORP-015, CORP-016, CORP-017: Implementation COMPLETE
All corpus connector implementations are complete and build successfully:
- ✓ GlibcCorpusConnector (GNU C Library)
- ✓ OpenSslCorpusConnector (OpenSSL)
- ✓ ZlibCorpusConnector (zlib)
- ✓ CurlCorpusConnector (libcurl)
Status: Code implementation is done. These tasks require operational execution to download and ingest real package data.
Running Corpus Ingestion
1. Configure Package Sources
Set up access to package mirrors in your configuration:
# config/corpus-ingestion.yaml
packageSources:
debian:
mirrorUrl: "http://deb.debian.org/debian"
distributions: ["bullseye", "bookworm"]
components: ["main"]
ubuntu:
mirrorUrl: "http://archive.ubuntu.com/ubuntu"
distributions: ["focal", "jammy"]
alpine:
mirrorUrl: "https://dl-cdn.alpinelinux.org/alpine"
versions: ["v3.18", "v3.19"]
2. Environment Variables
# Database connection
export STELLAOPS_CORPUS_DB="Host=localhost;Database=stellaops;Username=corpus_user;Password=..."
# Package cache directory (optional)
export STELLAOPS_PACKAGE_CACHE="/var/cache/stellaops/packages"
# Concurrent workers
export STELLAOPS_INGESTION_WORKERS=4
3. Execute Ingestion (CLI)
# Ingest specific library version
stellaops corpus ingest --library glibc --version 2.31 --architectures x86_64,aarch64
# Ingest version range
stellaops corpus ingest --library openssl --version-range "1.1.0..1.1.1" --architectures x86_64
# Ingest from local binary
stellaops corpus ingest-binary --library glibc --version 2.31 --arch x86_64 --path /usr/lib/x86_64-linux-gnu/libc.so.6
# Full ingestion job (all configured libraries)
stellaops corpus ingest-full --config config/corpus-ingestion.yaml
4. Execute Ingestion (Programmatic)
using StellaOps.BinaryIndex.Corpus;
using StellaOps.BinaryIndex.Corpus.Connectors;
// Setup
var serviceProvider = ...; // Configure DI
var ingestionService = serviceProvider.GetRequiredService<ICorpusIngestionService>();
var glibcConnector = serviceProvider.GetRequiredService<GlibcCorpusConnector>();
// Fetch available versions
var versions = await glibcConnector.GetAvailableVersionsAsync(ct);
// Ingest specific version
foreach (var version in versions.Take(5))
{
foreach (var arch in new[] { "x86_64", "aarch64" })
{
try
{
var binary = await glibcConnector.FetchBinaryAsync(version, arch, abi: "gnu", ct);
var metadata = new LibraryMetadata(
Name: "glibc",
Version: version,
Architecture: arch,
Abi: "gnu",
Compiler: "gcc",
OptimizationLevel: "O2"
);
using var stream = File.OpenRead(binary.Path);
var result = await ingestionService.IngestLibraryAsync(metadata, stream, ct: ct);
Console.WriteLine($"Ingested {result.FunctionsIndexed} functions from glibc {version} {arch}");
}
catch (Exception ex)
{
Console.WriteLine($"Failed to ingest glibc {version} {arch}: {ex.Message}");
}
}
}
Ingestion Workflow
1. Package Discovery
└─> Query package mirror for available versions
2. Package Download
└─> Fetch .deb/.apk/.rpm package
└─> Extract binary files
3. Binary Analysis
└─> Disassemble with B2R2
└─> Lift to IR (semantic fingerprints)
└─> Extract functions, imports, exports
4. Fingerprint Generation
└─> Instruction-level fingerprints
└─> Semantic graph fingerprints
└─> API call sequence fingerprints
└─> Combined fingerprints
5. Database Storage
└─> Insert library/version records
└─> Insert build variant records
└─> Insert function records
└─> Insert fingerprint records
6. Clustering (post-ingestion)
└─> Group similar functions across versions
└─> Compute centroids
Expected Corpus Coverage
Phase 2a (Priority Libraries)
| Library | Versions | Architectures | Est. Functions | Status |
|---|---|---|---|---|
| glibc | 2.17, 2.28, 2.31, 2.35, 2.38 | x64, arm64, armv7 | ~15,000 | Ready to ingest |
| OpenSSL | 1.0.2, 1.1.0, 1.1.1, 3.0, 3.1 | x64, arm64 | ~8,000 | Ready to ingest |
| zlib | 1.2.8, 1.2.11, 1.2.13, 1.3 | x64, arm64 | ~200 | Ready to ingest |
| libcurl | 7.50-7.88 (select) | x64, arm64 | ~2,000 | Ready to ingest |
| SQLite | 3.30-3.44 (select) | x64, arm64 | ~1,500 | Ready to ingest |
Total Phase 2a: ~26,700 unique functions, ~80,000 fingerprints (with variants)
Monitoring Ingestion
# Check ingestion job status
stellaops corpus jobs list
# View statistics
stellaops corpus stats
# Query specific library coverage
stellaops corpus query --library glibc --show-versions
Performance Considerations
- Parallel ingestion: Use multiple workers for concurrent processing
- Disk I/O: Local package cache significantly speeds up repeated ingestion
- Database: Ensure PostgreSQL has adequate memory for bulk inserts
- Network: Mirror selection impacts download speed
Troubleshooting
Package Download Failures
Error: Failed to download package from mirror
Solution: Check mirror availability, try alternative mirror
Fingerprint Generation Failures
Error: Failed to generate semantic fingerprint for function X
Solution: Check B2R2 support for architecture, verify binary format
Database Connection Issues
Error: Could not connect to corpus database
Solution: Verify STELLAOPS_CORPUS_DB connection string, check PostgreSQL is running
Next Steps
After successful ingestion:
- Run clustering:
stellaops corpus cluster --library glibc - Update CVE associations:
stellaops corpus update-cves - Validate query performance:
stellaops corpus benchmark-query - Export statistics:
stellaops corpus export-stats --output corpus-stats.json
Related Documentation
- Database Schema:
docs/db/schemas/corpus.sql - Architecture:
docs/modules/binary-index/corpus-management.md - Sprint:
docs/implplan/SPRINT_20260105_001_002_BINDEX_semdiff_corpus.md