save progress
This commit is contained in:
232
docs/modules/binary-index/corpus-ingestion-operations.md
Normal file
232
docs/modules/binary-index/corpus-ingestion-operations.md
Normal file
@@ -0,0 +1,232 @@
|
||||
# Corpus Ingestion Operations Guide
|
||||
|
||||
**Version:** 1.0
|
||||
**Sprint:** SPRINT_20260105_001_002_BINDEX
|
||||
**Status:** Implementation Complete - Operational Execution Pending
|
||||
|
||||
## Overview
|
||||
|
||||
This guide describes how to execute corpus ingestion operations to populate the function behavior corpus with fingerprints from known library functions.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- StellaOps.BinaryIndex.Corpus library built and deployed
|
||||
- PostgreSQL database with corpus schema (see `docs/db/schemas/corpus.sql`)
|
||||
- Network access to package mirrors (or local package cache)
|
||||
- Sufficient disk space (~100GB for full corpus)
|
||||
- Required tools:
|
||||
- .NET 10 runtime
|
||||
- HTTP client access to package repositories
|
||||
|
||||
## Implementation Status
|
||||
|
||||
**CORP-015, CORP-016, CORP-017: Implementation COMPLETE**
|
||||
|
||||
All corpus connector implementations are complete and build successfully:
|
||||
- ✓ GlibcCorpusConnector (GNU C Library)
|
||||
- ✓ OpenSslCorpusConnector (OpenSSL)
|
||||
- ✓ ZlibCorpusConnector (zlib)
|
||||
- ✓ CurlCorpusConnector (libcurl)
|
||||
|
||||
**Status:** Code implementation is done. These tasks require **operational execution** to download and ingest real package data.
|
||||
|
||||
## Running Corpus Ingestion
|
||||
|
||||
### 1. Configure Package Sources
|
||||
|
||||
Set up access to package mirrors in your configuration:
|
||||
|
||||
```yaml
|
||||
# config/corpus-ingestion.yaml
|
||||
packageSources:
|
||||
debian:
|
||||
mirrorUrl: "http://deb.debian.org/debian"
|
||||
distributions: ["bullseye", "bookworm"]
|
||||
components: ["main"]
|
||||
|
||||
ubuntu:
|
||||
mirrorUrl: "http://archive.ubuntu.com/ubuntu"
|
||||
distributions: ["focal", "jammy"]
|
||||
|
||||
alpine:
|
||||
mirrorUrl: "https://dl-cdn.alpinelinux.org/alpine"
|
||||
versions: ["v3.18", "v3.19"]
|
||||
```
|
||||
|
||||
### 2. Environment Variables
|
||||
|
||||
```bash
|
||||
# Database connection
|
||||
export STELLAOPS_CORPUS_DB="Host=localhost;Database=stellaops;Username=corpus_user;Password=..."
|
||||
|
||||
# Package cache directory (optional)
|
||||
export STELLAOPS_PACKAGE_CACHE="/var/cache/stellaops/packages"
|
||||
|
||||
# Concurrent workers
|
||||
export STELLAOPS_INGESTION_WORKERS=4
|
||||
```
|
||||
|
||||
### 3. Execute Ingestion (CLI)
|
||||
|
||||
```bash
|
||||
# Ingest specific library version
|
||||
stellaops corpus ingest --library glibc --version 2.31 --architectures x86_64,aarch64
|
||||
|
||||
# Ingest version range
|
||||
stellaops corpus ingest --library openssl --version-range "1.1.0..1.1.1" --architectures x86_64
|
||||
|
||||
# Ingest from local binary
|
||||
stellaops corpus ingest-binary --library glibc --version 2.31 --arch x86_64 --path /usr/lib/x86_64-linux-gnu/libc.so.6
|
||||
|
||||
# Full ingestion job (all configured libraries)
|
||||
stellaops corpus ingest-full --config config/corpus-ingestion.yaml
|
||||
```
|
||||
|
||||
### 4. Execute Ingestion (Programmatic)
|
||||
|
||||
```csharp
|
||||
using StellaOps.BinaryIndex.Corpus;
|
||||
using StellaOps.BinaryIndex.Corpus.Connectors;
|
||||
|
||||
// Setup
|
||||
var serviceProvider = ...; // Configure DI
|
||||
var ingestionService = serviceProvider.GetRequiredService<ICorpusIngestionService>();
|
||||
var glibcConnector = serviceProvider.GetRequiredService<GlibcCorpusConnector>();
|
||||
|
||||
// Fetch available versions
|
||||
var versions = await glibcConnector.GetAvailableVersionsAsync(ct);
|
||||
|
||||
// Ingest specific version
|
||||
foreach (var version in versions.Take(5))
|
||||
{
|
||||
foreach (var arch in new[] { "x86_64", "aarch64" })
|
||||
{
|
||||
try
|
||||
{
|
||||
var binary = await glibcConnector.FetchBinaryAsync(version, arch, abi: "gnu", ct);
|
||||
|
||||
var metadata = new LibraryMetadata(
|
||||
Name: "glibc",
|
||||
Version: version,
|
||||
Architecture: arch,
|
||||
Abi: "gnu",
|
||||
Compiler: "gcc",
|
||||
OptimizationLevel: "O2"
|
||||
);
|
||||
|
||||
using var stream = File.OpenRead(binary.Path);
|
||||
var result = await ingestionService.IngestLibraryAsync(metadata, stream, ct: ct);
|
||||
|
||||
Console.WriteLine($"Ingested {result.FunctionsIndexed} functions from glibc {version} {arch}");
|
||||
}
|
||||
catch (Exception ex)
|
||||
{
|
||||
Console.WriteLine($"Failed to ingest glibc {version} {arch}: {ex.Message}");
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Ingestion Workflow
|
||||
|
||||
```
|
||||
1. Package Discovery
|
||||
└─> Query package mirror for available versions
|
||||
|
||||
2. Package Download
|
||||
└─> Fetch .deb/.apk/.rpm package
|
||||
└─> Extract binary files
|
||||
|
||||
3. Binary Analysis
|
||||
└─> Disassemble with B2R2
|
||||
└─> Lift to IR (semantic fingerprints)
|
||||
└─> Extract functions, imports, exports
|
||||
|
||||
4. Fingerprint Generation
|
||||
└─> Instruction-level fingerprints
|
||||
└─> Semantic graph fingerprints
|
||||
└─> API call sequence fingerprints
|
||||
└─> Combined fingerprints
|
||||
|
||||
5. Database Storage
|
||||
└─> Insert library/version records
|
||||
└─> Insert build variant records
|
||||
└─> Insert function records
|
||||
└─> Insert fingerprint records
|
||||
|
||||
6. Clustering (post-ingestion)
|
||||
└─> Group similar functions across versions
|
||||
└─> Compute centroids
|
||||
```
|
||||
|
||||
## Expected Corpus Coverage
|
||||
|
||||
### Phase 2a (Priority Libraries)
|
||||
|
||||
| Library | Versions | Architectures | Est. Functions | Status |
|
||||
|---------|----------|---------------|----------------|--------|
|
||||
| glibc | 2.17, 2.28, 2.31, 2.35, 2.38 | x64, arm64, armv7 | ~15,000 | Ready to ingest |
|
||||
| OpenSSL | 1.0.2, 1.1.0, 1.1.1, 3.0, 3.1 | x64, arm64 | ~8,000 | Ready to ingest |
|
||||
| zlib | 1.2.8, 1.2.11, 1.2.13, 1.3 | x64, arm64 | ~200 | Ready to ingest |
|
||||
| libcurl | 7.50-7.88 (select) | x64, arm64 | ~2,000 | Ready to ingest |
|
||||
| SQLite | 3.30-3.44 (select) | x64, arm64 | ~1,500 | Ready to ingest |
|
||||
|
||||
**Total Phase 2a:** ~26,700 unique functions, ~80,000 fingerprints (with variants)
|
||||
|
||||
## Monitoring Ingestion
|
||||
|
||||
```bash
|
||||
# Check ingestion job status
|
||||
stellaops corpus jobs list
|
||||
|
||||
# View statistics
|
||||
stellaops corpus stats
|
||||
|
||||
# Query specific library coverage
|
||||
stellaops corpus query --library glibc --show-versions
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **Parallel ingestion:** Use multiple workers for concurrent processing
|
||||
- **Disk I/O:** Local package cache significantly speeds up repeated ingestion
|
||||
- **Database:** Ensure PostgreSQL has adequate memory for bulk inserts
|
||||
- **Network:** Mirror selection impacts download speed
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Package Download Failures
|
||||
|
||||
```
|
||||
Error: Failed to download package from mirror
|
||||
Solution: Check mirror availability, try alternative mirror
|
||||
```
|
||||
|
||||
### Fingerprint Generation Failures
|
||||
|
||||
```
|
||||
Error: Failed to generate semantic fingerprint for function X
|
||||
Solution: Check B2R2 support for architecture, verify binary format
|
||||
```
|
||||
|
||||
### Database Connection Issues
|
||||
|
||||
```
|
||||
Error: Could not connect to corpus database
|
||||
Solution: Verify STELLAOPS_CORPUS_DB connection string, check PostgreSQL is running
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
After successful ingestion:
|
||||
|
||||
1. Run clustering: `stellaops corpus cluster --library glibc`
|
||||
2. Update CVE associations: `stellaops corpus update-cves`
|
||||
3. Validate query performance: `stellaops corpus benchmark-query`
|
||||
4. Export statistics: `stellaops corpus export-stats --output corpus-stats.json`
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- Database Schema: `docs/db/schemas/corpus.sql`
|
||||
- Architecture: `docs/modules/binary-index/corpus-management.md`
|
||||
- Sprint: `docs/implplan/SPRINT_20260105_001_002_BINDEX_semdiff_corpus.md`
|
||||
Reference in New Issue
Block a user