save progress

This commit is contained in:
StellaOps Bot
2026-01-06 09:42:02 +02:00
parent 94d68bee8b
commit 37e11918e0
443 changed files with 85863 additions and 897 deletions

View File

@@ -0,0 +1,232 @@
# Corpus Ingestion Operations Guide
**Version:** 1.0
**Sprint:** SPRINT_20260105_001_002_BINDEX
**Status:** Implementation Complete - Operational Execution Pending
## Overview
This guide describes how to execute corpus ingestion operations to populate the function behavior corpus with fingerprints from known library functions.
## Prerequisites
- StellaOps.BinaryIndex.Corpus library built and deployed
- PostgreSQL database with corpus schema (see `docs/db/schemas/corpus.sql`)
- Network access to package mirrors (or local package cache)
- Sufficient disk space (~100GB for full corpus)
- Required tools:
- .NET 10 runtime
- HTTP client access to package repositories
## Implementation Status
**CORP-015, CORP-016, CORP-017: Implementation COMPLETE**
All corpus connector implementations are complete and build successfully:
- ✓ GlibcCorpusConnector (GNU C Library)
- ✓ OpenSslCorpusConnector (OpenSSL)
- ✓ ZlibCorpusConnector (zlib)
- ✓ CurlCorpusConnector (libcurl)
**Status:** Code implementation is done. These tasks require **operational execution** to download and ingest real package data.
## Running Corpus Ingestion
### 1. Configure Package Sources
Set up access to package mirrors in your configuration:
```yaml
# config/corpus-ingestion.yaml
packageSources:
debian:
mirrorUrl: "http://deb.debian.org/debian"
distributions: ["bullseye", "bookworm"]
components: ["main"]
ubuntu:
mirrorUrl: "http://archive.ubuntu.com/ubuntu"
distributions: ["focal", "jammy"]
alpine:
mirrorUrl: "https://dl-cdn.alpinelinux.org/alpine"
versions: ["v3.18", "v3.19"]
```
### 2. Environment Variables
```bash
# Database connection
export STELLAOPS_CORPUS_DB="Host=localhost;Database=stellaops;Username=corpus_user;Password=..."
# Package cache directory (optional)
export STELLAOPS_PACKAGE_CACHE="/var/cache/stellaops/packages"
# Concurrent workers
export STELLAOPS_INGESTION_WORKERS=4
```
### 3. Execute Ingestion (CLI)
```bash
# Ingest specific library version
stellaops corpus ingest --library glibc --version 2.31 --architectures x86_64,aarch64
# Ingest version range
stellaops corpus ingest --library openssl --version-range "1.1.0..1.1.1" --architectures x86_64
# Ingest from local binary
stellaops corpus ingest-binary --library glibc --version 2.31 --arch x86_64 --path /usr/lib/x86_64-linux-gnu/libc.so.6
# Full ingestion job (all configured libraries)
stellaops corpus ingest-full --config config/corpus-ingestion.yaml
```
### 4. Execute Ingestion (Programmatic)
```csharp
using StellaOps.BinaryIndex.Corpus;
using StellaOps.BinaryIndex.Corpus.Connectors;
// Setup
var serviceProvider = ...; // Configure DI
var ingestionService = serviceProvider.GetRequiredService<ICorpusIngestionService>();
var glibcConnector = serviceProvider.GetRequiredService<GlibcCorpusConnector>();
// Fetch available versions
var versions = await glibcConnector.GetAvailableVersionsAsync(ct);
// Ingest specific version
foreach (var version in versions.Take(5))
{
foreach (var arch in new[] { "x86_64", "aarch64" })
{
try
{
var binary = await glibcConnector.FetchBinaryAsync(version, arch, abi: "gnu", ct);
var metadata = new LibraryMetadata(
Name: "glibc",
Version: version,
Architecture: arch,
Abi: "gnu",
Compiler: "gcc",
OptimizationLevel: "O2"
);
using var stream = File.OpenRead(binary.Path);
var result = await ingestionService.IngestLibraryAsync(metadata, stream, ct: ct);
Console.WriteLine($"Ingested {result.FunctionsIndexed} functions from glibc {version} {arch}");
}
catch (Exception ex)
{
Console.WriteLine($"Failed to ingest glibc {version} {arch}: {ex.Message}");
}
}
}
```
## Ingestion Workflow
```
1. Package Discovery
└─> Query package mirror for available versions
2. Package Download
└─> Fetch .deb/.apk/.rpm package
└─> Extract binary files
3. Binary Analysis
└─> Disassemble with B2R2
└─> Lift to IR (semantic fingerprints)
└─> Extract functions, imports, exports
4. Fingerprint Generation
└─> Instruction-level fingerprints
└─> Semantic graph fingerprints
└─> API call sequence fingerprints
└─> Combined fingerprints
5. Database Storage
└─> Insert library/version records
└─> Insert build variant records
└─> Insert function records
└─> Insert fingerprint records
6. Clustering (post-ingestion)
└─> Group similar functions across versions
└─> Compute centroids
```
## Expected Corpus Coverage
### Phase 2a (Priority Libraries)
| Library | Versions | Architectures | Est. Functions | Status |
|---------|----------|---------------|----------------|--------|
| glibc | 2.17, 2.28, 2.31, 2.35, 2.38 | x64, arm64, armv7 | ~15,000 | Ready to ingest |
| OpenSSL | 1.0.2, 1.1.0, 1.1.1, 3.0, 3.1 | x64, arm64 | ~8,000 | Ready to ingest |
| zlib | 1.2.8, 1.2.11, 1.2.13, 1.3 | x64, arm64 | ~200 | Ready to ingest |
| libcurl | 7.50-7.88 (select) | x64, arm64 | ~2,000 | Ready to ingest |
| SQLite | 3.30-3.44 (select) | x64, arm64 | ~1,500 | Ready to ingest |
**Total Phase 2a:** ~26,700 unique functions, ~80,000 fingerprints (with variants)
## Monitoring Ingestion
```bash
# Check ingestion job status
stellaops corpus jobs list
# View statistics
stellaops corpus stats
# Query specific library coverage
stellaops corpus query --library glibc --show-versions
```
## Performance Considerations
- **Parallel ingestion:** Use multiple workers for concurrent processing
- **Disk I/O:** Local package cache significantly speeds up repeated ingestion
- **Database:** Ensure PostgreSQL has adequate memory for bulk inserts
- **Network:** Mirror selection impacts download speed
## Troubleshooting
### Package Download Failures
```
Error: Failed to download package from mirror
Solution: Check mirror availability, try alternative mirror
```
### Fingerprint Generation Failures
```
Error: Failed to generate semantic fingerprint for function X
Solution: Check B2R2 support for architecture, verify binary format
```
### Database Connection Issues
```
Error: Could not connect to corpus database
Solution: Verify STELLAOPS_CORPUS_DB connection string, check PostgreSQL is running
```
## Next Steps
After successful ingestion:
1. Run clustering: `stellaops corpus cluster --library glibc`
2. Update CVE associations: `stellaops corpus update-cves`
3. Validate query performance: `stellaops corpus benchmark-query`
4. Export statistics: `stellaops corpus export-stats --output corpus-stats.json`
## Related Documentation
- Database Schema: `docs/db/schemas/corpus.sql`
- Architecture: `docs/modules/binary-index/corpus-management.md`
- Sprint: `docs/implplan/SPRINT_20260105_001_002_BINDEX_semdiff_corpus.md`