Files
git.stella-ops.org/docs/modules/binary-index/corpus-ingestion-operations.md
StellaOps Bot 37e11918e0 save progress
2026-01-06 09:42:20 +02:00

6.8 KiB

Corpus Ingestion Operations Guide

Version: 1.0 Sprint: SPRINT_20260105_001_002_BINDEX Status: Implementation Complete - Operational Execution Pending

Overview

This guide describes how to execute corpus ingestion operations to populate the function behavior corpus with fingerprints from known library functions.

Prerequisites

  • StellaOps.BinaryIndex.Corpus library built and deployed
  • PostgreSQL database with corpus schema (see docs/db/schemas/corpus.sql)
  • Network access to package mirrors (or local package cache)
  • Sufficient disk space (~100GB for full corpus)
  • Required tools:
    • .NET 10 runtime
    • HTTP client access to package repositories

Implementation Status

CORP-015, CORP-016, CORP-017: Implementation COMPLETE

All corpus connector implementations are complete and build successfully:

  • ✓ GlibcCorpusConnector (GNU C Library)
  • ✓ OpenSslCorpusConnector (OpenSSL)
  • ✓ ZlibCorpusConnector (zlib)
  • ✓ CurlCorpusConnector (libcurl)

Status: Code implementation is done. These tasks require operational execution to download and ingest real package data.

Running Corpus Ingestion

1. Configure Package Sources

Set up access to package mirrors in your configuration:

# config/corpus-ingestion.yaml
packageSources:
  debian:
    mirrorUrl: "http://deb.debian.org/debian"
    distributions: ["bullseye", "bookworm"]
    components: ["main"]

  ubuntu:
    mirrorUrl: "http://archive.ubuntu.com/ubuntu"
    distributions: ["focal", "jammy"]

  alpine:
    mirrorUrl: "https://dl-cdn.alpinelinux.org/alpine"
    versions: ["v3.18", "v3.19"]

2. Environment Variables

# Database connection
export STELLAOPS_CORPUS_DB="Host=localhost;Database=stellaops;Username=corpus_user;Password=..."

# Package cache directory (optional)
export STELLAOPS_PACKAGE_CACHE="/var/cache/stellaops/packages"

# Concurrent workers
export STELLAOPS_INGESTION_WORKERS=4

3. Execute Ingestion (CLI)

# Ingest specific library version
stellaops corpus ingest --library glibc --version 2.31 --architectures x86_64,aarch64

# Ingest version range
stellaops corpus ingest --library openssl --version-range "1.1.0..1.1.1" --architectures x86_64

# Ingest from local binary
stellaops corpus ingest-binary --library glibc --version 2.31 --arch x86_64 --path /usr/lib/x86_64-linux-gnu/libc.so.6

# Full ingestion job (all configured libraries)
stellaops corpus ingest-full --config config/corpus-ingestion.yaml

4. Execute Ingestion (Programmatic)

using StellaOps.BinaryIndex.Corpus;
using StellaOps.BinaryIndex.Corpus.Connectors;

// Setup
var serviceProvider = ...; // Configure DI
var ingestionService = serviceProvider.GetRequiredService<ICorpusIngestionService>();
var glibcConnector = serviceProvider.GetRequiredService<GlibcCorpusConnector>();

// Fetch available versions
var versions = await glibcConnector.GetAvailableVersionsAsync(ct);

// Ingest specific version
foreach (var version in versions.Take(5))
{
    foreach (var arch in new[] { "x86_64", "aarch64" })
    {
        try
        {
            var binary = await glibcConnector.FetchBinaryAsync(version, arch, abi: "gnu", ct);

            var metadata = new LibraryMetadata(
                Name: "glibc",
                Version: version,
                Architecture: arch,
                Abi: "gnu",
                Compiler: "gcc",
                OptimizationLevel: "O2"
            );

            using var stream = File.OpenRead(binary.Path);
            var result = await ingestionService.IngestLibraryAsync(metadata, stream, ct: ct);

            Console.WriteLine($"Ingested {result.FunctionsIndexed} functions from glibc {version} {arch}");
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Failed to ingest glibc {version} {arch}: {ex.Message}");
        }
    }
}

Ingestion Workflow

1. Package Discovery
   └─> Query package mirror for available versions

2. Package Download
   └─> Fetch .deb/.apk/.rpm package
   └─> Extract binary files

3. Binary Analysis
   └─> Disassemble with B2R2
   └─> Lift to IR (semantic fingerprints)
   └─> Extract functions, imports, exports

4. Fingerprint Generation
   └─> Instruction-level fingerprints
   └─> Semantic graph fingerprints
   └─> API call sequence fingerprints
   └─> Combined fingerprints

5. Database Storage
   └─> Insert library/version records
   └─> Insert build variant records
   └─> Insert function records
   └─> Insert fingerprint records

6. Clustering (post-ingestion)
   └─> Group similar functions across versions
   └─> Compute centroids

Expected Corpus Coverage

Phase 2a (Priority Libraries)

Library Versions Architectures Est. Functions Status
glibc 2.17, 2.28, 2.31, 2.35, 2.38 x64, arm64, armv7 ~15,000 Ready to ingest
OpenSSL 1.0.2, 1.1.0, 1.1.1, 3.0, 3.1 x64, arm64 ~8,000 Ready to ingest
zlib 1.2.8, 1.2.11, 1.2.13, 1.3 x64, arm64 ~200 Ready to ingest
libcurl 7.50-7.88 (select) x64, arm64 ~2,000 Ready to ingest
SQLite 3.30-3.44 (select) x64, arm64 ~1,500 Ready to ingest

Total Phase 2a: ~26,700 unique functions, ~80,000 fingerprints (with variants)

Monitoring Ingestion

# Check ingestion job status
stellaops corpus jobs list

# View statistics
stellaops corpus stats

# Query specific library coverage
stellaops corpus query --library glibc --show-versions

Performance Considerations

  • Parallel ingestion: Use multiple workers for concurrent processing
  • Disk I/O: Local package cache significantly speeds up repeated ingestion
  • Database: Ensure PostgreSQL has adequate memory for bulk inserts
  • Network: Mirror selection impacts download speed

Troubleshooting

Package Download Failures

Error: Failed to download package from mirror
Solution: Check mirror availability, try alternative mirror

Fingerprint Generation Failures

Error: Failed to generate semantic fingerprint for function X
Solution: Check B2R2 support for architecture, verify binary format

Database Connection Issues

Error: Could not connect to corpus database
Solution: Verify STELLAOPS_CORPUS_DB connection string, check PostgreSQL is running

Next Steps

After successful ingestion:

  1. Run clustering: stellaops corpus cluster --library glibc
  2. Update CVE associations: stellaops corpus update-cves
  3. Validate query performance: stellaops corpus benchmark-query
  4. Export statistics: stellaops corpus export-stats --output corpus-stats.json
  • Database Schema: docs/db/schemas/corpus.sql
  • Architecture: docs/modules/binary-index/corpus-management.md
  • Sprint: docs/implplan/SPRINT_20260105_001_002_BINDEX_semdiff_corpus.md