# Ground-Truth Corpus Architecture > **Ownership:** BinaryIndex Guild > **Status:** DRAFT > **Version:** 1.0.0 > **Related:** [BinaryIndex Architecture](architecture.md), [Corpus Management](corpus-management.md), [Concelier AOC](../concelier/guides/aggregation-only-contract.md) --- ## 1. Overview The **Ground-Truth Corpus** system provides a validated function-matching oracle for binary diff accuracy measurement. It uses the same plugin-based ingestion pattern as Concelier (advisories) and Excititor (VEX), applying **Aggregation-Only Contract (AOC)** principles to ensure immutable, deterministic, and replayable data. ### 1.1 Problem Statement Function matching and binary diffing require ground-truth data to measure accuracy: 1. **No oracle for validation** - How do we know a function match is correct? 2. **Symbols stripped in production** - Debug info unavailable at scan time 3. **Compiler/optimization variance** - Same source produces different binaries 4. **Backport detection gaps** - Need pre/post pairs to validate patch detection ### 1.2 Solution: Distro Symbol Corpus Leverage mainstream Linux distro artifacts as ground-truth: | Source | What It Provides | Use Case | |--------|------------------|----------| | **Debian `.buildinfo`** | Exact build env records, often clearsigned | Reproducible oracle, build env metadata | | **Fedora Koji + debuginfod** | Machine-queryable debuginfo with IMA verification | Symbol recovery for stripped binaries | | **Ubuntu ddebs** | Debug symbol packages | Symbol-grounded truth for function names | | **Alpine SecDB** | Precise CVE-to-backport mappings | Pre/post pair curation | ### 1.3 Module Scope **In Scope:** - Symbol recovery connectors (debuginfod, ddebs, .buildinfo) - Ground-truth observations (immutable, append-only) - Pre/post security pair curation - Validation harness for function-matching accuracy - Deterministic manifests for replayability **Out of Scope:** - Function matching algorithms (see [semantic-diffing.md](semantic-diffing.md)) - Fingerprint generation (see [corpus-management.md](corpus-management.md)) - Policy decisions (provided by Policy Engine) --- ## 2. Architecture ### 2.1 System Context ``` ┌──────────────────────────────────────────────────────────────────────────┐ │ External Symbol Sources │ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Fedora │ │ Ubuntu │ │ Debian │ │ │ │ debuginfod │ │ ddebs │ │ .buildinfo │ │ │ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │ │ │ │ │ │ │ ┌────────┴────────┐ ┌────────┴────────┐ ┌───────┴─────────┐ │ │ │ Alpine SecDB │ │ reproduce. │ │ Upstream │ │ │ │ │ │ debian.net │ │ tarballs │ │ │ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │ └───────────│─────────────────────│─────────────────────│──────────────────┘ │ │ │ v v v ┌──────────────────────────────────────────────────────────────────────────┐ │ Ground-Truth Corpus Module │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Symbol Source Connectors │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ │ │ Debuginfod │ │ Ddeb │ │ Buildinfo │ │ │ │ │ │ Connector │ │ Connector │ │ Connector │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ │ │ SecDB │ │ Upstream │ │ │ │ │ │ Connector │ │ Connector │ │ │ │ │ └──────────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ v │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ AOC Write Guard Layer │ │ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ │ │ • No derived scores at ingest │ │ │ │ │ │ • Immutable observations + supersedes chain │ │ │ │ │ │ • Mandatory provenance (source URL, hash, signature) │ │ │ │ │ │ • Idempotent upserts (keyed by content hash) │ │ │ │ │ │ • Deterministic canonical JSON │ │ │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ v │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Storage Layer (PostgreSQL) │ │ │ │ │ │ │ │ groundtruth.symbol_sources - Registered symbol providers │ │ │ │ groundtruth.raw_documents - Immutable raw payloads │ │ │ │ groundtruth.symbol_observations- Normalized symbol records │ │ │ │ groundtruth.security_pairs - Pre/post CVE binary pairs │ │ │ │ groundtruth.validation_runs - Benchmark execution records │ │ │ │ groundtruth.match_results - Function match outcomes │ │ │ │ groundtruth.source_state - Cursor/sync state per source │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ v │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Validation Harness │ │ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ │ │ IValidationHarness │ │ │ │ │ │ - RunValidationAsync(pairs, matcherConfig) │ │ │ │ │ │ - GetMetricsAsync(runId) -> MatchRate, FP/FN, Unmatched │ │ │ │ │ │ - ExportReportAsync(runId, format) -> Markdown/HTML │ │ │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────────────────────┘ ``` ### 2.2 Component Breakdown #### 2.2.1 Symbol Source Connectors Plugin-based connectors following the Concelier `IFeedConnector` pattern: ```csharp public interface ISymbolSourceConnector { string SourceId { get; } string[] SupportedDistros { get; } // Three-phase pipeline (matches Concelier pattern) Task FetchAsync(IServiceProvider sp, CancellationToken ct); // Download raw docs Task ParseAsync(IServiceProvider sp, CancellationToken ct); // Normalize to DTOs Task MapAsync(IServiceProvider sp, CancellationToken ct); // Build observations } ``` **Implementations:** | Connector | Source | Data Retrieved | |-----------|--------|----------------| | `DebuginfodConnector` | Fedora/RHEL debuginfod | ELF debuginfo, source files | | `DdebConnector` | Ubuntu ddebs repos | .ddeb packages with DWARF | | `BuildinfoConnector` | Debian .buildinfo | Build env, checksums, signatures | | `SecDbConnector` | Alpine SecDB | CVE-to-fix mappings | | `UpstreamConnector` | GitHub/tarballs | Upstream release sources | #### 2.2.2 AOC Write Guard Enforces aggregation-only invariants (mirrors `IAdvisoryObservationWriteGuard`): ```csharp public interface ISymbolObservationWriteGuard { WriteDisposition ValidateWrite( SymbolObservation candidate, string? existingContentHash); } public enum WriteDisposition { Proceed, // Insert new observation SkipIdentical, // Idempotent re-insert, no-op RejectMutation // Reject (append-only violation) } ``` **Invariants Enforced:** | Invariant | What It Forbids | |-----------|-----------------| | No derived scores | Reject `confidence`, `accuracy`, `match_score` at ingest | | Immutable observations | No in-place updates; new revisions use `supersedes` | | Mandatory provenance | Require `source_url`, `fetched_at`, `content_hash`, `signature_state` | | Idempotent upserts | Key by `(source_id, debug_id, content_hash)` | | Deterministic canonical | Sorted JSON keys, UTC ISO-8601, stable hashes | #### 2.2.3 Security Pair Curation Manages pre/post CVE binary pairs for validation: ```csharp public interface ISecurityPairService { // Curate a pre/post pair for a CVE Task CreatePairAsync( string cveId, BinaryReference vulnerableBinary, BinaryReference patchedBinary, PairMetadata metadata, CancellationToken ct); // Get pairs for validation Task> GetPairsAsync( SecurityPairQuery query, CancellationToken ct); } public sealed record SecurityPair( string PairId, string CveId, BinaryReference VulnerableBinary, BinaryReference PatchedBinary, string[] AffectedFunctions, // Symbol names of vulnerable functions string[] ChangedFunctions, // Symbol names of patched functions DiffMetadata Diff, // Upstream patch info ProvenanceInfo Provenance); ``` #### 2.2.4 Validation Harness Runs function-matching validation with metrics: ```csharp public interface IValidationHarness { // Execute validation run Task RunAsync( ValidationConfig config, CancellationToken ct); // Get metrics for a run Task GetMetricsAsync( Guid runId, CancellationToken ct); // Export report Task ExportReportAsync( Guid runId, ReportFormat format, CancellationToken ct); } public sealed record ValidationMetrics( int TotalFunctions, int CorrectMatches, int FalsePositives, int FalseNegatives, int Unmatched, decimal MatchRate, decimal Precision, decimal Recall, ImmutableArray MismatchBuckets); public sealed record MismatchBucket( string Cause, // inlining, lto, optimization, pic_thunk int Count, ImmutableArray Examples); ``` --- ## 3. Database Schema ### 3.1 Symbol Sources ```sql CREATE TABLE groundtruth.symbol_sources ( source_id TEXT PRIMARY KEY, display_name TEXT NOT NULL, connector_type TEXT NOT NULL, -- debuginfod, ddeb, buildinfo, secdb base_url TEXT NOT NULL, enabled BOOLEAN DEFAULT TRUE, config_json JSONB, created_at TIMESTAMPTZ DEFAULT NOW(), updated_at TIMESTAMPTZ DEFAULT NOW() ); ``` ### 3.2 Raw Documents (Immutable) ```sql CREATE TABLE groundtruth.raw_documents ( digest TEXT PRIMARY KEY, -- sha256:{hex} source_id TEXT NOT NULL REFERENCES groundtruth.symbol_sources(source_id), document_uri TEXT NOT NULL, fetched_at TIMESTAMPTZ NOT NULL, recorded_at TIMESTAMPTZ DEFAULT NOW(), content_type TEXT NOT NULL, content_size_bytes INT, etag TEXT, signature_state TEXT, -- verified, unverified, failed payload_json JSONB, UNIQUE (source_id, document_uri, etag) ); CREATE INDEX idx_raw_documents_source_fetched ON groundtruth.raw_documents(source_id, fetched_at DESC); ``` ### 3.3 Symbol Observations (Immutable) ```sql CREATE TABLE groundtruth.symbol_observations ( observation_id TEXT PRIMARY KEY, -- groundtruth:{source}:{debug_id}:{revision} source_id TEXT NOT NULL, debug_id TEXT NOT NULL, -- ELF build-id, PE GUID, Mach-O UUID code_id TEXT, -- GNU build-id or PE checksum -- Binary metadata binary_name TEXT NOT NULL, binary_path TEXT, architecture TEXT NOT NULL, -- x86_64, aarch64, armv7 -- Package provenance distro TEXT, -- debian, ubuntu, fedora, alpine distro_version TEXT, package_name TEXT, package_version TEXT, -- Symbols symbols_json JSONB NOT NULL, -- Array of {name, address, size, type} symbol_count INT NOT NULL, -- Build metadata (from .buildinfo or debuginfo) compiler TEXT, compiler_version TEXT, optimization_level TEXT, build_flags_json JSONB, -- Provenance document_digest TEXT REFERENCES groundtruth.raw_documents(digest), content_hash TEXT NOT NULL, supersedes_id TEXT REFERENCES groundtruth.symbol_observations(observation_id), created_at TIMESTAMPTZ DEFAULT NOW(), UNIQUE (source_id, debug_id, content_hash) ); CREATE INDEX idx_symbol_observations_debug_id ON groundtruth.symbol_observations(debug_id); CREATE INDEX idx_symbol_observations_package ON groundtruth.symbol_observations(distro, package_name, package_version); ``` ### 3.4 Security Pairs ```sql CREATE TABLE groundtruth.security_pairs ( pair_id TEXT PRIMARY KEY, cve_id TEXT NOT NULL, -- Vulnerable binary vuln_observation_id TEXT NOT NULL REFERENCES groundtruth.symbol_observations(observation_id), vuln_debug_id TEXT NOT NULL, -- Patched binary patch_observation_id TEXT NOT NULL REFERENCES groundtruth.symbol_observations(observation_id), patch_debug_id TEXT NOT NULL, -- Affected function mapping affected_functions_json JSONB NOT NULL, -- [{name, vuln_addr, patch_addr}] changed_functions_json JSONB NOT NULL, -- Upstream diff reference upstream_commit TEXT, upstream_patch_url TEXT, -- Metadata distro TEXT NOT NULL, package_name TEXT NOT NULL, created_at TIMESTAMPTZ DEFAULT NOW(), created_by TEXT ); CREATE INDEX idx_security_pairs_cve ON groundtruth.security_pairs(cve_id); CREATE INDEX idx_security_pairs_package ON groundtruth.security_pairs(distro, package_name); ``` ### 3.5 Validation Runs ```sql CREATE TABLE groundtruth.validation_runs ( run_id UUID PRIMARY KEY, config_json JSONB NOT NULL, -- Matcher config, thresholds started_at TIMESTAMPTZ NOT NULL, completed_at TIMESTAMPTZ, status TEXT NOT NULL, -- running, completed, failed -- Aggregate metrics total_functions INT, correct_matches INT, false_positives INT, false_negatives INT, unmatched INT, match_rate DECIMAL(5,4), precision DECIMAL(5,4), recall DECIMAL(5,4), -- Environment matcher_version TEXT NOT NULL, corpus_snapshot_id TEXT, created_by TEXT ); CREATE TABLE groundtruth.match_results ( result_id UUID PRIMARY KEY, run_id UUID NOT NULL REFERENCES groundtruth.validation_runs(run_id), -- Ground truth pair_id TEXT NOT NULL REFERENCES groundtruth.security_pairs(pair_id), function_name TEXT NOT NULL, expected_match BOOLEAN NOT NULL, -- Actual result actual_match BOOLEAN, match_score DECIMAL(5,4), matched_function TEXT, -- Classification outcome TEXT NOT NULL, -- true_positive, false_positive, false_negative, unmatched mismatch_cause TEXT, -- inlining, lto, optimization, pic_thunk, etc. -- Debug info debug_json JSONB ); CREATE INDEX idx_match_results_run ON groundtruth.match_results(run_id); CREATE INDEX idx_match_results_outcome ON groundtruth.match_results(run_id, outcome); ``` ### 3.6 Source State (Cursor Tracking) ```sql CREATE TABLE groundtruth.source_state ( source_id TEXT PRIMARY KEY REFERENCES groundtruth.symbol_sources(source_id), enabled BOOLEAN DEFAULT TRUE, cursor_json JSONB, -- last_modified, last_id, pending_docs last_success_at TIMESTAMPTZ, last_error TEXT, backoff_until TIMESTAMPTZ ); ``` --- ## 4. Connector Specifications ### 4.1 Debuginfod Connector (Fedora/RHEL) **Data Source:** `https://debuginfod.fedoraproject.org` **Fetch Flow:** 1. Query debuginfod for build-id: `GET /buildid/{build_id}/debuginfo` 2. Retrieve DWARF sections (.debug_info, .debug_line) 3. Parse symbols using libdw 4. Store observation with IMA signature verification **Configuration:** ```yaml debuginfod: base_url: "https://debuginfod.fedoraproject.org" timeout_seconds: 30 verify_ima: true cache_dir: "/var/cache/stellaops/debuginfod" ``` ### 4.2 Ddeb Connector (Ubuntu) **Data Source:** `http://ddebs.ubuntu.com` **Fetch Flow:** 1. Query Packages index for `-dbgsym` packages 2. Download `.ddeb` archive 3. Extract DWARF from `/usr/lib/debug/.build-id/` 4. Parse symbols, map to corresponding binary package **Configuration:** ```yaml ddeb: mirror_url: "http://ddebs.ubuntu.com" distributions: ["focal", "jammy", "noble"] components: ["main", "universe"] cache_dir: "/var/cache/stellaops/ddebs" ``` ### 4.3 Buildinfo Connector (Debian) **Data Source:** `https://buildinfos.debian.net` **Fetch Flow:** 1. Query buildinfo index for package 2. Download `.buildinfo` file (often clearsigned) 3. Parse build environment (compiler, flags, checksums) 4. Cross-reference with snapshot.debian.org for exact binary **Configuration:** ```yaml buildinfo: index_url: "https://buildinfos.debian.net" snapshot_url: "https://snapshot.debian.org" reproducible_url: "https://reproduce.debian.net" verify_signature: true ``` ### 4.4 SecDB Connector (Alpine) **Data Source:** `https://github.com/alpinelinux/alpine-secdb` **Fetch Flow:** 1. Clone/pull secdb repository 2. Parse YAML files per branch (v3.18, v3.19, edge) 3. Map CVE to fixed/unfixed package versions 4. Cross-reference with aports for patch info **Configuration:** ```yaml secdb: repo_url: "https://github.com/alpinelinux/alpine-secdb" branches: ["v3.18", "v3.19", "v3.20", "edge"] aports_url: "https://gitlab.alpinelinux.org/alpine/aports" ``` --- ## 5. Validation Pipeline ### 5.1 Harness Workflow ``` 1. Assemble └─> Given package + CVE, fetch: binaries, debuginfo, .buildinfo, upstream tarball 2. Recover Symbols └─> Resolve build-id → symbols via debuginfod/ddebs └─> Fallback: Debian rebuild from .buildinfo 3. Lift Functions └─> Batch-lift .text functions → IR └─> Cache per build-id 4. Fingerprint └─> Emit deterministic + fuzzy signatures └─> Store as JSON lines 5. Match └─> Pre→post function matching └─> Write row per function with scores 6. Score └─> Compute metrics (match rate, FP/FN, precision, recall) └─> Bucket mismatches by cause 7. Report └─> Markdown/HTML with tables + diffs └─> Attach env hashes and artifact URLs ``` ### 5.2 Metrics Tracked | Metric | Description | |--------|-------------| | `match_rate` | Correct matches / total functions | | `precision` | True positives / (true positives + false positives) | | `recall` | True positives / (true positives + false negatives) | | `unmatched_rate` | Unmatched / total functions | ### 5.3 Mismatch Buckets | Cause | Description | Mitigation | |-------|-------------|------------| | `inlining` | Function inlined, no direct match | Inline expansion in fingerprint | | `lto` | Link-time optimization changed structure | Cross-module fingerprints | | `optimization` | Different -O level | Semantic fingerprints | | `pic_thunk` | Position-independent code stubs | Filter PIC thunks | | `versioned_symbol` | GLIBC symbol versioning | Version-aware matching | | `renamed` | Symbol renamed (macro, alias) | Alias resolution | --- ## 6. Evidence Objects ### 6.1 Ground-Truth Attestation Predicate ```json { "predicateType": "https://stella-ops.org/predicates/groundtruth/v1", "predicate": { "observationId": "groundtruth:debuginfod:abc123def456:1", "debugId": "abc123def456789...", "binaryIdentity": { "name": "libssl.so.3", "sha256": "sha256:...", "architecture": "x86_64" }, "symbolSource": { "sourceId": "debuginfod-fedora", "fetchedAt": "2026-01-19T10:00:00Z", "documentUri": "https://debuginfod.fedoraproject.org/buildid/abc123/debuginfo", "signatureState": "verified" }, "symbols": [ {"name": "SSL_CTX_new", "address": "0x1234", "size": 256}, {"name": "SSL_read", "address": "0x5678", "size": 512} ], "buildMetadata": { "compiler": "gcc", "compilerVersion": "12.2.0", "optimizationLevel": "O2", "buildFlags": ["-fstack-protector-strong", "-D_FORTIFY_SOURCE=2"] } } } ``` ### 6.2 Validation Run Attestation ```json { "predicateType": "https://stella-ops.org/predicates/validation-run/v1", "predicate": { "runId": "550e8400-e29b-41d4-a716-446655440000", "config": { "matcherVersion": "binaryindex-semantic-diffing:1.2.0", "thresholds": { "minSimilarity": 0.85, "semanticWeight": 0.35, "instructionWeight": 0.25 } }, "corpus": { "snapshotId": "corpus:2026-01-19", "functionCount": 30000, "libraryCount": 5 }, "metrics": { "totalFunctions": 1500, "correctMatches": 1380, "falsePositives": 15, "falseNegatives": 45, "unmatched": 60, "matchRate": 0.92, "precision": 0.989, "recall": 0.968 }, "mismatchBuckets": [ {"cause": "inlining", "count": 25}, {"cause": "lto", "count": 12}, {"cause": "optimization", "count": 8} ], "executedAt": "2026-01-19T10:30:00Z" } } ``` --- ## 7. CLI Commands ```bash # Symbol source management stella groundtruth sources list stella groundtruth sources enable debuginfod-fedora stella groundtruth sources sync --source debuginfod-fedora # Symbol observation queries stella groundtruth symbols lookup --debug-id abc123 stella groundtruth symbols search --package openssl --distro debian # Security pair management stella groundtruth pairs create \ --cve CVE-2024-1234 \ --vuln-pkg openssl=3.0.10-1 \ --patch-pkg openssl=3.0.11-1 stella groundtruth pairs list --cve CVE-2024-1234 # Validation harness stella groundtruth validate run \ --pairs "openssl:CVE-2024-*" \ --matcher semantic-diffing \ --output validation-report.md stella groundtruth validate metrics --run-id abc123 stella groundtruth validate export --run-id abc123 --format html ``` --- ## 8. Doctor Checks The ground-truth corpus integrates with Doctor for availability checks: ```csharp // stellaops.doctor.binaryanalysis plugin public sealed class BinaryAnalysisDoctorPlugin : IDoctorPlugin { public string Name => "stellaops.doctor.binaryanalysis"; public IEnumerable GetChecks() { yield return new DebuginfodAvailabilityCheck(); yield return new DdebRepoEnabledCheck(); yield return new BuildinfoCacheCheck(); yield return new SymbolRecoveryFallbackCheck(); } } ``` | Check | Description | Remediation | |-------|-------------|-------------| | `debuginfod_urls_configured` | Verify `DEBUGINFOD_URLS` env | Set env variable | | `ddeb_repos_enabled` | Check Ubuntu ddeb sources | Enable ddebs repo | | `buildinfo_cache_accessible` | Validate buildinfos.debian.net | Check network/firewall | | `symbol_recovery_fallback` | Ensure fallback path works | Configure local cache | --- ## 9. Air-Gap Support For offline/air-gapped deployments: ### 9.1 Symbol Bundle Format ``` symbol-bundle-2026-01-19/ ├── manifest.json # Bundle metadata + checksums ├── sources/ │ ├── debuginfod/ │ │ └── *.debuginfo # Pre-fetched debuginfo │ ├── ddebs/ │ │ └── *.ddeb # Pre-fetched ddebs │ └── buildinfo/ │ └── *.buildinfo # Pre-fetched buildinfo ├── observations/ │ └── *.ndjson # Pre-computed observations └── DSSE.envelope # Signed attestation ``` ### 9.2 Offline Sync ```bash # Export bundle for air-gap transfer stella groundtruth bundle export \ --packages openssl,zlib,glibc \ --distros debian,fedora \ --output symbol-bundle.tar.gz # Import bundle in air-gapped environment stella groundtruth bundle import \ --input symbol-bundle.tar.gz \ --verify-signature ``` --- ## 10. Related Documentation - [BinaryIndex Architecture](architecture.md) - [Semantic Diffing](semantic-diffing.md) - [Corpus Management](corpus-management.md) - [Concelier AOC](../concelier/guides/aggregation-only-contract.md) - [Excititor Architecture](../excititor/architecture.md)