Files
git.stella-ops.org/docs/modules/binary-index/ground-truth-corpus.md
2026-01-20 00:45:38 +02:00

28 KiB

Ground-Truth Corpus Architecture

Ownership: BinaryIndex Guild Status: DRAFT Version: 1.0.0 Related: BinaryIndex Architecture, Corpus Management, Concelier AOC


1. Overview

The Ground-Truth Corpus system provides a validated function-matching oracle for binary diff accuracy measurement. It uses the same plugin-based ingestion pattern as Concelier (advisories) and Excititor (VEX), applying Aggregation-Only Contract (AOC) principles to ensure immutable, deterministic, and replayable data.

1.1 Problem Statement

Function matching and binary diffing require ground-truth data to measure accuracy:

  1. No oracle for validation - How do we know a function match is correct?
  2. Symbols stripped in production - Debug info unavailable at scan time
  3. Compiler/optimization variance - Same source produces different binaries
  4. Backport detection gaps - Need pre/post pairs to validate patch detection

1.2 Solution: Distro Symbol Corpus

Leverage mainstream Linux distro artifacts as ground-truth:

Source What It Provides Use Case
Debian .buildinfo Exact build env records, often clearsigned Reproducible oracle, build env metadata
Fedora Koji + debuginfod Machine-queryable debuginfo with IMA verification Symbol recovery for stripped binaries
Ubuntu ddebs Debug symbol packages Symbol-grounded truth for function names
Alpine SecDB Precise CVE-to-backport mappings Pre/post pair curation

1.3 Module Scope

In Scope:

  • Symbol recovery connectors (debuginfod, ddebs, .buildinfo)
  • Ground-truth observations (immutable, append-only)
  • Pre/post security pair curation
  • Validation harness for function-matching accuracy
  • Deterministic manifests for replayability

Out of Scope:


2. Architecture

2.1 System Context

┌──────────────────────────────────────────────────────────────────────────┐
│                         External Symbol Sources                           │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐           │
│  │ Fedora          │  │ Ubuntu          │  │ Debian          │           │
│  │ debuginfod      │  │ ddebs           │  │ .buildinfo      │           │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘           │
│           │                    │                    │                    │
│  ┌────────┴────────┐  ┌────────┴────────┐  ┌───────┴─────────┐          │
│  │ Alpine SecDB    │  │ reproduce.      │  │ Upstream        │          │
│  │                 │  │ debian.net      │  │ tarballs        │          │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘          │
└───────────│─────────────────────│─────────────────────│──────────────────┘
            │                     │                     │
            v                     v                     v
┌──────────────────────────────────────────────────────────────────────────┐
│                   Ground-Truth Corpus Module                              │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │                  Symbol Source Connectors                           │ │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │ │
│  │  │ Debuginfod   │  │ Ddeb         │  │ Buildinfo    │              │ │
│  │  │ Connector    │  │ Connector    │  │ Connector    │              │ │
│  │  └──────────────┘  └──────────────┘  └──────────────┘              │ │
│  │  ┌──────────────┐  ┌──────────────┐                                │ │
│  │  │ SecDB        │  │ Upstream     │                                │ │
│  │  │ Connector    │  │ Connector    │                                │ │
│  │  └──────────────┘  └──────────────┘                                │ │
│  └─────────────────────────────────────────────────────────────────────┘ │
│                                │                                          │
│                                v                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │                  AOC Write Guard Layer                              │ │
│  │  ┌──────────────────────────────────────────────────────────────┐  │ │
│  │  │  • No derived scores at ingest                               │  │ │
│  │  │  • Immutable observations + supersedes chain                 │  │ │
│  │  │  • Mandatory provenance (source URL, hash, signature)        │  │ │
│  │  │  • Idempotent upserts (keyed by content hash)                │  │ │
│  │  │  • Deterministic canonical JSON                              │  │ │
│  │  └──────────────────────────────────────────────────────────────┘  │ │
│  └─────────────────────────────────────────────────────────────────────┘ │
│                                │                                          │
│                                v                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │                  Storage Layer (PostgreSQL)                         │ │
│  │                                                                     │ │
│  │  groundtruth.symbol_sources     - Registered symbol providers      │ │
│  │  groundtruth.raw_documents      - Immutable raw payloads           │ │
│  │  groundtruth.symbol_observations- Normalized symbol records        │ │
│  │  groundtruth.security_pairs     - Pre/post CVE binary pairs        │ │
│  │  groundtruth.validation_runs    - Benchmark execution records      │ │
│  │  groundtruth.match_results      - Function match outcomes          │ │
│  │  groundtruth.source_state       - Cursor/sync state per source     │ │
│  └─────────────────────────────────────────────────────────────────────┘ │
│                                │                                          │
│                                v                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │                  Validation Harness                                 │ │
│  │  ┌──────────────────────────────────────────────────────────────┐  │ │
│  │  │  IValidationHarness                                          │  │ │
│  │  │  - RunValidationAsync(pairs, matcherConfig)                  │  │ │
│  │  │  - GetMetricsAsync(runId) -> MatchRate, FP/FN, Unmatched     │  │ │
│  │  │  - ExportReportAsync(runId, format) -> Markdown/HTML         │  │ │
│  │  └──────────────────────────────────────────────────────────────┘  │ │
│  └─────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘

2.2 Component Breakdown

2.2.1 Symbol Source Connectors

Plugin-based connectors following the Concelier IFeedConnector pattern:

public interface ISymbolSourceConnector
{
    string SourceId { get; }
    string[] SupportedDistros { get; }

    // Three-phase pipeline (matches Concelier pattern)
    Task FetchAsync(IServiceProvider sp, CancellationToken ct);    // Download raw docs
    Task ParseAsync(IServiceProvider sp, CancellationToken ct);     // Normalize to DTOs
    Task MapAsync(IServiceProvider sp, CancellationToken ct);       // Build observations
}

Implementations:

Connector Source Data Retrieved
DebuginfodConnector Fedora/RHEL debuginfod ELF debuginfo, source files
DdebConnector Ubuntu ddebs repos .ddeb packages with DWARF
BuildinfoConnector Debian .buildinfo Build env, checksums, signatures
SecDbConnector Alpine SecDB CVE-to-fix mappings
UpstreamConnector GitHub/tarballs Upstream release sources

2.2.2 AOC Write Guard

Enforces aggregation-only invariants (mirrors IAdvisoryObservationWriteGuard):

public interface ISymbolObservationWriteGuard
{
    WriteDisposition ValidateWrite(
        SymbolObservation candidate,
        string? existingContentHash);
}

public enum WriteDisposition
{
    Proceed,           // Insert new observation
    SkipIdentical,     // Idempotent re-insert, no-op
    RejectMutation     // Reject (append-only violation)
}

Invariants Enforced:

Invariant What It Forbids
No derived scores Reject confidence, accuracy, match_score at ingest
Immutable observations No in-place updates; new revisions use supersedes
Mandatory provenance Require source_url, fetched_at, content_hash, signature_state
Idempotent upserts Key by (source_id, debug_id, content_hash)
Deterministic canonical Sorted JSON keys, UTC ISO-8601, stable hashes

2.2.3 Security Pair Curation

Manages pre/post CVE binary pairs for validation:

public interface ISecurityPairService
{
    // Curate a pre/post pair for a CVE
    Task<SecurityPair> CreatePairAsync(
        string cveId,
        BinaryReference vulnerableBinary,
        BinaryReference patchedBinary,
        PairMetadata metadata,
        CancellationToken ct);

    // Get pairs for validation
    Task<ImmutableArray<SecurityPair>> GetPairsAsync(
        SecurityPairQuery query,
        CancellationToken ct);
}

public sealed record SecurityPair(
    string PairId,
    string CveId,
    BinaryReference VulnerableBinary,
    BinaryReference PatchedBinary,
    string[] AffectedFunctions,      // Symbol names of vulnerable functions
    string[] ChangedFunctions,       // Symbol names of patched functions
    DiffMetadata Diff,               // Upstream patch info
    ProvenanceInfo Provenance);

2.2.4 Validation Harness

Runs function-matching validation with metrics:

public interface IValidationHarness
{
    // Execute validation run
    Task<ValidationRun> RunAsync(
        ValidationConfig config,
        CancellationToken ct);

    // Get metrics for a run
    Task<ValidationMetrics> GetMetricsAsync(
        Guid runId,
        CancellationToken ct);

    // Export report
    Task<Stream> ExportReportAsync(
        Guid runId,
        ReportFormat format,
        CancellationToken ct);
}

public sealed record ValidationMetrics(
    int TotalFunctions,
    int CorrectMatches,
    int FalsePositives,
    int FalseNegatives,
    int Unmatched,
    decimal MatchRate,
    decimal Precision,
    decimal Recall,
    ImmutableArray<MismatchBucket> MismatchBuckets);

public sealed record MismatchBucket(
    string Cause,                    // inlining, lto, optimization, pic_thunk
    int Count,
    ImmutableArray<FunctionRef> Examples);

3. Database Schema

3.1 Symbol Sources

CREATE TABLE groundtruth.symbol_sources (
    source_id TEXT PRIMARY KEY,
    display_name TEXT NOT NULL,
    connector_type TEXT NOT NULL,    -- debuginfod, ddeb, buildinfo, secdb
    base_url TEXT NOT NULL,
    enabled BOOLEAN DEFAULT TRUE,
    config_json JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

3.2 Raw Documents (Immutable)

CREATE TABLE groundtruth.raw_documents (
    digest TEXT PRIMARY KEY,         -- sha256:{hex}
    source_id TEXT NOT NULL REFERENCES groundtruth.symbol_sources(source_id),
    document_uri TEXT NOT NULL,
    fetched_at TIMESTAMPTZ NOT NULL,
    recorded_at TIMESTAMPTZ DEFAULT NOW(),
    content_type TEXT NOT NULL,
    content_size_bytes INT,
    etag TEXT,
    signature_state TEXT,            -- verified, unverified, failed
    payload_json JSONB,
    UNIQUE (source_id, document_uri, etag)
);

CREATE INDEX idx_raw_documents_source_fetched
    ON groundtruth.raw_documents(source_id, fetched_at DESC);

3.3 Symbol Observations (Immutable)

CREATE TABLE groundtruth.symbol_observations (
    observation_id TEXT PRIMARY KEY,  -- groundtruth:{source}:{debug_id}:{revision}
    source_id TEXT NOT NULL,
    debug_id TEXT NOT NULL,           -- ELF build-id, PE GUID, Mach-O UUID
    code_id TEXT,                     -- GNU build-id or PE checksum

    -- Binary metadata
    binary_name TEXT NOT NULL,
    binary_path TEXT,
    architecture TEXT NOT NULL,       -- x86_64, aarch64, armv7

    -- Package provenance
    distro TEXT,                      -- debian, ubuntu, fedora, alpine
    distro_version TEXT,
    package_name TEXT,
    package_version TEXT,

    -- Symbols
    symbols_json JSONB NOT NULL,      -- Array of {name, address, size, type}
    symbol_count INT NOT NULL,

    -- Build metadata (from .buildinfo or debuginfo)
    compiler TEXT,
    compiler_version TEXT,
    optimization_level TEXT,
    build_flags_json JSONB,

    -- Provenance
    document_digest TEXT REFERENCES groundtruth.raw_documents(digest),
    content_hash TEXT NOT NULL,
    supersedes_id TEXT REFERENCES groundtruth.symbol_observations(observation_id),

    created_at TIMESTAMPTZ DEFAULT NOW(),

    UNIQUE (source_id, debug_id, content_hash)
);

CREATE INDEX idx_symbol_observations_debug_id
    ON groundtruth.symbol_observations(debug_id);
CREATE INDEX idx_symbol_observations_package
    ON groundtruth.symbol_observations(distro, package_name, package_version);

3.4 Security Pairs

CREATE TABLE groundtruth.security_pairs (
    pair_id TEXT PRIMARY KEY,
    cve_id TEXT NOT NULL,

    -- Vulnerable binary
    vuln_observation_id TEXT NOT NULL
        REFERENCES groundtruth.symbol_observations(observation_id),
    vuln_debug_id TEXT NOT NULL,

    -- Patched binary
    patch_observation_id TEXT NOT NULL
        REFERENCES groundtruth.symbol_observations(observation_id),
    patch_debug_id TEXT NOT NULL,

    -- Affected function mapping
    affected_functions_json JSONB NOT NULL,  -- [{name, vuln_addr, patch_addr}]
    changed_functions_json JSONB NOT NULL,

    -- Upstream diff reference
    upstream_commit TEXT,
    upstream_patch_url TEXT,

    -- Metadata
    distro TEXT NOT NULL,
    package_name TEXT NOT NULL,

    created_at TIMESTAMPTZ DEFAULT NOW(),
    created_by TEXT
);

CREATE INDEX idx_security_pairs_cve
    ON groundtruth.security_pairs(cve_id);
CREATE INDEX idx_security_pairs_package
    ON groundtruth.security_pairs(distro, package_name);

3.5 Validation Runs

CREATE TABLE groundtruth.validation_runs (
    run_id UUID PRIMARY KEY,
    config_json JSONB NOT NULL,       -- Matcher config, thresholds
    started_at TIMESTAMPTZ NOT NULL,
    completed_at TIMESTAMPTZ,
    status TEXT NOT NULL,             -- running, completed, failed

    -- Aggregate metrics
    total_functions INT,
    correct_matches INT,
    false_positives INT,
    false_negatives INT,
    unmatched INT,
    match_rate DECIMAL(5,4),
    precision DECIMAL(5,4),
    recall DECIMAL(5,4),

    -- Environment
    matcher_version TEXT NOT NULL,
    corpus_snapshot_id TEXT,

    created_by TEXT
);

CREATE TABLE groundtruth.match_results (
    result_id UUID PRIMARY KEY,
    run_id UUID NOT NULL REFERENCES groundtruth.validation_runs(run_id),

    -- Ground truth
    pair_id TEXT NOT NULL REFERENCES groundtruth.security_pairs(pair_id),
    function_name TEXT NOT NULL,
    expected_match BOOLEAN NOT NULL,

    -- Actual result
    actual_match BOOLEAN,
    match_score DECIMAL(5,4),
    matched_function TEXT,

    -- Classification
    outcome TEXT NOT NULL,            -- true_positive, false_positive, false_negative, unmatched
    mismatch_cause TEXT,              -- inlining, lto, optimization, pic_thunk, etc.

    -- Debug info
    debug_json JSONB
);

CREATE INDEX idx_match_results_run
    ON groundtruth.match_results(run_id);
CREATE INDEX idx_match_results_outcome
    ON groundtruth.match_results(run_id, outcome);

3.6 Source State (Cursor Tracking)

CREATE TABLE groundtruth.source_state (
    source_id TEXT PRIMARY KEY REFERENCES groundtruth.symbol_sources(source_id),
    enabled BOOLEAN DEFAULT TRUE,
    cursor_json JSONB,                -- last_modified, last_id, pending_docs
    last_success_at TIMESTAMPTZ,
    last_error TEXT,
    backoff_until TIMESTAMPTZ
);

4. Connector Specifications

4.1 Debuginfod Connector (Fedora/RHEL)

Data Source: https://debuginfod.fedoraproject.org

Fetch Flow:

  1. Query debuginfod for build-id: GET /buildid/{build_id}/debuginfo
  2. Retrieve DWARF sections (.debug_info, .debug_line)
  3. Parse symbols using libdw
  4. Store observation with IMA signature verification

Configuration:

debuginfod:
  base_url: "https://debuginfod.fedoraproject.org"
  timeout_seconds: 30
  verify_ima: true
  cache_dir: "/var/cache/stellaops/debuginfod"

4.2 Ddeb Connector (Ubuntu)

Data Source: http://ddebs.ubuntu.com

Fetch Flow:

  1. Query Packages index for -dbgsym packages
  2. Download .ddeb archive
  3. Extract DWARF from /usr/lib/debug/.build-id/
  4. Parse symbols, map to corresponding binary package

Configuration:

ddeb:
  mirror_url: "http://ddebs.ubuntu.com"
  distributions: ["focal", "jammy", "noble"]
  components: ["main", "universe"]
  cache_dir: "/var/cache/stellaops/ddebs"

4.3 Buildinfo Connector (Debian)

Data Source: https://buildinfos.debian.net

Fetch Flow:

  1. Query buildinfo index for package
  2. Download .buildinfo file (often clearsigned)
  3. Parse build environment (compiler, flags, checksums)
  4. Cross-reference with snapshot.debian.org for exact binary

Configuration:

buildinfo:
  index_url: "https://buildinfos.debian.net"
  snapshot_url: "https://snapshot.debian.org"
  reproducible_url: "https://reproduce.debian.net"
  verify_signature: true

4.4 SecDB Connector (Alpine)

Data Source: https://github.com/alpinelinux/alpine-secdb

Fetch Flow:

  1. Clone/pull secdb repository
  2. Parse YAML files per branch (v3.18, v3.19, edge)
  3. Map CVE to fixed/unfixed package versions
  4. Cross-reference with aports for patch info

Configuration:

secdb:
  repo_url: "https://github.com/alpinelinux/alpine-secdb"
  branches: ["v3.18", "v3.19", "v3.20", "edge"]
  aports_url: "https://gitlab.alpinelinux.org/alpine/aports"

5. Validation Pipeline

5.1 Harness Workflow

1. Assemble
   └─> Given package + CVE, fetch: binaries, debuginfo, .buildinfo, upstream tarball

2. Recover Symbols
   └─> Resolve build-id → symbols via debuginfod/ddebs
   └─> Fallback: Debian rebuild from .buildinfo

3. Lift Functions
   └─> Batch-lift .text functions → IR
   └─> Cache per build-id

4. Fingerprint
   └─> Emit deterministic + fuzzy signatures
   └─> Store as JSON lines

5. Match
   └─> Pre→post function matching
   └─> Write row per function with scores

6. Score
   └─> Compute metrics (match rate, FP/FN, precision, recall)
   └─> Bucket mismatches by cause

7. Report
   └─> Markdown/HTML with tables + diffs
   └─> Attach env hashes and artifact URLs

5.2 Metrics Tracked

Metric Description
match_rate Correct matches / total functions
precision True positives / (true positives + false positives)
recall True positives / (true positives + false negatives)
unmatched_rate Unmatched / total functions

5.3 Mismatch Buckets

Cause Description Mitigation
inlining Function inlined, no direct match Inline expansion in fingerprint
lto Link-time optimization changed structure Cross-module fingerprints
optimization Different -O level Semantic fingerprints
pic_thunk Position-independent code stubs Filter PIC thunks
versioned_symbol GLIBC symbol versioning Version-aware matching
renamed Symbol renamed (macro, alias) Alias resolution

6. Evidence Objects

6.1 Ground-Truth Attestation Predicate

{
  "predicateType": "https://stella-ops.org/predicates/groundtruth/v1",
  "predicate": {
    "observationId": "groundtruth:debuginfod:abc123def456:1",
    "debugId": "abc123def456789...",
    "binaryIdentity": {
      "name": "libssl.so.3",
      "sha256": "sha256:...",
      "architecture": "x86_64"
    },
    "symbolSource": {
      "sourceId": "debuginfod-fedora",
      "fetchedAt": "2026-01-19T10:00:00Z",
      "documentUri": "https://debuginfod.fedoraproject.org/buildid/abc123/debuginfo",
      "signatureState": "verified"
    },
    "symbols": [
      {"name": "SSL_CTX_new", "address": "0x1234", "size": 256},
      {"name": "SSL_read", "address": "0x5678", "size": 512}
    ],
    "buildMetadata": {
      "compiler": "gcc",
      "compilerVersion": "12.2.0",
      "optimizationLevel": "O2",
      "buildFlags": ["-fstack-protector-strong", "-D_FORTIFY_SOURCE=2"]
    }
  }
}

6.2 Validation Run Attestation

{
  "predicateType": "https://stella-ops.org/predicates/validation-run/v1",
  "predicate": {
    "runId": "550e8400-e29b-41d4-a716-446655440000",
    "config": {
      "matcherVersion": "binaryindex-semantic-diffing:1.2.0",
      "thresholds": {
        "minSimilarity": 0.85,
        "semanticWeight": 0.35,
        "instructionWeight": 0.25
      }
    },
    "corpus": {
      "snapshotId": "corpus:2026-01-19",
      "functionCount": 30000,
      "libraryCount": 5
    },
    "metrics": {
      "totalFunctions": 1500,
      "correctMatches": 1380,
      "falsePositives": 15,
      "falseNegatives": 45,
      "unmatched": 60,
      "matchRate": 0.92,
      "precision": 0.989,
      "recall": 0.968
    },
    "mismatchBuckets": [
      {"cause": "inlining", "count": 25},
      {"cause": "lto", "count": 12},
      {"cause": "optimization", "count": 8}
    ],
    "executedAt": "2026-01-19T10:30:00Z"
  }
}

7. CLI Commands

# Symbol source management
stella groundtruth sources list
stella groundtruth sources enable debuginfod-fedora
stella groundtruth sources sync --source debuginfod-fedora

# Symbol observation queries
stella groundtruth symbols lookup --debug-id abc123
stella groundtruth symbols search --package openssl --distro debian

# Security pair management
stella groundtruth pairs create \
    --cve CVE-2024-1234 \
    --vuln-pkg openssl=3.0.10-1 \
    --patch-pkg openssl=3.0.11-1

stella groundtruth pairs list --cve CVE-2024-1234

# Validation harness
stella groundtruth validate run \
    --pairs "openssl:CVE-2024-*" \
    --matcher semantic-diffing \
    --output validation-report.md

stella groundtruth validate metrics --run-id abc123
stella groundtruth validate export --run-id abc123 --format html

8. Doctor Checks

The ground-truth corpus integrates with Doctor for availability checks:

// stellaops.doctor.binaryanalysis plugin
public sealed class BinaryAnalysisDoctorPlugin : IDoctorPlugin
{
    public string Name => "stellaops.doctor.binaryanalysis";

    public IEnumerable<IDoctorCheck> GetChecks()
    {
        yield return new DebuginfodAvailabilityCheck();
        yield return new DdebRepoEnabledCheck();
        yield return new BuildinfoCacheCheck();
        yield return new SymbolRecoveryFallbackCheck();
    }
}
Check Description Remediation
debuginfod_urls_configured Verify DEBUGINFOD_URLS env Set env variable
ddeb_repos_enabled Check Ubuntu ddeb sources Enable ddebs repo
buildinfo_cache_accessible Validate buildinfos.debian.net Check network/firewall
symbol_recovery_fallback Ensure fallback path works Configure local cache

9. Air-Gap Support

For offline/air-gapped deployments:

9.1 Symbol Bundle Format

symbol-bundle-2026-01-19/
├── manifest.json           # Bundle metadata + checksums
├── sources/
│   ├── debuginfod/
│   │   └── *.debuginfo     # Pre-fetched debuginfo
│   ├── ddebs/
│   │   └── *.ddeb          # Pre-fetched ddebs
│   └── buildinfo/
│       └── *.buildinfo     # Pre-fetched buildinfo
├── observations/
│   └── *.ndjson            # Pre-computed observations
└── DSSE.envelope           # Signed attestation

9.2 Offline Sync

# Export bundle for air-gap transfer
stella groundtruth bundle export \
    --packages openssl,zlib,glibc \
    --distros debian,fedora \
    --output symbol-bundle.tar.gz

# Import bundle in air-gapped environment
stella groundtruth bundle import \
    --input symbol-bundle.tar.gz \
    --verify-signature