28 KiB
Ground-Truth Corpus Architecture
Ownership: BinaryIndex Guild Status: DRAFT Version: 1.0.0 Related: BinaryIndex Architecture, Corpus Management, Concelier AOC
1. Overview
The Ground-Truth Corpus system provides a validated function-matching oracle for binary diff accuracy measurement. It uses the same plugin-based ingestion pattern as Concelier (advisories) and Excititor (VEX), applying Aggregation-Only Contract (AOC) principles to ensure immutable, deterministic, and replayable data.
1.1 Problem Statement
Function matching and binary diffing require ground-truth data to measure accuracy:
- No oracle for validation - How do we know a function match is correct?
- Symbols stripped in production - Debug info unavailable at scan time
- Compiler/optimization variance - Same source produces different binaries
- Backport detection gaps - Need pre/post pairs to validate patch detection
1.2 Solution: Distro Symbol Corpus
Leverage mainstream Linux distro artifacts as ground-truth:
| Source | What It Provides | Use Case |
|---|---|---|
Debian .buildinfo |
Exact build env records, often clearsigned | Reproducible oracle, build env metadata |
| Fedora Koji + debuginfod | Machine-queryable debuginfo with IMA verification | Symbol recovery for stripped binaries |
| Ubuntu ddebs | Debug symbol packages | Symbol-grounded truth for function names |
| Alpine SecDB | Precise CVE-to-backport mappings | Pre/post pair curation |
1.3 Module Scope
In Scope:
- Symbol recovery connectors (debuginfod, ddebs, .buildinfo)
- Ground-truth observations (immutable, append-only)
- Pre/post security pair curation
- Validation harness for function-matching accuracy
- Deterministic manifests for replayability
Out of Scope:
- Function matching algorithms (see semantic-diffing.md)
- Fingerprint generation (see corpus-management.md)
- Policy decisions (provided by Policy Engine)
2. Architecture
2.1 System Context
┌──────────────────────────────────────────────────────────────────────────┐
│ External Symbol Sources │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Fedora │ │ Ubuntu │ │ Debian │ │
│ │ debuginfod │ │ ddebs │ │ .buildinfo │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ ┌────────┴────────┐ ┌────────┴────────┐ ┌───────┴─────────┐ │
│ │ Alpine SecDB │ │ reproduce. │ │ Upstream │ │
│ │ │ │ debian.net │ │ tarballs │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
└───────────│─────────────────────│─────────────────────│──────────────────┘
│ │ │
v v v
┌──────────────────────────────────────────────────────────────────────────┐
│ Ground-Truth Corpus Module │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Symbol Source Connectors │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Debuginfod │ │ Ddeb │ │ Buildinfo │ │ │
│ │ │ Connector │ │ Connector │ │ Connector │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ SecDB │ │ Upstream │ │ │
│ │ │ Connector │ │ Connector │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ AOC Write Guard Layer │ │
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
│ │ │ • No derived scores at ingest │ │ │
│ │ │ • Immutable observations + supersedes chain │ │ │
│ │ │ • Mandatory provenance (source URL, hash, signature) │ │ │
│ │ │ • Idempotent upserts (keyed by content hash) │ │ │
│ │ │ • Deterministic canonical JSON │ │ │
│ │ └──────────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Storage Layer (PostgreSQL) │ │
│ │ │ │
│ │ groundtruth.symbol_sources - Registered symbol providers │ │
│ │ groundtruth.raw_documents - Immutable raw payloads │ │
│ │ groundtruth.symbol_observations- Normalized symbol records │ │
│ │ groundtruth.security_pairs - Pre/post CVE binary pairs │ │
│ │ groundtruth.validation_runs - Benchmark execution records │ │
│ │ groundtruth.match_results - Function match outcomes │ │
│ │ groundtruth.source_state - Cursor/sync state per source │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Validation Harness │ │
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
│ │ │ IValidationHarness │ │ │
│ │ │ - RunValidationAsync(pairs, matcherConfig) │ │ │
│ │ │ - GetMetricsAsync(runId) -> MatchRate, FP/FN, Unmatched │ │ │
│ │ │ - ExportReportAsync(runId, format) -> Markdown/HTML │ │ │
│ │ └──────────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
2.2 Component Breakdown
2.2.1 Symbol Source Connectors
Plugin-based connectors following the Concelier IFeedConnector pattern:
public interface ISymbolSourceConnector
{
string SourceId { get; }
string[] SupportedDistros { get; }
// Three-phase pipeline (matches Concelier pattern)
Task FetchAsync(IServiceProvider sp, CancellationToken ct); // Download raw docs
Task ParseAsync(IServiceProvider sp, CancellationToken ct); // Normalize to DTOs
Task MapAsync(IServiceProvider sp, CancellationToken ct); // Build observations
}
Implementations:
| Connector | Source | Data Retrieved |
|---|---|---|
DebuginfodConnector |
Fedora/RHEL debuginfod | ELF debuginfo, source files |
DdebConnector |
Ubuntu ddebs repos | .ddeb packages with DWARF |
BuildinfoConnector |
Debian .buildinfo | Build env, checksums, signatures |
SecDbConnector |
Alpine SecDB | CVE-to-fix mappings |
UpstreamConnector |
GitHub/tarballs | Upstream release sources |
2.2.2 AOC Write Guard
Enforces aggregation-only invariants (mirrors IAdvisoryObservationWriteGuard):
public interface ISymbolObservationWriteGuard
{
WriteDisposition ValidateWrite(
SymbolObservation candidate,
string? existingContentHash);
}
public enum WriteDisposition
{
Proceed, // Insert new observation
SkipIdentical, // Idempotent re-insert, no-op
RejectMutation // Reject (append-only violation)
}
Invariants Enforced:
| Invariant | What It Forbids |
|---|---|
| No derived scores | Reject confidence, accuracy, match_score at ingest |
| Immutable observations | No in-place updates; new revisions use supersedes |
| Mandatory provenance | Require source_url, fetched_at, content_hash, signature_state |
| Idempotent upserts | Key by (source_id, debug_id, content_hash) |
| Deterministic canonical | Sorted JSON keys, UTC ISO-8601, stable hashes |
2.2.3 Security Pair Curation
Manages pre/post CVE binary pairs for validation:
public interface ISecurityPairService
{
// Curate a pre/post pair for a CVE
Task<SecurityPair> CreatePairAsync(
string cveId,
BinaryReference vulnerableBinary,
BinaryReference patchedBinary,
PairMetadata metadata,
CancellationToken ct);
// Get pairs for validation
Task<ImmutableArray<SecurityPair>> GetPairsAsync(
SecurityPairQuery query,
CancellationToken ct);
}
public sealed record SecurityPair(
string PairId,
string CveId,
BinaryReference VulnerableBinary,
BinaryReference PatchedBinary,
string[] AffectedFunctions, // Symbol names of vulnerable functions
string[] ChangedFunctions, // Symbol names of patched functions
DiffMetadata Diff, // Upstream patch info
ProvenanceInfo Provenance);
2.2.4 Validation Harness
Runs function-matching validation with metrics:
public interface IValidationHarness
{
// Execute validation run
Task<ValidationRun> RunAsync(
ValidationConfig config,
CancellationToken ct);
// Get metrics for a run
Task<ValidationMetrics> GetMetricsAsync(
Guid runId,
CancellationToken ct);
// Export report
Task<Stream> ExportReportAsync(
Guid runId,
ReportFormat format,
CancellationToken ct);
}
public sealed record ValidationMetrics(
int TotalFunctions,
int CorrectMatches,
int FalsePositives,
int FalseNegatives,
int Unmatched,
decimal MatchRate,
decimal Precision,
decimal Recall,
ImmutableArray<MismatchBucket> MismatchBuckets);
public sealed record MismatchBucket(
string Cause, // inlining, lto, optimization, pic_thunk
int Count,
ImmutableArray<FunctionRef> Examples);
3. Database Schema
3.1 Symbol Sources
CREATE TABLE groundtruth.symbol_sources (
source_id TEXT PRIMARY KEY,
display_name TEXT NOT NULL,
connector_type TEXT NOT NULL, -- debuginfod, ddeb, buildinfo, secdb
base_url TEXT NOT NULL,
enabled BOOLEAN DEFAULT TRUE,
config_json JSONB,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
3.2 Raw Documents (Immutable)
CREATE TABLE groundtruth.raw_documents (
digest TEXT PRIMARY KEY, -- sha256:{hex}
source_id TEXT NOT NULL REFERENCES groundtruth.symbol_sources(source_id),
document_uri TEXT NOT NULL,
fetched_at TIMESTAMPTZ NOT NULL,
recorded_at TIMESTAMPTZ DEFAULT NOW(),
content_type TEXT NOT NULL,
content_size_bytes INT,
etag TEXT,
signature_state TEXT, -- verified, unverified, failed
payload_json JSONB,
UNIQUE (source_id, document_uri, etag)
);
CREATE INDEX idx_raw_documents_source_fetched
ON groundtruth.raw_documents(source_id, fetched_at DESC);
3.3 Symbol Observations (Immutable)
CREATE TABLE groundtruth.symbol_observations (
observation_id TEXT PRIMARY KEY, -- groundtruth:{source}:{debug_id}:{revision}
source_id TEXT NOT NULL,
debug_id TEXT NOT NULL, -- ELF build-id, PE GUID, Mach-O UUID
code_id TEXT, -- GNU build-id or PE checksum
-- Binary metadata
binary_name TEXT NOT NULL,
binary_path TEXT,
architecture TEXT NOT NULL, -- x86_64, aarch64, armv7
-- Package provenance
distro TEXT, -- debian, ubuntu, fedora, alpine
distro_version TEXT,
package_name TEXT,
package_version TEXT,
-- Symbols
symbols_json JSONB NOT NULL, -- Array of {name, address, size, type}
symbol_count INT NOT NULL,
-- Build metadata (from .buildinfo or debuginfo)
compiler TEXT,
compiler_version TEXT,
optimization_level TEXT,
build_flags_json JSONB,
-- Provenance
document_digest TEXT REFERENCES groundtruth.raw_documents(digest),
content_hash TEXT NOT NULL,
supersedes_id TEXT REFERENCES groundtruth.symbol_observations(observation_id),
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE (source_id, debug_id, content_hash)
);
CREATE INDEX idx_symbol_observations_debug_id
ON groundtruth.symbol_observations(debug_id);
CREATE INDEX idx_symbol_observations_package
ON groundtruth.symbol_observations(distro, package_name, package_version);
3.4 Security Pairs
CREATE TABLE groundtruth.security_pairs (
pair_id TEXT PRIMARY KEY,
cve_id TEXT NOT NULL,
-- Vulnerable binary
vuln_observation_id TEXT NOT NULL
REFERENCES groundtruth.symbol_observations(observation_id),
vuln_debug_id TEXT NOT NULL,
-- Patched binary
patch_observation_id TEXT NOT NULL
REFERENCES groundtruth.symbol_observations(observation_id),
patch_debug_id TEXT NOT NULL,
-- Affected function mapping
affected_functions_json JSONB NOT NULL, -- [{name, vuln_addr, patch_addr}]
changed_functions_json JSONB NOT NULL,
-- Upstream diff reference
upstream_commit TEXT,
upstream_patch_url TEXT,
-- Metadata
distro TEXT NOT NULL,
package_name TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
created_by TEXT
);
CREATE INDEX idx_security_pairs_cve
ON groundtruth.security_pairs(cve_id);
CREATE INDEX idx_security_pairs_package
ON groundtruth.security_pairs(distro, package_name);
3.5 Validation Runs
CREATE TABLE groundtruth.validation_runs (
run_id UUID PRIMARY KEY,
config_json JSONB NOT NULL, -- Matcher config, thresholds
started_at TIMESTAMPTZ NOT NULL,
completed_at TIMESTAMPTZ,
status TEXT NOT NULL, -- running, completed, failed
-- Aggregate metrics
total_functions INT,
correct_matches INT,
false_positives INT,
false_negatives INT,
unmatched INT,
match_rate DECIMAL(5,4),
precision DECIMAL(5,4),
recall DECIMAL(5,4),
-- Environment
matcher_version TEXT NOT NULL,
corpus_snapshot_id TEXT,
created_by TEXT
);
CREATE TABLE groundtruth.match_results (
result_id UUID PRIMARY KEY,
run_id UUID NOT NULL REFERENCES groundtruth.validation_runs(run_id),
-- Ground truth
pair_id TEXT NOT NULL REFERENCES groundtruth.security_pairs(pair_id),
function_name TEXT NOT NULL,
expected_match BOOLEAN NOT NULL,
-- Actual result
actual_match BOOLEAN,
match_score DECIMAL(5,4),
matched_function TEXT,
-- Classification
outcome TEXT NOT NULL, -- true_positive, false_positive, false_negative, unmatched
mismatch_cause TEXT, -- inlining, lto, optimization, pic_thunk, etc.
-- Debug info
debug_json JSONB
);
CREATE INDEX idx_match_results_run
ON groundtruth.match_results(run_id);
CREATE INDEX idx_match_results_outcome
ON groundtruth.match_results(run_id, outcome);
3.6 Source State (Cursor Tracking)
CREATE TABLE groundtruth.source_state (
source_id TEXT PRIMARY KEY REFERENCES groundtruth.symbol_sources(source_id),
enabled BOOLEAN DEFAULT TRUE,
cursor_json JSONB, -- last_modified, last_id, pending_docs
last_success_at TIMESTAMPTZ,
last_error TEXT,
backoff_until TIMESTAMPTZ
);
4. Connector Specifications
4.1 Debuginfod Connector (Fedora/RHEL)
Data Source: https://debuginfod.fedoraproject.org
Fetch Flow:
- Query debuginfod for build-id:
GET /buildid/{build_id}/debuginfo - Retrieve DWARF sections (.debug_info, .debug_line)
- Parse symbols using libdw
- Store observation with IMA signature verification
Configuration:
debuginfod:
base_url: "https://debuginfod.fedoraproject.org"
timeout_seconds: 30
verify_ima: true
cache_dir: "/var/cache/stellaops/debuginfod"
4.2 Ddeb Connector (Ubuntu)
Data Source: http://ddebs.ubuntu.com
Fetch Flow:
- Query Packages index for
-dbgsympackages - Download
.ddebarchive - Extract DWARF from
/usr/lib/debug/.build-id/ - Parse symbols, map to corresponding binary package
Configuration:
ddeb:
mirror_url: "http://ddebs.ubuntu.com"
distributions: ["focal", "jammy", "noble"]
components: ["main", "universe"]
cache_dir: "/var/cache/stellaops/ddebs"
4.3 Buildinfo Connector (Debian)
Data Source: https://buildinfos.debian.net
Fetch Flow:
- Query buildinfo index for package
- Download
.buildinfofile (often clearsigned) - Parse build environment (compiler, flags, checksums)
- Cross-reference with snapshot.debian.org for exact binary
Configuration:
buildinfo:
index_url: "https://buildinfos.debian.net"
snapshot_url: "https://snapshot.debian.org"
reproducible_url: "https://reproduce.debian.net"
verify_signature: true
4.4 SecDB Connector (Alpine)
Data Source: https://github.com/alpinelinux/alpine-secdb
Fetch Flow:
- Clone/pull secdb repository
- Parse YAML files per branch (v3.18, v3.19, edge)
- Map CVE to fixed/unfixed package versions
- Cross-reference with aports for patch info
Configuration:
secdb:
repo_url: "https://github.com/alpinelinux/alpine-secdb"
branches: ["v3.18", "v3.19", "v3.20", "edge"]
aports_url: "https://gitlab.alpinelinux.org/alpine/aports"
5. Validation Pipeline
5.1 Harness Workflow
1. Assemble
└─> Given package + CVE, fetch: binaries, debuginfo, .buildinfo, upstream tarball
2. Recover Symbols
└─> Resolve build-id → symbols via debuginfod/ddebs
└─> Fallback: Debian rebuild from .buildinfo
3. Lift Functions
└─> Batch-lift .text functions → IR
└─> Cache per build-id
4. Fingerprint
└─> Emit deterministic + fuzzy signatures
└─> Store as JSON lines
5. Match
└─> Pre→post function matching
└─> Write row per function with scores
6. Score
└─> Compute metrics (match rate, FP/FN, precision, recall)
└─> Bucket mismatches by cause
7. Report
└─> Markdown/HTML with tables + diffs
└─> Attach env hashes and artifact URLs
5.2 Metrics Tracked
| Metric | Description |
|---|---|
match_rate |
Correct matches / total functions |
precision |
True positives / (true positives + false positives) |
recall |
True positives / (true positives + false negatives) |
unmatched_rate |
Unmatched / total functions |
5.3 Mismatch Buckets
| Cause | Description | Mitigation |
|---|---|---|
inlining |
Function inlined, no direct match | Inline expansion in fingerprint |
lto |
Link-time optimization changed structure | Cross-module fingerprints |
optimization |
Different -O level | Semantic fingerprints |
pic_thunk |
Position-independent code stubs | Filter PIC thunks |
versioned_symbol |
GLIBC symbol versioning | Version-aware matching |
renamed |
Symbol renamed (macro, alias) | Alias resolution |
6. Evidence Objects
6.1 Ground-Truth Attestation Predicate
{
"predicateType": "https://stella-ops.org/predicates/groundtruth/v1",
"predicate": {
"observationId": "groundtruth:debuginfod:abc123def456:1",
"debugId": "abc123def456789...",
"binaryIdentity": {
"name": "libssl.so.3",
"sha256": "sha256:...",
"architecture": "x86_64"
},
"symbolSource": {
"sourceId": "debuginfod-fedora",
"fetchedAt": "2026-01-19T10:00:00Z",
"documentUri": "https://debuginfod.fedoraproject.org/buildid/abc123/debuginfo",
"signatureState": "verified"
},
"symbols": [
{"name": "SSL_CTX_new", "address": "0x1234", "size": 256},
{"name": "SSL_read", "address": "0x5678", "size": 512}
],
"buildMetadata": {
"compiler": "gcc",
"compilerVersion": "12.2.0",
"optimizationLevel": "O2",
"buildFlags": ["-fstack-protector-strong", "-D_FORTIFY_SOURCE=2"]
}
}
}
6.2 Validation Run Attestation
{
"predicateType": "https://stella-ops.org/predicates/validation-run/v1",
"predicate": {
"runId": "550e8400-e29b-41d4-a716-446655440000",
"config": {
"matcherVersion": "binaryindex-semantic-diffing:1.2.0",
"thresholds": {
"minSimilarity": 0.85,
"semanticWeight": 0.35,
"instructionWeight": 0.25
}
},
"corpus": {
"snapshotId": "corpus:2026-01-19",
"functionCount": 30000,
"libraryCount": 5
},
"metrics": {
"totalFunctions": 1500,
"correctMatches": 1380,
"falsePositives": 15,
"falseNegatives": 45,
"unmatched": 60,
"matchRate": 0.92,
"precision": 0.989,
"recall": 0.968
},
"mismatchBuckets": [
{"cause": "inlining", "count": 25},
{"cause": "lto", "count": 12},
{"cause": "optimization", "count": 8}
],
"executedAt": "2026-01-19T10:30:00Z"
}
}
7. CLI Commands
# Symbol source management
stella groundtruth sources list
stella groundtruth sources enable debuginfod-fedora
stella groundtruth sources sync --source debuginfod-fedora
# Symbol observation queries
stella groundtruth symbols lookup --debug-id abc123
stella groundtruth symbols search --package openssl --distro debian
# Security pair management
stella groundtruth pairs create \
--cve CVE-2024-1234 \
--vuln-pkg openssl=3.0.10-1 \
--patch-pkg openssl=3.0.11-1
stella groundtruth pairs list --cve CVE-2024-1234
# Validation harness
stella groundtruth validate run \
--pairs "openssl:CVE-2024-*" \
--matcher semantic-diffing \
--output validation-report.md
stella groundtruth validate metrics --run-id abc123
stella groundtruth validate export --run-id abc123 --format html
8. Doctor Checks
The ground-truth corpus integrates with Doctor for availability checks:
// stellaops.doctor.binaryanalysis plugin
public sealed class BinaryAnalysisDoctorPlugin : IDoctorPlugin
{
public string Name => "stellaops.doctor.binaryanalysis";
public IEnumerable<IDoctorCheck> GetChecks()
{
yield return new DebuginfodAvailabilityCheck();
yield return new DdebRepoEnabledCheck();
yield return new BuildinfoCacheCheck();
yield return new SymbolRecoveryFallbackCheck();
}
}
| Check | Description | Remediation |
|---|---|---|
debuginfod_urls_configured |
Verify DEBUGINFOD_URLS env |
Set env variable |
ddeb_repos_enabled |
Check Ubuntu ddeb sources | Enable ddebs repo |
buildinfo_cache_accessible |
Validate buildinfos.debian.net | Check network/firewall |
symbol_recovery_fallback |
Ensure fallback path works | Configure local cache |
9. Air-Gap Support
For offline/air-gapped deployments:
9.1 Symbol Bundle Format
symbol-bundle-2026-01-19/
├── manifest.json # Bundle metadata + checksums
├── sources/
│ ├── debuginfod/
│ │ └── *.debuginfo # Pre-fetched debuginfo
│ ├── ddebs/
│ │ └── *.ddeb # Pre-fetched ddebs
│ └── buildinfo/
│ └── *.buildinfo # Pre-fetched buildinfo
├── observations/
│ └── *.ndjson # Pre-computed observations
└── DSSE.envelope # Signed attestation
9.2 Offline Sync
# Export bundle for air-gap transfer
stella groundtruth bundle export \
--packages openssl,zlib,glibc \
--distros debian,fedora \
--output symbol-bundle.tar.gz
# Import bundle in air-gapped environment
stella groundtruth bundle import \
--input symbol-bundle.tar.gz \
--verify-signature