# BinaryIndex Module Architecture > **Ownership:** Scanner Guild + Concelier Guild > **Status:** DRAFT > **Version:** 1.0.0 > **Related:** [High-Level Architecture](../../ARCHITECTURE_OVERVIEW.md), [Scanner Architecture](../scanner/architecture.md), [Concelier Architecture](../concelier/architecture.md) --- ## 1. Overview The **BinaryIndex** module provides a vulnerable binaries database that enables detection of vulnerable code at the binary level, independent of package metadata. This addresses a critical gap in vulnerability scanning: package version strings can lie (backports, custom builds, stripped metadata), but **binary identity doesn't lie**. ### 1.1 Problem Statement Traditional vulnerability scanners rely on package version matching, which fails in several scenarios: 1. **Backported patches** - Distros backport security fixes without changing upstream version 2. **Custom/vendored builds** - Binaries compiled from source without package metadata 3. **Stripped binaries** - Debug info and version strings removed 4. **Static linking** - Vulnerable library code embedded in final binary 5. **Container base images** - Distroless or scratch images with no package DB ### 1.2 Solution: Binary-First Vulnerability Detection BinaryIndex provides three tiers of binary identification: | Tier | Method | Precision | Coverage | |------|--------|-----------|----------| | A | Package/version range matching | Medium | High | | B | Build-ID/hash catalog (exact binary identity) | High | Medium | | C | Function fingerprints (CFG/basic-block hashes) | Very High | Targeted | ### 1.3 Module Scope **In Scope:** - Binary identity extraction (Build-ID, PE CodeView GUID, Mach-O UUID) - Binary-to-advisory mapping database - Fingerprint storage and matching engine - Fix index for patch-aware backport handling - Integration with Scanner.Worker for binary lookup **Out of Scope:** - Binary disassembly/analysis (provided by Scanner.Analyzers.Native) - Runtime binary tracing (provided by Zastava) - SBOM generation (provided by Scanner) --- ## 2. Architecture ### 2.1 System Context ``` ┌──────────────────────────────────────────────────────────────────────────┐ │ External Systems │ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Distro Repos │ │ Debug Symbol │ │ Upstream Source │ │ │ │ (Debian, RPM, │ │ Servers │ │ (GitHub, etc.) │ │ │ │ Alpine) │ │ (debuginfod) │ │ │ │ │ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │ └───────────│─────────────────────│─────────────────────│──────────────────┘ │ │ │ v v v ┌──────────────────────────────────────────────────────────────────────────┐ │ BinaryIndex Module │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Corpus Ingestion Layer │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ │ │ DebianCorpus │ │ RpmCorpus │ │ AlpineCorpus │ │ │ │ │ │ Connector │ │ Connector │ │ Connector │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ v │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Processing Layer │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ │ │ BinaryFeature│ │ FixIndex │ │ Fingerprint │ │ │ │ │ │ Extractor │ │ Builder │ │ Generator │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ v │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Storage Layer │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ │ │ PostgreSQL │ │ RustFS │ │ Valkey │ │ │ │ │ │ (binaries │ │ (fingerprint │ │ (lookup │ │ │ │ │ │ schema) │ │ blobs) │ │ cache) │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ v │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Query Layer │ │ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ │ │ IBinaryVulnerabilityService │ │ │ │ │ │ - LookupByBuildIdAsync(buildId) │ │ │ │ │ │ - LookupByFingerprintAsync(fingerprint) │ │ │ │ │ │ - LookupBatchAsync(identities) │ │ │ │ │ │ - GetFixStatusAsync(distro, release, sourcePkg, cve) │ │ │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────────────────────┘ │ v ┌──────────────────────────────────────────────────────────────────────────┐ │ Consuming Modules │ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Scanner.Worker │ │ Policy Engine │ │ Findings Ledger │ │ │ │ (binary lookup │ │ (evidence in │ │ (match records) │ │ │ │ during scan) │ │ proof chain) │ │ │ │ │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ └──────────────────────────────────────────────────────────────────────────┘ ``` ### 2.2 Component Breakdown #### 2.2.1 Corpus Connectors Plugin-based connectors that ingest binaries from distribution repositories. ```csharp public interface IBinaryCorpusConnector { string ConnectorId { get; } string[] SupportedDistros { get; } Task FetchSnapshotAsync(CorpusQuery query, CancellationToken ct); Task> ExtractBinariesAsync(PackageReference pkg, CancellationToken ct); } ``` **Implementations:** - `DebianBinaryCorpusConnector` - Debian/Ubuntu packages + debuginfo - `RpmBinaryCorpusConnector` - RHEL/Fedora/CentOS + SRPM - `AlpineBinaryCorpusConnector` - Alpine APK + APKBUILD #### 2.2.2 Binary Feature Extractor Extracts identity and features from binaries. Reuses existing Scanner.Analyzers.Native capabilities. ```csharp public interface IBinaryFeatureExtractor { Task ExtractIdentityAsync(Stream binaryStream, CancellationToken ct); Task ExtractFeaturesAsync(Stream binaryStream, ExtractorOptions opts, CancellationToken ct); } public sealed record BinaryIdentity( string Format, // elf, pe, macho string? BuildId, // ELF GNU Build-ID string? PeCodeViewGuid, // PE CodeView GUID + Age string? MachoUuid, // Mach-O LC_UUID string FileSha256, string TextSectionSha256); public sealed record BinaryFeatures( BinaryIdentity Identity, string[] DynamicDeps, // DT_NEEDED string[] ExportedSymbols, string[] ImportedSymbols, BinaryHardening Hardening); ``` #### 2.2.3 Fix Index Builder Builds the patch-aware CVE fix index from distro sources. ```csharp public interface IFixIndexBuilder { Task BuildIndexAsync(DistroRelease distro, CancellationToken ct); Task GetFixRecordAsync(string distro, string release, string sourcePkg, string cveId, CancellationToken ct); } public sealed record FixRecord( string Distro, string Release, string SourcePkg, string CveId, FixState State, // fixed, vulnerable, not_affected, wontfix, unknown string? FixedVersion, // Distro version string FixMethod Method, // security_feed, changelog, patch_header decimal Confidence, // 0.00-1.00 FixEvidence Evidence); public enum FixState { Fixed, Vulnerable, NotAffected, Wontfix, Unknown } public enum FixMethod { SecurityFeed, Changelog, PatchHeader, UpstreamPatchMatch } ``` #### 2.2.4 Fingerprint Generator Generates function-level fingerprints for vulnerable code detection. ```csharp public interface IVulnFingerprintGenerator { Task> GenerateAsync( string cveId, BinaryPair vulnAndFixed, // Reference builds FingerprintOptions opts, CancellationToken ct); } public sealed record VulnFingerprint( string CveId, string Component, // e.g., openssl string Architecture, // x86-64, aarch64 FingerprintType Type, // basic_block, cfg, combined string FingerprintId, // e.g., "bb-abc123..." byte[] FingerprintHash, // 16-32 bytes string? FunctionHint, // Function name if known decimal Confidence, FingerprintEvidence Evidence); public enum FingerprintType { BasicBlock, ControlFlowGraph, StringReferences, Combined } ``` #### 2.2.5 Semantic Analysis Library > **Library:** `StellaOps.BinaryIndex.Semantic` > **Sprint:** 20260105_001_001_BINDEX - Semantic Diffing Phase 1 The Semantic Analysis Library extends fingerprint generation with IR-level semantic matching, enabling detection of semantically equivalent code despite compiler optimizations, instruction reordering, and register allocation differences. **Key Insight:** Traditional instruction-level fingerprinting loses accuracy on optimized binaries by ~15-20%. Semantic analysis lifts to B2R2's Intermediate Representation (LowUIR), extracts key-semantics graphs, and uses graph hashing for similarity computation. ##### 2.2.5.1 Architecture ``` Binary Input │ v B2R2 Disassembly → Raw Instructions │ v IR Lifting Service → LowUIR Statements │ v Semantic Graph Extractor → Key-Semantics Graph (KSG) │ v Graph Fingerprinting → Semantic Fingerprint │ v Semantic Matcher → Similarity Score + Deltas ``` ##### 2.2.5.2 Core Components **IR Lifting Service** (`IIrLiftingService`) Lifts disassembled instructions to B2R2 LowUIR: ```csharp public interface IIrLiftingService { Task LiftToIrAsync( IReadOnlyList instructions, string functionName, LiftOptions? options = null, CancellationToken ct = default); } public sealed record LiftedFunction( string Name, ImmutableArray Statements, ImmutableArray BasicBlocks); ``` **Semantic Graph Extractor** (`ISemanticGraphExtractor`) Extracts key-semantics graphs capturing data dependencies, control flow, and memory operations: ```csharp public interface ISemanticGraphExtractor { Task ExtractGraphAsync( LiftedFunction function, GraphExtractionOptions? options = null, CancellationToken ct = default); } public sealed record KeySemanticsGraph( string FunctionName, ImmutableArray Nodes, ImmutableArray Edges, GraphProperties Properties); public enum SemanticNodeType { Compute, Load, Store, Branch, Call, Return, Phi } public enum SemanticEdgeType { DataDependency, ControlDependency, MemoryDependency } ``` **Semantic Fingerprint Generator** (`ISemanticFingerprintGenerator`) Generates semantic fingerprints using Weisfeiler-Lehman graph hashing: ```csharp public interface ISemanticFingerprintGenerator { Task GenerateAsync( KeySemanticsGraph graph, SemanticFingerprintOptions? options = null, CancellationToken ct = default); } public sealed record SemanticFingerprint( string FunctionName, string GraphHashHex, // WL graph hash (SHA-256) string OperationHashHex, // Normalized operation sequence hash string DataFlowHashHex, // Data dependency pattern hash int NodeCount, int EdgeCount, int CyclomaticComplexity, ImmutableArray ApiCalls, SemanticFingerprintAlgorithm Algorithm); ``` **Semantic Matcher** (`ISemanticMatcher`) Computes semantic similarity with weighted components: ```csharp public interface ISemanticMatcher { Task MatchAsync( SemanticFingerprint a, SemanticFingerprint b, MatchOptions? options = null, CancellationToken ct = default); Task MatchWithDeltasAsync( SemanticFingerprint a, SemanticFingerprint b, MatchOptions? options = null, CancellationToken ct = default); } public sealed record SemanticMatchResult( decimal Similarity, // 0.00-1.00 decimal GraphSimilarity, decimal OperationSimilarity, decimal DataFlowSimilarity, decimal ApiCallSimilarity, MatchConfidence Confidence); ``` ##### 2.2.5.3 Algorithm Details **Weisfeiler-Lehman Graph Hashing:** - 3 iterations of label propagation - SHA-256 for final hash computation - Deterministic node ordering via canonical sort **Similarity Weights (Default):** | Component | Weight | |-----------|--------| | Graph Hash | 0.35 | | Operation Hash | 0.25 | | Data Flow Hash | 0.25 | | API Calls | 0.15 | ##### 2.2.5.4 Integration Points The semantic library integrates with existing BinaryIndex components: **DeltaSignatureGenerator Extension:** ```csharp // Optional semantic services via constructor injection services.AddDeltaSignaturesWithSemantic(); // Extended SymbolSignature with semantic properties public sealed record SymbolSignature { // ... existing properties ... public string? SemanticHashHex { get; init; } public ImmutableArray SemanticApiCalls { get; init; } } ``` **PatchDiffEngine Extension:** ```csharp // SemanticWeight in HashWeights public decimal SemanticWeight { get; init; } = 0.2m; // FunctionFingerprint extended with semantic fingerprint public SemanticFingerprint? SemanticFingerprint { get; init; } ``` ##### 2.2.5.5 Test Coverage | Category | Tests | Coverage | |----------|-------|----------| | Unit Tests (IR lifting, graph extraction, hashing) | 53 | Core algorithms | | Integration Tests (full pipeline) | 9 | End-to-end flow | | Golden Corpus (compiler variations) | 11 | Register allocation, optimization, compiler variants | | Benchmarks (accuracy, performance) | 7 | Baseline metrics | ##### 2.2.5.6 Current Baselines > **Note:** Baselines reflect foundational implementation; accuracy improves as semantic features mature. | Metric | Baseline | Target | |--------|----------|--------| | Similarity (register allocation variants) | ≥0.55 | ≥0.85 | | Overall accuracy | ≥40% | ≥70% | | False positive rate | <10% | <5% | | P95 fingerprint latency | <100ms | <50ms | ##### 2.2.5.7 B2R2 LowUIR Adapter The B2R2LowUirLiftingService implements `IIrLiftingService` using B2R2's native lifting capabilities. This provides cross-platform IR representation for semantic analysis. **Key Components:** ```csharp public sealed class B2R2LowUirLiftingService : IIrLiftingService { // Lifts to B2R2 LowUIR and maps to Stella IR model public Task LiftToIrAsync( IReadOnlyList instructions, string functionName, LiftOptions? options = null, CancellationToken ct = default); } ``` **Supported ISAs:** - Intel (x86-32, x86-64) - ARM (ARMv7, ARMv8/ARM64) - MIPS (32/64) - RISC-V (64) - PowerPC, SPARC, SH4, AVR, EVM **IR Statement Mapping:** | B2R2 LowUIR | Stella IR Kind | |-------------|----------------| | Put | IrStatementKind.Store | | Store | IrStatementKind.Store | | Get | IrStatementKind.Load | | Load | IrStatementKind.Load | | BinOp | IrStatementKind.BinaryOp | | UnOp | IrStatementKind.UnaryOp | | Jmp | IrStatementKind.Jump | | CJmp | IrStatementKind.ConditionalJump | | InterJmp | IrStatementKind.IndirectJump | | Call | IrStatementKind.Call | | SideEffect | IrStatementKind.SideEffect | **Determinism Guarantees:** - Statements ordered by block address (ascending) - Blocks sorted by entry address (ascending) - Consistent IR IDs across identical inputs - InvariantCulture used for all string formatting ##### 2.2.5.8 B2R2 Lifter Pool The `B2R2LifterPool` provides bounded pooling and warm preload for B2R2 lifting units to reduce per-call allocation overhead. **Configuration (`B2R2LifterPoolOptions`):** | Option | Default | Description | |--------|---------|-------------| | `MaxPoolSizePerIsa` | 4 | Maximum pooled lifters per ISA | | `EnableWarmPreload` | true | Preload lifters at startup | | `WarmPreloadIsas` | ["intel-64", "intel-32", "armv8-64", "armv7-32"] | ISAs to warm | | `AcquireTimeout` | 5s | Timeout for acquiring a lifter | **Pool Statistics:** - `TotalPooledLifters`: Lifters currently in pool - `TotalActiveLifters`: Lifters currently in use - `IsWarm`: Whether pool has been warmed - `IsaStats`: Per-ISA pool and active counts **Usage:** ```csharp using var lifter = _lifterPool.Acquire(isa); var stmts = lifter.LiftingUnit.LiftInstruction(address); // Lifter automatically returned to pool on dispose ``` ##### 2.2.5.9 Function IR Cache The `FunctionIrCacheService` provides Valkey-backed caching for computed semantic fingerprints to avoid redundant IR lifting and graph hashing. **Cache Key Structure:** ``` (isa, b2r2_version, normalization_recipe, canonical_ir_hash) ``` **Configuration (`FunctionIrCacheOptions`):** | Option | Default | Description | |--------|---------|-------------| | `KeyPrefix` | "stellaops:binidx:funccache:" | Valkey key prefix | | `CacheTtl` | 4h | TTL for cached entries | | `MaxTtl` | 24h | Maximum TTL | | `Enabled` | true | Whether caching is enabled | | `B2R2Version` | "0.9.1" | B2R2 version for cache key | | `NormalizationRecipeVersion` | "v1" | Recipe version for cache key | **Cache Entry (`CachedFunctionFingerprint`):** - `FunctionAddress`, `FunctionName` - `SemanticFingerprint`: The computed fingerprint - `IrStatementCount`, `BasicBlockCount` - `ComputedAtUtc`: ISO-8601 timestamp - `B2R2Version`, `NormalizationRecipe` **Invalidation Rules:** - Cache entries expire after `CacheTtl` (default 4h) - Changing B2R2 version or normalization recipe results in cache misses - Manual invalidation via `RemoveAsync()` **Statistics:** - Hits, Misses, Evictions - Hit Rate - Enabled status ##### 2.2.5.10 Ops Endpoints BinaryIndex exposes operational endpoints for health, benchmarking, cache monitoring, and configuration visibility. | Endpoint | Method | Description | |----------|--------|-------------| | `/api/v1/ops/binaryindex/health` | GET | Health status with lifter warmness, cache availability | | `/api/v1/ops/binaryindex/bench/run` | POST | Run benchmark, return latency stats | | `/api/v1/ops/binaryindex/cache` | GET | Function IR cache hit/miss statistics | | `/api/v1/ops/binaryindex/config` | GET | Effective configuration (secrets redacted) | **Health Response:** ```json { "status": "healthy", "timestamp": "2026-01-14T12:00:00Z", "lifterStatus": "warm", "lifterWarm": true, "lifterPoolStats": { "intel-64": 4, "armv8-64": 2 }, "cacheStatus": "enabled", "cacheEnabled": true } ``` **Determinism Constraints:** - All timestamps in ISO-8601 UTC format - ASCII-only output - Deterministic JSON key ordering - Secrets/credentials redacted from config endpoint #### 2.2.6 Binary Vulnerability Service Main query interface for consumers. ```csharp public interface IBinaryVulnerabilityService { /// /// Look up vulnerabilities by Build-ID or equivalent binary identity. /// Task> LookupByIdentityAsync( BinaryIdentity identity, LookupOptions? opts = null, CancellationToken ct = default); /// /// Look up vulnerabilities by function fingerprint. /// Task> LookupByFingerprintAsync( CodeFingerprint fingerprint, decimal minSimilarity = 0.95m, CancellationToken ct = default); /// /// Batch lookup for scan performance. /// Task>> LookupBatchAsync( IEnumerable identities, LookupOptions? opts = null, CancellationToken ct = default); /// /// Get distro-specific fix status (patch-aware). /// Task GetFixStatusAsync( string distro, string release, string sourcePkg, string cveId, CancellationToken ct = default); } public sealed record BinaryVulnMatch( string CveId, string VulnerablePurl, MatchMethod Method, // buildid_catalog, fingerprint_match, range_match decimal Confidence, MatchEvidence Evidence); public enum MatchMethod { BuildIdCatalog, FingerprintMatch, RangeMatch } ``` --- ## 3. Data Model ### 3.1 PostgreSQL Schema (`binaries`) The `binaries` schema stores binary identity, fingerprint, and match data. ```sql CREATE SCHEMA IF NOT EXISTS binaries; CREATE SCHEMA IF NOT EXISTS binaries_app; -- RLS helper CREATE OR REPLACE FUNCTION binaries_app.require_current_tenant() RETURNS TEXT LANGUAGE plpgsql STABLE SECURITY DEFINER AS $$ DECLARE v_tenant TEXT; BEGIN v_tenant := current_setting('app.tenant_id', true); IF v_tenant IS NULL OR v_tenant = '' THEN RAISE EXCEPTION 'app.tenant_id session variable not set'; END IF; RETURN v_tenant; END; $$; ``` #### 3.1.1 Core Tables See `docs/db/schemas/binaries_schema_specification.md` for complete DDL. **Key Tables:** | Table | Purpose | |-------|---------| | `binaries.binary_identity` | Known binary identities (Build-ID, hashes) | | `binaries.binary_package_map` | Binary → package mapping per snapshot | | `binaries.vulnerable_buildids` | Build-IDs known to be vulnerable | | `binaries.vulnerable_fingerprints` | Function fingerprints for CVEs | | `binaries.cve_fix_index` | Patch-aware fix status per distro | | `binaries.fingerprint_matches` | Match results (findings evidence) | | `binaries.corpus_snapshots` | Corpus ingestion tracking | ### 3.2 RustFS Layout ``` rustfs://stellaops/binaryindex/ fingerprints///.bin corpus////manifest.json corpus////packages/.metadata.json evidence/.dsse.json ``` --- ## 4. Integration Points ### 4.1 Scanner.Worker Integration During container scanning, Scanner.Worker queries BinaryIndex for each extracted binary: ```mermaid sequenceDiagram participant SW as Scanner.Worker participant BI as BinaryIndex participant PG as PostgreSQL participant FL as Findings Ledger SW->>SW: Extract binary from layer SW->>SW: Compute BinaryIdentity SW->>BI: LookupByIdentityAsync(identity) BI->>PG: Query binaries.vulnerable_buildids PG-->>BI: Matches BI->>PG: Query binaries.cve_fix_index (if distro known) PG-->>BI: Fix status BI-->>SW: BinaryVulnMatch[] SW->>FL: RecordFinding(match, evidence) ``` ### 4.2 Concelier Integration BinaryIndex subscribes to Concelier's advisory updates: ```mermaid sequenceDiagram participant CO as Concelier participant BI as BinaryIndex participant PG as PostgreSQL CO->>CO: Ingest new advisory CO->>BI: advisory.created event BI->>BI: Check if affected packages in corpus BI->>PG: Update binaries.binary_vuln_assertion BI->>BI: Queue fingerprint generation (if high-impact) ``` ### 4.3 Policy Integration Binary matches are recorded as proof segments: ```json { "segment_type": "binary_fingerprint_evidence", "payload": { "binary_identity": { "format": "elf", "build_id": "abc123...", "file_sha256": "def456..." }, "matches": [ { "cve_id": "CVE-2024-1234", "method": "buildid_catalog", "confidence": 0.98, "vulnerable_purl": "pkg:deb/debian/libssl3@1.1.1n-0+deb11u3" } ] } } ``` --- ## 5. MVP Roadmap ### MVP 1: Known-Build Binary Catalog (Sprint 6000.0001) **Goal:** Query "is this Build-ID vulnerable?" with distro-level precision. **Deliverables:** - `binaries` PostgreSQL schema - Build-ID to package mapping tables - Basic CVE lookup by binary identity - Debian/Ubuntu corpus connector ### MVP 2: Patch-Aware Backport Handling (Sprint 6000.0002) **Goal:** Handle "version says vulnerable but distro backported the fix." **Deliverables:** - Fix index builder (changelog + patch header parsing) - Distro-specific version comparison - RPM corpus connector - Scanner.Worker integration ### MVP 3: Binary Fingerprint Factory (Sprint 6000.0003) **Goal:** Detect vulnerable code independent of package metadata. **Deliverables:** - Fingerprint storage and matching - Reference build generation pipeline - Fingerprint validation corpus - High-impact CVE coverage (OpenSSL, glibc, zlib, curl) ### MVP 4: Full Scanner Integration (Sprint 6000.0004) **Goal:** Binary evidence in production scans. **Deliverables:** - Scanner.Worker binary lookup integration - Findings Ledger binary match records - Proof segment attestations - CLI binary match inspection --- ## 5b. Fix Evidence Chain The **Fix Evidence Chain** provides auditable proof of why a CVE is marked as fixed (or not) for a specific distro/package combination. This is critical for patch-aware backport handling where package versions can be misleading. ### 5b.1 Evidence Sources | Source | Confidence | Description | |--------|------------|-------------| | **Security Feed (OVAL)** | 0.95-0.99 | Authoritative feed from distro (Debian Security Tracker, Red Hat OVAL) | | **Patch Header (DEP-3)** | 0.87-0.95 | CVE reference in Debian/Ubuntu patch metadata | | **Changelog** | 0.75-0.85 | CVE mention in debian/changelog or RPM %changelog | | **Upstream Patch Match** | 0.90 | Binary diff matches known upstream fix | ### 5b.2 Evidence Storage Evidence is stored in two PostgreSQL tables: ```sql -- Fix index: one row per (distro, release, source_pkg, cve_id) CREATE TABLE binaries.cve_fix_index ( id UUID PRIMARY KEY, tenant_id TEXT NOT NULL, distro TEXT NOT NULL, -- debian, ubuntu, alpine, rhel release TEXT NOT NULL, -- bookworm, jammy, v3.19 source_pkg TEXT NOT NULL, cve_id TEXT NOT NULL, state TEXT NOT NULL, -- fixed, vulnerable, not_affected, wontfix, unknown fixed_version TEXT, method TEXT NOT NULL, -- security_feed, changelog, patch_header, upstream_match confidence DECIMAL(3,2) NOT NULL, evidence_id UUID REFERENCES binaries.fix_evidence(id), snapshot_id UUID, indexed_at TIMESTAMPTZ NOT NULL DEFAULT now(), UNIQUE (tenant_id, distro, release, source_pkg, cve_id) ); -- Evidence blobs: audit trail CREATE TABLE binaries.fix_evidence ( id UUID PRIMARY KEY, tenant_id TEXT NOT NULL, evidence_type TEXT NOT NULL, -- changelog, patch_header, security_feed source_file TEXT, -- Path to source file (changelog, patch) source_sha256 TEXT, -- Hash of source file excerpt TEXT, -- Relevant snippet (max 1KB) metadata JSONB NOT NULL, -- Structured metadata snapshot_id UUID, created_at TIMESTAMPTZ NOT NULL DEFAULT now() ); ``` ### 5b.3 Evidence Types **ChangelogEvidence:** ```json { "evidence_type": "changelog", "source_file": "debian/changelog", "excerpt": "* Fix CVE-2024-0727: PKCS12 decoding crash", "metadata": { "version": "3.0.11-1~deb12u2", "line_number": 5 } } ``` **PatchHeaderEvidence:** ```json { "evidence_type": "patch_header", "source_file": "debian/patches/CVE-2024-0727.patch", "excerpt": "CVE: CVE-2024-0727\nOrigin: upstream, https://github.com/openssl/commit/abc123", "metadata": { "patch_sha256": "abc123def456..." } } ``` **SecurityFeedEvidence:** ```json { "evidence_type": "security_feed", "metadata": { "feed_id": "debian-security-tracker", "entry_id": "DSA-5678-1", "published_at": "2024-01-15T10:00:00Z" } } ``` ### 5b.4 Confidence Resolution When multiple evidence sources exist for the same CVE, the system keeps the **highest confidence** entry: ```csharp ON CONFLICT (tenant_id, distro, release, source_pkg, cve_id) DO UPDATE SET confidence = GREATEST(existing.confidence, new.confidence), method = CASE WHEN existing.confidence < new.confidence THEN new.method ELSE existing.method END, evidence_id = CASE WHEN existing.confidence < new.confidence THEN new.evidence_id ELSE existing.evidence_id END ``` ### 5b.5 Parsers The following parsers extract CVE fix information: | Parser | Distros | Input | Confidence | |--------|---------|-------|------------| | `DebianChangelogParser` | Debian, Ubuntu | debian/changelog | 0.80 | | `PatchHeaderParser` | Debian, Ubuntu | debian/patches/*.patch (DEP-3) | 0.87 | | `AlpineSecfixesParser` | Alpine | APKBUILD secfixes block | 0.95 | | `RpmChangelogParser` | RHEL, Fedora, CentOS | RPM spec %changelog | 0.75 | ### 5b.6 Query Flow ```mermaid sequenceDiagram participant SW as Scanner.Worker participant BVS as BinaryVulnerabilityService participant FIR as FixIndexRepository participant PG as PostgreSQL SW->>BVS: GetFixStatusAsync(debian, bookworm, openssl, CVE-2024-0727) BVS->>FIR: GetFixStatusAsync(...) FIR->>PG: SELECT FROM cve_fix_index WHERE ... PG-->>FIR: FixIndexEntry (state=fixed, confidence=0.87) FIR-->>BVS: FixStatusResult BVS-->>SW: {state: Fixed, confidence: 0.87, method: PatchHeader} ``` --- ## 6. Security Considerations ### 6.1 Trust Boundaries 1. **Corpus Ingestion** - Packages are untrusted; extraction runs in sandboxed workers 2. **Fingerprint Generation** - Reference builds compiled in isolated environments 3. **Query API** - Tenant-isolated via RLS; no cross-tenant data leakage ### 6.2 Signing & Provenance - All corpus snapshots are signed (DSSE) - Fingerprint sets are versioned and signed - Every match result references evidence digests ### 6.3 Sandbox Requirements Binary extraction and fingerprint generation MUST run with: - Seccomp profile restricting syscalls - Read-only root filesystem - No network access during analysis - Memory/CPU limits --- ## 7. Observability ### 7.1 Metrics | Metric | Type | Labels | |--------|------|--------| | `binaryindex_lookup_total` | Counter | method, result | | `binaryindex_lookup_latency_ms` | Histogram | method | | `binaryindex_corpus_packages_total` | Gauge | distro, release | | `binaryindex_fingerprints_indexed` | Gauge | algorithm, component | | `binaryindex_match_confidence` | Histogram | method | ### 7.2 Traces - `binaryindex.lookup` - Full lookup span - `binaryindex.corpus.ingest` - Corpus ingestion - `binaryindex.fingerprint.generate` - Fingerprint generation ### 7.3 Ops Endpoints > **Sprint:** SPRINT_20260112_007_BINIDX_binaryindex_user_config BinaryIndex exposes read-only ops endpoints for health, bench, cache, and effective configuration: | Endpoint | Method | Response Schema | Description | |----------|--------|-----------------|-------------| | `/api/v1/ops/binaryindex/health` | GET | `BinaryIndexOpsHealthResponse` | Health status, lifter warmness per ISA, cache availability | | `/api/v1/ops/binaryindex/bench/run` | POST | `BinaryIndexBenchResponse` | Run latency benchmark, return min/max/mean/p50/p95/p99 stats | | `/api/v1/ops/binaryindex/cache` | GET | `BinaryIndexFunctionCacheStats` | Function cache hit/miss/eviction statistics | | `/api/v1/ops/binaryindex/config` | GET | `BinaryIndexEffectiveConfig` | Effective configuration with secrets redacted | #### 7.3.1 Response Schemas **BinaryIndexOpsHealthResponse:** ```json { "status": "healthy", "timestamp": "2026-01-16T12:00:00Z", "components": { "lifterPool": { "status": "healthy", "message": null }, "functionCache": { "status": "healthy", "message": null }, "persistence": { "status": "healthy", "message": null } }, "lifterWarmness": { "intel-64": { "isa": "intel-64", "warm": true, "poolSize": 4, "acquireTimeMs": 12 }, "armv8-64": { "isa": "armv8-64", "warm": true, "poolSize": 2, "acquireTimeMs": 8 } } } ``` **BinaryIndexBenchResponse:** ```json { "timestamp": "2026-01-16T12:00:00Z", "sampleSize": 100, "latencySummary": { "minMs": 5.2, "maxMs": 142.8, "meanMs": 28.4, "p50Ms": 22.1, "p95Ms": 78.3, "p99Ms": 121.5 }, "operations": [ { "operation": "lifterAcquire", "samples": 100, "meanMs": 12.4 }, { "operation": "irNormalization", "samples": 100, "meanMs": 8.7 }, { "operation": "cacheLookup", "samples": 100, "meanMs": 1.2 } ] } ``` **BinaryIndexFunctionCacheStats:** ```json { "enabled": true, "backend": "valkey", "hits": 15234, "misses": 892, "evictions": 45, "hitRate": 0.944, "keyPrefix": "stellaops:binidx:funccache:", "cacheTtlSeconds": 14400, "estimatedEntries": 12500, "estimatedMemoryBytes": 52428800 } ``` **BinaryIndexEffectiveConfig:** ```json { "b2r2Pool": { "maxPoolSizePerIsa": 4, "warmPreload": ["intel-64", "armv8-64"], "acquireTimeoutMs": 5000, "enableMetrics": true }, "semanticLifting": { "b2r2Version": "1.5.0", "normalizationRecipeVersion": "2024.1", "maxInstructionsPerFunction": 10000, "maxFunctionsPerBinary": 5000, "functionLiftTimeoutMs": 30000, "enableDeduplication": true }, "functionCache": { "connectionString": "********", "keyPrefix": "stellaops:binidx:funccache:", "cacheTtlSeconds": 14400, "maxTtlSeconds": 86400, "earlyExpiryPercent": 0.1, "maxEntrySizeBytes": 1048576 }, "persistence": { "schema": "binaries", "minPoolSize": 5, "maxPoolSize": 20, "commandTimeoutSeconds": 30, "retryOnFailure": true, "batchSize": 100 }, "backendVersions": { "b2r2": "1.5.0", "valkey": "7.2.0", "postgres": "15.4" } } ``` #### 7.3.2 Rate Limiting The `/bench/run` endpoint is rate-limited to prevent load spikes: - Default: 5 requests per minute per tenant - Configurable via `BinaryIndex:Ops:BenchRateLimitPerMinute` #### 7.3.3 Secret Redaction The config endpoint automatically redacts sensitive keys: | Redacted Keys | Pattern | |---------------|---------| | `connectionString` | Replaced with `********` | | `password` | Replaced with `********` | | `secret*` | Any key starting with "secret" | | `apiKey` | Replaced with `********` | | `token` | Replaced with `********` | Redaction is applied recursively to nested objects. --- ## 8. Configuration > **Sprint:** SPRINT_20260112_007_BINIDX_binaryindex_user_config ### 8.1 Configuration Sections All configuration is under the `BinaryIndex` section in `appsettings.yaml` or environment variables with `BINARYINDEX__` prefix. #### 8.1.1 B2R2 Lifter Pool (`BinaryIndex:B2R2Pool`) | Key | Type | Default | Description | |-----|------|---------|-------------| | `MaxPoolSizePerIsa` | int | 4 | Maximum lifter instances per ISA | | `WarmPreload` | string[] | ["intel-64", "armv8-64"] | ISAs to warm on startup | | `AcquireTimeoutMs` | int | 5000 | Timeout for lifter acquisition | | `EnableMetrics` | bool | true | Emit Prometheus metrics for pool | ```yaml BinaryIndex: B2R2Pool: MaxPoolSizePerIsa: 4 WarmPreload: - intel-64 - armv8-64 AcquireTimeoutMs: 5000 EnableMetrics: true ``` #### 8.1.2 Semantic Lifting (`BinaryIndex:SemanticLifting`) | Key | Type | Default | Description | |-----|------|---------|-------------| | `B2R2Version` | string | "1.5.0" | B2R2 disassembler version | | `NormalizationRecipeVersion` | string | "2024.1" | IR normalization recipe version | | `MaxInstructionsPerFunction` | int | 10000 | Max instructions to lift per function | | `MaxFunctionsPerBinary` | int | 5000 | Max functions to process per binary | | `FunctionLiftTimeoutMs` | int | 30000 | Timeout for lifting single function | | `EnableDeduplication` | bool | true | Deduplicate IR before fingerprinting | ```yaml BinaryIndex: SemanticLifting: MaxInstructionsPerFunction: 10000 MaxFunctionsPerBinary: 5000 FunctionLiftTimeoutMs: 30000 EnableDeduplication: true ``` #### 8.1.3 Function Cache (`BinaryIndex:FunctionCache`) | Key | Type | Default | Description | |-----|------|---------|-------------| | `ConnectionString` | string | — | Valkey connection string (secret) | | `KeyPrefix` | string | "stellaops:binidx:funccache:" | Cache key prefix | | `CacheTtlSeconds` | int | 14400 | Default cache TTL (4 hours) | | `MaxTtlSeconds` | int | 86400 | Maximum TTL (24 hours) | | `EarlyExpiryPercent` | decimal | 0.1 | Early expiry jitter (10%) | | `MaxEntrySizeBytes` | int | 1048576 | Max entry size (1 MB) | ```yaml BinaryIndex: FunctionCache: ConnectionString: ${VALKEY_CONNECTION} # from env KeyPrefix: "stellaops:binidx:funccache:" CacheTtlSeconds: 14400 MaxEntrySizeBytes: 1048576 ``` #### 8.1.4 Persistence (`Postgres:BinaryIndex`) | Key | Type | Default | Description | |-----|------|---------|-------------| | `Schema` | string | "binaries" | PostgreSQL schema name | | `MinPoolSize` | int | 5 | Minimum connection pool size | | `MaxPoolSize` | int | 20 | Maximum connection pool size | | `CommandTimeoutSeconds` | int | 30 | Command execution timeout | | `RetryOnFailure` | bool | true | Retry transient failures | | `BatchSize` | int | 100 | Batch insert size | ```yaml Postgres: BinaryIndex: Schema: binaries MinPoolSize: 5 MaxPoolSize: 20 CommandTimeoutSeconds: 30 RetryOnFailure: true BatchSize: 100 ``` #### 8.1.5 Ops Configuration (`BinaryIndex:Ops`) | Key | Type | Default | Description | |-----|------|---------|-------------| | `EnableHealthEndpoint` | bool | true | Enable /health endpoint | | `EnableBenchEndpoint` | bool | true | Enable /bench/run endpoint | | `BenchRateLimitPerMinute` | int | 5 | Rate limit for bench endpoint | | `RedactedKeys` | string[] | See 7.3.3 | Keys to redact in config output | ### 8.2 Legacy Configuration ```yaml # binaryindex.yaml (corpus configuration) binaryindex: enabled: true corpus: connectors: - type: debian enabled: true mirror: http://deb.debian.org/debian releases: [bookworm, bullseye] architectures: [amd64, arm64] - type: ubuntu enabled: true mirror: http://archive.ubuntu.com/ubuntu releases: [jammy, noble] fingerprinting: enabled: true algorithms: [basic_block, cfg] target_components: - openssl - glibc - zlib - curl - sqlite min_function_size: 16 # bytes max_functions_per_binary: 10000 lookup: cache_ttl: 3600 batch_size: 100 timeout_ms: 5000 storage: postgres_schema: binaries rustfs_bucket: stellaops/binaryindex ``` --- ## 9. Testing Strategy ### 9.1 Unit Tests - Identity extraction (Build-ID, hashes) - Fingerprint generation determinism - Fix index parsing (changelog, patch headers) ### 9.2 Integration Tests - PostgreSQL schema validation - Full corpus ingestion flow - Scanner.Worker lookup integration ### 9.3 Regression Tests - Known CVE detection (golden corpus) - Backport handling (Debian libssl example) - False positive rate validation --- ## 10. Golden Corpus for Patch Provenance > **Sprint:** SPRINT_20260121_034/035/036 - Golden Corpus Implementation The BinaryIndex module supports a **golden corpus** of patch-paired artifacts that enables offline SBOM reproducibility and binary-level patch provenance verification. ### 10.1 Corpus Purpose The golden corpus provides: - **Auditor-ready evidence bundles** for air-gapped customers - **Regression testing** for binary matching accuracy - **Proof of patch status** independent of package metadata ### 10.2 Corpus Sources | Source | Type | Purpose | |--------|------|---------| | Debian Security Tracker / DSAs | Advisory | Primary advisory linkage | | Debian Snapshot | Binary archive | Pre/post patch binary pairs | | Ubuntu Security Notices | Advisory | Ubuntu-specific advisories | | Alpine secdb | Advisory | Alpine YAML advisories | | OSV dump | Unified schema | Cross-reference and commit ranges | ### 10.2.1 Symbol Source Connectors > **Sprint:** SPRINT_20260121_035_BinaryIndex_golden_corpus_connectors_cli The corpus ingestion layer uses pluggable connectors to retrieve symbols and metadata from upstream sources: | Connector ID | Implementation | Protocol | Data Retrieved | |--------------|----------------|----------|----------------| | `debuginfod-fedora` | `DebuginfodConnector` | debuginfod HTTP | ELF debug symbols by Build-ID | | `debuginfod-ubuntu` | `DebuginfodConnector` | debuginfod HTTP | ELF debug symbols by Build-ID | | `ddeb-ubuntu` | `DdebConnector` | APT/HTTP | `.ddeb` debug packages | | `buildinfo-debian` | `BuildinfoConnector` | HTTP | `.buildinfo` reproducibility records | | `secdb-alpine` | `AlpineSecDbConnector` | Git/HTTP | `secfixes` YAML from APKBUILD | **Connector Interface:** ```csharp public interface ISymbolSourceConnector { string ConnectorId { get; } string DisplayName { get; } string[] SupportedDistros { get; } Task GetStatusAsync(CancellationToken ct); Task SyncAsync(SyncOptions options, CancellationToken ct); Task LookupByBuildIdAsync(string buildId, CancellationToken ct); Task> SearchAsync(SymbolSearchQuery query, CancellationToken ct); } ``` **Debuginfod Connector:** The `DebuginfodConnector` implements the [debuginfod protocol](https://sourceware.org/elfutils/Debuginfod.html) for retrieving debug symbols: - Endpoint: `GET /buildid//debuginfo` - Supports federated queries across multiple debuginfod servers - Caches retrieved symbols in RustFS blob storage - Rate-limited to respect upstream server policies **Ubuntu ddeb Connector:** The `DdebConnector` retrieves Ubuntu debug symbol packages (`.ddeb`): - Sources: `ddebs.ubuntu.com` mirror - Indexes: Reads `Packages.xz` for package metadata - Extraction: Unpacks `.ddeb` AR archives to extract DWARF symbols - Mapping: Links debug symbols to binary packages via Build-ID **Debian Buildinfo Connector:** The `BuildinfoConnector` retrieves Debian buildinfo files for reproducibility verification: - Source: `buildinfos.debian.net` and snapshot archives - Purpose: Provides build environment metadata for reproducible builds - Fields extracted: `Build-Date`, `Build-Architecture`, `Checksums-Sha256` - Integration: Cross-references with binary packages for provenance **Alpine SecDB Connector:** The `AlpineSecDbConnector` parses Alpine's security database: - Source: `secfixes` blocks in APKBUILD files - Repository: `alpine/aports` Git repository - Format: YAML blocks mapping CVEs to fixed versions - Example: ```yaml secfixes: 3.0.11-r0: - CVE-2024-0727 - CVE-2024-0728 ``` **OSV Dump Parser:** The `OsvDumpParser` processes Google OSV database dumps for advisory cross-correlation: - Source: `osv.dev` bulk exports (JSON) - Purpose: CVE → commit range extraction for patch identification - Cross-reference: Correlates OSV entries with distribution advisories - Inconsistency detection: Identifies discrepancies between OSV and distro advisories ```csharp public interface IOsvDumpParser { IAsyncEnumerable ParseDumpAsync(Stream osvDumpStream, CancellationToken ct); OsvCveIndex BuildCveIndex(IEnumerable entries); IEnumerable CrossReferenceWithExternal( OsvCveIndex osvIndex, IEnumerable externalAdvisories); IEnumerable DetectInconsistencies( IEnumerable correlations); } ``` **CLI Access:** All connectors are manageable via the `stella groundtruth sources` CLI commands: ```bash # List all connectors stella groundtruth sources list # Sync specific connector stella groundtruth sources sync --source buildinfo-debian --full # Enable/disable connectors stella groundtruth sources enable ddeb-ubuntu stella groundtruth sources disable debuginfod-fedora ``` See [Ground-Truth CLI Guide](../cli/guides/ground-truth-cli.md) for complete CLI documentation ### 10.3 Key Performance Indicators | KPI | Target | Description | |-----|--------|-------------| | Per-function match rate | >= 90% | Functions matched in post-patch binary | | False-negative patch detection | <= 5% | Patched functions incorrectly classified | | SBOM canonical-hash stability | 3/3 | Determinism across independent runs | | Binary reconstruction equivalence | Trend | Rebuilt binary matches original | | End-to-end verify time (p95, cold) | Trend | Offline verification performance | ### 10.4 Validation Harness The validation harness (`IValidationHarness`) orchestrates end-to-end verification: ``` Binary Pair (pre/post) → Symbol Recovery → IR Lifting → Fingerprinting → Matching → Metrics ``` ### 10.5 Evidence Bundle Format Evidence bundles follow OCI/ORAS conventions: ``` --bundle.oci.tar ├── manifest.json # OCI manifest └── blobs/ ├── sha256: # Canonical SBOM ├── sha256: # Pre-fix binary ├── sha256: # Post-fix binary ├── sha256: # DSSE delta-sig predicate └── sha256: # RFC 3161 timestamp ``` ### 10.6 Two-Tier Bundle Design and Large Blob References > **Sprint:** SPRINT_20260122_040_Platform_oci_delta_attestation_pipeline (040-04) Evidence bundles support two export modes to balance transfer speed with auditability: | Mode | Export Flag | Contents | Use Case | |------|------------|----------|----------| | **Light** | (default) | Manifest + attestation envelopes + metadata | Quick transfer, metadata-only audit | | **Full** | `--full` | Light + embedded binary blobs in `blobs/` | Air-gap replay, full provenance verification | #### 10.6.1 `largeBlobs[]` Field The `DeltaSigPredicate` includes a `largeBlobs` array referencing binary artifacts that may be too large to embed in attestation payloads: ```json { "schemaVersion": "1.0.0", "subject": [...], "delta": [...], "largeBlobs": [ { "kind": "binary-patch", "digest": "sha256:a1b2c3...", "mediaType": "application/octet-stream", "sizeBytes": 1048576 }, { "kind": "sbom-fragment", "digest": "sha256:d4e5f6...", "mediaType": "application/spdx+json", "sizeBytes": 32768 } ], "sbomDigest": "sha256:789abc..." } ``` **Field Definitions:** | Field | Type | Description | |-------|------|-------------| | `largeBlobs[].kind` | string | Blob category: `binary-patch`, `sbom-fragment`, `debug-symbols`, etc. | | `largeBlobs[].digest` | string | Content-addressable digest (`sha256:`, `sha384:`, `sha512:`) | | `largeBlobs[].mediaType` | string | IANA media type of the blob | | `largeBlobs[].sizeBytes` | long | Blob size in bytes | | `sbomDigest` | string | Digest of the canonical SBOM associated with this delta | #### 10.6.2 Blob Fetch Strategy During `stella bundle verify --replay`, blobs are resolved in priority order: 1. **Embedded** (full bundles): Read from `blobs/` in bundle directory 2. **Local source** (`--blob-source /path/`): Read from specified local directory 3. **Registry** (`--blob-source https://...`): HTTP GET from OCI registry (blocked in `--offline` mode) #### 10.6.3 Digest Verification Fetched blobs are verified against their declared digest using the algorithm prefix: ``` sha256: → SHA-256 sha384: → SHA-384 sha512: → SHA-512 ``` A mismatch fails the blob replay verification step. ### 10.7 Related Documentation - [Golden Corpus KPIs](../../benchmarks/golden-corpus-kpis.md) - [Golden Corpus Seed List](../../benchmarks/golden-corpus-seed-list.md) - [Ground-Truth Corpus Specification](../../benchmarks/ground-truth-corpus.md) --- ## 11. References - Advisory: `docs/product/advisories/21-Dec-2025 - Mapping Evidence Within Compiled Binaries.md` - Scanner Native Analysis: `src/Scanner/StellaOps.Scanner.Analyzers.Native/` - Existing Fingerprinting: `src/Scanner/__Libraries/StellaOps.Scanner.EntryTrace/Binary/` - Build-ID Index: `src/Scanner/StellaOps.Scanner.Analyzers.Native/Index/` - **Semantic Diffing Sprint:** `docs/implplan/SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md` - **Semantic Library:** `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Semantic/` - **Semantic Tests:** `src/BinaryIndex/__Tests/StellaOps.BinaryIndex.Semantic.Tests/` - **Golden Corpus Sprints:** `docs/implplan/SPRINT_20260121_034_BinaryIndex_golden_corpus_foundation.md` --- *Document Version: 1.2.0* *Last Updated: 2026-01-21*