# BinaryIndex Module Architecture > **Ownership:** Scanner Guild + Concelier Guild > **Status:** DRAFT > **Version:** 1.0.0 > **Related:** [High-Level Architecture](../../ARCHITECTURE_OVERVIEW.md), [Scanner Architecture](../scanner/architecture.md), [Concelier Architecture](../concelier/architecture.md), [Hybrid Diff Stack](./hybrid-diff-stack.md) --- ## 1. Overview The **BinaryIndex** module provides a vulnerable binaries database that enables detection of vulnerable code at the binary level, independent of package metadata. This addresses a critical gap in vulnerability scanning: package version strings can lie (backports, custom builds, stripped metadata), but **binary identity doesn't lie**. ### 1.1 Problem Statement Traditional vulnerability scanners rely on package version matching, which fails in several scenarios: 1. **Backported patches** - Distros backport security fixes without changing upstream version 2. **Custom/vendored builds** - Binaries compiled from source without package metadata 3. **Stripped binaries** - Debug info and version strings removed 4. **Static linking** - Vulnerable library code embedded in final binary 5. **Container base images** - Distroless or scratch images with no package DB ### 1.2 Solution: Binary-First Vulnerability Detection BinaryIndex provides three tiers of binary identification: | Tier | Method | Precision | Coverage | |------|--------|-----------|----------| | A | Package/version range matching | Medium | High | | B | Build-ID/hash catalog (exact binary identity) | High | Medium | | C | Function fingerprints (CFG/basic-block hashes) | Very High | Targeted | Tier B catalog lookups are performed over three identity dimensions: - `BuildId + BuildIdType` - `BinaryKey` - `FileSha256` ### 1.2.1 Tier C Fingerprint Runtime Contract Tier C function fingerprint matching is implemented as deterministic offline-safe byte-window analysis: - `FingerprintExtractor` derives basic-block hashes, CFG hash, string-reference hashes, constants, and call-targets from bounded binary byte windows (not seed-only synthetic placeholders). - `SignatureMatcher` applies configurable weighted matching across basic-block, CFG, string-reference, and constant signals. - Golden CVE fixtures include required high-impact package coverage for `openssl`, `glibc`, `zlib`, and `curl`, and verification checks enforce this coverage during feature rechecks. ### 1.3 Module Scope **In Scope:** - Binary identity extraction (Build-ID, PE CodeView GUID, Mach-O UUID) - Binary-to-advisory mapping database - Fingerprint storage and matching engine - Fix index for patch-aware backport handling - Integration with Scanner.Worker for binary lookup **Out of Scope:** - Binary disassembly/analysis (provided by Scanner.Analyzers.Native) - Runtime binary tracing (provided by Zastava) - SBOM generation (provided by Scanner) --- ## 2. Architecture ### 2.1 System Context ``` ┌──────────────────────────────────────────────────────────────────────────┐ │ External Systems │ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Distro Repos │ │ Debug Symbol │ │ Upstream Source │ │ │ │ (Debian, RPM, │ │ Servers │ │ (GitHub, etc.) │ │ │ │ Alpine) │ │ (debuginfod) │ │ │ │ │ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │ └───────────│─────────────────────│─────────────────────│──────────────────┘ │ │ │ v v v ┌──────────────────────────────────────────────────────────────────────────┐ │ BinaryIndex Module │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Corpus Ingestion Layer │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ │ │ DebianCorpus │ │ RpmCorpus │ │ AlpineCorpus │ │ │ │ │ │ Connector │ │ Connector │ │ Connector │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ v │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Processing Layer │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ │ │ BinaryFeature│ │ FixIndex │ │ Fingerprint │ │ │ │ │ │ Extractor │ │ Builder │ │ Generator │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ v │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Storage Layer │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ │ │ PostgreSQL │ │ RustFS │ │ Valkey │ │ │ │ │ │ (binaries │ │ (fingerprint │ │ (lookup │ │ │ │ │ │ schema) │ │ blobs) │ │ cache) │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ v │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Query Layer │ │ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ │ │ IBinaryVulnerabilityService │ │ │ │ │ │ - LookupByBuildIdAsync(buildId) │ │ │ │ │ │ - LookupByFingerprintAsync(fingerprint) │ │ │ │ │ │ - LookupBatchAsync(identities) │ │ │ │ │ │ - GetFixStatusAsync(distro, release, sourcePkg, cve) │ │ │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────────────────────┘ │ v ┌──────────────────────────────────────────────────────────────────────────┐ │ Consuming Modules │ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Scanner.Worker │ │ Policy Engine │ │ Findings Ledger │ │ │ │ (binary lookup │ │ (evidence in │ │ (match records) │ │ │ │ during scan) │ │ proof chain) │ │ │ │ │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ └──────────────────────────────────────────────────────────────────────────┘ ``` ### 2.2 Component Breakdown #### 2.2.1 Corpus Connectors Plugin-based connectors that ingest binaries from distribution repositories. ```csharp public interface IBinaryCorpusConnector { string ConnectorId { get; } string[] SupportedDistros { get; } Task FetchSnapshotAsync(CorpusQuery query, CancellationToken ct); Task> ExtractBinariesAsync(PackageReference pkg, CancellationToken ct); } ``` **Implementations:** - `DebianBinaryCorpusConnector` - Debian/Ubuntu packages + debuginfo - `RpmBinaryCorpusConnector` - RHEL/Fedora/CentOS + SRPM - `AlpineBinaryCorpusConnector` - Alpine APK + APKBUILD - `DebianMirrorPackageSource` - Debian package index/payload mirror source - `AlpineMirrorPackageSource` - Alpine package index/payload mirror source with cache fallback - `RpmMirrorPackageSource` - RPM primary metadata/package mirror source with cache fallback #### 2.2.2 Binary Feature Extractor Extracts identity and features from binaries. Reuses existing Scanner.Analyzers.Native capabilities. ```csharp public interface IBinaryFeatureExtractor { Task ExtractIdentityAsync(Stream binaryStream, CancellationToken ct); Task ExtractFeaturesAsync(Stream binaryStream, ExtractorOptions opts, CancellationToken ct); } public sealed record BinaryIdentity( string Format, // elf, pe, macho string? BuildId, // ELF GNU Build-ID string? PeCodeViewGuid, // PE CodeView GUID + Age string? MachoUuid, // Mach-O LC_UUID string FileSha256, string TextSectionSha256); public sealed record BinaryFeatures( BinaryIdentity Identity, string[] DynamicDeps, // DT_NEEDED string[] ExportedSymbols, string[] ImportedSymbols, BinaryHardening Hardening); ``` #### 2.2.3 Fix Index Builder Builds the patch-aware CVE fix index from distro sources. ```csharp public interface IFixIndexBuilder { Task BuildIndexAsync(DistroRelease distro, CancellationToken ct); Task GetFixRecordAsync(string distro, string release, string sourcePkg, string cveId, CancellationToken ct); } public sealed record FixRecord( string Distro, string Release, string SourcePkg, string CveId, FixState State, // fixed, vulnerable, not_affected, wontfix, unknown string? FixedVersion, // Distro version string FixMethod Method, // security_feed, changelog, patch_header decimal Confidence, // 0.00-1.00 FixEvidence Evidence); public enum FixState { Fixed, Vulnerable, NotAffected, Wontfix, Unknown } public enum FixMethod { SecurityFeed, Changelog, PatchHeader, UpstreamPatchMatch } ``` #### 2.2.4 Fingerprint Generator Generates function-level fingerprints for vulnerable code detection. ```csharp public interface IVulnFingerprintGenerator { Task> GenerateAsync( string cveId, BinaryPair vulnAndFixed, // Reference builds FingerprintOptions opts, CancellationToken ct); } public sealed record VulnFingerprint( string CveId, string Component, // e.g., openssl string Architecture, // x86-64, aarch64 FingerprintType Type, // basic_block, cfg, combined string FingerprintId, // e.g., "bb-abc123..." byte[] FingerprintHash, // 16-32 bytes string? FunctionHint, // Function name if known decimal Confidence, FingerprintEvidence Evidence); public enum FingerprintType { BasicBlock, ControlFlowGraph, StringReferences, Combined } ``` #### 2.2.5 Semantic Analysis Library > **Library:** `StellaOps.BinaryIndex.Semantic` > **Sprint:** 20260105_001_001_BINDEX - Semantic Diffing Phase 1 The Semantic Analysis Library extends fingerprint generation with IR-level semantic matching, enabling detection of semantically equivalent code despite compiler optimizations, instruction reordering, and register allocation differences. **Key Insight:** Traditional instruction-level fingerprinting loses accuracy on optimized binaries by ~15-20%. Semantic analysis lifts to B2R2's Intermediate Representation (LowUIR), extracts key-semantics graphs, and uses graph hashing for similarity computation. ##### 2.2.5.1 Architecture ``` Binary Input │ v B2R2 Disassembly → Raw Instructions │ v IR Lifting Service → LowUIR Statements │ v Semantic Graph Extractor → Key-Semantics Graph (KSG) │ v Graph Fingerprinting → Semantic Fingerprint │ v Semantic Matcher → Similarity Score + Deltas ``` ##### 2.2.5.2 Core Components **IR Lifting Service** (`IIrLiftingService`) Lifts disassembled instructions to B2R2 LowUIR: ```csharp public interface IIrLiftingService { Task LiftToIrAsync( IReadOnlyList instructions, string functionName, LiftOptions? options = null, CancellationToken ct = default); } public sealed record LiftedFunction( string Name, ImmutableArray Statements, ImmutableArray BasicBlocks); ``` **Semantic Graph Extractor** (`ISemanticGraphExtractor`) Extracts key-semantics graphs capturing data dependencies, control flow, and memory operations: ```csharp public interface ISemanticGraphExtractor { Task ExtractGraphAsync( LiftedFunction function, GraphExtractionOptions? options = null, CancellationToken ct = default); } public sealed record KeySemanticsGraph( string FunctionName, ImmutableArray Nodes, ImmutableArray Edges, GraphProperties Properties); public enum SemanticNodeType { Compute, Load, Store, Branch, Call, Return, Phi } public enum SemanticEdgeType { DataDependency, ControlDependency, MemoryDependency } ``` **Semantic Fingerprint Generator** (`ISemanticFingerprintGenerator`) Generates semantic fingerprints using Weisfeiler-Lehman graph hashing: ```csharp public interface ISemanticFingerprintGenerator { Task GenerateAsync( KeySemanticsGraph graph, SemanticFingerprintOptions? options = null, CancellationToken ct = default); } public sealed record SemanticFingerprint( string FunctionName, string GraphHashHex, // WL graph hash (SHA-256) string OperationHashHex, // Normalized operation sequence hash string DataFlowHashHex, // Data dependency pattern hash int NodeCount, int EdgeCount, int CyclomaticComplexity, ImmutableArray ApiCalls, SemanticFingerprintAlgorithm Algorithm); ``` **Semantic Matcher** (`ISemanticMatcher`) Computes semantic similarity with weighted components: ```csharp public interface ISemanticMatcher { Task MatchAsync( SemanticFingerprint a, SemanticFingerprint b, MatchOptions? options = null, CancellationToken ct = default); Task MatchWithDeltasAsync( SemanticFingerprint a, SemanticFingerprint b, MatchOptions? options = null, CancellationToken ct = default); } public sealed record SemanticMatchResult( decimal Similarity, // 0.00-1.00 decimal GraphSimilarity, decimal OperationSimilarity, decimal DataFlowSimilarity, decimal ApiCallSimilarity, MatchConfidence Confidence); ``` ##### 2.2.5.3 Algorithm Details **Weisfeiler-Lehman Graph Hashing:** - 3 iterations of label propagation - SHA-256 for final hash computation - Deterministic node ordering via canonical sort **Similarity Weights (Default):** | Component | Weight | |-----------|--------| | Graph Hash | 0.35 | | Operation Hash | 0.25 | | Data Flow Hash | 0.25 | | API Calls | 0.15 | ##### 2.2.5.4 Integration Points The semantic library integrates with existing BinaryIndex components: **DeltaSignatureGenerator Extension:** ```csharp // Optional semantic services via constructor injection services.AddDeltaSignaturesWithSemantic(); // Extended SymbolSignature with semantic properties public sealed record SymbolSignature { // ... existing properties ... public string? SemanticHashHex { get; init; } public ImmutableArray SemanticApiCalls { get; init; } } ``` **PatchDiffEngine Extension:** ```csharp // SemanticWeight in HashWeights public decimal SemanticWeight { get; init; } = 0.2m; // FunctionFingerprint extended with semantic fingerprint public SemanticFingerprint? SemanticFingerprint { get; init; } ``` **Patch Diff Result Storage Contract:** - `IDiffResultStore` persists `PatchDiffResult` records with deterministic, content-addressed IDs in the in-memory implementation. - Re-storing an identical result yields the same ID (idempotent storage), while content changes produce a different ID. - Rename detection and result-store persistence behaviors are covered by `StellaOps.BinaryIndex.Diff.Tests` unit suites. ##### 2.2.5.5 Test Coverage | Category | Tests | Coverage | |----------|-------|----------| | Unit Tests (IR lifting, graph extraction, hashing) | 53 | Core algorithms | | Integration Tests (full pipeline) | 9 | End-to-end flow | | Golden Corpus (compiler variations) | 11 | Register allocation, optimization, compiler variants | | Benchmarks (accuracy, performance) | 7 | Baseline metrics | ##### 2.2.5.6 Current Baselines > **Note:** Baselines reflect foundational implementation; accuracy improves as semantic features mature. | Metric | Baseline | Target | |--------|----------|--------| | Similarity (register allocation variants) | ≥0.55 | ≥0.85 | | Overall accuracy | ≥40% | ≥70% | | False positive rate | <10% | <5% | | P95 fingerprint latency | <100ms | <50ms | ##### 2.2.5.7 B2R2 LowUIR Adapter The B2R2LowUirLiftingService implements `IIrLiftingService` using B2R2's native lifting capabilities. This provides cross-platform IR representation for semantic analysis. **Key Components:** ```csharp public sealed class B2R2LowUirLiftingService : IIrLiftingService { // Lifts to B2R2 LowUIR and maps to Stella IR model public Task LiftToIrAsync( IReadOnlyList instructions, string functionName, LiftOptions? options = null, CancellationToken ct = default); } ``` **Supported ISAs:** - Intel (x86-32, x86-64) - ARM (ARMv7, ARMv8/ARM64) - MIPS (32/64) - RISC-V (64) - PowerPC, SPARC, SH4, AVR, EVM **IR Statement Mapping:** | B2R2 LowUIR | Stella IR Kind | |-------------|----------------| | Put | IrStatementKind.Store | | Store | IrStatementKind.Store | | Get | IrStatementKind.Load | | Load | IrStatementKind.Load | | BinOp | IrStatementKind.BinaryOp | | UnOp | IrStatementKind.UnaryOp | | Jmp | IrStatementKind.Jump | | CJmp | IrStatementKind.ConditionalJump | | InterJmp | IrStatementKind.IndirectJump | | Call | IrStatementKind.Call | | SideEffect | IrStatementKind.SideEffect | **Determinism Guarantees:** - Statements ordered by block address (ascending) - Blocks sorted by entry address (ascending) - Consistent IR IDs across identical inputs - InvariantCulture used for all string formatting ##### 2.2.5.8 B2R2 Lifter Pool The `B2R2LifterPool` provides bounded pooling and warm preload for B2R2 lifting units to reduce per-call allocation overhead. **Configuration (`B2R2LifterPoolOptions`):** | Option | Default | Description | |--------|---------|-------------| | `MaxPoolSizePerIsa` | 4 | Maximum pooled lifters per ISA | | `EnableWarmPreload` | true | Preload lifters at startup | | `WarmPreloadIsas` | ["intel-64", "intel-32", "armv8-64", "armv7-32"] | ISAs to warm | | `AcquireTimeout` | 5s | Timeout for acquiring a lifter | **Pool Statistics:** - `TotalPooledLifters`: Lifters currently in pool - `TotalActiveLifters`: Lifters currently in use - `IsWarm`: Whether pool has been warmed - `IsaStats`: Per-ISA pool and active counts **Usage:** ```csharp using var lifter = _lifterPool.Acquire(isa); var stmts = lifter.LiftingUnit.LiftInstruction(address); // Lifter automatically returned to pool on dispose ``` ##### 2.2.5.9 Function IR Cache The `FunctionIrCacheService` provides Valkey-backed caching for computed semantic fingerprints to avoid redundant IR lifting and graph hashing. **Cache Key Structure:** ``` (isa, b2r2_version, normalization_recipe, canonical_ir_hash) ``` **Configuration (`FunctionIrCacheOptions`):** | Option | Default | Description | |--------|---------|-------------| | `KeyPrefix` | "stellaops:binidx:funccache:" | Valkey key prefix | | `CacheTtl` | 4h | TTL for cached entries | | `MaxTtl` | 24h | Maximum TTL | | `Enabled` | true | Whether caching is enabled | | `B2R2Version` | "0.9.1" | B2R2 version for cache key | | `NormalizationRecipeVersion` | "v1" | Recipe version for cache key | **Cache Entry (`CachedFunctionFingerprint`):** - `FunctionAddress`, `FunctionName` - `SemanticFingerprint`: The computed fingerprint - `IrStatementCount`, `BasicBlockCount` - `ComputedAtUtc`: ISO-8601 timestamp - `B2R2Version`, `NormalizationRecipe` **Invalidation Rules:** - Cache entries expire after `CacheTtl` (default 4h) - Changing B2R2 version or normalization recipe results in cache misses - Manual invalidation via `RemoveAsync()` **Statistics:** - Hits, Misses, Evictions - Hit Rate - Enabled status ##### 2.2.5.10 Ops Endpoints BinaryIndex exposes operational endpoints for health, benchmarking, cache monitoring, and configuration visibility. | Endpoint | Method | Description | |----------|--------|-------------| | `/api/v1/ops/binaryindex/health` | GET | Health status with lifter warmness, cache availability | | `/api/v1/ops/binaryindex/bench/run` | POST | Run benchmark, return latency stats | | `/api/v1/ops/binaryindex/cache` | GET | Function IR cache hit/miss statistics | | `/api/v1/ops/binaryindex/config` | GET | Effective configuration (secrets redacted) | For local/offline startup, `StellaOps.BinaryIndex.WebService` registers `InMemoryBinaryVulnerabilityService` as a deterministic fallback for `IBinaryVulnerabilityService` when persistence-backed providers are not wired. It also composes `InMemoryDeltaSignatureRepository` as a deterministic fallback for `IDeltaSignatureRepository` so `PatchCoverageController` endpoints remain callable when PostgreSQL-backed signature repositories are unavailable. **Health Response:** ```json { "status": "healthy", "timestamp": "2026-01-14T12:00:00Z", "lifterStatus": "warm", "lifterWarm": true, "lifterPoolStats": { "intel-64": 4, "armv8-64": 2 }, "cacheStatus": "enabled", "cacheEnabled": true } ``` **Determinism Constraints:** - All timestamps in ISO-8601 UTC format - ASCII-only output - Deterministic JSON key ordering - Secrets/credentials redacted from config endpoint #### 2.2.6 Binary Vulnerability Service Main query interface for consumers. ```csharp public interface IBinaryVulnerabilityService { /// /// Look up vulnerabilities by Build-ID or equivalent binary identity. /// Task> LookupByIdentityAsync( BinaryIdentity identity, LookupOptions? opts = null, CancellationToken ct = default); /// /// Look up vulnerabilities by function fingerprint. /// Task> LookupByFingerprintAsync( CodeFingerprint fingerprint, decimal minSimilarity = 0.95m, CancellationToken ct = default); /// /// Batch lookup for scan performance. /// Task>> LookupBatchAsync( IEnumerable identities, LookupOptions? opts = null, CancellationToken ct = default); /// /// Get distro-specific fix status (patch-aware). /// Task GetFixStatusAsync( string distro, string release, string sourcePkg, string cveId, CancellationToken ct = default); } public sealed record BinaryVulnMatch( string CveId, string VulnerablePurl, MatchMethod Method, // buildid_catalog, fingerprint_match, range_match decimal Confidence, MatchEvidence Evidence); public enum MatchMethod { BuildIdCatalog, FingerprintMatch, RangeMatch } ``` --- ## 3. Data Model ### 3.1 PostgreSQL Schema (`binaries`) The `binaries` schema stores binary identity, fingerprint, and match data. ```sql CREATE SCHEMA IF NOT EXISTS binaries; CREATE SCHEMA IF NOT EXISTS binaries_app; -- RLS helper CREATE OR REPLACE FUNCTION binaries_app.require_current_tenant() RETURNS TEXT LANGUAGE plpgsql STABLE SECURITY DEFINER AS $$ DECLARE v_tenant TEXT; BEGIN v_tenant := current_setting('app.tenant_id', true); IF v_tenant IS NULL OR v_tenant = '' THEN RAISE EXCEPTION 'app.tenant_id session variable not set'; END IF; RETURN v_tenant; END; $$; ``` #### 3.1.1 Core Tables See `docs/db/schemas/binaries_schema_specification.md` for complete DDL. **Key Tables:** | Table | Purpose | |-------|---------| | `binaries.binary_identity` | Known binary identities (Build-ID, hashes) | | `binaries.binary_package_map` | Binary → package mapping per snapshot | | `binaries.vulnerable_buildids` | Build-IDs known to be vulnerable | | `binaries.vulnerable_fingerprints` | Function fingerprints for CVEs | | `binaries.cve_fix_index` | Patch-aware fix status per distro | | `binaries.fingerprint_matches` | Match results (findings evidence) | | `binaries.corpus_snapshots` | Corpus ingestion tracking | ### 3.2 RustFS Layout ``` rustfs://stellaops/binaryindex/ fingerprints///.bin corpus////manifest.json corpus////packages/.metadata.json evidence/.dsse.json ``` --- ## 4. Integration Points ### 4.1 Scanner.Worker Integration During container scanning, Scanner.Worker queries BinaryIndex for each extracted binary: ```mermaid sequenceDiagram participant SW as Scanner.Worker participant BI as BinaryIndex participant PG as PostgreSQL participant FL as Findings Ledger SW->>SW: Extract binary from layer SW->>SW: Compute BinaryIdentity SW->>BI: LookupByIdentityAsync(identity) BI->>PG: Query binaries.vulnerable_buildids PG-->>BI: Matches BI->>PG: Query binaries.cve_fix_index (if distro known) PG-->>BI: Fix status BI-->>SW: BinaryVulnMatch[] SW->>FL: RecordFinding(match, evidence) ``` ### 4.2 Concelier Integration BinaryIndex subscribes to Concelier's advisory updates: ```mermaid sequenceDiagram participant CO as Concelier participant BI as BinaryIndex participant PG as PostgreSQL CO->>CO: Ingest new advisory CO->>BI: advisory.created event BI->>BI: Check if affected packages in corpus BI->>PG: Update binaries.binary_vuln_assertion BI->>BI: Queue fingerprint generation (if high-impact) ``` ### 4.3 Policy Integration Binary matches are recorded as proof segments: ```json { "segment_type": "binary_fingerprint_evidence", "payload": { "binary_identity": { "format": "elf", "build_id": "abc123...", "file_sha256": "def456..." }, "matches": [ { "cve_id": "CVE-2024-1234", "method": "buildid_catalog", "confidence": 0.98, "vulnerable_purl": "pkg:deb/debian/libssl3@1.1.1n-0+deb11u3" } ] } } ``` --- ## 5. MVP Roadmap ### MVP 1: Known-Build Binary Catalog (Sprint 6000.0001) **Goal:** Query "is this Build-ID vulnerable?" with distro-level precision. **Deliverables:** - `binaries` PostgreSQL schema - Build-ID to package mapping tables - Basic CVE lookup by binary identity - Debian/Ubuntu corpus connector ### MVP 2: Patch-Aware Backport Handling (Sprint 6000.0002) **Goal:** Handle "version says vulnerable but distro backported the fix." **Deliverables:** - Fix index builder (changelog + patch header parsing) - Distro-specific version comparison - RPM corpus connector - Scanner.Worker integration ### MVP 3: Binary Fingerprint Factory (Sprint 6000.0003) **Goal:** Detect vulnerable code independent of package metadata. **Deliverables:** - Fingerprint storage and matching - Reference build generation pipeline - Fingerprint validation corpus - High-impact CVE coverage (OpenSSL, glibc, zlib, curl) ### MVP 4: Full Scanner Integration (Sprint 6000.0004) **Goal:** Binary evidence in production scans. **Deliverables:** - Scanner.Worker binary lookup integration - Findings Ledger binary match records - Proof segment attestations - CLI binary match inspection --- ## 5b. Fix Evidence Chain The **Fix Evidence Chain** provides auditable proof of why a CVE is marked as fixed (or not) for a specific distro/package combination. This is critical for patch-aware backport handling where package versions can be misleading. ### 5b.1 Evidence Sources | Source | Confidence | Description | |--------|------------|-------------| | **Security Feed (OVAL)** | 0.95-0.99 | Authoritative feed from distro (Debian Security Tracker, Red Hat OVAL) | | **Patch Header (DEP-3)** | 0.87-0.95 | CVE reference in Debian/Ubuntu patch metadata | | **Changelog** | 0.75-0.85 | CVE mention in debian/changelog or RPM %changelog | | **Upstream Patch Match** | 0.90 | Binary diff matches known upstream fix | ### 5b.2 Evidence Storage Evidence is stored in two PostgreSQL tables: ```sql -- Fix index: one row per (distro, release, source_pkg, cve_id) CREATE TABLE binaries.cve_fix_index ( id UUID PRIMARY KEY, tenant_id TEXT NOT NULL, distro TEXT NOT NULL, -- debian, ubuntu, alpine, rhel release TEXT NOT NULL, -- bookworm, jammy, v3.19 source_pkg TEXT NOT NULL, cve_id TEXT NOT NULL, state TEXT NOT NULL, -- fixed, vulnerable, not_affected, wontfix, unknown fixed_version TEXT, method TEXT NOT NULL, -- security_feed, changelog, patch_header, upstream_match confidence DECIMAL(3,2) NOT NULL, evidence_id UUID REFERENCES binaries.fix_evidence(id), snapshot_id UUID, indexed_at TIMESTAMPTZ NOT NULL DEFAULT now(), UNIQUE (tenant_id, distro, release, source_pkg, cve_id) ); -- Evidence blobs: audit trail CREATE TABLE binaries.fix_evidence ( id UUID PRIMARY KEY, tenant_id TEXT NOT NULL, evidence_type TEXT NOT NULL, -- changelog, patch_header, security_feed source_file TEXT, -- Path to source file (changelog, patch) source_sha256 TEXT, -- Hash of source file excerpt TEXT, -- Relevant snippet (max 1KB) metadata JSONB NOT NULL, -- Structured metadata snapshot_id UUID, created_at TIMESTAMPTZ NOT NULL DEFAULT now() ); ``` ### 5b.3 Evidence Types **ChangelogEvidence:** ```json { "evidence_type": "changelog", "source_file": "debian/changelog", "excerpt": "* Fix CVE-2024-0727: PKCS12 decoding crash", "metadata": { "version": "3.0.11-1~deb12u2", "line_number": 5 } } ``` **PatchHeaderEvidence:** ```json { "evidence_type": "patch_header", "source_file": "debian/patches/CVE-2024-0727.patch", "excerpt": "CVE: CVE-2024-0727\nOrigin: upstream, https://github.com/openssl/commit/abc123", "metadata": { "patch_sha256": "abc123def456..." } } ``` **SecurityFeedEvidence:** ```json { "evidence_type": "security_feed", "metadata": { "feed_id": "debian-security-tracker", "entry_id": "DSA-5678-1", "published_at": "2024-01-15T10:00:00Z" } } ``` ### 5b.4 Confidence Resolution When multiple evidence sources exist for the same CVE, the system keeps the **highest confidence** entry: ```csharp ON CONFLICT (tenant_id, distro, release, source_pkg, cve_id) DO UPDATE SET confidence = GREATEST(existing.confidence, new.confidence), method = CASE WHEN existing.confidence < new.confidence THEN new.method ELSE existing.method END, evidence_id = CASE WHEN existing.confidence < new.confidence THEN new.evidence_id ELSE existing.evidence_id END ``` ### 5b.5 Parsers The following parsers extract CVE fix information: | Parser | Distros | Input | Confidence | |--------|---------|-------|------------| | `DebianChangelogParser` | Debian, Ubuntu | debian/changelog | 0.80 | | `PatchHeaderParser` | Debian, Ubuntu | debian/patches/*.patch (DEP-3) | 0.87 | | `AlpineSecfixesParser` | Alpine | APKBUILD secfixes block | 0.95 | | `RpmChangelogParser` | RHEL, Fedora, CentOS | RPM spec %changelog | 0.75 | ### 5b.6 Query Flow ```mermaid sequenceDiagram participant SW as Scanner.Worker participant BVS as BinaryVulnerabilityService participant FIR as FixIndexRepository participant PG as PostgreSQL SW->>BVS: GetFixStatusAsync(debian, bookworm, openssl, CVE-2024-0727) BVS->>FIR: GetFixStatusAsync(...) FIR->>PG: SELECT FROM cve_fix_index WHERE ... PG-->>FIR: FixIndexEntry (state=fixed, confidence=0.87) FIR-->>BVS: FixStatusResult BVS-->>SW: {state: Fixed, confidence: 0.87, method: PatchHeader} ``` --- ## 6. Security Considerations ### 6.1 Trust Boundaries 1. **Corpus Ingestion** - Packages are untrusted; extraction runs in sandboxed workers 2. **Fingerprint Generation** - Reference builds compiled in isolated environments 3. **Query API** - Tenant-isolated via RLS; no cross-tenant data leakage ### 6.2 Signing & Provenance - All corpus snapshots are signed (DSSE) - Fingerprint sets are versioned and signed - Every match result references evidence digests ### 6.3 Sandbox Requirements Binary extraction and fingerprint generation MUST run with: - Seccomp profile restricting syscalls - Read-only root filesystem - No network access during analysis - Memory/CPU limits --- ## 7. Observability ### 7.1 Metrics | Metric | Type | Labels | |--------|------|--------| | `binaryindex_lookup_total` | Counter | method, result | | `binaryindex_lookup_latency_ms` | Histogram | method | | `binaryindex_corpus_packages_total` | Gauge | distro, release | | `binaryindex_fingerprints_indexed` | Gauge | algorithm, component | | `binaryindex_match_confidence` | Histogram | method | ### 7.2 Traces - `binaryindex.lookup` - Full lookup span - `binaryindex.corpus.ingest` - Corpus ingestion - `binaryindex.fingerprint.generate` - Fingerprint generation ### 7.3 Ops Endpoints > **Sprint:** SPRINT_20260112_007_BINIDX_binaryindex_user_config BinaryIndex exposes read-only ops endpoints for health, bench, cache, and effective configuration: | Endpoint | Method | Response Schema | Description | |----------|--------|-----------------|-------------| | `/api/v1/ops/binaryindex/health` | GET | `BinaryIndexOpsHealthResponse` | Health status, lifter warmness per ISA, cache availability | | `/api/v1/ops/binaryindex/bench/run` | POST | `BinaryIndexBenchResponse` | Run latency benchmark, return min/max/mean/p50/p95/p99 stats | | `/api/v1/ops/binaryindex/cache` | GET | `BinaryIndexFunctionCacheStats` | Function cache hit/miss/eviction statistics | | `/api/v1/ops/binaryindex/config` | GET | `BinaryIndexEffectiveConfig` | Effective configuration with secrets redacted | Startup safety: the WebService composes a deterministic fallback `IBinaryVulnerabilityService` implementation to keep API host boot valid in local/offline environments where persistence dependencies are intentionally absent. The same startup safety contract applies to patch-coverage API routes through a deterministic `IDeltaSignatureRepository` fallback implementation. #### 7.3.1 Response Schemas **BinaryIndexOpsHealthResponse:** ```json { "status": "healthy", "timestamp": "2026-01-16T12:00:00Z", "components": { "lifterPool": { "status": "healthy", "message": null }, "functionCache": { "status": "healthy", "message": null }, "persistence": { "status": "healthy", "message": null } }, "lifterWarmness": { "intel-64": { "isa": "intel-64", "warm": true, "poolSize": 4, "acquireTimeMs": 12 }, "armv8-64": { "isa": "armv8-64", "warm": true, "poolSize": 2, "acquireTimeMs": 8 } } } ``` **BinaryIndexBenchResponse:** ```json { "timestamp": "2026-01-16T12:00:00Z", "sampleSize": 100, "latencySummary": { "minMs": 5.2, "maxMs": 142.8, "meanMs": 28.4, "p50Ms": 22.1, "p95Ms": 78.3, "p99Ms": 121.5 }, "operations": [ { "operation": "lifterAcquire", "samples": 100, "meanMs": 12.4 }, { "operation": "irNormalization", "samples": 100, "meanMs": 8.7 }, { "operation": "cacheLookup", "samples": 100, "meanMs": 1.2 } ] } ``` **BinaryIndexFunctionCacheStats:** ```json { "enabled": true, "backend": "valkey", "hits": 15234, "misses": 892, "evictions": 45, "hitRate": 0.944, "keyPrefix": "stellaops:binidx:funccache:", "cacheTtlSeconds": 14400, "estimatedEntries": 12500, "estimatedMemoryBytes": 52428800 } ``` **BinaryIndexEffectiveConfig:** ```json { "b2r2Pool": { "maxPoolSizePerIsa": 4, "warmPreload": ["intel-64", "armv8-64"], "acquireTimeoutMs": 5000, "enableMetrics": true }, "semanticLifting": { "b2r2Version": "1.5.0", "normalizationRecipeVersion": "2024.1", "maxInstructionsPerFunction": 10000, "maxFunctionsPerBinary": 5000, "functionLiftTimeoutMs": 30000, "enableDeduplication": true }, "functionCache": { "connectionString": "********", "keyPrefix": "stellaops:binidx:funccache:", "cacheTtlSeconds": 14400, "maxTtlSeconds": 86400, "earlyExpiryPercent": 0.1, "maxEntrySizeBytes": 1048576 }, "persistence": { "schema": "binaries", "minPoolSize": 5, "maxPoolSize": 20, "commandTimeoutSeconds": 30, "retryOnFailure": true, "batchSize": 100 }, "backendVersions": { "b2r2": "1.5.0", "valkey": "7.2.0", "postgres": "15.4" } } ``` #### 7.3.2 Rate Limiting The `/bench/run` endpoint is rate-limited to prevent load spikes: - Default: 5 requests per minute per tenant - Configurable via `BinaryIndex:Ops:BenchRateLimitPerMinute` #### 7.3.3 Secret Redaction The config endpoint automatically redacts sensitive keys: | Redacted Keys | Pattern | |---------------|---------| | `connectionString` | Replaced with `********` | | `password` | Replaced with `********` | | `secret*` | Any key starting with "secret" | | `apiKey` | Replaced with `********` | | `token` | Replaced with `********` | Redaction is applied recursively to nested objects. --- ## 8. Configuration > **Sprint:** SPRINT_20260112_007_BINIDX_binaryindex_user_config ### 8.1 Configuration Sections All configuration is under the `BinaryIndex` section in `appsettings.yaml` or environment variables with `BINARYINDEX__` prefix. #### 8.1.1 B2R2 Lifter Pool (`BinaryIndex:B2R2Pool`) | Key | Type | Default | Description | |-----|------|---------|-------------| | `MaxPoolSizePerIsa` | int | 4 | Maximum lifter instances per ISA | | `WarmPreload` | string[] | ["intel-64", "armv8-64"] | ISAs to warm on startup | | `AcquireTimeoutMs` | int | 5000 | Timeout for lifter acquisition | | `EnableMetrics` | bool | true | Emit Prometheus metrics for pool | ```yaml BinaryIndex: B2R2Pool: MaxPoolSizePerIsa: 4 WarmPreload: - intel-64 - armv8-64 AcquireTimeoutMs: 5000 EnableMetrics: true ``` #### 8.1.2 Semantic Lifting (`BinaryIndex:SemanticLifting`) | Key | Type | Default | Description | |-----|------|---------|-------------| | `B2R2Version` | string | "1.5.0" | B2R2 disassembler version | | `NormalizationRecipeVersion` | string | "2024.1" | IR normalization recipe version | | `MaxInstructionsPerFunction` | int | 10000 | Max instructions to lift per function | | `MaxFunctionsPerBinary` | int | 5000 | Max functions to process per binary | | `FunctionLiftTimeoutMs` | int | 30000 | Timeout for lifting single function | | `EnableDeduplication` | bool | true | Deduplicate IR before fingerprinting | ```yaml BinaryIndex: SemanticLifting: MaxInstructionsPerFunction: 10000 MaxFunctionsPerBinary: 5000 FunctionLiftTimeoutMs: 30000 EnableDeduplication: true ``` #### 8.1.3 Function Cache (`BinaryIndex:FunctionCache`) | Key | Type | Default | Description | |-----|------|---------|-------------| | `ConnectionString` | string | — | Valkey connection string (secret) | | `KeyPrefix` | string | "stellaops:binidx:funccache:" | Cache key prefix | | `CacheTtlSeconds` | int | 14400 | Default cache TTL (4 hours) | | `MaxTtlSeconds` | int | 86400 | Maximum TTL (24 hours) | | `EarlyExpiryPercent` | decimal | 0.1 | Early expiry jitter (10%) | | `MaxEntrySizeBytes` | int | 1048576 | Max entry size (1 MB) | ```yaml BinaryIndex: FunctionCache: ConnectionString: ${VALKEY_CONNECTION} # from env KeyPrefix: "stellaops:binidx:funccache:" CacheTtlSeconds: 14400 MaxEntrySizeBytes: 1048576 ``` #### 8.1.4 Persistence (`Postgres:BinaryIndex`) | Key | Type | Default | Description | |-----|------|---------|-------------| | `Schema` | string | "binaries" | PostgreSQL schema name | | `MinPoolSize` | int | 5 | Minimum connection pool size | | `MaxPoolSize` | int | 20 | Maximum connection pool size | | `CommandTimeoutSeconds` | int | 30 | Command execution timeout | | `RetryOnFailure` | bool | true | Retry transient failures | | `BatchSize` | int | 100 | Batch insert size | ```yaml Postgres: BinaryIndex: Schema: binaries MinPoolSize: 5 MaxPoolSize: 20 CommandTimeoutSeconds: 30 RetryOnFailure: true BatchSize: 100 ``` #### 8.1.5 Ops Configuration (`BinaryIndex:Ops`) | Key | Type | Default | Description | |-----|------|---------|-------------| | `EnableHealthEndpoint` | bool | true | Enable /health endpoint | | `EnableBenchEndpoint` | bool | true | Enable /bench/run endpoint | | `BenchRateLimitPerMinute` | int | 5 | Rate limit for bench endpoint | | `RedactedKeys` | string[] | See 7.3.3 | Keys to redact in config output | ### 8.2 Legacy Configuration ```yaml # binaryindex.yaml (corpus configuration) binaryindex: enabled: true corpus: connectors: - type: debian enabled: true mirror: http://deb.debian.org/debian releases: [bookworm, bullseye] architectures: [amd64, arm64] - type: ubuntu enabled: true mirror: http://archive.ubuntu.com/ubuntu releases: [jammy, noble] fingerprinting: enabled: true algorithms: [basic_block, cfg] target_components: - openssl - glibc - zlib - curl - sqlite min_function_size: 16 # bytes max_functions_per_binary: 10000 lookup: cache_ttl: 3600 batch_size: 100 timeout_ms: 5000 storage: postgres_schema: binaries rustfs_bucket: stellaops/binaryindex ``` --- ## 9. Testing Strategy ### 9.1 Unit Tests - Identity extraction (Build-ID, hashes) - Fingerprint generation determinism - Fix index parsing (changelog, patch headers) ### 9.2 Integration Tests - PostgreSQL schema validation - Full corpus ingestion flow - Scanner.Worker lookup integration ### 9.3 Regression Tests - Known CVE detection (golden corpus) - Backport handling (Debian libssl example) - False positive rate validation --- ## 10. Golden Corpus for Patch Provenance > **Sprint:** SPRINT_20260121_034/035/036 - Golden Corpus Implementation The BinaryIndex module supports a **golden corpus** of patch-paired artifacts that enables offline SBOM reproducibility and binary-level patch provenance verification. ### 10.1 Corpus Purpose The golden corpus provides: - **Auditor-ready evidence bundles** for air-gapped customers - **Regression testing** for binary matching accuracy - **Proof of patch status** independent of package metadata ### 10.2 Corpus Sources | Source | Type | Purpose | |--------|------|---------| | Debian Security Tracker / DSAs | Advisory | Primary advisory linkage | | Debian Snapshot | Binary archive | Pre/post patch binary pairs | | Ubuntu Security Notices | Advisory | Ubuntu-specific advisories | | Alpine secdb | Advisory | Alpine YAML advisories | | OSV dump | Unified schema | Cross-reference and commit ranges | ### 10.2.1 Symbol Source Connectors > **Sprint:** SPRINT_20260121_035_BinaryIndex_golden_corpus_connectors_cli The corpus ingestion layer uses pluggable connectors to retrieve symbols and metadata from upstream sources: | Connector ID | Implementation | Protocol | Data Retrieved | |--------------|----------------|----------|----------------| | `debuginfod-fedora` | `DebuginfodConnector` | debuginfod HTTP | ELF debug symbols by Build-ID | | `debuginfod-ubuntu` | `DebuginfodConnector` | debuginfod HTTP | ELF debug symbols by Build-ID | | `ddeb-ubuntu` | `DdebConnector` | APT/HTTP | `.ddeb` debug packages | | `buildinfo-debian` | `BuildinfoConnector` | HTTP | `.buildinfo` reproducibility records | | `secdb-alpine` | `AlpineSecDbConnector` | Git/HTTP | `secfixes` YAML from APKBUILD | **Connector Interface:** ```csharp public interface ISymbolSourceConnector { string ConnectorId { get; } string DisplayName { get; } string[] SupportedDistros { get; } Task GetStatusAsync(CancellationToken ct); Task SyncAsync(SyncOptions options, CancellationToken ct); Task LookupByBuildIdAsync(string buildId, CancellationToken ct); Task> SearchAsync(SymbolSearchQuery query, CancellationToken ct); } ``` **Debuginfod Connector:** The `DebuginfodConnector` implements the [debuginfod protocol](https://sourceware.org/elfutils/Debuginfod.html) for retrieving debug symbols: - Endpoint: `GET /buildid//debuginfo` - Supports federated queries across multiple debuginfod servers - Caches retrieved symbols in RustFS blob storage - Rate-limited to respect upstream server policies **Ubuntu ddeb Connector:** The `DdebConnector` retrieves Ubuntu debug symbol packages (`.ddeb`): - Sources: `ddebs.ubuntu.com` mirror - Indexes: Reads `Packages.xz` for package metadata - Extraction: Unpacks `.ddeb` AR archives to extract DWARF symbols - Mapping: Links debug symbols to binary packages via Build-ID **Debian Buildinfo Connector:** The `BuildinfoConnector` retrieves Debian buildinfo files for reproducibility verification: - Source: `buildinfos.debian.net` and snapshot archives - Purpose: Provides build environment metadata for reproducible builds - Fields extracted: `Build-Date`, `Build-Architecture`, `Checksums-Sha256` - Integration: Cross-references with binary packages for provenance **Alpine SecDB Connector:** The `AlpineSecDbConnector` parses Alpine's security database: - Source: `secfixes` blocks in APKBUILD files - Repository: `alpine/aports` Git repository - Format: YAML blocks mapping CVEs to fixed versions - Example: ```yaml secfixes: 3.0.11-r0: - CVE-2024-0727 - CVE-2024-0728 ``` **OSV Dump Parser:** The `OsvDumpParser` processes Google OSV database dumps for advisory cross-correlation: - Source: `osv.dev` bulk exports (JSON) - Purpose: CVE → commit range extraction for patch identification - Cross-reference: Correlates OSV entries with distribution advisories - Inconsistency detection: Identifies discrepancies between OSV and distro advisories ```csharp public interface IOsvDumpParser { IAsyncEnumerable ParseDumpAsync(Stream osvDumpStream, CancellationToken ct); OsvCveIndex BuildCveIndex(IEnumerable entries); IEnumerable CrossReferenceWithExternal( OsvCveIndex osvIndex, IEnumerable externalAdvisories); IEnumerable DetectInconsistencies( IEnumerable correlations); } ``` **CLI Access:** All connectors are manageable via the `stella groundtruth sources` CLI commands: ```bash # List all connectors stella groundtruth sources list # Sync specific connector stella groundtruth sources sync --source buildinfo-debian --full # Enable/disable connectors stella groundtruth sources enable ddeb-ubuntu stella groundtruth sources disable debuginfod-fedora ``` See [Ground-Truth CLI Guide](../cli/guides/ground-truth-cli.md) for complete CLI documentation ### 10.3 Key Performance Indicators | KPI | Target | Description | |-----|--------|-------------| | Per-function match rate | >= 90% | Functions matched in post-patch binary | | False-negative patch detection | <= 5% | Patched functions incorrectly classified | | SBOM canonical-hash stability | 3/3 | Determinism across independent runs | | Binary reconstruction equivalence | Trend | Rebuilt binary matches original | | End-to-end verify time (p95, cold) | Trend | Offline verification performance | ### 10.4 Validation Harness The validation harness (`IValidationHarness`) orchestrates end-to-end verification: ``` Binary Pair (pre/post) → Symbol Recovery → IR Lifting → Fingerprinting → Matching → Metrics ``` ### 10.5 Evidence Bundle Format Evidence bundles follow OCI/ORAS conventions: ``` --bundle.oci.tar ├── manifest.json # OCI manifest └── blobs/ ├── sha256: # Canonical SBOM ├── sha256: # Pre-fix binary ├── sha256: # Post-fix binary ├── sha256: # DSSE delta-sig predicate └── sha256: # RFC 3161 timestamp ``` ### 10.6 Two-Tier Bundle Design and Large Blob References > **Sprint:** SPRINT_20260122_040_Platform_oci_delta_attestation_pipeline (040-04) Evidence bundles support two export modes to balance transfer speed with auditability: | Mode | Export Flag | Contents | Use Case | |------|------------|----------|----------| | **Light** | (default) | Manifest + attestation envelopes + metadata | Quick transfer, metadata-only audit | | **Full** | `--full` | Light + embedded binary blobs in `blobs/` | Air-gap replay, full provenance verification | #### 10.6.1 `largeBlobs[]` Field The `DeltaSigPredicate` includes a `largeBlobs` array referencing binary artifacts that may be too large to embed in attestation payloads: ```json { "schemaVersion": "1.0.0", "subject": [...], "delta": [...], "largeBlobs": [ { "kind": "binary-patch", "digest": "sha256:a1b2c3...", "mediaType": "application/octet-stream", "sizeBytes": 1048576 }, { "kind": "sbom-fragment", "digest": "sha256:d4e5f6...", "mediaType": "application/spdx+json", "sizeBytes": 32768 } ], "sbomDigest": "sha256:789abc..." } ``` **Field Definitions:** | Field | Type | Description | |-------|------|-------------| | `largeBlobs[].kind` | string | Blob category: `binary-patch`, `sbom-fragment`, `debug-symbols`, etc. | | `largeBlobs[].digest` | string | Content-addressable digest (`sha256:`, `sha384:`, `sha512:`) | | `largeBlobs[].mediaType` | string | IANA media type of the blob | | `largeBlobs[].sizeBytes` | long | Blob size in bytes | | `sbomDigest` | string | Digest of the canonical SBOM associated with this delta | #### 10.6.2 Blob Fetch Strategy During `stella bundle verify --replay`, blobs are resolved in priority order: 1. **Embedded** (full bundles): Read from `blobs/` in bundle directory 2. **Local source** (`--blob-source /path/`): Read from specified local directory 3. **Registry** (`--blob-source https://...`): HTTP GET from OCI registry (blocked in `--offline` mode) #### 10.6.3 Digest Verification Fetched blobs are verified against their declared digest using the algorithm prefix: ``` sha256: → SHA-256 sha384: → SHA-384 sha512: → SHA-512 ``` A mismatch fails the blob replay verification step. ### 10.7 Related Documentation - [Golden Corpus KPIs](../../benchmarks/golden-corpus-kpis.md) - [Golden Corpus Seed List](../../benchmarks/golden-corpus-seed-list.md) - [Ground-Truth Corpus Specification](../../benchmarks/ground-truth-corpus.md) --- ## 11. References - Advisory: `docs/product/advisories/21-Dec-2025 - Mapping Evidence Within Compiled Binaries.md` - Scanner Native Analysis: `src/Scanner/StellaOps.Scanner.Analyzers.Native/` - Existing Fingerprinting: `src/Scanner/__Libraries/StellaOps.Scanner.EntryTrace/Binary/` - Build-ID Index: `src/Scanner/StellaOps.Scanner.Analyzers.Native/Index/` - **Semantic Diffing Sprint:** `docs/implplan/SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md` - **Semantic Library:** `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Semantic/` - **Semantic Tests:** `src/BinaryIndex/__Tests/StellaOps.BinaryIndex.Semantic.Tests/` - **Golden Corpus Sprints:** `docs/implplan/SPRINT_20260121_034_BinaryIndex_golden_corpus_foundation.md` --- ## 12. Binary Micro-Witnesses Binary micro-witnesses provide cryptographic proof of patch status at the binary level. They formalize the output of BinaryIndex's semantic diffing capabilities into an auditor-friendly, portable format. ### 12.1 Overview A micro-witness is a DSSE (Dead Simple Signing Envelope) predicate that captures: - Subject binary digest (SHA-256) - CVE/patch reference - Function-level evidence with confidence scores - Delta-Sig fingerprint hash - Tool versions and analysis metadata - Optional SBOM component mapping ### 12.2 Predicate Schema **Predicate Type:** `https://stellaops.dev/predicates/binary-micro-witness@v1` ```json { "schemaVersion": "1.0.0", "binary": { "digest": "sha256:...", "purl": "pkg:deb/debian/openssl@3.0.11", "arch": "linux-amd64", "filename": "libssl.so.3" }, "cve": { "id": "CVE-2024-0567", "advisory": "https://...", "patchCommit": "abc123" }, "verdict": "patched", "confidence": 0.95, "evidence": [ { "function": "SSL_CTX_new", "state": "patched", "score": 0.97, "method": "semantic_ksg", "hash": "sha256:..." } ], "deltaSigDigest": "sha256:...", "sbomRef": { "sbomDigest": "sha256:...", "purl": "pkg:...", "bomRef": "component-ref" }, "tooling": { "binaryIndexVersion": "2.1.0", "lifter": "b2r2", "matchAlgorithm": "semantic_ksg" }, "computedAt": "2026-01-28T12:00:00Z" } ``` ### 12.3 Verdicts | Verdict | Meaning | |---------|---------| | `patched` | Binary matches patched version signature | | `vulnerable` | Binary matches vulnerable version signature | | `inconclusive` | Unable to determine (insufficient evidence) | | `partial` | Some functions patched, others not | ### 12.4 CLI Commands ```bash # Generate a micro-witness stella witness generate /path/to/binary --cve CVE-2024-0567 --sbom sbom.json --output witness.json # Verify a micro-witness stella witness verify witness.json --offline # Create portable bundle for air-gapped verification stella witness bundle witness.json --output ./audit-bundle ``` ### 12.5 Integration with Rekor When `--rekor` is specified during generation, witnesses are logged to the Rekor transparency log using v2 tile-based inclusion proofs. This provides tamper-evidence and enables auditors to verify witnesses weren't backdated. Offline verification bundles include tile proofs for air-gapped environments. ### 12.6 Related Documentation - **Auditor Guide:** `docs/guides/binary-micro-witness-verification.md` - **Predicate Schema:** `src/Attestor/StellaOps.Attestor.Types/schemas/stellaops-binary-micro-witness.v1.schema.json` - **CLI Commands:** `src/Cli/StellaOps.Cli/Commands/Witness/` - **Demo Bundle:** `demos/binary-micro-witness/` - **Sprint:** `docs-archived/implplan/SPRINT_0128_001_BinaryIndex_binary_micro_witness.md` --- ## 13. Cross-Distro Coverage Matrix for Backport Validation Manages a curated set of high-impact CVEs with per-distribution backport status tracking, enabling systematic validation of backport detection accuracy across Alpine, Debian, and RHEL. ### 13.1 Architecture 1. **CuratedCveEntry** — One row per CVE (e.g., Heartbleed, Baron Samedit) with cross-distro `DistroCoverageEntry` array tracking backport status per distro-version 2. **CrossDistroCoverageService** — In-memory coverage matrix with upsert, query, summary, and validation marking operations 3. **SeedBuiltInEntries** — Idempotent seeding of 5 curated high-impact CVEs (CVE-2014-0160, CVE-2021-3156, CVE-2015-0235, CVE-2023-38545, CVE-2024-6387) with pre-populated backport status across Alpine, Debian, and RHEL versions ### 13.2 Distro Families & Backport Status | Enum | Values | |---|---| | `DistroFamily` | Alpine, Debian, Rhel | | `BackportStatus` | NotPatched, Backported, NotApplicable, Unknown | ### 13.3 Models | Type | Description | |---|---| | `DistroCoverageEntry` | Per distro-version: package name/version, backport status, validated flag | | `CuratedCveEntry` | CVE with CommonName, CvssScore, CweIds, Coverage array, computed CoverageRatio | | `CrossDistroCoverageSummary` | Aggregated counts: TotalCves, TotalEntries, ValidatedEntries, ByDistro breakdown | | `DistroBreakdown` | Per-distro EntryCount, ValidatedCount, BackportedCount | | `CuratedCveQuery` | Component/Distro/Status/OnlyUnvalidated filters with Limit/Offset paging | ### 13.4 Built-in Curated CVEs | CVE | Component | Common Name | CVSS | |---|---|---|---| | CVE-2014-0160 | openssl | Heartbleed | 7.5 | | CVE-2021-3156 | sudo | Baron Samedit | 7.8 | | CVE-2015-0235 | glibc | GHOST | 10.0 | | CVE-2023-38545 | curl | SOCKS5 heap overflow | 9.8 | | CVE-2024-6387 | openssh | regreSSHion | 8.1 | ### 13.5 DI Registration `ICrossDistroCoverageService` → `CrossDistroCoverageService` registered via TryAddSingleton in `GoldenSetServiceCollectionExtensions.AddGoldenSetServices()`. ### 13.6 OTel Metrics Meter: `StellaOps.BinaryIndex.GoldenSet.CrossDistro` | Counter | Description | |---|---| | `crossdistro.upsert.total` | CVE entries upserted | | `crossdistro.query.total` | Coverage queries executed | | `crossdistro.seed.total` | Built-in entries seeded | | `crossdistro.validated.total` | Entries marked as validated | ### 13.7 Source Files - Models: `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.GoldenSet/Models/CrossDistroCoverageModels.cs` - Interface: `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.GoldenSet/Services/ICrossDistroCoverageService.cs` - Implementation: `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.GoldenSet/Services/CrossDistroCoverageService.cs` ### 13.8 Test Coverage (37 tests) - Models: DistroFamily/BackportStatus enum counts, DistroCoverageEntry roundtrips/defaults, CuratedCveEntry coverage ratio/empty, CuratedCveQuery defaults, Summary coverage/empty - Service: SeedBuiltInEntries population/idempotency/heartbleed/baron-samedit/distro coverage, UpsertAsync store-retrieve/overwrite/null/empty, GetByCveIdAsync unknown/case-insensitive/null, QueryAsync all/component/distro/status/unvalidated/limit-offset/ordering, GetSummaryAsync counts/empty, SetValidatedAsync mark/unknown-cve/unknown-version/summary/null, CreateBuiltInEntries deterministic/distro-coverage --- ## 14. ELF Segment Normalization for Delta Hashing ### 14.1 Purpose The existing instruction-level normalization (X64/Arm64 pipelines) operates on disassembled instruction streams. ELF Segment Normalization fills the gap for **raw binary bytes** — zeroing position-dependent data (relocation entries, GOT/PLT displacements, alignment padding) and canonicalizing NOP sleds *before* disassembly, enabling deterministic delta hashing across builds compiled at different base addresses or link orders. ### 14.2 Key Types | Type | Location | Purpose | | --- | --- | --- | | `ElfNormalizationStep` | `Normalization/ElfSegmentNormalizer.cs` | Enum of normalization passes (RelocationZeroing, GotPltCanonicalization, NopCanonicalization, JumpTableRewriting, PaddingZeroing) | | `ElfSegmentNormalizationOptions` | same | Options record with `Default` and `Minimal` presets | | `ElfSegmentNormalizationResult` | same | Result with NormalizedBytes, DeltaHash (SHA-256), ModifiedBytes, AppliedSteps, StepCounts, computed ModificationRatio | | `IElfSegmentNormalizer` | same | Interface: `Normalize`, `ComputeDeltaHash` | | `ElfSegmentNormalizer` | same | Implementation with 5 internal passes and 2 OTel counters | ### 14.3 Normalization Passes 1. **RelocationZeroing** — Scans for ELF64 RELA-shaped entries (heuristic: info field encodes valid x86-64 relocation types 1–42 with symbol index ≤100 000); zeros the offset and addend fields (16 bytes per entry). 2. **GotPltCanonicalization** — Detects `FF 25` (JMP [rip+disp32]) and `FF 35` (PUSH [rip+disp32]) PLT stub patterns; zeros the 4-byte displacement to remove position-dependent indirect jump targets. 3. **NopCanonicalization** — Matches 7 multi-byte x86-64 NOP variants (2–7 bytes each, per Intel SDM) and replaces with canonical single-byte NOPs (0x90). 4. **JumpTableRewriting** — Identifies sequences of 4+ consecutive 8-byte entries sharing the same upper 32 bits (switch-statement jump tables); zeros the entries. 5. **PaddingZeroing** — Detects runs of 4+ alignment padding bytes (0xCC or 0x00) between code regions and zeros them. ### 14.4 Delta Hashing `ComputeDeltaHash` produces a lowercase SHA-256 hex string of the normalized byte buffer. Two builds of the same source compiled at different addresses will produce the same delta hash after normalization. ### 14.5 OTel Instrumentation Meter: `StellaOps.BinaryIndex.Normalization.ElfSegment` | Counter | Description | | --- | --- | | `elfsegment.normalize.total` | Segments normalized | | `elfsegment.bytes.modified` | Total bytes modified across all passes | ### 14.6 DI Registration `IElfSegmentNormalizer` is registered as `TryAddSingleton` inside `AddNormalizationPipelines()` in `ServiceCollectionExtensions.cs`. ### 14.7 Test Coverage (35 tests) - Models: DefaultOptions (all enabled), MinimalOptions (relocations only), ModificationRatio zero/computed, enum values - Service: Constructor null guard, empty input result + SHA-256, ComputeDeltaHash determinism/distinct, NOP canonicalization (3-byte, 2-byte, 4-byte, no-NOP, 7-byte, single-byte), GOT/PLT (JMP disp32, PUSH disp32), alignment padding (INT3 run, zero run, short run), relocation zeroing (valid RELA, invalid entry), jump table (consecutive addresses, random data), full pipeline (deterministic hash, default vs minimal, all-disabled, step-count consistency) --- *Document Version: 1.5.0* *Last Updated: 2026-02-12*