# BinaryIndex Module Architecture > **Ownership:** Scanner Guild + Concelier Guild > **Status:** DRAFT > **Version:** 1.0.0 > **Related:** [High-Level Architecture](../../07_HIGH_LEVEL_ARCHITECTURE.md), [Scanner Architecture](../scanner/architecture.md), [Concelier Architecture](../concelier/architecture.md) --- ## 1. Overview The **BinaryIndex** module provides a vulnerable binaries database that enables detection of vulnerable code at the binary level, independent of package metadata. This addresses a critical gap in vulnerability scanning: package version strings can lie (backports, custom builds, stripped metadata), but **binary identity doesn't lie**. ### 1.1 Problem Statement Traditional vulnerability scanners rely on package version matching, which fails in several scenarios: 1. **Backported patches** - Distros backport security fixes without changing upstream version 2. **Custom/vendored builds** - Binaries compiled from source without package metadata 3. **Stripped binaries** - Debug info and version strings removed 4. **Static linking** - Vulnerable library code embedded in final binary 5. **Container base images** - Distroless or scratch images with no package DB ### 1.2 Solution: Binary-First Vulnerability Detection BinaryIndex provides three tiers of binary identification: | Tier | Method | Precision | Coverage | |------|--------|-----------|----------| | A | Package/version range matching | Medium | High | | B | Build-ID/hash catalog (exact binary identity) | High | Medium | | C | Function fingerprints (CFG/basic-block hashes) | Very High | Targeted | ### 1.3 Module Scope **In Scope:** - Binary identity extraction (Build-ID, PE CodeView GUID, Mach-O UUID) - Binary-to-advisory mapping database - Fingerprint storage and matching engine - Fix index for patch-aware backport handling - Integration with Scanner.Worker for binary lookup **Out of Scope:** - Binary disassembly/analysis (provided by Scanner.Analyzers.Native) - Runtime binary tracing (provided by Zastava) - SBOM generation (provided by Scanner) --- ## 2. Architecture ### 2.1 System Context ``` ┌──────────────────────────────────────────────────────────────────────────┐ │ External Systems │ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Distro Repos │ │ Debug Symbol │ │ Upstream Source │ │ │ │ (Debian, RPM, │ │ Servers │ │ (GitHub, etc.) │ │ │ │ Alpine) │ │ (debuginfod) │ │ │ │ │ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │ └───────────│─────────────────────│─────────────────────│──────────────────┘ │ │ │ v v v ┌──────────────────────────────────────────────────────────────────────────┐ │ BinaryIndex Module │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Corpus Ingestion Layer │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ │ │ DebianCorpus │ │ RpmCorpus │ │ AlpineCorpus │ │ │ │ │ │ Connector │ │ Connector │ │ Connector │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ v │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Processing Layer │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ │ │ BinaryFeature│ │ FixIndex │ │ Fingerprint │ │ │ │ │ │ Extractor │ │ Builder │ │ Generator │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ v │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Storage Layer │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ │ │ PostgreSQL │ │ RustFS │ │ Valkey │ │ │ │ │ │ (binaries │ │ (fingerprint │ │ (lookup │ │ │ │ │ │ schema) │ │ blobs) │ │ cache) │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ v │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Query Layer │ │ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ │ │ IBinaryVulnerabilityService │ │ │ │ │ │ - LookupByBuildIdAsync(buildId) │ │ │ │ │ │ - LookupByFingerprintAsync(fingerprint) │ │ │ │ │ │ - LookupBatchAsync(identities) │ │ │ │ │ │ - GetFixStatusAsync(distro, release, sourcePkg, cve) │ │ │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────────────────────┘ │ v ┌──────────────────────────────────────────────────────────────────────────┐ │ Consuming Modules │ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Scanner.Worker │ │ Policy Engine │ │ Findings Ledger │ │ │ │ (binary lookup │ │ (evidence in │ │ (match records) │ │ │ │ during scan) │ │ proof chain) │ │ │ │ │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ └──────────────────────────────────────────────────────────────────────────┘ ``` ### 2.2 Component Breakdown #### 2.2.1 Corpus Connectors Plugin-based connectors that ingest binaries from distribution repositories. ```csharp public interface IBinaryCorpusConnector { string ConnectorId { get; } string[] SupportedDistros { get; } Task FetchSnapshotAsync(CorpusQuery query, CancellationToken ct); Task> ExtractBinariesAsync(PackageReference pkg, CancellationToken ct); } ``` **Implementations:** - `DebianBinaryCorpusConnector` - Debian/Ubuntu packages + debuginfo - `RpmBinaryCorpusConnector` - RHEL/Fedora/CentOS + SRPM - `AlpineBinaryCorpusConnector` - Alpine APK + APKBUILD #### 2.2.2 Binary Feature Extractor Extracts identity and features from binaries. Reuses existing Scanner.Analyzers.Native capabilities. ```csharp public interface IBinaryFeatureExtractor { Task ExtractIdentityAsync(Stream binaryStream, CancellationToken ct); Task ExtractFeaturesAsync(Stream binaryStream, ExtractorOptions opts, CancellationToken ct); } public sealed record BinaryIdentity( string Format, // elf, pe, macho string? BuildId, // ELF GNU Build-ID string? PeCodeViewGuid, // PE CodeView GUID + Age string? MachoUuid, // Mach-O LC_UUID string FileSha256, string TextSectionSha256); public sealed record BinaryFeatures( BinaryIdentity Identity, string[] DynamicDeps, // DT_NEEDED string[] ExportedSymbols, string[] ImportedSymbols, BinaryHardening Hardening); ``` #### 2.2.3 Fix Index Builder Builds the patch-aware CVE fix index from distro sources. ```csharp public interface IFixIndexBuilder { Task BuildIndexAsync(DistroRelease distro, CancellationToken ct); Task GetFixRecordAsync(string distro, string release, string sourcePkg, string cveId, CancellationToken ct); } public sealed record FixRecord( string Distro, string Release, string SourcePkg, string CveId, FixState State, // fixed, vulnerable, not_affected, wontfix, unknown string? FixedVersion, // Distro version string FixMethod Method, // security_feed, changelog, patch_header decimal Confidence, // 0.00-1.00 FixEvidence Evidence); public enum FixState { Fixed, Vulnerable, NotAffected, Wontfix, Unknown } public enum FixMethod { SecurityFeed, Changelog, PatchHeader, UpstreamPatchMatch } ``` #### 2.2.4 Fingerprint Generator Generates function-level fingerprints for vulnerable code detection. ```csharp public interface IVulnFingerprintGenerator { Task> GenerateAsync( string cveId, BinaryPair vulnAndFixed, // Reference builds FingerprintOptions opts, CancellationToken ct); } public sealed record VulnFingerprint( string CveId, string Component, // e.g., openssl string Architecture, // x86-64, aarch64 FingerprintType Type, // basic_block, cfg, combined string FingerprintId, // e.g., "bb-abc123..." byte[] FingerprintHash, // 16-32 bytes string? FunctionHint, // Function name if known decimal Confidence, FingerprintEvidence Evidence); public enum FingerprintType { BasicBlock, ControlFlowGraph, StringReferences, Combined } ``` #### 2.2.5 Binary Vulnerability Service Main query interface for consumers. ```csharp public interface IBinaryVulnerabilityService { /// /// Look up vulnerabilities by Build-ID or equivalent binary identity. /// Task> LookupByIdentityAsync( BinaryIdentity identity, LookupOptions? opts = null, CancellationToken ct = default); /// /// Look up vulnerabilities by function fingerprint. /// Task> LookupByFingerprintAsync( CodeFingerprint fingerprint, decimal minSimilarity = 0.95m, CancellationToken ct = default); /// /// Batch lookup for scan performance. /// Task>> LookupBatchAsync( IEnumerable identities, LookupOptions? opts = null, CancellationToken ct = default); /// /// Get distro-specific fix status (patch-aware). /// Task GetFixStatusAsync( string distro, string release, string sourcePkg, string cveId, CancellationToken ct = default); } public sealed record BinaryVulnMatch( string CveId, string VulnerablePurl, MatchMethod Method, // buildid_catalog, fingerprint_match, range_match decimal Confidence, MatchEvidence Evidence); public enum MatchMethod { BuildIdCatalog, FingerprintMatch, RangeMatch } ``` --- ## 3. Data Model ### 3.1 PostgreSQL Schema (`binaries`) The `binaries` schema stores binary identity, fingerprint, and match data. ```sql CREATE SCHEMA IF NOT EXISTS binaries; CREATE SCHEMA IF NOT EXISTS binaries_app; -- RLS helper CREATE OR REPLACE FUNCTION binaries_app.require_current_tenant() RETURNS TEXT LANGUAGE plpgsql STABLE SECURITY DEFINER AS $$ DECLARE v_tenant TEXT; BEGIN v_tenant := current_setting('app.tenant_id', true); IF v_tenant IS NULL OR v_tenant = '' THEN RAISE EXCEPTION 'app.tenant_id session variable not set'; END IF; RETURN v_tenant; END; $$; ``` #### 3.1.1 Core Tables See `docs/db/schemas/binaries_schema_specification.md` for complete DDL. **Key Tables:** | Table | Purpose | |-------|---------| | `binaries.binary_identity` | Known binary identities (Build-ID, hashes) | | `binaries.binary_package_map` | Binary → package mapping per snapshot | | `binaries.vulnerable_buildids` | Build-IDs known to be vulnerable | | `binaries.vulnerable_fingerprints` | Function fingerprints for CVEs | | `binaries.cve_fix_index` | Patch-aware fix status per distro | | `binaries.fingerprint_matches` | Match results (findings evidence) | | `binaries.corpus_snapshots` | Corpus ingestion tracking | ### 3.2 RustFS Layout ``` rustfs://stellaops/binaryindex/ fingerprints///.bin corpus////manifest.json corpus////packages/.metadata.json evidence/.dsse.json ``` --- ## 4. Integration Points ### 4.1 Scanner.Worker Integration During container scanning, Scanner.Worker queries BinaryIndex for each extracted binary: ```mermaid sequenceDiagram participant SW as Scanner.Worker participant BI as BinaryIndex participant PG as PostgreSQL participant FL as Findings Ledger SW->>SW: Extract binary from layer SW->>SW: Compute BinaryIdentity SW->>BI: LookupByIdentityAsync(identity) BI->>PG: Query binaries.vulnerable_buildids PG-->>BI: Matches BI->>PG: Query binaries.cve_fix_index (if distro known) PG-->>BI: Fix status BI-->>SW: BinaryVulnMatch[] SW->>FL: RecordFinding(match, evidence) ``` ### 4.2 Concelier Integration BinaryIndex subscribes to Concelier's advisory updates: ```mermaid sequenceDiagram participant CO as Concelier participant BI as BinaryIndex participant PG as PostgreSQL CO->>CO: Ingest new advisory CO->>BI: advisory.created event BI->>BI: Check if affected packages in corpus BI->>PG: Update binaries.binary_vuln_assertion BI->>BI: Queue fingerprint generation (if high-impact) ``` ### 4.3 Policy Integration Binary matches are recorded as proof segments: ```json { "segment_type": "binary_fingerprint_evidence", "payload": { "binary_identity": { "format": "elf", "build_id": "abc123...", "file_sha256": "def456..." }, "matches": [ { "cve_id": "CVE-2024-1234", "method": "buildid_catalog", "confidence": 0.98, "vulnerable_purl": "pkg:deb/debian/libssl3@1.1.1n-0+deb11u3" } ] } } ``` --- ## 5. MVP Roadmap ### MVP 1: Known-Build Binary Catalog (Sprint 6000.0001) **Goal:** Query "is this Build-ID vulnerable?" with distro-level precision. **Deliverables:** - `binaries` PostgreSQL schema - Build-ID to package mapping tables - Basic CVE lookup by binary identity - Debian/Ubuntu corpus connector ### MVP 2: Patch-Aware Backport Handling (Sprint 6000.0002) **Goal:** Handle "version says vulnerable but distro backported the fix." **Deliverables:** - Fix index builder (changelog + patch header parsing) - Distro-specific version comparison - RPM corpus connector - Scanner.Worker integration ### MVP 3: Binary Fingerprint Factory (Sprint 6000.0003) **Goal:** Detect vulnerable code independent of package metadata. **Deliverables:** - Fingerprint storage and matching - Reference build generation pipeline - Fingerprint validation corpus - High-impact CVE coverage (OpenSSL, glibc, zlib, curl) ### MVP 4: Full Scanner Integration (Sprint 6000.0004) **Goal:** Binary evidence in production scans. **Deliverables:** - Scanner.Worker binary lookup integration - Findings Ledger binary match records - Proof segment attestations - CLI binary match inspection --- ## 6. Security Considerations ### 6.1 Trust Boundaries 1. **Corpus Ingestion** - Packages are untrusted; extraction runs in sandboxed workers 2. **Fingerprint Generation** - Reference builds compiled in isolated environments 3. **Query API** - Tenant-isolated via RLS; no cross-tenant data leakage ### 6.2 Signing & Provenance - All corpus snapshots are signed (DSSE) - Fingerprint sets are versioned and signed - Every match result references evidence digests ### 6.3 Sandbox Requirements Binary extraction and fingerprint generation MUST run with: - Seccomp profile restricting syscalls - Read-only root filesystem - No network access during analysis - Memory/CPU limits --- ## 7. Observability ### 7.1 Metrics | Metric | Type | Labels | |--------|------|--------| | `binaryindex_lookup_total` | Counter | method, result | | `binaryindex_lookup_latency_ms` | Histogram | method | | `binaryindex_corpus_packages_total` | Gauge | distro, release | | `binaryindex_fingerprints_indexed` | Gauge | algorithm, component | | `binaryindex_match_confidence` | Histogram | method | ### 7.2 Traces - `binaryindex.lookup` - Full lookup span - `binaryindex.corpus.ingest` - Corpus ingestion - `binaryindex.fingerprint.generate` - Fingerprint generation --- ## 8. Configuration ```yaml # binaryindex.yaml binaryindex: enabled: true corpus: connectors: - type: debian enabled: true mirror: http://deb.debian.org/debian releases: [bookworm, bullseye] architectures: [amd64, arm64] - type: ubuntu enabled: true mirror: http://archive.ubuntu.com/ubuntu releases: [jammy, noble] fingerprinting: enabled: true algorithms: [basic_block, cfg] target_components: - openssl - glibc - zlib - curl - sqlite min_function_size: 16 # bytes max_functions_per_binary: 10000 lookup: cache_ttl: 3600 batch_size: 100 timeout_ms: 5000 storage: postgres_schema: binaries rustfs_bucket: stellaops/binaryindex ``` --- ## 9. Testing Strategy ### 9.1 Unit Tests - Identity extraction (Build-ID, hashes) - Fingerprint generation determinism - Fix index parsing (changelog, patch headers) ### 9.2 Integration Tests - PostgreSQL schema validation - Full corpus ingestion flow - Scanner.Worker lookup integration ### 9.3 Regression Tests - Known CVE detection (golden corpus) - Backport handling (Debian libssl example) - False positive rate validation --- ## 10. References - Advisory: `docs/product-advisories/21-Dec-2025 - Mapping Evidence Within Compiled Binaries.md` - Scanner Native Analysis: `src/Scanner/StellaOps.Scanner.Analyzers.Native/` - Existing Fingerprinting: `src/Scanner/__Libraries/StellaOps.Scanner.EntryTrace/Binary/` - Build-ID Index: `src/Scanner/StellaOps.Scanner.Analyzers.Native/Index/` --- *Document Version: 1.0.0* *Last Updated: 2025-12-21*