Files
git.stella-ops.org/docs/modules/binary-index/architecture.md
2026-01-24 00:12:43 +02:00

52 KiB

BinaryIndex Module Architecture

Ownership: Scanner Guild + Concelier Guild Status: DRAFT Version: 1.0.0 Related: High-Level Architecture, Scanner Architecture, Concelier Architecture


1. Overview

The BinaryIndex module provides a vulnerable binaries database that enables detection of vulnerable code at the binary level, independent of package metadata. This addresses a critical gap in vulnerability scanning: package version strings can lie (backports, custom builds, stripped metadata), but binary identity doesn't lie.

1.1 Problem Statement

Traditional vulnerability scanners rely on package version matching, which fails in several scenarios:

  1. Backported patches - Distros backport security fixes without changing upstream version
  2. Custom/vendored builds - Binaries compiled from source without package metadata
  3. Stripped binaries - Debug info and version strings removed
  4. Static linking - Vulnerable library code embedded in final binary
  5. Container base images - Distroless or scratch images with no package DB

1.2 Solution: Binary-First Vulnerability Detection

BinaryIndex provides three tiers of binary identification:

Tier Method Precision Coverage
A Package/version range matching Medium High
B Build-ID/hash catalog (exact binary identity) High Medium
C Function fingerprints (CFG/basic-block hashes) Very High Targeted

1.3 Module Scope

In Scope:

  • Binary identity extraction (Build-ID, PE CodeView GUID, Mach-O UUID)
  • Binary-to-advisory mapping database
  • Fingerprint storage and matching engine
  • Fix index for patch-aware backport handling
  • Integration with Scanner.Worker for binary lookup

Out of Scope:

  • Binary disassembly/analysis (provided by Scanner.Analyzers.Native)
  • Runtime binary tracing (provided by Zastava)
  • SBOM generation (provided by Scanner)

2. Architecture

2.1 System Context

┌──────────────────────────────────────────────────────────────────────────┐
│                         External Systems                                   │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐           │
│  │ Distro Repos     │  │ Debug Symbol    │  │ Upstream Source │           │
│  │ (Debian, RPM,    │  │ Servers         │  │ (GitHub, etc.)  │           │
│  │  Alpine)         │  │ (debuginfod)    │  │                 │           │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘           │
└───────────│─────────────────────│─────────────────────│──────────────────┘
            │                     │                     │
            v                     v                     v
┌──────────────────────────────────────────────────────────────────────────┐
│                      BinaryIndex Module                                    │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │                    Corpus Ingestion Layer                            │ │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐               │ │
│  │  │ DebianCorpus │  │ RpmCorpus    │  │ AlpineCorpus │               │ │
│  │  │ Connector    │  │ Connector    │  │ Connector    │               │ │
│  │  └──────────────┘  └──────────────┘  └──────────────┘               │ │
│  └─────────────────────────────────────────────────────────────────────┘ │
│                                │                                          │
│                                v                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │                    Processing Layer                                  │ │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐               │ │
│  │  │ BinaryFeature│  │ FixIndex     │  │ Fingerprint  │               │ │
│  │  │ Extractor    │  │ Builder      │  │ Generator    │               │ │
│  │  └──────────────┘  └──────────────┘  └──────────────┘               │ │
│  └─────────────────────────────────────────────────────────────────────┘ │
│                                │                                          │
│                                v                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │                    Storage Layer                                     │ │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐               │ │
│  │  │ PostgreSQL   │  │ RustFS       │  │ Valkey       │               │ │
│  │  │ (binaries    │  │ (fingerprint │  │ (lookup      │               │ │
│  │  │  schema)     │  │  blobs)      │  │  cache)      │               │ │
│  │  └──────────────┘  └──────────────┘  └──────────────┘               │ │
│  └─────────────────────────────────────────────────────────────────────┘ │
│                                │                                          │
│                                v                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │                    Query Layer                                       │ │
│  │  ┌──────────────────────────────────────────────────────────────┐   │ │
│  │  │              IBinaryVulnerabilityService                      │   │ │
│  │  │  - LookupByBuildIdAsync(buildId)                             │   │ │
│  │  │  - LookupByFingerprintAsync(fingerprint)                     │   │ │
│  │  │  - LookupBatchAsync(identities)                              │   │ │
│  │  │  - GetFixStatusAsync(distro, release, sourcePkg, cve)        │   │ │
│  │  └──────────────────────────────────────────────────────────────┘   │ │
│  └─────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
                                │
                                v
┌──────────────────────────────────────────────────────────────────────────┐
│                      Consuming Modules                                     │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐           │
│  │ Scanner.Worker   │  │ Policy Engine   │  │ Findings Ledger │           │
│  │ (binary lookup   │  │ (evidence in    │  │ (match records) │           │
│  │  during scan)    │  │  proof chain)   │  │                 │           │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘           │
└──────────────────────────────────────────────────────────────────────────┘

2.2 Component Breakdown

2.2.1 Corpus Connectors

Plugin-based connectors that ingest binaries from distribution repositories.

public interface IBinaryCorpusConnector
{
    string ConnectorId { get; }
    string[] SupportedDistros { get; }

    Task<CorpusSnapshot> FetchSnapshotAsync(CorpusQuery query, CancellationToken ct);
    Task<IAsyncEnumerable<ExtractedBinary>> ExtractBinariesAsync(PackageReference pkg, CancellationToken ct);
}

Implementations:

  • DebianBinaryCorpusConnector - Debian/Ubuntu packages + debuginfo
  • RpmBinaryCorpusConnector - RHEL/Fedora/CentOS + SRPM
  • AlpineBinaryCorpusConnector - Alpine APK + APKBUILD

2.2.2 Binary Feature Extractor

Extracts identity and features from binaries. Reuses existing Scanner.Analyzers.Native capabilities.

public interface IBinaryFeatureExtractor
{
    Task<BinaryIdentity> ExtractIdentityAsync(Stream binaryStream, CancellationToken ct);
    Task<BinaryFeatures> ExtractFeaturesAsync(Stream binaryStream, ExtractorOptions opts, CancellationToken ct);
}

public sealed record BinaryIdentity(
    string Format,           // elf, pe, macho
    string? BuildId,         // ELF GNU Build-ID
    string? PeCodeViewGuid,  // PE CodeView GUID + Age
    string? MachoUuid,       // Mach-O LC_UUID
    string FileSha256,
    string TextSectionSha256);

public sealed record BinaryFeatures(
    BinaryIdentity Identity,
    string[] DynamicDeps,    // DT_NEEDED
    string[] ExportedSymbols,
    string[] ImportedSymbols,
    BinaryHardening Hardening);

2.2.3 Fix Index Builder

Builds the patch-aware CVE fix index from distro sources.

public interface IFixIndexBuilder
{
    Task BuildIndexAsync(DistroRelease distro, CancellationToken ct);
    Task<FixRecord?> GetFixRecordAsync(string distro, string release, string sourcePkg, string cveId, CancellationToken ct);
}

public sealed record FixRecord(
    string Distro,
    string Release,
    string SourcePkg,
    string CveId,
    FixState State,           // fixed, vulnerable, not_affected, wontfix, unknown
    string? FixedVersion,     // Distro version string
    FixMethod Method,         // security_feed, changelog, patch_header
    decimal Confidence,       // 0.00-1.00
    FixEvidence Evidence);

public enum FixState { Fixed, Vulnerable, NotAffected, Wontfix, Unknown }
public enum FixMethod { SecurityFeed, Changelog, PatchHeader, UpstreamPatchMatch }

2.2.4 Fingerprint Generator

Generates function-level fingerprints for vulnerable code detection.

public interface IVulnFingerprintGenerator
{
    Task<ImmutableArray<VulnFingerprint>> GenerateAsync(
        string cveId,
        BinaryPair vulnAndFixed,  // Reference builds
        FingerprintOptions opts,
        CancellationToken ct);
}

public sealed record VulnFingerprint(
    string CveId,
    string Component,         // e.g., openssl
    string Architecture,      // x86-64, aarch64
    FingerprintType Type,     // basic_block, cfg, combined
    string FingerprintId,     // e.g., "bb-abc123..."
    byte[] FingerprintHash,   // 16-32 bytes
    string? FunctionHint,     // Function name if known
    decimal Confidence,
    FingerprintEvidence Evidence);

public enum FingerprintType { BasicBlock, ControlFlowGraph, StringReferences, Combined }

2.2.5 Semantic Analysis Library

Library: StellaOps.BinaryIndex.Semantic Sprint: 20260105_001_001_BINDEX - Semantic Diffing Phase 1

The Semantic Analysis Library extends fingerprint generation with IR-level semantic matching, enabling detection of semantically equivalent code despite compiler optimizations, instruction reordering, and register allocation differences.

Key Insight: Traditional instruction-level fingerprinting loses accuracy on optimized binaries by ~15-20%. Semantic analysis lifts to B2R2's Intermediate Representation (LowUIR), extracts key-semantics graphs, and uses graph hashing for similarity computation.

2.2.5.1 Architecture
Binary Input
    │
    v
B2R2 Disassembly → Raw Instructions
    │
    v
IR Lifting Service → LowUIR Statements
    │
    v
Semantic Graph Extractor → Key-Semantics Graph (KSG)
    │
    v
Graph Fingerprinting → Semantic Fingerprint
    │
    v
Semantic Matcher → Similarity Score + Deltas
2.2.5.2 Core Components

IR Lifting Service (IIrLiftingService)

Lifts disassembled instructions to B2R2 LowUIR:

public interface IIrLiftingService
{
    Task<LiftedFunction> LiftToIrAsync(
        IReadOnlyList<DisassembledInstruction> instructions,
        string functionName,
        LiftOptions? options = null,
        CancellationToken ct = default);
}

public sealed record LiftedFunction(
    string Name,
    ImmutableArray<IrStatement> Statements,
    ImmutableArray<IrBasicBlock> BasicBlocks);

Semantic Graph Extractor (ISemanticGraphExtractor)

Extracts key-semantics graphs capturing data dependencies, control flow, and memory operations:

public interface ISemanticGraphExtractor
{
    Task<KeySemanticsGraph> ExtractGraphAsync(
        LiftedFunction function,
        GraphExtractionOptions? options = null,
        CancellationToken ct = default);
}

public sealed record KeySemanticsGraph(
    string FunctionName,
    ImmutableArray<SemanticNode> Nodes,
    ImmutableArray<SemanticEdge> Edges,
    GraphProperties Properties);

public enum SemanticNodeType { Compute, Load, Store, Branch, Call, Return, Phi }
public enum SemanticEdgeType { DataDependency, ControlDependency, MemoryDependency }

Semantic Fingerprint Generator (ISemanticFingerprintGenerator)

Generates semantic fingerprints using Weisfeiler-Lehman graph hashing:

public interface ISemanticFingerprintGenerator
{
    Task<SemanticFingerprint> GenerateAsync(
        KeySemanticsGraph graph,
        SemanticFingerprintOptions? options = null,
        CancellationToken ct = default);
}

public sealed record SemanticFingerprint(
    string FunctionName,
    string GraphHashHex,      // WL graph hash (SHA-256)
    string OperationHashHex,  // Normalized operation sequence hash
    string DataFlowHashHex,   // Data dependency pattern hash
    int NodeCount,
    int EdgeCount,
    int CyclomaticComplexity,
    ImmutableArray<string> ApiCalls,
    SemanticFingerprintAlgorithm Algorithm);

Semantic Matcher (ISemanticMatcher)

Computes semantic similarity with weighted components:

public interface ISemanticMatcher
{
    Task<SemanticMatchResult> MatchAsync(
        SemanticFingerprint a,
        SemanticFingerprint b,
        MatchOptions? options = null,
        CancellationToken ct = default);

    Task<SemanticMatchResult> MatchWithDeltasAsync(
        SemanticFingerprint a,
        SemanticFingerprint b,
        MatchOptions? options = null,
        CancellationToken ct = default);
}

public sealed record SemanticMatchResult(
    decimal Similarity,       // 0.00-1.00
    decimal GraphSimilarity,
    decimal OperationSimilarity,
    decimal DataFlowSimilarity,
    decimal ApiCallSimilarity,
    MatchConfidence Confidence);
2.2.5.3 Algorithm Details

Weisfeiler-Lehman Graph Hashing:

  • 3 iterations of label propagation
  • SHA-256 for final hash computation
  • Deterministic node ordering via canonical sort

Similarity Weights (Default):

Component Weight
Graph Hash 0.35
Operation Hash 0.25
Data Flow Hash 0.25
API Calls 0.15
2.2.5.4 Integration Points

The semantic library integrates with existing BinaryIndex components:

DeltaSignatureGenerator Extension:

// Optional semantic services via constructor injection
services.AddDeltaSignaturesWithSemantic();

// Extended SymbolSignature with semantic properties
public sealed record SymbolSignature
{
    // ... existing properties ...
    public string? SemanticHashHex { get; init; }
    public ImmutableArray<string> SemanticApiCalls { get; init; }
}

PatchDiffEngine Extension:

// SemanticWeight in HashWeights
public decimal SemanticWeight { get; init; } = 0.2m;

// FunctionFingerprint extended with semantic fingerprint
public SemanticFingerprint? SemanticFingerprint { get; init; }
2.2.5.5 Test Coverage
Category Tests Coverage
Unit Tests (IR lifting, graph extraction, hashing) 53 Core algorithms
Integration Tests (full pipeline) 9 End-to-end flow
Golden Corpus (compiler variations) 11 Register allocation, optimization, compiler variants
Benchmarks (accuracy, performance) 7 Baseline metrics
2.2.5.6 Current Baselines

Note: Baselines reflect foundational implementation; accuracy improves as semantic features mature.

Metric Baseline Target
Similarity (register allocation variants) ≥0.55 ≥0.85
Overall accuracy ≥40% ≥70%
False positive rate <10% <5%
P95 fingerprint latency <100ms <50ms
2.2.5.7 B2R2 LowUIR Adapter

The B2R2LowUirLiftingService implements IIrLiftingService using B2R2's native lifting capabilities. This provides cross-platform IR representation for semantic analysis.

Key Components:

public sealed class B2R2LowUirLiftingService : IIrLiftingService
{
    // Lifts to B2R2 LowUIR and maps to Stella IR model
    public Task<LiftedFunction> LiftToIrAsync(
        IReadOnlyList<DisassembledInstruction> instructions,
        string functionName,
        LiftOptions? options = null,
        CancellationToken ct = default);
}

Supported ISAs:

  • Intel (x86-32, x86-64)
  • ARM (ARMv7, ARMv8/ARM64)
  • MIPS (32/64)
  • RISC-V (64)
  • PowerPC, SPARC, SH4, AVR, EVM

IR Statement Mapping:

B2R2 LowUIR Stella IR Kind
Put IrStatementKind.Store
Store IrStatementKind.Store
Get IrStatementKind.Load
Load IrStatementKind.Load
BinOp IrStatementKind.BinaryOp
UnOp IrStatementKind.UnaryOp
Jmp IrStatementKind.Jump
CJmp IrStatementKind.ConditionalJump
InterJmp IrStatementKind.IndirectJump
Call IrStatementKind.Call
SideEffect IrStatementKind.SideEffect

Determinism Guarantees:

  • Statements ordered by block address (ascending)
  • Blocks sorted by entry address (ascending)
  • Consistent IR IDs across identical inputs
  • InvariantCulture used for all string formatting
2.2.5.8 B2R2 Lifter Pool

The B2R2LifterPool provides bounded pooling and warm preload for B2R2 lifting units to reduce per-call allocation overhead.

Configuration (B2R2LifterPoolOptions):

Option Default Description
MaxPoolSizePerIsa 4 Maximum pooled lifters per ISA
EnableWarmPreload true Preload lifters at startup
WarmPreloadIsas ["intel-64", "intel-32", "armv8-64", "armv7-32"] ISAs to warm
AcquireTimeout 5s Timeout for acquiring a lifter

Pool Statistics:

  • TotalPooledLifters: Lifters currently in pool
  • TotalActiveLifters: Lifters currently in use
  • IsWarm: Whether pool has been warmed
  • IsaStats: Per-ISA pool and active counts

Usage:

using var lifter = _lifterPool.Acquire(isa);
var stmts = lifter.LiftingUnit.LiftInstruction(address);
// Lifter automatically returned to pool on dispose
2.2.5.9 Function IR Cache

The FunctionIrCacheService provides Valkey-backed caching for computed semantic fingerprints to avoid redundant IR lifting and graph hashing.

Cache Key Structure:

(isa, b2r2_version, normalization_recipe, canonical_ir_hash)

Configuration (FunctionIrCacheOptions):

Option Default Description
KeyPrefix "stellaops:binidx:funccache:" Valkey key prefix
CacheTtl 4h TTL for cached entries
MaxTtl 24h Maximum TTL
Enabled true Whether caching is enabled
B2R2Version "0.9.1" B2R2 version for cache key
NormalizationRecipeVersion "v1" Recipe version for cache key

Cache Entry (CachedFunctionFingerprint):

  • FunctionAddress, FunctionName
  • SemanticFingerprint: The computed fingerprint
  • IrStatementCount, BasicBlockCount
  • ComputedAtUtc: ISO-8601 timestamp
  • B2R2Version, NormalizationRecipe

Invalidation Rules:

  • Cache entries expire after CacheTtl (default 4h)
  • Changing B2R2 version or normalization recipe results in cache misses
  • Manual invalidation via RemoveAsync()

Statistics:

  • Hits, Misses, Evictions
  • Hit Rate
  • Enabled status
2.2.5.10 Ops Endpoints

BinaryIndex exposes operational endpoints for health, benchmarking, cache monitoring, and configuration visibility.

Endpoint Method Description
/api/v1/ops/binaryindex/health GET Health status with lifter warmness, cache availability
/api/v1/ops/binaryindex/bench/run POST Run benchmark, return latency stats
/api/v1/ops/binaryindex/cache GET Function IR cache hit/miss statistics
/api/v1/ops/binaryindex/config GET Effective configuration (secrets redacted)

Health Response:

{
  "status": "healthy",
  "timestamp": "2026-01-14T12:00:00Z",
  "lifterStatus": "warm",
  "lifterWarm": true,
  "lifterPoolStats": { "intel-64": 4, "armv8-64": 2 },
  "cacheStatus": "enabled",
  "cacheEnabled": true
}

Determinism Constraints:

  • All timestamps in ISO-8601 UTC format
  • ASCII-only output
  • Deterministic JSON key ordering
  • Secrets/credentials redacted from config endpoint

2.2.6 Binary Vulnerability Service

Main query interface for consumers.

public interface IBinaryVulnerabilityService
{
    /// <summary>
    /// Look up vulnerabilities by Build-ID or equivalent binary identity.
    /// </summary>
    Task<ImmutableArray<BinaryVulnMatch>> LookupByIdentityAsync(
        BinaryIdentity identity,
        LookupOptions? opts = null,
        CancellationToken ct = default);

    /// <summary>
    /// Look up vulnerabilities by function fingerprint.
    /// </summary>
    Task<ImmutableArray<BinaryVulnMatch>> LookupByFingerprintAsync(
        CodeFingerprint fingerprint,
        decimal minSimilarity = 0.95m,
        CancellationToken ct = default);

    /// <summary>
    /// Batch lookup for scan performance.
    /// </summary>
    Task<ImmutableDictionary<string, ImmutableArray<BinaryVulnMatch>>> LookupBatchAsync(
        IEnumerable<BinaryIdentity> identities,
        LookupOptions? opts = null,
        CancellationToken ct = default);

    /// <summary>
    /// Get distro-specific fix status (patch-aware).
    /// </summary>
    Task<FixRecord?> GetFixStatusAsync(
        string distro,
        string release,
        string sourcePkg,
        string cveId,
        CancellationToken ct = default);
}

public sealed record BinaryVulnMatch(
    string CveId,
    string VulnerablePurl,
    MatchMethod Method,       // buildid_catalog, fingerprint_match, range_match
    decimal Confidence,
    MatchEvidence Evidence);

public enum MatchMethod { BuildIdCatalog, FingerprintMatch, RangeMatch }

3. Data Model

3.1 PostgreSQL Schema (binaries)

The binaries schema stores binary identity, fingerprint, and match data.

CREATE SCHEMA IF NOT EXISTS binaries;
CREATE SCHEMA IF NOT EXISTS binaries_app;

-- RLS helper
CREATE OR REPLACE FUNCTION binaries_app.require_current_tenant()
RETURNS TEXT LANGUAGE plpgsql STABLE SECURITY DEFINER AS $$
DECLARE v_tenant TEXT;
BEGIN
    v_tenant := current_setting('app.tenant_id', true);
    IF v_tenant IS NULL OR v_tenant = '' THEN
        RAISE EXCEPTION 'app.tenant_id session variable not set';
    END IF;
    RETURN v_tenant;
END;
$$;

3.1.1 Core Tables

See docs/db/schemas/binaries_schema_specification.md for complete DDL.

Key Tables:

Table Purpose
binaries.binary_identity Known binary identities (Build-ID, hashes)
binaries.binary_package_map Binary → package mapping per snapshot
binaries.vulnerable_buildids Build-IDs known to be vulnerable
binaries.vulnerable_fingerprints Function fingerprints for CVEs
binaries.cve_fix_index Patch-aware fix status per distro
binaries.fingerprint_matches Match results (findings evidence)
binaries.corpus_snapshots Corpus ingestion tracking

3.2 RustFS Layout

rustfs://stellaops/binaryindex/
  fingerprints/<algorithm>/<prefix>/<fingerprint_id>.bin
  corpus/<distro>/<release>/<snapshot_id>/manifest.json
  corpus/<distro>/<release>/<snapshot_id>/packages/<pkg>.metadata.json
  evidence/<match_id>.dsse.json

4. Integration Points

4.1 Scanner.Worker Integration

During container scanning, Scanner.Worker queries BinaryIndex for each extracted binary:

sequenceDiagram
    participant SW as Scanner.Worker
    participant BI as BinaryIndex
    participant PG as PostgreSQL
    participant FL as Findings Ledger

    SW->>SW: Extract binary from layer
    SW->>SW: Compute BinaryIdentity
    SW->>BI: LookupByIdentityAsync(identity)
    BI->>PG: Query binaries.vulnerable_buildids
    PG-->>BI: Matches
    BI->>PG: Query binaries.cve_fix_index (if distro known)
    PG-->>BI: Fix status
    BI-->>SW: BinaryVulnMatch[]
    SW->>FL: RecordFinding(match, evidence)

4.2 Concelier Integration

BinaryIndex subscribes to Concelier's advisory updates:

sequenceDiagram
    participant CO as Concelier
    participant BI as BinaryIndex
    participant PG as PostgreSQL

    CO->>CO: Ingest new advisory
    CO->>BI: advisory.created event
    BI->>BI: Check if affected packages in corpus
    BI->>PG: Update binaries.binary_vuln_assertion
    BI->>BI: Queue fingerprint generation (if high-impact)

4.3 Policy Integration

Binary matches are recorded as proof segments:

{
  "segment_type": "binary_fingerprint_evidence",
  "payload": {
    "binary_identity": {
      "format": "elf",
      "build_id": "abc123...",
      "file_sha256": "def456..."
    },
    "matches": [
      {
        "cve_id": "CVE-2024-1234",
        "method": "buildid_catalog",
        "confidence": 0.98,
        "vulnerable_purl": "pkg:deb/debian/libssl3@1.1.1n-0+deb11u3"
      }
    ]
  }
}

5. MVP Roadmap

MVP 1: Known-Build Binary Catalog (Sprint 6000.0001)

Goal: Query "is this Build-ID vulnerable?" with distro-level precision.

Deliverables:

  • binaries PostgreSQL schema
  • Build-ID to package mapping tables
  • Basic CVE lookup by binary identity
  • Debian/Ubuntu corpus connector

MVP 2: Patch-Aware Backport Handling (Sprint 6000.0002)

Goal: Handle "version says vulnerable but distro backported the fix."

Deliverables:

  • Fix index builder (changelog + patch header parsing)
  • Distro-specific version comparison
  • RPM corpus connector
  • Scanner.Worker integration

MVP 3: Binary Fingerprint Factory (Sprint 6000.0003)

Goal: Detect vulnerable code independent of package metadata.

Deliverables:

  • Fingerprint storage and matching
  • Reference build generation pipeline
  • Fingerprint validation corpus
  • High-impact CVE coverage (OpenSSL, glibc, zlib, curl)

MVP 4: Full Scanner Integration (Sprint 6000.0004)

Goal: Binary evidence in production scans.

Deliverables:

  • Scanner.Worker binary lookup integration
  • Findings Ledger binary match records
  • Proof segment attestations
  • CLI binary match inspection

5b. Fix Evidence Chain

The Fix Evidence Chain provides auditable proof of why a CVE is marked as fixed (or not) for a specific distro/package combination. This is critical for patch-aware backport handling where package versions can be misleading.

5b.1 Evidence Sources

Source Confidence Description
Security Feed (OVAL) 0.95-0.99 Authoritative feed from distro (Debian Security Tracker, Red Hat OVAL)
Patch Header (DEP-3) 0.87-0.95 CVE reference in Debian/Ubuntu patch metadata
Changelog 0.75-0.85 CVE mention in debian/changelog or RPM %changelog
Upstream Patch Match 0.90 Binary diff matches known upstream fix

5b.2 Evidence Storage

Evidence is stored in two PostgreSQL tables:

-- Fix index: one row per (distro, release, source_pkg, cve_id)
CREATE TABLE binaries.cve_fix_index (
    id UUID PRIMARY KEY,
    tenant_id TEXT NOT NULL,
    distro TEXT NOT NULL,        -- debian, ubuntu, alpine, rhel
    release TEXT NOT NULL,       -- bookworm, jammy, v3.19
    source_pkg TEXT NOT NULL,
    cve_id TEXT NOT NULL,
    state TEXT NOT NULL,         -- fixed, vulnerable, not_affected, wontfix, unknown
    fixed_version TEXT,
    method TEXT NOT NULL,        -- security_feed, changelog, patch_header, upstream_match
    confidence DECIMAL(3,2) NOT NULL,
    evidence_id UUID REFERENCES binaries.fix_evidence(id),
    snapshot_id UUID,
    indexed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE (tenant_id, distro, release, source_pkg, cve_id)
);

-- Evidence blobs: audit trail
CREATE TABLE binaries.fix_evidence (
    id UUID PRIMARY KEY,
    tenant_id TEXT NOT NULL,
    evidence_type TEXT NOT NULL, -- changelog, patch_header, security_feed
    source_file TEXT,            -- Path to source file (changelog, patch)
    source_sha256 TEXT,          -- Hash of source file
    excerpt TEXT,                -- Relevant snippet (max 1KB)
    metadata JSONB NOT NULL,     -- Structured metadata
    snapshot_id UUID,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

5b.3 Evidence Types

ChangelogEvidence:

{
  "evidence_type": "changelog",
  "source_file": "debian/changelog",
  "excerpt": "* Fix CVE-2024-0727: PKCS12 decoding crash",
  "metadata": {
    "version": "3.0.11-1~deb12u2",
    "line_number": 5
  }
}

PatchHeaderEvidence:

{
  "evidence_type": "patch_header",
  "source_file": "debian/patches/CVE-2024-0727.patch",
  "excerpt": "CVE: CVE-2024-0727\nOrigin: upstream, https://github.com/openssl/commit/abc123",
  "metadata": {
    "patch_sha256": "abc123def456..."
  }
}

SecurityFeedEvidence:

{
  "evidence_type": "security_feed",
  "metadata": {
    "feed_id": "debian-security-tracker",
    "entry_id": "DSA-5678-1",
    "published_at": "2024-01-15T10:00:00Z"
  }
}

5b.4 Confidence Resolution

When multiple evidence sources exist for the same CVE, the system keeps the highest confidence entry:

ON CONFLICT (tenant_id, distro, release, source_pkg, cve_id)
DO UPDATE SET
    confidence = GREATEST(existing.confidence, new.confidence),
    method = CASE
        WHEN existing.confidence < new.confidence THEN new.method
        ELSE existing.method
    END,
    evidence_id = CASE
        WHEN existing.confidence < new.confidence THEN new.evidence_id
        ELSE existing.evidence_id
    END

5b.5 Parsers

The following parsers extract CVE fix information:

Parser Distros Input Confidence
DebianChangelogParser Debian, Ubuntu debian/changelog 0.80
PatchHeaderParser Debian, Ubuntu debian/patches/*.patch (DEP-3) 0.87
AlpineSecfixesParser Alpine APKBUILD secfixes block 0.95
RpmChangelogParser RHEL, Fedora, CentOS RPM spec %changelog 0.75

5b.6 Query Flow

sequenceDiagram
    participant SW as Scanner.Worker
    participant BVS as BinaryVulnerabilityService
    participant FIR as FixIndexRepository
    participant PG as PostgreSQL

    SW->>BVS: GetFixStatusAsync(debian, bookworm, openssl, CVE-2024-0727)
    BVS->>FIR: GetFixStatusAsync(...)
    FIR->>PG: SELECT FROM cve_fix_index WHERE ...
    PG-->>FIR: FixIndexEntry (state=fixed, confidence=0.87)
    FIR-->>BVS: FixStatusResult
    BVS-->>SW: {state: Fixed, confidence: 0.87, method: PatchHeader}

6. Security Considerations

6.1 Trust Boundaries

  1. Corpus Ingestion - Packages are untrusted; extraction runs in sandboxed workers
  2. Fingerprint Generation - Reference builds compiled in isolated environments
  3. Query API - Tenant-isolated via RLS; no cross-tenant data leakage

6.2 Signing & Provenance

  • All corpus snapshots are signed (DSSE)
  • Fingerprint sets are versioned and signed
  • Every match result references evidence digests

6.3 Sandbox Requirements

Binary extraction and fingerprint generation MUST run with:

  • Seccomp profile restricting syscalls
  • Read-only root filesystem
  • No network access during analysis
  • Memory/CPU limits

7. Observability

7.1 Metrics

Metric Type Labels
binaryindex_lookup_total Counter method, result
binaryindex_lookup_latency_ms Histogram method
binaryindex_corpus_packages_total Gauge distro, release
binaryindex_fingerprints_indexed Gauge algorithm, component
binaryindex_match_confidence Histogram method

7.2 Traces

  • binaryindex.lookup - Full lookup span
  • binaryindex.corpus.ingest - Corpus ingestion
  • binaryindex.fingerprint.generate - Fingerprint generation

7.3 Ops Endpoints

Sprint: SPRINT_20260112_007_BINIDX_binaryindex_user_config

BinaryIndex exposes read-only ops endpoints for health, bench, cache, and effective configuration:

Endpoint Method Response Schema Description
/api/v1/ops/binaryindex/health GET BinaryIndexOpsHealthResponse Health status, lifter warmness per ISA, cache availability
/api/v1/ops/binaryindex/bench/run POST BinaryIndexBenchResponse Run latency benchmark, return min/max/mean/p50/p95/p99 stats
/api/v1/ops/binaryindex/cache GET BinaryIndexFunctionCacheStats Function cache hit/miss/eviction statistics
/api/v1/ops/binaryindex/config GET BinaryIndexEffectiveConfig Effective configuration with secrets redacted

7.3.1 Response Schemas

BinaryIndexOpsHealthResponse:

{
  "status": "healthy",
  "timestamp": "2026-01-16T12:00:00Z",
  "components": {
    "lifterPool": { "status": "healthy", "message": null },
    "functionCache": { "status": "healthy", "message": null },
    "persistence": { "status": "healthy", "message": null }
  },
  "lifterWarmness": {
    "intel-64": { "isa": "intel-64", "warm": true, "poolSize": 4, "acquireTimeMs": 12 },
    "armv8-64": { "isa": "armv8-64", "warm": true, "poolSize": 2, "acquireTimeMs": 8 }
  }
}

BinaryIndexBenchResponse:

{
  "timestamp": "2026-01-16T12:00:00Z",
  "sampleSize": 100,
  "latencySummary": {
    "minMs": 5.2,
    "maxMs": 142.8,
    "meanMs": 28.4,
    "p50Ms": 22.1,
    "p95Ms": 78.3,
    "p99Ms": 121.5
  },
  "operations": [
    { "operation": "lifterAcquire", "samples": 100, "meanMs": 12.4 },
    { "operation": "irNormalization", "samples": 100, "meanMs": 8.7 },
    { "operation": "cacheLookup", "samples": 100, "meanMs": 1.2 }
  ]
}

BinaryIndexFunctionCacheStats:

{
  "enabled": true,
  "backend": "valkey",
  "hits": 15234,
  "misses": 892,
  "evictions": 45,
  "hitRate": 0.944,
  "keyPrefix": "stellaops:binidx:funccache:",
  "cacheTtlSeconds": 14400,
  "estimatedEntries": 12500,
  "estimatedMemoryBytes": 52428800
}

BinaryIndexEffectiveConfig:

{
  "b2r2Pool": {
    "maxPoolSizePerIsa": 4,
    "warmPreload": ["intel-64", "armv8-64"],
    "acquireTimeoutMs": 5000,
    "enableMetrics": true
  },
  "semanticLifting": {
    "b2r2Version": "1.5.0",
    "normalizationRecipeVersion": "2024.1",
    "maxInstructionsPerFunction": 10000,
    "maxFunctionsPerBinary": 5000,
    "functionLiftTimeoutMs": 30000,
    "enableDeduplication": true
  },
  "functionCache": {
    "connectionString": "********",
    "keyPrefix": "stellaops:binidx:funccache:",
    "cacheTtlSeconds": 14400,
    "maxTtlSeconds": 86400,
    "earlyExpiryPercent": 0.1,
    "maxEntrySizeBytes": 1048576
  },
  "persistence": {
    "schema": "binaries",
    "minPoolSize": 5,
    "maxPoolSize": 20,
    "commandTimeoutSeconds": 30,
    "retryOnFailure": true,
    "batchSize": 100
  },
  "backendVersions": {
    "b2r2": "1.5.0",
    "valkey": "7.2.0",
    "postgres": "15.4"
  }
}

7.3.2 Rate Limiting

The /bench/run endpoint is rate-limited to prevent load spikes:

  • Default: 5 requests per minute per tenant
  • Configurable via BinaryIndex:Ops:BenchRateLimitPerMinute

7.3.3 Secret Redaction

The config endpoint automatically redacts sensitive keys:

Redacted Keys Pattern
connectionString Replaced with ********
password Replaced with ********
secret* Any key starting with "secret"
apiKey Replaced with ********
token Replaced with ********

Redaction is applied recursively to nested objects.


8. Configuration

Sprint: SPRINT_20260112_007_BINIDX_binaryindex_user_config

8.1 Configuration Sections

All configuration is under the BinaryIndex section in appsettings.yaml or environment variables with BINARYINDEX__ prefix.

8.1.1 B2R2 Lifter Pool (BinaryIndex:B2R2Pool)

Key Type Default Description
MaxPoolSizePerIsa int 4 Maximum lifter instances per ISA
WarmPreload string[] ["intel-64", "armv8-64"] ISAs to warm on startup
AcquireTimeoutMs int 5000 Timeout for lifter acquisition
EnableMetrics bool true Emit Prometheus metrics for pool
BinaryIndex:
  B2R2Pool:
    MaxPoolSizePerIsa: 4
    WarmPreload:
      - intel-64
      - armv8-64
    AcquireTimeoutMs: 5000
    EnableMetrics: true

8.1.2 Semantic Lifting (BinaryIndex:SemanticLifting)

Key Type Default Description
B2R2Version string "1.5.0" B2R2 disassembler version
NormalizationRecipeVersion string "2024.1" IR normalization recipe version
MaxInstructionsPerFunction int 10000 Max instructions to lift per function
MaxFunctionsPerBinary int 5000 Max functions to process per binary
FunctionLiftTimeoutMs int 30000 Timeout for lifting single function
EnableDeduplication bool true Deduplicate IR before fingerprinting
BinaryIndex:
  SemanticLifting:
    MaxInstructionsPerFunction: 10000
    MaxFunctionsPerBinary: 5000
    FunctionLiftTimeoutMs: 30000
    EnableDeduplication: true

8.1.3 Function Cache (BinaryIndex:FunctionCache)

Key Type Default Description
ConnectionString string Valkey connection string (secret)
KeyPrefix string "stellaops:binidx:funccache:" Cache key prefix
CacheTtlSeconds int 14400 Default cache TTL (4 hours)
MaxTtlSeconds int 86400 Maximum TTL (24 hours)
EarlyExpiryPercent decimal 0.1 Early expiry jitter (10%)
MaxEntrySizeBytes int 1048576 Max entry size (1 MB)
BinaryIndex:
  FunctionCache:
    ConnectionString: ${VALKEY_CONNECTION}  # from env
    KeyPrefix: "stellaops:binidx:funccache:"
    CacheTtlSeconds: 14400
    MaxEntrySizeBytes: 1048576

8.1.4 Persistence (Postgres:BinaryIndex)

Key Type Default Description
Schema string "binaries" PostgreSQL schema name
MinPoolSize int 5 Minimum connection pool size
MaxPoolSize int 20 Maximum connection pool size
CommandTimeoutSeconds int 30 Command execution timeout
RetryOnFailure bool true Retry transient failures
BatchSize int 100 Batch insert size
Postgres:
  BinaryIndex:
    Schema: binaries
    MinPoolSize: 5
    MaxPoolSize: 20
    CommandTimeoutSeconds: 30
    RetryOnFailure: true
    BatchSize: 100

8.1.5 Ops Configuration (BinaryIndex:Ops)

Key Type Default Description
EnableHealthEndpoint bool true Enable /health endpoint
EnableBenchEndpoint bool true Enable /bench/run endpoint
BenchRateLimitPerMinute int 5 Rate limit for bench endpoint
RedactedKeys string[] See 7.3.3 Keys to redact in config output

8.2 Legacy Configuration

# binaryindex.yaml (corpus configuration)
binaryindex:
  enabled: true

  corpus:
    connectors:
      - type: debian
        enabled: true
        mirror: http://deb.debian.org/debian
        releases: [bookworm, bullseye]
        architectures: [amd64, arm64]
      - type: ubuntu
        enabled: true
        mirror: http://archive.ubuntu.com/ubuntu
        releases: [jammy, noble]

  fingerprinting:
    enabled: true
    algorithms: [basic_block, cfg]
    target_components:
      - openssl
      - glibc
      - zlib
      - curl
      - sqlite
    min_function_size: 16  # bytes
    max_functions_per_binary: 10000

  lookup:
    cache_ttl: 3600
    batch_size: 100
    timeout_ms: 5000

  storage:
    postgres_schema: binaries
    rustfs_bucket: stellaops/binaryindex

9. Testing Strategy

9.1 Unit Tests

  • Identity extraction (Build-ID, hashes)
  • Fingerprint generation determinism
  • Fix index parsing (changelog, patch headers)

9.2 Integration Tests

  • PostgreSQL schema validation
  • Full corpus ingestion flow
  • Scanner.Worker lookup integration

9.3 Regression Tests

  • Known CVE detection (golden corpus)
  • Backport handling (Debian libssl example)
  • False positive rate validation

10. Golden Corpus for Patch Provenance

Sprint: SPRINT_20260121_034/035/036 - Golden Corpus Implementation

The BinaryIndex module supports a golden corpus of patch-paired artifacts that enables offline SBOM reproducibility and binary-level patch provenance verification.

10.1 Corpus Purpose

The golden corpus provides:

  • Auditor-ready evidence bundles for air-gapped customers
  • Regression testing for binary matching accuracy
  • Proof of patch status independent of package metadata

10.2 Corpus Sources

Source Type Purpose
Debian Security Tracker / DSAs Advisory Primary advisory linkage
Debian Snapshot Binary archive Pre/post patch binary pairs
Ubuntu Security Notices Advisory Ubuntu-specific advisories
Alpine secdb Advisory Alpine YAML advisories
OSV dump Unified schema Cross-reference and commit ranges

10.2.1 Symbol Source Connectors

Sprint: SPRINT_20260121_035_BinaryIndex_golden_corpus_connectors_cli

The corpus ingestion layer uses pluggable connectors to retrieve symbols and metadata from upstream sources:

Connector ID Implementation Protocol Data Retrieved
debuginfod-fedora DebuginfodConnector debuginfod HTTP ELF debug symbols by Build-ID
debuginfod-ubuntu DebuginfodConnector debuginfod HTTP ELF debug symbols by Build-ID
ddeb-ubuntu DdebConnector APT/HTTP .ddeb debug packages
buildinfo-debian BuildinfoConnector HTTP .buildinfo reproducibility records
secdb-alpine AlpineSecDbConnector Git/HTTP secfixes YAML from APKBUILD

Connector Interface:

public interface ISymbolSourceConnector
{
    string ConnectorId { get; }
    string DisplayName { get; }
    string[] SupportedDistros { get; }

    Task<ConnectorStatus> GetStatusAsync(CancellationToken ct);
    Task SyncAsync(SyncOptions options, CancellationToken ct);
    Task<SymbolLookupResult?> LookupByBuildIdAsync(string buildId, CancellationToken ct);
    Task<IAsyncEnumerable<SymbolRecord>> SearchAsync(SymbolSearchQuery query, CancellationToken ct);
}

Debuginfod Connector:

The DebuginfodConnector implements the debuginfod protocol for retrieving debug symbols:

  • Endpoint: GET /buildid/<build-id>/debuginfo
  • Supports federated queries across multiple debuginfod servers
  • Caches retrieved symbols in RustFS blob storage
  • Rate-limited to respect upstream server policies

Ubuntu ddeb Connector:

The DdebConnector retrieves Ubuntu debug symbol packages (.ddeb):

  • Sources: ddebs.ubuntu.com mirror
  • Indexes: Reads Packages.xz for package metadata
  • Extraction: Unpacks .ddeb AR archives to extract DWARF symbols
  • Mapping: Links debug symbols to binary packages via Build-ID

Debian Buildinfo Connector:

The BuildinfoConnector retrieves Debian buildinfo files for reproducibility verification:

  • Source: buildinfos.debian.net and snapshot archives
  • Purpose: Provides build environment metadata for reproducible builds
  • Fields extracted: Build-Date, Build-Architecture, Checksums-Sha256
  • Integration: Cross-references with binary packages for provenance

Alpine SecDB Connector:

The AlpineSecDbConnector parses Alpine's security database:

  • Source: secfixes blocks in APKBUILD files
  • Repository: alpine/aports Git repository
  • Format: YAML blocks mapping CVEs to fixed versions
  • Example:
    secfixes:
      3.0.11-r0:
        - CVE-2024-0727
        - CVE-2024-0728
    

OSV Dump Parser:

The OsvDumpParser processes Google OSV database dumps for advisory cross-correlation:

  • Source: osv.dev bulk exports (JSON)
  • Purpose: CVE → commit range extraction for patch identification
  • Cross-reference: Correlates OSV entries with distribution advisories
  • Inconsistency detection: Identifies discrepancies between OSV and distro advisories
public interface IOsvDumpParser
{
    IAsyncEnumerable<OsvParsedEntry> ParseDumpAsync(Stream osvDumpStream, CancellationToken ct);
    OsvCveIndex BuildCveIndex(IEnumerable<OsvParsedEntry> entries);
    IEnumerable<AdvisoryCorrelation> CrossReferenceWithExternal(
        OsvCveIndex osvIndex,
        IEnumerable<ExternalAdvisory> externalAdvisories);
    IEnumerable<AdvisoryInconsistency> DetectInconsistencies(
        IEnumerable<AdvisoryCorrelation> correlations);
}

CLI Access:

All connectors are manageable via the stella groundtruth sources CLI commands:

# List all connectors
stella groundtruth sources list

# Sync specific connector
stella groundtruth sources sync --source buildinfo-debian --full

# Enable/disable connectors
stella groundtruth sources enable ddeb-ubuntu
stella groundtruth sources disable debuginfod-fedora

See Ground-Truth CLI Guide for complete CLI documentation

10.3 Key Performance Indicators

KPI Target Description
Per-function match rate >= 90% Functions matched in post-patch binary
False-negative patch detection <= 5% Patched functions incorrectly classified
SBOM canonical-hash stability 3/3 Determinism across independent runs
Binary reconstruction equivalence Trend Rebuilt binary matches original
End-to-end verify time (p95, cold) Trend Offline verification performance

10.4 Validation Harness

The validation harness (IValidationHarness) orchestrates end-to-end verification:

Binary Pair (pre/post) → Symbol Recovery → IR Lifting → Fingerprinting → Matching → Metrics

10.5 Evidence Bundle Format

Evidence bundles follow OCI/ORAS conventions:

<pkg>-<advisory>-bundle.oci.tar
├── manifest.json           # OCI manifest
└── blobs/
    ├── sha256:<sbom>       # Canonical SBOM
    ├── sha256:<pre-bin>    # Pre-fix binary
    ├── sha256:<post-bin>   # Post-fix binary
    ├── sha256:<delta-sig>  # DSSE delta-sig predicate
    └── sha256:<timestamp>  # RFC 3161 timestamp

10.6 Two-Tier Bundle Design and Large Blob References

Sprint: SPRINT_20260122_040_Platform_oci_delta_attestation_pipeline (040-04)

Evidence bundles support two export modes to balance transfer speed with auditability:

Mode Export Flag Contents Use Case
Light (default) Manifest + attestation envelopes + metadata Quick transfer, metadata-only audit
Full --full Light + embedded binary blobs in blobs/ Air-gap replay, full provenance verification

10.6.1 largeBlobs[] Field

The DeltaSigPredicate includes a largeBlobs array referencing binary artifacts that may be too large to embed in attestation payloads:

{
  "schemaVersion": "1.0.0",
  "subject": [...],
  "delta": [...],
  "largeBlobs": [
    {
      "kind": "binary-patch",
      "digest": "sha256:a1b2c3...",
      "mediaType": "application/octet-stream",
      "sizeBytes": 1048576
    },
    {
      "kind": "sbom-fragment",
      "digest": "sha256:d4e5f6...",
      "mediaType": "application/spdx+json",
      "sizeBytes": 32768
    }
  ],
  "sbomDigest": "sha256:789abc..."
}

Field Definitions:

Field Type Description
largeBlobs[].kind string Blob category: binary-patch, sbom-fragment, debug-symbols, etc.
largeBlobs[].digest string Content-addressable digest (sha256:<hex>, sha384:<hex>, sha512:<hex>)
largeBlobs[].mediaType string IANA media type of the blob
largeBlobs[].sizeBytes long Blob size in bytes
sbomDigest string Digest of the canonical SBOM associated with this delta

10.6.2 Blob Fetch Strategy

During stella bundle verify --replay, blobs are resolved in priority order:

  1. Embedded (full bundles): Read from blobs/<digest-with-dash> in bundle directory
  2. Local source (--blob-source /path/): Read from specified local directory
  3. Registry (--blob-source https://...): HTTP GET from OCI registry (blocked in --offline mode)

10.6.3 Digest Verification

Fetched blobs are verified against their declared digest using the algorithm prefix:

sha256:<hex> → SHA-256
sha384:<hex> → SHA-384
sha512:<hex> → SHA-512

A mismatch fails the blob replay verification step.


11. References

  • Advisory: docs/product/advisories/21-Dec-2025 - Mapping Evidence Within Compiled Binaries.md
  • Scanner Native Analysis: src/Scanner/StellaOps.Scanner.Analyzers.Native/
  • Existing Fingerprinting: src/Scanner/__Libraries/StellaOps.Scanner.EntryTrace/Binary/
  • Build-ID Index: src/Scanner/StellaOps.Scanner.Analyzers.Native/Index/
  • Semantic Diffing Sprint: docs/implplan/SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md
  • Semantic Library: src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Semantic/
  • Semantic Tests: src/BinaryIndex/__Tests/StellaOps.BinaryIndex.Semantic.Tests/
  • Golden Corpus Sprints: docs/implplan/SPRINT_20260121_034_BinaryIndex_golden_corpus_foundation.md

Document Version: 1.2.0 Last Updated: 2026-01-21