# BinaryIndex Module Architecture

> **Ownership:** Scanner Guild + Concelier Guild
> **Status:** DRAFT
> **Version:** 1.0.0
> **Related:** [High-Level Architecture](../../ARCHITECTURE_OVERVIEW.md), [Scanner Architecture](../scanner/architecture.md), [Concelier Architecture](../concelier/architecture.md)

---

## 1. Overview

The **BinaryIndex** module provides a vulnerable binaries database that enables detection of vulnerable code at the binary level, independent of package metadata. This addresses a critical gap in vulnerability scanning: package version strings can lie (backports, custom builds, stripped metadata), but **binary identity doesn't lie**.

### 1.1 Problem Statement

Traditional vulnerability scanners rely on package version matching, which fails in several scenarios:

1. **Backported patches** - Distros backport security fixes without changing upstream version
2. **Custom/vendored builds** - Binaries compiled from source without package metadata
3. **Stripped binaries** - Debug info and version strings removed
4. **Static linking** - Vulnerable library code embedded in final binary
5. **Container base images** - Distroless or scratch images with no package DB

### 1.2 Solution: Binary-First Vulnerability Detection

BinaryIndex provides three tiers of binary identification:

| Tier | Method | Precision | Coverage |
|------|--------|-----------|----------|
| A | Package/version range matching | Medium | High |
| B | Build-ID/hash catalog (exact binary identity) | High | Medium |
| C | Function fingerprints (CFG/basic-block hashes) | Very High | Targeted |

### 1.3 Module Scope

**In Scope:**
- Binary identity extraction (Build-ID, PE CodeView GUID, Mach-O UUID)
- Binary-to-advisory mapping database
- Fingerprint storage and matching engine
- Fix index for patch-aware backport handling
- Integration with Scanner.Worker for binary lookup

**Out of Scope:**
- Binary disassembly/analysis (provided by Scanner.Analyzers.Native)
- Runtime binary tracing (provided by Zastava)
- SBOM generation (provided by Scanner)

---

## 2. Architecture

### 2.1 System Context

```
┌──────────────────────────────────────────────────────────────────────────┐
│                         External Systems                                   │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐           │
│  │ Distro Repos     │  │ Debug Symbol    │  │ Upstream Source │           │
│  │ (Debian, RPM,    │  │ Servers         │  │ (GitHub, etc.)  │           │
│  │  Alpine)         │  │ (debuginfod)    │  │                 │           │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘           │
└───────────│─────────────────────│─────────────────────│──────────────────┘
            │                     │                     │
            v                     v                     v
┌──────────────────────────────────────────────────────────────────────────┐
│                      BinaryIndex Module                                    │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │                    Corpus Ingestion Layer                            │ │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐               │ │
│  │  │ DebianCorpus │  │ RpmCorpus    │  │ AlpineCorpus │               │ │
│  │  │ Connector    │  │ Connector    │  │ Connector    │               │ │
│  │  └──────────────┘  └──────────────┘  └──────────────┘               │ │
│  └─────────────────────────────────────────────────────────────────────┘ │
│                                │                                          │
│                                v                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │                    Processing Layer                                  │ │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐               │ │
│  │  │ BinaryFeature│  │ FixIndex     │  │ Fingerprint  │               │ │
│  │  │ Extractor    │  │ Builder      │  │ Generator    │               │ │
│  │  └──────────────┘  └──────────────┘  └──────────────┘               │ │
│  └─────────────────────────────────────────────────────────────────────┘ │
│                                │                                          │
│                                v                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │                    Storage Layer                                     │ │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐               │ │
│  │  │ PostgreSQL   │  │ RustFS       │  │ Valkey       │               │ │
│  │  │ (binaries    │  │ (fingerprint │  │ (lookup      │               │ │
│  │  │  schema)     │  │  blobs)      │  │  cache)      │               │ │
│  │  └──────────────┘  └──────────────┘  └──────────────┘               │ │
│  └─────────────────────────────────────────────────────────────────────┘ │
│                                │                                          │
│                                v                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │                    Query Layer                                       │ │
│  │  ┌──────────────────────────────────────────────────────────────┐   │ │
│  │  │              IBinaryVulnerabilityService                      │   │ │
│  │  │  - LookupByBuildIdAsync(buildId)                             │   │ │
│  │  │  - LookupByFingerprintAsync(fingerprint)                     │   │ │
│  │  │  - LookupBatchAsync(identities)                              │   │ │
│  │  │  - GetFixStatusAsync(distro, release, sourcePkg, cve)        │   │ │
│  │  └──────────────────────────────────────────────────────────────┘   │ │
│  └─────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
                                │
                                v
┌──────────────────────────────────────────────────────────────────────────┐
│                      Consuming Modules                                     │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐           │
│  │ Scanner.Worker   │  │ Policy Engine   │  │ Findings Ledger │           │
│  │ (binary lookup   │  │ (evidence in    │  │ (match records) │           │
│  │  during scan)    │  │  proof chain)   │  │                 │           │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘           │
└──────────────────────────────────────────────────────────────────────────┘
```

### 2.2 Component Breakdown

#### 2.2.1 Corpus Connectors

Plugin-based connectors that ingest binaries from distribution repositories.

```csharp
public interface IBinaryCorpusConnector
{
    string ConnectorId { get; }
    string[] SupportedDistros { get; }

    Task<CorpusSnapshot> FetchSnapshotAsync(CorpusQuery query, CancellationToken ct);
    Task<IAsyncEnumerable<ExtractedBinary>> ExtractBinariesAsync(PackageReference pkg, CancellationToken ct);
}
```

**Implementations:**
- `DebianBinaryCorpusConnector` - Debian/Ubuntu packages + debuginfo
- `RpmBinaryCorpusConnector` - RHEL/Fedora/CentOS + SRPM
- `AlpineBinaryCorpusConnector` - Alpine APK + APKBUILD

#### 2.2.2 Binary Feature Extractor

Extracts identity and features from binaries. Reuses existing Scanner.Analyzers.Native capabilities.

```csharp
public interface IBinaryFeatureExtractor
{
    Task<BinaryIdentity> ExtractIdentityAsync(Stream binaryStream, CancellationToken ct);
    Task<BinaryFeatures> ExtractFeaturesAsync(Stream binaryStream, ExtractorOptions opts, CancellationToken ct);
}

public sealed record BinaryIdentity(
    string Format,           // elf, pe, macho
    string? BuildId,         // ELF GNU Build-ID
    string? PeCodeViewGuid,  // PE CodeView GUID + Age
    string? MachoUuid,       // Mach-O LC_UUID
    string FileSha256,
    string TextSectionSha256);

public sealed record BinaryFeatures(
    BinaryIdentity Identity,
    string[] DynamicDeps,    // DT_NEEDED
    string[] ExportedSymbols,
    string[] ImportedSymbols,
    BinaryHardening Hardening);
```

#### 2.2.3 Fix Index Builder

Builds the patch-aware CVE fix index from distro sources.

```csharp
public interface IFixIndexBuilder
{
    Task BuildIndexAsync(DistroRelease distro, CancellationToken ct);
    Task<FixRecord?> GetFixRecordAsync(string distro, string release, string sourcePkg, string cveId, CancellationToken ct);
}

public sealed record FixRecord(
    string Distro,
    string Release,
    string SourcePkg,
    string CveId,
    FixState State,           // fixed, vulnerable, not_affected, wontfix, unknown
    string? FixedVersion,     // Distro version string
    FixMethod Method,         // security_feed, changelog, patch_header
    decimal Confidence,       // 0.00-1.00
    FixEvidence Evidence);

public enum FixState { Fixed, Vulnerable, NotAffected, Wontfix, Unknown }
public enum FixMethod { SecurityFeed, Changelog, PatchHeader, UpstreamPatchMatch }
```

#### 2.2.4 Fingerprint Generator

Generates function-level fingerprints for vulnerable code detection.

```csharp
public interface IVulnFingerprintGenerator
{
    Task<ImmutableArray<VulnFingerprint>> GenerateAsync(
        string cveId,
        BinaryPair vulnAndFixed,  // Reference builds
        FingerprintOptions opts,
        CancellationToken ct);
}

public sealed record VulnFingerprint(
    string CveId,
    string Component,         // e.g., openssl
    string Architecture,      // x86-64, aarch64
    FingerprintType Type,     // basic_block, cfg, combined
    string FingerprintId,     // e.g., "bb-abc123..."
    byte[] FingerprintHash,   // 16-32 bytes
    string? FunctionHint,     // Function name if known
    decimal Confidence,
    FingerprintEvidence Evidence);

public enum FingerprintType { BasicBlock, ControlFlowGraph, StringReferences, Combined }
```

#### 2.2.5 Semantic Analysis Library

> **Library:** `StellaOps.BinaryIndex.Semantic`
> **Sprint:** 20260105_001_001_BINDEX - Semantic Diffing Phase 1

The Semantic Analysis Library extends fingerprint generation with IR-level semantic matching, enabling detection of semantically equivalent code despite compiler optimizations, instruction reordering, and register allocation differences.

**Key Insight:** Traditional instruction-level fingerprinting loses accuracy on optimized binaries by ~15-20%. Semantic analysis lifts to B2R2's Intermediate Representation (LowUIR), extracts key-semantics graphs, and uses graph hashing for similarity computation.

##### 2.2.5.1 Architecture

```
Binary Input
    │
    v
B2R2 Disassembly → Raw Instructions
    │
    v
IR Lifting Service → LowUIR Statements
    │
    v
Semantic Graph Extractor → Key-Semantics Graph (KSG)
    │
    v
Graph Fingerprinting → Semantic Fingerprint
    │
    v
Semantic Matcher → Similarity Score + Deltas
```

##### 2.2.5.2 Core Components

**IR Lifting Service** (`IIrLiftingService`)

Lifts disassembled instructions to B2R2 LowUIR:

```csharp
public interface IIrLiftingService
{
    Task<LiftedFunction> LiftToIrAsync(
        IReadOnlyList<DisassembledInstruction> instructions,
        string functionName,
        LiftOptions? options = null,
        CancellationToken ct = default);
}

public sealed record LiftedFunction(
    string Name,
    ImmutableArray<IrStatement> Statements,
    ImmutableArray<IrBasicBlock> BasicBlocks);
```

**Semantic Graph Extractor** (`ISemanticGraphExtractor`)

Extracts key-semantics graphs capturing data dependencies, control flow, and memory operations:

```csharp
public interface ISemanticGraphExtractor
{
    Task<KeySemanticsGraph> ExtractGraphAsync(
        LiftedFunction function,
        GraphExtractionOptions? options = null,
        CancellationToken ct = default);
}

public sealed record KeySemanticsGraph(
    string FunctionName,
    ImmutableArray<SemanticNode> Nodes,
    ImmutableArray<SemanticEdge> Edges,
    GraphProperties Properties);

public enum SemanticNodeType { Compute, Load, Store, Branch, Call, Return, Phi }
public enum SemanticEdgeType { DataDependency, ControlDependency, MemoryDependency }
```

**Semantic Fingerprint Generator** (`ISemanticFingerprintGenerator`)

Generates semantic fingerprints using Weisfeiler-Lehman graph hashing:

```csharp
public interface ISemanticFingerprintGenerator
{
    Task<SemanticFingerprint> GenerateAsync(
        KeySemanticsGraph graph,
        SemanticFingerprintOptions? options = null,
        CancellationToken ct = default);
}

public sealed record SemanticFingerprint(
    string FunctionName,
    string GraphHashHex,      // WL graph hash (SHA-256)
    string OperationHashHex,  // Normalized operation sequence hash
    string DataFlowHashHex,   // Data dependency pattern hash
    int NodeCount,
    int EdgeCount,
    int CyclomaticComplexity,
    ImmutableArray<string> ApiCalls,
    SemanticFingerprintAlgorithm Algorithm);
```

**Semantic Matcher** (`ISemanticMatcher`)

Computes semantic similarity with weighted components:

```csharp
public interface ISemanticMatcher
{
    Task<SemanticMatchResult> MatchAsync(
        SemanticFingerprint a,
        SemanticFingerprint b,
        MatchOptions? options = null,
        CancellationToken ct = default);

    Task<SemanticMatchResult> MatchWithDeltasAsync(
        SemanticFingerprint a,
        SemanticFingerprint b,
        MatchOptions? options = null,
        CancellationToken ct = default);
}

public sealed record SemanticMatchResult(
    decimal Similarity,       // 0.00-1.00
    decimal GraphSimilarity,
    decimal OperationSimilarity,
    decimal DataFlowSimilarity,
    decimal ApiCallSimilarity,
    MatchConfidence Confidence);
```

##### 2.2.5.3 Algorithm Details

**Weisfeiler-Lehman Graph Hashing:**
- 3 iterations of label propagation
- SHA-256 for final hash computation
- Deterministic node ordering via canonical sort

**Similarity Weights (Default):**
| Component | Weight |
|-----------|--------|
| Graph Hash | 0.35 |
| Operation Hash | 0.25 |
| Data Flow Hash | 0.25 |
| API Calls | 0.15 |

##### 2.2.5.4 Integration Points

The semantic library integrates with existing BinaryIndex components:

**DeltaSignatureGenerator Extension:**
```csharp
// Optional semantic services via constructor injection
services.AddDeltaSignaturesWithSemantic();

// Extended SymbolSignature with semantic properties
public sealed record SymbolSignature
{
    // ... existing properties ...
    public string? SemanticHashHex { get; init; }
    public ImmutableArray<string> SemanticApiCalls { get; init; }
}
```

**PatchDiffEngine Extension:**
```csharp
// SemanticWeight in HashWeights
public decimal SemanticWeight { get; init; } = 0.2m;

// FunctionFingerprint extended with semantic fingerprint
public SemanticFingerprint? SemanticFingerprint { get; init; }
```

##### 2.2.5.5 Test Coverage

| Category | Tests | Coverage |
|----------|-------|----------|
| Unit Tests (IR lifting, graph extraction, hashing) | 53 | Core algorithms |
| Integration Tests (full pipeline) | 9 | End-to-end flow |
| Golden Corpus (compiler variations) | 11 | Register allocation, optimization, compiler variants |
| Benchmarks (accuracy, performance) | 7 | Baseline metrics |

##### 2.2.5.6 Current Baselines

> **Note:** Baselines reflect foundational implementation; accuracy improves as semantic features mature.

| Metric | Baseline | Target |
|--------|----------|--------|
| Similarity (register allocation variants) | ≥0.55 | ≥0.85 |
| Overall accuracy | ≥40% | ≥70% |
| False positive rate | <10% | <5% |
| P95 fingerprint latency | <100ms | <50ms |

##### 2.2.5.7 B2R2 LowUIR Adapter

The B2R2LowUirLiftingService implements `IIrLiftingService` using B2R2's native lifting capabilities. This provides cross-platform IR representation for semantic analysis.

**Key Components:**

```csharp
public sealed class B2R2LowUirLiftingService : IIrLiftingService
{
    // Lifts to B2R2 LowUIR and maps to Stella IR model
    public Task<LiftedFunction> LiftToIrAsync(
        IReadOnlyList<DisassembledInstruction> instructions,
        string functionName,
        LiftOptions? options = null,
        CancellationToken ct = default);
}
```

**Supported ISAs:**
- Intel (x86-32, x86-64)
- ARM (ARMv7, ARMv8/ARM64)
- MIPS (32/64)
- RISC-V (64)
- PowerPC, SPARC, SH4, AVR, EVM

**IR Statement Mapping:**
| B2R2 LowUIR | Stella IR Kind |
|-------------|----------------|
| Put | IrStatementKind.Store |
| Store | IrStatementKind.Store |
| Get | IrStatementKind.Load |
| Load | IrStatementKind.Load |
| BinOp | IrStatementKind.BinaryOp |
| UnOp | IrStatementKind.UnaryOp |
| Jmp | IrStatementKind.Jump |
| CJmp | IrStatementKind.ConditionalJump |
| InterJmp | IrStatementKind.IndirectJump |
| Call | IrStatementKind.Call |
| SideEffect | IrStatementKind.SideEffect |

**Determinism Guarantees:**
- Statements ordered by block address (ascending)
- Blocks sorted by entry address (ascending)
- Consistent IR IDs across identical inputs
- InvariantCulture used for all string formatting

##### 2.2.5.8 B2R2 Lifter Pool

The `B2R2LifterPool` provides bounded pooling and warm preload for B2R2 lifting units to reduce per-call allocation overhead.

**Configuration (`B2R2LifterPoolOptions`):**
| Option | Default | Description |
|--------|---------|-------------|
| `MaxPoolSizePerIsa` | 4 | Maximum pooled lifters per ISA |
| `EnableWarmPreload` | true | Preload lifters at startup |
| `WarmPreloadIsas` | ["intel-64", "intel-32", "armv8-64", "armv7-32"] | ISAs to warm |
| `AcquireTimeout` | 5s | Timeout for acquiring a lifter |

**Pool Statistics:**
- `TotalPooledLifters`: Lifters currently in pool
- `TotalActiveLifters`: Lifters currently in use
- `IsWarm`: Whether pool has been warmed
- `IsaStats`: Per-ISA pool and active counts

**Usage:**
```csharp
using var lifter = _lifterPool.Acquire(isa);
var stmts = lifter.LiftingUnit.LiftInstruction(address);
// Lifter automatically returned to pool on dispose
```

##### 2.2.5.9 Function IR Cache

The `FunctionIrCacheService` provides Valkey-backed caching for computed semantic fingerprints to avoid redundant IR lifting and graph hashing.

**Cache Key Structure:**
```
(isa, b2r2_version, normalization_recipe, canonical_ir_hash)
```

**Configuration (`FunctionIrCacheOptions`):**
| Option | Default | Description |
|--------|---------|-------------|
| `KeyPrefix` | "stellaops:binidx:funccache:" | Valkey key prefix |
| `CacheTtl` | 4h | TTL for cached entries |
| `MaxTtl` | 24h | Maximum TTL |
| `Enabled` | true | Whether caching is enabled |
| `B2R2Version` | "0.9.1" | B2R2 version for cache key |
| `NormalizationRecipeVersion` | "v1" | Recipe version for cache key |

**Cache Entry (`CachedFunctionFingerprint`):**
- `FunctionAddress`, `FunctionName`
- `SemanticFingerprint`: The computed fingerprint
- `IrStatementCount`, `BasicBlockCount`
- `ComputedAtUtc`: ISO-8601 timestamp
- `B2R2Version`, `NormalizationRecipe`

**Invalidation Rules:**
- Cache entries expire after `CacheTtl` (default 4h)
- Changing B2R2 version or normalization recipe results in cache misses
- Manual invalidation via `RemoveAsync()`

**Statistics:**
- Hits, Misses, Evictions
- Hit Rate
- Enabled status

##### 2.2.5.10 Ops Endpoints

BinaryIndex exposes operational endpoints for health, benchmarking, cache monitoring, and configuration visibility.

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/v1/ops/binaryindex/health` | GET | Health status with lifter warmness, cache availability |
| `/api/v1/ops/binaryindex/bench/run` | POST | Run benchmark, return latency stats |
| `/api/v1/ops/binaryindex/cache` | GET | Function IR cache hit/miss statistics |
| `/api/v1/ops/binaryindex/config` | GET | Effective configuration (secrets redacted) |

**Health Response:**
```json
{
  "status": "healthy",
  "timestamp": "2026-01-14T12:00:00Z",
  "lifterStatus": "warm",
  "lifterWarm": true,
  "lifterPoolStats": { "intel-64": 4, "armv8-64": 2 },
  "cacheStatus": "enabled",
  "cacheEnabled": true
}
```

**Determinism Constraints:**
- All timestamps in ISO-8601 UTC format
- ASCII-only output
- Deterministic JSON key ordering
- Secrets/credentials redacted from config endpoint

#### 2.2.6 Binary Vulnerability Service

Main query interface for consumers.

```csharp
public interface IBinaryVulnerabilityService
{
    /// <summary>
    /// Look up vulnerabilities by Build-ID or equivalent binary identity.
    /// </summary>
    Task<ImmutableArray<BinaryVulnMatch>> LookupByIdentityAsync(
        BinaryIdentity identity,
        LookupOptions? opts = null,
        CancellationToken ct = default);

    /// <summary>
    /// Look up vulnerabilities by function fingerprint.
    /// </summary>
    Task<ImmutableArray<BinaryVulnMatch>> LookupByFingerprintAsync(
        CodeFingerprint fingerprint,
        decimal minSimilarity = 0.95m,
        CancellationToken ct = default);

    /// <summary>
    /// Batch lookup for scan performance.
    /// </summary>
    Task<ImmutableDictionary<string, ImmutableArray<BinaryVulnMatch>>> LookupBatchAsync(
        IEnumerable<BinaryIdentity> identities,
        LookupOptions? opts = null,
        CancellationToken ct = default);

    /// <summary>
    /// Get distro-specific fix status (patch-aware).
    /// </summary>
    Task<FixRecord?> GetFixStatusAsync(
        string distro,
        string release,
        string sourcePkg,
        string cveId,
        CancellationToken ct = default);
}

public sealed record BinaryVulnMatch(
    string CveId,
    string VulnerablePurl,
    MatchMethod Method,       // buildid_catalog, fingerprint_match, range_match
    decimal Confidence,
    MatchEvidence Evidence);

public enum MatchMethod { BuildIdCatalog, FingerprintMatch, RangeMatch }
```

---

## 3. Data Model

### 3.1 PostgreSQL Schema (`binaries`)

The `binaries` schema stores binary identity, fingerprint, and match data.

```sql
CREATE SCHEMA IF NOT EXISTS binaries;
CREATE SCHEMA IF NOT EXISTS binaries_app;

-- RLS helper
CREATE OR REPLACE FUNCTION binaries_app.require_current_tenant()
RETURNS TEXT LANGUAGE plpgsql STABLE SECURITY DEFINER AS $$
DECLARE v_tenant TEXT;
BEGIN
    v_tenant := current_setting('app.tenant_id', true);
    IF v_tenant IS NULL OR v_tenant = '' THEN
        RAISE EXCEPTION 'app.tenant_id session variable not set';
    END IF;
    RETURN v_tenant;
END;
$$;
```

#### 3.1.1 Core Tables

See `docs/db/schemas/binaries_schema_specification.md` for complete DDL.

**Key Tables:**

| Table | Purpose |
|-------|---------|
| `binaries.binary_identity` | Known binary identities (Build-ID, hashes) |
| `binaries.binary_package_map` | Binary → package mapping per snapshot |
| `binaries.vulnerable_buildids` | Build-IDs known to be vulnerable |
| `binaries.vulnerable_fingerprints` | Function fingerprints for CVEs |
| `binaries.cve_fix_index` | Patch-aware fix status per distro |
| `binaries.fingerprint_matches` | Match results (findings evidence) |
| `binaries.corpus_snapshots` | Corpus ingestion tracking |

### 3.2 RustFS Layout

```
rustfs://stellaops/binaryindex/
  fingerprints/<algorithm>/<prefix>/<fingerprint_id>.bin
  corpus/<distro>/<release>/<snapshot_id>/manifest.json
  corpus/<distro>/<release>/<snapshot_id>/packages/<pkg>.metadata.json
  evidence/<match_id>.dsse.json
```

---

## 4. Integration Points

### 4.1 Scanner.Worker Integration

During container scanning, Scanner.Worker queries BinaryIndex for each extracted binary:

```mermaid
sequenceDiagram
    participant SW as Scanner.Worker
    participant BI as BinaryIndex
    participant PG as PostgreSQL
    participant FL as Findings Ledger

    SW->>SW: Extract binary from layer
    SW->>SW: Compute BinaryIdentity
    SW->>BI: LookupByIdentityAsync(identity)
    BI->>PG: Query binaries.vulnerable_buildids
    PG-->>BI: Matches
    BI->>PG: Query binaries.cve_fix_index (if distro known)
    PG-->>BI: Fix status
    BI-->>SW: BinaryVulnMatch[]
    SW->>FL: RecordFinding(match, evidence)
```

### 4.2 Concelier Integration

BinaryIndex subscribes to Concelier's advisory updates:

```mermaid
sequenceDiagram
    participant CO as Concelier
    participant BI as BinaryIndex
    participant PG as PostgreSQL

    CO->>CO: Ingest new advisory
    CO->>BI: advisory.created event
    BI->>BI: Check if affected packages in corpus
    BI->>PG: Update binaries.binary_vuln_assertion
    BI->>BI: Queue fingerprint generation (if high-impact)
```

### 4.3 Policy Integration

Binary matches are recorded as proof segments:

```json
{
  "segment_type": "binary_fingerprint_evidence",
  "payload": {
    "binary_identity": {
      "format": "elf",
      "build_id": "abc123...",
      "file_sha256": "def456..."
    },
    "matches": [
      {
        "cve_id": "CVE-2024-1234",
        "method": "buildid_catalog",
        "confidence": 0.98,
        "vulnerable_purl": "pkg:deb/debian/libssl3@1.1.1n-0+deb11u3"
      }
    ]
  }
}
```

---

## 5. MVP Roadmap

### MVP 1: Known-Build Binary Catalog (Sprint 6000.0001)

**Goal:** Query "is this Build-ID vulnerable?" with distro-level precision.

**Deliverables:**
- `binaries` PostgreSQL schema
- Build-ID to package mapping tables
- Basic CVE lookup by binary identity
- Debian/Ubuntu corpus connector

### MVP 2: Patch-Aware Backport Handling (Sprint 6000.0002)

**Goal:** Handle "version says vulnerable but distro backported the fix."

**Deliverables:**
- Fix index builder (changelog + patch header parsing)
- Distro-specific version comparison
- RPM corpus connector
- Scanner.Worker integration

### MVP 3: Binary Fingerprint Factory (Sprint 6000.0003)

**Goal:** Detect vulnerable code independent of package metadata.

**Deliverables:**
- Fingerprint storage and matching
- Reference build generation pipeline
- Fingerprint validation corpus
- High-impact CVE coverage (OpenSSL, glibc, zlib, curl)

### MVP 4: Full Scanner Integration (Sprint 6000.0004)

**Goal:** Binary evidence in production scans.

**Deliverables:**
- Scanner.Worker binary lookup integration
- Findings Ledger binary match records
- Proof segment attestations
- CLI binary match inspection

---

## 5b. Fix Evidence Chain

The **Fix Evidence Chain** provides auditable proof of why a CVE is marked as fixed (or not) for a specific distro/package combination. This is critical for patch-aware backport handling where package versions can be misleading.

### 5b.1 Evidence Sources

| Source | Confidence | Description |
|--------|------------|-------------|
| **Security Feed (OVAL)** | 0.95-0.99 | Authoritative feed from distro (Debian Security Tracker, Red Hat OVAL) |
| **Patch Header (DEP-3)** | 0.87-0.95 | CVE reference in Debian/Ubuntu patch metadata |
| **Changelog** | 0.75-0.85 | CVE mention in debian/changelog or RPM %changelog |
| **Upstream Patch Match** | 0.90 | Binary diff matches known upstream fix |

### 5b.2 Evidence Storage

Evidence is stored in two PostgreSQL tables:

```sql
-- Fix index: one row per (distro, release, source_pkg, cve_id)
CREATE TABLE binaries.cve_fix_index (
    id UUID PRIMARY KEY,
    tenant_id TEXT NOT NULL,
    distro TEXT NOT NULL,        -- debian, ubuntu, alpine, rhel
    release TEXT NOT NULL,       -- bookworm, jammy, v3.19
    source_pkg TEXT NOT NULL,
    cve_id TEXT NOT NULL,
    state TEXT NOT NULL,         -- fixed, vulnerable, not_affected, wontfix, unknown
    fixed_version TEXT,
    method TEXT NOT NULL,        -- security_feed, changelog, patch_header, upstream_match
    confidence DECIMAL(3,2) NOT NULL,
    evidence_id UUID REFERENCES binaries.fix_evidence(id),
    snapshot_id UUID,
    indexed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE (tenant_id, distro, release, source_pkg, cve_id)
);

-- Evidence blobs: audit trail
CREATE TABLE binaries.fix_evidence (
    id UUID PRIMARY KEY,
    tenant_id TEXT NOT NULL,
    evidence_type TEXT NOT NULL, -- changelog, patch_header, security_feed
    source_file TEXT,            -- Path to source file (changelog, patch)
    source_sha256 TEXT,          -- Hash of source file
    excerpt TEXT,                -- Relevant snippet (max 1KB)
    metadata JSONB NOT NULL,     -- Structured metadata
    snapshot_id UUID,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
```

### 5b.3 Evidence Types

**ChangelogEvidence:**
```json
{
  "evidence_type": "changelog",
  "source_file": "debian/changelog",
  "excerpt": "* Fix CVE-2024-0727: PKCS12 decoding crash",
  "metadata": {
    "version": "3.0.11-1~deb12u2",
    "line_number": 5
  }
}
```

**PatchHeaderEvidence:**
```json
{
  "evidence_type": "patch_header",
  "source_file": "debian/patches/CVE-2024-0727.patch",
  "excerpt": "CVE: CVE-2024-0727\nOrigin: upstream, https://github.com/openssl/commit/abc123",
  "metadata": {
    "patch_sha256": "abc123def456..."
  }
}
```

**SecurityFeedEvidence:**
```json
{
  "evidence_type": "security_feed",
  "metadata": {
    "feed_id": "debian-security-tracker",
    "entry_id": "DSA-5678-1",
    "published_at": "2024-01-15T10:00:00Z"
  }
}
```

### 5b.4 Confidence Resolution

When multiple evidence sources exist for the same CVE, the system keeps the **highest confidence** entry:

```csharp
ON CONFLICT (tenant_id, distro, release, source_pkg, cve_id)
DO UPDATE SET
    confidence = GREATEST(existing.confidence, new.confidence),
    method = CASE
        WHEN existing.confidence < new.confidence THEN new.method
        ELSE existing.method
    END,
    evidence_id = CASE
        WHEN existing.confidence < new.confidence THEN new.evidence_id
        ELSE existing.evidence_id
    END
```

### 5b.5 Parsers

The following parsers extract CVE fix information:

| Parser | Distros | Input | Confidence |
|--------|---------|-------|------------|
| `DebianChangelogParser` | Debian, Ubuntu | debian/changelog | 0.80 |
| `PatchHeaderParser` | Debian, Ubuntu | debian/patches/*.patch (DEP-3) | 0.87 |
| `AlpineSecfixesParser` | Alpine | APKBUILD secfixes block | 0.95 |
| `RpmChangelogParser` | RHEL, Fedora, CentOS | RPM spec %changelog | 0.75 |

### 5b.6 Query Flow

```mermaid
sequenceDiagram
    participant SW as Scanner.Worker
    participant BVS as BinaryVulnerabilityService
    participant FIR as FixIndexRepository
    participant PG as PostgreSQL

    SW->>BVS: GetFixStatusAsync(debian, bookworm, openssl, CVE-2024-0727)
    BVS->>FIR: GetFixStatusAsync(...)
    FIR->>PG: SELECT FROM cve_fix_index WHERE ...
    PG-->>FIR: FixIndexEntry (state=fixed, confidence=0.87)
    FIR-->>BVS: FixStatusResult
    BVS-->>SW: {state: Fixed, confidence: 0.87, method: PatchHeader}
```

---

## 6. Security Considerations

### 6.1 Trust Boundaries

1. **Corpus Ingestion** - Packages are untrusted; extraction runs in sandboxed workers
2. **Fingerprint Generation** - Reference builds compiled in isolated environments
3. **Query API** - Tenant-isolated via RLS; no cross-tenant data leakage

### 6.2 Signing & Provenance

- All corpus snapshots are signed (DSSE)
- Fingerprint sets are versioned and signed
- Every match result references evidence digests

### 6.3 Sandbox Requirements

Binary extraction and fingerprint generation MUST run with:
- Seccomp profile restricting syscalls
- Read-only root filesystem
- No network access during analysis
- Memory/CPU limits

---

## 7. Observability

### 7.1 Metrics

| Metric | Type | Labels |
|--------|------|--------|
| `binaryindex_lookup_total` | Counter | method, result |
| `binaryindex_lookup_latency_ms` | Histogram | method |
| `binaryindex_corpus_packages_total` | Gauge | distro, release |
| `binaryindex_fingerprints_indexed` | Gauge | algorithm, component |
| `binaryindex_match_confidence` | Histogram | method |

### 7.2 Traces

- `binaryindex.lookup` - Full lookup span
- `binaryindex.corpus.ingest` - Corpus ingestion
- `binaryindex.fingerprint.generate` - Fingerprint generation

### 7.3 Ops Endpoints

> **Sprint:** SPRINT_20260112_007_BINIDX_binaryindex_user_config

BinaryIndex exposes read-only ops endpoints for health, bench, cache, and effective configuration:

| Endpoint | Method | Response Schema | Description |
|----------|--------|-----------------|-------------|
| `/api/v1/ops/binaryindex/health` | GET | `BinaryIndexOpsHealthResponse` | Health status, lifter warmness per ISA, cache availability |
| `/api/v1/ops/binaryindex/bench/run` | POST | `BinaryIndexBenchResponse` | Run latency benchmark, return min/max/mean/p50/p95/p99 stats |
| `/api/v1/ops/binaryindex/cache` | GET | `BinaryIndexFunctionCacheStats` | Function cache hit/miss/eviction statistics |
| `/api/v1/ops/binaryindex/config` | GET | `BinaryIndexEffectiveConfig` | Effective configuration with secrets redacted |

#### 7.3.1 Response Schemas

**BinaryIndexOpsHealthResponse:**
```json
{
  "status": "healthy",
  "timestamp": "2026-01-16T12:00:00Z",
  "components": {
    "lifterPool": { "status": "healthy", "message": null },
    "functionCache": { "status": "healthy", "message": null },
    "persistence": { "status": "healthy", "message": null }
  },
  "lifterWarmness": {
    "intel-64": { "isa": "intel-64", "warm": true, "poolSize": 4, "acquireTimeMs": 12 },
    "armv8-64": { "isa": "armv8-64", "warm": true, "poolSize": 2, "acquireTimeMs": 8 }
  }
}
```

**BinaryIndexBenchResponse:**
```json
{
  "timestamp": "2026-01-16T12:00:00Z",
  "sampleSize": 100,
  "latencySummary": {
    "minMs": 5.2,
    "maxMs": 142.8,
    "meanMs": 28.4,
    "p50Ms": 22.1,
    "p95Ms": 78.3,
    "p99Ms": 121.5
  },
  "operations": [
    { "operation": "lifterAcquire", "samples": 100, "meanMs": 12.4 },
    { "operation": "irNormalization", "samples": 100, "meanMs": 8.7 },
    { "operation": "cacheLookup", "samples": 100, "meanMs": 1.2 }
  ]
}
```

**BinaryIndexFunctionCacheStats:**
```json
{
  "enabled": true,
  "backend": "valkey",
  "hits": 15234,
  "misses": 892,
  "evictions": 45,
  "hitRate": 0.944,
  "keyPrefix": "stellaops:binidx:funccache:",
  "cacheTtlSeconds": 14400,
  "estimatedEntries": 12500,
  "estimatedMemoryBytes": 52428800
}
```

**BinaryIndexEffectiveConfig:**
```json
{
  "b2r2Pool": {
    "maxPoolSizePerIsa": 4,
    "warmPreload": ["intel-64", "armv8-64"],
    "acquireTimeoutMs": 5000,
    "enableMetrics": true
  },
  "semanticLifting": {
    "b2r2Version": "1.5.0",
    "normalizationRecipeVersion": "2024.1",
    "maxInstructionsPerFunction": 10000,
    "maxFunctionsPerBinary": 5000,
    "functionLiftTimeoutMs": 30000,
    "enableDeduplication": true
  },
  "functionCache": {
    "connectionString": "********",
    "keyPrefix": "stellaops:binidx:funccache:",
    "cacheTtlSeconds": 14400,
    "maxTtlSeconds": 86400,
    "earlyExpiryPercent": 0.1,
    "maxEntrySizeBytes": 1048576
  },
  "persistence": {
    "schema": "binaries",
    "minPoolSize": 5,
    "maxPoolSize": 20,
    "commandTimeoutSeconds": 30,
    "retryOnFailure": true,
    "batchSize": 100
  },
  "backendVersions": {
    "b2r2": "1.5.0",
    "valkey": "7.2.0",
    "postgres": "15.4"
  }
}
```

#### 7.3.2 Rate Limiting

The `/bench/run` endpoint is rate-limited to prevent load spikes:
- Default: 5 requests per minute per tenant
- Configurable via `BinaryIndex:Ops:BenchRateLimitPerMinute`

#### 7.3.3 Secret Redaction

The config endpoint automatically redacts sensitive keys:

| Redacted Keys | Pattern |
|---------------|---------|
| `connectionString` | Replaced with `********` |
| `password` | Replaced with `********` |
| `secret*` | Any key starting with "secret" |
| `apiKey` | Replaced with `********` |
| `token` | Replaced with `********` |

Redaction is applied recursively to nested objects.

---

## 8. Configuration

> **Sprint:** SPRINT_20260112_007_BINIDX_binaryindex_user_config

### 8.1 Configuration Sections

All configuration is under the `BinaryIndex` section in `appsettings.yaml` or environment variables with `BINARYINDEX__` prefix.

#### 8.1.1 B2R2 Lifter Pool (`BinaryIndex:B2R2Pool`)

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `MaxPoolSizePerIsa` | int | 4 | Maximum lifter instances per ISA |
| `WarmPreload` | string[] | ["intel-64", "armv8-64"] | ISAs to warm on startup |
| `AcquireTimeoutMs` | int | 5000 | Timeout for lifter acquisition |
| `EnableMetrics` | bool | true | Emit Prometheus metrics for pool |

```yaml
BinaryIndex:
  B2R2Pool:
    MaxPoolSizePerIsa: 4
    WarmPreload:
      - intel-64
      - armv8-64
    AcquireTimeoutMs: 5000
    EnableMetrics: true
```

#### 8.1.2 Semantic Lifting (`BinaryIndex:SemanticLifting`)

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `B2R2Version` | string | "1.5.0" | B2R2 disassembler version |
| `NormalizationRecipeVersion` | string | "2024.1" | IR normalization recipe version |
| `MaxInstructionsPerFunction` | int | 10000 | Max instructions to lift per function |
| `MaxFunctionsPerBinary` | int | 5000 | Max functions to process per binary |
| `FunctionLiftTimeoutMs` | int | 30000 | Timeout for lifting single function |
| `EnableDeduplication` | bool | true | Deduplicate IR before fingerprinting |

```yaml
BinaryIndex:
  SemanticLifting:
    MaxInstructionsPerFunction: 10000
    MaxFunctionsPerBinary: 5000
    FunctionLiftTimeoutMs: 30000
    EnableDeduplication: true
```

#### 8.1.3 Function Cache (`BinaryIndex:FunctionCache`)

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `ConnectionString` | string | — | Valkey connection string (secret) |
| `KeyPrefix` | string | "stellaops:binidx:funccache:" | Cache key prefix |
| `CacheTtlSeconds` | int | 14400 | Default cache TTL (4 hours) |
| `MaxTtlSeconds` | int | 86400 | Maximum TTL (24 hours) |
| `EarlyExpiryPercent` | decimal | 0.1 | Early expiry jitter (10%) |
| `MaxEntrySizeBytes` | int | 1048576 | Max entry size (1 MB) |

```yaml
BinaryIndex:
  FunctionCache:
    ConnectionString: ${VALKEY_CONNECTION}  # from env
    KeyPrefix: "stellaops:binidx:funccache:"
    CacheTtlSeconds: 14400
    MaxEntrySizeBytes: 1048576
```

#### 8.1.4 Persistence (`Postgres:BinaryIndex`)

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `Schema` | string | "binaries" | PostgreSQL schema name |
| `MinPoolSize` | int | 5 | Minimum connection pool size |
| `MaxPoolSize` | int | 20 | Maximum connection pool size |
| `CommandTimeoutSeconds` | int | 30 | Command execution timeout |
| `RetryOnFailure` | bool | true | Retry transient failures |
| `BatchSize` | int | 100 | Batch insert size |

```yaml
Postgres:
  BinaryIndex:
    Schema: binaries
    MinPoolSize: 5
    MaxPoolSize: 20
    CommandTimeoutSeconds: 30
    RetryOnFailure: true
    BatchSize: 100
```

#### 8.1.5 Ops Configuration (`BinaryIndex:Ops`)

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `EnableHealthEndpoint` | bool | true | Enable /health endpoint |
| `EnableBenchEndpoint` | bool | true | Enable /bench/run endpoint |
| `BenchRateLimitPerMinute` | int | 5 | Rate limit for bench endpoint |
| `RedactedKeys` | string[] | See 7.3.3 | Keys to redact in config output |

### 8.2 Legacy Configuration

```yaml
# binaryindex.yaml (corpus configuration)
binaryindex:
  enabled: true

  corpus:
    connectors:
      - type: debian
        enabled: true
        mirror: http://deb.debian.org/debian
        releases: [bookworm, bullseye]
        architectures: [amd64, arm64]
      - type: ubuntu
        enabled: true
        mirror: http://archive.ubuntu.com/ubuntu
        releases: [jammy, noble]

  fingerprinting:
    enabled: true
    algorithms: [basic_block, cfg]
    target_components:
      - openssl
      - glibc
      - zlib
      - curl
      - sqlite
    min_function_size: 16  # bytes
    max_functions_per_binary: 10000

  lookup:
    cache_ttl: 3600
    batch_size: 100
    timeout_ms: 5000

  storage:
    postgres_schema: binaries
    rustfs_bucket: stellaops/binaryindex
```

---

## 9. Testing Strategy

### 9.1 Unit Tests

- Identity extraction (Build-ID, hashes)
- Fingerprint generation determinism
- Fix index parsing (changelog, patch headers)

### 9.2 Integration Tests

- PostgreSQL schema validation
- Full corpus ingestion flow
- Scanner.Worker lookup integration

### 9.3 Regression Tests

- Known CVE detection (golden corpus)
- Backport handling (Debian libssl example)
- False positive rate validation

---

## 10. Golden Corpus for Patch Provenance

> **Sprint:** SPRINT_20260121_034/035/036 - Golden Corpus Implementation

The BinaryIndex module supports a **golden corpus** of patch-paired artifacts that enables offline SBOM reproducibility and binary-level patch provenance verification.

### 10.1 Corpus Purpose

The golden corpus provides:
- **Auditor-ready evidence bundles** for air-gapped customers
- **Regression testing** for binary matching accuracy
- **Proof of patch status** independent of package metadata

### 10.2 Corpus Sources

| Source | Type | Purpose |
|--------|------|---------|
| Debian Security Tracker / DSAs | Advisory | Primary advisory linkage |
| Debian Snapshot | Binary archive | Pre/post patch binary pairs |
| Ubuntu Security Notices | Advisory | Ubuntu-specific advisories |
| Alpine secdb | Advisory | Alpine YAML advisories |
| OSV dump | Unified schema | Cross-reference and commit ranges |

### 10.2.1 Symbol Source Connectors

> **Sprint:** SPRINT_20260121_035_BinaryIndex_golden_corpus_connectors_cli

The corpus ingestion layer uses pluggable connectors to retrieve symbols and metadata from upstream sources:

| Connector ID | Implementation | Protocol | Data Retrieved |
|--------------|----------------|----------|----------------|
| `debuginfod-fedora` | `DebuginfodConnector` | debuginfod HTTP | ELF debug symbols by Build-ID |
| `debuginfod-ubuntu` | `DebuginfodConnector` | debuginfod HTTP | ELF debug symbols by Build-ID |
| `ddeb-ubuntu` | `DdebConnector` | APT/HTTP | `.ddeb` debug packages |
| `buildinfo-debian` | `BuildinfoConnector` | HTTP | `.buildinfo` reproducibility records |
| `secdb-alpine` | `AlpineSecDbConnector` | Git/HTTP | `secfixes` YAML from APKBUILD |

**Connector Interface:**

```csharp
public interface ISymbolSourceConnector
{
    string ConnectorId { get; }
    string DisplayName { get; }
    string[] SupportedDistros { get; }

    Task<ConnectorStatus> GetStatusAsync(CancellationToken ct);
    Task SyncAsync(SyncOptions options, CancellationToken ct);
    Task<SymbolLookupResult?> LookupByBuildIdAsync(string buildId, CancellationToken ct);
    Task<IAsyncEnumerable<SymbolRecord>> SearchAsync(SymbolSearchQuery query, CancellationToken ct);
}
```

**Debuginfod Connector:**

The `DebuginfodConnector` implements the [debuginfod protocol](https://sourceware.org/elfutils/Debuginfod.html) for retrieving debug symbols:

- Endpoint: `GET /buildid/<build-id>/debuginfo`
- Supports federated queries across multiple debuginfod servers
- Caches retrieved symbols in RustFS blob storage
- Rate-limited to respect upstream server policies

**Ubuntu ddeb Connector:**

The `DdebConnector` retrieves Ubuntu debug symbol packages (`.ddeb`):

- Sources: `ddebs.ubuntu.com` mirror
- Indexes: Reads `Packages.xz` for package metadata
- Extraction: Unpacks `.ddeb` AR archives to extract DWARF symbols
- Mapping: Links debug symbols to binary packages via Build-ID

**Debian Buildinfo Connector:**

The `BuildinfoConnector` retrieves Debian buildinfo files for reproducibility verification:

- Source: `buildinfos.debian.net` and snapshot archives
- Purpose: Provides build environment metadata for reproducible builds
- Fields extracted: `Build-Date`, `Build-Architecture`, `Checksums-Sha256`
- Integration: Cross-references with binary packages for provenance

**Alpine SecDB Connector:**

The `AlpineSecDbConnector` parses Alpine's security database:

- Source: `secfixes` blocks in APKBUILD files
- Repository: `alpine/aports` Git repository
- Format: YAML blocks mapping CVEs to fixed versions
- Example:
  ```yaml
  secfixes:
    3.0.11-r0:
      - CVE-2024-0727
      - CVE-2024-0728
  ```

**OSV Dump Parser:**

The `OsvDumpParser` processes Google OSV database dumps for advisory cross-correlation:

- Source: `osv.dev` bulk exports (JSON)
- Purpose: CVE → commit range extraction for patch identification
- Cross-reference: Correlates OSV entries with distribution advisories
- Inconsistency detection: Identifies discrepancies between OSV and distro advisories

```csharp
public interface IOsvDumpParser
{
    IAsyncEnumerable<OsvParsedEntry> ParseDumpAsync(Stream osvDumpStream, CancellationToken ct);
    OsvCveIndex BuildCveIndex(IEnumerable<OsvParsedEntry> entries);
    IEnumerable<AdvisoryCorrelation> CrossReferenceWithExternal(
        OsvCveIndex osvIndex,
        IEnumerable<ExternalAdvisory> externalAdvisories);
    IEnumerable<AdvisoryInconsistency> DetectInconsistencies(
        IEnumerable<AdvisoryCorrelation> correlations);
}
```

**CLI Access:**

All connectors are manageable via the `stella groundtruth sources` CLI commands:

```bash
# List all connectors
stella groundtruth sources list

# Sync specific connector
stella groundtruth sources sync --source buildinfo-debian --full

# Enable/disable connectors
stella groundtruth sources enable ddeb-ubuntu
stella groundtruth sources disable debuginfod-fedora
```

See [Ground-Truth CLI Guide](../cli/guides/ground-truth-cli.md) for complete CLI documentation

### 10.3 Key Performance Indicators

| KPI | Target | Description |
|-----|--------|-------------|
| Per-function match rate | >= 90% | Functions matched in post-patch binary |
| False-negative patch detection | <= 5% | Patched functions incorrectly classified |
| SBOM canonical-hash stability | 3/3 | Determinism across independent runs |
| Binary reconstruction equivalence | Trend | Rebuilt binary matches original |
| End-to-end verify time (p95, cold) | Trend | Offline verification performance |

### 10.4 Validation Harness

The validation harness (`IValidationHarness`) orchestrates end-to-end verification:

```
Binary Pair (pre/post) → Symbol Recovery → IR Lifting → Fingerprinting → Matching → Metrics
```

### 10.5 Evidence Bundle Format

Evidence bundles follow OCI/ORAS conventions:

```
<pkg>-<advisory>-bundle.oci.tar
├── manifest.json           # OCI manifest
└── blobs/
    ├── sha256:<sbom>       # Canonical SBOM
    ├── sha256:<pre-bin>    # Pre-fix binary
    ├── sha256:<post-bin>   # Post-fix binary
    ├── sha256:<delta-sig>  # DSSE delta-sig predicate
    └── sha256:<timestamp>  # RFC 3161 timestamp
```

### 10.6 Two-Tier Bundle Design and Large Blob References

> **Sprint:** SPRINT_20260122_040_Platform_oci_delta_attestation_pipeline (040-04)

Evidence bundles support two export modes to balance transfer speed with auditability:

| Mode | Export Flag | Contents | Use Case |
|------|------------|----------|----------|
| **Light** | (default) | Manifest + attestation envelopes + metadata | Quick transfer, metadata-only audit |
| **Full** | `--full` | Light + embedded binary blobs in `blobs/` | Air-gap replay, full provenance verification |

#### 10.6.1 `largeBlobs[]` Field

The `DeltaSigPredicate` includes a `largeBlobs` array referencing binary artifacts that may be too large to embed in attestation payloads:

```json
{
  "schemaVersion": "1.0.0",
  "subject": [...],
  "delta": [...],
  "largeBlobs": [
    {
      "kind": "binary-patch",
      "digest": "sha256:a1b2c3...",
      "mediaType": "application/octet-stream",
      "sizeBytes": 1048576
    },
    {
      "kind": "sbom-fragment",
      "digest": "sha256:d4e5f6...",
      "mediaType": "application/spdx+json",
      "sizeBytes": 32768
    }
  ],
  "sbomDigest": "sha256:789abc..."
}
```

**Field Definitions:**

| Field | Type | Description |
|-------|------|-------------|
| `largeBlobs[].kind` | string | Blob category: `binary-patch`, `sbom-fragment`, `debug-symbols`, etc. |
| `largeBlobs[].digest` | string | Content-addressable digest (`sha256:<hex>`, `sha384:<hex>`, `sha512:<hex>`) |
| `largeBlobs[].mediaType` | string | IANA media type of the blob |
| `largeBlobs[].sizeBytes` | long | Blob size in bytes |
| `sbomDigest` | string | Digest of the canonical SBOM associated with this delta |

#### 10.6.2 Blob Fetch Strategy

During `stella bundle verify --replay`, blobs are resolved in priority order:

1. **Embedded** (full bundles): Read from `blobs/<digest-with-dash>` in bundle directory
2. **Local source** (`--blob-source /path/`): Read from specified local directory
3. **Registry** (`--blob-source https://...`): HTTP GET from OCI registry (blocked in `--offline` mode)

#### 10.6.3 Digest Verification

Fetched blobs are verified against their declared digest using the algorithm prefix:

```
sha256:<hex> → SHA-256
sha384:<hex> → SHA-384
sha512:<hex> → SHA-512
```

A mismatch fails the blob replay verification step.

### 10.7 Related Documentation

- [Golden Corpus KPIs](../../benchmarks/golden-corpus-kpis.md)
- [Golden Corpus Seed List](../../benchmarks/golden-corpus-seed-list.md)
- [Ground-Truth Corpus Specification](../../benchmarks/ground-truth-corpus.md)

---

## 11. References

- Advisory: `docs/product/advisories/21-Dec-2025 - Mapping Evidence Within Compiled Binaries.md`
- Scanner Native Analysis: `src/Scanner/StellaOps.Scanner.Analyzers.Native/`
- Existing Fingerprinting: `src/Scanner/__Libraries/StellaOps.Scanner.EntryTrace/Binary/`
- Build-ID Index: `src/Scanner/StellaOps.Scanner.Analyzers.Native/Index/`
- **Semantic Diffing Sprint:** `docs/implplan/SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md`
- **Semantic Library:** `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Semantic/`
- **Semantic Tests:** `src/BinaryIndex/__Tests/StellaOps.BinaryIndex.Semantic.Tests/`
- **Golden Corpus Sprints:** `docs/implplan/SPRINT_20260121_034_BinaryIndex_golden_corpus_foundation.md`

---

*Document Version: 1.2.0*
*Last Updated: 2026-01-21*