Files
git.stella-ops.org/docs/modules/binary-index/architecture.md
2026-01-24 00:12:43 +02:00

1501 lines
52 KiB
Markdown

# BinaryIndex Module Architecture
> **Ownership:** Scanner Guild + Concelier Guild
> **Status:** DRAFT
> **Version:** 1.0.0
> **Related:** [High-Level Architecture](../../ARCHITECTURE_OVERVIEW.md), [Scanner Architecture](../scanner/architecture.md), [Concelier Architecture](../concelier/architecture.md)
---
## 1. Overview
The **BinaryIndex** module provides a vulnerable binaries database that enables detection of vulnerable code at the binary level, independent of package metadata. This addresses a critical gap in vulnerability scanning: package version strings can lie (backports, custom builds, stripped metadata), but **binary identity doesn't lie**.
### 1.1 Problem Statement
Traditional vulnerability scanners rely on package version matching, which fails in several scenarios:
1. **Backported patches** - Distros backport security fixes without changing upstream version
2. **Custom/vendored builds** - Binaries compiled from source without package metadata
3. **Stripped binaries** - Debug info and version strings removed
4. **Static linking** - Vulnerable library code embedded in final binary
5. **Container base images** - Distroless or scratch images with no package DB
### 1.2 Solution: Binary-First Vulnerability Detection
BinaryIndex provides three tiers of binary identification:
| Tier | Method | Precision | Coverage |
|------|--------|-----------|----------|
| A | Package/version range matching | Medium | High |
| B | Build-ID/hash catalog (exact binary identity) | High | Medium |
| C | Function fingerprints (CFG/basic-block hashes) | Very High | Targeted |
### 1.3 Module Scope
**In Scope:**
- Binary identity extraction (Build-ID, PE CodeView GUID, Mach-O UUID)
- Binary-to-advisory mapping database
- Fingerprint storage and matching engine
- Fix index for patch-aware backport handling
- Integration with Scanner.Worker for binary lookup
**Out of Scope:**
- Binary disassembly/analysis (provided by Scanner.Analyzers.Native)
- Runtime binary tracing (provided by Zastava)
- SBOM generation (provided by Scanner)
---
## 2. Architecture
### 2.1 System Context
```
┌──────────────────────────────────────────────────────────────────────────┐
│ External Systems │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Distro Repos │ │ Debug Symbol │ │ Upstream Source │ │
│ │ (Debian, RPM, │ │ Servers │ │ (GitHub, etc.) │ │
│ │ Alpine) │ │ (debuginfod) │ │ │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
└───────────│─────────────────────│─────────────────────│──────────────────┘
│ │ │
v v v
┌──────────────────────────────────────────────────────────────────────────┐
│ BinaryIndex Module │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Corpus Ingestion Layer │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ DebianCorpus │ │ RpmCorpus │ │ AlpineCorpus │ │ │
│ │ │ Connector │ │ Connector │ │ Connector │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Processing Layer │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ BinaryFeature│ │ FixIndex │ │ Fingerprint │ │ │
│ │ │ Extractor │ │ Builder │ │ Generator │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Storage Layer │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ PostgreSQL │ │ RustFS │ │ Valkey │ │ │
│ │ │ (binaries │ │ (fingerprint │ │ (lookup │ │ │
│ │ │ schema) │ │ blobs) │ │ cache) │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Query Layer │ │
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
│ │ │ IBinaryVulnerabilityService │ │ │
│ │ │ - LookupByBuildIdAsync(buildId) │ │ │
│ │ │ - LookupByFingerprintAsync(fingerprint) │ │ │
│ │ │ - LookupBatchAsync(identities) │ │ │
│ │ │ - GetFixStatusAsync(distro, release, sourcePkg, cve) │ │ │
│ │ └──────────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
v
┌──────────────────────────────────────────────────────────────────────────┐
│ Consuming Modules │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Scanner.Worker │ │ Policy Engine │ │ Findings Ledger │ │
│ │ (binary lookup │ │ (evidence in │ │ (match records) │ │
│ │ during scan) │ │ proof chain) │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
```
### 2.2 Component Breakdown
#### 2.2.1 Corpus Connectors
Plugin-based connectors that ingest binaries from distribution repositories.
```csharp
public interface IBinaryCorpusConnector
{
string ConnectorId { get; }
string[] SupportedDistros { get; }
Task<CorpusSnapshot> FetchSnapshotAsync(CorpusQuery query, CancellationToken ct);
Task<IAsyncEnumerable<ExtractedBinary>> ExtractBinariesAsync(PackageReference pkg, CancellationToken ct);
}
```
**Implementations:**
- `DebianBinaryCorpusConnector` - Debian/Ubuntu packages + debuginfo
- `RpmBinaryCorpusConnector` - RHEL/Fedora/CentOS + SRPM
- `AlpineBinaryCorpusConnector` - Alpine APK + APKBUILD
#### 2.2.2 Binary Feature Extractor
Extracts identity and features from binaries. Reuses existing Scanner.Analyzers.Native capabilities.
```csharp
public interface IBinaryFeatureExtractor
{
Task<BinaryIdentity> ExtractIdentityAsync(Stream binaryStream, CancellationToken ct);
Task<BinaryFeatures> ExtractFeaturesAsync(Stream binaryStream, ExtractorOptions opts, CancellationToken ct);
}
public sealed record BinaryIdentity(
string Format, // elf, pe, macho
string? BuildId, // ELF GNU Build-ID
string? PeCodeViewGuid, // PE CodeView GUID + Age
string? MachoUuid, // Mach-O LC_UUID
string FileSha256,
string TextSectionSha256);
public sealed record BinaryFeatures(
BinaryIdentity Identity,
string[] DynamicDeps, // DT_NEEDED
string[] ExportedSymbols,
string[] ImportedSymbols,
BinaryHardening Hardening);
```
#### 2.2.3 Fix Index Builder
Builds the patch-aware CVE fix index from distro sources.
```csharp
public interface IFixIndexBuilder
{
Task BuildIndexAsync(DistroRelease distro, CancellationToken ct);
Task<FixRecord?> GetFixRecordAsync(string distro, string release, string sourcePkg, string cveId, CancellationToken ct);
}
public sealed record FixRecord(
string Distro,
string Release,
string SourcePkg,
string CveId,
FixState State, // fixed, vulnerable, not_affected, wontfix, unknown
string? FixedVersion, // Distro version string
FixMethod Method, // security_feed, changelog, patch_header
decimal Confidence, // 0.00-1.00
FixEvidence Evidence);
public enum FixState { Fixed, Vulnerable, NotAffected, Wontfix, Unknown }
public enum FixMethod { SecurityFeed, Changelog, PatchHeader, UpstreamPatchMatch }
```
#### 2.2.4 Fingerprint Generator
Generates function-level fingerprints for vulnerable code detection.
```csharp
public interface IVulnFingerprintGenerator
{
Task<ImmutableArray<VulnFingerprint>> GenerateAsync(
string cveId,
BinaryPair vulnAndFixed, // Reference builds
FingerprintOptions opts,
CancellationToken ct);
}
public sealed record VulnFingerprint(
string CveId,
string Component, // e.g., openssl
string Architecture, // x86-64, aarch64
FingerprintType Type, // basic_block, cfg, combined
string FingerprintId, // e.g., "bb-abc123..."
byte[] FingerprintHash, // 16-32 bytes
string? FunctionHint, // Function name if known
decimal Confidence,
FingerprintEvidence Evidence);
public enum FingerprintType { BasicBlock, ControlFlowGraph, StringReferences, Combined }
```
#### 2.2.5 Semantic Analysis Library
> **Library:** `StellaOps.BinaryIndex.Semantic`
> **Sprint:** 20260105_001_001_BINDEX - Semantic Diffing Phase 1
The Semantic Analysis Library extends fingerprint generation with IR-level semantic matching, enabling detection of semantically equivalent code despite compiler optimizations, instruction reordering, and register allocation differences.
**Key Insight:** Traditional instruction-level fingerprinting loses accuracy on optimized binaries by ~15-20%. Semantic analysis lifts to B2R2's Intermediate Representation (LowUIR), extracts key-semantics graphs, and uses graph hashing for similarity computation.
##### 2.2.5.1 Architecture
```
Binary Input
v
B2R2 Disassembly → Raw Instructions
v
IR Lifting Service → LowUIR Statements
v
Semantic Graph Extractor → Key-Semantics Graph (KSG)
v
Graph Fingerprinting → Semantic Fingerprint
v
Semantic Matcher → Similarity Score + Deltas
```
##### 2.2.5.2 Core Components
**IR Lifting Service** (`IIrLiftingService`)
Lifts disassembled instructions to B2R2 LowUIR:
```csharp
public interface IIrLiftingService
{
Task<LiftedFunction> LiftToIrAsync(
IReadOnlyList<DisassembledInstruction> instructions,
string functionName,
LiftOptions? options = null,
CancellationToken ct = default);
}
public sealed record LiftedFunction(
string Name,
ImmutableArray<IrStatement> Statements,
ImmutableArray<IrBasicBlock> BasicBlocks);
```
**Semantic Graph Extractor** (`ISemanticGraphExtractor`)
Extracts key-semantics graphs capturing data dependencies, control flow, and memory operations:
```csharp
public interface ISemanticGraphExtractor
{
Task<KeySemanticsGraph> ExtractGraphAsync(
LiftedFunction function,
GraphExtractionOptions? options = null,
CancellationToken ct = default);
}
public sealed record KeySemanticsGraph(
string FunctionName,
ImmutableArray<SemanticNode> Nodes,
ImmutableArray<SemanticEdge> Edges,
GraphProperties Properties);
public enum SemanticNodeType { Compute, Load, Store, Branch, Call, Return, Phi }
public enum SemanticEdgeType { DataDependency, ControlDependency, MemoryDependency }
```
**Semantic Fingerprint Generator** (`ISemanticFingerprintGenerator`)
Generates semantic fingerprints using Weisfeiler-Lehman graph hashing:
```csharp
public interface ISemanticFingerprintGenerator
{
Task<SemanticFingerprint> GenerateAsync(
KeySemanticsGraph graph,
SemanticFingerprintOptions? options = null,
CancellationToken ct = default);
}
public sealed record SemanticFingerprint(
string FunctionName,
string GraphHashHex, // WL graph hash (SHA-256)
string OperationHashHex, // Normalized operation sequence hash
string DataFlowHashHex, // Data dependency pattern hash
int NodeCount,
int EdgeCount,
int CyclomaticComplexity,
ImmutableArray<string> ApiCalls,
SemanticFingerprintAlgorithm Algorithm);
```
**Semantic Matcher** (`ISemanticMatcher`)
Computes semantic similarity with weighted components:
```csharp
public interface ISemanticMatcher
{
Task<SemanticMatchResult> MatchAsync(
SemanticFingerprint a,
SemanticFingerprint b,
MatchOptions? options = null,
CancellationToken ct = default);
Task<SemanticMatchResult> MatchWithDeltasAsync(
SemanticFingerprint a,
SemanticFingerprint b,
MatchOptions? options = null,
CancellationToken ct = default);
}
public sealed record SemanticMatchResult(
decimal Similarity, // 0.00-1.00
decimal GraphSimilarity,
decimal OperationSimilarity,
decimal DataFlowSimilarity,
decimal ApiCallSimilarity,
MatchConfidence Confidence);
```
##### 2.2.5.3 Algorithm Details
**Weisfeiler-Lehman Graph Hashing:**
- 3 iterations of label propagation
- SHA-256 for final hash computation
- Deterministic node ordering via canonical sort
**Similarity Weights (Default):**
| Component | Weight |
|-----------|--------|
| Graph Hash | 0.35 |
| Operation Hash | 0.25 |
| Data Flow Hash | 0.25 |
| API Calls | 0.15 |
##### 2.2.5.4 Integration Points
The semantic library integrates with existing BinaryIndex components:
**DeltaSignatureGenerator Extension:**
```csharp
// Optional semantic services via constructor injection
services.AddDeltaSignaturesWithSemantic();
// Extended SymbolSignature with semantic properties
public sealed record SymbolSignature
{
// ... existing properties ...
public string? SemanticHashHex { get; init; }
public ImmutableArray<string> SemanticApiCalls { get; init; }
}
```
**PatchDiffEngine Extension:**
```csharp
// SemanticWeight in HashWeights
public decimal SemanticWeight { get; init; } = 0.2m;
// FunctionFingerprint extended with semantic fingerprint
public SemanticFingerprint? SemanticFingerprint { get; init; }
```
##### 2.2.5.5 Test Coverage
| Category | Tests | Coverage |
|----------|-------|----------|
| Unit Tests (IR lifting, graph extraction, hashing) | 53 | Core algorithms |
| Integration Tests (full pipeline) | 9 | End-to-end flow |
| Golden Corpus (compiler variations) | 11 | Register allocation, optimization, compiler variants |
| Benchmarks (accuracy, performance) | 7 | Baseline metrics |
##### 2.2.5.6 Current Baselines
> **Note:** Baselines reflect foundational implementation; accuracy improves as semantic features mature.
| Metric | Baseline | Target |
|--------|----------|--------|
| Similarity (register allocation variants) | ≥0.55 | ≥0.85 |
| Overall accuracy | ≥40% | ≥70% |
| False positive rate | <10% | <5% |
| P95 fingerprint latency | <100ms | <50ms |
##### 2.2.5.7 B2R2 LowUIR Adapter
The B2R2LowUirLiftingService implements `IIrLiftingService` using B2R2's native lifting capabilities. This provides cross-platform IR representation for semantic analysis.
**Key Components:**
```csharp
public sealed class B2R2LowUirLiftingService : IIrLiftingService
{
// Lifts to B2R2 LowUIR and maps to Stella IR model
public Task<LiftedFunction> LiftToIrAsync(
IReadOnlyList<DisassembledInstruction> instructions,
string functionName,
LiftOptions? options = null,
CancellationToken ct = default);
}
```
**Supported ISAs:**
- Intel (x86-32, x86-64)
- ARM (ARMv7, ARMv8/ARM64)
- MIPS (32/64)
- RISC-V (64)
- PowerPC, SPARC, SH4, AVR, EVM
**IR Statement Mapping:**
| B2R2 LowUIR | Stella IR Kind |
|-------------|----------------|
| Put | IrStatementKind.Store |
| Store | IrStatementKind.Store |
| Get | IrStatementKind.Load |
| Load | IrStatementKind.Load |
| BinOp | IrStatementKind.BinaryOp |
| UnOp | IrStatementKind.UnaryOp |
| Jmp | IrStatementKind.Jump |
| CJmp | IrStatementKind.ConditionalJump |
| InterJmp | IrStatementKind.IndirectJump |
| Call | IrStatementKind.Call |
| SideEffect | IrStatementKind.SideEffect |
**Determinism Guarantees:**
- Statements ordered by block address (ascending)
- Blocks sorted by entry address (ascending)
- Consistent IR IDs across identical inputs
- InvariantCulture used for all string formatting
##### 2.2.5.8 B2R2 Lifter Pool
The `B2R2LifterPool` provides bounded pooling and warm preload for B2R2 lifting units to reduce per-call allocation overhead.
**Configuration (`B2R2LifterPoolOptions`):**
| Option | Default | Description |
|--------|---------|-------------|
| `MaxPoolSizePerIsa` | 4 | Maximum pooled lifters per ISA |
| `EnableWarmPreload` | true | Preload lifters at startup |
| `WarmPreloadIsas` | ["intel-64", "intel-32", "armv8-64", "armv7-32"] | ISAs to warm |
| `AcquireTimeout` | 5s | Timeout for acquiring a lifter |
**Pool Statistics:**
- `TotalPooledLifters`: Lifters currently in pool
- `TotalActiveLifters`: Lifters currently in use
- `IsWarm`: Whether pool has been warmed
- `IsaStats`: Per-ISA pool and active counts
**Usage:**
```csharp
using var lifter = _lifterPool.Acquire(isa);
var stmts = lifter.LiftingUnit.LiftInstruction(address);
// Lifter automatically returned to pool on dispose
```
##### 2.2.5.9 Function IR Cache
The `FunctionIrCacheService` provides Valkey-backed caching for computed semantic fingerprints to avoid redundant IR lifting and graph hashing.
**Cache Key Structure:**
```
(isa, b2r2_version, normalization_recipe, canonical_ir_hash)
```
**Configuration (`FunctionIrCacheOptions`):**
| Option | Default | Description |
|--------|---------|-------------|
| `KeyPrefix` | "stellaops:binidx:funccache:" | Valkey key prefix |
| `CacheTtl` | 4h | TTL for cached entries |
| `MaxTtl` | 24h | Maximum TTL |
| `Enabled` | true | Whether caching is enabled |
| `B2R2Version` | "0.9.1" | B2R2 version for cache key |
| `NormalizationRecipeVersion` | "v1" | Recipe version for cache key |
**Cache Entry (`CachedFunctionFingerprint`):**
- `FunctionAddress`, `FunctionName`
- `SemanticFingerprint`: The computed fingerprint
- `IrStatementCount`, `BasicBlockCount`
- `ComputedAtUtc`: ISO-8601 timestamp
- `B2R2Version`, `NormalizationRecipe`
**Invalidation Rules:**
- Cache entries expire after `CacheTtl` (default 4h)
- Changing B2R2 version or normalization recipe results in cache misses
- Manual invalidation via `RemoveAsync()`
**Statistics:**
- Hits, Misses, Evictions
- Hit Rate
- Enabled status
##### 2.2.5.10 Ops Endpoints
BinaryIndex exposes operational endpoints for health, benchmarking, cache monitoring, and configuration visibility.
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/v1/ops/binaryindex/health` | GET | Health status with lifter warmness, cache availability |
| `/api/v1/ops/binaryindex/bench/run` | POST | Run benchmark, return latency stats |
| `/api/v1/ops/binaryindex/cache` | GET | Function IR cache hit/miss statistics |
| `/api/v1/ops/binaryindex/config` | GET | Effective configuration (secrets redacted) |
**Health Response:**
```json
{
"status": "healthy",
"timestamp": "2026-01-14T12:00:00Z",
"lifterStatus": "warm",
"lifterWarm": true,
"lifterPoolStats": { "intel-64": 4, "armv8-64": 2 },
"cacheStatus": "enabled",
"cacheEnabled": true
}
```
**Determinism Constraints:**
- All timestamps in ISO-8601 UTC format
- ASCII-only output
- Deterministic JSON key ordering
- Secrets/credentials redacted from config endpoint
#### 2.2.6 Binary Vulnerability Service
Main query interface for consumers.
```csharp
public interface IBinaryVulnerabilityService
{
/// <summary>
/// Look up vulnerabilities by Build-ID or equivalent binary identity.
/// </summary>
Task<ImmutableArray<BinaryVulnMatch>> LookupByIdentityAsync(
BinaryIdentity identity,
LookupOptions? opts = null,
CancellationToken ct = default);
/// <summary>
/// Look up vulnerabilities by function fingerprint.
/// </summary>
Task<ImmutableArray<BinaryVulnMatch>> LookupByFingerprintAsync(
CodeFingerprint fingerprint,
decimal minSimilarity = 0.95m,
CancellationToken ct = default);
/// <summary>
/// Batch lookup for scan performance.
/// </summary>
Task<ImmutableDictionary<string, ImmutableArray<BinaryVulnMatch>>> LookupBatchAsync(
IEnumerable<BinaryIdentity> identities,
LookupOptions? opts = null,
CancellationToken ct = default);
/// <summary>
/// Get distro-specific fix status (patch-aware).
/// </summary>
Task<FixRecord?> GetFixStatusAsync(
string distro,
string release,
string sourcePkg,
string cveId,
CancellationToken ct = default);
}
public sealed record BinaryVulnMatch(
string CveId,
string VulnerablePurl,
MatchMethod Method, // buildid_catalog, fingerprint_match, range_match
decimal Confidence,
MatchEvidence Evidence);
public enum MatchMethod { BuildIdCatalog, FingerprintMatch, RangeMatch }
```
---
## 3. Data Model
### 3.1 PostgreSQL Schema (`binaries`)
The `binaries` schema stores binary identity, fingerprint, and match data.
```sql
CREATE SCHEMA IF NOT EXISTS binaries;
CREATE SCHEMA IF NOT EXISTS binaries_app;
-- RLS helper
CREATE OR REPLACE FUNCTION binaries_app.require_current_tenant()
RETURNS TEXT LANGUAGE plpgsql STABLE SECURITY DEFINER AS $$
DECLARE v_tenant TEXT;
BEGIN
v_tenant := current_setting('app.tenant_id', true);
IF v_tenant IS NULL OR v_tenant = '' THEN
RAISE EXCEPTION 'app.tenant_id session variable not set';
END IF;
RETURN v_tenant;
END;
$$;
```
#### 3.1.1 Core Tables
See `docs/db/schemas/binaries_schema_specification.md` for complete DDL.
**Key Tables:**
| Table | Purpose |
|-------|---------|
| `binaries.binary_identity` | Known binary identities (Build-ID, hashes) |
| `binaries.binary_package_map` | Binary package mapping per snapshot |
| `binaries.vulnerable_buildids` | Build-IDs known to be vulnerable |
| `binaries.vulnerable_fingerprints` | Function fingerprints for CVEs |
| `binaries.cve_fix_index` | Patch-aware fix status per distro |
| `binaries.fingerprint_matches` | Match results (findings evidence) |
| `binaries.corpus_snapshots` | Corpus ingestion tracking |
### 3.2 RustFS Layout
```
rustfs://stellaops/binaryindex/
fingerprints/<algorithm>/<prefix>/<fingerprint_id>.bin
corpus/<distro>/<release>/<snapshot_id>/manifest.json
corpus/<distro>/<release>/<snapshot_id>/packages/<pkg>.metadata.json
evidence/<match_id>.dsse.json
```
---
## 4. Integration Points
### 4.1 Scanner.Worker Integration
During container scanning, Scanner.Worker queries BinaryIndex for each extracted binary:
```mermaid
sequenceDiagram
participant SW as Scanner.Worker
participant BI as BinaryIndex
participant PG as PostgreSQL
participant FL as Findings Ledger
SW->>SW: Extract binary from layer
SW->>SW: Compute BinaryIdentity
SW->>BI: LookupByIdentityAsync(identity)
BI->>PG: Query binaries.vulnerable_buildids
PG-->>BI: Matches
BI->>PG: Query binaries.cve_fix_index (if distro known)
PG-->>BI: Fix status
BI-->>SW: BinaryVulnMatch[]
SW->>FL: RecordFinding(match, evidence)
```
### 4.2 Concelier Integration
BinaryIndex subscribes to Concelier's advisory updates:
```mermaid
sequenceDiagram
participant CO as Concelier
participant BI as BinaryIndex
participant PG as PostgreSQL
CO->>CO: Ingest new advisory
CO->>BI: advisory.created event
BI->>BI: Check if affected packages in corpus
BI->>PG: Update binaries.binary_vuln_assertion
BI->>BI: Queue fingerprint generation (if high-impact)
```
### 4.3 Policy Integration
Binary matches are recorded as proof segments:
```json
{
"segment_type": "binary_fingerprint_evidence",
"payload": {
"binary_identity": {
"format": "elf",
"build_id": "abc123...",
"file_sha256": "def456..."
},
"matches": [
{
"cve_id": "CVE-2024-1234",
"method": "buildid_catalog",
"confidence": 0.98,
"vulnerable_purl": "pkg:deb/debian/libssl3@1.1.1n-0+deb11u3"
}
]
}
}
```
---
## 5. MVP Roadmap
### MVP 1: Known-Build Binary Catalog (Sprint 6000.0001)
**Goal:** Query "is this Build-ID vulnerable?" with distro-level precision.
**Deliverables:**
- `binaries` PostgreSQL schema
- Build-ID to package mapping tables
- Basic CVE lookup by binary identity
- Debian/Ubuntu corpus connector
### MVP 2: Patch-Aware Backport Handling (Sprint 6000.0002)
**Goal:** Handle "version says vulnerable but distro backported the fix."
**Deliverables:**
- Fix index builder (changelog + patch header parsing)
- Distro-specific version comparison
- RPM corpus connector
- Scanner.Worker integration
### MVP 3: Binary Fingerprint Factory (Sprint 6000.0003)
**Goal:** Detect vulnerable code independent of package metadata.
**Deliverables:**
- Fingerprint storage and matching
- Reference build generation pipeline
- Fingerprint validation corpus
- High-impact CVE coverage (OpenSSL, glibc, zlib, curl)
### MVP 4: Full Scanner Integration (Sprint 6000.0004)
**Goal:** Binary evidence in production scans.
**Deliverables:**
- Scanner.Worker binary lookup integration
- Findings Ledger binary match records
- Proof segment attestations
- CLI binary match inspection
---
## 5b. Fix Evidence Chain
The **Fix Evidence Chain** provides auditable proof of why a CVE is marked as fixed (or not) for a specific distro/package combination. This is critical for patch-aware backport handling where package versions can be misleading.
### 5b.1 Evidence Sources
| Source | Confidence | Description |
|--------|------------|-------------|
| **Security Feed (OVAL)** | 0.95-0.99 | Authoritative feed from distro (Debian Security Tracker, Red Hat OVAL) |
| **Patch Header (DEP-3)** | 0.87-0.95 | CVE reference in Debian/Ubuntu patch metadata |
| **Changelog** | 0.75-0.85 | CVE mention in debian/changelog or RPM %changelog |
| **Upstream Patch Match** | 0.90 | Binary diff matches known upstream fix |
### 5b.2 Evidence Storage
Evidence is stored in two PostgreSQL tables:
```sql
-- Fix index: one row per (distro, release, source_pkg, cve_id)
CREATE TABLE binaries.cve_fix_index (
id UUID PRIMARY KEY,
tenant_id TEXT NOT NULL,
distro TEXT NOT NULL, -- debian, ubuntu, alpine, rhel
release TEXT NOT NULL, -- bookworm, jammy, v3.19
source_pkg TEXT NOT NULL,
cve_id TEXT NOT NULL,
state TEXT NOT NULL, -- fixed, vulnerable, not_affected, wontfix, unknown
fixed_version TEXT,
method TEXT NOT NULL, -- security_feed, changelog, patch_header, upstream_match
confidence DECIMAL(3,2) NOT NULL,
evidence_id UUID REFERENCES binaries.fix_evidence(id),
snapshot_id UUID,
indexed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (tenant_id, distro, release, source_pkg, cve_id)
);
-- Evidence blobs: audit trail
CREATE TABLE binaries.fix_evidence (
id UUID PRIMARY KEY,
tenant_id TEXT NOT NULL,
evidence_type TEXT NOT NULL, -- changelog, patch_header, security_feed
source_file TEXT, -- Path to source file (changelog, patch)
source_sha256 TEXT, -- Hash of source file
excerpt TEXT, -- Relevant snippet (max 1KB)
metadata JSONB NOT NULL, -- Structured metadata
snapshot_id UUID,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
```
### 5b.3 Evidence Types
**ChangelogEvidence:**
```json
{
"evidence_type": "changelog",
"source_file": "debian/changelog",
"excerpt": "* Fix CVE-2024-0727: PKCS12 decoding crash",
"metadata": {
"version": "3.0.11-1~deb12u2",
"line_number": 5
}
}
```
**PatchHeaderEvidence:**
```json
{
"evidence_type": "patch_header",
"source_file": "debian/patches/CVE-2024-0727.patch",
"excerpt": "CVE: CVE-2024-0727\nOrigin: upstream, https://github.com/openssl/commit/abc123",
"metadata": {
"patch_sha256": "abc123def456..."
}
}
```
**SecurityFeedEvidence:**
```json
{
"evidence_type": "security_feed",
"metadata": {
"feed_id": "debian-security-tracker",
"entry_id": "DSA-5678-1",
"published_at": "2024-01-15T10:00:00Z"
}
}
```
### 5b.4 Confidence Resolution
When multiple evidence sources exist for the same CVE, the system keeps the **highest confidence** entry:
```csharp
ON CONFLICT (tenant_id, distro, release, source_pkg, cve_id)
DO UPDATE SET
confidence = GREATEST(existing.confidence, new.confidence),
method = CASE
WHEN existing.confidence < new.confidence THEN new.method
ELSE existing.method
END,
evidence_id = CASE
WHEN existing.confidence < new.confidence THEN new.evidence_id
ELSE existing.evidence_id
END
```
### 5b.5 Parsers
The following parsers extract CVE fix information:
| Parser | Distros | Input | Confidence |
|--------|---------|-------|------------|
| `DebianChangelogParser` | Debian, Ubuntu | debian/changelog | 0.80 |
| `PatchHeaderParser` | Debian, Ubuntu | debian/patches/*.patch (DEP-3) | 0.87 |
| `AlpineSecfixesParser` | Alpine | APKBUILD secfixes block | 0.95 |
| `RpmChangelogParser` | RHEL, Fedora, CentOS | RPM spec %changelog | 0.75 |
### 5b.6 Query Flow
```mermaid
sequenceDiagram
participant SW as Scanner.Worker
participant BVS as BinaryVulnerabilityService
participant FIR as FixIndexRepository
participant PG as PostgreSQL
SW->>BVS: GetFixStatusAsync(debian, bookworm, openssl, CVE-2024-0727)
BVS->>FIR: GetFixStatusAsync(...)
FIR->>PG: SELECT FROM cve_fix_index WHERE ...
PG-->>FIR: FixIndexEntry (state=fixed, confidence=0.87)
FIR-->>BVS: FixStatusResult
BVS-->>SW: {state: Fixed, confidence: 0.87, method: PatchHeader}
```
---
## 6. Security Considerations
### 6.1 Trust Boundaries
1. **Corpus Ingestion** - Packages are untrusted; extraction runs in sandboxed workers
2. **Fingerprint Generation** - Reference builds compiled in isolated environments
3. **Query API** - Tenant-isolated via RLS; no cross-tenant data leakage
### 6.2 Signing & Provenance
- All corpus snapshots are signed (DSSE)
- Fingerprint sets are versioned and signed
- Every match result references evidence digests
### 6.3 Sandbox Requirements
Binary extraction and fingerprint generation MUST run with:
- Seccomp profile restricting syscalls
- Read-only root filesystem
- No network access during analysis
- Memory/CPU limits
---
## 7. Observability
### 7.1 Metrics
| Metric | Type | Labels |
|--------|------|--------|
| `binaryindex_lookup_total` | Counter | method, result |
| `binaryindex_lookup_latency_ms` | Histogram | method |
| `binaryindex_corpus_packages_total` | Gauge | distro, release |
| `binaryindex_fingerprints_indexed` | Gauge | algorithm, component |
| `binaryindex_match_confidence` | Histogram | method |
### 7.2 Traces
- `binaryindex.lookup` - Full lookup span
- `binaryindex.corpus.ingest` - Corpus ingestion
- `binaryindex.fingerprint.generate` - Fingerprint generation
### 7.3 Ops Endpoints
> **Sprint:** SPRINT_20260112_007_BINIDX_binaryindex_user_config
BinaryIndex exposes read-only ops endpoints for health, bench, cache, and effective configuration:
| Endpoint | Method | Response Schema | Description |
|----------|--------|-----------------|-------------|
| `/api/v1/ops/binaryindex/health` | GET | `BinaryIndexOpsHealthResponse` | Health status, lifter warmness per ISA, cache availability |
| `/api/v1/ops/binaryindex/bench/run` | POST | `BinaryIndexBenchResponse` | Run latency benchmark, return min/max/mean/p50/p95/p99 stats |
| `/api/v1/ops/binaryindex/cache` | GET | `BinaryIndexFunctionCacheStats` | Function cache hit/miss/eviction statistics |
| `/api/v1/ops/binaryindex/config` | GET | `BinaryIndexEffectiveConfig` | Effective configuration with secrets redacted |
#### 7.3.1 Response Schemas
**BinaryIndexOpsHealthResponse:**
```json
{
"status": "healthy",
"timestamp": "2026-01-16T12:00:00Z",
"components": {
"lifterPool": { "status": "healthy", "message": null },
"functionCache": { "status": "healthy", "message": null },
"persistence": { "status": "healthy", "message": null }
},
"lifterWarmness": {
"intel-64": { "isa": "intel-64", "warm": true, "poolSize": 4, "acquireTimeMs": 12 },
"armv8-64": { "isa": "armv8-64", "warm": true, "poolSize": 2, "acquireTimeMs": 8 }
}
}
```
**BinaryIndexBenchResponse:**
```json
{
"timestamp": "2026-01-16T12:00:00Z",
"sampleSize": 100,
"latencySummary": {
"minMs": 5.2,
"maxMs": 142.8,
"meanMs": 28.4,
"p50Ms": 22.1,
"p95Ms": 78.3,
"p99Ms": 121.5
},
"operations": [
{ "operation": "lifterAcquire", "samples": 100, "meanMs": 12.4 },
{ "operation": "irNormalization", "samples": 100, "meanMs": 8.7 },
{ "operation": "cacheLookup", "samples": 100, "meanMs": 1.2 }
]
}
```
**BinaryIndexFunctionCacheStats:**
```json
{
"enabled": true,
"backend": "valkey",
"hits": 15234,
"misses": 892,
"evictions": 45,
"hitRate": 0.944,
"keyPrefix": "stellaops:binidx:funccache:",
"cacheTtlSeconds": 14400,
"estimatedEntries": 12500,
"estimatedMemoryBytes": 52428800
}
```
**BinaryIndexEffectiveConfig:**
```json
{
"b2r2Pool": {
"maxPoolSizePerIsa": 4,
"warmPreload": ["intel-64", "armv8-64"],
"acquireTimeoutMs": 5000,
"enableMetrics": true
},
"semanticLifting": {
"b2r2Version": "1.5.0",
"normalizationRecipeVersion": "2024.1",
"maxInstructionsPerFunction": 10000,
"maxFunctionsPerBinary": 5000,
"functionLiftTimeoutMs": 30000,
"enableDeduplication": true
},
"functionCache": {
"connectionString": "********",
"keyPrefix": "stellaops:binidx:funccache:",
"cacheTtlSeconds": 14400,
"maxTtlSeconds": 86400,
"earlyExpiryPercent": 0.1,
"maxEntrySizeBytes": 1048576
},
"persistence": {
"schema": "binaries",
"minPoolSize": 5,
"maxPoolSize": 20,
"commandTimeoutSeconds": 30,
"retryOnFailure": true,
"batchSize": 100
},
"backendVersions": {
"b2r2": "1.5.0",
"valkey": "7.2.0",
"postgres": "15.4"
}
}
```
#### 7.3.2 Rate Limiting
The `/bench/run` endpoint is rate-limited to prevent load spikes:
- Default: 5 requests per minute per tenant
- Configurable via `BinaryIndex:Ops:BenchRateLimitPerMinute`
#### 7.3.3 Secret Redaction
The config endpoint automatically redacts sensitive keys:
| Redacted Keys | Pattern |
|---------------|---------|
| `connectionString` | Replaced with `********` |
| `password` | Replaced with `********` |
| `secret*` | Any key starting with "secret" |
| `apiKey` | Replaced with `********` |
| `token` | Replaced with `********` |
Redaction is applied recursively to nested objects.
---
## 8. Configuration
> **Sprint:** SPRINT_20260112_007_BINIDX_binaryindex_user_config
### 8.1 Configuration Sections
All configuration is under the `BinaryIndex` section in `appsettings.yaml` or environment variables with `BINARYINDEX__` prefix.
#### 8.1.1 B2R2 Lifter Pool (`BinaryIndex:B2R2Pool`)
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `MaxPoolSizePerIsa` | int | 4 | Maximum lifter instances per ISA |
| `WarmPreload` | string[] | ["intel-64", "armv8-64"] | ISAs to warm on startup |
| `AcquireTimeoutMs` | int | 5000 | Timeout for lifter acquisition |
| `EnableMetrics` | bool | true | Emit Prometheus metrics for pool |
```yaml
BinaryIndex:
B2R2Pool:
MaxPoolSizePerIsa: 4
WarmPreload:
- intel-64
- armv8-64
AcquireTimeoutMs: 5000
EnableMetrics: true
```
#### 8.1.2 Semantic Lifting (`BinaryIndex:SemanticLifting`)
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `B2R2Version` | string | "1.5.0" | B2R2 disassembler version |
| `NormalizationRecipeVersion` | string | "2024.1" | IR normalization recipe version |
| `MaxInstructionsPerFunction` | int | 10000 | Max instructions to lift per function |
| `MaxFunctionsPerBinary` | int | 5000 | Max functions to process per binary |
| `FunctionLiftTimeoutMs` | int | 30000 | Timeout for lifting single function |
| `EnableDeduplication` | bool | true | Deduplicate IR before fingerprinting |
```yaml
BinaryIndex:
SemanticLifting:
MaxInstructionsPerFunction: 10000
MaxFunctionsPerBinary: 5000
FunctionLiftTimeoutMs: 30000
EnableDeduplication: true
```
#### 8.1.3 Function Cache (`BinaryIndex:FunctionCache`)
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `ConnectionString` | string | | Valkey connection string (secret) |
| `KeyPrefix` | string | "stellaops:binidx:funccache:" | Cache key prefix |
| `CacheTtlSeconds` | int | 14400 | Default cache TTL (4 hours) |
| `MaxTtlSeconds` | int | 86400 | Maximum TTL (24 hours) |
| `EarlyExpiryPercent` | decimal | 0.1 | Early expiry jitter (10%) |
| `MaxEntrySizeBytes` | int | 1048576 | Max entry size (1 MB) |
```yaml
BinaryIndex:
FunctionCache:
ConnectionString: ${VALKEY_CONNECTION} # from env
KeyPrefix: "stellaops:binidx:funccache:"
CacheTtlSeconds: 14400
MaxEntrySizeBytes: 1048576
```
#### 8.1.4 Persistence (`Postgres:BinaryIndex`)
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `Schema` | string | "binaries" | PostgreSQL schema name |
| `MinPoolSize` | int | 5 | Minimum connection pool size |
| `MaxPoolSize` | int | 20 | Maximum connection pool size |
| `CommandTimeoutSeconds` | int | 30 | Command execution timeout |
| `RetryOnFailure` | bool | true | Retry transient failures |
| `BatchSize` | int | 100 | Batch insert size |
```yaml
Postgres:
BinaryIndex:
Schema: binaries
MinPoolSize: 5
MaxPoolSize: 20
CommandTimeoutSeconds: 30
RetryOnFailure: true
BatchSize: 100
```
#### 8.1.5 Ops Configuration (`BinaryIndex:Ops`)
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `EnableHealthEndpoint` | bool | true | Enable /health endpoint |
| `EnableBenchEndpoint` | bool | true | Enable /bench/run endpoint |
| `BenchRateLimitPerMinute` | int | 5 | Rate limit for bench endpoint |
| `RedactedKeys` | string[] | See 7.3.3 | Keys to redact in config output |
### 8.2 Legacy Configuration
```yaml
# binaryindex.yaml (corpus configuration)
binaryindex:
enabled: true
corpus:
connectors:
- type: debian
enabled: true
mirror: http://deb.debian.org/debian
releases: [bookworm, bullseye]
architectures: [amd64, arm64]
- type: ubuntu
enabled: true
mirror: http://archive.ubuntu.com/ubuntu
releases: [jammy, noble]
fingerprinting:
enabled: true
algorithms: [basic_block, cfg]
target_components:
- openssl
- glibc
- zlib
- curl
- sqlite
min_function_size: 16 # bytes
max_functions_per_binary: 10000
lookup:
cache_ttl: 3600
batch_size: 100
timeout_ms: 5000
storage:
postgres_schema: binaries
rustfs_bucket: stellaops/binaryindex
```
---
## 9. Testing Strategy
### 9.1 Unit Tests
- Identity extraction (Build-ID, hashes)
- Fingerprint generation determinism
- Fix index parsing (changelog, patch headers)
### 9.2 Integration Tests
- PostgreSQL schema validation
- Full corpus ingestion flow
- Scanner.Worker lookup integration
### 9.3 Regression Tests
- Known CVE detection (golden corpus)
- Backport handling (Debian libssl example)
- False positive rate validation
---
## 10. Golden Corpus for Patch Provenance
> **Sprint:** SPRINT_20260121_034/035/036 - Golden Corpus Implementation
The BinaryIndex module supports a **golden corpus** of patch-paired artifacts that enables offline SBOM reproducibility and binary-level patch provenance verification.
### 10.1 Corpus Purpose
The golden corpus provides:
- **Auditor-ready evidence bundles** for air-gapped customers
- **Regression testing** for binary matching accuracy
- **Proof of patch status** independent of package metadata
### 10.2 Corpus Sources
| Source | Type | Purpose |
|--------|------|---------|
| Debian Security Tracker / DSAs | Advisory | Primary advisory linkage |
| Debian Snapshot | Binary archive | Pre/post patch binary pairs |
| Ubuntu Security Notices | Advisory | Ubuntu-specific advisories |
| Alpine secdb | Advisory | Alpine YAML advisories |
| OSV dump | Unified schema | Cross-reference and commit ranges |
### 10.2.1 Symbol Source Connectors
> **Sprint:** SPRINT_20260121_035_BinaryIndex_golden_corpus_connectors_cli
The corpus ingestion layer uses pluggable connectors to retrieve symbols and metadata from upstream sources:
| Connector ID | Implementation | Protocol | Data Retrieved |
|--------------|----------------|----------|----------------|
| `debuginfod-fedora` | `DebuginfodConnector` | debuginfod HTTP | ELF debug symbols by Build-ID |
| `debuginfod-ubuntu` | `DebuginfodConnector` | debuginfod HTTP | ELF debug symbols by Build-ID |
| `ddeb-ubuntu` | `DdebConnector` | APT/HTTP | `.ddeb` debug packages |
| `buildinfo-debian` | `BuildinfoConnector` | HTTP | `.buildinfo` reproducibility records |
| `secdb-alpine` | `AlpineSecDbConnector` | Git/HTTP | `secfixes` YAML from APKBUILD |
**Connector Interface:**
```csharp
public interface ISymbolSourceConnector
{
string ConnectorId { get; }
string DisplayName { get; }
string[] SupportedDistros { get; }
Task<ConnectorStatus> GetStatusAsync(CancellationToken ct);
Task SyncAsync(SyncOptions options, CancellationToken ct);
Task<SymbolLookupResult?> LookupByBuildIdAsync(string buildId, CancellationToken ct);
Task<IAsyncEnumerable<SymbolRecord>> SearchAsync(SymbolSearchQuery query, CancellationToken ct);
}
```
**Debuginfod Connector:**
The `DebuginfodConnector` implements the [debuginfod protocol](https://sourceware.org/elfutils/Debuginfod.html) for retrieving debug symbols:
- Endpoint: `GET /buildid/<build-id>/debuginfo`
- Supports federated queries across multiple debuginfod servers
- Caches retrieved symbols in RustFS blob storage
- Rate-limited to respect upstream server policies
**Ubuntu ddeb Connector:**
The `DdebConnector` retrieves Ubuntu debug symbol packages (`.ddeb`):
- Sources: `ddebs.ubuntu.com` mirror
- Indexes: Reads `Packages.xz` for package metadata
- Extraction: Unpacks `.ddeb` AR archives to extract DWARF symbols
- Mapping: Links debug symbols to binary packages via Build-ID
**Debian Buildinfo Connector:**
The `BuildinfoConnector` retrieves Debian buildinfo files for reproducibility verification:
- Source: `buildinfos.debian.net` and snapshot archives
- Purpose: Provides build environment metadata for reproducible builds
- Fields extracted: `Build-Date`, `Build-Architecture`, `Checksums-Sha256`
- Integration: Cross-references with binary packages for provenance
**Alpine SecDB Connector:**
The `AlpineSecDbConnector` parses Alpine's security database:
- Source: `secfixes` blocks in APKBUILD files
- Repository: `alpine/aports` Git repository
- Format: YAML blocks mapping CVEs to fixed versions
- Example:
```yaml
secfixes:
3.0.11-r0:
- CVE-2024-0727
- CVE-2024-0728
```
**OSV Dump Parser:**
The `OsvDumpParser` processes Google OSV database dumps for advisory cross-correlation:
- Source: `osv.dev` bulk exports (JSON)
- Purpose: CVE → commit range extraction for patch identification
- Cross-reference: Correlates OSV entries with distribution advisories
- Inconsistency detection: Identifies discrepancies between OSV and distro advisories
```csharp
public interface IOsvDumpParser
{
IAsyncEnumerable<OsvParsedEntry> ParseDumpAsync(Stream osvDumpStream, CancellationToken ct);
OsvCveIndex BuildCveIndex(IEnumerable<OsvParsedEntry> entries);
IEnumerable<AdvisoryCorrelation> CrossReferenceWithExternal(
OsvCveIndex osvIndex,
IEnumerable<ExternalAdvisory> externalAdvisories);
IEnumerable<AdvisoryInconsistency> DetectInconsistencies(
IEnumerable<AdvisoryCorrelation> correlations);
}
```
**CLI Access:**
All connectors are manageable via the `stella groundtruth sources` CLI commands:
```bash
# List all connectors
stella groundtruth sources list
# Sync specific connector
stella groundtruth sources sync --source buildinfo-debian --full
# Enable/disable connectors
stella groundtruth sources enable ddeb-ubuntu
stella groundtruth sources disable debuginfod-fedora
```
See [Ground-Truth CLI Guide](../cli/guides/ground-truth-cli.md) for complete CLI documentation
### 10.3 Key Performance Indicators
| KPI | Target | Description |
|-----|--------|-------------|
| Per-function match rate | >= 90% | Functions matched in post-patch binary |
| False-negative patch detection | <= 5% | Patched functions incorrectly classified |
| SBOM canonical-hash stability | 3/3 | Determinism across independent runs |
| Binary reconstruction equivalence | Trend | Rebuilt binary matches original |
| End-to-end verify time (p95, cold) | Trend | Offline verification performance |
### 10.4 Validation Harness
The validation harness (`IValidationHarness`) orchestrates end-to-end verification:
```
Binary Pair (pre/post) → Symbol Recovery → IR Lifting → Fingerprinting → Matching → Metrics
```
### 10.5 Evidence Bundle Format
Evidence bundles follow OCI/ORAS conventions:
```
<pkg>-<advisory>-bundle.oci.tar
├── manifest.json # OCI manifest
└── blobs/
├── sha256:<sbom> # Canonical SBOM
├── sha256:<pre-bin> # Pre-fix binary
├── sha256:<post-bin> # Post-fix binary
├── sha256:<delta-sig> # DSSE delta-sig predicate
└── sha256:<timestamp> # RFC 3161 timestamp
```
### 10.6 Two-Tier Bundle Design and Large Blob References
> **Sprint:** SPRINT_20260122_040_Platform_oci_delta_attestation_pipeline (040-04)
Evidence bundles support two export modes to balance transfer speed with auditability:
| Mode | Export Flag | Contents | Use Case |
|------|------------|----------|----------|
| **Light** | (default) | Manifest + attestation envelopes + metadata | Quick transfer, metadata-only audit |
| **Full** | `--full` | Light + embedded binary blobs in `blobs/` | Air-gap replay, full provenance verification |
#### 10.6.1 `largeBlobs[]` Field
The `DeltaSigPredicate` includes a `largeBlobs` array referencing binary artifacts that may be too large to embed in attestation payloads:
```json
{
"schemaVersion": "1.0.0",
"subject": [...],
"delta": [...],
"largeBlobs": [
{
"kind": "binary-patch",
"digest": "sha256:a1b2c3...",
"mediaType": "application/octet-stream",
"sizeBytes": 1048576
},
{
"kind": "sbom-fragment",
"digest": "sha256:d4e5f6...",
"mediaType": "application/spdx+json",
"sizeBytes": 32768
}
],
"sbomDigest": "sha256:789abc..."
}
```
**Field Definitions:**
| Field | Type | Description |
|-------|------|-------------|
| `largeBlobs[].kind` | string | Blob category: `binary-patch`, `sbom-fragment`, `debug-symbols`, etc. |
| `largeBlobs[].digest` | string | Content-addressable digest (`sha256:<hex>`, `sha384:<hex>`, `sha512:<hex>`) |
| `largeBlobs[].mediaType` | string | IANA media type of the blob |
| `largeBlobs[].sizeBytes` | long | Blob size in bytes |
| `sbomDigest` | string | Digest of the canonical SBOM associated with this delta |
#### 10.6.2 Blob Fetch Strategy
During `stella bundle verify --replay`, blobs are resolved in priority order:
1. **Embedded** (full bundles): Read from `blobs/<digest-with-dash>` in bundle directory
2. **Local source** (`--blob-source /path/`): Read from specified local directory
3. **Registry** (`--blob-source https://...`): HTTP GET from OCI registry (blocked in `--offline` mode)
#### 10.6.3 Digest Verification
Fetched blobs are verified against their declared digest using the algorithm prefix:
```
sha256:<hex> → SHA-256
sha384:<hex> → SHA-384
sha512:<hex> → SHA-512
```
A mismatch fails the blob replay verification step.
### 10.7 Related Documentation
- [Golden Corpus KPIs](../../benchmarks/golden-corpus-kpis.md)
- [Golden Corpus Seed List](../../benchmarks/golden-corpus-seed-list.md)
- [Ground-Truth Corpus Specification](../../benchmarks/ground-truth-corpus.md)
---
## 11. References
- Advisory: `docs/product/advisories/21-Dec-2025 - Mapping Evidence Within Compiled Binaries.md`
- Scanner Native Analysis: `src/Scanner/StellaOps.Scanner.Analyzers.Native/`
- Existing Fingerprinting: `src/Scanner/__Libraries/StellaOps.Scanner.EntryTrace/Binary/`
- Build-ID Index: `src/Scanner/StellaOps.Scanner.Analyzers.Native/Index/`
- **Semantic Diffing Sprint:** `docs/implplan/SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md`
- **Semantic Library:** `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Semantic/`
- **Semantic Tests:** `src/BinaryIndex/__Tests/StellaOps.BinaryIndex.Semantic.Tests/`
- **Golden Corpus Sprints:** `docs/implplan/SPRINT_20260121_034_BinaryIndex_golden_corpus_foundation.md`
---
*Document Version: 1.2.0*
*Last Updated: 2026-01-21*