1744 lines
62 KiB
Markdown
1744 lines
62 KiB
Markdown
# BinaryIndex Module Architecture
|
||
|
||
> **Ownership:** Scanner Guild + Concelier Guild
|
||
> **Status:** DRAFT
|
||
> **Version:** 1.0.0
|
||
> **Related:** [High-Level Architecture](../../ARCHITECTURE_OVERVIEW.md), [Scanner Architecture](../scanner/architecture.md), [Concelier Architecture](../concelier/architecture.md)
|
||
|
||
---
|
||
|
||
## 1. Overview
|
||
|
||
The **BinaryIndex** module provides a vulnerable binaries database that enables detection of vulnerable code at the binary level, independent of package metadata. This addresses a critical gap in vulnerability scanning: package version strings can lie (backports, custom builds, stripped metadata), but **binary identity doesn't lie**.
|
||
|
||
### 1.1 Problem Statement
|
||
|
||
Traditional vulnerability scanners rely on package version matching, which fails in several scenarios:
|
||
|
||
1. **Backported patches** - Distros backport security fixes without changing upstream version
|
||
2. **Custom/vendored builds** - Binaries compiled from source without package metadata
|
||
3. **Stripped binaries** - Debug info and version strings removed
|
||
4. **Static linking** - Vulnerable library code embedded in final binary
|
||
5. **Container base images** - Distroless or scratch images with no package DB
|
||
|
||
### 1.2 Solution: Binary-First Vulnerability Detection
|
||
|
||
BinaryIndex provides three tiers of binary identification:
|
||
|
||
| Tier | Method | Precision | Coverage |
|
||
|------|--------|-----------|----------|
|
||
| A | Package/version range matching | Medium | High |
|
||
| B | Build-ID/hash catalog (exact binary identity) | High | Medium |
|
||
| C | Function fingerprints (CFG/basic-block hashes) | Very High | Targeted |
|
||
|
||
### 1.3 Module Scope
|
||
|
||
**In Scope:**
|
||
- Binary identity extraction (Build-ID, PE CodeView GUID, Mach-O UUID)
|
||
- Binary-to-advisory mapping database
|
||
- Fingerprint storage and matching engine
|
||
- Fix index for patch-aware backport handling
|
||
- Integration with Scanner.Worker for binary lookup
|
||
|
||
**Out of Scope:**
|
||
- Binary disassembly/analysis (provided by Scanner.Analyzers.Native)
|
||
- Runtime binary tracing (provided by Zastava)
|
||
- SBOM generation (provided by Scanner)
|
||
|
||
---
|
||
|
||
## 2. Architecture
|
||
|
||
### 2.1 System Context
|
||
|
||
```
|
||
┌──────────────────────────────────────────────────────────────────────────┐
|
||
│ External Systems │
|
||
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
||
│ │ Distro Repos │ │ Debug Symbol │ │ Upstream Source │ │
|
||
│ │ (Debian, RPM, │ │ Servers │ │ (GitHub, etc.) │ │
|
||
│ │ Alpine) │ │ (debuginfod) │ │ │ │
|
||
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
|
||
└───────────│─────────────────────│─────────────────────│──────────────────┘
|
||
│ │ │
|
||
v v v
|
||
┌──────────────────────────────────────────────────────────────────────────┐
|
||
│ BinaryIndex Module │
|
||
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||
│ │ Corpus Ingestion Layer │ │
|
||
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
|
||
│ │ │ DebianCorpus │ │ RpmCorpus │ │ AlpineCorpus │ │ │
|
||
│ │ │ Connector │ │ Connector │ │ Connector │ │ │
|
||
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
|
||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ v │
|
||
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||
│ │ Processing Layer │ │
|
||
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
|
||
│ │ │ BinaryFeature│ │ FixIndex │ │ Fingerprint │ │ │
|
||
│ │ │ Extractor │ │ Builder │ │ Generator │ │ │
|
||
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
|
||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ v │
|
||
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||
│ │ Storage Layer │ │
|
||
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
|
||
│ │ │ PostgreSQL │ │ RustFS │ │ Valkey │ │ │
|
||
│ │ │ (binaries │ │ (fingerprint │ │ (lookup │ │ │
|
||
│ │ │ schema) │ │ blobs) │ │ cache) │ │ │
|
||
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
|
||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ v │
|
||
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||
│ │ Query Layer │ │
|
||
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
|
||
│ │ │ IBinaryVulnerabilityService │ │ │
|
||
│ │ │ - LookupByBuildIdAsync(buildId) │ │ │
|
||
│ │ │ - LookupByFingerprintAsync(fingerprint) │ │ │
|
||
│ │ │ - LookupBatchAsync(identities) │ │ │
|
||
│ │ │ - GetFixStatusAsync(distro, release, sourcePkg, cve) │ │ │
|
||
│ │ └──────────────────────────────────────────────────────────────┘ │ │
|
||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||
└──────────────────────────────────────────────────────────────────────────┘
|
||
│
|
||
v
|
||
┌──────────────────────────────────────────────────────────────────────────┐
|
||
│ Consuming Modules │
|
||
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
||
│ │ Scanner.Worker │ │ Policy Engine │ │ Findings Ledger │ │
|
||
│ │ (binary lookup │ │ (evidence in │ │ (match records) │ │
|
||
│ │ during scan) │ │ proof chain) │ │ │ │
|
||
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
||
└──────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### 2.2 Component Breakdown
|
||
|
||
#### 2.2.1 Corpus Connectors
|
||
|
||
Plugin-based connectors that ingest binaries from distribution repositories.
|
||
|
||
```csharp
|
||
public interface IBinaryCorpusConnector
|
||
{
|
||
string ConnectorId { get; }
|
||
string[] SupportedDistros { get; }
|
||
|
||
Task<CorpusSnapshot> FetchSnapshotAsync(CorpusQuery query, CancellationToken ct);
|
||
Task<IAsyncEnumerable<ExtractedBinary>> ExtractBinariesAsync(PackageReference pkg, CancellationToken ct);
|
||
}
|
||
```
|
||
|
||
**Implementations:**
|
||
- `DebianBinaryCorpusConnector` - Debian/Ubuntu packages + debuginfo
|
||
- `RpmBinaryCorpusConnector` - RHEL/Fedora/CentOS + SRPM
|
||
- `AlpineBinaryCorpusConnector` - Alpine APK + APKBUILD
|
||
|
||
#### 2.2.2 Binary Feature Extractor
|
||
|
||
Extracts identity and features from binaries. Reuses existing Scanner.Analyzers.Native capabilities.
|
||
|
||
```csharp
|
||
public interface IBinaryFeatureExtractor
|
||
{
|
||
Task<BinaryIdentity> ExtractIdentityAsync(Stream binaryStream, CancellationToken ct);
|
||
Task<BinaryFeatures> ExtractFeaturesAsync(Stream binaryStream, ExtractorOptions opts, CancellationToken ct);
|
||
}
|
||
|
||
public sealed record BinaryIdentity(
|
||
string Format, // elf, pe, macho
|
||
string? BuildId, // ELF GNU Build-ID
|
||
string? PeCodeViewGuid, // PE CodeView GUID + Age
|
||
string? MachoUuid, // Mach-O LC_UUID
|
||
string FileSha256,
|
||
string TextSectionSha256);
|
||
|
||
public sealed record BinaryFeatures(
|
||
BinaryIdentity Identity,
|
||
string[] DynamicDeps, // DT_NEEDED
|
||
string[] ExportedSymbols,
|
||
string[] ImportedSymbols,
|
||
BinaryHardening Hardening);
|
||
```
|
||
|
||
#### 2.2.3 Fix Index Builder
|
||
|
||
Builds the patch-aware CVE fix index from distro sources.
|
||
|
||
```csharp
|
||
public interface IFixIndexBuilder
|
||
{
|
||
Task BuildIndexAsync(DistroRelease distro, CancellationToken ct);
|
||
Task<FixRecord?> GetFixRecordAsync(string distro, string release, string sourcePkg, string cveId, CancellationToken ct);
|
||
}
|
||
|
||
public sealed record FixRecord(
|
||
string Distro,
|
||
string Release,
|
||
string SourcePkg,
|
||
string CveId,
|
||
FixState State, // fixed, vulnerable, not_affected, wontfix, unknown
|
||
string? FixedVersion, // Distro version string
|
||
FixMethod Method, // security_feed, changelog, patch_header
|
||
decimal Confidence, // 0.00-1.00
|
||
FixEvidence Evidence);
|
||
|
||
public enum FixState { Fixed, Vulnerable, NotAffected, Wontfix, Unknown }
|
||
public enum FixMethod { SecurityFeed, Changelog, PatchHeader, UpstreamPatchMatch }
|
||
```
|
||
|
||
#### 2.2.4 Fingerprint Generator
|
||
|
||
Generates function-level fingerprints for vulnerable code detection.
|
||
|
||
```csharp
|
||
public interface IVulnFingerprintGenerator
|
||
{
|
||
Task<ImmutableArray<VulnFingerprint>> GenerateAsync(
|
||
string cveId,
|
||
BinaryPair vulnAndFixed, // Reference builds
|
||
FingerprintOptions opts,
|
||
CancellationToken ct);
|
||
}
|
||
|
||
public sealed record VulnFingerprint(
|
||
string CveId,
|
||
string Component, // e.g., openssl
|
||
string Architecture, // x86-64, aarch64
|
||
FingerprintType Type, // basic_block, cfg, combined
|
||
string FingerprintId, // e.g., "bb-abc123..."
|
||
byte[] FingerprintHash, // 16-32 bytes
|
||
string? FunctionHint, // Function name if known
|
||
decimal Confidence,
|
||
FingerprintEvidence Evidence);
|
||
|
||
public enum FingerprintType { BasicBlock, ControlFlowGraph, StringReferences, Combined }
|
||
```
|
||
|
||
#### 2.2.5 Semantic Analysis Library
|
||
|
||
> **Library:** `StellaOps.BinaryIndex.Semantic`
|
||
> **Sprint:** 20260105_001_001_BINDEX - Semantic Diffing Phase 1
|
||
|
||
The Semantic Analysis Library extends fingerprint generation with IR-level semantic matching, enabling detection of semantically equivalent code despite compiler optimizations, instruction reordering, and register allocation differences.
|
||
|
||
**Key Insight:** Traditional instruction-level fingerprinting loses accuracy on optimized binaries by ~15-20%. Semantic analysis lifts to B2R2's Intermediate Representation (LowUIR), extracts key-semantics graphs, and uses graph hashing for similarity computation.
|
||
|
||
##### 2.2.5.1 Architecture
|
||
|
||
```
|
||
Binary Input
|
||
│
|
||
v
|
||
B2R2 Disassembly → Raw Instructions
|
||
│
|
||
v
|
||
IR Lifting Service → LowUIR Statements
|
||
│
|
||
v
|
||
Semantic Graph Extractor → Key-Semantics Graph (KSG)
|
||
│
|
||
v
|
||
Graph Fingerprinting → Semantic Fingerprint
|
||
│
|
||
v
|
||
Semantic Matcher → Similarity Score + Deltas
|
||
```
|
||
|
||
##### 2.2.5.2 Core Components
|
||
|
||
**IR Lifting Service** (`IIrLiftingService`)
|
||
|
||
Lifts disassembled instructions to B2R2 LowUIR:
|
||
|
||
```csharp
|
||
public interface IIrLiftingService
|
||
{
|
||
Task<LiftedFunction> LiftToIrAsync(
|
||
IReadOnlyList<DisassembledInstruction> instructions,
|
||
string functionName,
|
||
LiftOptions? options = null,
|
||
CancellationToken ct = default);
|
||
}
|
||
|
||
public sealed record LiftedFunction(
|
||
string Name,
|
||
ImmutableArray<IrStatement> Statements,
|
||
ImmutableArray<IrBasicBlock> BasicBlocks);
|
||
```
|
||
|
||
**Semantic Graph Extractor** (`ISemanticGraphExtractor`)
|
||
|
||
Extracts key-semantics graphs capturing data dependencies, control flow, and memory operations:
|
||
|
||
```csharp
|
||
public interface ISemanticGraphExtractor
|
||
{
|
||
Task<KeySemanticsGraph> ExtractGraphAsync(
|
||
LiftedFunction function,
|
||
GraphExtractionOptions? options = null,
|
||
CancellationToken ct = default);
|
||
}
|
||
|
||
public sealed record KeySemanticsGraph(
|
||
string FunctionName,
|
||
ImmutableArray<SemanticNode> Nodes,
|
||
ImmutableArray<SemanticEdge> Edges,
|
||
GraphProperties Properties);
|
||
|
||
public enum SemanticNodeType { Compute, Load, Store, Branch, Call, Return, Phi }
|
||
public enum SemanticEdgeType { DataDependency, ControlDependency, MemoryDependency }
|
||
```
|
||
|
||
**Semantic Fingerprint Generator** (`ISemanticFingerprintGenerator`)
|
||
|
||
Generates semantic fingerprints using Weisfeiler-Lehman graph hashing:
|
||
|
||
```csharp
|
||
public interface ISemanticFingerprintGenerator
|
||
{
|
||
Task<SemanticFingerprint> GenerateAsync(
|
||
KeySemanticsGraph graph,
|
||
SemanticFingerprintOptions? options = null,
|
||
CancellationToken ct = default);
|
||
}
|
||
|
||
public sealed record SemanticFingerprint(
|
||
string FunctionName,
|
||
string GraphHashHex, // WL graph hash (SHA-256)
|
||
string OperationHashHex, // Normalized operation sequence hash
|
||
string DataFlowHashHex, // Data dependency pattern hash
|
||
int NodeCount,
|
||
int EdgeCount,
|
||
int CyclomaticComplexity,
|
||
ImmutableArray<string> ApiCalls,
|
||
SemanticFingerprintAlgorithm Algorithm);
|
||
```
|
||
|
||
**Semantic Matcher** (`ISemanticMatcher`)
|
||
|
||
Computes semantic similarity with weighted components:
|
||
|
||
```csharp
|
||
public interface ISemanticMatcher
|
||
{
|
||
Task<SemanticMatchResult> MatchAsync(
|
||
SemanticFingerprint a,
|
||
SemanticFingerprint b,
|
||
MatchOptions? options = null,
|
||
CancellationToken ct = default);
|
||
|
||
Task<SemanticMatchResult> MatchWithDeltasAsync(
|
||
SemanticFingerprint a,
|
||
SemanticFingerprint b,
|
||
MatchOptions? options = null,
|
||
CancellationToken ct = default);
|
||
}
|
||
|
||
public sealed record SemanticMatchResult(
|
||
decimal Similarity, // 0.00-1.00
|
||
decimal GraphSimilarity,
|
||
decimal OperationSimilarity,
|
||
decimal DataFlowSimilarity,
|
||
decimal ApiCallSimilarity,
|
||
MatchConfidence Confidence);
|
||
```
|
||
|
||
##### 2.2.5.3 Algorithm Details
|
||
|
||
**Weisfeiler-Lehman Graph Hashing:**
|
||
- 3 iterations of label propagation
|
||
- SHA-256 for final hash computation
|
||
- Deterministic node ordering via canonical sort
|
||
|
||
**Similarity Weights (Default):**
|
||
| Component | Weight |
|
||
|-----------|--------|
|
||
| Graph Hash | 0.35 |
|
||
| Operation Hash | 0.25 |
|
||
| Data Flow Hash | 0.25 |
|
||
| API Calls | 0.15 |
|
||
|
||
##### 2.2.5.4 Integration Points
|
||
|
||
The semantic library integrates with existing BinaryIndex components:
|
||
|
||
**DeltaSignatureGenerator Extension:**
|
||
```csharp
|
||
// Optional semantic services via constructor injection
|
||
services.AddDeltaSignaturesWithSemantic();
|
||
|
||
// Extended SymbolSignature with semantic properties
|
||
public sealed record SymbolSignature
|
||
{
|
||
// ... existing properties ...
|
||
public string? SemanticHashHex { get; init; }
|
||
public ImmutableArray<string> SemanticApiCalls { get; init; }
|
||
}
|
||
```
|
||
|
||
**PatchDiffEngine Extension:**
|
||
```csharp
|
||
// SemanticWeight in HashWeights
|
||
public decimal SemanticWeight { get; init; } = 0.2m;
|
||
|
||
// FunctionFingerprint extended with semantic fingerprint
|
||
public SemanticFingerprint? SemanticFingerprint { get; init; }
|
||
```
|
||
|
||
##### 2.2.5.5 Test Coverage
|
||
|
||
| Category | Tests | Coverage |
|
||
|----------|-------|----------|
|
||
| Unit Tests (IR lifting, graph extraction, hashing) | 53 | Core algorithms |
|
||
| Integration Tests (full pipeline) | 9 | End-to-end flow |
|
||
| Golden Corpus (compiler variations) | 11 | Register allocation, optimization, compiler variants |
|
||
| Benchmarks (accuracy, performance) | 7 | Baseline metrics |
|
||
|
||
##### 2.2.5.6 Current Baselines
|
||
|
||
> **Note:** Baselines reflect foundational implementation; accuracy improves as semantic features mature.
|
||
|
||
| Metric | Baseline | Target |
|
||
|--------|----------|--------|
|
||
| Similarity (register allocation variants) | ≥0.55 | ≥0.85 |
|
||
| Overall accuracy | ≥40% | ≥70% |
|
||
| False positive rate | <10% | <5% |
|
||
| P95 fingerprint latency | <100ms | <50ms |
|
||
|
||
##### 2.2.5.7 B2R2 LowUIR Adapter
|
||
|
||
The B2R2LowUirLiftingService implements `IIrLiftingService` using B2R2's native lifting capabilities. This provides cross-platform IR representation for semantic analysis.
|
||
|
||
**Key Components:**
|
||
|
||
```csharp
|
||
public sealed class B2R2LowUirLiftingService : IIrLiftingService
|
||
{
|
||
// Lifts to B2R2 LowUIR and maps to Stella IR model
|
||
public Task<LiftedFunction> LiftToIrAsync(
|
||
IReadOnlyList<DisassembledInstruction> instructions,
|
||
string functionName,
|
||
LiftOptions? options = null,
|
||
CancellationToken ct = default);
|
||
}
|
||
```
|
||
|
||
**Supported ISAs:**
|
||
- Intel (x86-32, x86-64)
|
||
- ARM (ARMv7, ARMv8/ARM64)
|
||
- MIPS (32/64)
|
||
- RISC-V (64)
|
||
- PowerPC, SPARC, SH4, AVR, EVM
|
||
|
||
**IR Statement Mapping:**
|
||
| B2R2 LowUIR | Stella IR Kind |
|
||
|-------------|----------------|
|
||
| Put | IrStatementKind.Store |
|
||
| Store | IrStatementKind.Store |
|
||
| Get | IrStatementKind.Load |
|
||
| Load | IrStatementKind.Load |
|
||
| BinOp | IrStatementKind.BinaryOp |
|
||
| UnOp | IrStatementKind.UnaryOp |
|
||
| Jmp | IrStatementKind.Jump |
|
||
| CJmp | IrStatementKind.ConditionalJump |
|
||
| InterJmp | IrStatementKind.IndirectJump |
|
||
| Call | IrStatementKind.Call |
|
||
| SideEffect | IrStatementKind.SideEffect |
|
||
|
||
**Determinism Guarantees:**
|
||
- Statements ordered by block address (ascending)
|
||
- Blocks sorted by entry address (ascending)
|
||
- Consistent IR IDs across identical inputs
|
||
- InvariantCulture used for all string formatting
|
||
|
||
##### 2.2.5.8 B2R2 Lifter Pool
|
||
|
||
The `B2R2LifterPool` provides bounded pooling and warm preload for B2R2 lifting units to reduce per-call allocation overhead.
|
||
|
||
**Configuration (`B2R2LifterPoolOptions`):**
|
||
| Option | Default | Description |
|
||
|--------|---------|-------------|
|
||
| `MaxPoolSizePerIsa` | 4 | Maximum pooled lifters per ISA |
|
||
| `EnableWarmPreload` | true | Preload lifters at startup |
|
||
| `WarmPreloadIsas` | ["intel-64", "intel-32", "armv8-64", "armv7-32"] | ISAs to warm |
|
||
| `AcquireTimeout` | 5s | Timeout for acquiring a lifter |
|
||
|
||
**Pool Statistics:**
|
||
- `TotalPooledLifters`: Lifters currently in pool
|
||
- `TotalActiveLifters`: Lifters currently in use
|
||
- `IsWarm`: Whether pool has been warmed
|
||
- `IsaStats`: Per-ISA pool and active counts
|
||
|
||
**Usage:**
|
||
```csharp
|
||
using var lifter = _lifterPool.Acquire(isa);
|
||
var stmts = lifter.LiftingUnit.LiftInstruction(address);
|
||
// Lifter automatically returned to pool on dispose
|
||
```
|
||
|
||
##### 2.2.5.9 Function IR Cache
|
||
|
||
The `FunctionIrCacheService` provides Valkey-backed caching for computed semantic fingerprints to avoid redundant IR lifting and graph hashing.
|
||
|
||
**Cache Key Structure:**
|
||
```
|
||
(isa, b2r2_version, normalization_recipe, canonical_ir_hash)
|
||
```
|
||
|
||
**Configuration (`FunctionIrCacheOptions`):**
|
||
| Option | Default | Description |
|
||
|--------|---------|-------------|
|
||
| `KeyPrefix` | "stellaops:binidx:funccache:" | Valkey key prefix |
|
||
| `CacheTtl` | 4h | TTL for cached entries |
|
||
| `MaxTtl` | 24h | Maximum TTL |
|
||
| `Enabled` | true | Whether caching is enabled |
|
||
| `B2R2Version` | "0.9.1" | B2R2 version for cache key |
|
||
| `NormalizationRecipeVersion` | "v1" | Recipe version for cache key |
|
||
|
||
**Cache Entry (`CachedFunctionFingerprint`):**
|
||
- `FunctionAddress`, `FunctionName`
|
||
- `SemanticFingerprint`: The computed fingerprint
|
||
- `IrStatementCount`, `BasicBlockCount`
|
||
- `ComputedAtUtc`: ISO-8601 timestamp
|
||
- `B2R2Version`, `NormalizationRecipe`
|
||
|
||
**Invalidation Rules:**
|
||
- Cache entries expire after `CacheTtl` (default 4h)
|
||
- Changing B2R2 version or normalization recipe results in cache misses
|
||
- Manual invalidation via `RemoveAsync()`
|
||
|
||
**Statistics:**
|
||
- Hits, Misses, Evictions
|
||
- Hit Rate
|
||
- Enabled status
|
||
|
||
##### 2.2.5.10 Ops Endpoints
|
||
|
||
BinaryIndex exposes operational endpoints for health, benchmarking, cache monitoring, and configuration visibility.
|
||
|
||
| Endpoint | Method | Description |
|
||
|----------|--------|-------------|
|
||
| `/api/v1/ops/binaryindex/health` | GET | Health status with lifter warmness, cache availability |
|
||
| `/api/v1/ops/binaryindex/bench/run` | POST | Run benchmark, return latency stats |
|
||
| `/api/v1/ops/binaryindex/cache` | GET | Function IR cache hit/miss statistics |
|
||
| `/api/v1/ops/binaryindex/config` | GET | Effective configuration (secrets redacted) |
|
||
|
||
**Health Response:**
|
||
```json
|
||
{
|
||
"status": "healthy",
|
||
"timestamp": "2026-01-14T12:00:00Z",
|
||
"lifterStatus": "warm",
|
||
"lifterWarm": true,
|
||
"lifterPoolStats": { "intel-64": 4, "armv8-64": 2 },
|
||
"cacheStatus": "enabled",
|
||
"cacheEnabled": true
|
||
}
|
||
```
|
||
|
||
**Determinism Constraints:**
|
||
- All timestamps in ISO-8601 UTC format
|
||
- ASCII-only output
|
||
- Deterministic JSON key ordering
|
||
- Secrets/credentials redacted from config endpoint
|
||
|
||
#### 2.2.6 Binary Vulnerability Service
|
||
|
||
Main query interface for consumers.
|
||
|
||
```csharp
|
||
public interface IBinaryVulnerabilityService
|
||
{
|
||
/// <summary>
|
||
/// Look up vulnerabilities by Build-ID or equivalent binary identity.
|
||
/// </summary>
|
||
Task<ImmutableArray<BinaryVulnMatch>> LookupByIdentityAsync(
|
||
BinaryIdentity identity,
|
||
LookupOptions? opts = null,
|
||
CancellationToken ct = default);
|
||
|
||
/// <summary>
|
||
/// Look up vulnerabilities by function fingerprint.
|
||
/// </summary>
|
||
Task<ImmutableArray<BinaryVulnMatch>> LookupByFingerprintAsync(
|
||
CodeFingerprint fingerprint,
|
||
decimal minSimilarity = 0.95m,
|
||
CancellationToken ct = default);
|
||
|
||
/// <summary>
|
||
/// Batch lookup for scan performance.
|
||
/// </summary>
|
||
Task<ImmutableDictionary<string, ImmutableArray<BinaryVulnMatch>>> LookupBatchAsync(
|
||
IEnumerable<BinaryIdentity> identities,
|
||
LookupOptions? opts = null,
|
||
CancellationToken ct = default);
|
||
|
||
/// <summary>
|
||
/// Get distro-specific fix status (patch-aware).
|
||
/// </summary>
|
||
Task<FixRecord?> GetFixStatusAsync(
|
||
string distro,
|
||
string release,
|
||
string sourcePkg,
|
||
string cveId,
|
||
CancellationToken ct = default);
|
||
}
|
||
|
||
public sealed record BinaryVulnMatch(
|
||
string CveId,
|
||
string VulnerablePurl,
|
||
MatchMethod Method, // buildid_catalog, fingerprint_match, range_match
|
||
decimal Confidence,
|
||
MatchEvidence Evidence);
|
||
|
||
public enum MatchMethod { BuildIdCatalog, FingerprintMatch, RangeMatch }
|
||
```
|
||
|
||
---
|
||
|
||
## 3. Data Model
|
||
|
||
### 3.1 PostgreSQL Schema (`binaries`)
|
||
|
||
The `binaries` schema stores binary identity, fingerprint, and match data.
|
||
|
||
```sql
|
||
CREATE SCHEMA IF NOT EXISTS binaries;
|
||
CREATE SCHEMA IF NOT EXISTS binaries_app;
|
||
|
||
-- RLS helper
|
||
CREATE OR REPLACE FUNCTION binaries_app.require_current_tenant()
|
||
RETURNS TEXT LANGUAGE plpgsql STABLE SECURITY DEFINER AS $$
|
||
DECLARE v_tenant TEXT;
|
||
BEGIN
|
||
v_tenant := current_setting('app.tenant_id', true);
|
||
IF v_tenant IS NULL OR v_tenant = '' THEN
|
||
RAISE EXCEPTION 'app.tenant_id session variable not set';
|
||
END IF;
|
||
RETURN v_tenant;
|
||
END;
|
||
$$;
|
||
```
|
||
|
||
#### 3.1.1 Core Tables
|
||
|
||
See `docs/db/schemas/binaries_schema_specification.md` for complete DDL.
|
||
|
||
**Key Tables:**
|
||
|
||
| Table | Purpose |
|
||
|-------|---------|
|
||
| `binaries.binary_identity` | Known binary identities (Build-ID, hashes) |
|
||
| `binaries.binary_package_map` | Binary → package mapping per snapshot |
|
||
| `binaries.vulnerable_buildids` | Build-IDs known to be vulnerable |
|
||
| `binaries.vulnerable_fingerprints` | Function fingerprints for CVEs |
|
||
| `binaries.cve_fix_index` | Patch-aware fix status per distro |
|
||
| `binaries.fingerprint_matches` | Match results (findings evidence) |
|
||
| `binaries.corpus_snapshots` | Corpus ingestion tracking |
|
||
|
||
### 3.2 RustFS Layout
|
||
|
||
```
|
||
rustfs://stellaops/binaryindex/
|
||
fingerprints/<algorithm>/<prefix>/<fingerprint_id>.bin
|
||
corpus/<distro>/<release>/<snapshot_id>/manifest.json
|
||
corpus/<distro>/<release>/<snapshot_id>/packages/<pkg>.metadata.json
|
||
evidence/<match_id>.dsse.json
|
||
```
|
||
|
||
---
|
||
|
||
## 4. Integration Points
|
||
|
||
### 4.1 Scanner.Worker Integration
|
||
|
||
During container scanning, Scanner.Worker queries BinaryIndex for each extracted binary:
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant SW as Scanner.Worker
|
||
participant BI as BinaryIndex
|
||
participant PG as PostgreSQL
|
||
participant FL as Findings Ledger
|
||
|
||
SW->>SW: Extract binary from layer
|
||
SW->>SW: Compute BinaryIdentity
|
||
SW->>BI: LookupByIdentityAsync(identity)
|
||
BI->>PG: Query binaries.vulnerable_buildids
|
||
PG-->>BI: Matches
|
||
BI->>PG: Query binaries.cve_fix_index (if distro known)
|
||
PG-->>BI: Fix status
|
||
BI-->>SW: BinaryVulnMatch[]
|
||
SW->>FL: RecordFinding(match, evidence)
|
||
```
|
||
|
||
### 4.2 Concelier Integration
|
||
|
||
BinaryIndex subscribes to Concelier's advisory updates:
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant CO as Concelier
|
||
participant BI as BinaryIndex
|
||
participant PG as PostgreSQL
|
||
|
||
CO->>CO: Ingest new advisory
|
||
CO->>BI: advisory.created event
|
||
BI->>BI: Check if affected packages in corpus
|
||
BI->>PG: Update binaries.binary_vuln_assertion
|
||
BI->>BI: Queue fingerprint generation (if high-impact)
|
||
```
|
||
|
||
### 4.3 Policy Integration
|
||
|
||
Binary matches are recorded as proof segments:
|
||
|
||
```json
|
||
{
|
||
"segment_type": "binary_fingerprint_evidence",
|
||
"payload": {
|
||
"binary_identity": {
|
||
"format": "elf",
|
||
"build_id": "abc123...",
|
||
"file_sha256": "def456..."
|
||
},
|
||
"matches": [
|
||
{
|
||
"cve_id": "CVE-2024-1234",
|
||
"method": "buildid_catalog",
|
||
"confidence": 0.98,
|
||
"vulnerable_purl": "pkg:deb/debian/libssl3@1.1.1n-0+deb11u3"
|
||
}
|
||
]
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 5. MVP Roadmap
|
||
|
||
### MVP 1: Known-Build Binary Catalog (Sprint 6000.0001)
|
||
|
||
**Goal:** Query "is this Build-ID vulnerable?" with distro-level precision.
|
||
|
||
**Deliverables:**
|
||
- `binaries` PostgreSQL schema
|
||
- Build-ID to package mapping tables
|
||
- Basic CVE lookup by binary identity
|
||
- Debian/Ubuntu corpus connector
|
||
|
||
### MVP 2: Patch-Aware Backport Handling (Sprint 6000.0002)
|
||
|
||
**Goal:** Handle "version says vulnerable but distro backported the fix."
|
||
|
||
**Deliverables:**
|
||
- Fix index builder (changelog + patch header parsing)
|
||
- Distro-specific version comparison
|
||
- RPM corpus connector
|
||
- Scanner.Worker integration
|
||
|
||
### MVP 3: Binary Fingerprint Factory (Sprint 6000.0003)
|
||
|
||
**Goal:** Detect vulnerable code independent of package metadata.
|
||
|
||
**Deliverables:**
|
||
- Fingerprint storage and matching
|
||
- Reference build generation pipeline
|
||
- Fingerprint validation corpus
|
||
- High-impact CVE coverage (OpenSSL, glibc, zlib, curl)
|
||
|
||
### MVP 4: Full Scanner Integration (Sprint 6000.0004)
|
||
|
||
**Goal:** Binary evidence in production scans.
|
||
|
||
**Deliverables:**
|
||
- Scanner.Worker binary lookup integration
|
||
- Findings Ledger binary match records
|
||
- Proof segment attestations
|
||
- CLI binary match inspection
|
||
|
||
---
|
||
|
||
## 5b. Fix Evidence Chain
|
||
|
||
The **Fix Evidence Chain** provides auditable proof of why a CVE is marked as fixed (or not) for a specific distro/package combination. This is critical for patch-aware backport handling where package versions can be misleading.
|
||
|
||
### 5b.1 Evidence Sources
|
||
|
||
| Source | Confidence | Description |
|
||
|--------|------------|-------------|
|
||
| **Security Feed (OVAL)** | 0.95-0.99 | Authoritative feed from distro (Debian Security Tracker, Red Hat OVAL) |
|
||
| **Patch Header (DEP-3)** | 0.87-0.95 | CVE reference in Debian/Ubuntu patch metadata |
|
||
| **Changelog** | 0.75-0.85 | CVE mention in debian/changelog or RPM %changelog |
|
||
| **Upstream Patch Match** | 0.90 | Binary diff matches known upstream fix |
|
||
|
||
### 5b.2 Evidence Storage
|
||
|
||
Evidence is stored in two PostgreSQL tables:
|
||
|
||
```sql
|
||
-- Fix index: one row per (distro, release, source_pkg, cve_id)
|
||
CREATE TABLE binaries.cve_fix_index (
|
||
id UUID PRIMARY KEY,
|
||
tenant_id TEXT NOT NULL,
|
||
distro TEXT NOT NULL, -- debian, ubuntu, alpine, rhel
|
||
release TEXT NOT NULL, -- bookworm, jammy, v3.19
|
||
source_pkg TEXT NOT NULL,
|
||
cve_id TEXT NOT NULL,
|
||
state TEXT NOT NULL, -- fixed, vulnerable, not_affected, wontfix, unknown
|
||
fixed_version TEXT,
|
||
method TEXT NOT NULL, -- security_feed, changelog, patch_header, upstream_match
|
||
confidence DECIMAL(3,2) NOT NULL,
|
||
evidence_id UUID REFERENCES binaries.fix_evidence(id),
|
||
snapshot_id UUID,
|
||
indexed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||
UNIQUE (tenant_id, distro, release, source_pkg, cve_id)
|
||
);
|
||
|
||
-- Evidence blobs: audit trail
|
||
CREATE TABLE binaries.fix_evidence (
|
||
id UUID PRIMARY KEY,
|
||
tenant_id TEXT NOT NULL,
|
||
evidence_type TEXT NOT NULL, -- changelog, patch_header, security_feed
|
||
source_file TEXT, -- Path to source file (changelog, patch)
|
||
source_sha256 TEXT, -- Hash of source file
|
||
excerpt TEXT, -- Relevant snippet (max 1KB)
|
||
metadata JSONB NOT NULL, -- Structured metadata
|
||
snapshot_id UUID,
|
||
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
||
);
|
||
```
|
||
|
||
### 5b.3 Evidence Types
|
||
|
||
**ChangelogEvidence:**
|
||
```json
|
||
{
|
||
"evidence_type": "changelog",
|
||
"source_file": "debian/changelog",
|
||
"excerpt": "* Fix CVE-2024-0727: PKCS12 decoding crash",
|
||
"metadata": {
|
||
"version": "3.0.11-1~deb12u2",
|
||
"line_number": 5
|
||
}
|
||
}
|
||
```
|
||
|
||
**PatchHeaderEvidence:**
|
||
```json
|
||
{
|
||
"evidence_type": "patch_header",
|
||
"source_file": "debian/patches/CVE-2024-0727.patch",
|
||
"excerpt": "CVE: CVE-2024-0727\nOrigin: upstream, https://github.com/openssl/commit/abc123",
|
||
"metadata": {
|
||
"patch_sha256": "abc123def456..."
|
||
}
|
||
}
|
||
```
|
||
|
||
**SecurityFeedEvidence:**
|
||
```json
|
||
{
|
||
"evidence_type": "security_feed",
|
||
"metadata": {
|
||
"feed_id": "debian-security-tracker",
|
||
"entry_id": "DSA-5678-1",
|
||
"published_at": "2024-01-15T10:00:00Z"
|
||
}
|
||
}
|
||
```
|
||
|
||
### 5b.4 Confidence Resolution
|
||
|
||
When multiple evidence sources exist for the same CVE, the system keeps the **highest confidence** entry:
|
||
|
||
```csharp
|
||
ON CONFLICT (tenant_id, distro, release, source_pkg, cve_id)
|
||
DO UPDATE SET
|
||
confidence = GREATEST(existing.confidence, new.confidence),
|
||
method = CASE
|
||
WHEN existing.confidence < new.confidence THEN new.method
|
||
ELSE existing.method
|
||
END,
|
||
evidence_id = CASE
|
||
WHEN existing.confidence < new.confidence THEN new.evidence_id
|
||
ELSE existing.evidence_id
|
||
END
|
||
```
|
||
|
||
### 5b.5 Parsers
|
||
|
||
The following parsers extract CVE fix information:
|
||
|
||
| Parser | Distros | Input | Confidence |
|
||
|--------|---------|-------|------------|
|
||
| `DebianChangelogParser` | Debian, Ubuntu | debian/changelog | 0.80 |
|
||
| `PatchHeaderParser` | Debian, Ubuntu | debian/patches/*.patch (DEP-3) | 0.87 |
|
||
| `AlpineSecfixesParser` | Alpine | APKBUILD secfixes block | 0.95 |
|
||
| `RpmChangelogParser` | RHEL, Fedora, CentOS | RPM spec %changelog | 0.75 |
|
||
|
||
### 5b.6 Query Flow
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant SW as Scanner.Worker
|
||
participant BVS as BinaryVulnerabilityService
|
||
participant FIR as FixIndexRepository
|
||
participant PG as PostgreSQL
|
||
|
||
SW->>BVS: GetFixStatusAsync(debian, bookworm, openssl, CVE-2024-0727)
|
||
BVS->>FIR: GetFixStatusAsync(...)
|
||
FIR->>PG: SELECT FROM cve_fix_index WHERE ...
|
||
PG-->>FIR: FixIndexEntry (state=fixed, confidence=0.87)
|
||
FIR-->>BVS: FixStatusResult
|
||
BVS-->>SW: {state: Fixed, confidence: 0.87, method: PatchHeader}
|
||
```
|
||
|
||
---
|
||
|
||
## 6. Security Considerations
|
||
|
||
### 6.1 Trust Boundaries
|
||
|
||
1. **Corpus Ingestion** - Packages are untrusted; extraction runs in sandboxed workers
|
||
2. **Fingerprint Generation** - Reference builds compiled in isolated environments
|
||
3. **Query API** - Tenant-isolated via RLS; no cross-tenant data leakage
|
||
|
||
### 6.2 Signing & Provenance
|
||
|
||
- All corpus snapshots are signed (DSSE)
|
||
- Fingerprint sets are versioned and signed
|
||
- Every match result references evidence digests
|
||
|
||
### 6.3 Sandbox Requirements
|
||
|
||
Binary extraction and fingerprint generation MUST run with:
|
||
- Seccomp profile restricting syscalls
|
||
- Read-only root filesystem
|
||
- No network access during analysis
|
||
- Memory/CPU limits
|
||
|
||
---
|
||
|
||
## 7. Observability
|
||
|
||
### 7.1 Metrics
|
||
|
||
| Metric | Type | Labels |
|
||
|--------|------|--------|
|
||
| `binaryindex_lookup_total` | Counter | method, result |
|
||
| `binaryindex_lookup_latency_ms` | Histogram | method |
|
||
| `binaryindex_corpus_packages_total` | Gauge | distro, release |
|
||
| `binaryindex_fingerprints_indexed` | Gauge | algorithm, component |
|
||
| `binaryindex_match_confidence` | Histogram | method |
|
||
|
||
### 7.2 Traces
|
||
|
||
- `binaryindex.lookup` - Full lookup span
|
||
- `binaryindex.corpus.ingest` - Corpus ingestion
|
||
- `binaryindex.fingerprint.generate` - Fingerprint generation
|
||
|
||
### 7.3 Ops Endpoints
|
||
|
||
> **Sprint:** SPRINT_20260112_007_BINIDX_binaryindex_user_config
|
||
|
||
BinaryIndex exposes read-only ops endpoints for health, bench, cache, and effective configuration:
|
||
|
||
| Endpoint | Method | Response Schema | Description |
|
||
|----------|--------|-----------------|-------------|
|
||
| `/api/v1/ops/binaryindex/health` | GET | `BinaryIndexOpsHealthResponse` | Health status, lifter warmness per ISA, cache availability |
|
||
| `/api/v1/ops/binaryindex/bench/run` | POST | `BinaryIndexBenchResponse` | Run latency benchmark, return min/max/mean/p50/p95/p99 stats |
|
||
| `/api/v1/ops/binaryindex/cache` | GET | `BinaryIndexFunctionCacheStats` | Function cache hit/miss/eviction statistics |
|
||
| `/api/v1/ops/binaryindex/config` | GET | `BinaryIndexEffectiveConfig` | Effective configuration with secrets redacted |
|
||
|
||
#### 7.3.1 Response Schemas
|
||
|
||
**BinaryIndexOpsHealthResponse:**
|
||
```json
|
||
{
|
||
"status": "healthy",
|
||
"timestamp": "2026-01-16T12:00:00Z",
|
||
"components": {
|
||
"lifterPool": { "status": "healthy", "message": null },
|
||
"functionCache": { "status": "healthy", "message": null },
|
||
"persistence": { "status": "healthy", "message": null }
|
||
},
|
||
"lifterWarmness": {
|
||
"intel-64": { "isa": "intel-64", "warm": true, "poolSize": 4, "acquireTimeMs": 12 },
|
||
"armv8-64": { "isa": "armv8-64", "warm": true, "poolSize": 2, "acquireTimeMs": 8 }
|
||
}
|
||
}
|
||
```
|
||
|
||
**BinaryIndexBenchResponse:**
|
||
```json
|
||
{
|
||
"timestamp": "2026-01-16T12:00:00Z",
|
||
"sampleSize": 100,
|
||
"latencySummary": {
|
||
"minMs": 5.2,
|
||
"maxMs": 142.8,
|
||
"meanMs": 28.4,
|
||
"p50Ms": 22.1,
|
||
"p95Ms": 78.3,
|
||
"p99Ms": 121.5
|
||
},
|
||
"operations": [
|
||
{ "operation": "lifterAcquire", "samples": 100, "meanMs": 12.4 },
|
||
{ "operation": "irNormalization", "samples": 100, "meanMs": 8.7 },
|
||
{ "operation": "cacheLookup", "samples": 100, "meanMs": 1.2 }
|
||
]
|
||
}
|
||
```
|
||
|
||
**BinaryIndexFunctionCacheStats:**
|
||
```json
|
||
{
|
||
"enabled": true,
|
||
"backend": "valkey",
|
||
"hits": 15234,
|
||
"misses": 892,
|
||
"evictions": 45,
|
||
"hitRate": 0.944,
|
||
"keyPrefix": "stellaops:binidx:funccache:",
|
||
"cacheTtlSeconds": 14400,
|
||
"estimatedEntries": 12500,
|
||
"estimatedMemoryBytes": 52428800
|
||
}
|
||
```
|
||
|
||
**BinaryIndexEffectiveConfig:**
|
||
```json
|
||
{
|
||
"b2r2Pool": {
|
||
"maxPoolSizePerIsa": 4,
|
||
"warmPreload": ["intel-64", "armv8-64"],
|
||
"acquireTimeoutMs": 5000,
|
||
"enableMetrics": true
|
||
},
|
||
"semanticLifting": {
|
||
"b2r2Version": "1.5.0",
|
||
"normalizationRecipeVersion": "2024.1",
|
||
"maxInstructionsPerFunction": 10000,
|
||
"maxFunctionsPerBinary": 5000,
|
||
"functionLiftTimeoutMs": 30000,
|
||
"enableDeduplication": true
|
||
},
|
||
"functionCache": {
|
||
"connectionString": "********",
|
||
"keyPrefix": "stellaops:binidx:funccache:",
|
||
"cacheTtlSeconds": 14400,
|
||
"maxTtlSeconds": 86400,
|
||
"earlyExpiryPercent": 0.1,
|
||
"maxEntrySizeBytes": 1048576
|
||
},
|
||
"persistence": {
|
||
"schema": "binaries",
|
||
"minPoolSize": 5,
|
||
"maxPoolSize": 20,
|
||
"commandTimeoutSeconds": 30,
|
||
"retryOnFailure": true,
|
||
"batchSize": 100
|
||
},
|
||
"backendVersions": {
|
||
"b2r2": "1.5.0",
|
||
"valkey": "7.2.0",
|
||
"postgres": "15.4"
|
||
}
|
||
}
|
||
```
|
||
|
||
#### 7.3.2 Rate Limiting
|
||
|
||
The `/bench/run` endpoint is rate-limited to prevent load spikes:
|
||
- Default: 5 requests per minute per tenant
|
||
- Configurable via `BinaryIndex:Ops:BenchRateLimitPerMinute`
|
||
|
||
#### 7.3.3 Secret Redaction
|
||
|
||
The config endpoint automatically redacts sensitive keys:
|
||
|
||
| Redacted Keys | Pattern |
|
||
|---------------|---------|
|
||
| `connectionString` | Replaced with `********` |
|
||
| `password` | Replaced with `********` |
|
||
| `secret*` | Any key starting with "secret" |
|
||
| `apiKey` | Replaced with `********` |
|
||
| `token` | Replaced with `********` |
|
||
|
||
Redaction is applied recursively to nested objects.
|
||
|
||
---
|
||
|
||
## 8. Configuration
|
||
|
||
> **Sprint:** SPRINT_20260112_007_BINIDX_binaryindex_user_config
|
||
|
||
### 8.1 Configuration Sections
|
||
|
||
All configuration is under the `BinaryIndex` section in `appsettings.yaml` or environment variables with `BINARYINDEX__` prefix.
|
||
|
||
#### 8.1.1 B2R2 Lifter Pool (`BinaryIndex:B2R2Pool`)
|
||
|
||
| Key | Type | Default | Description |
|
||
|-----|------|---------|-------------|
|
||
| `MaxPoolSizePerIsa` | int | 4 | Maximum lifter instances per ISA |
|
||
| `WarmPreload` | string[] | ["intel-64", "armv8-64"] | ISAs to warm on startup |
|
||
| `AcquireTimeoutMs` | int | 5000 | Timeout for lifter acquisition |
|
||
| `EnableMetrics` | bool | true | Emit Prometheus metrics for pool |
|
||
|
||
```yaml
|
||
BinaryIndex:
|
||
B2R2Pool:
|
||
MaxPoolSizePerIsa: 4
|
||
WarmPreload:
|
||
- intel-64
|
||
- armv8-64
|
||
AcquireTimeoutMs: 5000
|
||
EnableMetrics: true
|
||
```
|
||
|
||
#### 8.1.2 Semantic Lifting (`BinaryIndex:SemanticLifting`)
|
||
|
||
| Key | Type | Default | Description |
|
||
|-----|------|---------|-------------|
|
||
| `B2R2Version` | string | "1.5.0" | B2R2 disassembler version |
|
||
| `NormalizationRecipeVersion` | string | "2024.1" | IR normalization recipe version |
|
||
| `MaxInstructionsPerFunction` | int | 10000 | Max instructions to lift per function |
|
||
| `MaxFunctionsPerBinary` | int | 5000 | Max functions to process per binary |
|
||
| `FunctionLiftTimeoutMs` | int | 30000 | Timeout for lifting single function |
|
||
| `EnableDeduplication` | bool | true | Deduplicate IR before fingerprinting |
|
||
|
||
```yaml
|
||
BinaryIndex:
|
||
SemanticLifting:
|
||
MaxInstructionsPerFunction: 10000
|
||
MaxFunctionsPerBinary: 5000
|
||
FunctionLiftTimeoutMs: 30000
|
||
EnableDeduplication: true
|
||
```
|
||
|
||
#### 8.1.3 Function Cache (`BinaryIndex:FunctionCache`)
|
||
|
||
| Key | Type | Default | Description |
|
||
|-----|------|---------|-------------|
|
||
| `ConnectionString` | string | — | Valkey connection string (secret) |
|
||
| `KeyPrefix` | string | "stellaops:binidx:funccache:" | Cache key prefix |
|
||
| `CacheTtlSeconds` | int | 14400 | Default cache TTL (4 hours) |
|
||
| `MaxTtlSeconds` | int | 86400 | Maximum TTL (24 hours) |
|
||
| `EarlyExpiryPercent` | decimal | 0.1 | Early expiry jitter (10%) |
|
||
| `MaxEntrySizeBytes` | int | 1048576 | Max entry size (1 MB) |
|
||
|
||
```yaml
|
||
BinaryIndex:
|
||
FunctionCache:
|
||
ConnectionString: ${VALKEY_CONNECTION} # from env
|
||
KeyPrefix: "stellaops:binidx:funccache:"
|
||
CacheTtlSeconds: 14400
|
||
MaxEntrySizeBytes: 1048576
|
||
```
|
||
|
||
#### 8.1.4 Persistence (`Postgres:BinaryIndex`)
|
||
|
||
| Key | Type | Default | Description |
|
||
|-----|------|---------|-------------|
|
||
| `Schema` | string | "binaries" | PostgreSQL schema name |
|
||
| `MinPoolSize` | int | 5 | Minimum connection pool size |
|
||
| `MaxPoolSize` | int | 20 | Maximum connection pool size |
|
||
| `CommandTimeoutSeconds` | int | 30 | Command execution timeout |
|
||
| `RetryOnFailure` | bool | true | Retry transient failures |
|
||
| `BatchSize` | int | 100 | Batch insert size |
|
||
|
||
```yaml
|
||
Postgres:
|
||
BinaryIndex:
|
||
Schema: binaries
|
||
MinPoolSize: 5
|
||
MaxPoolSize: 20
|
||
CommandTimeoutSeconds: 30
|
||
RetryOnFailure: true
|
||
BatchSize: 100
|
||
```
|
||
|
||
#### 8.1.5 Ops Configuration (`BinaryIndex:Ops`)
|
||
|
||
| Key | Type | Default | Description |
|
||
|-----|------|---------|-------------|
|
||
| `EnableHealthEndpoint` | bool | true | Enable /health endpoint |
|
||
| `EnableBenchEndpoint` | bool | true | Enable /bench/run endpoint |
|
||
| `BenchRateLimitPerMinute` | int | 5 | Rate limit for bench endpoint |
|
||
| `RedactedKeys` | string[] | See 7.3.3 | Keys to redact in config output |
|
||
|
||
### 8.2 Legacy Configuration
|
||
|
||
```yaml
|
||
# binaryindex.yaml (corpus configuration)
|
||
binaryindex:
|
||
enabled: true
|
||
|
||
corpus:
|
||
connectors:
|
||
- type: debian
|
||
enabled: true
|
||
mirror: http://deb.debian.org/debian
|
||
releases: [bookworm, bullseye]
|
||
architectures: [amd64, arm64]
|
||
- type: ubuntu
|
||
enabled: true
|
||
mirror: http://archive.ubuntu.com/ubuntu
|
||
releases: [jammy, noble]
|
||
|
||
fingerprinting:
|
||
enabled: true
|
||
algorithms: [basic_block, cfg]
|
||
target_components:
|
||
- openssl
|
||
- glibc
|
||
- zlib
|
||
- curl
|
||
- sqlite
|
||
min_function_size: 16 # bytes
|
||
max_functions_per_binary: 10000
|
||
|
||
lookup:
|
||
cache_ttl: 3600
|
||
batch_size: 100
|
||
timeout_ms: 5000
|
||
|
||
storage:
|
||
postgres_schema: binaries
|
||
rustfs_bucket: stellaops/binaryindex
|
||
```
|
||
|
||
---
|
||
|
||
## 9. Testing Strategy
|
||
|
||
### 9.1 Unit Tests
|
||
|
||
- Identity extraction (Build-ID, hashes)
|
||
- Fingerprint generation determinism
|
||
- Fix index parsing (changelog, patch headers)
|
||
|
||
### 9.2 Integration Tests
|
||
|
||
- PostgreSQL schema validation
|
||
- Full corpus ingestion flow
|
||
- Scanner.Worker lookup integration
|
||
|
||
### 9.3 Regression Tests
|
||
|
||
- Known CVE detection (golden corpus)
|
||
- Backport handling (Debian libssl example)
|
||
- False positive rate validation
|
||
|
||
---
|
||
|
||
## 10. Golden Corpus for Patch Provenance
|
||
|
||
> **Sprint:** SPRINT_20260121_034/035/036 - Golden Corpus Implementation
|
||
|
||
The BinaryIndex module supports a **golden corpus** of patch-paired artifacts that enables offline SBOM reproducibility and binary-level patch provenance verification.
|
||
|
||
### 10.1 Corpus Purpose
|
||
|
||
The golden corpus provides:
|
||
- **Auditor-ready evidence bundles** for air-gapped customers
|
||
- **Regression testing** for binary matching accuracy
|
||
- **Proof of patch status** independent of package metadata
|
||
|
||
### 10.2 Corpus Sources
|
||
|
||
| Source | Type | Purpose |
|
||
|--------|------|---------|
|
||
| Debian Security Tracker / DSAs | Advisory | Primary advisory linkage |
|
||
| Debian Snapshot | Binary archive | Pre/post patch binary pairs |
|
||
| Ubuntu Security Notices | Advisory | Ubuntu-specific advisories |
|
||
| Alpine secdb | Advisory | Alpine YAML advisories |
|
||
| OSV dump | Unified schema | Cross-reference and commit ranges |
|
||
|
||
### 10.2.1 Symbol Source Connectors
|
||
|
||
> **Sprint:** SPRINT_20260121_035_BinaryIndex_golden_corpus_connectors_cli
|
||
|
||
The corpus ingestion layer uses pluggable connectors to retrieve symbols and metadata from upstream sources:
|
||
|
||
| Connector ID | Implementation | Protocol | Data Retrieved |
|
||
|--------------|----------------|----------|----------------|
|
||
| `debuginfod-fedora` | `DebuginfodConnector` | debuginfod HTTP | ELF debug symbols by Build-ID |
|
||
| `debuginfod-ubuntu` | `DebuginfodConnector` | debuginfod HTTP | ELF debug symbols by Build-ID |
|
||
| `ddeb-ubuntu` | `DdebConnector` | APT/HTTP | `.ddeb` debug packages |
|
||
| `buildinfo-debian` | `BuildinfoConnector` | HTTP | `.buildinfo` reproducibility records |
|
||
| `secdb-alpine` | `AlpineSecDbConnector` | Git/HTTP | `secfixes` YAML from APKBUILD |
|
||
|
||
**Connector Interface:**
|
||
|
||
```csharp
|
||
public interface ISymbolSourceConnector
|
||
{
|
||
string ConnectorId { get; }
|
||
string DisplayName { get; }
|
||
string[] SupportedDistros { get; }
|
||
|
||
Task<ConnectorStatus> GetStatusAsync(CancellationToken ct);
|
||
Task SyncAsync(SyncOptions options, CancellationToken ct);
|
||
Task<SymbolLookupResult?> LookupByBuildIdAsync(string buildId, CancellationToken ct);
|
||
Task<IAsyncEnumerable<SymbolRecord>> SearchAsync(SymbolSearchQuery query, CancellationToken ct);
|
||
}
|
||
```
|
||
|
||
**Debuginfod Connector:**
|
||
|
||
The `DebuginfodConnector` implements the [debuginfod protocol](https://sourceware.org/elfutils/Debuginfod.html) for retrieving debug symbols:
|
||
|
||
- Endpoint: `GET /buildid/<build-id>/debuginfo`
|
||
- Supports federated queries across multiple debuginfod servers
|
||
- Caches retrieved symbols in RustFS blob storage
|
||
- Rate-limited to respect upstream server policies
|
||
|
||
**Ubuntu ddeb Connector:**
|
||
|
||
The `DdebConnector` retrieves Ubuntu debug symbol packages (`.ddeb`):
|
||
|
||
- Sources: `ddebs.ubuntu.com` mirror
|
||
- Indexes: Reads `Packages.xz` for package metadata
|
||
- Extraction: Unpacks `.ddeb` AR archives to extract DWARF symbols
|
||
- Mapping: Links debug symbols to binary packages via Build-ID
|
||
|
||
**Debian Buildinfo Connector:**
|
||
|
||
The `BuildinfoConnector` retrieves Debian buildinfo files for reproducibility verification:
|
||
|
||
- Source: `buildinfos.debian.net` and snapshot archives
|
||
- Purpose: Provides build environment metadata for reproducible builds
|
||
- Fields extracted: `Build-Date`, `Build-Architecture`, `Checksums-Sha256`
|
||
- Integration: Cross-references with binary packages for provenance
|
||
|
||
**Alpine SecDB Connector:**
|
||
|
||
The `AlpineSecDbConnector` parses Alpine's security database:
|
||
|
||
- Source: `secfixes` blocks in APKBUILD files
|
||
- Repository: `alpine/aports` Git repository
|
||
- Format: YAML blocks mapping CVEs to fixed versions
|
||
- Example:
|
||
```yaml
|
||
secfixes:
|
||
3.0.11-r0:
|
||
- CVE-2024-0727
|
||
- CVE-2024-0728
|
||
```
|
||
|
||
**OSV Dump Parser:**
|
||
|
||
The `OsvDumpParser` processes Google OSV database dumps for advisory cross-correlation:
|
||
|
||
- Source: `osv.dev` bulk exports (JSON)
|
||
- Purpose: CVE → commit range extraction for patch identification
|
||
- Cross-reference: Correlates OSV entries with distribution advisories
|
||
- Inconsistency detection: Identifies discrepancies between OSV and distro advisories
|
||
|
||
```csharp
|
||
public interface IOsvDumpParser
|
||
{
|
||
IAsyncEnumerable<OsvParsedEntry> ParseDumpAsync(Stream osvDumpStream, CancellationToken ct);
|
||
OsvCveIndex BuildCveIndex(IEnumerable<OsvParsedEntry> entries);
|
||
IEnumerable<AdvisoryCorrelation> CrossReferenceWithExternal(
|
||
OsvCveIndex osvIndex,
|
||
IEnumerable<ExternalAdvisory> externalAdvisories);
|
||
IEnumerable<AdvisoryInconsistency> DetectInconsistencies(
|
||
IEnumerable<AdvisoryCorrelation> correlations);
|
||
}
|
||
```
|
||
|
||
**CLI Access:**
|
||
|
||
All connectors are manageable via the `stella groundtruth sources` CLI commands:
|
||
|
||
```bash
|
||
# List all connectors
|
||
stella groundtruth sources list
|
||
|
||
# Sync specific connector
|
||
stella groundtruth sources sync --source buildinfo-debian --full
|
||
|
||
# Enable/disable connectors
|
||
stella groundtruth sources enable ddeb-ubuntu
|
||
stella groundtruth sources disable debuginfod-fedora
|
||
```
|
||
|
||
See [Ground-Truth CLI Guide](../cli/guides/ground-truth-cli.md) for complete CLI documentation
|
||
|
||
### 10.3 Key Performance Indicators
|
||
|
||
| KPI | Target | Description |
|
||
|-----|--------|-------------|
|
||
| Per-function match rate | >= 90% | Functions matched in post-patch binary |
|
||
| False-negative patch detection | <= 5% | Patched functions incorrectly classified |
|
||
| SBOM canonical-hash stability | 3/3 | Determinism across independent runs |
|
||
| Binary reconstruction equivalence | Trend | Rebuilt binary matches original |
|
||
| End-to-end verify time (p95, cold) | Trend | Offline verification performance |
|
||
|
||
### 10.4 Validation Harness
|
||
|
||
The validation harness (`IValidationHarness`) orchestrates end-to-end verification:
|
||
|
||
```
|
||
Binary Pair (pre/post) → Symbol Recovery → IR Lifting → Fingerprinting → Matching → Metrics
|
||
```
|
||
|
||
### 10.5 Evidence Bundle Format
|
||
|
||
Evidence bundles follow OCI/ORAS conventions:
|
||
|
||
```
|
||
<pkg>-<advisory>-bundle.oci.tar
|
||
├── manifest.json # OCI manifest
|
||
└── blobs/
|
||
├── sha256:<sbom> # Canonical SBOM
|
||
├── sha256:<pre-bin> # Pre-fix binary
|
||
├── sha256:<post-bin> # Post-fix binary
|
||
├── sha256:<delta-sig> # DSSE delta-sig predicate
|
||
└── sha256:<timestamp> # RFC 3161 timestamp
|
||
```
|
||
|
||
### 10.6 Two-Tier Bundle Design and Large Blob References
|
||
|
||
> **Sprint:** SPRINT_20260122_040_Platform_oci_delta_attestation_pipeline (040-04)
|
||
|
||
Evidence bundles support two export modes to balance transfer speed with auditability:
|
||
|
||
| Mode | Export Flag | Contents | Use Case |
|
||
|------|------------|----------|----------|
|
||
| **Light** | (default) | Manifest + attestation envelopes + metadata | Quick transfer, metadata-only audit |
|
||
| **Full** | `--full` | Light + embedded binary blobs in `blobs/` | Air-gap replay, full provenance verification |
|
||
|
||
#### 10.6.1 `largeBlobs[]` Field
|
||
|
||
The `DeltaSigPredicate` includes a `largeBlobs` array referencing binary artifacts that may be too large to embed in attestation payloads:
|
||
|
||
```json
|
||
{
|
||
"schemaVersion": "1.0.0",
|
||
"subject": [...],
|
||
"delta": [...],
|
||
"largeBlobs": [
|
||
{
|
||
"kind": "binary-patch",
|
||
"digest": "sha256:a1b2c3...",
|
||
"mediaType": "application/octet-stream",
|
||
"sizeBytes": 1048576
|
||
},
|
||
{
|
||
"kind": "sbom-fragment",
|
||
"digest": "sha256:d4e5f6...",
|
||
"mediaType": "application/spdx+json",
|
||
"sizeBytes": 32768
|
||
}
|
||
],
|
||
"sbomDigest": "sha256:789abc..."
|
||
}
|
||
```
|
||
|
||
**Field Definitions:**
|
||
|
||
| Field | Type | Description |
|
||
|-------|------|-------------|
|
||
| `largeBlobs[].kind` | string | Blob category: `binary-patch`, `sbom-fragment`, `debug-symbols`, etc. |
|
||
| `largeBlobs[].digest` | string | Content-addressable digest (`sha256:<hex>`, `sha384:<hex>`, `sha512:<hex>`) |
|
||
| `largeBlobs[].mediaType` | string | IANA media type of the blob |
|
||
| `largeBlobs[].sizeBytes` | long | Blob size in bytes |
|
||
| `sbomDigest` | string | Digest of the canonical SBOM associated with this delta |
|
||
|
||
#### 10.6.2 Blob Fetch Strategy
|
||
|
||
During `stella bundle verify --replay`, blobs are resolved in priority order:
|
||
|
||
1. **Embedded** (full bundles): Read from `blobs/<digest-with-dash>` in bundle directory
|
||
2. **Local source** (`--blob-source /path/`): Read from specified local directory
|
||
3. **Registry** (`--blob-source https://...`): HTTP GET from OCI registry (blocked in `--offline` mode)
|
||
|
||
#### 10.6.3 Digest Verification
|
||
|
||
Fetched blobs are verified against their declared digest using the algorithm prefix:
|
||
|
||
```
|
||
sha256:<hex> → SHA-256
|
||
sha384:<hex> → SHA-384
|
||
sha512:<hex> → SHA-512
|
||
```
|
||
|
||
A mismatch fails the blob replay verification step.
|
||
|
||
### 10.7 Related Documentation
|
||
|
||
- [Golden Corpus KPIs](../../benchmarks/golden-corpus-kpis.md)
|
||
- [Golden Corpus Seed List](../../benchmarks/golden-corpus-seed-list.md)
|
||
- [Ground-Truth Corpus Specification](../../benchmarks/ground-truth-corpus.md)
|
||
|
||
---
|
||
|
||
## 11. References
|
||
|
||
- Advisory: `docs/product/advisories/21-Dec-2025 - Mapping Evidence Within Compiled Binaries.md`
|
||
- Scanner Native Analysis: `src/Scanner/StellaOps.Scanner.Analyzers.Native/`
|
||
- Existing Fingerprinting: `src/Scanner/__Libraries/StellaOps.Scanner.EntryTrace/Binary/`
|
||
- Build-ID Index: `src/Scanner/StellaOps.Scanner.Analyzers.Native/Index/`
|
||
- **Semantic Diffing Sprint:** `docs/implplan/SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md`
|
||
- **Semantic Library:** `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Semantic/`
|
||
- **Semantic Tests:** `src/BinaryIndex/__Tests/StellaOps.BinaryIndex.Semantic.Tests/`
|
||
- **Golden Corpus Sprints:** `docs/implplan/SPRINT_20260121_034_BinaryIndex_golden_corpus_foundation.md`
|
||
|
||
---
|
||
|
||
## 12. Binary Micro-Witnesses
|
||
|
||
Binary micro-witnesses provide cryptographic proof of patch status at the binary level. They formalize the output of BinaryIndex's semantic diffing capabilities into an auditor-friendly, portable format.
|
||
|
||
### 12.1 Overview
|
||
|
||
A micro-witness is a DSSE (Dead Simple Signing Envelope) predicate that captures:
|
||
- Subject binary digest (SHA-256)
|
||
- CVE/patch reference
|
||
- Function-level evidence with confidence scores
|
||
- Delta-Sig fingerprint hash
|
||
- Tool versions and analysis metadata
|
||
- Optional SBOM component mapping
|
||
|
||
### 12.2 Predicate Schema
|
||
|
||
**Predicate Type:** `https://stellaops.dev/predicates/binary-micro-witness@v1`
|
||
|
||
```json
|
||
{
|
||
"schemaVersion": "1.0.0",
|
||
"binary": {
|
||
"digest": "sha256:...",
|
||
"purl": "pkg:deb/debian/openssl@3.0.11",
|
||
"arch": "linux-amd64",
|
||
"filename": "libssl.so.3"
|
||
},
|
||
"cve": {
|
||
"id": "CVE-2024-0567",
|
||
"advisory": "https://...",
|
||
"patchCommit": "abc123"
|
||
},
|
||
"verdict": "patched",
|
||
"confidence": 0.95,
|
||
"evidence": [
|
||
{
|
||
"function": "SSL_CTX_new",
|
||
"state": "patched",
|
||
"score": 0.97,
|
||
"method": "semantic_ksg",
|
||
"hash": "sha256:..."
|
||
}
|
||
],
|
||
"deltaSigDigest": "sha256:...",
|
||
"sbomRef": {
|
||
"sbomDigest": "sha256:...",
|
||
"purl": "pkg:...",
|
||
"bomRef": "component-ref"
|
||
},
|
||
"tooling": {
|
||
"binaryIndexVersion": "2.1.0",
|
||
"lifter": "b2r2",
|
||
"matchAlgorithm": "semantic_ksg"
|
||
},
|
||
"computedAt": "2026-01-28T12:00:00Z"
|
||
}
|
||
```
|
||
|
||
### 12.3 Verdicts
|
||
|
||
| Verdict | Meaning |
|
||
|---------|---------|
|
||
| `patched` | Binary matches patched version signature |
|
||
| `vulnerable` | Binary matches vulnerable version signature |
|
||
| `inconclusive` | Unable to determine (insufficient evidence) |
|
||
| `partial` | Some functions patched, others not |
|
||
|
||
### 12.4 CLI Commands
|
||
|
||
```bash
|
||
# Generate a micro-witness
|
||
stella witness generate /path/to/binary --cve CVE-2024-0567 --sbom sbom.json --output witness.json
|
||
|
||
# Verify a micro-witness
|
||
stella witness verify witness.json --offline
|
||
|
||
# Create portable bundle for air-gapped verification
|
||
stella witness bundle witness.json --output ./audit-bundle
|
||
```
|
||
|
||
### 12.5 Integration with Rekor
|
||
|
||
When `--rekor` is specified during generation, witnesses are logged to the Rekor transparency log using v2 tile-based inclusion proofs. This provides tamper-evidence and enables auditors to verify witnesses weren't backdated.
|
||
|
||
Offline verification bundles include tile proofs for air-gapped environments.
|
||
|
||
### 12.6 Related Documentation
|
||
|
||
- **Auditor Guide:** `docs/guides/binary-micro-witness-verification.md`
|
||
- **Predicate Schema:** `src/Attestor/StellaOps.Attestor.Types/schemas/stellaops-binary-micro-witness.v1.schema.json`
|
||
- **CLI Commands:** `src/Cli/StellaOps.Cli/Commands/Witness/`
|
||
- **Demo Bundle:** `demos/binary-micro-witness/`
|
||
- **Sprint:** `docs-archived/implplan/SPRINT_0128_001_BinaryIndex_binary_micro_witness.md`
|
||
|
||
---
|
||
|
||
## 13. Cross-Distro Coverage Matrix for Backport Validation
|
||
|
||
Manages a curated set of high-impact CVEs with per-distribution backport
|
||
status tracking, enabling systematic validation of backport detection
|
||
accuracy across Alpine, Debian, and RHEL.
|
||
|
||
### 13.1 Architecture
|
||
|
||
1. **CuratedCveEntry** — One row per CVE (e.g., Heartbleed, Baron Samedit)
|
||
with cross-distro `DistroCoverageEntry` array tracking backport status
|
||
per distro-version
|
||
2. **CrossDistroCoverageService** — In-memory coverage matrix with upsert,
|
||
query, summary, and validation marking operations
|
||
3. **SeedBuiltInEntries** — Idempotent seeding of 5 curated high-impact CVEs
|
||
(CVE-2014-0160, CVE-2021-3156, CVE-2015-0235, CVE-2023-38545, CVE-2024-6387)
|
||
with pre-populated backport status across Alpine, Debian, and RHEL versions
|
||
|
||
### 13.2 Distro Families & Backport Status
|
||
|
||
| Enum | Values |
|
||
|---|---|
|
||
| `DistroFamily` | Alpine, Debian, Rhel |
|
||
| `BackportStatus` | NotPatched, Backported, NotApplicable, Unknown |
|
||
|
||
### 13.3 Models
|
||
|
||
| Type | Description |
|
||
|---|---|
|
||
| `DistroCoverageEntry` | Per distro-version: package name/version, backport status, validated flag |
|
||
| `CuratedCveEntry` | CVE with CommonName, CvssScore, CweIds, Coverage array, computed CoverageRatio |
|
||
| `CrossDistroCoverageSummary` | Aggregated counts: TotalCves, TotalEntries, ValidatedEntries, ByDistro breakdown |
|
||
| `DistroBreakdown` | Per-distro EntryCount, ValidatedCount, BackportedCount |
|
||
| `CuratedCveQuery` | Component/Distro/Status/OnlyUnvalidated filters with Limit/Offset paging |
|
||
|
||
### 13.4 Built-in Curated CVEs
|
||
|
||
| CVE | Component | Common Name | CVSS |
|
||
|---|---|---|---|
|
||
| CVE-2014-0160 | openssl | Heartbleed | 7.5 |
|
||
| CVE-2021-3156 | sudo | Baron Samedit | 7.8 |
|
||
| CVE-2015-0235 | glibc | GHOST | 10.0 |
|
||
| CVE-2023-38545 | curl | SOCKS5 heap overflow | 9.8 |
|
||
| CVE-2024-6387 | openssh | regreSSHion | 8.1 |
|
||
|
||
### 13.5 DI Registration
|
||
|
||
`ICrossDistroCoverageService` → `CrossDistroCoverageService` registered via
|
||
TryAddSingleton in `GoldenSetServiceCollectionExtensions.AddGoldenSetServices()`.
|
||
|
||
### 13.6 OTel Metrics
|
||
|
||
Meter: `StellaOps.BinaryIndex.GoldenSet.CrossDistro`
|
||
|
||
| Counter | Description |
|
||
|---|---|
|
||
| `crossdistro.upsert.total` | CVE entries upserted |
|
||
| `crossdistro.query.total` | Coverage queries executed |
|
||
| `crossdistro.seed.total` | Built-in entries seeded |
|
||
| `crossdistro.validated.total` | Entries marked as validated |
|
||
|
||
### 13.7 Source Files
|
||
|
||
- Models: `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.GoldenSet/Models/CrossDistroCoverageModels.cs`
|
||
- Interface: `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.GoldenSet/Services/ICrossDistroCoverageService.cs`
|
||
- Implementation: `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.GoldenSet/Services/CrossDistroCoverageService.cs`
|
||
|
||
### 13.8 Test Coverage (37 tests)
|
||
|
||
- Models: DistroFamily/BackportStatus enum counts, DistroCoverageEntry roundtrips/defaults,
|
||
CuratedCveEntry coverage ratio/empty, CuratedCveQuery defaults, Summary coverage/empty
|
||
- Service: SeedBuiltInEntries population/idempotency/heartbleed/baron-samedit/distro coverage,
|
||
UpsertAsync store-retrieve/overwrite/null/empty, GetByCveIdAsync unknown/case-insensitive/null,
|
||
QueryAsync all/component/distro/status/unvalidated/limit-offset/ordering,
|
||
GetSummaryAsync counts/empty, SetValidatedAsync mark/unknown-cve/unknown-version/summary/null,
|
||
CreateBuiltInEntries deterministic/distro-coverage
|
||
|
||
---
|
||
|
||
## 14. ELF Segment Normalization for Delta Hashing
|
||
|
||
### 14.1 Purpose
|
||
|
||
The existing instruction-level normalization (X64/Arm64 pipelines) operates on
|
||
disassembled instruction streams. ELF Segment Normalization fills the gap for
|
||
**raw binary bytes** — zeroing position-dependent data (relocation entries,
|
||
GOT/PLT displacements, alignment padding) and canonicalizing NOP sleds
|
||
*before* disassembly, enabling deterministic delta hashing across builds
|
||
compiled at different base addresses or link orders.
|
||
|
||
### 14.2 Key Types
|
||
|
||
| Type | Location | Purpose |
|
||
| --- | --- | --- |
|
||
| `ElfNormalizationStep` | `Normalization/ElfSegmentNormalizer.cs` | Enum of normalization passes (RelocationZeroing, GotPltCanonicalization, NopCanonicalization, JumpTableRewriting, PaddingZeroing) |
|
||
| `ElfSegmentNormalizationOptions` | same | Options record with `Default` and `Minimal` presets |
|
||
| `ElfSegmentNormalizationResult` | same | Result with NormalizedBytes, DeltaHash (SHA-256), ModifiedBytes, AppliedSteps, StepCounts, computed ModificationRatio |
|
||
| `IElfSegmentNormalizer` | same | Interface: `Normalize`, `ComputeDeltaHash` |
|
||
| `ElfSegmentNormalizer` | same | Implementation with 5 internal passes and 2 OTel counters |
|
||
|
||
### 14.3 Normalization Passes
|
||
|
||
1. **RelocationZeroing** — Scans for ELF64 RELA-shaped entries (heuristic:
|
||
info field encodes valid x86-64 relocation types 1–42 with symbol index
|
||
≤100 000); zeros the offset and addend fields (16 bytes per entry).
|
||
2. **GotPltCanonicalization** — Detects `FF 25` (JMP [rip+disp32]) and
|
||
`FF 35` (PUSH [rip+disp32]) PLT stub patterns; zeros the 4-byte
|
||
displacement to remove position-dependent indirect jump targets.
|
||
3. **NopCanonicalization** — Matches 7 multi-byte x86-64 NOP variants
|
||
(2–7 bytes each, per Intel SDM) and replaces with canonical single-byte
|
||
NOPs (0x90).
|
||
4. **JumpTableRewriting** — Identifies sequences of 4+ consecutive 8-byte
|
||
entries sharing the same upper 32 bits (switch-statement jump tables);
|
||
zeros the entries.
|
||
5. **PaddingZeroing** — Detects runs of 4+ alignment padding bytes (0xCC or
|
||
0x00) between code regions and zeros them.
|
||
|
||
### 14.4 Delta Hashing
|
||
|
||
`ComputeDeltaHash` produces a lowercase SHA-256 hex string of the normalized
|
||
byte buffer. Two builds of the same source compiled at different addresses
|
||
will produce the same delta hash after normalization.
|
||
|
||
### 14.5 OTel Instrumentation
|
||
|
||
Meter: `StellaOps.BinaryIndex.Normalization.ElfSegment`
|
||
|
||
| Counter | Description |
|
||
| --- | --- |
|
||
| `elfsegment.normalize.total` | Segments normalized |
|
||
| `elfsegment.bytes.modified` | Total bytes modified across all passes |
|
||
|
||
### 14.6 DI Registration
|
||
|
||
`IElfSegmentNormalizer` is registered as `TryAddSingleton<ElfSegmentNormalizer>`
|
||
inside `AddNormalizationPipelines()` in `ServiceCollectionExtensions.cs`.
|
||
|
||
### 14.7 Test Coverage (35 tests)
|
||
|
||
- Models: DefaultOptions (all enabled), MinimalOptions (relocations only), ModificationRatio zero/computed, enum values
|
||
- Service: Constructor null guard, empty input result + SHA-256, ComputeDeltaHash determinism/distinct,
|
||
NOP canonicalization (3-byte, 2-byte, 4-byte, no-NOP, 7-byte, single-byte),
|
||
GOT/PLT (JMP disp32, PUSH disp32), alignment padding (INT3 run, zero run, short run),
|
||
relocation zeroing (valid RELA, invalid entry), jump table (consecutive addresses, random data),
|
||
full pipeline (deterministic hash, default vs minimal, all-disabled, step-count consistency)
|
||
|
||
---
|
||
|
||
*Document Version: 1.5.0*
|
||
*Last Updated: 2026-02-08*
|