33 KiB
BinaryIndex Module Architecture
Ownership: Scanner Guild + Concelier Guild Status: DRAFT Version: 1.0.0 Related: High-Level Architecture, Scanner Architecture, Concelier Architecture
1. Overview
The BinaryIndex module provides a vulnerable binaries database that enables detection of vulnerable code at the binary level, independent of package metadata. This addresses a critical gap in vulnerability scanning: package version strings can lie (backports, custom builds, stripped metadata), but binary identity doesn't lie.
1.1 Problem Statement
Traditional vulnerability scanners rely on package version matching, which fails in several scenarios:
- Backported patches - Distros backport security fixes without changing upstream version
- Custom/vendored builds - Binaries compiled from source without package metadata
- Stripped binaries - Debug info and version strings removed
- Static linking - Vulnerable library code embedded in final binary
- Container base images - Distroless or scratch images with no package DB
1.2 Solution: Binary-First Vulnerability Detection
BinaryIndex provides three tiers of binary identification:
| Tier | Method | Precision | Coverage |
|---|---|---|---|
| A | Package/version range matching | Medium | High |
| B | Build-ID/hash catalog (exact binary identity) | High | Medium |
| C | Function fingerprints (CFG/basic-block hashes) | Very High | Targeted |
1.3 Module Scope
In Scope:
- Binary identity extraction (Build-ID, PE CodeView GUID, Mach-O UUID)
- Binary-to-advisory mapping database
- Fingerprint storage and matching engine
- Fix index for patch-aware backport handling
- Integration with Scanner.Worker for binary lookup
Out of Scope:
- Binary disassembly/analysis (provided by Scanner.Analyzers.Native)
- Runtime binary tracing (provided by Zastava)
- SBOM generation (provided by Scanner)
2. Architecture
2.1 System Context
┌──────────────────────────────────────────────────────────────────────────┐
│ External Systems │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Distro Repos │ │ Debug Symbol │ │ Upstream Source │ │
│ │ (Debian, RPM, │ │ Servers │ │ (GitHub, etc.) │ │
│ │ Alpine) │ │ (debuginfod) │ │ │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
└───────────│─────────────────────│─────────────────────│──────────────────┘
│ │ │
v v v
┌──────────────────────────────────────────────────────────────────────────┐
│ BinaryIndex Module │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Corpus Ingestion Layer │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ DebianCorpus │ │ RpmCorpus │ │ AlpineCorpus │ │ │
│ │ │ Connector │ │ Connector │ │ Connector │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Processing Layer │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ BinaryFeature│ │ FixIndex │ │ Fingerprint │ │ │
│ │ │ Extractor │ │ Builder │ │ Generator │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Storage Layer │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ PostgreSQL │ │ RustFS │ │ Valkey │ │ │
│ │ │ (binaries │ │ (fingerprint │ │ (lookup │ │ │
│ │ │ schema) │ │ blobs) │ │ cache) │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Query Layer │ │
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
│ │ │ IBinaryVulnerabilityService │ │ │
│ │ │ - LookupByBuildIdAsync(buildId) │ │ │
│ │ │ - LookupByFingerprintAsync(fingerprint) │ │ │
│ │ │ - LookupBatchAsync(identities) │ │ │
│ │ │ - GetFixStatusAsync(distro, release, sourcePkg, cve) │ │ │
│ │ └──────────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
│
v
┌──────────────────────────────────────────────────────────────────────────┐
│ Consuming Modules │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Scanner.Worker │ │ Policy Engine │ │ Findings Ledger │ │
│ │ (binary lookup │ │ (evidence in │ │ (match records) │ │
│ │ during scan) │ │ proof chain) │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
2.2 Component Breakdown
2.2.1 Corpus Connectors
Plugin-based connectors that ingest binaries from distribution repositories.
public interface IBinaryCorpusConnector
{
string ConnectorId { get; }
string[] SupportedDistros { get; }
Task<CorpusSnapshot> FetchSnapshotAsync(CorpusQuery query, CancellationToken ct);
Task<IAsyncEnumerable<ExtractedBinary>> ExtractBinariesAsync(PackageReference pkg, CancellationToken ct);
}
Implementations:
DebianBinaryCorpusConnector- Debian/Ubuntu packages + debuginfoRpmBinaryCorpusConnector- RHEL/Fedora/CentOS + SRPMAlpineBinaryCorpusConnector- Alpine APK + APKBUILD
2.2.2 Binary Feature Extractor
Extracts identity and features from binaries. Reuses existing Scanner.Analyzers.Native capabilities.
public interface IBinaryFeatureExtractor
{
Task<BinaryIdentity> ExtractIdentityAsync(Stream binaryStream, CancellationToken ct);
Task<BinaryFeatures> ExtractFeaturesAsync(Stream binaryStream, ExtractorOptions opts, CancellationToken ct);
}
public sealed record BinaryIdentity(
string Format, // elf, pe, macho
string? BuildId, // ELF GNU Build-ID
string? PeCodeViewGuid, // PE CodeView GUID + Age
string? MachoUuid, // Mach-O LC_UUID
string FileSha256,
string TextSectionSha256);
public sealed record BinaryFeatures(
BinaryIdentity Identity,
string[] DynamicDeps, // DT_NEEDED
string[] ExportedSymbols,
string[] ImportedSymbols,
BinaryHardening Hardening);
2.2.3 Fix Index Builder
Builds the patch-aware CVE fix index from distro sources.
public interface IFixIndexBuilder
{
Task BuildIndexAsync(DistroRelease distro, CancellationToken ct);
Task<FixRecord?> GetFixRecordAsync(string distro, string release, string sourcePkg, string cveId, CancellationToken ct);
}
public sealed record FixRecord(
string Distro,
string Release,
string SourcePkg,
string CveId,
FixState State, // fixed, vulnerable, not_affected, wontfix, unknown
string? FixedVersion, // Distro version string
FixMethod Method, // security_feed, changelog, patch_header
decimal Confidence, // 0.00-1.00
FixEvidence Evidence);
public enum FixState { Fixed, Vulnerable, NotAffected, Wontfix, Unknown }
public enum FixMethod { SecurityFeed, Changelog, PatchHeader, UpstreamPatchMatch }
2.2.4 Fingerprint Generator
Generates function-level fingerprints for vulnerable code detection.
public interface IVulnFingerprintGenerator
{
Task<ImmutableArray<VulnFingerprint>> GenerateAsync(
string cveId,
BinaryPair vulnAndFixed, // Reference builds
FingerprintOptions opts,
CancellationToken ct);
}
public sealed record VulnFingerprint(
string CveId,
string Component, // e.g., openssl
string Architecture, // x86-64, aarch64
FingerprintType Type, // basic_block, cfg, combined
string FingerprintId, // e.g., "bb-abc123..."
byte[] FingerprintHash, // 16-32 bytes
string? FunctionHint, // Function name if known
decimal Confidence,
FingerprintEvidence Evidence);
public enum FingerprintType { BasicBlock, ControlFlowGraph, StringReferences, Combined }
2.2.5 Semantic Analysis Library
Library:
StellaOps.BinaryIndex.SemanticSprint: 20260105_001_001_BINDEX - Semantic Diffing Phase 1
The Semantic Analysis Library extends fingerprint generation with IR-level semantic matching, enabling detection of semantically equivalent code despite compiler optimizations, instruction reordering, and register allocation differences.
Key Insight: Traditional instruction-level fingerprinting loses accuracy on optimized binaries by ~15-20%. Semantic analysis lifts to B2R2's Intermediate Representation (LowUIR), extracts key-semantics graphs, and uses graph hashing for similarity computation.
2.2.5.1 Architecture
Binary Input
│
v
B2R2 Disassembly → Raw Instructions
│
v
IR Lifting Service → LowUIR Statements
│
v
Semantic Graph Extractor → Key-Semantics Graph (KSG)
│
v
Graph Fingerprinting → Semantic Fingerprint
│
v
Semantic Matcher → Similarity Score + Deltas
2.2.5.2 Core Components
IR Lifting Service (IIrLiftingService)
Lifts disassembled instructions to B2R2 LowUIR:
public interface IIrLiftingService
{
Task<LiftedFunction> LiftToIrAsync(
IReadOnlyList<DisassembledInstruction> instructions,
string functionName,
LiftOptions? options = null,
CancellationToken ct = default);
}
public sealed record LiftedFunction(
string Name,
ImmutableArray<IrStatement> Statements,
ImmutableArray<IrBasicBlock> BasicBlocks);
Semantic Graph Extractor (ISemanticGraphExtractor)
Extracts key-semantics graphs capturing data dependencies, control flow, and memory operations:
public interface ISemanticGraphExtractor
{
Task<KeySemanticsGraph> ExtractGraphAsync(
LiftedFunction function,
GraphExtractionOptions? options = null,
CancellationToken ct = default);
}
public sealed record KeySemanticsGraph(
string FunctionName,
ImmutableArray<SemanticNode> Nodes,
ImmutableArray<SemanticEdge> Edges,
GraphProperties Properties);
public enum SemanticNodeType { Compute, Load, Store, Branch, Call, Return, Phi }
public enum SemanticEdgeType { DataDependency, ControlDependency, MemoryDependency }
Semantic Fingerprint Generator (ISemanticFingerprintGenerator)
Generates semantic fingerprints using Weisfeiler-Lehman graph hashing:
public interface ISemanticFingerprintGenerator
{
Task<SemanticFingerprint> GenerateAsync(
KeySemanticsGraph graph,
SemanticFingerprintOptions? options = null,
CancellationToken ct = default);
}
public sealed record SemanticFingerprint(
string FunctionName,
string GraphHashHex, // WL graph hash (SHA-256)
string OperationHashHex, // Normalized operation sequence hash
string DataFlowHashHex, // Data dependency pattern hash
int NodeCount,
int EdgeCount,
int CyclomaticComplexity,
ImmutableArray<string> ApiCalls,
SemanticFingerprintAlgorithm Algorithm);
Semantic Matcher (ISemanticMatcher)
Computes semantic similarity with weighted components:
public interface ISemanticMatcher
{
Task<SemanticMatchResult> MatchAsync(
SemanticFingerprint a,
SemanticFingerprint b,
MatchOptions? options = null,
CancellationToken ct = default);
Task<SemanticMatchResult> MatchWithDeltasAsync(
SemanticFingerprint a,
SemanticFingerprint b,
MatchOptions? options = null,
CancellationToken ct = default);
}
public sealed record SemanticMatchResult(
decimal Similarity, // 0.00-1.00
decimal GraphSimilarity,
decimal OperationSimilarity,
decimal DataFlowSimilarity,
decimal ApiCallSimilarity,
MatchConfidence Confidence);
2.2.5.3 Algorithm Details
Weisfeiler-Lehman Graph Hashing:
- 3 iterations of label propagation
- SHA-256 for final hash computation
- Deterministic node ordering via canonical sort
Similarity Weights (Default):
| Component | Weight |
|---|---|
| Graph Hash | 0.35 |
| Operation Hash | 0.25 |
| Data Flow Hash | 0.25 |
| API Calls | 0.15 |
2.2.5.4 Integration Points
The semantic library integrates with existing BinaryIndex components:
DeltaSignatureGenerator Extension:
// Optional semantic services via constructor injection
services.AddDeltaSignaturesWithSemantic();
// Extended SymbolSignature with semantic properties
public sealed record SymbolSignature
{
// ... existing properties ...
public string? SemanticHashHex { get; init; }
public ImmutableArray<string> SemanticApiCalls { get; init; }
}
PatchDiffEngine Extension:
// SemanticWeight in HashWeights
public decimal SemanticWeight { get; init; } = 0.2m;
// FunctionFingerprint extended with semantic fingerprint
public SemanticFingerprint? SemanticFingerprint { get; init; }
2.2.5.5 Test Coverage
| Category | Tests | Coverage |
|---|---|---|
| Unit Tests (IR lifting, graph extraction, hashing) | 53 | Core algorithms |
| Integration Tests (full pipeline) | 9 | End-to-end flow |
| Golden Corpus (compiler variations) | 11 | Register allocation, optimization, compiler variants |
| Benchmarks (accuracy, performance) | 7 | Baseline metrics |
2.2.5.6 Current Baselines
Note: Baselines reflect foundational implementation; accuracy improves as semantic features mature.
| Metric | Baseline | Target |
|---|---|---|
| Similarity (register allocation variants) | ≥0.55 | ≥0.85 |
| Overall accuracy | ≥40% | ≥70% |
| False positive rate | <10% | <5% |
| P95 fingerprint latency | <100ms | <50ms |
2.2.6 Binary Vulnerability Service
Main query interface for consumers.
public interface IBinaryVulnerabilityService
{
/// <summary>
/// Look up vulnerabilities by Build-ID or equivalent binary identity.
/// </summary>
Task<ImmutableArray<BinaryVulnMatch>> LookupByIdentityAsync(
BinaryIdentity identity,
LookupOptions? opts = null,
CancellationToken ct = default);
/// <summary>
/// Look up vulnerabilities by function fingerprint.
/// </summary>
Task<ImmutableArray<BinaryVulnMatch>> LookupByFingerprintAsync(
CodeFingerprint fingerprint,
decimal minSimilarity = 0.95m,
CancellationToken ct = default);
/// <summary>
/// Batch lookup for scan performance.
/// </summary>
Task<ImmutableDictionary<string, ImmutableArray<BinaryVulnMatch>>> LookupBatchAsync(
IEnumerable<BinaryIdentity> identities,
LookupOptions? opts = null,
CancellationToken ct = default);
/// <summary>
/// Get distro-specific fix status (patch-aware).
/// </summary>
Task<FixRecord?> GetFixStatusAsync(
string distro,
string release,
string sourcePkg,
string cveId,
CancellationToken ct = default);
}
public sealed record BinaryVulnMatch(
string CveId,
string VulnerablePurl,
MatchMethod Method, // buildid_catalog, fingerprint_match, range_match
decimal Confidence,
MatchEvidence Evidence);
public enum MatchMethod { BuildIdCatalog, FingerprintMatch, RangeMatch }
3. Data Model
3.1 PostgreSQL Schema (binaries)
The binaries schema stores binary identity, fingerprint, and match data.
CREATE SCHEMA IF NOT EXISTS binaries;
CREATE SCHEMA IF NOT EXISTS binaries_app;
-- RLS helper
CREATE OR REPLACE FUNCTION binaries_app.require_current_tenant()
RETURNS TEXT LANGUAGE plpgsql STABLE SECURITY DEFINER AS $$
DECLARE v_tenant TEXT;
BEGIN
v_tenant := current_setting('app.tenant_id', true);
IF v_tenant IS NULL OR v_tenant = '' THEN
RAISE EXCEPTION 'app.tenant_id session variable not set';
END IF;
RETURN v_tenant;
END;
$$;
3.1.1 Core Tables
See docs/db/schemas/binaries_schema_specification.md for complete DDL.
Key Tables:
| Table | Purpose |
|---|---|
binaries.binary_identity |
Known binary identities (Build-ID, hashes) |
binaries.binary_package_map |
Binary → package mapping per snapshot |
binaries.vulnerable_buildids |
Build-IDs known to be vulnerable |
binaries.vulnerable_fingerprints |
Function fingerprints for CVEs |
binaries.cve_fix_index |
Patch-aware fix status per distro |
binaries.fingerprint_matches |
Match results (findings evidence) |
binaries.corpus_snapshots |
Corpus ingestion tracking |
3.2 RustFS Layout
rustfs://stellaops/binaryindex/
fingerprints/<algorithm>/<prefix>/<fingerprint_id>.bin
corpus/<distro>/<release>/<snapshot_id>/manifest.json
corpus/<distro>/<release>/<snapshot_id>/packages/<pkg>.metadata.json
evidence/<match_id>.dsse.json
4. Integration Points
4.1 Scanner.Worker Integration
During container scanning, Scanner.Worker queries BinaryIndex for each extracted binary:
sequenceDiagram
participant SW as Scanner.Worker
participant BI as BinaryIndex
participant PG as PostgreSQL
participant FL as Findings Ledger
SW->>SW: Extract binary from layer
SW->>SW: Compute BinaryIdentity
SW->>BI: LookupByIdentityAsync(identity)
BI->>PG: Query binaries.vulnerable_buildids
PG-->>BI: Matches
BI->>PG: Query binaries.cve_fix_index (if distro known)
PG-->>BI: Fix status
BI-->>SW: BinaryVulnMatch[]
SW->>FL: RecordFinding(match, evidence)
4.2 Concelier Integration
BinaryIndex subscribes to Concelier's advisory updates:
sequenceDiagram
participant CO as Concelier
participant BI as BinaryIndex
participant PG as PostgreSQL
CO->>CO: Ingest new advisory
CO->>BI: advisory.created event
BI->>BI: Check if affected packages in corpus
BI->>PG: Update binaries.binary_vuln_assertion
BI->>BI: Queue fingerprint generation (if high-impact)
4.3 Policy Integration
Binary matches are recorded as proof segments:
{
"segment_type": "binary_fingerprint_evidence",
"payload": {
"binary_identity": {
"format": "elf",
"build_id": "abc123...",
"file_sha256": "def456..."
},
"matches": [
{
"cve_id": "CVE-2024-1234",
"method": "buildid_catalog",
"confidence": 0.98,
"vulnerable_purl": "pkg:deb/debian/libssl3@1.1.1n-0+deb11u3"
}
]
}
}
5. MVP Roadmap
MVP 1: Known-Build Binary Catalog (Sprint 6000.0001)
Goal: Query "is this Build-ID vulnerable?" with distro-level precision.
Deliverables:
binariesPostgreSQL schema- Build-ID to package mapping tables
- Basic CVE lookup by binary identity
- Debian/Ubuntu corpus connector
MVP 2: Patch-Aware Backport Handling (Sprint 6000.0002)
Goal: Handle "version says vulnerable but distro backported the fix."
Deliverables:
- Fix index builder (changelog + patch header parsing)
- Distro-specific version comparison
- RPM corpus connector
- Scanner.Worker integration
MVP 3: Binary Fingerprint Factory (Sprint 6000.0003)
Goal: Detect vulnerable code independent of package metadata.
Deliverables:
- Fingerprint storage and matching
- Reference build generation pipeline
- Fingerprint validation corpus
- High-impact CVE coverage (OpenSSL, glibc, zlib, curl)
MVP 4: Full Scanner Integration (Sprint 6000.0004)
Goal: Binary evidence in production scans.
Deliverables:
- Scanner.Worker binary lookup integration
- Findings Ledger binary match records
- Proof segment attestations
- CLI binary match inspection
5b. Fix Evidence Chain
The Fix Evidence Chain provides auditable proof of why a CVE is marked as fixed (or not) for a specific distro/package combination. This is critical for patch-aware backport handling where package versions can be misleading.
5b.1 Evidence Sources
| Source | Confidence | Description |
|---|---|---|
| Security Feed (OVAL) | 0.95-0.99 | Authoritative feed from distro (Debian Security Tracker, Red Hat OVAL) |
| Patch Header (DEP-3) | 0.87-0.95 | CVE reference in Debian/Ubuntu patch metadata |
| Changelog | 0.75-0.85 | CVE mention in debian/changelog or RPM %changelog |
| Upstream Patch Match | 0.90 | Binary diff matches known upstream fix |
5b.2 Evidence Storage
Evidence is stored in two PostgreSQL tables:
-- Fix index: one row per (distro, release, source_pkg, cve_id)
CREATE TABLE binaries.cve_fix_index (
id UUID PRIMARY KEY,
tenant_id TEXT NOT NULL,
distro TEXT NOT NULL, -- debian, ubuntu, alpine, rhel
release TEXT NOT NULL, -- bookworm, jammy, v3.19
source_pkg TEXT NOT NULL,
cve_id TEXT NOT NULL,
state TEXT NOT NULL, -- fixed, vulnerable, not_affected, wontfix, unknown
fixed_version TEXT,
method TEXT NOT NULL, -- security_feed, changelog, patch_header, upstream_match
confidence DECIMAL(3,2) NOT NULL,
evidence_id UUID REFERENCES binaries.fix_evidence(id),
snapshot_id UUID,
indexed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (tenant_id, distro, release, source_pkg, cve_id)
);
-- Evidence blobs: audit trail
CREATE TABLE binaries.fix_evidence (
id UUID PRIMARY KEY,
tenant_id TEXT NOT NULL,
evidence_type TEXT NOT NULL, -- changelog, patch_header, security_feed
source_file TEXT, -- Path to source file (changelog, patch)
source_sha256 TEXT, -- Hash of source file
excerpt TEXT, -- Relevant snippet (max 1KB)
metadata JSONB NOT NULL, -- Structured metadata
snapshot_id UUID,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
5b.3 Evidence Types
ChangelogEvidence:
{
"evidence_type": "changelog",
"source_file": "debian/changelog",
"excerpt": "* Fix CVE-2024-0727: PKCS12 decoding crash",
"metadata": {
"version": "3.0.11-1~deb12u2",
"line_number": 5
}
}
PatchHeaderEvidence:
{
"evidence_type": "patch_header",
"source_file": "debian/patches/CVE-2024-0727.patch",
"excerpt": "CVE: CVE-2024-0727\nOrigin: upstream, https://github.com/openssl/commit/abc123",
"metadata": {
"patch_sha256": "abc123def456..."
}
}
SecurityFeedEvidence:
{
"evidence_type": "security_feed",
"metadata": {
"feed_id": "debian-security-tracker",
"entry_id": "DSA-5678-1",
"published_at": "2024-01-15T10:00:00Z"
}
}
5b.4 Confidence Resolution
When multiple evidence sources exist for the same CVE, the system keeps the highest confidence entry:
ON CONFLICT (tenant_id, distro, release, source_pkg, cve_id)
DO UPDATE SET
confidence = GREATEST(existing.confidence, new.confidence),
method = CASE
WHEN existing.confidence < new.confidence THEN new.method
ELSE existing.method
END,
evidence_id = CASE
WHEN existing.confidence < new.confidence THEN new.evidence_id
ELSE existing.evidence_id
END
5b.5 Parsers
The following parsers extract CVE fix information:
| Parser | Distros | Input | Confidence |
|---|---|---|---|
DebianChangelogParser |
Debian, Ubuntu | debian/changelog | 0.80 |
PatchHeaderParser |
Debian, Ubuntu | debian/patches/*.patch (DEP-3) | 0.87 |
AlpineSecfixesParser |
Alpine | APKBUILD secfixes block | 0.95 |
RpmChangelogParser |
RHEL, Fedora, CentOS | RPM spec %changelog | 0.75 |
5b.6 Query Flow
sequenceDiagram
participant SW as Scanner.Worker
participant BVS as BinaryVulnerabilityService
participant FIR as FixIndexRepository
participant PG as PostgreSQL
SW->>BVS: GetFixStatusAsync(debian, bookworm, openssl, CVE-2024-0727)
BVS->>FIR: GetFixStatusAsync(...)
FIR->>PG: SELECT FROM cve_fix_index WHERE ...
PG-->>FIR: FixIndexEntry (state=fixed, confidence=0.87)
FIR-->>BVS: FixStatusResult
BVS-->>SW: {state: Fixed, confidence: 0.87, method: PatchHeader}
6. Security Considerations
6.1 Trust Boundaries
- Corpus Ingestion - Packages are untrusted; extraction runs in sandboxed workers
- Fingerprint Generation - Reference builds compiled in isolated environments
- Query API - Tenant-isolated via RLS; no cross-tenant data leakage
6.2 Signing & Provenance
- All corpus snapshots are signed (DSSE)
- Fingerprint sets are versioned and signed
- Every match result references evidence digests
6.3 Sandbox Requirements
Binary extraction and fingerprint generation MUST run with:
- Seccomp profile restricting syscalls
- Read-only root filesystem
- No network access during analysis
- Memory/CPU limits
7. Observability
7.1 Metrics
| Metric | Type | Labels |
|---|---|---|
binaryindex_lookup_total |
Counter | method, result |
binaryindex_lookup_latency_ms |
Histogram | method |
binaryindex_corpus_packages_total |
Gauge | distro, release |
binaryindex_fingerprints_indexed |
Gauge | algorithm, component |
binaryindex_match_confidence |
Histogram | method |
7.2 Traces
binaryindex.lookup- Full lookup spanbinaryindex.corpus.ingest- Corpus ingestionbinaryindex.fingerprint.generate- Fingerprint generation
7.3 Ops Endpoints
BinaryIndex exposes read-only ops endpoints for health, bench, cache, and effective configuration:
- GET
/api/v1/ops/binaryindex/health-> BinaryIndexOpsHealthResponse - POST
/api/v1/ops/binaryindex/bench/run-> BinaryIndexBenchResponse - GET
/api/v1/ops/binaryindex/cache-> BinaryIndexFunctionCacheStats - GET
/api/v1/ops/binaryindex/config-> BinaryIndexEffectiveConfig
8. Configuration
# binaryindex.yaml
binaryindex:
enabled: true
corpus:
connectors:
- type: debian
enabled: true
mirror: http://deb.debian.org/debian
releases: [bookworm, bullseye]
architectures: [amd64, arm64]
- type: ubuntu
enabled: true
mirror: http://archive.ubuntu.com/ubuntu
releases: [jammy, noble]
fingerprinting:
enabled: true
algorithms: [basic_block, cfg]
target_components:
- openssl
- glibc
- zlib
- curl
- sqlite
min_function_size: 16 # bytes
max_functions_per_binary: 10000
lookup:
cache_ttl: 3600
batch_size: 100
timeout_ms: 5000
storage:
postgres_schema: binaries
rustfs_bucket: stellaops/binaryindex
Additional appsettings sections (case-insensitive):
BinaryIndex:B2R2Pool- lifter pool sizing and warm ISA list.BinaryIndex:SemanticLifting- LowUIR enablement and deterministic controls.BinaryIndex:FunctionCache- Valkey function cache configuration.Postgres:BinaryIndex- persistence for canonical IR fingerprints.
9. Testing Strategy
9.1 Unit Tests
- Identity extraction (Build-ID, hashes)
- Fingerprint generation determinism
- Fix index parsing (changelog, patch headers)
9.2 Integration Tests
- PostgreSQL schema validation
- Full corpus ingestion flow
- Scanner.Worker lookup integration
9.3 Regression Tests
- Known CVE detection (golden corpus)
- Backport handling (Debian libssl example)
- False positive rate validation
10. References
- Advisory:
docs/product/advisories/21-Dec-2025 - Mapping Evidence Within Compiled Binaries.md - Scanner Native Analysis:
src/Scanner/StellaOps.Scanner.Analyzers.Native/ - Existing Fingerprinting:
src/Scanner/__Libraries/StellaOps.Scanner.EntryTrace/Binary/ - Build-ID Index:
src/Scanner/StellaOps.Scanner.Analyzers.Native/Index/ - Semantic Diffing Sprint:
docs/implplan/SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md - Semantic Library:
src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Semantic/ - Semantic Tests:
src/BinaryIndex/__Tests/StellaOps.BinaryIndex.Semantic.Tests/
Document Version: 1.1.1 Last Updated: 2026-01-14