Files
git.stella-ops.org/docs/modules/binaryindex/architecture.md
master 53503cb407 Add reference architecture and testing strategy documentation
- Created a new document for the Stella Ops Reference Architecture outlining the system's topology, trust boundaries, artifact association, and interfaces.
- Developed a comprehensive Testing Strategy document detailing the importance of offline readiness, interoperability, determinism, and operational guardrails.
- Introduced a README for the Testing Strategy, summarizing processing details and key concepts implemented.
- Added guidance for AI agents and developers in the tests directory, including directory structure, test categories, key patterns, and rules for test development.
2025-12-22 07:59:30 +02:00

559 lines
22 KiB
Markdown

# BinaryIndex Module Architecture
> **Ownership:** Scanner Guild + Concelier Guild
> **Status:** DRAFT
> **Version:** 1.0.0
> **Related:** [High-Level Architecture](../../07_HIGH_LEVEL_ARCHITECTURE.md), [Scanner Architecture](../scanner/architecture.md), [Concelier Architecture](../concelier/architecture.md)
---
## 1. Overview
The **BinaryIndex** module provides a vulnerable binaries database that enables detection of vulnerable code at the binary level, independent of package metadata. This addresses a critical gap in vulnerability scanning: package version strings can lie (backports, custom builds, stripped metadata), but **binary identity doesn't lie**.
### 1.1 Problem Statement
Traditional vulnerability scanners rely on package version matching, which fails in several scenarios:
1. **Backported patches** - Distros backport security fixes without changing upstream version
2. **Custom/vendored builds** - Binaries compiled from source without package metadata
3. **Stripped binaries** - Debug info and version strings removed
4. **Static linking** - Vulnerable library code embedded in final binary
5. **Container base images** - Distroless or scratch images with no package DB
### 1.2 Solution: Binary-First Vulnerability Detection
BinaryIndex provides three tiers of binary identification:
| Tier | Method | Precision | Coverage |
|------|--------|-----------|----------|
| A | Package/version range matching | Medium | High |
| B | Build-ID/hash catalog (exact binary identity) | High | Medium |
| C | Function fingerprints (CFG/basic-block hashes) | Very High | Targeted |
### 1.3 Module Scope
**In Scope:**
- Binary identity extraction (Build-ID, PE CodeView GUID, Mach-O UUID)
- Binary-to-advisory mapping database
- Fingerprint storage and matching engine
- Fix index for patch-aware backport handling
- Integration with Scanner.Worker for binary lookup
**Out of Scope:**
- Binary disassembly/analysis (provided by Scanner.Analyzers.Native)
- Runtime binary tracing (provided by Zastava)
- SBOM generation (provided by Scanner)
---
## 2. Architecture
### 2.1 System Context
```
┌──────────────────────────────────────────────────────────────────────────┐
│ External Systems │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Distro Repos │ │ Debug Symbol │ │ Upstream Source │ │
│ │ (Debian, RPM, │ │ Servers │ │ (GitHub, etc.) │ │
│ │ Alpine) │ │ (debuginfod) │ │ │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
└───────────│─────────────────────│─────────────────────│──────────────────┘
│ │ │
v v v
┌──────────────────────────────────────────────────────────────────────────┐
│ BinaryIndex Module │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Corpus Ingestion Layer │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ DebianCorpus │ │ RpmCorpus │ │ AlpineCorpus │ │ │
│ │ │ Connector │ │ Connector │ │ Connector │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Processing Layer │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ BinaryFeature│ │ FixIndex │ │ Fingerprint │ │ │
│ │ │ Extractor │ │ Builder │ │ Generator │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Storage Layer │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ PostgreSQL │ │ RustFS │ │ Valkey │ │ │
│ │ │ (binaries │ │ (fingerprint │ │ (lookup │ │ │
│ │ │ schema) │ │ blobs) │ │ cache) │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Query Layer │ │
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
│ │ │ IBinaryVulnerabilityService │ │ │
│ │ │ - LookupByBuildIdAsync(buildId) │ │ │
│ │ │ - LookupByFingerprintAsync(fingerprint) │ │ │
│ │ │ - LookupBatchAsync(identities) │ │ │
│ │ │ - GetFixStatusAsync(distro, release, sourcePkg, cve) │ │ │
│ │ └──────────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
v
┌──────────────────────────────────────────────────────────────────────────┐
│ Consuming Modules │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Scanner.Worker │ │ Policy Engine │ │ Findings Ledger │ │
│ │ (binary lookup │ │ (evidence in │ │ (match records) │ │
│ │ during scan) │ │ proof chain) │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
```
### 2.2 Component Breakdown
#### 2.2.1 Corpus Connectors
Plugin-based connectors that ingest binaries from distribution repositories.
```csharp
public interface IBinaryCorpusConnector
{
string ConnectorId { get; }
string[] SupportedDistros { get; }
Task<CorpusSnapshot> FetchSnapshotAsync(CorpusQuery query, CancellationToken ct);
Task<IAsyncEnumerable<ExtractedBinary>> ExtractBinariesAsync(PackageReference pkg, CancellationToken ct);
}
```
**Implementations:**
- `DebianBinaryCorpusConnector` - Debian/Ubuntu packages + debuginfo
- `RpmBinaryCorpusConnector` - RHEL/Fedora/CentOS + SRPM
- `AlpineBinaryCorpusConnector` - Alpine APK + APKBUILD
#### 2.2.2 Binary Feature Extractor
Extracts identity and features from binaries. Reuses existing Scanner.Analyzers.Native capabilities.
```csharp
public interface IBinaryFeatureExtractor
{
Task<BinaryIdentity> ExtractIdentityAsync(Stream binaryStream, CancellationToken ct);
Task<BinaryFeatures> ExtractFeaturesAsync(Stream binaryStream, ExtractorOptions opts, CancellationToken ct);
}
public sealed record BinaryIdentity(
string Format, // elf, pe, macho
string? BuildId, // ELF GNU Build-ID
string? PeCodeViewGuid, // PE CodeView GUID + Age
string? MachoUuid, // Mach-O LC_UUID
string FileSha256,
string TextSectionSha256);
public sealed record BinaryFeatures(
BinaryIdentity Identity,
string[] DynamicDeps, // DT_NEEDED
string[] ExportedSymbols,
string[] ImportedSymbols,
BinaryHardening Hardening);
```
#### 2.2.3 Fix Index Builder
Builds the patch-aware CVE fix index from distro sources.
```csharp
public interface IFixIndexBuilder
{
Task BuildIndexAsync(DistroRelease distro, CancellationToken ct);
Task<FixRecord?> GetFixRecordAsync(string distro, string release, string sourcePkg, string cveId, CancellationToken ct);
}
public sealed record FixRecord(
string Distro,
string Release,
string SourcePkg,
string CveId,
FixState State, // fixed, vulnerable, not_affected, wontfix, unknown
string? FixedVersion, // Distro version string
FixMethod Method, // security_feed, changelog, patch_header
decimal Confidence, // 0.00-1.00
FixEvidence Evidence);
public enum FixState { Fixed, Vulnerable, NotAffected, Wontfix, Unknown }
public enum FixMethod { SecurityFeed, Changelog, PatchHeader, UpstreamPatchMatch }
```
#### 2.2.4 Fingerprint Generator
Generates function-level fingerprints for vulnerable code detection.
```csharp
public interface IVulnFingerprintGenerator
{
Task<ImmutableArray<VulnFingerprint>> GenerateAsync(
string cveId,
BinaryPair vulnAndFixed, // Reference builds
FingerprintOptions opts,
CancellationToken ct);
}
public sealed record VulnFingerprint(
string CveId,
string Component, // e.g., openssl
string Architecture, // x86-64, aarch64
FingerprintType Type, // basic_block, cfg, combined
string FingerprintId, // e.g., "bb-abc123..."
byte[] FingerprintHash, // 16-32 bytes
string? FunctionHint, // Function name if known
decimal Confidence,
FingerprintEvidence Evidence);
public enum FingerprintType { BasicBlock, ControlFlowGraph, StringReferences, Combined }
```
#### 2.2.5 Binary Vulnerability Service
Main query interface for consumers.
```csharp
public interface IBinaryVulnerabilityService
{
/// <summary>
/// Look up vulnerabilities by Build-ID or equivalent binary identity.
/// </summary>
Task<ImmutableArray<BinaryVulnMatch>> LookupByIdentityAsync(
BinaryIdentity identity,
LookupOptions? opts = null,
CancellationToken ct = default);
/// <summary>
/// Look up vulnerabilities by function fingerprint.
/// </summary>
Task<ImmutableArray<BinaryVulnMatch>> LookupByFingerprintAsync(
CodeFingerprint fingerprint,
decimal minSimilarity = 0.95m,
CancellationToken ct = default);
/// <summary>
/// Batch lookup for scan performance.
/// </summary>
Task<ImmutableDictionary<string, ImmutableArray<BinaryVulnMatch>>> LookupBatchAsync(
IEnumerable<BinaryIdentity> identities,
LookupOptions? opts = null,
CancellationToken ct = default);
/// <summary>
/// Get distro-specific fix status (patch-aware).
/// </summary>
Task<FixRecord?> GetFixStatusAsync(
string distro,
string release,
string sourcePkg,
string cveId,
CancellationToken ct = default);
}
public sealed record BinaryVulnMatch(
string CveId,
string VulnerablePurl,
MatchMethod Method, // buildid_catalog, fingerprint_match, range_match
decimal Confidence,
MatchEvidence Evidence);
public enum MatchMethod { BuildIdCatalog, FingerprintMatch, RangeMatch }
```
---
## 3. Data Model
### 3.1 PostgreSQL Schema (`binaries`)
The `binaries` schema stores binary identity, fingerprint, and match data.
```sql
CREATE SCHEMA IF NOT EXISTS binaries;
CREATE SCHEMA IF NOT EXISTS binaries_app;
-- RLS helper
CREATE OR REPLACE FUNCTION binaries_app.require_current_tenant()
RETURNS TEXT LANGUAGE plpgsql STABLE SECURITY DEFINER AS $$
DECLARE v_tenant TEXT;
BEGIN
v_tenant := current_setting('app.tenant_id', true);
IF v_tenant IS NULL OR v_tenant = '' THEN
RAISE EXCEPTION 'app.tenant_id session variable not set';
END IF;
RETURN v_tenant;
END;
$$;
```
#### 3.1.1 Core Tables
See `docs/db/schemas/binaries_schema_specification.md` for complete DDL.
**Key Tables:**
| Table | Purpose |
|-------|---------|
| `binaries.binary_identity` | Known binary identities (Build-ID, hashes) |
| `binaries.binary_package_map` | Binary → package mapping per snapshot |
| `binaries.vulnerable_buildids` | Build-IDs known to be vulnerable |
| `binaries.vulnerable_fingerprints` | Function fingerprints for CVEs |
| `binaries.cve_fix_index` | Patch-aware fix status per distro |
| `binaries.fingerprint_matches` | Match results (findings evidence) |
| `binaries.corpus_snapshots` | Corpus ingestion tracking |
### 3.2 RustFS Layout
```
rustfs://stellaops/binaryindex/
fingerprints/<algorithm>/<prefix>/<fingerprint_id>.bin
corpus/<distro>/<release>/<snapshot_id>/manifest.json
corpus/<distro>/<release>/<snapshot_id>/packages/<pkg>.metadata.json
evidence/<match_id>.dsse.json
```
---
## 4. Integration Points
### 4.1 Scanner.Worker Integration
During container scanning, Scanner.Worker queries BinaryIndex for each extracted binary:
```mermaid
sequenceDiagram
participant SW as Scanner.Worker
participant BI as BinaryIndex
participant PG as PostgreSQL
participant FL as Findings Ledger
SW->>SW: Extract binary from layer
SW->>SW: Compute BinaryIdentity
SW->>BI: LookupByIdentityAsync(identity)
BI->>PG: Query binaries.vulnerable_buildids
PG-->>BI: Matches
BI->>PG: Query binaries.cve_fix_index (if distro known)
PG-->>BI: Fix status
BI-->>SW: BinaryVulnMatch[]
SW->>FL: RecordFinding(match, evidence)
```
### 4.2 Concelier Integration
BinaryIndex subscribes to Concelier's advisory updates:
```mermaid
sequenceDiagram
participant CO as Concelier
participant BI as BinaryIndex
participant PG as PostgreSQL
CO->>CO: Ingest new advisory
CO->>BI: advisory.created event
BI->>BI: Check if affected packages in corpus
BI->>PG: Update binaries.binary_vuln_assertion
BI->>BI: Queue fingerprint generation (if high-impact)
```
### 4.3 Policy Integration
Binary matches are recorded as proof segments:
```json
{
"segment_type": "binary_fingerprint_evidence",
"payload": {
"binary_identity": {
"format": "elf",
"build_id": "abc123...",
"file_sha256": "def456..."
},
"matches": [
{
"cve_id": "CVE-2024-1234",
"method": "buildid_catalog",
"confidence": 0.98,
"vulnerable_purl": "pkg:deb/debian/libssl3@1.1.1n-0+deb11u3"
}
]
}
}
```
---
## 5. MVP Roadmap
### MVP 1: Known-Build Binary Catalog (Sprint 6000.0001)
**Goal:** Query "is this Build-ID vulnerable?" with distro-level precision.
**Deliverables:**
- `binaries` PostgreSQL schema
- Build-ID to package mapping tables
- Basic CVE lookup by binary identity
- Debian/Ubuntu corpus connector
### MVP 2: Patch-Aware Backport Handling (Sprint 6000.0002)
**Goal:** Handle "version says vulnerable but distro backported the fix."
**Deliverables:**
- Fix index builder (changelog + patch header parsing)
- Distro-specific version comparison
- RPM corpus connector
- Scanner.Worker integration
### MVP 3: Binary Fingerprint Factory (Sprint 6000.0003)
**Goal:** Detect vulnerable code independent of package metadata.
**Deliverables:**
- Fingerprint storage and matching
- Reference build generation pipeline
- Fingerprint validation corpus
- High-impact CVE coverage (OpenSSL, glibc, zlib, curl)
### MVP 4: Full Scanner Integration (Sprint 6000.0004)
**Goal:** Binary evidence in production scans.
**Deliverables:**
- Scanner.Worker binary lookup integration
- Findings Ledger binary match records
- Proof segment attestations
- CLI binary match inspection
---
## 6. Security Considerations
### 6.1 Trust Boundaries
1. **Corpus Ingestion** - Packages are untrusted; extraction runs in sandboxed workers
2. **Fingerprint Generation** - Reference builds compiled in isolated environments
3. **Query API** - Tenant-isolated via RLS; no cross-tenant data leakage
### 6.2 Signing & Provenance
- All corpus snapshots are signed (DSSE)
- Fingerprint sets are versioned and signed
- Every match result references evidence digests
### 6.3 Sandbox Requirements
Binary extraction and fingerprint generation MUST run with:
- Seccomp profile restricting syscalls
- Read-only root filesystem
- No network access during analysis
- Memory/CPU limits
---
## 7. Observability
### 7.1 Metrics
| Metric | Type | Labels |
|--------|------|--------|
| `binaryindex_lookup_total` | Counter | method, result |
| `binaryindex_lookup_latency_ms` | Histogram | method |
| `binaryindex_corpus_packages_total` | Gauge | distro, release |
| `binaryindex_fingerprints_indexed` | Gauge | algorithm, component |
| `binaryindex_match_confidence` | Histogram | method |
### 7.2 Traces
- `binaryindex.lookup` - Full lookup span
- `binaryindex.corpus.ingest` - Corpus ingestion
- `binaryindex.fingerprint.generate` - Fingerprint generation
---
## 8. Configuration
```yaml
# binaryindex.yaml
binaryindex:
enabled: true
corpus:
connectors:
- type: debian
enabled: true
mirror: http://deb.debian.org/debian
releases: [bookworm, bullseye]
architectures: [amd64, arm64]
- type: ubuntu
enabled: true
mirror: http://archive.ubuntu.com/ubuntu
releases: [jammy, noble]
fingerprinting:
enabled: true
algorithms: [basic_block, cfg]
target_components:
- openssl
- glibc
- zlib
- curl
- sqlite
min_function_size: 16 # bytes
max_functions_per_binary: 10000
lookup:
cache_ttl: 3600
batch_size: 100
timeout_ms: 5000
storage:
postgres_schema: binaries
rustfs_bucket: stellaops/binaryindex
```
---
## 9. Testing Strategy
### 9.1 Unit Tests
- Identity extraction (Build-ID, hashes)
- Fingerprint generation determinism
- Fix index parsing (changelog, patch headers)
### 9.2 Integration Tests
- PostgreSQL schema validation
- Full corpus ingestion flow
- Scanner.Worker lookup integration
### 9.3 Regression Tests
- Known CVE detection (golden corpus)
- Backport handling (Debian libssl example)
- False positive rate validation
---
## 10. References
- Advisory: `docs/product-advisories/21-Dec-2025 - Mapping Evidence Within Compiled Binaries.md`
- Scanner Native Analysis: `src/Scanner/StellaOps.Scanner.Analyzers.Native/`
- Existing Fingerprinting: `src/Scanner/__Libraries/StellaOps.Scanner.EntryTrace/Binary/`
- Build-ID Index: `src/Scanner/StellaOps.Scanner.Analyzers.Native/Index/`
---
*Document Version: 1.0.0*
*Last Updated: 2025-12-21*