docs consolidation, big sln build fixes, new advisories and sprints/tasks
This commit is contained in:
94
docs/modules/binary-index/README.md
Normal file
94
docs/modules/binary-index/README.md
Normal file
@@ -0,0 +1,94 @@
|
||||
# BinaryIndex
|
||||
|
||||
**Status:** Implemented
|
||||
**Source:** `src/BinaryIndex/`
|
||||
**Owner:** Scanner Guild + Concelier Guild
|
||||
|
||||
## Purpose
|
||||
|
||||
BinaryIndex provides vulnerable binary detection independent of package metadata. It addresses the gap where package version strings can lie (backports, custom builds, stripped metadata) through binary-first vulnerability identification using Build-IDs, hash catalogs, and function fingerprints.
|
||||
|
||||
## Components
|
||||
|
||||
**Libraries:**
|
||||
- `StellaOps.BinaryIndex.Core` - Core binary identity extraction and matching engine
|
||||
- `StellaOps.BinaryIndex.Corpus` - Binary-to-advisory mapping database
|
||||
- `StellaOps.BinaryIndex.Corpus.Debian` - Debian-specific corpus support
|
||||
- `StellaOps.BinaryIndex.Fingerprints` - Function fingerprint storage and matching (CFG/basic-block hashes)
|
||||
- `StellaOps.BinaryIndex.FixIndex` - Patch-aware backport handling
|
||||
- `StellaOps.BinaryIndex.Persistence` - Storage adapters for binary catalogs
|
||||
|
||||
## Configuration
|
||||
|
||||
Configuration is typically embedded in Scanner and Concelier module settings.
|
||||
|
||||
Key features:
|
||||
- Three-tier binary identification (package/version, Build-ID/hash, function fingerprints)
|
||||
- Binary identity extraction (Build-ID, PE CodeView GUID, Mach-O UUID)
|
||||
- Integration with Scanner.Worker for binary lookup
|
||||
- Offline-first design with deterministic outputs
|
||||
|
||||
## Dependencies
|
||||
|
||||
- PostgreSQL (integrated with Scanner/Concelier schemas)
|
||||
- Scanner.Analyzers.Native (for binary disassembly/analysis)
|
||||
- Concelier (for advisory-to-binary mapping)
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- Architecture: `./architecture.md`
|
||||
- High-Level Architecture: `../../ARCHITECTURE_OVERVIEW.md`
|
||||
- Scanner Architecture: `../scanner/architecture.md`
|
||||
- Concelier Architecture: `../concelier/architecture.md`
|
||||
|
||||
## Current Status
|
||||
|
||||
Library implementation complete with support for ELF (Build-ID), PE (CodeView GUID), and Mach-O (UUID) binary formats. Integrated into Scanner's native binary analysis pipeline.
|
||||
|
||||
---
|
||||
|
||||
## Semantic Diffing Roadmap
|
||||
|
||||
A major enhancement to BinaryIndex is planned to enable **semantic-level binary diffing** - detecting function equivalence based on behavior rather than syntax. This addresses limitations in current byte/symbol-based matching when dealing with:
|
||||
|
||||
- Compiler optimizations (same source, different instructions)
|
||||
- Stripped binaries (no symbols)
|
||||
- Cross-compiler builds (GCC vs Clang)
|
||||
- Obfuscated code
|
||||
|
||||
### Planned Phases
|
||||
|
||||
| Phase | Description | Impact | Status |
|
||||
|-------|-------------|--------|--------|
|
||||
| **Phase 1** | IR-Level Semantic Analysis | +15% accuracy on optimized binaries | Planned |
|
||||
| **Phase 2** | Function Behavior Corpus | +10% coverage on stripped binaries | Planned |
|
||||
| **Phase 3** | Ghidra Integration | +5% edge case handling | Planned |
|
||||
| **Phase 4** | Decompiler & ML Similarity | +10% obfuscation resilience | Planned |
|
||||
|
||||
### New Libraries (Planned)
|
||||
|
||||
- `StellaOps.BinaryIndex.Semantic` - IR lifting and semantic graph fingerprints
|
||||
- `StellaOps.BinaryIndex.Corpus` - 30K+ function behavior database
|
||||
- `StellaOps.BinaryIndex.Ghidra` - Ghidra Headless integration
|
||||
- `StellaOps.BinaryIndex.Decompiler` - Decompiled code AST comparison
|
||||
- `StellaOps.BinaryIndex.ML` - CodeBERT-based function embeddings
|
||||
- `StellaOps.BinaryIndex.Ensemble` - Multi-signal decision fusion
|
||||
|
||||
### Expected Outcomes
|
||||
|
||||
| Metric | Current | Target |
|
||||
|--------|---------|--------|
|
||||
| Patch detection accuracy | ~70% | 92%+ |
|
||||
| Function identification (stripped) | ~50% | 85%+ |
|
||||
| False positive rate | ~5% | <2% |
|
||||
|
||||
### Sprint Files
|
||||
|
||||
- `docs/implplan/SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md`
|
||||
- `docs/implplan/SPRINT_20260105_001_002_BINDEX_semdiff_corpus.md`
|
||||
- `docs/implplan/SPRINT_20260105_001_003_BINDEX_semdiff_ghidra.md`
|
||||
- `docs/implplan/SPRINT_20260105_001_004_BINDEX_semdiff_decompiler_ml.md`
|
||||
|
||||
### Architecture Documentation
|
||||
|
||||
See `./semantic-diffing.md` for comprehensive architecture documentation.
|
||||
695
docs/modules/binary-index/architecture.md
Normal file
695
docs/modules/binary-index/architecture.md
Normal file
@@ -0,0 +1,695 @@
|
||||
# BinaryIndex Module Architecture
|
||||
|
||||
> **Ownership:** Scanner Guild + Concelier Guild
|
||||
> **Status:** DRAFT
|
||||
> **Version:** 1.0.0
|
||||
> **Related:** [High-Level Architecture](../../ARCHITECTURE_OVERVIEW.md), [Scanner Architecture](../scanner/architecture.md), [Concelier Architecture](../concelier/architecture.md)
|
||||
|
||||
---
|
||||
|
||||
## 1. Overview
|
||||
|
||||
The **BinaryIndex** module provides a vulnerable binaries database that enables detection of vulnerable code at the binary level, independent of package metadata. This addresses a critical gap in vulnerability scanning: package version strings can lie (backports, custom builds, stripped metadata), but **binary identity doesn't lie**.
|
||||
|
||||
### 1.1 Problem Statement
|
||||
|
||||
Traditional vulnerability scanners rely on package version matching, which fails in several scenarios:
|
||||
|
||||
1. **Backported patches** - Distros backport security fixes without changing upstream version
|
||||
2. **Custom/vendored builds** - Binaries compiled from source without package metadata
|
||||
3. **Stripped binaries** - Debug info and version strings removed
|
||||
4. **Static linking** - Vulnerable library code embedded in final binary
|
||||
5. **Container base images** - Distroless or scratch images with no package DB
|
||||
|
||||
### 1.2 Solution: Binary-First Vulnerability Detection
|
||||
|
||||
BinaryIndex provides three tiers of binary identification:
|
||||
|
||||
| Tier | Method | Precision | Coverage |
|
||||
|------|--------|-----------|----------|
|
||||
| A | Package/version range matching | Medium | High |
|
||||
| B | Build-ID/hash catalog (exact binary identity) | High | Medium |
|
||||
| C | Function fingerprints (CFG/basic-block hashes) | Very High | Targeted |
|
||||
|
||||
### 1.3 Module Scope
|
||||
|
||||
**In Scope:**
|
||||
- Binary identity extraction (Build-ID, PE CodeView GUID, Mach-O UUID)
|
||||
- Binary-to-advisory mapping database
|
||||
- Fingerprint storage and matching engine
|
||||
- Fix index for patch-aware backport handling
|
||||
- Integration with Scanner.Worker for binary lookup
|
||||
|
||||
**Out of Scope:**
|
||||
- Binary disassembly/analysis (provided by Scanner.Analyzers.Native)
|
||||
- Runtime binary tracing (provided by Zastava)
|
||||
- SBOM generation (provided by Scanner)
|
||||
|
||||
---
|
||||
|
||||
## 2. Architecture
|
||||
|
||||
### 2.1 System Context
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ External Systems │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ Distro Repos │ │ Debug Symbol │ │ Upstream Source │ │
|
||||
│ │ (Debian, RPM, │ │ Servers │ │ (GitHub, etc.) │ │
|
||||
│ │ Alpine) │ │ (debuginfod) │ │ │ │
|
||||
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
|
||||
└───────────│─────────────────────│─────────────────────│──────────────────┘
|
||||
│ │ │
|
||||
v v v
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ BinaryIndex Module │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Corpus Ingestion Layer │ │
|
||||
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
|
||||
│ │ │ DebianCorpus │ │ RpmCorpus │ │ AlpineCorpus │ │ │
|
||||
│ │ │ Connector │ │ Connector │ │ Connector │ │ │
|
||||
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
|
||||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ v │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Processing Layer │ │
|
||||
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
|
||||
│ │ │ BinaryFeature│ │ FixIndex │ │ Fingerprint │ │ │
|
||||
│ │ │ Extractor │ │ Builder │ │ Generator │ │ │
|
||||
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
|
||||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ v │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Storage Layer │ │
|
||||
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
|
||||
│ │ │ PostgreSQL │ │ RustFS │ │ Valkey │ │ │
|
||||
│ │ │ (binaries │ │ (fingerprint │ │ (lookup │ │ │
|
||||
│ │ │ schema) │ │ blobs) │ │ cache) │ │ │
|
||||
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
|
||||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ v │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Query Layer │ │
|
||||
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
|
||||
│ │ │ IBinaryVulnerabilityService │ │ │
|
||||
│ │ │ - LookupByBuildIdAsync(buildId) │ │ │
|
||||
│ │ │ - LookupByFingerprintAsync(fingerprint) │ │ │
|
||||
│ │ │ - LookupBatchAsync(identities) │ │ │
|
||||
│ │ │ - GetFixStatusAsync(distro, release, sourcePkg, cve) │ │ │
|
||||
│ │ └──────────────────────────────────────────────────────────────┘ │ │
|
||||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||||
└──────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
v
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ Consuming Modules │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ Scanner.Worker │ │ Policy Engine │ │ Findings Ledger │ │
|
||||
│ │ (binary lookup │ │ (evidence in │ │ (match records) │ │
|
||||
│ │ during scan) │ │ proof chain) │ │ │ │
|
||||
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
||||
└──────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 2.2 Component Breakdown
|
||||
|
||||
#### 2.2.1 Corpus Connectors
|
||||
|
||||
Plugin-based connectors that ingest binaries from distribution repositories.
|
||||
|
||||
```csharp
|
||||
public interface IBinaryCorpusConnector
|
||||
{
|
||||
string ConnectorId { get; }
|
||||
string[] SupportedDistros { get; }
|
||||
|
||||
Task<CorpusSnapshot> FetchSnapshotAsync(CorpusQuery query, CancellationToken ct);
|
||||
Task<IAsyncEnumerable<ExtractedBinary>> ExtractBinariesAsync(PackageReference pkg, CancellationToken ct);
|
||||
}
|
||||
```
|
||||
|
||||
**Implementations:**
|
||||
- `DebianBinaryCorpusConnector` - Debian/Ubuntu packages + debuginfo
|
||||
- `RpmBinaryCorpusConnector` - RHEL/Fedora/CentOS + SRPM
|
||||
- `AlpineBinaryCorpusConnector` - Alpine APK + APKBUILD
|
||||
|
||||
#### 2.2.2 Binary Feature Extractor
|
||||
|
||||
Extracts identity and features from binaries. Reuses existing Scanner.Analyzers.Native capabilities.
|
||||
|
||||
```csharp
|
||||
public interface IBinaryFeatureExtractor
|
||||
{
|
||||
Task<BinaryIdentity> ExtractIdentityAsync(Stream binaryStream, CancellationToken ct);
|
||||
Task<BinaryFeatures> ExtractFeaturesAsync(Stream binaryStream, ExtractorOptions opts, CancellationToken ct);
|
||||
}
|
||||
|
||||
public sealed record BinaryIdentity(
|
||||
string Format, // elf, pe, macho
|
||||
string? BuildId, // ELF GNU Build-ID
|
||||
string? PeCodeViewGuid, // PE CodeView GUID + Age
|
||||
string? MachoUuid, // Mach-O LC_UUID
|
||||
string FileSha256,
|
||||
string TextSectionSha256);
|
||||
|
||||
public sealed record BinaryFeatures(
|
||||
BinaryIdentity Identity,
|
||||
string[] DynamicDeps, // DT_NEEDED
|
||||
string[] ExportedSymbols,
|
||||
string[] ImportedSymbols,
|
||||
BinaryHardening Hardening);
|
||||
```
|
||||
|
||||
#### 2.2.3 Fix Index Builder
|
||||
|
||||
Builds the patch-aware CVE fix index from distro sources.
|
||||
|
||||
```csharp
|
||||
public interface IFixIndexBuilder
|
||||
{
|
||||
Task BuildIndexAsync(DistroRelease distro, CancellationToken ct);
|
||||
Task<FixRecord?> GetFixRecordAsync(string distro, string release, string sourcePkg, string cveId, CancellationToken ct);
|
||||
}
|
||||
|
||||
public sealed record FixRecord(
|
||||
string Distro,
|
||||
string Release,
|
||||
string SourcePkg,
|
||||
string CveId,
|
||||
FixState State, // fixed, vulnerable, not_affected, wontfix, unknown
|
||||
string? FixedVersion, // Distro version string
|
||||
FixMethod Method, // security_feed, changelog, patch_header
|
||||
decimal Confidence, // 0.00-1.00
|
||||
FixEvidence Evidence);
|
||||
|
||||
public enum FixState { Fixed, Vulnerable, NotAffected, Wontfix, Unknown }
|
||||
public enum FixMethod { SecurityFeed, Changelog, PatchHeader, UpstreamPatchMatch }
|
||||
```
|
||||
|
||||
#### 2.2.4 Fingerprint Generator
|
||||
|
||||
Generates function-level fingerprints for vulnerable code detection.
|
||||
|
||||
```csharp
|
||||
public interface IVulnFingerprintGenerator
|
||||
{
|
||||
Task<ImmutableArray<VulnFingerprint>> GenerateAsync(
|
||||
string cveId,
|
||||
BinaryPair vulnAndFixed, // Reference builds
|
||||
FingerprintOptions opts,
|
||||
CancellationToken ct);
|
||||
}
|
||||
|
||||
public sealed record VulnFingerprint(
|
||||
string CveId,
|
||||
string Component, // e.g., openssl
|
||||
string Architecture, // x86-64, aarch64
|
||||
FingerprintType Type, // basic_block, cfg, combined
|
||||
string FingerprintId, // e.g., "bb-abc123..."
|
||||
byte[] FingerprintHash, // 16-32 bytes
|
||||
string? FunctionHint, // Function name if known
|
||||
decimal Confidence,
|
||||
FingerprintEvidence Evidence);
|
||||
|
||||
public enum FingerprintType { BasicBlock, ControlFlowGraph, StringReferences, Combined }
|
||||
```
|
||||
|
||||
#### 2.2.5 Binary Vulnerability Service
|
||||
|
||||
Main query interface for consumers.
|
||||
|
||||
```csharp
|
||||
public interface IBinaryVulnerabilityService
|
||||
{
|
||||
/// <summary>
|
||||
/// Look up vulnerabilities by Build-ID or equivalent binary identity.
|
||||
/// </summary>
|
||||
Task<ImmutableArray<BinaryVulnMatch>> LookupByIdentityAsync(
|
||||
BinaryIdentity identity,
|
||||
LookupOptions? opts = null,
|
||||
CancellationToken ct = default);
|
||||
|
||||
/// <summary>
|
||||
/// Look up vulnerabilities by function fingerprint.
|
||||
/// </summary>
|
||||
Task<ImmutableArray<BinaryVulnMatch>> LookupByFingerprintAsync(
|
||||
CodeFingerprint fingerprint,
|
||||
decimal minSimilarity = 0.95m,
|
||||
CancellationToken ct = default);
|
||||
|
||||
/// <summary>
|
||||
/// Batch lookup for scan performance.
|
||||
/// </summary>
|
||||
Task<ImmutableDictionary<string, ImmutableArray<BinaryVulnMatch>>> LookupBatchAsync(
|
||||
IEnumerable<BinaryIdentity> identities,
|
||||
LookupOptions? opts = null,
|
||||
CancellationToken ct = default);
|
||||
|
||||
/// <summary>
|
||||
/// Get distro-specific fix status (patch-aware).
|
||||
/// </summary>
|
||||
Task<FixRecord?> GetFixStatusAsync(
|
||||
string distro,
|
||||
string release,
|
||||
string sourcePkg,
|
||||
string cveId,
|
||||
CancellationToken ct = default);
|
||||
}
|
||||
|
||||
public sealed record BinaryVulnMatch(
|
||||
string CveId,
|
||||
string VulnerablePurl,
|
||||
MatchMethod Method, // buildid_catalog, fingerprint_match, range_match
|
||||
decimal Confidence,
|
||||
MatchEvidence Evidence);
|
||||
|
||||
public enum MatchMethod { BuildIdCatalog, FingerprintMatch, RangeMatch }
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Data Model
|
||||
|
||||
### 3.1 PostgreSQL Schema (`binaries`)
|
||||
|
||||
The `binaries` schema stores binary identity, fingerprint, and match data.
|
||||
|
||||
```sql
|
||||
CREATE SCHEMA IF NOT EXISTS binaries;
|
||||
CREATE SCHEMA IF NOT EXISTS binaries_app;
|
||||
|
||||
-- RLS helper
|
||||
CREATE OR REPLACE FUNCTION binaries_app.require_current_tenant()
|
||||
RETURNS TEXT LANGUAGE plpgsql STABLE SECURITY DEFINER AS $$
|
||||
DECLARE v_tenant TEXT;
|
||||
BEGIN
|
||||
v_tenant := current_setting('app.tenant_id', true);
|
||||
IF v_tenant IS NULL OR v_tenant = '' THEN
|
||||
RAISE EXCEPTION 'app.tenant_id session variable not set';
|
||||
END IF;
|
||||
RETURN v_tenant;
|
||||
END;
|
||||
$$;
|
||||
```
|
||||
|
||||
#### 3.1.1 Core Tables
|
||||
|
||||
See `docs/db/schemas/binaries_schema_specification.md` for complete DDL.
|
||||
|
||||
**Key Tables:**
|
||||
|
||||
| Table | Purpose |
|
||||
|-------|---------|
|
||||
| `binaries.binary_identity` | Known binary identities (Build-ID, hashes) |
|
||||
| `binaries.binary_package_map` | Binary → package mapping per snapshot |
|
||||
| `binaries.vulnerable_buildids` | Build-IDs known to be vulnerable |
|
||||
| `binaries.vulnerable_fingerprints` | Function fingerprints for CVEs |
|
||||
| `binaries.cve_fix_index` | Patch-aware fix status per distro |
|
||||
| `binaries.fingerprint_matches` | Match results (findings evidence) |
|
||||
| `binaries.corpus_snapshots` | Corpus ingestion tracking |
|
||||
|
||||
### 3.2 RustFS Layout
|
||||
|
||||
```
|
||||
rustfs://stellaops/binaryindex/
|
||||
fingerprints/<algorithm>/<prefix>/<fingerprint_id>.bin
|
||||
corpus/<distro>/<release>/<snapshot_id>/manifest.json
|
||||
corpus/<distro>/<release>/<snapshot_id>/packages/<pkg>.metadata.json
|
||||
evidence/<match_id>.dsse.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Integration Points
|
||||
|
||||
### 4.1 Scanner.Worker Integration
|
||||
|
||||
During container scanning, Scanner.Worker queries BinaryIndex for each extracted binary:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant SW as Scanner.Worker
|
||||
participant BI as BinaryIndex
|
||||
participant PG as PostgreSQL
|
||||
participant FL as Findings Ledger
|
||||
|
||||
SW->>SW: Extract binary from layer
|
||||
SW->>SW: Compute BinaryIdentity
|
||||
SW->>BI: LookupByIdentityAsync(identity)
|
||||
BI->>PG: Query binaries.vulnerable_buildids
|
||||
PG-->>BI: Matches
|
||||
BI->>PG: Query binaries.cve_fix_index (if distro known)
|
||||
PG-->>BI: Fix status
|
||||
BI-->>SW: BinaryVulnMatch[]
|
||||
SW->>FL: RecordFinding(match, evidence)
|
||||
```
|
||||
|
||||
### 4.2 Concelier Integration
|
||||
|
||||
BinaryIndex subscribes to Concelier's advisory updates:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant CO as Concelier
|
||||
participant BI as BinaryIndex
|
||||
participant PG as PostgreSQL
|
||||
|
||||
CO->>CO: Ingest new advisory
|
||||
CO->>BI: advisory.created event
|
||||
BI->>BI: Check if affected packages in corpus
|
||||
BI->>PG: Update binaries.binary_vuln_assertion
|
||||
BI->>BI: Queue fingerprint generation (if high-impact)
|
||||
```
|
||||
|
||||
### 4.3 Policy Integration
|
||||
|
||||
Binary matches are recorded as proof segments:
|
||||
|
||||
```json
|
||||
{
|
||||
"segment_type": "binary_fingerprint_evidence",
|
||||
"payload": {
|
||||
"binary_identity": {
|
||||
"format": "elf",
|
||||
"build_id": "abc123...",
|
||||
"file_sha256": "def456..."
|
||||
},
|
||||
"matches": [
|
||||
{
|
||||
"cve_id": "CVE-2024-1234",
|
||||
"method": "buildid_catalog",
|
||||
"confidence": 0.98,
|
||||
"vulnerable_purl": "pkg:deb/debian/libssl3@1.1.1n-0+deb11u3"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. MVP Roadmap
|
||||
|
||||
### MVP 1: Known-Build Binary Catalog (Sprint 6000.0001)
|
||||
|
||||
**Goal:** Query "is this Build-ID vulnerable?" with distro-level precision.
|
||||
|
||||
**Deliverables:**
|
||||
- `binaries` PostgreSQL schema
|
||||
- Build-ID to package mapping tables
|
||||
- Basic CVE lookup by binary identity
|
||||
- Debian/Ubuntu corpus connector
|
||||
|
||||
### MVP 2: Patch-Aware Backport Handling (Sprint 6000.0002)
|
||||
|
||||
**Goal:** Handle "version says vulnerable but distro backported the fix."
|
||||
|
||||
**Deliverables:**
|
||||
- Fix index builder (changelog + patch header parsing)
|
||||
- Distro-specific version comparison
|
||||
- RPM corpus connector
|
||||
- Scanner.Worker integration
|
||||
|
||||
### MVP 3: Binary Fingerprint Factory (Sprint 6000.0003)
|
||||
|
||||
**Goal:** Detect vulnerable code independent of package metadata.
|
||||
|
||||
**Deliverables:**
|
||||
- Fingerprint storage and matching
|
||||
- Reference build generation pipeline
|
||||
- Fingerprint validation corpus
|
||||
- High-impact CVE coverage (OpenSSL, glibc, zlib, curl)
|
||||
|
||||
### MVP 4: Full Scanner Integration (Sprint 6000.0004)
|
||||
|
||||
**Goal:** Binary evidence in production scans.
|
||||
|
||||
**Deliverables:**
|
||||
- Scanner.Worker binary lookup integration
|
||||
- Findings Ledger binary match records
|
||||
- Proof segment attestations
|
||||
- CLI binary match inspection
|
||||
|
||||
---
|
||||
|
||||
## 5b. Fix Evidence Chain
|
||||
|
||||
The **Fix Evidence Chain** provides auditable proof of why a CVE is marked as fixed (or not) for a specific distro/package combination. This is critical for patch-aware backport handling where package versions can be misleading.
|
||||
|
||||
### 5b.1 Evidence Sources
|
||||
|
||||
| Source | Confidence | Description |
|
||||
|--------|------------|-------------|
|
||||
| **Security Feed (OVAL)** | 0.95-0.99 | Authoritative feed from distro (Debian Security Tracker, Red Hat OVAL) |
|
||||
| **Patch Header (DEP-3)** | 0.87-0.95 | CVE reference in Debian/Ubuntu patch metadata |
|
||||
| **Changelog** | 0.75-0.85 | CVE mention in debian/changelog or RPM %changelog |
|
||||
| **Upstream Patch Match** | 0.90 | Binary diff matches known upstream fix |
|
||||
|
||||
### 5b.2 Evidence Storage
|
||||
|
||||
Evidence is stored in two PostgreSQL tables:
|
||||
|
||||
```sql
|
||||
-- Fix index: one row per (distro, release, source_pkg, cve_id)
|
||||
CREATE TABLE binaries.cve_fix_index (
|
||||
id UUID PRIMARY KEY,
|
||||
tenant_id TEXT NOT NULL,
|
||||
distro TEXT NOT NULL, -- debian, ubuntu, alpine, rhel
|
||||
release TEXT NOT NULL, -- bookworm, jammy, v3.19
|
||||
source_pkg TEXT NOT NULL,
|
||||
cve_id TEXT NOT NULL,
|
||||
state TEXT NOT NULL, -- fixed, vulnerable, not_affected, wontfix, unknown
|
||||
fixed_version TEXT,
|
||||
method TEXT NOT NULL, -- security_feed, changelog, patch_header, upstream_match
|
||||
confidence DECIMAL(3,2) NOT NULL,
|
||||
evidence_id UUID REFERENCES binaries.fix_evidence(id),
|
||||
snapshot_id UUID,
|
||||
indexed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
UNIQUE (tenant_id, distro, release, source_pkg, cve_id)
|
||||
);
|
||||
|
||||
-- Evidence blobs: audit trail
|
||||
CREATE TABLE binaries.fix_evidence (
|
||||
id UUID PRIMARY KEY,
|
||||
tenant_id TEXT NOT NULL,
|
||||
evidence_type TEXT NOT NULL, -- changelog, patch_header, security_feed
|
||||
source_file TEXT, -- Path to source file (changelog, patch)
|
||||
source_sha256 TEXT, -- Hash of source file
|
||||
excerpt TEXT, -- Relevant snippet (max 1KB)
|
||||
metadata JSONB NOT NULL, -- Structured metadata
|
||||
snapshot_id UUID,
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
||||
);
|
||||
```
|
||||
|
||||
### 5b.3 Evidence Types
|
||||
|
||||
**ChangelogEvidence:**
|
||||
```json
|
||||
{
|
||||
"evidence_type": "changelog",
|
||||
"source_file": "debian/changelog",
|
||||
"excerpt": "* Fix CVE-2024-0727: PKCS12 decoding crash",
|
||||
"metadata": {
|
||||
"version": "3.0.11-1~deb12u2",
|
||||
"line_number": 5
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**PatchHeaderEvidence:**
|
||||
```json
|
||||
{
|
||||
"evidence_type": "patch_header",
|
||||
"source_file": "debian/patches/CVE-2024-0727.patch",
|
||||
"excerpt": "CVE: CVE-2024-0727\nOrigin: upstream, https://github.com/openssl/commit/abc123",
|
||||
"metadata": {
|
||||
"patch_sha256": "abc123def456..."
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**SecurityFeedEvidence:**
|
||||
```json
|
||||
{
|
||||
"evidence_type": "security_feed",
|
||||
"metadata": {
|
||||
"feed_id": "debian-security-tracker",
|
||||
"entry_id": "DSA-5678-1",
|
||||
"published_at": "2024-01-15T10:00:00Z"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 5b.4 Confidence Resolution
|
||||
|
||||
When multiple evidence sources exist for the same CVE, the system keeps the **highest confidence** entry:
|
||||
|
||||
```csharp
|
||||
ON CONFLICT (tenant_id, distro, release, source_pkg, cve_id)
|
||||
DO UPDATE SET
|
||||
confidence = GREATEST(existing.confidence, new.confidence),
|
||||
method = CASE
|
||||
WHEN existing.confidence < new.confidence THEN new.method
|
||||
ELSE existing.method
|
||||
END,
|
||||
evidence_id = CASE
|
||||
WHEN existing.confidence < new.confidence THEN new.evidence_id
|
||||
ELSE existing.evidence_id
|
||||
END
|
||||
```
|
||||
|
||||
### 5b.5 Parsers
|
||||
|
||||
The following parsers extract CVE fix information:
|
||||
|
||||
| Parser | Distros | Input | Confidence |
|
||||
|--------|---------|-------|------------|
|
||||
| `DebianChangelogParser` | Debian, Ubuntu | debian/changelog | 0.80 |
|
||||
| `PatchHeaderParser` | Debian, Ubuntu | debian/patches/*.patch (DEP-3) | 0.87 |
|
||||
| `AlpineSecfixesParser` | Alpine | APKBUILD secfixes block | 0.95 |
|
||||
| `RpmChangelogParser` | RHEL, Fedora, CentOS | RPM spec %changelog | 0.75 |
|
||||
|
||||
### 5b.6 Query Flow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant SW as Scanner.Worker
|
||||
participant BVS as BinaryVulnerabilityService
|
||||
participant FIR as FixIndexRepository
|
||||
participant PG as PostgreSQL
|
||||
|
||||
SW->>BVS: GetFixStatusAsync(debian, bookworm, openssl, CVE-2024-0727)
|
||||
BVS->>FIR: GetFixStatusAsync(...)
|
||||
FIR->>PG: SELECT FROM cve_fix_index WHERE ...
|
||||
PG-->>FIR: FixIndexEntry (state=fixed, confidence=0.87)
|
||||
FIR-->>BVS: FixStatusResult
|
||||
BVS-->>SW: {state: Fixed, confidence: 0.87, method: PatchHeader}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Security Considerations
|
||||
|
||||
### 6.1 Trust Boundaries
|
||||
|
||||
1. **Corpus Ingestion** - Packages are untrusted; extraction runs in sandboxed workers
|
||||
2. **Fingerprint Generation** - Reference builds compiled in isolated environments
|
||||
3. **Query API** - Tenant-isolated via RLS; no cross-tenant data leakage
|
||||
|
||||
### 6.2 Signing & Provenance
|
||||
|
||||
- All corpus snapshots are signed (DSSE)
|
||||
- Fingerprint sets are versioned and signed
|
||||
- Every match result references evidence digests
|
||||
|
||||
### 6.3 Sandbox Requirements
|
||||
|
||||
Binary extraction and fingerprint generation MUST run with:
|
||||
- Seccomp profile restricting syscalls
|
||||
- Read-only root filesystem
|
||||
- No network access during analysis
|
||||
- Memory/CPU limits
|
||||
|
||||
---
|
||||
|
||||
## 7. Observability
|
||||
|
||||
### 7.1 Metrics
|
||||
|
||||
| Metric | Type | Labels |
|
||||
|--------|------|--------|
|
||||
| `binaryindex_lookup_total` | Counter | method, result |
|
||||
| `binaryindex_lookup_latency_ms` | Histogram | method |
|
||||
| `binaryindex_corpus_packages_total` | Gauge | distro, release |
|
||||
| `binaryindex_fingerprints_indexed` | Gauge | algorithm, component |
|
||||
| `binaryindex_match_confidence` | Histogram | method |
|
||||
|
||||
### 7.2 Traces
|
||||
|
||||
- `binaryindex.lookup` - Full lookup span
|
||||
- `binaryindex.corpus.ingest` - Corpus ingestion
|
||||
- `binaryindex.fingerprint.generate` - Fingerprint generation
|
||||
|
||||
---
|
||||
|
||||
## 8. Configuration
|
||||
|
||||
```yaml
|
||||
# binaryindex.yaml
|
||||
binaryindex:
|
||||
enabled: true
|
||||
|
||||
corpus:
|
||||
connectors:
|
||||
- type: debian
|
||||
enabled: true
|
||||
mirror: http://deb.debian.org/debian
|
||||
releases: [bookworm, bullseye]
|
||||
architectures: [amd64, arm64]
|
||||
- type: ubuntu
|
||||
enabled: true
|
||||
mirror: http://archive.ubuntu.com/ubuntu
|
||||
releases: [jammy, noble]
|
||||
|
||||
fingerprinting:
|
||||
enabled: true
|
||||
algorithms: [basic_block, cfg]
|
||||
target_components:
|
||||
- openssl
|
||||
- glibc
|
||||
- zlib
|
||||
- curl
|
||||
- sqlite
|
||||
min_function_size: 16 # bytes
|
||||
max_functions_per_binary: 10000
|
||||
|
||||
lookup:
|
||||
cache_ttl: 3600
|
||||
batch_size: 100
|
||||
timeout_ms: 5000
|
||||
|
||||
storage:
|
||||
postgres_schema: binaries
|
||||
rustfs_bucket: stellaops/binaryindex
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 9. Testing Strategy
|
||||
|
||||
### 9.1 Unit Tests
|
||||
|
||||
- Identity extraction (Build-ID, hashes)
|
||||
- Fingerprint generation determinism
|
||||
- Fix index parsing (changelog, patch headers)
|
||||
|
||||
### 9.2 Integration Tests
|
||||
|
||||
- PostgreSQL schema validation
|
||||
- Full corpus ingestion flow
|
||||
- Scanner.Worker lookup integration
|
||||
|
||||
### 9.3 Regression Tests
|
||||
|
||||
- Known CVE detection (golden corpus)
|
||||
- Backport handling (Debian libssl example)
|
||||
- False positive rate validation
|
||||
|
||||
---
|
||||
|
||||
## 10. References
|
||||
|
||||
- Advisory: `docs/product-advisories/21-Dec-2025 - Mapping Evidence Within Compiled Binaries.md`
|
||||
- Scanner Native Analysis: `src/Scanner/StellaOps.Scanner.Analyzers.Native/`
|
||||
- Existing Fingerprinting: `src/Scanner/__Libraries/StellaOps.Scanner.EntryTrace/Binary/`
|
||||
- Build-ID Index: `src/Scanner/StellaOps.Scanner.Analyzers.Native/Index/`
|
||||
|
||||
---
|
||||
|
||||
*Document Version: 1.0.0*
|
||||
*Last Updated: 2025-12-21*
|
||||
564
docs/modules/binary-index/semantic-diffing.md
Normal file
564
docs/modules/binary-index/semantic-diffing.md
Normal file
@@ -0,0 +1,564 @@
|
||||
# Semantic Diffing Architecture
|
||||
|
||||
> **Status:** PLANNED
|
||||
> **Version:** 1.0.0
|
||||
> **Related Sprints:**
|
||||
> - `SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md`
|
||||
> - `SPRINT_20260105_001_002_BINDEX_semdiff_corpus.md`
|
||||
> - `SPRINT_20260105_001_003_BINDEX_semdiff_ghidra.md`
|
||||
> - `SPRINT_20260105_001_004_BINDEX_semdiff_decompiler_ml.md`
|
||||
|
||||
---
|
||||
|
||||
## 1. Executive Summary
|
||||
|
||||
Semantic diffing is an advanced binary analysis capability that detects function equivalence based on **behavior** rather than **syntax**. This enables accurate vulnerability detection in scenarios where traditional byte-level or symbol-based matching fails:
|
||||
|
||||
- **Compiler optimizations** - Same source, different instructions
|
||||
- **Obfuscation** - Intentionally altered code structure
|
||||
- **Stripped binaries** - No symbols or debug information
|
||||
- **Cross-compiler** - GCC vs Clang produce different output
|
||||
- **Backported patches** - Different version, same fix
|
||||
|
||||
### Expected Impact
|
||||
|
||||
| Capability | Current Accuracy | With Semantic Diffing |
|
||||
|------------|-----------------|----------------------|
|
||||
| Patch detection (optimized) | ~70% | 92%+ |
|
||||
| Function identification (stripped) | ~50% | 85%+ |
|
||||
| Obfuscation resilience | ~40% | 75%+ |
|
||||
| False positive rate | ~5% | <2% |
|
||||
|
||||
---
|
||||
|
||||
## 2. Architecture Overview
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────────┐
|
||||
│ Semantic Diffing Architecture │
|
||||
│ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────────────┐│
|
||||
│ │ Analysis Layer ││
|
||||
│ │ ││
|
||||
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ││
|
||||
│ │ │ B2R2 │ │ Ghidra │ │ Decompiler │ │ ML │ ││
|
||||
│ │ │ (Primary) │ │ (Fallback) │ │ (Optional) │ │ (Optional) │ ││
|
||||
│ │ │ │ │ │ │ │ │ │ ││
|
||||
│ │ │ - Disasm │ │ - P-Code │ │ - C output │ │ - CodeBERT │ ││
|
||||
│ │ │ - LowUIR │ │ - BSim │ │ - AST parse │ │ - GraphSage │ ││
|
||||
│ │ │ - CFG │ │ - Ver.Track │ │ - Normalize │ │ - Embedding │ ││
|
||||
│ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ ││
|
||||
│ │ │ │ │ │ ││
|
||||
│ └─────────┴────────────────┴────────────────┴────────────────┴───────────────┘│
|
||||
│ │ │
|
||||
│ v │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────────────┐│
|
||||
│ │ Fingerprint Layer ││
|
||||
│ │ ││
|
||||
│ │ ┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐ ││
|
||||
│ │ │ Instruction │ │ Semantic │ │ Decompiled │ ││
|
||||
│ │ │ Fingerprint │ │ Fingerprint │ │ Fingerprint │ ││
|
||||
│ │ │ │ │ │ │ │ ││
|
||||
│ │ │ - BasicBlock hash │ │ - KSG graph hash │ │ - AST hash │ ││
|
||||
│ │ │ - CFG edge hash │ │ - WL hash │ │ - Normalized code │ ││
|
||||
│ │ │ - String refs │ │ - DataFlow hash │ │ - API sequence │ ││
|
||||
│ │ │ - Rolling chunks │ │ - API calls │ │ - Pattern hash │ ││
|
||||
│ │ └───────────────────┘ └───────────────────┘ └───────────────────┘ ││
|
||||
│ │ ││
|
||||
│ │ ┌───────────────────┐ ┌───────────────────┐ ││
|
||||
│ │ │ BSim │ │ ML Embedding │ ││
|
||||
│ │ │ Signature │ │ Vector │ ││
|
||||
│ │ │ │ │ │ ││
|
||||
│ │ │ - Feature vector │ │ - 768-dim float[] │ ││
|
||||
│ │ │ - Significance │ │ - Cosine sim │ ││
|
||||
│ │ └───────────────────┘ └───────────────────┘ ││
|
||||
│ │ ││
|
||||
│ └─────────────────────────────────────────────────────────────────────────────┘│
|
||||
│ │ │
|
||||
│ v │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────────────┐│
|
||||
│ │ Matching Layer ││
|
||||
│ │ ││
|
||||
│ │ ┌───────────────────────────────────────────────────────────────────────┐ ││
|
||||
│ │ │ Ensemble Decision Engine │ ││
|
||||
│ │ │ │ ││
|
||||
│ │ │ Signal Weights: │ ││
|
||||
│ │ │ - Instruction fingerprint: 15% │ ││
|
||||
│ │ │ - Semantic graph: 25% │ ││
|
||||
│ │ │ - Decompiled AST: 35% │ ││
|
||||
│ │ │ - ML embedding: 25% │ ││
|
||||
│ │ │ │ ││
|
||||
│ │ │ Output: Confidence-weighted similarity score │ ││
|
||||
│ │ │ │ ││
|
||||
│ │ └───────────────────────────────────────────────────────────────────────┘ ││
|
||||
│ │ ││
|
||||
│ └─────────────────────────────────────────────────────────────────────────────┘│
|
||||
│ │ │
|
||||
│ v │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────────────┐│
|
||||
│ │ Storage Layer ││
|
||||
│ │ ││
|
||||
│ │ PostgreSQL RustFS Valkey ││
|
||||
│ │ - corpus.* tables - Fingerprint blobs - Query cache ││
|
||||
│ │ - binaries.* tables - Model artifacts - Embedding index ││
|
||||
│ │ - BSim database - Training data ││
|
||||
│ │ ││
|
||||
│ └─────────────────────────────────────────────────────────────────────────────┘│
|
||||
└─────────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Implementation Phases
|
||||
|
||||
### Phase 1: IR-Level Semantic Analysis (Foundation)
|
||||
|
||||
**Sprint:** `SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md`
|
||||
|
||||
Leverage B2R2's Intermediate Representation (IR) for semantic-level function comparison.
|
||||
|
||||
**Key Components:**
|
||||
- `IrLiftingService` - Lift instructions to LowUIR
|
||||
- `SemanticGraphExtractor` - Build Key-Semantics Graph (KSG)
|
||||
- `WeisfeilerLehmanHasher` - Graph fingerprinting
|
||||
- `SemanticMatcher` - Semantic similarity scoring
|
||||
|
||||
**Deliverables:**
|
||||
- `StellaOps.BinaryIndex.Semantic` library
|
||||
- 20 tasks, ~3 weeks
|
||||
|
||||
### Phase 2: Function Behavior Corpus (Scale)
|
||||
|
||||
**Sprint:** `SPRINT_20260105_001_002_BINDEX_semdiff_corpus.md`
|
||||
|
||||
Build comprehensive database of known library functions.
|
||||
|
||||
**Key Components:**
|
||||
- Library corpus connectors (glibc, OpenSSL, zlib, curl, SQLite)
|
||||
- `CorpusIngestionService` - Batch fingerprint generation
|
||||
- `FunctionClusteringService` - Group similar functions
|
||||
- `CorpusQueryService` - Function identification
|
||||
|
||||
**Deliverables:**
|
||||
- `StellaOps.BinaryIndex.Corpus` library
|
||||
- PostgreSQL `corpus.*` schema
|
||||
- ~30,000 indexed functions
|
||||
- 22 tasks, ~4 weeks
|
||||
|
||||
### Phase 3: Ghidra Integration (Depth)
|
||||
|
||||
**Sprint:** `SPRINT_20260105_001_003_BINDEX_semdiff_ghidra.md`
|
||||
|
||||
Add Ghidra as secondary backend for complex cases.
|
||||
|
||||
**Key Components:**
|
||||
- `GhidraHeadlessManager` - Process lifecycle
|
||||
- `VersionTrackingService` - Multi-correlator diffing
|
||||
- `GhidriffBridge` - Python interop
|
||||
- `BSimService` - Behavioral similarity
|
||||
|
||||
**Deliverables:**
|
||||
- `StellaOps.BinaryIndex.Ghidra` library
|
||||
- Docker image for Ghidra Headless
|
||||
- 20 tasks, ~4 weeks
|
||||
|
||||
### Phase 4: Decompiler & ML (Excellence)
|
||||
|
||||
**Sprint:** `SPRINT_20260105_001_004_BINDEX_semdiff_decompiler_ml.md`
|
||||
|
||||
Highest-fidelity semantic analysis.
|
||||
|
||||
**Key Components:**
|
||||
- `IDecompilerService` - Ghidra decompilation
|
||||
- `AstComparisonEngine` - Structural similarity
|
||||
- `OnnxInferenceEngine` - ML embeddings
|
||||
- `EnsembleDecisionEngine` - Multi-signal fusion
|
||||
|
||||
**Deliverables:**
|
||||
- `StellaOps.BinaryIndex.Decompiler` library
|
||||
- `StellaOps.BinaryIndex.ML` library
|
||||
- Trained CodeBERT-Binary model
|
||||
- 30 tasks, ~5 weeks
|
||||
|
||||
---
|
||||
|
||||
## 4. Fingerprint Types
|
||||
|
||||
### 4.1 Instruction Fingerprint (Existing)
|
||||
|
||||
**Algorithm:** BasicBlock hash + CFG edge hash + String refs hash
|
||||
|
||||
**Properties:**
|
||||
- Fast to compute
|
||||
- Sensitive to instruction changes
|
||||
- Good for exact/near-exact matches
|
||||
|
||||
**Weight in ensemble:** 15%
|
||||
|
||||
### 4.2 Semantic Fingerprint (Phase 1)
|
||||
|
||||
**Algorithm:** Key-Semantics Graph + Weisfeiler-Lehman hash
|
||||
|
||||
**Properties:**
|
||||
- Captures data/control dependencies
|
||||
- Resilient to register renaming
|
||||
- Resilient to instruction reordering
|
||||
|
||||
**Weight in ensemble:** 25%
|
||||
|
||||
### 4.3 Decompiled Fingerprint (Phase 4)
|
||||
|
||||
**Algorithm:** Normalized AST hash + Pattern detection
|
||||
|
||||
**Properties:**
|
||||
- Highest semantic fidelity
|
||||
- Captures algorithmic structure
|
||||
- Resilient to most optimizations
|
||||
|
||||
**Weight in ensemble:** 35%
|
||||
|
||||
### 4.4 ML Embedding (Phase 4)
|
||||
|
||||
**Algorithm:** CodeBERT-Binary transformer, 768-dim vectors
|
||||
|
||||
**Properties:**
|
||||
- Learned similarity metric
|
||||
- Captures latent patterns
|
||||
- Resilient to obfuscation
|
||||
|
||||
**Weight in ensemble:** 25%
|
||||
|
||||
---
|
||||
|
||||
## 5. Matching Pipeline
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Client
|
||||
participant DiffEngine as PatchDiffEngine
|
||||
participant B2R2
|
||||
participant Ghidra
|
||||
participant Corpus
|
||||
participant Ensemble
|
||||
|
||||
Client->>DiffEngine: Compare(oldBinary, newBinary)
|
||||
|
||||
par Parallel Analysis
|
||||
DiffEngine->>B2R2: Disassemble + IR lift
|
||||
DiffEngine->>Ghidra: Decompile (if needed)
|
||||
end
|
||||
|
||||
B2R2-->>DiffEngine: SemanticFingerprints[]
|
||||
Ghidra-->>DiffEngine: DecompiledFunctions[]
|
||||
|
||||
DiffEngine->>Corpus: IdentifyFunctions(fingerprints)
|
||||
Corpus-->>DiffEngine: FunctionMatches[]
|
||||
|
||||
DiffEngine->>Ensemble: ComputeSimilarity(old, new)
|
||||
Ensemble-->>DiffEngine: EnsembleResult
|
||||
|
||||
DiffEngine-->>Client: PatchDiffResult
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Fallback Strategy
|
||||
|
||||
The system uses a tiered fallback strategy:
|
||||
|
||||
```
|
||||
Tier 1: B2R2 IR + Semantic Graph (fast, ~90% coverage)
|
||||
│
|
||||
│ If confidence < threshold OR architecture unsupported
|
||||
v
|
||||
Tier 2: Ghidra Version Tracking (slower, ~95% coverage)
|
||||
│
|
||||
│ If function is high-value (CVE-relevant)
|
||||
v
|
||||
Tier 3: Decompiled AST + ML Embedding (slowest, ~99% coverage)
|
||||
```
|
||||
|
||||
**Selection Criteria:**
|
||||
|
||||
| Condition | Backend | Reason |
|
||||
|-----------|---------|--------|
|
||||
| Standard x64/ARM64 binary | B2R2 only | Fast, accurate |
|
||||
| Low B2R2 confidence (<0.7) | B2R2 + Ghidra | Validation |
|
||||
| Exotic architecture | Ghidra only | Better coverage |
|
||||
| CVE-affected function | Full pipeline | Maximum accuracy |
|
||||
| Obfuscated binary | ML embedding | Obfuscation resilience |
|
||||
|
||||
---
|
||||
|
||||
## 7. Corpus Coverage
|
||||
|
||||
### Priority Libraries
|
||||
|
||||
| Library | Priority | Functions | CVEs |
|
||||
|---------|----------|-----------|------|
|
||||
| glibc | Critical | ~15,000 | 50+ |
|
||||
| OpenSSL | Critical | ~8,000 | 100+ |
|
||||
| zlib | High | ~200 | 5+ |
|
||||
| libcurl | High | ~2,000 | 80+ |
|
||||
| SQLite | High | ~1,500 | 30+ |
|
||||
| libxml2 | Medium | ~1,200 | 40+ |
|
||||
| libpng | Medium | ~300 | 10+ |
|
||||
| expat | Medium | ~150 | 15+ |
|
||||
|
||||
### Architecture Coverage
|
||||
|
||||
| Architecture | B2R2 | Ghidra | Status |
|
||||
|--------------|------|--------|--------|
|
||||
| x86_64 | Excellent | Excellent | Primary |
|
||||
| ARM64 | Excellent | Excellent | Primary |
|
||||
| ARM32 | Good | Excellent | Secondary |
|
||||
| MIPS32 | Fair | Excellent | Fallback |
|
||||
| MIPS64 | Fair | Excellent | Fallback |
|
||||
| RISC-V | Good | Good | Emerging |
|
||||
| PPC32/64 | Fair | Excellent | Fallback |
|
||||
|
||||
---
|
||||
|
||||
## 8. Performance Characteristics
|
||||
|
||||
### Latency Budget
|
||||
|
||||
| Operation | Target | Notes |
|
||||
|-----------|--------|-------|
|
||||
| B2R2 disassembly | <100ms | Per function |
|
||||
| IR lifting | <50ms | Per function |
|
||||
| Semantic fingerprint | <50ms | Per function |
|
||||
| Ghidra analysis | <30s | Per binary (startup) |
|
||||
| Decompilation | <500ms | Per function |
|
||||
| ML inference | <100ms | Per function |
|
||||
| Ensemble decision | <10ms | Per comparison |
|
||||
| **Total (Tier 1)** | **<200ms** | Per function |
|
||||
| **Total (Full)** | **<1s** | Per function |
|
||||
|
||||
### Memory Budget
|
||||
|
||||
| Component | Memory | Notes |
|
||||
|-----------|--------|-------|
|
||||
| B2R2 per binary | ~100MB | Scales with binary size |
|
||||
| Ghidra per project | ~2GB | Persistent cache |
|
||||
| ML model | ~500MB | ONNX loaded |
|
||||
| Corpus query cache | ~100MB | LRU eviction |
|
||||
|
||||
---
|
||||
|
||||
## 9. Integration Points
|
||||
|
||||
### 9.1 Scanner Integration
|
||||
|
||||
```csharp
|
||||
// Scanner.Worker uses semantic diffing for binary vulnerability detection
|
||||
var result = await _binaryVulnerabilityService.LookupByFingerprintAsync(
|
||||
fingerprint,
|
||||
minSimilarity: 0.85m,
|
||||
useSemanticMatching: true, // Enable semantic diffing
|
||||
ct);
|
||||
```
|
||||
|
||||
### 9.2 PatchDiffEngine Enhancement
|
||||
|
||||
```csharp
|
||||
// PatchDiffEngine now includes semantic comparison
|
||||
var diff = await _patchDiffEngine.DiffAsync(
|
||||
vulnerableBinary,
|
||||
patchedBinary,
|
||||
new PatchDiffOptions
|
||||
{
|
||||
UseSemanticAnalysis = true,
|
||||
SemanticThreshold = 0.7m,
|
||||
IncludeDecompilation = true,
|
||||
IncludeMlEmbedding = true
|
||||
},
|
||||
ct);
|
||||
```
|
||||
|
||||
### 9.3 DeltaSignature Enhancement
|
||||
|
||||
```csharp
|
||||
// Delta signatures now include semantic fingerprints
|
||||
var signature = await _deltaSignatureGenerator.GenerateSignaturesAsync(
|
||||
binaryStream,
|
||||
new DeltaSignatureRequest
|
||||
{
|
||||
Cve = "CVE-2024-1234",
|
||||
TargetSymbols = ["vulnerable_func"],
|
||||
IncludeSemanticFingerprint = true,
|
||||
IncludeDecompiledHash = true
|
||||
},
|
||||
ct);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 10. Security Considerations
|
||||
|
||||
### 10.1 Sandbox Requirements
|
||||
|
||||
All binary analysis runs in sandboxed environments:
|
||||
- Seccomp profile restricting syscalls
|
||||
- Read-only root filesystem
|
||||
- No network access during analysis
|
||||
- Memory/CPU limits
|
||||
|
||||
### 10.2 Model Security
|
||||
|
||||
ML models are:
|
||||
- Signed with DSSE attestations
|
||||
- Verified before loading
|
||||
- Not user-uploadable (pre-trained only)
|
||||
|
||||
### 10.3 Corpus Integrity
|
||||
|
||||
Corpus data is:
|
||||
- Ingested from trusted sources only
|
||||
- Signed at snapshot level
|
||||
- Version-controlled with audit trail
|
||||
|
||||
---
|
||||
|
||||
## 11. Configuration
|
||||
|
||||
```yaml
|
||||
# binaryindex.yaml - Semantic diffing configuration
|
||||
binaryindex:
|
||||
semantic_diffing:
|
||||
enabled: true
|
||||
|
||||
# Analysis backends
|
||||
backends:
|
||||
b2r2:
|
||||
enabled: true
|
||||
ir_lifting: true
|
||||
semantic_graph: true
|
||||
ghidra:
|
||||
enabled: true
|
||||
fallback_only: true
|
||||
min_b2r2_confidence: 0.7
|
||||
headless_timeout_ms: 30000
|
||||
decompiler:
|
||||
enabled: true
|
||||
high_value_only: true # Only for CVE-affected functions
|
||||
ml:
|
||||
enabled: true
|
||||
model_path: /models/codebert_binary_v1.onnx
|
||||
embedding_dimension: 768
|
||||
|
||||
# Ensemble weights
|
||||
ensemble:
|
||||
instruction_weight: 0.15
|
||||
semantic_weight: 0.25
|
||||
decompiled_weight: 0.35
|
||||
ml_weight: 0.25
|
||||
min_confidence: 0.6
|
||||
|
||||
# Corpus
|
||||
corpus:
|
||||
auto_update: true
|
||||
update_interval_hours: 24
|
||||
libraries:
|
||||
- glibc
|
||||
- openssl
|
||||
- zlib
|
||||
- curl
|
||||
- sqlite
|
||||
|
||||
# Performance
|
||||
performance:
|
||||
max_parallel_analyses: 4
|
||||
cache_ttl_seconds: 3600
|
||||
max_function_size_bytes: 1048576 # 1MB
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 12. Metrics & Observability
|
||||
|
||||
### Metrics
|
||||
|
||||
| Metric | Type | Labels |
|
||||
|--------|------|--------|
|
||||
| `semantic_diffing_analysis_total` | Counter | backend, result |
|
||||
| `semantic_diffing_latency_ms` | Histogram | backend, tier |
|
||||
| `semantic_diffing_accuracy` | Gauge | comparison_type |
|
||||
| `corpus_functions_total` | Gauge | library |
|
||||
| `ml_inference_latency_ms` | Histogram | model |
|
||||
| `ensemble_signal_weight` | Gauge | signal_type |
|
||||
|
||||
### Traces
|
||||
|
||||
- `semantic_diffing.analyze` - Full analysis span
|
||||
- `semantic_diffing.b2r2.lift` - IR lifting
|
||||
- `semantic_diffing.ghidra.decompile` - Decompilation
|
||||
- `semantic_diffing.ml.inference` - ML embedding
|
||||
- `semantic_diffing.ensemble.decide` - Ensemble decision
|
||||
|
||||
---
|
||||
|
||||
## 13. Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
|
||||
| Test Suite | Coverage |
|
||||
|------------|----------|
|
||||
| `IrLiftingServiceTests` | IR lifting correctness |
|
||||
| `SemanticGraphExtractorTests` | Graph construction |
|
||||
| `WeisfeilerLehmanHasherTests` | Hash stability |
|
||||
| `AstComparisonEngineTests` | AST similarity |
|
||||
| `OnnxInferenceEngineTests` | ML inference |
|
||||
| `EnsembleDecisionEngineTests` | Weight combination |
|
||||
|
||||
### Integration Tests
|
||||
|
||||
| Test Suite | Coverage |
|
||||
|------------|----------|
|
||||
| `EndToEndSemanticDiffTests` | Full pipeline |
|
||||
| `OptimizationResilienceTests` | O0 vs O2 vs O3 |
|
||||
| `CompilerVariantTests` | GCC vs Clang |
|
||||
| `GhidraFallbackTests` | Fallback scenarios |
|
||||
|
||||
### Golden Corpus Tests
|
||||
|
||||
Pre-computed test cases with known results:
|
||||
- 100 CVE patch pairs (vulnerable -> fixed)
|
||||
- 50 optimization variant sets
|
||||
- 25 compiler variant sets
|
||||
- 25 obfuscation variant sets
|
||||
|
||||
---
|
||||
|
||||
## 14. Roadmap
|
||||
|
||||
| Phase | Status | ETA | Impact |
|
||||
|-------|--------|-----|--------|
|
||||
| Phase 1: IR Semantics | Planned | 2026-01-24 | +15% accuracy |
|
||||
| Phase 2: Corpus | Planned | 2026-02-15 | +10% coverage |
|
||||
| Phase 3: Ghidra | Planned | 2026-02-28 | +5% edge cases |
|
||||
| Phase 4: Decompiler/ML | Planned | 2026-03-31 | +10% obfuscation |
|
||||
| **Total** | | | **+35-40%** |
|
||||
|
||||
---
|
||||
|
||||
## 15. References
|
||||
|
||||
### Internal
|
||||
|
||||
- `docs/modules/binary-index/architecture.md`
|
||||
- `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.DeltaSig/`
|
||||
- `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Fingerprints/`
|
||||
|
||||
### External
|
||||
|
||||
- [B2R2 Binary Analysis Framework](https://b2r2.org/)
|
||||
- [Ghidra Patch Diffing Guide](https://cve-north-stars.github.io/docs/Ghidra-Patch-Diffing)
|
||||
- [ghidriff Tool](https://github.com/clearbluejar/ghidriff)
|
||||
- [SemDiff Paper (arXiv)](https://arxiv.org/abs/2308.01463)
|
||||
- [SEI Semantic Equivalence Research](https://www.sei.cmu.edu/annual-reviews/2022-research-review/semantic-equivalence-checking-of-decompiled-binaries/)
|
||||
|
||||
---
|
||||
|
||||
*Document Version: 1.0.0*
|
||||
*Last Updated: 2026-01-05*
|
||||
Reference in New Issue
Block a user