sprints work.
This commit is contained in:
164
docs/modules/binary-index/deltasig-v2-schema.md
Normal file
164
docs/modules/binary-index/deltasig-v2-schema.md
Normal file
@@ -0,0 +1,164 @@
|
||||
# DeltaSig v2 Predicate Schema
|
||||
|
||||
> **Sprint**: SPRINT_20260119_004_BinaryIndex_deltasig_extensions
|
||||
> **Status**: Implemented
|
||||
|
||||
## Overview
|
||||
|
||||
DeltaSig v2 extends the function-level binary diff predicate with:
|
||||
|
||||
- **Symbol Provenance**: Links function matches to ground-truth corpus sources (debuginfod, ddeb, buildinfo, secdb)
|
||||
- **IR Diff References**: CAS-stored intermediate representation diffs for detailed analysis
|
||||
- **Explicit Verdicts**: Clear vulnerability status with confidence scores
|
||||
- **Function Match States**: Per-function vulnerable/patched/modified/unchanged classification
|
||||
|
||||
## Schema
|
||||
|
||||
**Predicate Type URI**: `https://stella-ops.org/predicates/deltasig/v2`
|
||||
|
||||
### Key Fields
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `schemaVersion` | string | Always `"2.0.0"` |
|
||||
| `subject` | object | Single subject (PURL, digest, arch) |
|
||||
| `functionMatches` | array | Function-level matches with evidence |
|
||||
| `verdict` | string | `vulnerable`, `patched`, `partial`, `unknown`, `partially_patched`, `inconclusive` |
|
||||
| `confidence` | number | 0.0-1.0 confidence score |
|
||||
| `summary` | object | Aggregate statistics |
|
||||
|
||||
### Function Match
|
||||
|
||||
```json
|
||||
{
|
||||
"functionId": "sha256:abc123...",
|
||||
"name": "ssl_handshake",
|
||||
"address": 4194304,
|
||||
"size": 256,
|
||||
"matchScore": 0.95,
|
||||
"matchMethod": "semantic_ksg",
|
||||
"matchState": "patched",
|
||||
"symbolProvenance": {
|
||||
"sourceId": "fedora-debuginfod",
|
||||
"observationId": "obs:gt:12345",
|
||||
"confidence": 0.98,
|
||||
"resolvedAt": "2026-01-19T12:00:00Z"
|
||||
},
|
||||
"irDiff": {
|
||||
"casDigest": "sha256:def456...",
|
||||
"statementsAdded": 5,
|
||||
"statementsRemoved": 3,
|
||||
"changedInstructions": 8
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Summary
|
||||
|
||||
```json
|
||||
{
|
||||
"totalFunctions": 150,
|
||||
"vulnerableFunctions": 0,
|
||||
"patchedFunctions": 12,
|
||||
"unknownFunctions": 138,
|
||||
"functionsWithProvenance": 45,
|
||||
"functionsWithIrDiff": 12,
|
||||
"avgMatchScore": 0.85,
|
||||
"minMatchScore": 0.42,
|
||||
"maxMatchScore": 0.99,
|
||||
"totalIrDiffSize": 1234
|
||||
}
|
||||
```
|
||||
|
||||
## Version Negotiation
|
||||
|
||||
Clients can request specific predicate versions:
|
||||
|
||||
```json
|
||||
{
|
||||
"preferredVersion": "2",
|
||||
"requiredFeatures": ["provenance", "ir-diff"]
|
||||
}
|
||||
```
|
||||
|
||||
Response:
|
||||
|
||||
```json
|
||||
{
|
||||
"version": "2.0.0",
|
||||
"predicateType": "https://stella-ops.org/predicates/deltasig/v2",
|
||||
"features": ["provenance", "ir-diff"]
|
||||
}
|
||||
```
|
||||
|
||||
## VEX Integration
|
||||
|
||||
DeltaSig v2 predicates can be converted to VEX observations via `IDeltaSigVexBridge`:
|
||||
|
||||
| DeltaSig Verdict | VEX Status |
|
||||
|------------------|------------|
|
||||
| `patched` | `fixed` |
|
||||
| `vulnerable` | `affected` |
|
||||
| `partially_patched` | `under_investigation` |
|
||||
| `inconclusive` | `under_investigation` |
|
||||
| `unknown` | `not_affected` (conservative) |
|
||||
|
||||
### Evidence Blocks
|
||||
|
||||
VEX observations include evidence blocks:
|
||||
|
||||
1. **deltasig-summary**: Aggregate statistics
|
||||
2. **deltasig-function-matches**: High-confidence matches with provenance
|
||||
3. **deltasig-predicate-ref**: Reference to full predicate
|
||||
|
||||
## Implementation
|
||||
|
||||
### Core Services
|
||||
|
||||
| Interface | Implementation | Description |
|
||||
|-----------|----------------|-------------|
|
||||
| `IDeltaSigServiceV2` | `DeltaSigServiceV2` | V2 predicate generation |
|
||||
| `ISymbolProvenanceResolver` | `GroundTruthProvenanceResolver` | Ground-truth lookup |
|
||||
| `IIrDiffGenerator` | `IrDiffGenerator` | IR diff generation with CAS |
|
||||
| `IDeltaSigVexBridge` | `DeltaSigVexBridge` | VEX observation generation |
|
||||
|
||||
### DI Registration
|
||||
|
||||
```csharp
|
||||
services.AddDeltaSigV2();
|
||||
```
|
||||
|
||||
Or with options:
|
||||
|
||||
```csharp
|
||||
services.AddDeltaSigV2(
|
||||
configureProvenance: opts => opts.IncludeStale = false,
|
||||
configureIrDiff: opts => opts.MaxParallelism = 4
|
||||
);
|
||||
```
|
||||
|
||||
## Migration from v1
|
||||
|
||||
Use `DeltaSigPredicateConverter`:
|
||||
|
||||
```csharp
|
||||
// v1 → v2
|
||||
var v2 = DeltaSigPredicateConverter.ToV2(v1Predicate);
|
||||
|
||||
// v2 → v1
|
||||
var v1 = DeltaSigPredicateConverter.ToV1(v2Predicate);
|
||||
```
|
||||
|
||||
Notes:
|
||||
- v1 → v2: Provenance and IR diff will be empty (add via resolver/generator)
|
||||
- v2 → v1: Provenance and IR diff are discarded; verdict/confidence are lost
|
||||
|
||||
## JSON Schema
|
||||
|
||||
Full schema: [`docs/schemas/predicates/deltasig-v2.schema.json`](../../../schemas/predicates/deltasig-v2.schema.json)
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Ground-Truth Corpus](./ground-truth-corpus.md)
|
||||
- [Semantic Diffing](./semantic-diffing.md)
|
||||
- [Architecture](./architecture.md)
|
||||
764
docs/modules/binary-index/ground-truth-corpus.md
Normal file
764
docs/modules/binary-index/ground-truth-corpus.md
Normal file
@@ -0,0 +1,764 @@
|
||||
# Ground-Truth Corpus Architecture
|
||||
|
||||
> **Ownership:** BinaryIndex Guild
|
||||
> **Status:** DRAFT
|
||||
> **Version:** 1.0.0
|
||||
> **Related:** [BinaryIndex Architecture](architecture.md), [Corpus Management](corpus-management.md), [Concelier AOC](../concelier/guides/aggregation-only-contract.md)
|
||||
|
||||
---
|
||||
|
||||
## 1. Overview
|
||||
|
||||
The **Ground-Truth Corpus** system provides a validated function-matching oracle for binary diff accuracy measurement. It uses the same plugin-based ingestion pattern as Concelier (advisories) and Excititor (VEX), applying **Aggregation-Only Contract (AOC)** principles to ensure immutable, deterministic, and replayable data.
|
||||
|
||||
### 1.1 Problem Statement
|
||||
|
||||
Function matching and binary diffing require ground-truth data to measure accuracy:
|
||||
|
||||
1. **No oracle for validation** - How do we know a function match is correct?
|
||||
2. **Symbols stripped in production** - Debug info unavailable at scan time
|
||||
3. **Compiler/optimization variance** - Same source produces different binaries
|
||||
4. **Backport detection gaps** - Need pre/post pairs to validate patch detection
|
||||
|
||||
### 1.2 Solution: Distro Symbol Corpus
|
||||
|
||||
Leverage mainstream Linux distro artifacts as ground-truth:
|
||||
|
||||
| Source | What It Provides | Use Case |
|
||||
|--------|------------------|----------|
|
||||
| **Debian `.buildinfo`** | Exact build env records, often clearsigned | Reproducible oracle, build env metadata |
|
||||
| **Fedora Koji + debuginfod** | Machine-queryable debuginfo with IMA verification | Symbol recovery for stripped binaries |
|
||||
| **Ubuntu ddebs** | Debug symbol packages | Symbol-grounded truth for function names |
|
||||
| **Alpine SecDB** | Precise CVE-to-backport mappings | Pre/post pair curation |
|
||||
|
||||
### 1.3 Module Scope
|
||||
|
||||
**In Scope:**
|
||||
- Symbol recovery connectors (debuginfod, ddebs, .buildinfo)
|
||||
- Ground-truth observations (immutable, append-only)
|
||||
- Pre/post security pair curation
|
||||
- Validation harness for function-matching accuracy
|
||||
- Deterministic manifests for replayability
|
||||
|
||||
**Out of Scope:**
|
||||
- Function matching algorithms (see [semantic-diffing.md](semantic-diffing.md))
|
||||
- Fingerprint generation (see [corpus-management.md](corpus-management.md))
|
||||
- Policy decisions (provided by Policy Engine)
|
||||
|
||||
---
|
||||
|
||||
## 2. Architecture
|
||||
|
||||
### 2.1 System Context
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ External Symbol Sources │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ Fedora │ │ Ubuntu │ │ Debian │ │
|
||||
│ │ debuginfod │ │ ddebs │ │ .buildinfo │ │
|
||||
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
|
||||
│ │ │ │ │
|
||||
│ ┌────────┴────────┐ ┌────────┴────────┐ ┌───────┴─────────┐ │
|
||||
│ │ Alpine SecDB │ │ reproduce. │ │ Upstream │ │
|
||||
│ │ │ │ debian.net │ │ tarballs │ │
|
||||
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
|
||||
└───────────│─────────────────────│─────────────────────│──────────────────┘
|
||||
│ │ │
|
||||
v v v
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ Ground-Truth Corpus Module │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Symbol Source Connectors │ │
|
||||
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
|
||||
│ │ │ Debuginfod │ │ Ddeb │ │ Buildinfo │ │ │
|
||||
│ │ │ Connector │ │ Connector │ │ Connector │ │ │
|
||||
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
|
||||
│ │ ┌──────────────┐ ┌──────────────┐ │ │
|
||||
│ │ │ SecDB │ │ Upstream │ │ │
|
||||
│ │ │ Connector │ │ Connector │ │ │
|
||||
│ │ └──────────────┘ └──────────────┘ │ │
|
||||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ v │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ AOC Write Guard Layer │ │
|
||||
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
|
||||
│ │ │ • No derived scores at ingest │ │ │
|
||||
│ │ │ • Immutable observations + supersedes chain │ │ │
|
||||
│ │ │ • Mandatory provenance (source URL, hash, signature) │ │ │
|
||||
│ │ │ • Idempotent upserts (keyed by content hash) │ │ │
|
||||
│ │ │ • Deterministic canonical JSON │ │ │
|
||||
│ │ └──────────────────────────────────────────────────────────────┘ │ │
|
||||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ v │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Storage Layer (PostgreSQL) │ │
|
||||
│ │ │ │
|
||||
│ │ groundtruth.symbol_sources - Registered symbol providers │ │
|
||||
│ │ groundtruth.raw_documents - Immutable raw payloads │ │
|
||||
│ │ groundtruth.symbol_observations- Normalized symbol records │ │
|
||||
│ │ groundtruth.security_pairs - Pre/post CVE binary pairs │ │
|
||||
│ │ groundtruth.validation_runs - Benchmark execution records │ │
|
||||
│ │ groundtruth.match_results - Function match outcomes │ │
|
||||
│ │ groundtruth.source_state - Cursor/sync state per source │ │
|
||||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ v │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Validation Harness │ │
|
||||
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
|
||||
│ │ │ IValidationHarness │ │ │
|
||||
│ │ │ - RunValidationAsync(pairs, matcherConfig) │ │ │
|
||||
│ │ │ - GetMetricsAsync(runId) -> MatchRate, FP/FN, Unmatched │ │ │
|
||||
│ │ │ - ExportReportAsync(runId, format) -> Markdown/HTML │ │ │
|
||||
│ │ └──────────────────────────────────────────────────────────────┘ │ │
|
||||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||||
└──────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 2.2 Component Breakdown
|
||||
|
||||
#### 2.2.1 Symbol Source Connectors
|
||||
|
||||
Plugin-based connectors following the Concelier `IFeedConnector` pattern:
|
||||
|
||||
```csharp
|
||||
public interface ISymbolSourceConnector
|
||||
{
|
||||
string SourceId { get; }
|
||||
string[] SupportedDistros { get; }
|
||||
|
||||
// Three-phase pipeline (matches Concelier pattern)
|
||||
Task FetchAsync(IServiceProvider sp, CancellationToken ct); // Download raw docs
|
||||
Task ParseAsync(IServiceProvider sp, CancellationToken ct); // Normalize to DTOs
|
||||
Task MapAsync(IServiceProvider sp, CancellationToken ct); // Build observations
|
||||
}
|
||||
```
|
||||
|
||||
**Implementations:**
|
||||
|
||||
| Connector | Source | Data Retrieved |
|
||||
|-----------|--------|----------------|
|
||||
| `DebuginfodConnector` | Fedora/RHEL debuginfod | ELF debuginfo, source files |
|
||||
| `DdebConnector` | Ubuntu ddebs repos | .ddeb packages with DWARF |
|
||||
| `BuildinfoConnector` | Debian .buildinfo | Build env, checksums, signatures |
|
||||
| `SecDbConnector` | Alpine SecDB | CVE-to-fix mappings |
|
||||
| `UpstreamConnector` | GitHub/tarballs | Upstream release sources |
|
||||
|
||||
#### 2.2.2 AOC Write Guard
|
||||
|
||||
Enforces aggregation-only invariants (mirrors `IAdvisoryObservationWriteGuard`):
|
||||
|
||||
```csharp
|
||||
public interface ISymbolObservationWriteGuard
|
||||
{
|
||||
WriteDisposition ValidateWrite(
|
||||
SymbolObservation candidate,
|
||||
string? existingContentHash);
|
||||
}
|
||||
|
||||
public enum WriteDisposition
|
||||
{
|
||||
Proceed, // Insert new observation
|
||||
SkipIdentical, // Idempotent re-insert, no-op
|
||||
RejectMutation // Reject (append-only violation)
|
||||
}
|
||||
```
|
||||
|
||||
**Invariants Enforced:**
|
||||
|
||||
| Invariant | What It Forbids |
|
||||
|-----------|-----------------|
|
||||
| No derived scores | Reject `confidence`, `accuracy`, `match_score` at ingest |
|
||||
| Immutable observations | No in-place updates; new revisions use `supersedes` |
|
||||
| Mandatory provenance | Require `source_url`, `fetched_at`, `content_hash`, `signature_state` |
|
||||
| Idempotent upserts | Key by `(source_id, debug_id, content_hash)` |
|
||||
| Deterministic canonical | Sorted JSON keys, UTC ISO-8601, stable hashes |
|
||||
|
||||
#### 2.2.3 Security Pair Curation
|
||||
|
||||
Manages pre/post CVE binary pairs for validation:
|
||||
|
||||
```csharp
|
||||
public interface ISecurityPairService
|
||||
{
|
||||
// Curate a pre/post pair for a CVE
|
||||
Task<SecurityPair> CreatePairAsync(
|
||||
string cveId,
|
||||
BinaryReference vulnerableBinary,
|
||||
BinaryReference patchedBinary,
|
||||
PairMetadata metadata,
|
||||
CancellationToken ct);
|
||||
|
||||
// Get pairs for validation
|
||||
Task<ImmutableArray<SecurityPair>> GetPairsAsync(
|
||||
SecurityPairQuery query,
|
||||
CancellationToken ct);
|
||||
}
|
||||
|
||||
public sealed record SecurityPair(
|
||||
string PairId,
|
||||
string CveId,
|
||||
BinaryReference VulnerableBinary,
|
||||
BinaryReference PatchedBinary,
|
||||
string[] AffectedFunctions, // Symbol names of vulnerable functions
|
||||
string[] ChangedFunctions, // Symbol names of patched functions
|
||||
DiffMetadata Diff, // Upstream patch info
|
||||
ProvenanceInfo Provenance);
|
||||
```
|
||||
|
||||
#### 2.2.4 Validation Harness
|
||||
|
||||
Runs function-matching validation with metrics:
|
||||
|
||||
```csharp
|
||||
public interface IValidationHarness
|
||||
{
|
||||
// Execute validation run
|
||||
Task<ValidationRun> RunAsync(
|
||||
ValidationConfig config,
|
||||
CancellationToken ct);
|
||||
|
||||
// Get metrics for a run
|
||||
Task<ValidationMetrics> GetMetricsAsync(
|
||||
Guid runId,
|
||||
CancellationToken ct);
|
||||
|
||||
// Export report
|
||||
Task<Stream> ExportReportAsync(
|
||||
Guid runId,
|
||||
ReportFormat format,
|
||||
CancellationToken ct);
|
||||
}
|
||||
|
||||
public sealed record ValidationMetrics(
|
||||
int TotalFunctions,
|
||||
int CorrectMatches,
|
||||
int FalsePositives,
|
||||
int FalseNegatives,
|
||||
int Unmatched,
|
||||
decimal MatchRate,
|
||||
decimal Precision,
|
||||
decimal Recall,
|
||||
ImmutableArray<MismatchBucket> MismatchBuckets);
|
||||
|
||||
public sealed record MismatchBucket(
|
||||
string Cause, // inlining, lto, optimization, pic_thunk
|
||||
int Count,
|
||||
ImmutableArray<FunctionRef> Examples);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Database Schema
|
||||
|
||||
### 3.1 Symbol Sources
|
||||
|
||||
```sql
|
||||
CREATE TABLE groundtruth.symbol_sources (
|
||||
source_id TEXT PRIMARY KEY,
|
||||
display_name TEXT NOT NULL,
|
||||
connector_type TEXT NOT NULL, -- debuginfod, ddeb, buildinfo, secdb
|
||||
base_url TEXT NOT NULL,
|
||||
enabled BOOLEAN DEFAULT TRUE,
|
||||
config_json JSONB,
|
||||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
```
|
||||
|
||||
### 3.2 Raw Documents (Immutable)
|
||||
|
||||
```sql
|
||||
CREATE TABLE groundtruth.raw_documents (
|
||||
digest TEXT PRIMARY KEY, -- sha256:{hex}
|
||||
source_id TEXT NOT NULL REFERENCES groundtruth.symbol_sources(source_id),
|
||||
document_uri TEXT NOT NULL,
|
||||
fetched_at TIMESTAMPTZ NOT NULL,
|
||||
recorded_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
content_type TEXT NOT NULL,
|
||||
content_size_bytes INT,
|
||||
etag TEXT,
|
||||
signature_state TEXT, -- verified, unverified, failed
|
||||
payload_json JSONB,
|
||||
UNIQUE (source_id, document_uri, etag)
|
||||
);
|
||||
|
||||
CREATE INDEX idx_raw_documents_source_fetched
|
||||
ON groundtruth.raw_documents(source_id, fetched_at DESC);
|
||||
```
|
||||
|
||||
### 3.3 Symbol Observations (Immutable)
|
||||
|
||||
```sql
|
||||
CREATE TABLE groundtruth.symbol_observations (
|
||||
observation_id TEXT PRIMARY KEY, -- groundtruth:{source}:{debug_id}:{revision}
|
||||
source_id TEXT NOT NULL,
|
||||
debug_id TEXT NOT NULL, -- ELF build-id, PE GUID, Mach-O UUID
|
||||
code_id TEXT, -- GNU build-id or PE checksum
|
||||
|
||||
-- Binary metadata
|
||||
binary_name TEXT NOT NULL,
|
||||
binary_path TEXT,
|
||||
architecture TEXT NOT NULL, -- x86_64, aarch64, armv7
|
||||
|
||||
-- Package provenance
|
||||
distro TEXT, -- debian, ubuntu, fedora, alpine
|
||||
distro_version TEXT,
|
||||
package_name TEXT,
|
||||
package_version TEXT,
|
||||
|
||||
-- Symbols
|
||||
symbols_json JSONB NOT NULL, -- Array of {name, address, size, type}
|
||||
symbol_count INT NOT NULL,
|
||||
|
||||
-- Build metadata (from .buildinfo or debuginfo)
|
||||
compiler TEXT,
|
||||
compiler_version TEXT,
|
||||
optimization_level TEXT,
|
||||
build_flags_json JSONB,
|
||||
|
||||
-- Provenance
|
||||
document_digest TEXT REFERENCES groundtruth.raw_documents(digest),
|
||||
content_hash TEXT NOT NULL,
|
||||
supersedes_id TEXT REFERENCES groundtruth.symbol_observations(observation_id),
|
||||
|
||||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
|
||||
UNIQUE (source_id, debug_id, content_hash)
|
||||
);
|
||||
|
||||
CREATE INDEX idx_symbol_observations_debug_id
|
||||
ON groundtruth.symbol_observations(debug_id);
|
||||
CREATE INDEX idx_symbol_observations_package
|
||||
ON groundtruth.symbol_observations(distro, package_name, package_version);
|
||||
```
|
||||
|
||||
### 3.4 Security Pairs
|
||||
|
||||
```sql
|
||||
CREATE TABLE groundtruth.security_pairs (
|
||||
pair_id TEXT PRIMARY KEY,
|
||||
cve_id TEXT NOT NULL,
|
||||
|
||||
-- Vulnerable binary
|
||||
vuln_observation_id TEXT NOT NULL
|
||||
REFERENCES groundtruth.symbol_observations(observation_id),
|
||||
vuln_debug_id TEXT NOT NULL,
|
||||
|
||||
-- Patched binary
|
||||
patch_observation_id TEXT NOT NULL
|
||||
REFERENCES groundtruth.symbol_observations(observation_id),
|
||||
patch_debug_id TEXT NOT NULL,
|
||||
|
||||
-- Affected function mapping
|
||||
affected_functions_json JSONB NOT NULL, -- [{name, vuln_addr, patch_addr}]
|
||||
changed_functions_json JSONB NOT NULL,
|
||||
|
||||
-- Upstream diff reference
|
||||
upstream_commit TEXT,
|
||||
upstream_patch_url TEXT,
|
||||
|
||||
-- Metadata
|
||||
distro TEXT NOT NULL,
|
||||
package_name TEXT NOT NULL,
|
||||
|
||||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
created_by TEXT
|
||||
);
|
||||
|
||||
CREATE INDEX idx_security_pairs_cve
|
||||
ON groundtruth.security_pairs(cve_id);
|
||||
CREATE INDEX idx_security_pairs_package
|
||||
ON groundtruth.security_pairs(distro, package_name);
|
||||
```
|
||||
|
||||
### 3.5 Validation Runs
|
||||
|
||||
```sql
|
||||
CREATE TABLE groundtruth.validation_runs (
|
||||
run_id UUID PRIMARY KEY,
|
||||
config_json JSONB NOT NULL, -- Matcher config, thresholds
|
||||
started_at TIMESTAMPTZ NOT NULL,
|
||||
completed_at TIMESTAMPTZ,
|
||||
status TEXT NOT NULL, -- running, completed, failed
|
||||
|
||||
-- Aggregate metrics
|
||||
total_functions INT,
|
||||
correct_matches INT,
|
||||
false_positives INT,
|
||||
false_negatives INT,
|
||||
unmatched INT,
|
||||
match_rate DECIMAL(5,4),
|
||||
precision DECIMAL(5,4),
|
||||
recall DECIMAL(5,4),
|
||||
|
||||
-- Environment
|
||||
matcher_version TEXT NOT NULL,
|
||||
corpus_snapshot_id TEXT,
|
||||
|
||||
created_by TEXT
|
||||
);
|
||||
|
||||
CREATE TABLE groundtruth.match_results (
|
||||
result_id UUID PRIMARY KEY,
|
||||
run_id UUID NOT NULL REFERENCES groundtruth.validation_runs(run_id),
|
||||
|
||||
-- Ground truth
|
||||
pair_id TEXT NOT NULL REFERENCES groundtruth.security_pairs(pair_id),
|
||||
function_name TEXT NOT NULL,
|
||||
expected_match BOOLEAN NOT NULL,
|
||||
|
||||
-- Actual result
|
||||
actual_match BOOLEAN,
|
||||
match_score DECIMAL(5,4),
|
||||
matched_function TEXT,
|
||||
|
||||
-- Classification
|
||||
outcome TEXT NOT NULL, -- true_positive, false_positive, false_negative, unmatched
|
||||
mismatch_cause TEXT, -- inlining, lto, optimization, pic_thunk, etc.
|
||||
|
||||
-- Debug info
|
||||
debug_json JSONB
|
||||
);
|
||||
|
||||
CREATE INDEX idx_match_results_run
|
||||
ON groundtruth.match_results(run_id);
|
||||
CREATE INDEX idx_match_results_outcome
|
||||
ON groundtruth.match_results(run_id, outcome);
|
||||
```
|
||||
|
||||
### 3.6 Source State (Cursor Tracking)
|
||||
|
||||
```sql
|
||||
CREATE TABLE groundtruth.source_state (
|
||||
source_id TEXT PRIMARY KEY REFERENCES groundtruth.symbol_sources(source_id),
|
||||
enabled BOOLEAN DEFAULT TRUE,
|
||||
cursor_json JSONB, -- last_modified, last_id, pending_docs
|
||||
last_success_at TIMESTAMPTZ,
|
||||
last_error TEXT,
|
||||
backoff_until TIMESTAMPTZ
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Connector Specifications
|
||||
|
||||
### 4.1 Debuginfod Connector (Fedora/RHEL)
|
||||
|
||||
**Data Source:** `https://debuginfod.fedoraproject.org`
|
||||
|
||||
**Fetch Flow:**
|
||||
1. Query debuginfod for build-id: `GET /buildid/{build_id}/debuginfo`
|
||||
2. Retrieve DWARF sections (.debug_info, .debug_line)
|
||||
3. Parse symbols using libdw
|
||||
4. Store observation with IMA signature verification
|
||||
|
||||
**Configuration:**
|
||||
```yaml
|
||||
debuginfod:
|
||||
base_url: "https://debuginfod.fedoraproject.org"
|
||||
timeout_seconds: 30
|
||||
verify_ima: true
|
||||
cache_dir: "/var/cache/stellaops/debuginfod"
|
||||
```
|
||||
|
||||
### 4.2 Ddeb Connector (Ubuntu)
|
||||
|
||||
**Data Source:** `http://ddebs.ubuntu.com`
|
||||
|
||||
**Fetch Flow:**
|
||||
1. Query Packages index for `-dbgsym` packages
|
||||
2. Download `.ddeb` archive
|
||||
3. Extract DWARF from `/usr/lib/debug/.build-id/`
|
||||
4. Parse symbols, map to corresponding binary package
|
||||
|
||||
**Configuration:**
|
||||
```yaml
|
||||
ddeb:
|
||||
mirror_url: "http://ddebs.ubuntu.com"
|
||||
distributions: ["focal", "jammy", "noble"]
|
||||
components: ["main", "universe"]
|
||||
cache_dir: "/var/cache/stellaops/ddebs"
|
||||
```
|
||||
|
||||
### 4.3 Buildinfo Connector (Debian)
|
||||
|
||||
**Data Source:** `https://buildinfos.debian.net`
|
||||
|
||||
**Fetch Flow:**
|
||||
1. Query buildinfo index for package
|
||||
2. Download `.buildinfo` file (often clearsigned)
|
||||
3. Parse build environment (compiler, flags, checksums)
|
||||
4. Cross-reference with snapshot.debian.org for exact binary
|
||||
|
||||
**Configuration:**
|
||||
```yaml
|
||||
buildinfo:
|
||||
index_url: "https://buildinfos.debian.net"
|
||||
snapshot_url: "https://snapshot.debian.org"
|
||||
reproducible_url: "https://reproduce.debian.net"
|
||||
verify_signature: true
|
||||
```
|
||||
|
||||
### 4.4 SecDB Connector (Alpine)
|
||||
|
||||
**Data Source:** `https://github.com/alpinelinux/alpine-secdb`
|
||||
|
||||
**Fetch Flow:**
|
||||
1. Clone/pull secdb repository
|
||||
2. Parse YAML files per branch (v3.18, v3.19, edge)
|
||||
3. Map CVE to fixed/unfixed package versions
|
||||
4. Cross-reference with aports for patch info
|
||||
|
||||
**Configuration:**
|
||||
```yaml
|
||||
secdb:
|
||||
repo_url: "https://github.com/alpinelinux/alpine-secdb"
|
||||
branches: ["v3.18", "v3.19", "v3.20", "edge"]
|
||||
aports_url: "https://gitlab.alpinelinux.org/alpine/aports"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Validation Pipeline
|
||||
|
||||
### 5.1 Harness Workflow
|
||||
|
||||
```
|
||||
1. Assemble
|
||||
└─> Given package + CVE, fetch: binaries, debuginfo, .buildinfo, upstream tarball
|
||||
|
||||
2. Recover Symbols
|
||||
└─> Resolve build-id → symbols via debuginfod/ddebs
|
||||
└─> Fallback: Debian rebuild from .buildinfo
|
||||
|
||||
3. Lift Functions
|
||||
└─> Batch-lift .text functions → IR
|
||||
└─> Cache per build-id
|
||||
|
||||
4. Fingerprint
|
||||
└─> Emit deterministic + fuzzy signatures
|
||||
└─> Store as JSON lines
|
||||
|
||||
5. Match
|
||||
└─> Pre→post function matching
|
||||
└─> Write row per function with scores
|
||||
|
||||
6. Score
|
||||
└─> Compute metrics (match rate, FP/FN, precision, recall)
|
||||
└─> Bucket mismatches by cause
|
||||
|
||||
7. Report
|
||||
└─> Markdown/HTML with tables + diffs
|
||||
└─> Attach env hashes and artifact URLs
|
||||
```
|
||||
|
||||
### 5.2 Metrics Tracked
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `match_rate` | Correct matches / total functions |
|
||||
| `precision` | True positives / (true positives + false positives) |
|
||||
| `recall` | True positives / (true positives + false negatives) |
|
||||
| `unmatched_rate` | Unmatched / total functions |
|
||||
|
||||
### 5.3 Mismatch Buckets
|
||||
|
||||
| Cause | Description | Mitigation |
|
||||
|-------|-------------|------------|
|
||||
| `inlining` | Function inlined, no direct match | Inline expansion in fingerprint |
|
||||
| `lto` | Link-time optimization changed structure | Cross-module fingerprints |
|
||||
| `optimization` | Different -O level | Semantic fingerprints |
|
||||
| `pic_thunk` | Position-independent code stubs | Filter PIC thunks |
|
||||
| `versioned_symbol` | GLIBC symbol versioning | Version-aware matching |
|
||||
| `renamed` | Symbol renamed (macro, alias) | Alias resolution |
|
||||
|
||||
---
|
||||
|
||||
## 6. Evidence Objects
|
||||
|
||||
### 6.1 Ground-Truth Attestation Predicate
|
||||
|
||||
```json
|
||||
{
|
||||
"predicateType": "https://stella-ops.org/predicates/groundtruth/v1",
|
||||
"predicate": {
|
||||
"observationId": "groundtruth:debuginfod:abc123def456:1",
|
||||
"debugId": "abc123def456789...",
|
||||
"binaryIdentity": {
|
||||
"name": "libssl.so.3",
|
||||
"sha256": "sha256:...",
|
||||
"architecture": "x86_64"
|
||||
},
|
||||
"symbolSource": {
|
||||
"sourceId": "debuginfod-fedora",
|
||||
"fetchedAt": "2026-01-19T10:00:00Z",
|
||||
"documentUri": "https://debuginfod.fedoraproject.org/buildid/abc123/debuginfo",
|
||||
"signatureState": "verified"
|
||||
},
|
||||
"symbols": [
|
||||
{"name": "SSL_CTX_new", "address": "0x1234", "size": 256},
|
||||
{"name": "SSL_read", "address": "0x5678", "size": 512}
|
||||
],
|
||||
"buildMetadata": {
|
||||
"compiler": "gcc",
|
||||
"compilerVersion": "12.2.0",
|
||||
"optimizationLevel": "O2",
|
||||
"buildFlags": ["-fstack-protector-strong", "-D_FORTIFY_SOURCE=2"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 6.2 Validation Run Attestation
|
||||
|
||||
```json
|
||||
{
|
||||
"predicateType": "https://stella-ops.org/predicates/validation-run/v1",
|
||||
"predicate": {
|
||||
"runId": "550e8400-e29b-41d4-a716-446655440000",
|
||||
"config": {
|
||||
"matcherVersion": "binaryindex-semantic-diffing:1.2.0",
|
||||
"thresholds": {
|
||||
"minSimilarity": 0.85,
|
||||
"semanticWeight": 0.35,
|
||||
"instructionWeight": 0.25
|
||||
}
|
||||
},
|
||||
"corpus": {
|
||||
"snapshotId": "corpus:2026-01-19",
|
||||
"functionCount": 30000,
|
||||
"libraryCount": 5
|
||||
},
|
||||
"metrics": {
|
||||
"totalFunctions": 1500,
|
||||
"correctMatches": 1380,
|
||||
"falsePositives": 15,
|
||||
"falseNegatives": 45,
|
||||
"unmatched": 60,
|
||||
"matchRate": 0.92,
|
||||
"precision": 0.989,
|
||||
"recall": 0.968
|
||||
},
|
||||
"mismatchBuckets": [
|
||||
{"cause": "inlining", "count": 25},
|
||||
{"cause": "lto", "count": 12},
|
||||
{"cause": "optimization", "count": 8}
|
||||
],
|
||||
"executedAt": "2026-01-19T10:30:00Z"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. CLI Commands
|
||||
|
||||
```bash
|
||||
# Symbol source management
|
||||
stella groundtruth sources list
|
||||
stella groundtruth sources enable debuginfod-fedora
|
||||
stella groundtruth sources sync --source debuginfod-fedora
|
||||
|
||||
# Symbol observation queries
|
||||
stella groundtruth symbols lookup --debug-id abc123
|
||||
stella groundtruth symbols search --package openssl --distro debian
|
||||
|
||||
# Security pair management
|
||||
stella groundtruth pairs create \
|
||||
--cve CVE-2024-1234 \
|
||||
--vuln-pkg openssl=3.0.10-1 \
|
||||
--patch-pkg openssl=3.0.11-1
|
||||
|
||||
stella groundtruth pairs list --cve CVE-2024-1234
|
||||
|
||||
# Validation harness
|
||||
stella groundtruth validate run \
|
||||
--pairs "openssl:CVE-2024-*" \
|
||||
--matcher semantic-diffing \
|
||||
--output validation-report.md
|
||||
|
||||
stella groundtruth validate metrics --run-id abc123
|
||||
stella groundtruth validate export --run-id abc123 --format html
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Doctor Checks
|
||||
|
||||
The ground-truth corpus integrates with Doctor for availability checks:
|
||||
|
||||
```csharp
|
||||
// stellaops.doctor.binaryanalysis plugin
|
||||
public sealed class BinaryAnalysisDoctorPlugin : IDoctorPlugin
|
||||
{
|
||||
public string Name => "stellaops.doctor.binaryanalysis";
|
||||
|
||||
public IEnumerable<IDoctorCheck> GetChecks()
|
||||
{
|
||||
yield return new DebuginfodAvailabilityCheck();
|
||||
yield return new DdebRepoEnabledCheck();
|
||||
yield return new BuildinfoCacheCheck();
|
||||
yield return new SymbolRecoveryFallbackCheck();
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
| Check | Description | Remediation |
|
||||
|-------|-------------|-------------|
|
||||
| `debuginfod_urls_configured` | Verify `DEBUGINFOD_URLS` env | Set env variable |
|
||||
| `ddeb_repos_enabled` | Check Ubuntu ddeb sources | Enable ddebs repo |
|
||||
| `buildinfo_cache_accessible` | Validate buildinfos.debian.net | Check network/firewall |
|
||||
| `symbol_recovery_fallback` | Ensure fallback path works | Configure local cache |
|
||||
|
||||
---
|
||||
|
||||
## 9. Air-Gap Support
|
||||
|
||||
For offline/air-gapped deployments:
|
||||
|
||||
### 9.1 Symbol Bundle Format
|
||||
|
||||
```
|
||||
symbol-bundle-2026-01-19/
|
||||
├── manifest.json # Bundle metadata + checksums
|
||||
├── sources/
|
||||
│ ├── debuginfod/
|
||||
│ │ └── *.debuginfo # Pre-fetched debuginfo
|
||||
│ ├── ddebs/
|
||||
│ │ └── *.ddeb # Pre-fetched ddebs
|
||||
│ └── buildinfo/
|
||||
│ └── *.buildinfo # Pre-fetched buildinfo
|
||||
├── observations/
|
||||
│ └── *.ndjson # Pre-computed observations
|
||||
└── DSSE.envelope # Signed attestation
|
||||
```
|
||||
|
||||
### 9.2 Offline Sync
|
||||
|
||||
```bash
|
||||
# Export bundle for air-gap transfer
|
||||
stella groundtruth bundle export \
|
||||
--packages openssl,zlib,glibc \
|
||||
--distros debian,fedora \
|
||||
--output symbol-bundle.tar.gz
|
||||
|
||||
# Import bundle in air-gapped environment
|
||||
stella groundtruth bundle import \
|
||||
--input symbol-bundle.tar.gz \
|
||||
--verify-signature
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 10. Related Documentation
|
||||
|
||||
- [BinaryIndex Architecture](architecture.md)
|
||||
- [Semantic Diffing](semantic-diffing.md)
|
||||
- [Corpus Management](corpus-management.md)
|
||||
- [Concelier AOC](../concelier/guides/aggregation-only-contract.md)
|
||||
- [Excititor Architecture](../excititor/architecture.md)
|
||||
Reference in New Issue
Block a user