9.0 KiB
9.0 KiB
Golden Corpus KPI Specification
Version: 1.0.0 Last Updated: 2026-01-21 Source Advisory: Golden Corpus Patch-Paired Artifacts Advisory
This document specifies the Key Performance Indicators (KPIs) for the golden corpus of patch-paired artifacts, enabling measurement of SBOM reproducibility and binary-level patch provenance verification.
Overview
The golden corpus KPIs measure:
- Accuracy - How well the system detects patched vs. vulnerable code
- Reproducibility - Whether outputs are deterministic across runs
- Performance - Time to verify evidence offline
These metrics enable regression detection in CI and demonstrate corpus quality for auditors.
KPI Definitions
Per-Target KPIs
Computed for each artifact pair in the corpus:
| KPI | Formula | Target | Description |
|---|---|---|---|
| Per-function match rate | matched_functions_after / total_functions_post * 100 |
>= 90% | Percentage of post-patch functions matched by the system |
| False-negative patch detection | missed_patched_funcs / total_true_patched_funcs * 100 |
<= 5% | Percentage of known-patched functions incorrectly classified |
| SBOM canonical-hash stability | runs_with_same_hash / 3 |
3/3 | Determinism across 3 independent runs |
| Binary reconstruction equivalence | bytewise_equiv_rebuild / 1 |
1/1 (trend) | Whether rebuilt binary matches original |
Aggregate KPIs
Computed across the entire corpus:
| KPI | Formula | Target | Description |
|---|---|---|---|
| Corpus precision | TP / (TP + FP) |
>= 95% | Overall precision of vulnerability detection |
| Corpus recall | TP / (TP + FN) |
>= 90% | Overall recall of vulnerability detection |
| F1 score | 2 * (precision * recall) / (precision + recall) |
>= 92% | Harmonic mean of precision and recall |
| Deterministic replay rate | deterministic_pairs / total_pairs |
100% | Pairs with identical results across runs |
| Verify time (median, cold) | p50(verify_time_cold) |
Track trend | Cold-start offline verification time |
| Verify time (p95, cold) | p95(verify_time_cold) |
Track trend | 95th percentile cold verification time |
Measurement Methodology
Function Match Rate
Input: Post-patch binary B_post, ground-truth function list F_gt
Output: Match rate percentage
1. Lift all functions in B_post to IR
2. Generate semantic fingerprints for each function
3. For each f in F_gt:
- Find best-matching function in B_post by fingerprint similarity
- Mark as matched if similarity >= 0.90
4. match_rate = |matched| / |F_gt| * 100
False-Negative Detection
Input: Pre-patch binary B_pre, post-patch binary B_post, CVE patch metadata
Output: False-negative rate percentage
1. Identify functions modified by the CVE patch (from delta-sig)
2. For each modified function f_patched:
- Compare fingerprint(f_pre) vs fingerprint(f_post)
- Mark as "detected" if diff confidence >= 0.85
3. false_neg_rate = |undetected| / |f_patched| * 100
SBOM Canonical-Hash Stability
Input: Target artifact A
Output: Stability score (0, 1, 2, or 3)
1. For i in 1..3:
- Spawn fresh process (no cache)
- Generate SBOM for A
- Compute canonical hash H_i
2. stability = count of (H_i == H_1)
Binary Reconstruction Equivalence
Input: Source package S, original binary B_orig
Output: Equivalence boolean
1. Rebuild S in deterministic chroot with SOURCE_DATE_EPOCH
2. Extract rebuilt binary B_rebuilt
3. equivalence = (sha256(B_orig) == sha256(B_rebuilt))
CI Regression Gates
Gate Thresholds
| Metric | Fail Threshold | Warn Threshold |
|---|---|---|
| Precision delta | > -1.0 pp | > -0.5 pp |
| Recall delta | > -1.0 pp | > -0.5 pp |
| F1 delta | > -1.0 pp | > -0.5 pp |
| False-negative rate delta | > +1.0 pp | > +0.5 pp |
| Deterministic replay | < 100% | N/A |
| TTFRP p95 delta | > +20% | > +10% |
Gate Actions
- Fail: Block merge, require investigation
- Warn: Allow merge, create tracking issue
- Pass: No action required
Baseline Management
# View current baseline
stella groundtruth baseline show
# Update baseline after validated improvements
stella groundtruth baseline update \
--results bench/results/20260121.json \
--output bench/baselines/current.json \
--reason "Improved semantic matching accuracy"
# Compare results against baseline
stella groundtruth validate check \
--results bench/results/20260121.json \
--baseline bench/baselines/current.json
Database Schema
-- KPI storage for validation runs
CREATE TABLE groundtruth.validation_kpis (
run_id UUID PRIMARY KEY,
tenant_id TEXT NOT NULL,
corpus_version TEXT NOT NULL,
scanner_version TEXT NOT NULL,
-- Per-run aggregates
pair_count INT NOT NULL,
function_match_rate_mean DECIMAL(5,2),
function_match_rate_min DECIMAL(5,2),
function_match_rate_max DECIMAL(5,2),
false_negative_rate_mean DECIMAL(5,2),
false_negative_rate_max DECIMAL(5,2),
-- Stability metrics
sbom_hash_stability_3of3_count INT,
sbom_hash_stability_2of3_count INT,
sbom_hash_stability_1of3_count INT,
reconstruction_equiv_count INT,
reconstruction_total_count INT,
-- Performance metrics
verify_time_median_ms INT,
verify_time_p95_ms INT,
verify_time_p99_ms INT,
-- Computed aggregates
precision DECIMAL(5,4),
recall DECIMAL(5,4),
f1_score DECIMAL(5,4),
deterministic_replay_rate DECIMAL(5,4),
computed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
-- Indexing
CONSTRAINT fk_tenant FOREIGN KEY (tenant_id) REFERENCES tenants.tenant(id)
);
CREATE INDEX idx_validation_kpis_tenant_time
ON groundtruth.validation_kpis(tenant_id, computed_at DESC);
CREATE INDEX idx_validation_kpis_corpus_version
ON groundtruth.validation_kpis(corpus_version, computed_at DESC);
-- Baseline storage
CREATE TABLE groundtruth.kpi_baselines (
baseline_id UUID PRIMARY KEY,
tenant_id TEXT NOT NULL,
corpus_version TEXT NOT NULL,
-- Reference metrics
precision_baseline DECIMAL(5,4) NOT NULL,
recall_baseline DECIMAL(5,4) NOT NULL,
f1_baseline DECIMAL(5,4) NOT NULL,
fn_rate_baseline DECIMAL(5,4) NOT NULL,
verify_p95_baseline_ms INT NOT NULL,
-- Metadata
source_run_id UUID REFERENCES groundtruth.validation_kpis(run_id),
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
created_by TEXT NOT NULL,
reason TEXT,
is_active BOOLEAN NOT NULL DEFAULT true
);
CREATE UNIQUE INDEX idx_kpi_baselines_active
ON groundtruth.kpi_baselines(tenant_id, corpus_version)
WHERE is_active = true;
Reporting
Validation Run Report (Markdown)
# Golden Corpus Validation Report
**Run ID:** bench-20260121-001
**Timestamp:** 2026-01-21T03:00:00Z
**Corpus Version:** 1.0.0
**Scanner Version:** 1.5.0
## Summary
| Metric | Value | Target | Status |
|--------|-------|--------|--------|
| Precision | 96.2% | >= 95% | PASS |
| Recall | 91.5% | >= 90% | PASS |
| F1 Score | 93.8% | >= 92% | PASS |
| False-Negative Rate | 3.2% | <= 5% | PASS |
| Deterministic Replay | 100% | 100% | PASS |
| SBOM Hash Stability | 10/10 3/3 | All 3/3 | PASS |
| Verify Time (p95) | 420ms | Trend | - |
## Regression Check
Compared against baseline `baseline-20260115-001`:
| Metric | Baseline | Current | Delta | Status |
|--------|----------|---------|-------|--------|
| Precision | 95.8% | 96.2% | +0.4 pp | IMPROVED |
| Recall | 91.2% | 91.5% | +0.3 pp | IMPROVED |
| Verify p95 | 450ms | 420ms | -6.7% | IMPROVED |
## Per-Package Results
| Package | Advisory | Match Rate | FN Rate | SBOM Stable | Recon Equiv |
|---------|----------|------------|---------|-------------|-------------|
| openssl | DSA-5678 | 94.2% | 2.1% | 3/3 | Yes |
| zlib | DSA-5432 | 98.1% | 0.0% | 3/3 | Yes |
| curl | DSA-5555 | 91.8% | 4.5% | 3/3 | No |
...
JSON Report Schema
{
"$schema": "https://stellaops.io/schemas/validation-report.v1.json",
"runId": "bench-20260121-001",
"timestamp": "2026-01-21T03:00:00Z",
"corpusVersion": "1.0.0",
"scannerVersion": "1.5.0",
"metrics": {
"precision": 0.962,
"recall": 0.915,
"f1Score": 0.938,
"falseNegativeRate": 0.032,
"deterministicReplayRate": 1.0,
"verifyTimeMedianMs": 280,
"verifyTimeP95Ms": 420
},
"regressionCheck": {
"baselineId": "baseline-20260115-001",
"precisionDelta": 0.004,
"recallDelta": 0.003,
"status": "pass"
},
"packages": [
{
"package": "openssl",
"advisory": "DSA-5678",
"matchRate": 0.942,
"falseNegativeRate": 0.021,
"sbomHashStability": 3,
"reconstructionEquivalent": true,
"verifyTimeMs": 350
}
]
}