Files
git.stella-ops.org/docs/benchmarks/golden-corpus-kpis.md
2026-01-22 19:08:46 +02:00

9.0 KiB

Golden Corpus KPI Specification

Version: 1.0.0 Last Updated: 2026-01-21 Source Advisory: Golden Corpus Patch-Paired Artifacts Advisory

This document specifies the Key Performance Indicators (KPIs) for the golden corpus of patch-paired artifacts, enabling measurement of SBOM reproducibility and binary-level patch provenance verification.


Overview

The golden corpus KPIs measure:

  1. Accuracy - How well the system detects patched vs. vulnerable code
  2. Reproducibility - Whether outputs are deterministic across runs
  3. Performance - Time to verify evidence offline

These metrics enable regression detection in CI and demonstrate corpus quality for auditors.


KPI Definitions

Per-Target KPIs

Computed for each artifact pair in the corpus:

KPI Formula Target Description
Per-function match rate matched_functions_after / total_functions_post * 100 >= 90% Percentage of post-patch functions matched by the system
False-negative patch detection missed_patched_funcs / total_true_patched_funcs * 100 <= 5% Percentage of known-patched functions incorrectly classified
SBOM canonical-hash stability runs_with_same_hash / 3 3/3 Determinism across 3 independent runs
Binary reconstruction equivalence bytewise_equiv_rebuild / 1 1/1 (trend) Whether rebuilt binary matches original

Aggregate KPIs

Computed across the entire corpus:

KPI Formula Target Description
Corpus precision TP / (TP + FP) >= 95% Overall precision of vulnerability detection
Corpus recall TP / (TP + FN) >= 90% Overall recall of vulnerability detection
F1 score 2 * (precision * recall) / (precision + recall) >= 92% Harmonic mean of precision and recall
Deterministic replay rate deterministic_pairs / total_pairs 100% Pairs with identical results across runs
Verify time (median, cold) p50(verify_time_cold) Track trend Cold-start offline verification time
Verify time (p95, cold) p95(verify_time_cold) Track trend 95th percentile cold verification time

Measurement Methodology

Function Match Rate

Input: Post-patch binary B_post, ground-truth function list F_gt
Output: Match rate percentage

1. Lift all functions in B_post to IR
2. Generate semantic fingerprints for each function
3. For each f in F_gt:
   - Find best-matching function in B_post by fingerprint similarity
   - Mark as matched if similarity >= 0.90
4. match_rate = |matched| / |F_gt| * 100

False-Negative Detection

Input: Pre-patch binary B_pre, post-patch binary B_post, CVE patch metadata
Output: False-negative rate percentage

1. Identify functions modified by the CVE patch (from delta-sig)
2. For each modified function f_patched:
   - Compare fingerprint(f_pre) vs fingerprint(f_post)
   - Mark as "detected" if diff confidence >= 0.85
3. false_neg_rate = |undetected| / |f_patched| * 100

SBOM Canonical-Hash Stability

Input: Target artifact A
Output: Stability score (0, 1, 2, or 3)

1. For i in 1..3:
   - Spawn fresh process (no cache)
   - Generate SBOM for A
   - Compute canonical hash H_i
2. stability = count of (H_i == H_1)

Binary Reconstruction Equivalence

Input: Source package S, original binary B_orig
Output: Equivalence boolean

1. Rebuild S in deterministic chroot with SOURCE_DATE_EPOCH
2. Extract rebuilt binary B_rebuilt
3. equivalence = (sha256(B_orig) == sha256(B_rebuilt))

CI Regression Gates

Gate Thresholds

Metric Fail Threshold Warn Threshold
Precision delta > -1.0 pp > -0.5 pp
Recall delta > -1.0 pp > -0.5 pp
F1 delta > -1.0 pp > -0.5 pp
False-negative rate delta > +1.0 pp > +0.5 pp
Deterministic replay < 100% N/A
TTFRP p95 delta > +20% > +10%

Gate Actions

  • Fail: Block merge, require investigation
  • Warn: Allow merge, create tracking issue
  • Pass: No action required

Baseline Management

# View current baseline
stella groundtruth baseline show

# Update baseline after validated improvements
stella groundtruth baseline update \
  --results bench/results/20260121.json \
  --output bench/baselines/current.json \
  --reason "Improved semantic matching accuracy"

# Compare results against baseline
stella groundtruth validate check \
  --results bench/results/20260121.json \
  --baseline bench/baselines/current.json

Database Schema

-- KPI storage for validation runs
CREATE TABLE groundtruth.validation_kpis (
    run_id UUID PRIMARY KEY,
    tenant_id TEXT NOT NULL,
    corpus_version TEXT NOT NULL,
    scanner_version TEXT NOT NULL,

    -- Per-run aggregates
    pair_count INT NOT NULL,
    function_match_rate_mean DECIMAL(5,2),
    function_match_rate_min DECIMAL(5,2),
    function_match_rate_max DECIMAL(5,2),
    false_negative_rate_mean DECIMAL(5,2),
    false_negative_rate_max DECIMAL(5,2),

    -- Stability metrics
    sbom_hash_stability_3of3_count INT,
    sbom_hash_stability_2of3_count INT,
    sbom_hash_stability_1of3_count INT,
    reconstruction_equiv_count INT,
    reconstruction_total_count INT,

    -- Performance metrics
    verify_time_median_ms INT,
    verify_time_p95_ms INT,
    verify_time_p99_ms INT,

    -- Computed aggregates
    precision DECIMAL(5,4),
    recall DECIMAL(5,4),
    f1_score DECIMAL(5,4),
    deterministic_replay_rate DECIMAL(5,4),

    computed_at TIMESTAMPTZ NOT NULL DEFAULT now(),

    -- Indexing
    CONSTRAINT fk_tenant FOREIGN KEY (tenant_id) REFERENCES tenants.tenant(id)
);

CREATE INDEX idx_validation_kpis_tenant_time
    ON groundtruth.validation_kpis(tenant_id, computed_at DESC);

CREATE INDEX idx_validation_kpis_corpus_version
    ON groundtruth.validation_kpis(corpus_version, computed_at DESC);

-- Baseline storage
CREATE TABLE groundtruth.kpi_baselines (
    baseline_id UUID PRIMARY KEY,
    tenant_id TEXT NOT NULL,
    corpus_version TEXT NOT NULL,

    -- Reference metrics
    precision_baseline DECIMAL(5,4) NOT NULL,
    recall_baseline DECIMAL(5,4) NOT NULL,
    f1_baseline DECIMAL(5,4) NOT NULL,
    fn_rate_baseline DECIMAL(5,4) NOT NULL,
    verify_p95_baseline_ms INT NOT NULL,

    -- Metadata
    source_run_id UUID REFERENCES groundtruth.validation_kpis(run_id),
    created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    created_by TEXT NOT NULL,
    reason TEXT,

    is_active BOOLEAN NOT NULL DEFAULT true
);

CREATE UNIQUE INDEX idx_kpi_baselines_active
    ON groundtruth.kpi_baselines(tenant_id, corpus_version)
    WHERE is_active = true;

Reporting

Validation Run Report (Markdown)

# Golden Corpus Validation Report

**Run ID:** bench-20260121-001
**Timestamp:** 2026-01-21T03:00:00Z
**Corpus Version:** 1.0.0
**Scanner Version:** 1.5.0

## Summary

| Metric | Value | Target | Status |
|--------|-------|--------|--------|
| Precision | 96.2% | >= 95% | PASS |
| Recall | 91.5% | >= 90% | PASS |
| F1 Score | 93.8% | >= 92% | PASS |
| False-Negative Rate | 3.2% | <= 5% | PASS |
| Deterministic Replay | 100% | 100% | PASS |
| SBOM Hash Stability | 10/10 3/3 | All 3/3 | PASS |
| Verify Time (p95) | 420ms | Trend | - |

## Regression Check

Compared against baseline `baseline-20260115-001`:

| Metric | Baseline | Current | Delta | Status |
|--------|----------|---------|-------|--------|
| Precision | 95.8% | 96.2% | +0.4 pp | IMPROVED |
| Recall | 91.2% | 91.5% | +0.3 pp | IMPROVED |
| Verify p95 | 450ms | 420ms | -6.7% | IMPROVED |

## Per-Package Results

| Package | Advisory | Match Rate | FN Rate | SBOM Stable | Recon Equiv |
|---------|----------|------------|---------|-------------|-------------|
| openssl | DSA-5678 | 94.2% | 2.1% | 3/3 | Yes |
| zlib | DSA-5432 | 98.1% | 0.0% | 3/3 | Yes |
| curl | DSA-5555 | 91.8% | 4.5% | 3/3 | No |
...

JSON Report Schema

{
  "$schema": "https://stellaops.io/schemas/validation-report.v1.json",
  "runId": "bench-20260121-001",
  "timestamp": "2026-01-21T03:00:00Z",
  "corpusVersion": "1.0.0",
  "scannerVersion": "1.5.0",
  "metrics": {
    "precision": 0.962,
    "recall": 0.915,
    "f1Score": 0.938,
    "falseNegativeRate": 0.032,
    "deterministicReplayRate": 1.0,
    "verifyTimeMedianMs": 280,
    "verifyTimeP95Ms": 420
  },
  "regressionCheck": {
    "baselineId": "baseline-20260115-001",
    "precisionDelta": 0.004,
    "recallDelta": 0.003,
    "status": "pass"
  },
  "packages": [
    {
      "package": "openssl",
      "advisory": "DSA-5678",
      "matchRate": 0.942,
      "falseNegativeRate": 0.021,
      "sbomHashStability": 3,
      "reconstructionEquivalent": true,
      "verifyTimeMs": 350
    }
  ]
}