tests fixes and sprints work

2026-01-22 19:08:46 +02:00
parent c32fff8f86
commit 726d70dc7f
881 changed files with 134434 additions and 6228 deletions
--- a/docs/benchmarks/golden-corpus-kpis.md
+++ b/docs/benchmarks/golden-corpus-kpis.md
@@ -0,0 +1,310 @@
+# Golden Corpus KPI Specification
+
+> **Version**: 1.0.0
+> **Last Updated**: 2026-01-21
+> **Source Advisory**: Golden Corpus Patch-Paired Artifacts Advisory
+
+This document specifies the Key Performance Indicators (KPIs) for the golden corpus of patch-paired artifacts, enabling measurement of SBOM reproducibility and binary-level patch provenance verification.
+
+---
+
+## Overview
+
+The golden corpus KPIs measure:
+1. **Accuracy** - How well the system detects patched vs. vulnerable code
+2. **Reproducibility** - Whether outputs are deterministic across runs
+3. **Performance** - Time to verify evidence offline
+
+These metrics enable regression detection in CI and demonstrate corpus quality for auditors.
+
+---
+
+## KPI Definitions
+
+### Per-Target KPIs
+
+Computed for each artifact pair in the corpus:
+
+| KPI | Formula | Target | Description |
+|-----|---------|--------|-------------|
+| **Per-function match rate** | `matched_functions_after / total_functions_post * 100` | >= 90% | Percentage of post-patch functions matched by the system |
+| **False-negative patch detection** | `missed_patched_funcs / total_true_patched_funcs * 100` | <= 5% | Percentage of known-patched functions incorrectly classified |
+| **SBOM canonical-hash stability** | `runs_with_same_hash / 3` | 3/3 | Determinism across 3 independent runs |
+| **Binary reconstruction equivalence** | `bytewise_equiv_rebuild / 1` | 1/1 (trend) | Whether rebuilt binary matches original |
+
+### Aggregate KPIs
+
+Computed across the entire corpus:
+
+| KPI | Formula | Target | Description |
+|-----|---------|--------|-------------|
+| **Corpus precision** | `TP / (TP + FP)` | >= 95% | Overall precision of vulnerability detection |
+| **Corpus recall** | `TP / (TP + FN)` | >= 90% | Overall recall of vulnerability detection |
+| **F1 score** | `2 * (precision * recall) / (precision + recall)` | >= 92% | Harmonic mean of precision and recall |
+| **Deterministic replay rate** | `deterministic_pairs / total_pairs` | 100% | Pairs with identical results across runs |
+| **Verify time (median, cold)** | `p50(verify_time_cold)` | Track trend | Cold-start offline verification time |
+| **Verify time (p95, cold)** | `p95(verify_time_cold)` | Track trend | 95th percentile cold verification time |
+
+---
+
+## Measurement Methodology
+
+### Function Match Rate
+
+```
+Input: Post-patch binary B_post, ground-truth function list F_gt
+Output: Match rate percentage
+
+1. Lift all functions in B_post to IR
+2. Generate semantic fingerprints for each function
+3. For each f in F_gt:
+   - Find best-matching function in B_post by fingerprint similarity
+   - Mark as matched if similarity >= 0.90
+4. match_rate = |matched| / |F_gt| * 100
+```
+
+### False-Negative Detection
+
+```
+Input: Pre-patch binary B_pre, post-patch binary B_post, CVE patch metadata
+Output: False-negative rate percentage
+
+1. Identify functions modified by the CVE patch (from delta-sig)
+2. For each modified function f_patched:
+   - Compare fingerprint(f_pre) vs fingerprint(f_post)
+   - Mark as "detected" if diff confidence >= 0.85
+3. false_neg_rate = |undetected| / |f_patched| * 100
+```
+
+### SBOM Canonical-Hash Stability
+
+```
+Input: Target artifact A
+Output: Stability score (0, 1, 2, or 3)
+
+1. For i in 1..3:
+   - Spawn fresh process (no cache)
+   - Generate SBOM for A
+   - Compute canonical hash H_i
+2. stability = count of (H_i == H_1)
+```
+
+### Binary Reconstruction Equivalence
+
+```
+Input: Source package S, original binary B_orig
+Output: Equivalence boolean
+
+1. Rebuild S in deterministic chroot with SOURCE_DATE_EPOCH
+2. Extract rebuilt binary B_rebuilt
+3. equivalence = (sha256(B_orig) == sha256(B_rebuilt))
+```
+
+---
+
+## CI Regression Gates
+
+### Gate Thresholds
+
+| Metric | Fail Threshold | Warn Threshold |
+|--------|----------------|----------------|
+| Precision delta | > -1.0 pp | > -0.5 pp |
+| Recall delta | > -1.0 pp | > -0.5 pp |
+| F1 delta | > -1.0 pp | > -0.5 pp |
+| False-negative rate delta | > +1.0 pp | > +0.5 pp |
+| Deterministic replay | < 100% | N/A |
+| TTFRP p95 delta | > +20% | > +10% |
+
+### Gate Actions
+
+- **Fail**: Block merge, require investigation
+- **Warn**: Allow merge, create tracking issue
+- **Pass**: No action required
+
+### Baseline Management
+
+```bash
+# View current baseline
+stella groundtruth baseline show
+
+# Update baseline after validated improvements
+stella groundtruth baseline update \
+  --results bench/results/20260121.json \
+  --output bench/baselines/current.json \
+  --reason "Improved semantic matching accuracy"
+
+# Compare results against baseline
+stella groundtruth validate check \
+  --results bench/results/20260121.json \
+  --baseline bench/baselines/current.json
+```
+
+---
+
+## Database Schema
+
+```sql
+-- KPI storage for validation runs
+CREATE TABLE groundtruth.validation_kpis (
+    run_id UUID PRIMARY KEY,
+    tenant_id TEXT NOT NULL,
+    corpus_version TEXT NOT NULL,
+    scanner_version TEXT NOT NULL,
+
+    -- Per-run aggregates
+    pair_count INT NOT NULL,
+    function_match_rate_mean DECIMAL(5,2),
+    function_match_rate_min DECIMAL(5,2),
+    function_match_rate_max DECIMAL(5,2),
+    false_negative_rate_mean DECIMAL(5,2),
+    false_negative_rate_max DECIMAL(5,2),
+
+    -- Stability metrics
+    sbom_hash_stability_3of3_count INT,
+    sbom_hash_stability_2of3_count INT,
+    sbom_hash_stability_1of3_count INT,
+    reconstruction_equiv_count INT,
+    reconstruction_total_count INT,
+
+    -- Performance metrics
+    verify_time_median_ms INT,
+    verify_time_p95_ms INT,
+    verify_time_p99_ms INT,
+
+    -- Computed aggregates
+    precision DECIMAL(5,4),
+    recall DECIMAL(5,4),
+    f1_score DECIMAL(5,4),
+    deterministic_replay_rate DECIMAL(5,4),
+
+    computed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
+
+    -- Indexing
+    CONSTRAINT fk_tenant FOREIGN KEY (tenant_id) REFERENCES tenants.tenant(id)
+);
+
+CREATE INDEX idx_validation_kpis_tenant_time
+    ON groundtruth.validation_kpis(tenant_id, computed_at DESC);
+
+CREATE INDEX idx_validation_kpis_corpus_version
+    ON groundtruth.validation_kpis(corpus_version, computed_at DESC);
+
+-- Baseline storage
+CREATE TABLE groundtruth.kpi_baselines (
+    baseline_id UUID PRIMARY KEY,
+    tenant_id TEXT NOT NULL,
+    corpus_version TEXT NOT NULL,
+
+    -- Reference metrics
+    precision_baseline DECIMAL(5,4) NOT NULL,
+    recall_baseline DECIMAL(5,4) NOT NULL,
+    f1_baseline DECIMAL(5,4) NOT NULL,
+    fn_rate_baseline DECIMAL(5,4) NOT NULL,
+    verify_p95_baseline_ms INT NOT NULL,
+
+    -- Metadata
+    source_run_id UUID REFERENCES groundtruth.validation_kpis(run_id),
+    created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
+    created_by TEXT NOT NULL,
+    reason TEXT,
+
+    is_active BOOLEAN NOT NULL DEFAULT true
+);
+
+CREATE UNIQUE INDEX idx_kpi_baselines_active
+    ON groundtruth.kpi_baselines(tenant_id, corpus_version)
+    WHERE is_active = true;
+```
+
+---
+
+## Reporting
+
+### Validation Run Report (Markdown)
+
+```markdown
+# Golden Corpus Validation Report
+
+**Run ID:** bench-20260121-001
+**Timestamp:** 2026-01-21T03:00:00Z
+**Corpus Version:** 1.0.0
+**Scanner Version:** 1.5.0
+
+## Summary
+
+| Metric | Value | Target | Status |
+|--------|-------|--------|--------|
+| Precision | 96.2% | >= 95% | PASS |
+| Recall | 91.5% | >= 90% | PASS |
+| F1 Score | 93.8% | >= 92% | PASS |
+| False-Negative Rate | 3.2% | <= 5% | PASS |
+| Deterministic Replay | 100% | 100% | PASS |
+| SBOM Hash Stability | 10/10 3/3 | All 3/3 | PASS |
+| Verify Time (p95) | 420ms | Trend | - |
+
+## Regression Check
+
+Compared against baseline `baseline-20260115-001`:
+
+| Metric | Baseline | Current | Delta | Status |
+|--------|----------|---------|-------|--------|
+| Precision | 95.8% | 96.2% | +0.4 pp | IMPROVED |
+| Recall | 91.2% | 91.5% | +0.3 pp | IMPROVED |
+| Verify p95 | 450ms | 420ms | -6.7% | IMPROVED |
+
+## Per-Package Results
+
+| Package | Advisory | Match Rate | FN Rate | SBOM Stable | Recon Equiv |
+|---------|----------|------------|---------|-------------|-------------|
+| openssl | DSA-5678 | 94.2% | 2.1% | 3/3 | Yes |
+| zlib | DSA-5432 | 98.1% | 0.0% | 3/3 | Yes |
+| curl | DSA-5555 | 91.8% | 4.5% | 3/3 | No |
+...
+```
+
+### JSON Report Schema
+
+```json
+{
+  "$schema": "https://stellaops.io/schemas/validation-report.v1.json",
+  "runId": "bench-20260121-001",
+  "timestamp": "2026-01-21T03:00:00Z",
+  "corpusVersion": "1.0.0",
+  "scannerVersion": "1.5.0",
+  "metrics": {
+    "precision": 0.962,
+    "recall": 0.915,
+    "f1Score": 0.938,
+    "falseNegativeRate": 0.032,
+    "deterministicReplayRate": 1.0,
+    "verifyTimeMedianMs": 280,
+    "verifyTimeP95Ms": 420
+  },
+  "regressionCheck": {
+    "baselineId": "baseline-20260115-001",
+    "precisionDelta": 0.004,
+    "recallDelta": 0.003,
+    "status": "pass"
+  },
+  "packages": [
+    {
+      "package": "openssl",
+      "advisory": "DSA-5678",
+      "matchRate": 0.942,
+      "falseNegativeRate": 0.021,
+      "sbomHashStability": 3,
+      "reconstructionEquivalent": true,
+      "verifyTimeMs": 350
+    }
+  ]
+}
+```
+
+---
+
+## Related Documentation
+
+- [Ground-Truth Corpus Specification](ground-truth-corpus.md)
+- [BinaryIndex Architecture](../modules/binary-index/architecture.md)
+- [Golden Corpus Seed List](golden-corpus-seed-list.md)
+- [Determinism and Reproducibility Reference](../product/advisories/14-Dec-2025%20-%20Determinism%20and%20Reproducibility%20Technical%20Reference.md)