Files
git.stella-ops.org/docs/benchmarks/signals/bench-determinism.md
StellaOps Bot cfa2274d31 up
2025-11-27 21:09:47 +02:00

5.1 KiB
Raw Blame History

Determinism Benchmark (cross-scanner) — Stable (2025-11)

Source: advisory “23-Nov-2025 - Benchmarking Determinism in Vulnerability Scoring”. This doc captures the runnable harness pattern and expected outputs for task BENCH-DETERMINISM-401-057.

Goal

  • Measure determinism rate, order-invariance, and CVSS delta σ across scanners when fed identical SBOM+VEX inputs with frozen feeds.

Minimal harness (Python excerpt)

# run_bench.py (excerpt) — deterministic JSON hashing
def canon(obj):
    return json.dumps(obj, sort_keys=True, separators=(',', ':')).encode()

def shas(b):
    return hashlib.sha256(b).hexdigest()

for sbom, vex in zip(SBOMS, VEXES):
    for scanner, tmpl in SCANNERS.items():
        for mode in ("canonical", "shuffled"):
            for i in range(10):
                if mode == "shuffled":
                    sb, vx = shuffle(sbom), shuffle(vex)
                out = run(tmpl.format(sbom=sb, vex=vx))
                norm = normalize(out)  # purl, vuln id, base_cvss, effective
                blob = canon({"scanner": scanner, "sbom": sbom,
                              "vex": vex, "findings": norm})
                results.append({
                    "hash": shas(blob), "mode": mode,
                    "run": i, "scanner": scanner, "sbom": sbom
                })

Inputs

  • 35 SBOMs (CycloneDX 1.6 / SPDX 3.0.1) + matching VEX docs covering affected/not_affected/fixed.
  • Feeds bundle: vendor DBs (NVD, GHSA, OVAL) hashed and frozen.
  • Policy: single normalization profile (e.g., prefer vendor scores, CVSS v3.1).
  • Reachability dataset (optional combined run): tests/reachability/samples-public corpus; graphs produced via stella graph lift for each language sample; runtime traces optional.

Metrics

  • Determinism rate = identical_hash_runs / total_runs (20 per scanner/SBOM).
  • Order-invariance failures (# distinct hashes between canonical vs shuffled).
  • CVSS delta σ vs reference; VEX stability (σ_after ≤ σ_before).

Deliverables

  • Harness at src/Bench/StellaOps.Bench/Determinism (offline-friendly mock scanner included).
  • results/*.csv with per-run hashes plus summary.json determinism rate.
  • results/inputs.sha256 listing SBOM, VEX, and config hashes (deterministic ordering).
  • bench/reachability/dataset.sha256 listing reachability corpus inputs (graphs, runtime traces) when running combined bench.
  • CI target bench:determinism producing determinism% and σ per scanner; optional bench:reachability to recompute graph hash and runtime hit stability.
  • Source advisory: docs/product-advisories/23-Nov-2025 - Benchmarking Determinism in Vulnerability Scoring.md
  • Sprint task: BENCH-DETERMINISM-401-057 (SPRINT_0401_0001_0001_reachability_evidence_chain.md)

How to run (local)

cd src/Bench/StellaOps.Bench/Determinism

# Run determinism bench (uses built-in mock scanner by default; defaults to 10 runs)
python run_bench.py --sboms inputs/sboms/*.json --vex inputs/vex/*.json \
  --config configs/scanners.json --shuffle --output results

# Reachability dataset (optional)
python run_reachability.py --graphs inputs/graphs/*.json \
  --runtime inputs/runtime/*.ndjson --output results

Outputs are written to results.csv (determinism), results-reach.csv/results-reach.json (reachability hashes), and manifests inputs.sha256 + dataset.sha256 (if reachability). Feed bundle hashes live in the same manifest when provided via DET_EXTRA_INPUTS.

How to run (CI)

  • Workflow .gitea/workflows/bench-determinism.yml calls scripts/bench/determinism-run.sh, which runs the harness with the bundled mock scanner and uploads out/bench-determinism/** (results, manifests, summary). Set DET_EXTRA_INPUTS to include frozen feed bundles in inputs.sha256; optional DET_REACH_GRAPHS/DET_REACH_RUNTIME adds reachability hashes + dataset.sha256.
  • Optional bench:reachability target (future) will replay reachability corpus, recompute graph hashes, and compare against expected dataset.sha256.
  • CI fails when determinism_rate < BENCH_DETERMINISM_THRESHOLD (defaults to 0.95; set via env in the workflow).

Offline/air-gap workflow

  1. Place feeds bundle (see src/Bench/StellaOps.Bench/Determinism/inputs/feeds/README.md), SBOMs, VEX, and optional reachability corpus under offline/inputs/ with matching inputs.sha256 and (if reachability) dataset.sha256. A sample inputs/inputs.sha256 is provided for the bundled demo SBOM/VEX/config.
  2. Run ./offline_run.sh --inputs offline/inputs --output offline/results (script lives under src/Bench/StellaOps.Bench/Determinism) to execute benches without network (defaults: runs=10, threshold=0.95; manifest verification on). Use --no-verify to skip hash checks if manifests are absent.
  3. Store outputs plus manifests in Offline Kit; include DSSE envelope if signing is enabled (./sign_results.sh).

Notes

  • Keep file ordering deterministic (lexicographic) when generating manifests.
  • Do not pull live feeds during bench runs; use frozen bundles only.
  • For reachability runs, require symbol manifest availability; otherwise mark missing symbols and fail the run.