Files
git.stella-ops.org/docs/benchmarks/signals/bench-determinism.md
StellaOps Bot d63af51f84
Some checks failed
api-governance / spectral-lint (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
oas-ci / oas-validate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
up
2025-11-26 20:23:28 +02:00

94 lines
4.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Determinism Benchmark (cross-scanner) — Stable (2025-11)
Source: advisory “23-Nov-2025 - Benchmarking Determinism in Vulnerability Scoring”. This doc captures the runnable harness pattern and expected outputs for task BENCH-DETERMINISM-401-057.
## Goal
- Measure determinism rate, order-invariance, and CVSS delta σ across scanners when fed identical SBOM+VEX inputs with frozen feeds.
## Minimal harness (Python excerpt)
```python
# run_bench.py (excerpt) — deterministic JSON hashing
def canon(obj):
return json.dumps(obj, sort_keys=True, separators=(',', ':')).encode()
def shas(b):
return hashlib.sha256(b).hexdigest()
for sbom, vex in zip(SBOMS, VEXES):
for scanner, tmpl in SCANNERS.items():
for mode in ("canonical", "shuffled"):
for i in range(10):
if mode == "shuffled":
sb, vx = shuffle(sbom), shuffle(vex)
out = run(tmpl.format(sbom=sb, vex=vx))
norm = normalize(out) # purl, vuln id, base_cvss, effective
blob = canon({"scanner": scanner, "sbom": sbom,
"vex": vex, "findings": norm})
results.append({
"hash": shas(blob), "mode": mode,
"run": i, "scanner": scanner, "sbom": sbom
})
```
## Inputs
- 35 SBOMs (CycloneDX 1.6 / SPDX 3.0.1) + matching VEX docs covering affected/not_affected/fixed.
- Feeds bundle: vendor DBs (NVD, GHSA, OVAL) hashed and frozen.
- Policy: single normalization profile (e.g., prefer vendor scores, CVSS v3.1).
- Reachability dataset (optional combined run): `tests/reachability/samples-public` corpus; graphs produced via `stella graph lift` for each language sample; runtime traces optional.
## Metrics
- Determinism rate = identical_hash_runs / total_runs (20 per scanner/SBOM).
- Order-invariance failures (# distinct hashes between canonical vs shuffled).
- CVSS delta σ vs reference; VEX stability (σ_after ≤ σ_before).
## Deliverables
- `bench/determinism/` with harness, hashed inputs, and `results.csv`.
- `bench/determinism/inputs.sha256` listing SBOM, VEX, feed bundle hashes (deterministic ordering).
- `bench/reachability/dataset.sha256` listing reachability corpus inputs (graphs, runtime traces) when running combined bench.
- CI target `bench:determinism` producing determinism% and σ per scanner; optional `bench:reachability` to recompute graph hash and runtime hit stability.
## Links
- Source advisory: `docs/product-advisories/23-Nov-2025 - Benchmarking Determinism in Vulnerability Scoring.md`
- Sprint task: BENCH-DETERMINISM-401-057 (SPRINT_0401_0001_0001_reachability_evidence_chain.md)
---
## How to run (local)
```sh
cd bench/determinism
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# Freeze feeds and policy hashes
./freeze_feeds.sh ../feeds/bundle.tar.gz > inputs.sha256
# Run determinism bench
python run_bench.py --sboms inputs/sboms/*.json --vex inputs/vex/*.json \
--scanners configs/scanners.yaml --runs 20 --shuffle
# Reachability dataset (optional)
python run_reachability.py --graphs ../reachability/graphs/*.json \
--runtime ../reachability/runtime/*.ndjson.gz --output results-reach.csv
```
Outputs are written to `results.csv` (determinism) and `results-reach.csv` (reachability stability) plus SHA manifests.
## How to run (CI)
- Target `bench:determinism` in CI (see `.gitea/workflows/bench-determinism.yml`) runs the harness with frozen feeds and uploads `results.csv` + `inputs.sha256` as artifacts.
- Optional `bench:reachability` target replays reachability corpus, recomputes graph hashes, and compares against expected `dataset.sha256`.
- CI must fail if determinism rate < 0.95 or any graph hash mismatch.
## Offline/air-gap workflow
1. Place feeds bundle, SBOMs, VEX, and reachability corpus under `offline/inputs/` with matching `inputs.sha256` and `dataset.sha256`.
2. Run `./offline_run.sh --inputs offline/inputs --outputs offline/results` to execute both benches without network.
3. Verify hashes: `sha256sum -c offline/inputs/inputs.sha256` and `sha256sum -c offline/inputs/dataset.sha256`.
4. Store outputs plus manifests in Offline Kit; include DSSE envelope if signing is enabled (`./sign_results.sh`).
## Notes
- Keep file ordering deterministic (lexicographic) when generating manifests.
- Do not pull live feeds during bench runs; use frozen bundles only.
- For reachability runs, require symbol manifest availability; otherwise mark missing symbols and fail the run.