Files
git.stella-ops.org/bench/reachability-benchmark/docs/gaps/benchmark-gaps-remediation.md
StellaOps Bot e1262eb916 Add receipt input JSON and SHA256 hash for CVSS policy scoring tests
- Introduced a new JSON fixture `receipt-input.json` containing base, environmental, and threat metrics for CVSS scoring.
- Added corresponding SHA256 hash file `receipt-input.sha256` to ensure integrity of the JSON fixture.
2025-12-04 07:30:42 +02:00

2.9 KiB
Raw Blame History

Reachability Benchmark Gaps (G1G12, RD1RD10, RB1RB10) — Remediation

Date: 2025-12-03 Status: IMPLEMENTED

This note closes BENCH-GAPS-513-018, DATASET-GAPS-513-019, and REACH-FIXTURE-GAPS-513-020 by defining manifest/schema updates, verification tooling, and operational guardrails.

What changed

  • Benchmark kit manifest + schema: benchmark/schemas/benchmark-manifest.schema.json with signed/hashed entries for cases, truth, baselines, schemas, and tools. Sample at benchmark/manifest.sample.json.
  • Offline verifier: tools/verify_manifest.py validates the manifest against local files (hashes, required entries, DSSE envelope presence) to keep runs deterministic and tamper-evident.
  • Coverage/trace schemas: schemas/coverage.schema.json and schemas/trace.schema.json govern oracle outputs referenced by manifest hashes.
  • Submission provenance checks: manifest requires SHA-256 for submission schema, scorer package, and each baseline submission; DSSE path optional but encouraged.
  • Determinism env templates: manifest captures sourceDateEpoch and per-tool pinned versions; cases must provide build seeds in case metadata.
  • Unreachability oracles: truth files must include explicit rationale for unreachable cases; manifest enforces presence of truth artifact per case.
  • Sandbox/redaction guidance: case metadata must declare sandbox and redaction policy fields (schema updated) to ensure PII removal and constrained execution.
  • Resource normalization: manifest records build/runtime resource limits (cpu/memory) for repeatable benchmarking.
  • Offline kit & checklist: dataset safety checklist at benchmark/checklists/dataset-safety.md; deterministic packaging via tools/package_offline_kit.sh.
  • Frozen baselines: Semgrep rulepack hash pinned at baselines/semgrep/rules.sha256; manifest supports hashed baseline submissions.

How to use

python tools/verify_manifest.py benchmark/manifest.sample.json --root benchmark
  • Fails on hash mismatch, missing artifacts, or schema violations.
  • Optional --pubkey will verify DSSE envelopes when provided.

Gap mapping (summary)

  • G1G12 (benchmark gaps): addressed via manifest schema fields (attestations, submission provenance, determinism templates, coverage/trace schema refs), offline verifier, and required resource/sandbox metadata.
  • RD1RD10 (dataset gaps): lockfile-style manifest with hashes for SBOMs, datasets, truth, binaries; licensing/PII redaction captured via redaction.policy; semantic version + changelog required.
  • RB1RB10 (fixtures gaps): per-case truth + evidence entries mandatory; manifest enforces presence and hashes; DSSE optional but recorded; coverage/trace schema references included.

Follow-ups

  • When new cases land, regenerate manifest and rerun tools/verify_manifest.py in CI.
  • For production releases, sign the manifest DSSE and set signatures[] accordingly.