2.2 KiB
2.2 KiB
Reachability Benchmark Launch (BENCH-LAUNCH-513-017)
Audience
- Security engineering and platform teams evaluating reachability analysis tools.
- Benchmark participants (vendors, OSS maintainers) who need deterministic scoring.
Positioning
- Deterministic by default: fixed seeds, SOURCE_DATE_EPOCH builds, sorted outputs.
- Offline ready: no registry pulls or telemetry; baselines run without network.
- Explainable: truth sets include static/dynamic evidence; scorer rewards path + guards.
- Vendor-neutral: Semgrep / CodeQL / Stella baselines provided for comparison.
What’s included
- Cases across JS, Python, C (Java pending JDK availability).
- Schemas for cases, entrypoints, truth, and submissions.
- Baselines: Semgrep, CodeQL, Stella (offline).
- Tooling: scorer (
rb-score), leaderboard (rb-compare), deterministic CI script (ci/run-ci.sh). - Static site (
website/) for quick start + leaderboard view.
How to try it
# Build and validate
python tools/build/build_all.py --cases cases
python tools/validate.py --schemas schemas
# Run baselines (offline)
bash baselines/semgrep/run_all.sh cases /tmp/semgrep
bash baselines/stella/run_all.sh cases /tmp/stella
bash baselines/codeql/run_all.sh cases /tmp/codeql
# Score your submission
tools/scorer/rb_score.py --truth benchmark/truth/<aggregate>.json --submission submission.json --format json
Key dates
- 2025-12-01: Public beta (v1.0.0 schemas, JS/PY/C cases, offline baselines).
- 2025-12-15 (target): Add Java track once JDK available in CI.
- Quarterly: hidden set rotation + leaderboard refresh.
Calls to action
- Vendors: submit offline‑reproducible
submission.jsonfor inclusion on the public leaderboard. - Practitioners: run baselines locally to benchmark internal pipelines.
- OSS: propose new cases via PR; follow determinism checklist in
docs/submission-guide.md.
Risks & mitigations
- Java track blocked (JDK) — provide runner with JDK>=17; until then Java is excluded from CI.
- Hidden set leakage — governed by rotation policy in
docs/governance.md; no public release of hidden cases. - Telemetry drift — all runner scripts disable telemetry by env; reviewers verify no network calls.