Files
git.stella-ops.org/docs/marketing/reachability-benchmark-launch.md
StellaOps Bot 909d9b6220
Some checks failed
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
up
2025-12-01 21:16:22 +02:00

2.2 KiB
Raw Permalink Blame History

Reachability Benchmark Launch (BENCH-LAUNCH-513-017)

Audience

  • Security engineering and platform teams evaluating reachability analysis tools.
  • Benchmark participants (vendors, OSS maintainers) who need deterministic scoring.

Positioning

  • Deterministic by default: fixed seeds, SOURCE_DATE_EPOCH builds, sorted outputs.
  • Offline ready: no registry pulls or telemetry; baselines run without network.
  • Explainable: truth sets include static/dynamic evidence; scorer rewards path + guards.
  • Vendor-neutral: Semgrep / CodeQL / Stella baselines provided for comparison.

Whats included

  • Cases across JS, Python, C (Java pending JDK availability).
  • Schemas for cases, entrypoints, truth, and submissions.
  • Baselines: Semgrep, CodeQL, Stella (offline).
  • Tooling: scorer (rb-score), leaderboard (rb-compare), deterministic CI script (ci/run-ci.sh).
  • Static site (website/) for quick start + leaderboard view.

How to try it

# Build and validate
python tools/build/build_all.py --cases cases
python tools/validate.py --schemas schemas

# Run baselines (offline)
bash baselines/semgrep/run_all.sh cases /tmp/semgrep
bash baselines/stella/run_all.sh cases /tmp/stella
bash baselines/codeql/run_all.sh cases /tmp/codeql

# Score your submission
tools/scorer/rb_score.py --truth benchmark/truth/<aggregate>.json --submission submission.json --format json

Key dates

  • 2025-12-01: Public beta (v1.0.0 schemas, JS/PY/C cases, offline baselines).
  • 2025-12-15 (target): Add Java track once JDK available in CI.
  • Quarterly: hidden set rotation + leaderboard refresh.

Calls to action

  • Vendors: submit offlinereproducible submission.json for inclusion on the public leaderboard.
  • Practitioners: run baselines locally to benchmark internal pipelines.
  • OSS: propose new cases via PR; follow determinism checklist in docs/submission-guide.md.

Risks & mitigations

  • Java track blocked (JDK) — provide runner with JDK>=17; until then Java is excluded from CI.
  • Hidden set leakage — governed by rotation policy in docs/governance.md; no public release of hidden cases.
  • Telemetry drift — all runner scripts disable telemetry by env; reviewers verify no network calls.