git.stella-ops.org/bench/reachability-benchmark/AGENTS.md

# Reachability Benchmark · AGENTS

## Scope & Roles
- **Working directory:** `bench/reachability-benchmark/`
- Roles: benchmark curator (datasets, schemas), tooling engineer (scorer/CI), docs maintainer (public README/CONTRIBUTING), DevOps (deterministic builds, CI).
- Outputs are public-facing (Apache-2.0); keep artefacts deterministic and offline-friendly.

## Required Reading
- `docs/README.md`
- `docs/07_HIGH_LEVEL_ARCHITECTURE.md`
- `docs/reachability/function-level-evidence.md`
- `docs/reachability/lattice.md`
- Product advisories:
  - `docs/product-advisories/24-Nov-2025 - Designing a Deterministic Reachability Benchmark.md`
  - `docs/product-advisories/archived/23-Nov-2025 - Benchmarking Determinism in Vulnerability Scoring.md`
  - `docs/product-advisories/archived/23-Nov-2025 - Publishing a Reachability Benchmark Dataset.md`
- Sprint plan: `docs/implplan/SPRINT_0513_0001_0001_public_reachability_benchmark.md`
- DB/spec guidance for determinism and licensing: `docs/db/RULES.md`, `docs/db/VERIFICATION.md`

## Working Agreements
- Determinism: pin toolchains; set `SOURCE_DATE_EPOCH`; sort file lists; stable JSON/YAML ordering; fixed seeds for any sampling.
- Offline posture: no network at build/test time; vendored toolchains; registry pulls are forbidden—use cached/bundled images.
- Java builds: use vendored Temurin 21 via `tools/java/ensure_jdk.sh` when `JAVA_HOME`/`javac` are absent; keep `.jdk/` out of VCS and use `build_all.py --skip-lang` when a toolchain is missing.
- Licensing: all benchmark content Apache-2.0; include LICENSE in repo root; third-party cases must have compatible licenses and attributions.
- Evidence: each case must include oracle tests/coverage proving reachability label; store truth and submissions under `benchmark/truth/` and `benchmark/submissions/` with JSON Schema.
- Security: no secrets; scrub URLs/tokens; deterministic CI artifacts only.
- Observability: scorer emits structured logs (JSON) with deterministic ordering; metrics optional.

## Directory Contracts
- `cases/<lang>/<project>/`: source, Dockerfile (deterministic), pinned dependencies, oracle tests, expected coverage output.
- `schemas/`: JSON/YAML schemas for cases, entrypoints, truth, submission; include validation CLI.
- `tools/scorer/`: `rb-score` CLI; no network; pure local file IO.
- `baselines/`: reference runners (Semgrep/CodeQL/Stella) with normalized outputs.
- `ci/`: deterministic CI workflows; no cache flakiness.
- `website/`: static site (no trackers/fonts from CDN).

## Testing
- Per-case oracle tests must pass locally without network.
- Scorer unit tests: schema validation, scoring math (precision/recall/F1), explainability tiers.
- Determinism tests: rerun scorer twice → identical outputs/hash.

## Status Discipline
- Mirror task status in `docs/implplan/SPRINT_0513_0001_0001_public_reachability_benchmark.md` when starting/pausing/completing work.
- Log material changes in sprint Execution Log with date (UTC).

## Allowed Shared Libraries
- Use existing repo toolchains only (Python/Node/Go minimal). No new external services. Keep scorer dependencies minimal and vendored when possible.