Here’s a clean, action‑ready blueprint for a **public reachability benchmark** you can stand up quickly and grow over time.

# Why this matters (quick)

“Reachability” asks: *is a flagged vulnerability actually executable from real entry points in this codebase/container?* A public, reproducible benchmark lets you compare tools apples‑to‑apples, drive research, and keep vendors honest.

# What to collect (dataset design)

* **Projects & languages**

  * Polyglot mix: **C/C++ (ELF/PE/Mach‑O)**, **Java/Kotlin**, **C#/.NET**, **Python**, **JavaScript/TypeScript**, **PHP**, **Go**, **Rust**.
  * For each project: small (≤5k LOC), medium (5–100k), large (100k+).
* **Ground‑truth artifacts**

  * **Seed CVEs** with known sinks (e.g., deserializers, command exec, SS RF) and **neutral projects** with *no* reachable path (negatives).
  * **Exploit oracles**: minimal PoCs or unit tests that (1) reach the sink and (2) toggle reachability via feature flags.
* **Build outputs (deterministic)**

  * **Reproducible binaries/bytecode** (strip timestamps; fixed seeds; SOURCE_DATE_EPOCH).
  * **SBOM** (CycloneDX/SPDX) + **PURLs** + **Build‑ID** (ELF .note.gnu.build‑id / PE Authentihash / Mach‑O UUID).
  * **Attestations**: in‑toto/DSSE envelopes recording toolchain versions, flags, hashes.
* **Execution traces (for truth)**

  * **CI traces**: call‑graph dumps from compilers/analyzers; unit‑test coverage; optional **dynamic traces** (eBPF/.NET ETW/Java Flight Recorder).
  * **Entry‑point manifests**: HTTP routes, CLI commands, cron/queue consumers.
* **Metadata**

  * Language, framework, package manager, compiler versions, OS/container image, optimization level, stripping info, license.

# How to label ground truth

* **Per‑vuln case**: `(component, version, sink_id)` with label **reachable / unreachable / unknown**.
* **Evidence bundle**: pointer to (a) static call path, (b) dynamic hit (trace/coverage), or (c) rationale for negative.
* **Confidence**: high (static+dynamic agree), medium (one source), low (heuristic only).

# Scoring (simple + fair)

* **Binary classification** on cases:

  * Precision, Recall, F1. Report **AU‑PR** if you output probabilities.
* **Path quality**

  * **Explainability score (0–3)**:

    * 0: “vuln reachable” w/o context
    * 1: names only (entry→…→sink)
    * 2: full interprocedural path w/ locations
    * 3: plus **inputs/guards** (taint/constraints, env flags)
* **Runtime cost**

  * Wall‑clock, peak RAM, image size; normalized by KLOC.
* **Determinism**

  * Re‑run variance (≤1% is “A”, 1–5% “B”, >5% “C”).

# Avoiding overfitting

* **Train/Dev/Test** splits per language; **hidden test** projects rotated quarterly.
* **Case churn**: introduce **isomorphic variants** (rename symbols, reorder files) to punish memorization.
* **Poisoned controls**: include decoy sinks and unreachable dead‑code traps.
* **Submission rules**: require **attestations** of tool versions & flags; limit per‑case hints.

# Reference baselines (to run out‑of‑the‑box)

* **Snyk Code/Reachability** (JS/Java/Python, SaaS/CLI).
* **Semgrep + Pro Engine** (rules + reachability mode).
* **CodeQL** (multi‑lang, LGTM‑style queries).
* **Joern** (C/C++/JVM code property graphs).
* **angr** (binary symbolic exec; selective for native samples).
* **Language‑specific**: pip‑audit w/ import graphs, npm with lock‑tree + route discovery, Maven + call‑graph (Soot/WALA).

# Submission format (one JSON per tool run)

```json
{
  "tool": {"name": "YourTool", "version": "1.2.3"},
  "run": {
    "commit": "…",
    "platform": "ubuntu:24.04",
    "time_s": 182.4, "peak_mb": 3072
  },
  "cases": [
    {
      "id": "php-shop:fastjson@1.2.68:Sink#deserialize",
      "prediction": "reachable",
      "confidence": 0.88,
      "explain": {
        "entry": "POST /api/orders",
        "path": [
          "OrdersController::create",
          "Serializer::deserialize",
          "Fastjson::parseObject"
        ],
        "guards": ["feature.flag.json_enabled==true"]
      }
    }
  ],
  "artifacts": {
    "sbom": "sha256:…", "attestation": "sha256:…"
  }
}
```

# Folder layout (repo)

```
/benchmark
  /cases/<lang>/<project>/<case_id>/
    case.yaml           # component@version, sink, labels, evidence refs
    entrypoints.yaml    # routes/CLIs/cron
    build/              # Dockerfiles, lockfiles, pinned toolchains
    outputs/            # SBOMs, binaries, traces (checksummed)
  /splits/{train,dev,test}.txt
  /schemas/{case.json,submission.json}
  /scripts/{build.sh, run_tests.sh, score.py}
  /docs/ (how-to, FAQs, T&Cs)
```

# Minimal **v1** (4–6 weeks of work)

1. **Languages**: JS/TS, Python, Java, C (ELF).
2. **20–30 cases**: mix of reachable/unreachable with PoC unit tests.
3. **Deterministic builds** in containers; publish SBOM+attestations.
4. **Scorer**: precision/recall/F1 + explainability, runtime, determinism.
5. **Baselines**: run CodeQL + Semgrep across all; Snyk where feasible; angr for 3 native cases.
6. **Website**: static leaderboard (per‑lang, per‑size), download links, submission guide.

# V2+ (quarterly)

* Add **.NET, PHP, Go, Rust**; broaden binary focus (PE/Mach‑O).
* Add **dynamic traces** (eBPF/ETW/JFR) and **taint oracles**.
* Introduce **config‑gated reachability** (feature flags, env, k8s secrets).
* Add **dataset cards** per case (threat model, CWE, false‑positive traps).

# Publishing & governance

* License: **CC‑BY‑SA** for metadata, **source‑compatible OSS** for code, binaries under original licenses.
* **Repro packs**: `benchmark-kit.tgz` with container recipes, hashes, and attestations.
* **Disclosure**: CVE hygiene, responsible use, opt‑out path for upstreams.
* **Stewards**: small TAC (you + two external reviewers) to approve new cases and adjudicate disputes.

# Immediate next steps (checklist)

* Lock the **schemas** (case + submission + attestation fields).
* Pick 8 seed projects (2 per language tiered by size).
* Draft 12 sink‑cases (6 reachable, 6 unreachable) with unit‑test oracles.
* Script deterministic builds and **hash‑locked SBOMs**.
* Implement the scorer; publish a **starter leaderboard** with 2 baselines.
* Ship **v1 website/docs** and open submissions.

If you want, I can generate the repo scaffold (folders, YAML/JSON schemas, Dockerfiles, scorer script) so your team can `git clone` and start adding cases immediately.
Cool, let’s turn the blueprint into a concrete, developer‑friendly implementation plan.

I’ll assume **v1 scope** is:

* Languages: **JavaScript/TypeScript (Node)**, **Python**, **Java**, **C (ELF)**
* ~**20–30 cases** total (reachable/unreachable mix)
* Baselines: **CodeQL**, **Semgrep**, maybe **Snyk** where licenses allow, and **angr** for a few native cases

You can expand later, but this plan is enough to get v1 shipped.

---

## 0. Overall project structure & ownership

**Owners**

* **Tech Lead** – owns architecture & final decisions
* **Benchmark Core** – 2–3 devs building schemas, scorer, infra
* **Language Tracks** – 1 dev per language (JS, Python, Java, C)
* **Website/Docs** – 1 dev

**Repo layout (target)**

```text
reachability-benchmark/
  README.md
  LICENSE
  CONTRIBUTING.md
  CODE_OF_CONDUCT.md

  benchmark/
    cases/
      js/
        express-blog/
          case-001/
            case.yaml
            entrypoints.yaml
            build/
              Dockerfile
              build.sh
            src/           # project source (or submodule)
            tests/         # unit tests as oracles
            outputs/
              sbom.cdx.json
              binary.tar.gz
              coverage.json
              traces/      # optional dynamic traces
      py/
        flask-api/...
      java/
        spring-app/...
      c/
        httpd-like/...
    schemas/
      case.schema.yaml
      entrypoints.schema.yaml
      truth.schema.yaml
      submission.schema.json
    tools/
      scorer/
        rb_score/
          __init__.py
          cli.py
          metrics.py
          loader.py
          explainability.py
        pyproject.toml
        tests/
      build/
        build_all.py
        validate_builds.py

  baselines/
    codeql/
      run_case.sh
      config/
    semgrep/
      run_case.sh
      rules/
    snyk/
      run_case.sh
    angr/
      run_case.sh

  ci/
    github/
      benchmark.yml

  website/
    # static site / leaderboard
```

---

## 1. Phase 1 – Repo & infra setup

### Task 1.1 – Create repository

**Developer:** Tech Lead
**Deliverables:**

* Repo created (`reachability-benchmark` or similar)
* `LICENSE` (e.g., Apache-2.0 or MIT)
* Basic `README.md` describing:

  * Purpose (public reachability benchmark)
  * High‑level design
  * v1 scope (langs, #cases)

### Task 1.2 – Bootstrap structure

**Developer:** Benchmark Core

Create directory skeleton as above (without filling everything yet).

Add:

```bash
# benchmark/Makefile
.PHONY: test lint build
test:
\tpytest benchmark/tools/scorer/tests

lint:
\tblack benchmark/tools/scorer
\tflake8 benchmark/tools/scorer

build:
\tpython benchmark/tools/build/build_all.py
```

### Task 1.3 – Coding standards & tooling

**Developer:** Benchmark Core

* Add `.editorconfig`, `.gitignore`, and Python tool configs (`ruff`, `black`, or `flake8`).
* Define minimal **PR checklist** in `CONTRIBUTING.md`:

  * Tests pass
  * Lint passes
  * New schemas have JSON schema or YAML schema and tests
  * New cases come with oracles (tests/coverage)

---

## 2. Phase 2 – Case & submission schemas

### Task 2.1 – Define case metadata format

**Developer:** Benchmark Core

Create `benchmark/schemas/case.schema.yaml` and an example `case.yaml`.

**Example `case.yaml`**

```yaml
id: "js-express-blog:001"
language: "javascript"
framework: "express"
size: "small"               # small | medium | large
component:
  name: "express-blog"
  version: "1.0.0-bench"
vulnerability:
  cve: "CVE-XXXX-YYYY"
  cwe: "CWE-502"
  description: "Unsafe deserialization via user-controlled JSON."
  sink_id: "Deserializer::parse"
ground_truth:
  label: "reachable"        # reachable | unreachable | unknown
  confidence: "high"        # high | medium | low
  evidence_files:
    - "truth.yaml"
  notes: >
    Unit test test_reachable_deserialization triggers the sink.
build:
  dockerfile: "build/Dockerfile"
  build_script: "build/build.sh"
  output:
    artifact_path: "outputs/binary.tar.gz"
    sbom_path: "outputs/sbom.cdx.json"
    coverage_path: "outputs/coverage.json"
    traces_dir: "outputs/traces"
environment:
  os_image: "ubuntu:24.04"
  compiler: null
  runtime:
    node: "20.11.0"
  source_date_epoch: 1730000000
```

**Acceptance criteria**

* Schema validates sample `case.yaml` with a Python script:

  * `benchmark/tools/build/validate_schema.py` using `jsonschema` or `pykwalify`.

---

### Task 2.2 – Entry points schema

**Developer:** Benchmark Core

`benchmark/schemas/entrypoints.schema.yaml`

**Example `entrypoints.yaml`**

```yaml
entries:
  http:
    - id: "POST /api/posts"
      route: "/api/posts"
      method: "POST"
      handler: "PostsController.create"
  cli:
    - id: "generate-report"
      command: "node cli.js generate-report"
      description: "Generates summary report."
  scheduled:
    - id: "daily-cleanup"
      schedule: "0 3 * * *"
      handler: "CleanupJob.run"
```

---

### Task 2.3 – Ground truth / truth schema

**Developer:** Benchmark Core + Language Tracks

`benchmark/schemas/truth.schema.yaml`

**Example `truth.yaml`**

```yaml
id: "js-express-blog:001"
cases:
  - sink_id: "Deserializer::parse"
    label: "reachable"
    dynamic_evidence:
      covered_by_tests:
        - "tests/test_reachable_deserialization.js::should_reach_sink"
      coverage_files:
        - "outputs/coverage.json"
    static_evidence:
      call_path:
        - "POST /api/posts"
        - "PostsController.create"
        - "PostsService.createFromJson"
        - "Deserializer.parse"
    config_conditions:
      - "process.env.FEATURE_JSON_ENABLED == 'true'"
    notes: "If FEATURE_JSON_ENABLED=false, path is unreachable."
```

---

### Task 2.4 – Submission schema

**Developer:** Benchmark Core

`benchmark/schemas/submission.schema.json`

**Shape**

```json
{
  "tool": { "name": "YourTool", "version": "1.2.3" },
  "run": {
    "commit": "abcd1234",
    "platform": "ubuntu:24.04",
    "time_s": 182.4,
    "peak_mb": 3072
  },
  "cases": [
    {
      "id": "js-express-blog:001",
      "prediction": "reachable",
      "confidence": 0.88,
      "explain": {
        "entry": "POST /api/posts",
        "path": [
          "PostsController.create",
          "PostsService.createFromJson",
          "Deserializer.parse"
        ],
        "guards": [
          "process.env.FEATURE_JSON_ENABLED === 'true'"
        ]
      }
    }
  ],
  "artifacts": {
    "sbom": "sha256:...",
    "attestation": "sha256:..."
  }
}
```

Write Python validation utility:

```bash
python benchmark/tools/scorer/validate_submission.py submission.json
```

**Acceptance criteria**

* Validation fails on missing fields / wrong enum values.
* At least two sample submissions pass validation (e.g., “perfect” and “random baseline”).

---

## 3. Phase 3 – Reference projects & deterministic builds

### Task 3.1 – Select and vendor v1 projects

**Developer:** Tech Lead + Language Tracks

For each language, choose:

* 1 small toy app (simple web or CLI)
* 1 medium app (more routes, multiple modules)
* Optional: 1 large (for performance stress tests)

Add them under `benchmark/cases/<lang>/<project>/src/`
(or as git submodules if you want to track upstream).

---

### Task 3.2 – Deterministic Docker build per project

**Developer:** Language Tracks

For each project:

* Create `build/Dockerfile`
* Create `build/build.sh` that:

  * Builds the app
  * Produces artifacts
  * Generates SBOM and attestation

**Example `build/Dockerfile` (Node)**

```dockerfile
FROM node:20.11-slim

ENV NODE_ENV=production
ENV SOURCE_DATE_EPOCH=1730000000

WORKDIR /app
COPY src/ /app
COPY package.json package-lock.json /app/

RUN npm ci --ignore-scripts && \
    npm run build || true

CMD ["node", "server.js"]
```

**Example `build.sh`**

```bash
#!/usr/bin/env bash
set -euo pipefail

ROOT_DIR="$(dirname "$(readlink -f "$0")")/.."
OUT_DIR="$ROOT_DIR/outputs"
mkdir -p "$OUT_DIR"

IMAGE_TAG="rb-js-express-blog:1"

docker build -t "$IMAGE_TAG" "$ROOT_DIR/build"

# Export image as tarball (binary artifact)
docker save "$IMAGE_TAG" | gzip > "$OUT_DIR/binary.tar.gz"

# Generate SBOM (e.g. via syft) – can be optional stub for v1
syft packages "docker:$IMAGE_TAG" -o cyclonedx-json > "$OUT_DIR/sbom.cdx.json"

# In future: generate in-toto attestations
```

---

### Task 3.3 – Determinism checker

**Developer:** Benchmark Core

`benchmark/tools/build/validate_builds.py`:

* For each case:

  * Run `build.sh` twice
  * Compare hashes of `outputs/binary.tar.gz` and `outputs/sbom.cdx.json`
* Fail if hashes differ.

**Acceptance criteria**

* All v1 cases produce identical artifacts across two builds on CI.

---

## 4. Phase 4 – Ground truth oracles (tests & traces)

### Task 4.1 – Add unit/integration tests for reachable cases

**Developer:** Language Tracks

For each **reachable** case:

* Add `tests/` under the project to:

  * Start the app (if necessary)
  * Send a request/trigger that reaches the vulnerable sink
  * Assert that a sentinel side effect occurs (e.g. log or marker file) instead of real exploitation.

Example for Node using Jest:

```js
test("should reach deserialization sink", async () => {
  const res = await request(app)
    .post("/api/posts")
    .send({ title: "x", body: '{"__proto__":{}}' });

  expect(res.statusCode).toBe(200);
  // Sink logs "REACH_SINK" – we check log or variable
  expect(sinkWasReached()).toBe(true);
});
```

### Task 4.2 – Instrument coverage

**Developer:** Language Tracks

* For each language, pick a coverage tool:

  * JS: `nyc` + `istanbul`
  * Python: `coverage.py`
  * Java: `jacoco`
  * C: `gcov`/`llvm-cov` (optional for v1)

* Ensure running tests produces `outputs/coverage.json` or `.xml` that we then convert to a simple JSON format:

```json
{
  "files": {
    "src/controllers/posts.js": {
      "lines_covered": [12, 13, 14, 27],
      "lines_total": 40
    }
  }
}
```

Create a small converter script if needed.

### Task 4.3 – Optional dynamic traces

If you want richer evidence:

* JS: add middleware that logs `(entry_id, handler, sink)` triples to `outputs/traces/traces.json`
* Python: similar using decorators
* C/Java: out of scope for v1 unless you want to invest extra time.

---

## 5. Phase 5 – Scoring tool (CLI)

### Task 5.1 – Implement `rb-score` library + CLI

**Developer:** Benchmark Core

Create `benchmark/tools/scorer/rb_score/` with:

* `loader.py`

  * Load all `case.yaml`, `truth.yaml` into memory.
  * Provide functions: `load_cases() -> Dict[case_id, Case]`.

* `metrics.py`

  * Implement:

    * `compute_precision_recall(truth, predictions)`
    * `compute_path_quality_score(explain_block)` (0–3)
    * `compute_runtime_stats(run_block)`

* `cli.py`

  * CLI:

```bash
rb-score \
  --cases-root benchmark/cases \
  --submission submissions/mytool.json \
  --output results/mytool_results.json
```

**Pseudo-code for core scoring**

```python
def score_submission(truth, submission):
    y_true = []
    y_pred = []
    per_case_scores = {}

    for case in truth:
        gt = truth[case.id].label  # reachable/unreachable
        pred_case = find_pred_case(submission.cases, case.id)
        pred_label = pred_case.prediction if pred_case else "unreachable"

        y_true.append(gt == "reachable")
        y_pred.append(pred_label == "reachable")

        explain_score = explainability(pred_case.explain if pred_case else None)

        per_case_scores[case.id] = {
            "gt": gt,
            "pred": pred_label,
            "explainability": explain_score,
        }

    precision, recall, f1 = compute_prf(y_true, y_pred)

    return {
        "summary": {
            "precision": precision,
            "recall": recall,
            "f1": f1,
            "num_cases": len(truth),
        },
        "cases": per_case_scores,
    }
```

### Task 5.2 – Explainability scoring rules

**Developer:** Benchmark Core

Implement `explainability(explain)`:

* 0 – `explain` missing or `path` empty
* 1 – `path` present with at least 2 nodes (sink + one function)
* 2 – `path` contains:

  * Entry label (HTTP route/CLI id)
  * ≥3 nodes (entry → … → sink)
* 3 – Level 2 plus `guards` list non-empty

Unit tests for at least 4 scenarios.

### Task 5.3 – Regression tests for scoring

Add small test fixture:

* Tiny synthetic benchmark: 3 cases, 2 reachable, 1 unreachable.
* 3 submissions:

  * Perfect
  * All reachable
  * All unreachable

Assertions:

* Perfect: `precision=1, recall=1`
* All reachable: `recall=1, precision<1`
* All unreachable: `precision=1 (trivially on negatives), recall=0`

---

## 6. Phase 6 – Baseline integrations

### Task 6.1 – Semgrep baseline

**Developer:** Benchmark Core (with Semgrep experience)

* `baselines/semgrep/run_case.sh`:

  * Inputs: `case_id`, `cases_root`, `output_path`
  * Steps:

    * Find `src/` for case
    * Run `semgrep --config auto` or curated rules
    * Convert Semgrep findings into benchmark submission format:

      * Map Semgrep rules → vulnerability types → candidate sinks
      * Heuristically guess reachability (for v1, maybe always “reachable” if sink in code path)
  * Output: `output_path` JSON conforming to `submission.schema.json`.

### Task 6.2 – CodeQL baseline

* Create CodeQL databases for each project (likely via `codeql database create`).
* Create queries targeting known sinks (e.g., `Deserialization`, `CommandInjection`).
* `baselines/codeql/run_case.sh`:

  * Build DB (or reuse)
  * Run queries
  * Translate results into our submission format (again as heuristic reachability).

### Task 6.3 – Optional Snyk / angr baselines

* Snyk:

  * Use `snyk test` on the project
  * Map results to dependencies & known CVEs
  * For v1, just mark as `reachable` if Snyk reports a reachable path (if available).
* angr:

  * For 1–2 small C samples, configure simple analysis script.

**Acceptance criteria**

* For at least 5 cases (across languages), the baselines produce valid submission JSON.
* `rb-score` runs and yields metrics without errors.

---

## 7. Phase 7 – CI/CD

### Task 7.1 – GitHub Actions workflow

**Developer:** Benchmark Core

`ci/github/benchmark.yml`:

Jobs:

1. `lint-and-test`

   * `python -m pip install -e benchmark/tools/scorer[dev]`
   * `make lint`
   * `make test`

2. `build-cases`

   * `python benchmark/tools/build/build_all.py`
   * Run `validate_builds.py`

3. `smoke-baselines`

   * For 2–3 cases, run Semgrep/CodeQL wrappers and ensure they emit valid submissions.

### Task 7.2 – Artifact upload

* Upload `outputs/` tarball from `build-cases` as workflow artifacts.
* Upload `results/*.json` from scoring runs.

---

## 8. Phase 8 – Website & leaderboard

### Task 8.1 – Define results JSON format

**Developer:** Benchmark Core + Website dev

`results/leaderboard.json`:

```json
{
  "tools": [
    {
      "name": "Semgrep",
      "version": "1.60.0",
      "summary": {
        "precision": 0.72,
        "recall": 0.48,
        "f1": 0.58
      },
      "by_language": {
        "javascript": {"precision": 0.80, "recall": 0.50, "f1": 0.62},
        "python": {"precision": 0.65, "recall": 0.45, "f1": 0.53}
      }
    }
  ]
}
```

CLI option to generate this:

```bash
rb-score compare \
  --cases-root benchmark/cases \
  --submissions submissions/*.json \
  --output results/leaderboard.json
```

### Task 8.2 – Static site

**Developer:** Website dev

Tech choice: any static framework (Next.js, Astro, Docusaurus, or even pure HTML+JS).

Pages:

* **Home**

  * What is reachability?
  * Summary of benchmark

* **Leaderboard**

  * Renders `leaderboard.json`
  * Filters: language, case size

* **Docs**

  * How to run benchmark locally
  * How to prepare a submission

Add a simple script to copy `results/leaderboard.json` into `website/public/` for publishing.

---

## 9. Phase 9 – Docs, governance, and contribution flow

### Task 9.1 – CONTRIBUTING.md

Include:

* How to add a new case:

  * Step‑by‑step:

    1. Create project folder under `benchmark/cases/<lang>/<project>/case-XXX/`
    2. Add `case.yaml`, `entrypoints.yaml`, `truth.yaml`
    3. Add oracles (tests, coverage)
    4. Add deterministic `build/` assets
    5. Run local tooling:

       * `validate_schema.py`
       * `validate_builds.py --case <id>`
  * Example PR description template.

### Task 9.2 – Governance doc

* Define **Technical Advisory Committee (TAC)** roles:

  * Approve new cases
  * Approve schema changes
  * Manage hidden test sets (future phase)

* Define **release cadence**:

  * v1.0 with public cases
  * Quarterly updates with new hidden cases.

---

## 10. Suggested milestone breakdown (for planning / sprints)

### Milestone 1 – Foundation (1–2 sprints)

* Repo scaffolding (Tasks 1.x)
* Schemas (Tasks 2.x)
* Two tiny toy cases (one JS, one Python) with:

  * `case.yaml`, `entrypoints.yaml`, `truth.yaml`
  * Deterministic build
  * Basic unit tests
* Minimal `rb-score` with:

  * Case loading
  * Precision/recall only

**Exit:** You can run `rb-score` on a dummy submission for 2 cases.

---

### Milestone 2 – v1 dataset (2–3 sprints)

* Add ~20–30 cases across JS, Python, Java, C
* Ground truth & coverage for each
* Deterministic builds validated
* Explainability scoring implemented
* Regression tests for `rb-score`

**Exit:** Full scoring tool stable; dataset repeatably builds on CI.

---

### Milestone 3 – Baselines & site (1–2 sprints)

* Semgrep + CodeQL baselines producing valid submissions
* CI running smoke baselines
* `leaderboard.json` generator
* Static website with public leaderboard and docs

**Exit:** Public v1 benchmark you can share with external tool authors.

---

If you tell me which stack your team prefers for the site (React, plain HTML, SSG, etc.) or which CI you’re on, I can adapt this into concrete config files (e.g., a full GitHub Actions workflow, Next.js scaffold, or exact `pyproject.toml` for `rb-score`).