Here’s a clean, action‑ready blueprint for a **public reachability benchmark** you can stand up quickly and grow over time. # Why this matters (quick) “Reachability” asks: *is a flagged vulnerability actually executable from real entry points in this codebase/container?* A public, reproducible benchmark lets you compare tools apples‑to‑apples, drive research, and keep vendors honest. # What to collect (dataset design) * **Projects & languages** * Polyglot mix: **C/C++ (ELF/PE/Mach‑O)**, **Java/Kotlin**, **C#/.NET**, **Python**, **JavaScript/TypeScript**, **PHP**, **Go**, **Rust**. * For each project: small (≤5k LOC), medium (5–100k), large (100k+). * **Ground‑truth artifacts** * **Seed CVEs** with known sinks (e.g., deserializers, command exec, SS RF) and **neutral projects** with *no* reachable path (negatives). * **Exploit oracles**: minimal PoCs or unit tests that (1) reach the sink and (2) toggle reachability via feature flags. * **Build outputs (deterministic)** * **Reproducible binaries/bytecode** (strip timestamps; fixed seeds; SOURCE_DATE_EPOCH). * **SBOM** (CycloneDX/SPDX) + **PURLs** + **Build‑ID** (ELF .note.gnu.build‑id / PE Authentihash / Mach‑O UUID). * **Attestations**: in‑toto/DSSE envelopes recording toolchain versions, flags, hashes. * **Execution traces (for truth)** * **CI traces**: call‑graph dumps from compilers/analyzers; unit‑test coverage; optional **dynamic traces** (eBPF/.NET ETW/Java Flight Recorder). * **Entry‑point manifests**: HTTP routes, CLI commands, cron/queue consumers. * **Metadata** * Language, framework, package manager, compiler versions, OS/container image, optimization level, stripping info, license. # How to label ground truth * **Per‑vuln case**: `(component, version, sink_id)` with label **reachable / unreachable / unknown**. * **Evidence bundle**: pointer to (a) static call path, (b) dynamic hit (trace/coverage), or (c) rationale for negative. * **Confidence**: high (static+dynamic agree), medium (one source), low (heuristic only). # Scoring (simple + fair) * **Binary classification** on cases: * Precision, Recall, F1. Report **AU‑PR** if you output probabilities. * **Path quality** * **Explainability score (0–3)**: * 0: “vuln reachable” w/o context * 1: names only (entry→…→sink) * 2: full interprocedural path w/ locations * 3: plus **inputs/guards** (taint/constraints, env flags) * **Runtime cost** * Wall‑clock, peak RAM, image size; normalized by KLOC. * **Determinism** * Re‑run variance (≤1% is “A”, 1–5% “B”, >5% “C”). # Avoiding overfitting * **Train/Dev/Test** splits per language; **hidden test** projects rotated quarterly. * **Case churn**: introduce **isomorphic variants** (rename symbols, reorder files) to punish memorization. * **Poisoned controls**: include decoy sinks and unreachable dead‑code traps. * **Submission rules**: require **attestations** of tool versions & flags; limit per‑case hints. # Reference baselines (to run out‑of‑the‑box) * **Snyk Code/Reachability** (JS/Java/Python, SaaS/CLI). * **Semgrep + Pro Engine** (rules + reachability mode). * **CodeQL** (multi‑lang, LGTM‑style queries). * **Joern** (C/C++/JVM code property graphs). * **angr** (binary symbolic exec; selective for native samples). * **Language‑specific**: pip‑audit w/ import graphs, npm with lock‑tree + route discovery, Maven + call‑graph (Soot/WALA). # Submission format (one JSON per tool run) ```json { "tool": {"name": "YourTool", "version": "1.2.3"}, "run": { "commit": "…", "platform": "ubuntu:24.04", "time_s": 182.4, "peak_mb": 3072 }, "cases": [ { "id": "php-shop:fastjson@1.2.68:Sink#deserialize", "prediction": "reachable", "confidence": 0.88, "explain": { "entry": "POST /api/orders", "path": [ "OrdersController::create", "Serializer::deserialize", "Fastjson::parseObject" ], "guards": ["feature.flag.json_enabled==true"] } } ], "artifacts": { "sbom": "sha256:…", "attestation": "sha256:…" } } ``` # Folder layout (repo) ``` /benchmark /cases//// case.yaml # component@version, sink, labels, evidence refs entrypoints.yaml # routes/CLIs/cron build/ # Dockerfiles, lockfiles, pinned toolchains outputs/ # SBOMs, binaries, traces (checksummed) /splits/{train,dev,test}.txt /schemas/{case.json,submission.json} /scripts/{build.sh, run_tests.sh, score.py} /docs/ (how-to, FAQs, T&Cs) ``` # Minimal **v1** (4–6 weeks of work) 1. **Languages**: JS/TS, Python, Java, C (ELF). 2. **20–30 cases**: mix of reachable/unreachable with PoC unit tests. 3. **Deterministic builds** in containers; publish SBOM+attestations. 4. **Scorer**: precision/recall/F1 + explainability, runtime, determinism. 5. **Baselines**: run CodeQL + Semgrep across all; Snyk where feasible; angr for 3 native cases. 6. **Website**: static leaderboard (per‑lang, per‑size), download links, submission guide. # V2+ (quarterly) * Add **.NET, PHP, Go, Rust**; broaden binary focus (PE/Mach‑O). * Add **dynamic traces** (eBPF/ETW/JFR) and **taint oracles**. * Introduce **config‑gated reachability** (feature flags, env, k8s secrets). * Add **dataset cards** per case (threat model, CWE, false‑positive traps). # Publishing & governance * License: **CC‑BY‑SA** for metadata, **source‑compatible OSS** for code, binaries under original licenses. * **Repro packs**: `benchmark-kit.tgz` with container recipes, hashes, and attestations. * **Disclosure**: CVE hygiene, responsible use, opt‑out path for upstreams. * **Stewards**: small TAC (you + two external reviewers) to approve new cases and adjudicate disputes. # Immediate next steps (checklist) * Lock the **schemas** (case + submission + attestation fields). * Pick 8 seed projects (2 per language tiered by size). * Draft 12 sink‑cases (6 reachable, 6 unreachable) with unit‑test oracles. * Script deterministic builds and **hash‑locked SBOMs**. * Implement the scorer; publish a **starter leaderboard** with 2 baselines. * Ship **v1 website/docs** and open submissions. If you want, I can generate the repo scaffold (folders, YAML/JSON schemas, Dockerfiles, scorer script) so your team can `git clone` and start adding cases immediately. Cool, let’s turn the blueprint into a concrete, developer‑friendly implementation plan. I’ll assume **v1 scope** is: * Languages: **JavaScript/TypeScript (Node)**, **Python**, **Java**, **C (ELF)** * ~**20–30 cases** total (reachable/unreachable mix) * Baselines: **CodeQL**, **Semgrep**, maybe **Snyk** where licenses allow, and **angr** for a few native cases You can expand later, but this plan is enough to get v1 shipped. --- ## 0. Overall project structure & ownership **Owners** * **Tech Lead** – owns architecture & final decisions * **Benchmark Core** – 2–3 devs building schemas, scorer, infra * **Language Tracks** – 1 dev per language (JS, Python, Java, C) * **Website/Docs** – 1 dev **Repo layout (target)** ```text reachability-benchmark/ README.md LICENSE CONTRIBUTING.md CODE_OF_CONDUCT.md benchmark/ cases/ js/ express-blog/ case-001/ case.yaml entrypoints.yaml build/ Dockerfile build.sh src/ # project source (or submodule) tests/ # unit tests as oracles outputs/ sbom.cdx.json binary.tar.gz coverage.json traces/ # optional dynamic traces py/ flask-api/... java/ spring-app/... c/ httpd-like/... schemas/ case.schema.yaml entrypoints.schema.yaml truth.schema.yaml submission.schema.json tools/ scorer/ rb_score/ __init__.py cli.py metrics.py loader.py explainability.py pyproject.toml tests/ build/ build_all.py validate_builds.py baselines/ codeql/ run_case.sh config/ semgrep/ run_case.sh rules/ snyk/ run_case.sh angr/ run_case.sh ci/ github/ benchmark.yml website/ # static site / leaderboard ``` --- ## 1. Phase 1 – Repo & infra setup ### Task 1.1 – Create repository **Developer:** Tech Lead **Deliverables:** * Repo created (`reachability-benchmark` or similar) * `LICENSE` (e.g., Apache-2.0 or MIT) * Basic `README.md` describing: * Purpose (public reachability benchmark) * High‑level design * v1 scope (langs, #cases) ### Task 1.2 – Bootstrap structure **Developer:** Benchmark Core Create directory skeleton as above (without filling everything yet). Add: ```bash # benchmark/Makefile .PHONY: test lint build test: \tpytest benchmark/tools/scorer/tests lint: \tblack benchmark/tools/scorer \tflake8 benchmark/tools/scorer build: \tpython benchmark/tools/build/build_all.py ``` ### Task 1.3 – Coding standards & tooling **Developer:** Benchmark Core * Add `.editorconfig`, `.gitignore`, and Python tool configs (`ruff`, `black`, or `flake8`). * Define minimal **PR checklist** in `CONTRIBUTING.md`: * Tests pass * Lint passes * New schemas have JSON schema or YAML schema and tests * New cases come with oracles (tests/coverage) --- ## 2. Phase 2 – Case & submission schemas ### Task 2.1 – Define case metadata format **Developer:** Benchmark Core Create `benchmark/schemas/case.schema.yaml` and an example `case.yaml`. **Example `case.yaml`** ```yaml id: "js-express-blog:001" language: "javascript" framework: "express" size: "small" # small | medium | large component: name: "express-blog" version: "1.0.0-bench" vulnerability: cve: "CVE-XXXX-YYYY" cwe: "CWE-502" description: "Unsafe deserialization via user-controlled JSON." sink_id: "Deserializer::parse" ground_truth: label: "reachable" # reachable | unreachable | unknown confidence: "high" # high | medium | low evidence_files: - "truth.yaml" notes: > Unit test test_reachable_deserialization triggers the sink. build: dockerfile: "build/Dockerfile" build_script: "build/build.sh" output: artifact_path: "outputs/binary.tar.gz" sbom_path: "outputs/sbom.cdx.json" coverage_path: "outputs/coverage.json" traces_dir: "outputs/traces" environment: os_image: "ubuntu:24.04" compiler: null runtime: node: "20.11.0" source_date_epoch: 1730000000 ``` **Acceptance criteria** * Schema validates sample `case.yaml` with a Python script: * `benchmark/tools/build/validate_schema.py` using `jsonschema` or `pykwalify`. --- ### Task 2.2 – Entry points schema **Developer:** Benchmark Core `benchmark/schemas/entrypoints.schema.yaml` **Example `entrypoints.yaml`** ```yaml entries: http: - id: "POST /api/posts" route: "/api/posts" method: "POST" handler: "PostsController.create" cli: - id: "generate-report" command: "node cli.js generate-report" description: "Generates summary report." scheduled: - id: "daily-cleanup" schedule: "0 3 * * *" handler: "CleanupJob.run" ``` --- ### Task 2.3 – Ground truth / truth schema **Developer:** Benchmark Core + Language Tracks `benchmark/schemas/truth.schema.yaml` **Example `truth.yaml`** ```yaml id: "js-express-blog:001" cases: - sink_id: "Deserializer::parse" label: "reachable" dynamic_evidence: covered_by_tests: - "tests/test_reachable_deserialization.js::should_reach_sink" coverage_files: - "outputs/coverage.json" static_evidence: call_path: - "POST /api/posts" - "PostsController.create" - "PostsService.createFromJson" - "Deserializer.parse" config_conditions: - "process.env.FEATURE_JSON_ENABLED == 'true'" notes: "If FEATURE_JSON_ENABLED=false, path is unreachable." ``` --- ### Task 2.4 – Submission schema **Developer:** Benchmark Core `benchmark/schemas/submission.schema.json` **Shape** ```json { "tool": { "name": "YourTool", "version": "1.2.3" }, "run": { "commit": "abcd1234", "platform": "ubuntu:24.04", "time_s": 182.4, "peak_mb": 3072 }, "cases": [ { "id": "js-express-blog:001", "prediction": "reachable", "confidence": 0.88, "explain": { "entry": "POST /api/posts", "path": [ "PostsController.create", "PostsService.createFromJson", "Deserializer.parse" ], "guards": [ "process.env.FEATURE_JSON_ENABLED === 'true'" ] } } ], "artifacts": { "sbom": "sha256:...", "attestation": "sha256:..." } } ``` Write Python validation utility: ```bash python benchmark/tools/scorer/validate_submission.py submission.json ``` **Acceptance criteria** * Validation fails on missing fields / wrong enum values. * At least two sample submissions pass validation (e.g., “perfect” and “random baseline”). --- ## 3. Phase 3 – Reference projects & deterministic builds ### Task 3.1 – Select and vendor v1 projects **Developer:** Tech Lead + Language Tracks For each language, choose: * 1 small toy app (simple web or CLI) * 1 medium app (more routes, multiple modules) * Optional: 1 large (for performance stress tests) Add them under `benchmark/cases///src/` (or as git submodules if you want to track upstream). --- ### Task 3.2 – Deterministic Docker build per project **Developer:** Language Tracks For each project: * Create `build/Dockerfile` * Create `build/build.sh` that: * Builds the app * Produces artifacts * Generates SBOM and attestation **Example `build/Dockerfile` (Node)** ```dockerfile FROM node:20.11-slim ENV NODE_ENV=production ENV SOURCE_DATE_EPOCH=1730000000 WORKDIR /app COPY src/ /app COPY package.json package-lock.json /app/ RUN npm ci --ignore-scripts && \ npm run build || true CMD ["node", "server.js"] ``` **Example `build.sh`** ```bash #!/usr/bin/env bash set -euo pipefail ROOT_DIR="$(dirname "$(readlink -f "$0")")/.." OUT_DIR="$ROOT_DIR/outputs" mkdir -p "$OUT_DIR" IMAGE_TAG="rb-js-express-blog:1" docker build -t "$IMAGE_TAG" "$ROOT_DIR/build" # Export image as tarball (binary artifact) docker save "$IMAGE_TAG" | gzip > "$OUT_DIR/binary.tar.gz" # Generate SBOM (e.g. via syft) – can be optional stub for v1 syft packages "docker:$IMAGE_TAG" -o cyclonedx-json > "$OUT_DIR/sbom.cdx.json" # In future: generate in-toto attestations ``` --- ### Task 3.3 – Determinism checker **Developer:** Benchmark Core `benchmark/tools/build/validate_builds.py`: * For each case: * Run `build.sh` twice * Compare hashes of `outputs/binary.tar.gz` and `outputs/sbom.cdx.json` * Fail if hashes differ. **Acceptance criteria** * All v1 cases produce identical artifacts across two builds on CI. --- ## 4. Phase 4 – Ground truth oracles (tests & traces) ### Task 4.1 – Add unit/integration tests for reachable cases **Developer:** Language Tracks For each **reachable** case: * Add `tests/` under the project to: * Start the app (if necessary) * Send a request/trigger that reaches the vulnerable sink * Assert that a sentinel side effect occurs (e.g. log or marker file) instead of real exploitation. Example for Node using Jest: ```js test("should reach deserialization sink", async () => { const res = await request(app) .post("/api/posts") .send({ title: "x", body: '{"__proto__":{}}' }); expect(res.statusCode).toBe(200); // Sink logs "REACH_SINK" – we check log or variable expect(sinkWasReached()).toBe(true); }); ``` ### Task 4.2 – Instrument coverage **Developer:** Language Tracks * For each language, pick a coverage tool: * JS: `nyc` + `istanbul` * Python: `coverage.py` * Java: `jacoco` * C: `gcov`/`llvm-cov` (optional for v1) * Ensure running tests produces `outputs/coverage.json` or `.xml` that we then convert to a simple JSON format: ```json { "files": { "src/controllers/posts.js": { "lines_covered": [12, 13, 14, 27], "lines_total": 40 } } } ``` Create a small converter script if needed. ### Task 4.3 – Optional dynamic traces If you want richer evidence: * JS: add middleware that logs `(entry_id, handler, sink)` triples to `outputs/traces/traces.json` * Python: similar using decorators * C/Java: out of scope for v1 unless you want to invest extra time. --- ## 5. Phase 5 – Scoring tool (CLI) ### Task 5.1 – Implement `rb-score` library + CLI **Developer:** Benchmark Core Create `benchmark/tools/scorer/rb_score/` with: * `loader.py` * Load all `case.yaml`, `truth.yaml` into memory. * Provide functions: `load_cases() -> Dict[case_id, Case]`. * `metrics.py` * Implement: * `compute_precision_recall(truth, predictions)` * `compute_path_quality_score(explain_block)` (0–3) * `compute_runtime_stats(run_block)` * `cli.py` * CLI: ```bash rb-score \ --cases-root benchmark/cases \ --submission submissions/mytool.json \ --output results/mytool_results.json ``` **Pseudo-code for core scoring** ```python def score_submission(truth, submission): y_true = [] y_pred = [] per_case_scores = {} for case in truth: gt = truth[case.id].label # reachable/unreachable pred_case = find_pred_case(submission.cases, case.id) pred_label = pred_case.prediction if pred_case else "unreachable" y_true.append(gt == "reachable") y_pred.append(pred_label == "reachable") explain_score = explainability(pred_case.explain if pred_case else None) per_case_scores[case.id] = { "gt": gt, "pred": pred_label, "explainability": explain_score, } precision, recall, f1 = compute_prf(y_true, y_pred) return { "summary": { "precision": precision, "recall": recall, "f1": f1, "num_cases": len(truth), }, "cases": per_case_scores, } ``` ### Task 5.2 – Explainability scoring rules **Developer:** Benchmark Core Implement `explainability(explain)`: * 0 – `explain` missing or `path` empty * 1 – `path` present with at least 2 nodes (sink + one function) * 2 – `path` contains: * Entry label (HTTP route/CLI id) * ≥3 nodes (entry → … → sink) * 3 – Level 2 plus `guards` list non-empty Unit tests for at least 4 scenarios. ### Task 5.3 – Regression tests for scoring Add small test fixture: * Tiny synthetic benchmark: 3 cases, 2 reachable, 1 unreachable. * 3 submissions: * Perfect * All reachable * All unreachable Assertions: * Perfect: `precision=1, recall=1` * All reachable: `recall=1, precision<1` * All unreachable: `precision=1 (trivially on negatives), recall=0` --- ## 6. Phase 6 – Baseline integrations ### Task 6.1 – Semgrep baseline **Developer:** Benchmark Core (with Semgrep experience) * `baselines/semgrep/run_case.sh`: * Inputs: `case_id`, `cases_root`, `output_path` * Steps: * Find `src/` for case * Run `semgrep --config auto` or curated rules * Convert Semgrep findings into benchmark submission format: * Map Semgrep rules → vulnerability types → candidate sinks * Heuristically guess reachability (for v1, maybe always “reachable” if sink in code path) * Output: `output_path` JSON conforming to `submission.schema.json`. ### Task 6.2 – CodeQL baseline * Create CodeQL databases for each project (likely via `codeql database create`). * Create queries targeting known sinks (e.g., `Deserialization`, `CommandInjection`). * `baselines/codeql/run_case.sh`: * Build DB (or reuse) * Run queries * Translate results into our submission format (again as heuristic reachability). ### Task 6.3 – Optional Snyk / angr baselines * Snyk: * Use `snyk test` on the project * Map results to dependencies & known CVEs * For v1, just mark as `reachable` if Snyk reports a reachable path (if available). * angr: * For 1–2 small C samples, configure simple analysis script. **Acceptance criteria** * For at least 5 cases (across languages), the baselines produce valid submission JSON. * `rb-score` runs and yields metrics without errors. --- ## 7. Phase 7 – CI/CD ### Task 7.1 – GitHub Actions workflow **Developer:** Benchmark Core `ci/github/benchmark.yml`: Jobs: 1. `lint-and-test` * `python -m pip install -e benchmark/tools/scorer[dev]` * `make lint` * `make test` 2. `build-cases` * `python benchmark/tools/build/build_all.py` * Run `validate_builds.py` 3. `smoke-baselines` * For 2–3 cases, run Semgrep/CodeQL wrappers and ensure they emit valid submissions. ### Task 7.2 – Artifact upload * Upload `outputs/` tarball from `build-cases` as workflow artifacts. * Upload `results/*.json` from scoring runs. --- ## 8. Phase 8 – Website & leaderboard ### Task 8.1 – Define results JSON format **Developer:** Benchmark Core + Website dev `results/leaderboard.json`: ```json { "tools": [ { "name": "Semgrep", "version": "1.60.0", "summary": { "precision": 0.72, "recall": 0.48, "f1": 0.58 }, "by_language": { "javascript": {"precision": 0.80, "recall": 0.50, "f1": 0.62}, "python": {"precision": 0.65, "recall": 0.45, "f1": 0.53} } } ] } ``` CLI option to generate this: ```bash rb-score compare \ --cases-root benchmark/cases \ --submissions submissions/*.json \ --output results/leaderboard.json ``` ### Task 8.2 – Static site **Developer:** Website dev Tech choice: any static framework (Next.js, Astro, Docusaurus, or even pure HTML+JS). Pages: * **Home** * What is reachability? * Summary of benchmark * **Leaderboard** * Renders `leaderboard.json` * Filters: language, case size * **Docs** * How to run benchmark locally * How to prepare a submission Add a simple script to copy `results/leaderboard.json` into `website/public/` for publishing. --- ## 9. Phase 9 – Docs, governance, and contribution flow ### Task 9.1 – CONTRIBUTING.md Include: * How to add a new case: * Step‑by‑step: 1. Create project folder under `benchmark/cases///case-XXX/` 2. Add `case.yaml`, `entrypoints.yaml`, `truth.yaml` 3. Add oracles (tests, coverage) 4. Add deterministic `build/` assets 5. Run local tooling: * `validate_schema.py` * `validate_builds.py --case ` * Example PR description template. ### Task 9.2 – Governance doc * Define **Technical Advisory Committee (TAC)** roles: * Approve new cases * Approve schema changes * Manage hidden test sets (future phase) * Define **release cadence**: * v1.0 with public cases * Quarterly updates with new hidden cases. --- ## 10. Suggested milestone breakdown (for planning / sprints) ### Milestone 1 – Foundation (1–2 sprints) * Repo scaffolding (Tasks 1.x) * Schemas (Tasks 2.x) * Two tiny toy cases (one JS, one Python) with: * `case.yaml`, `entrypoints.yaml`, `truth.yaml` * Deterministic build * Basic unit tests * Minimal `rb-score` with: * Case loading * Precision/recall only **Exit:** You can run `rb-score` on a dummy submission for 2 cases. --- ### Milestone 2 – v1 dataset (2–3 sprints) * Add ~20–30 cases across JS, Python, Java, C * Ground truth & coverage for each * Deterministic builds validated * Explainability scoring implemented * Regression tests for `rb-score` **Exit:** Full scoring tool stable; dataset repeatably builds on CI. --- ### Milestone 3 – Baselines & site (1–2 sprints) * Semgrep + CodeQL baselines producing valid submissions * CI running smoke baselines * `leaderboard.json` generator * Static website with public leaderboard and docs **Exit:** Public v1 benchmark you can share with external tool authors. --- If you tell me which stack your team prefers for the site (React, plain HTML, SSG, etc.) or which CI you’re on, I can adapt this into concrete config files (e.g., a full GitHub Actions workflow, Next.js scaffold, or exact `pyproject.toml` for `rb-score`).