24 KiB
Here’s a clean, action‑ready blueprint for a public reachability benchmark you can stand up quickly and grow over time.
Why this matters (quick)
“Reachability” asks: is a flagged vulnerability actually executable from real entry points in this codebase/container? A public, reproducible benchmark lets you compare tools apples‑to‑apples, drive research, and keep vendors honest.
What to collect (dataset design)
-
Projects & languages
- Polyglot mix: C/C++ (ELF/PE/Mach‑O), Java/Kotlin, C#/.NET, Python, JavaScript/TypeScript, PHP, Go, Rust.
- For each project: small (≤5k LOC), medium (5–100k), large (100k+).
-
Ground‑truth artifacts
- Seed CVEs with known sinks (e.g., deserializers, command exec, SS RF) and neutral projects with no reachable path (negatives).
- Exploit oracles: minimal PoCs or unit tests that (1) reach the sink and (2) toggle reachability via feature flags.
-
Build outputs (deterministic)
- Reproducible binaries/bytecode (strip timestamps; fixed seeds; SOURCE_DATE_EPOCH).
- SBOM (CycloneDX/SPDX) + PURLs + Build‑ID (ELF .note.gnu.build‑id / PE Authentihash / Mach‑O UUID).
- Attestations: in‑toto/DSSE envelopes recording toolchain versions, flags, hashes.
-
Execution traces (for truth)
- CI traces: call‑graph dumps from compilers/analyzers; unit‑test coverage; optional dynamic traces (eBPF/.NET ETW/Java Flight Recorder).
- Entry‑point manifests: HTTP routes, CLI commands, cron/queue consumers.
-
Metadata
- Language, framework, package manager, compiler versions, OS/container image, optimization level, stripping info, license.
How to label ground truth
- Per‑vuln case:
(component, version, sink_id)with label reachable / unreachable / unknown. - Evidence bundle: pointer to (a) static call path, (b) dynamic hit (trace/coverage), or (c) rationale for negative.
- Confidence: high (static+dynamic agree), medium (one source), low (heuristic only).
Scoring (simple + fair)
-
Binary classification on cases:
- Precision, Recall, F1. Report AU‑PR if you output probabilities.
-
Path quality
-
Explainability score (0–3):
- 0: “vuln reachable” w/o context
- 1: names only (entry→…→sink)
- 2: full interprocedural path w/ locations
- 3: plus inputs/guards (taint/constraints, env flags)
-
-
Runtime cost
- Wall‑clock, peak RAM, image size; normalized by KLOC.
-
Determinism
- Re‑run variance (≤1% is “A”, 1–5% “B”, >5% “C”).
Avoiding overfitting
- Train/Dev/Test splits per language; hidden test projects rotated quarterly.
- Case churn: introduce isomorphic variants (rename symbols, reorder files) to punish memorization.
- Poisoned controls: include decoy sinks and unreachable dead‑code traps.
- Submission rules: require attestations of tool versions & flags; limit per‑case hints.
Reference baselines (to run out‑of‑the‑box)
- Snyk Code/Reachability (JS/Java/Python, SaaS/CLI).
- Semgrep + Pro Engine (rules + reachability mode).
- CodeQL (multi‑lang, LGTM‑style queries).
- Joern (C/C++/JVM code property graphs).
- angr (binary symbolic exec; selective for native samples).
- Language‑specific: pip‑audit w/ import graphs, npm with lock‑tree + route discovery, Maven + call‑graph (Soot/WALA).
Submission format (one JSON per tool run)
{
"tool": {"name": "YourTool", "version": "1.2.3"},
"run": {
"commit": "…",
"platform": "ubuntu:24.04",
"time_s": 182.4, "peak_mb": 3072
},
"cases": [
{
"id": "php-shop:fastjson@1.2.68:Sink#deserialize",
"prediction": "reachable",
"confidence": 0.88,
"explain": {
"entry": "POST /api/orders",
"path": [
"OrdersController::create",
"Serializer::deserialize",
"Fastjson::parseObject"
],
"guards": ["feature.flag.json_enabled==true"]
}
}
],
"artifacts": {
"sbom": "sha256:…", "attestation": "sha256:…"
}
}
Folder layout (repo)
/benchmark
/cases/<lang>/<project>/<case_id>/
case.yaml # component@version, sink, labels, evidence refs
entrypoints.yaml # routes/CLIs/cron
build/ # Dockerfiles, lockfiles, pinned toolchains
outputs/ # SBOMs, binaries, traces (checksummed)
/splits/{train,dev,test}.txt
/schemas/{case.json,submission.json}
/scripts/{build.sh, run_tests.sh, score.py}
/docs/ (how-to, FAQs, T&Cs)
Minimal v1 (4–6 weeks of work)
- Languages: JS/TS, Python, Java, C (ELF).
- 20–30 cases: mix of reachable/unreachable with PoC unit tests.
- Deterministic builds in containers; publish SBOM+attestations.
- Scorer: precision/recall/F1 + explainability, runtime, determinism.
- Baselines: run CodeQL + Semgrep across all; Snyk where feasible; angr for 3 native cases.
- Website: static leaderboard (per‑lang, per‑size), download links, submission guide.
V2+ (quarterly)
- Add .NET, PHP, Go, Rust; broaden binary focus (PE/Mach‑O).
- Add dynamic traces (eBPF/ETW/JFR) and taint oracles.
- Introduce config‑gated reachability (feature flags, env, k8s secrets).
- Add dataset cards per case (threat model, CWE, false‑positive traps).
Publishing & governance
- License: CC‑BY‑SA for metadata, source‑compatible OSS for code, binaries under original licenses.
- Repro packs:
benchmark-kit.tgzwith container recipes, hashes, and attestations. - Disclosure: CVE hygiene, responsible use, opt‑out path for upstreams.
- Stewards: small TAC (you + two external reviewers) to approve new cases and adjudicate disputes.
Immediate next steps (checklist)
- Lock the schemas (case + submission + attestation fields).
- Pick 8 seed projects (2 per language tiered by size).
- Draft 12 sink‑cases (6 reachable, 6 unreachable) with unit‑test oracles.
- Script deterministic builds and hash‑locked SBOMs.
- Implement the scorer; publish a starter leaderboard with 2 baselines.
- Ship v1 website/docs and open submissions.
If you want, I can generate the repo scaffold (folders, YAML/JSON schemas, Dockerfiles, scorer script) so your team can git clone and start adding cases immediately.
Cool, let’s turn the blueprint into a concrete, developer‑friendly implementation plan.
I’ll assume v1 scope is:
- Languages: JavaScript/TypeScript (Node), Python, Java, C (ELF)
- ~20–30 cases total (reachable/unreachable mix)
- Baselines: CodeQL, Semgrep, maybe Snyk where licenses allow, and angr for a few native cases
You can expand later, but this plan is enough to get v1 shipped.
0. Overall project structure & ownership
Owners
- Tech Lead – owns architecture & final decisions
- Benchmark Core – 2–3 devs building schemas, scorer, infra
- Language Tracks – 1 dev per language (JS, Python, Java, C)
- Website/Docs – 1 dev
Repo layout (target)
reachability-benchmark/
README.md
LICENSE
CONTRIBUTING.md
CODE_OF_CONDUCT.md
benchmark/
cases/
js/
express-blog/
case-001/
case.yaml
entrypoints.yaml
build/
Dockerfile
build.sh
src/ # project source (or submodule)
tests/ # unit tests as oracles
outputs/
sbom.cdx.json
binary.tar.gz
coverage.json
traces/ # optional dynamic traces
py/
flask-api/...
java/
spring-app/...
c/
httpd-like/...
schemas/
case.schema.yaml
entrypoints.schema.yaml
truth.schema.yaml
submission.schema.json
tools/
scorer/
rb_score/
__init__.py
cli.py
metrics.py
loader.py
explainability.py
pyproject.toml
tests/
build/
build_all.py
validate_builds.py
baselines/
codeql/
run_case.sh
config/
semgrep/
run_case.sh
rules/
snyk/
run_case.sh
angr/
run_case.sh
ci/
github/
benchmark.yml
website/
# static site / leaderboard
1. Phase 1 – Repo & infra setup
Task 1.1 – Create repository
Developer: Tech Lead Deliverables:
-
Repo created (
reachability-benchmarkor similar) -
LICENSE(e.g., Apache-2.0 or MIT) -
Basic
README.mddescribing:- Purpose (public reachability benchmark)
- High‑level design
- v1 scope (langs, #cases)
Task 1.2 – Bootstrap structure
Developer: Benchmark Core
Create directory skeleton as above (without filling everything yet).
Add:
# benchmark/Makefile
.PHONY: test lint build
test:
\tpytest benchmark/tools/scorer/tests
lint:
\tblack benchmark/tools/scorer
\tflake8 benchmark/tools/scorer
build:
\tpython benchmark/tools/build/build_all.py
Task 1.3 – Coding standards & tooling
Developer: Benchmark Core
-
Add
.editorconfig,.gitignore, and Python tool configs (ruff,black, orflake8). -
Define minimal PR checklist in
CONTRIBUTING.md:- Tests pass
- Lint passes
- New schemas have JSON schema or YAML schema and tests
- New cases come with oracles (tests/coverage)
2. Phase 2 – Case & submission schemas
Task 2.1 – Define case metadata format
Developer: Benchmark Core
Create benchmark/schemas/case.schema.yaml and an example case.yaml.
Example case.yaml
id: "js-express-blog:001"
language: "javascript"
framework: "express"
size: "small" # small | medium | large
component:
name: "express-blog"
version: "1.0.0-bench"
vulnerability:
cve: "CVE-XXXX-YYYY"
cwe: "CWE-502"
description: "Unsafe deserialization via user-controlled JSON."
sink_id: "Deserializer::parse"
ground_truth:
label: "reachable" # reachable | unreachable | unknown
confidence: "high" # high | medium | low
evidence_files:
- "truth.yaml"
notes: >
Unit test test_reachable_deserialization triggers the sink.
build:
dockerfile: "build/Dockerfile"
build_script: "build/build.sh"
output:
artifact_path: "outputs/binary.tar.gz"
sbom_path: "outputs/sbom.cdx.json"
coverage_path: "outputs/coverage.json"
traces_dir: "outputs/traces"
environment:
os_image: "ubuntu:24.04"
compiler: null
runtime:
node: "20.11.0"
source_date_epoch: 1730000000
Acceptance criteria
-
Schema validates sample
case.yamlwith a Python script:benchmark/tools/build/validate_schema.pyusingjsonschemaorpykwalify.
Task 2.2 – Entry points schema
Developer: Benchmark Core
benchmark/schemas/entrypoints.schema.yaml
Example entrypoints.yaml
entries:
http:
- id: "POST /api/posts"
route: "/api/posts"
method: "POST"
handler: "PostsController.create"
cli:
- id: "generate-report"
command: "node cli.js generate-report"
description: "Generates summary report."
scheduled:
- id: "daily-cleanup"
schedule: "0 3 * * *"
handler: "CleanupJob.run"
Task 2.3 – Ground truth / truth schema
Developer: Benchmark Core + Language Tracks
benchmark/schemas/truth.schema.yaml
Example truth.yaml
id: "js-express-blog:001"
cases:
- sink_id: "Deserializer::parse"
label: "reachable"
dynamic_evidence:
covered_by_tests:
- "tests/test_reachable_deserialization.js::should_reach_sink"
coverage_files:
- "outputs/coverage.json"
static_evidence:
call_path:
- "POST /api/posts"
- "PostsController.create"
- "PostsService.createFromJson"
- "Deserializer.parse"
config_conditions:
- "process.env.FEATURE_JSON_ENABLED == 'true'"
notes: "If FEATURE_JSON_ENABLED=false, path is unreachable."
Task 2.4 – Submission schema
Developer: Benchmark Core
benchmark/schemas/submission.schema.json
Shape
{
"tool": { "name": "YourTool", "version": "1.2.3" },
"run": {
"commit": "abcd1234",
"platform": "ubuntu:24.04",
"time_s": 182.4,
"peak_mb": 3072
},
"cases": [
{
"id": "js-express-blog:001",
"prediction": "reachable",
"confidence": 0.88,
"explain": {
"entry": "POST /api/posts",
"path": [
"PostsController.create",
"PostsService.createFromJson",
"Deserializer.parse"
],
"guards": [
"process.env.FEATURE_JSON_ENABLED === 'true'"
]
}
}
],
"artifacts": {
"sbom": "sha256:...",
"attestation": "sha256:..."
}
}
Write Python validation utility:
python benchmark/tools/scorer/validate_submission.py submission.json
Acceptance criteria
- Validation fails on missing fields / wrong enum values.
- At least two sample submissions pass validation (e.g., “perfect” and “random baseline”).
3. Phase 3 – Reference projects & deterministic builds
Task 3.1 – Select and vendor v1 projects
Developer: Tech Lead + Language Tracks
For each language, choose:
- 1 small toy app (simple web or CLI)
- 1 medium app (more routes, multiple modules)
- Optional: 1 large (for performance stress tests)
Add them under benchmark/cases/<lang>/<project>/src/
(or as git submodules if you want to track upstream).
Task 3.2 – Deterministic Docker build per project
Developer: Language Tracks
For each project:
-
Create
build/Dockerfile -
Create
build/build.shthat:- Builds the app
- Produces artifacts
- Generates SBOM and attestation
Example build/Dockerfile (Node)
FROM node:20.11-slim
ENV NODE_ENV=production
ENV SOURCE_DATE_EPOCH=1730000000
WORKDIR /app
COPY src/ /app
COPY package.json package-lock.json /app/
RUN npm ci --ignore-scripts && \
npm run build || true
CMD ["node", "server.js"]
Example build.sh
#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(dirname "$(readlink -f "$0")")/.."
OUT_DIR="$ROOT_DIR/outputs"
mkdir -p "$OUT_DIR"
IMAGE_TAG="rb-js-express-blog:1"
docker build -t "$IMAGE_TAG" "$ROOT_DIR/build"
# Export image as tarball (binary artifact)
docker save "$IMAGE_TAG" | gzip > "$OUT_DIR/binary.tar.gz"
# Generate SBOM (e.g. via syft) – can be optional stub for v1
syft packages "docker:$IMAGE_TAG" -o cyclonedx-json > "$OUT_DIR/sbom.cdx.json"
# In future: generate in-toto attestations
Task 3.3 – Determinism checker
Developer: Benchmark Core
benchmark/tools/build/validate_builds.py:
-
For each case:
- Run
build.shtwice - Compare hashes of
outputs/binary.tar.gzandoutputs/sbom.cdx.json
- Run
-
Fail if hashes differ.
Acceptance criteria
- All v1 cases produce identical artifacts across two builds on CI.
4. Phase 4 – Ground truth oracles (tests & traces)
Task 4.1 – Add unit/integration tests for reachable cases
Developer: Language Tracks
For each reachable case:
-
Add
tests/under the project to:- Start the app (if necessary)
- Send a request/trigger that reaches the vulnerable sink
- Assert that a sentinel side effect occurs (e.g. log or marker file) instead of real exploitation.
Example for Node using Jest:
test("should reach deserialization sink", async () => {
const res = await request(app)
.post("/api/posts")
.send({ title: "x", body: '{"__proto__":{}}' });
expect(res.statusCode).toBe(200);
// Sink logs "REACH_SINK" – we check log or variable
expect(sinkWasReached()).toBe(true);
});
Task 4.2 – Instrument coverage
Developer: Language Tracks
-
For each language, pick a coverage tool:
- JS:
nyc+istanbul - Python:
coverage.py - Java:
jacoco - C:
gcov/llvm-cov(optional for v1)
- JS:
-
Ensure running tests produces
outputs/coverage.jsonor.xmlthat we then convert to a simple JSON format:
{
"files": {
"src/controllers/posts.js": {
"lines_covered": [12, 13, 14, 27],
"lines_total": 40
}
}
}
Create a small converter script if needed.
Task 4.3 – Optional dynamic traces
If you want richer evidence:
- JS: add middleware that logs
(entry_id, handler, sink)triples tooutputs/traces/traces.json - Python: similar using decorators
- C/Java: out of scope for v1 unless you want to invest extra time.
5. Phase 5 – Scoring tool (CLI)
Task 5.1 – Implement rb-score library + CLI
Developer: Benchmark Core
Create benchmark/tools/scorer/rb_score/ with:
-
loader.py- Load all
case.yaml,truth.yamlinto memory. - Provide functions:
load_cases() -> Dict[case_id, Case].
- Load all
-
metrics.py-
Implement:
compute_precision_recall(truth, predictions)compute_path_quality_score(explain_block)(0–3)compute_runtime_stats(run_block)
-
-
cli.py- CLI:
rb-score \
--cases-root benchmark/cases \
--submission submissions/mytool.json \
--output results/mytool_results.json
Pseudo-code for core scoring
def score_submission(truth, submission):
y_true = []
y_pred = []
per_case_scores = {}
for case in truth:
gt = truth[case.id].label # reachable/unreachable
pred_case = find_pred_case(submission.cases, case.id)
pred_label = pred_case.prediction if pred_case else "unreachable"
y_true.append(gt == "reachable")
y_pred.append(pred_label == "reachable")
explain_score = explainability(pred_case.explain if pred_case else None)
per_case_scores[case.id] = {
"gt": gt,
"pred": pred_label,
"explainability": explain_score,
}
precision, recall, f1 = compute_prf(y_true, y_pred)
return {
"summary": {
"precision": precision,
"recall": recall,
"f1": f1,
"num_cases": len(truth),
},
"cases": per_case_scores,
}
Task 5.2 – Explainability scoring rules
Developer: Benchmark Core
Implement explainability(explain):
-
0 –
explainmissing orpathempty -
1 –
pathpresent with at least 2 nodes (sink + one function) -
2 –
pathcontains:- Entry label (HTTP route/CLI id)
- ≥3 nodes (entry → … → sink)
-
3 – Level 2 plus
guardslist non-empty
Unit tests for at least 4 scenarios.
Task 5.3 – Regression tests for scoring
Add small test fixture:
-
Tiny synthetic benchmark: 3 cases, 2 reachable, 1 unreachable.
-
3 submissions:
- Perfect
- All reachable
- All unreachable
Assertions:
- Perfect:
precision=1, recall=1 - All reachable:
recall=1, precision<1 - All unreachable:
precision=1 (trivially on negatives), recall=0
6. Phase 6 – Baseline integrations
Task 6.1 – Semgrep baseline
Developer: Benchmark Core (with Semgrep experience)
-
baselines/semgrep/run_case.sh:-
Inputs:
case_id,cases_root,output_path -
Steps:
-
Find
src/for case -
Run
semgrep --config autoor curated rules -
Convert Semgrep findings into benchmark submission format:
- Map Semgrep rules → vulnerability types → candidate sinks
- Heuristically guess reachability (for v1, maybe always “reachable” if sink in code path)
-
-
Output:
output_pathJSON conforming tosubmission.schema.json.
-
Task 6.2 – CodeQL baseline
-
Create CodeQL databases for each project (likely via
codeql database create). -
Create queries targeting known sinks (e.g.,
Deserialization,CommandInjection). -
baselines/codeql/run_case.sh:- Build DB (or reuse)
- Run queries
- Translate results into our submission format (again as heuristic reachability).
Task 6.3 – Optional Snyk / angr baselines
-
Snyk:
- Use
snyk teston the project - Map results to dependencies & known CVEs
- For v1, just mark as
reachableif Snyk reports a reachable path (if available).
- Use
-
angr:
- For 1–2 small C samples, configure simple analysis script.
Acceptance criteria
- For at least 5 cases (across languages), the baselines produce valid submission JSON.
rb-scoreruns and yields metrics without errors.
7. Phase 7 – CI/CD
Task 7.1 – GitHub Actions workflow
Developer: Benchmark Core
ci/github/benchmark.yml:
Jobs:
-
lint-and-testpython -m pip install -e benchmark/tools/scorer[dev]make lintmake test
-
build-casespython benchmark/tools/build/build_all.py- Run
validate_builds.py
-
smoke-baselines- For 2–3 cases, run Semgrep/CodeQL wrappers and ensure they emit valid submissions.
Task 7.2 – Artifact upload
- Upload
outputs/tarball frombuild-casesas workflow artifacts. - Upload
results/*.jsonfrom scoring runs.
8. Phase 8 – Website & leaderboard
Task 8.1 – Define results JSON format
Developer: Benchmark Core + Website dev
results/leaderboard.json:
{
"tools": [
{
"name": "Semgrep",
"version": "1.60.0",
"summary": {
"precision": 0.72,
"recall": 0.48,
"f1": 0.58
},
"by_language": {
"javascript": {"precision": 0.80, "recall": 0.50, "f1": 0.62},
"python": {"precision": 0.65, "recall": 0.45, "f1": 0.53}
}
}
]
}
CLI option to generate this:
rb-score compare \
--cases-root benchmark/cases \
--submissions submissions/*.json \
--output results/leaderboard.json
Task 8.2 – Static site
Developer: Website dev
Tech choice: any static framework (Next.js, Astro, Docusaurus, or even pure HTML+JS).
Pages:
-
Home
- What is reachability?
- Summary of benchmark
-
Leaderboard
- Renders
leaderboard.json - Filters: language, case size
- Renders
-
Docs
- How to run benchmark locally
- How to prepare a submission
Add a simple script to copy results/leaderboard.json into website/public/ for publishing.
9. Phase 9 – Docs, governance, and contribution flow
Task 9.1 – CONTRIBUTING.md
Include:
-
How to add a new case:
-
Step‑by‑step:
-
Create project folder under
benchmark/cases/<lang>/<project>/case-XXX/ -
Add
case.yaml,entrypoints.yaml,truth.yaml -
Add oracles (tests, coverage)
-
Add deterministic
build/assets -
Run local tooling:
validate_schema.pyvalidate_builds.py --case <id>
-
-
Example PR description template.
-
Task 9.2 – Governance doc
-
Define Technical Advisory Committee (TAC) roles:
- Approve new cases
- Approve schema changes
- Manage hidden test sets (future phase)
-
Define release cadence:
- v1.0 with public cases
- Quarterly updates with new hidden cases.
10. Suggested milestone breakdown (for planning / sprints)
Milestone 1 – Foundation (1–2 sprints)
-
Repo scaffolding (Tasks 1.x)
-
Schemas (Tasks 2.x)
-
Two tiny toy cases (one JS, one Python) with:
case.yaml,entrypoints.yaml,truth.yaml- Deterministic build
- Basic unit tests
-
Minimal
rb-scorewith:- Case loading
- Precision/recall only
Exit: You can run rb-score on a dummy submission for 2 cases.
Milestone 2 – v1 dataset (2–3 sprints)
- Add ~20–30 cases across JS, Python, Java, C
- Ground truth & coverage for each
- Deterministic builds validated
- Explainability scoring implemented
- Regression tests for
rb-score
Exit: Full scoring tool stable; dataset repeatably builds on CI.
Milestone 3 – Baselines & site (1–2 sprints)
- Semgrep + CodeQL baselines producing valid submissions
- CI running smoke baselines
leaderboard.jsongenerator- Static website with public leaderboard and docs
Exit: Public v1 benchmark you can share with external tool authors.
If you tell me which stack your team prefers for the site (React, plain HTML, SSG, etc.) or which CI you’re on, I can adapt this into concrete config files (e.g., a full GitHub Actions workflow, Next.js scaffold, or exact pyproject.toml for rb-score).