up
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
api-governance / spectral-lint (push) Has been cancelled
oas-ci / oas-validate (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
api-governance / spectral-lint (push) Has been cancelled
oas-ci / oas-validate (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
This commit is contained in:
@@ -0,0 +1,944 @@
|
||||
Here’s a clean, action‑ready blueprint for a **public reachability benchmark** you can stand up quickly and grow over time.
|
||||
|
||||
# Why this matters (quick)
|
||||
|
||||
“Reachability” asks: *is a flagged vulnerability actually executable from real entry points in this codebase/container?* A public, reproducible benchmark lets you compare tools apples‑to‑apples, drive research, and keep vendors honest.
|
||||
|
||||
# What to collect (dataset design)
|
||||
|
||||
* **Projects & languages**
|
||||
|
||||
* Polyglot mix: **C/C++ (ELF/PE/Mach‑O)**, **Java/Kotlin**, **C#/.NET**, **Python**, **JavaScript/TypeScript**, **PHP**, **Go**, **Rust**.
|
||||
* For each project: small (≤5k LOC), medium (5–100k), large (100k+).
|
||||
* **Ground‑truth artifacts**
|
||||
|
||||
* **Seed CVEs** with known sinks (e.g., deserializers, command exec, SS RF) and **neutral projects** with *no* reachable path (negatives).
|
||||
* **Exploit oracles**: minimal PoCs or unit tests that (1) reach the sink and (2) toggle reachability via feature flags.
|
||||
* **Build outputs (deterministic)**
|
||||
|
||||
* **Reproducible binaries/bytecode** (strip timestamps; fixed seeds; SOURCE_DATE_EPOCH).
|
||||
* **SBOM** (CycloneDX/SPDX) + **PURLs** + **Build‑ID** (ELF .note.gnu.build‑id / PE Authentihash / Mach‑O UUID).
|
||||
* **Attestations**: in‑toto/DSSE envelopes recording toolchain versions, flags, hashes.
|
||||
* **Execution traces (for truth)**
|
||||
|
||||
* **CI traces**: call‑graph dumps from compilers/analyzers; unit‑test coverage; optional **dynamic traces** (eBPF/.NET ETW/Java Flight Recorder).
|
||||
* **Entry‑point manifests**: HTTP routes, CLI commands, cron/queue consumers.
|
||||
* **Metadata**
|
||||
|
||||
* Language, framework, package manager, compiler versions, OS/container image, optimization level, stripping info, license.
|
||||
|
||||
# How to label ground truth
|
||||
|
||||
* **Per‑vuln case**: `(component, version, sink_id)` with label **reachable / unreachable / unknown**.
|
||||
* **Evidence bundle**: pointer to (a) static call path, (b) dynamic hit (trace/coverage), or (c) rationale for negative.
|
||||
* **Confidence**: high (static+dynamic agree), medium (one source), low (heuristic only).
|
||||
|
||||
# Scoring (simple + fair)
|
||||
|
||||
* **Binary classification** on cases:
|
||||
|
||||
* Precision, Recall, F1. Report **AU‑PR** if you output probabilities.
|
||||
* **Path quality**
|
||||
|
||||
* **Explainability score (0–3)**:
|
||||
|
||||
* 0: “vuln reachable” w/o context
|
||||
* 1: names only (entry→…→sink)
|
||||
* 2: full interprocedural path w/ locations
|
||||
* 3: plus **inputs/guards** (taint/constraints, env flags)
|
||||
* **Runtime cost**
|
||||
|
||||
* Wall‑clock, peak RAM, image size; normalized by KLOC.
|
||||
* **Determinism**
|
||||
|
||||
* Re‑run variance (≤1% is “A”, 1–5% “B”, >5% “C”).
|
||||
|
||||
# Avoiding overfitting
|
||||
|
||||
* **Train/Dev/Test** splits per language; **hidden test** projects rotated quarterly.
|
||||
* **Case churn**: introduce **isomorphic variants** (rename symbols, reorder files) to punish memorization.
|
||||
* **Poisoned controls**: include decoy sinks and unreachable dead‑code traps.
|
||||
* **Submission rules**: require **attestations** of tool versions & flags; limit per‑case hints.
|
||||
|
||||
# Reference baselines (to run out‑of‑the‑box)
|
||||
|
||||
* **Snyk Code/Reachability** (JS/Java/Python, SaaS/CLI).
|
||||
* **Semgrep + Pro Engine** (rules + reachability mode).
|
||||
* **CodeQL** (multi‑lang, LGTM‑style queries).
|
||||
* **Joern** (C/C++/JVM code property graphs).
|
||||
* **angr** (binary symbolic exec; selective for native samples).
|
||||
* **Language‑specific**: pip‑audit w/ import graphs, npm with lock‑tree + route discovery, Maven + call‑graph (Soot/WALA).
|
||||
|
||||
# Submission format (one JSON per tool run)
|
||||
|
||||
```json
|
||||
{
|
||||
"tool": {"name": "YourTool", "version": "1.2.3"},
|
||||
"run": {
|
||||
"commit": "…",
|
||||
"platform": "ubuntu:24.04",
|
||||
"time_s": 182.4, "peak_mb": 3072
|
||||
},
|
||||
"cases": [
|
||||
{
|
||||
"id": "php-shop:fastjson@1.2.68:Sink#deserialize",
|
||||
"prediction": "reachable",
|
||||
"confidence": 0.88,
|
||||
"explain": {
|
||||
"entry": "POST /api/orders",
|
||||
"path": [
|
||||
"OrdersController::create",
|
||||
"Serializer::deserialize",
|
||||
"Fastjson::parseObject"
|
||||
],
|
||||
"guards": ["feature.flag.json_enabled==true"]
|
||||
}
|
||||
}
|
||||
],
|
||||
"artifacts": {
|
||||
"sbom": "sha256:…", "attestation": "sha256:…"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
# Folder layout (repo)
|
||||
|
||||
```
|
||||
/benchmark
|
||||
/cases/<lang>/<project>/<case_id>/
|
||||
case.yaml # component@version, sink, labels, evidence refs
|
||||
entrypoints.yaml # routes/CLIs/cron
|
||||
build/ # Dockerfiles, lockfiles, pinned toolchains
|
||||
outputs/ # SBOMs, binaries, traces (checksummed)
|
||||
/splits/{train,dev,test}.txt
|
||||
/schemas/{case.json,submission.json}
|
||||
/scripts/{build.sh, run_tests.sh, score.py}
|
||||
/docs/ (how-to, FAQs, T&Cs)
|
||||
```
|
||||
|
||||
# Minimal **v1** (4–6 weeks of work)
|
||||
|
||||
1. **Languages**: JS/TS, Python, Java, C (ELF).
|
||||
2. **20–30 cases**: mix of reachable/unreachable with PoC unit tests.
|
||||
3. **Deterministic builds** in containers; publish SBOM+attestations.
|
||||
4. **Scorer**: precision/recall/F1 + explainability, runtime, determinism.
|
||||
5. **Baselines**: run CodeQL + Semgrep across all; Snyk where feasible; angr for 3 native cases.
|
||||
6. **Website**: static leaderboard (per‑lang, per‑size), download links, submission guide.
|
||||
|
||||
# V2+ (quarterly)
|
||||
|
||||
* Add **.NET, PHP, Go, Rust**; broaden binary focus (PE/Mach‑O).
|
||||
* Add **dynamic traces** (eBPF/ETW/JFR) and **taint oracles**.
|
||||
* Introduce **config‑gated reachability** (feature flags, env, k8s secrets).
|
||||
* Add **dataset cards** per case (threat model, CWE, false‑positive traps).
|
||||
|
||||
# Publishing & governance
|
||||
|
||||
* License: **CC‑BY‑SA** for metadata, **source‑compatible OSS** for code, binaries under original licenses.
|
||||
* **Repro packs**: `benchmark-kit.tgz` with container recipes, hashes, and attestations.
|
||||
* **Disclosure**: CVE hygiene, responsible use, opt‑out path for upstreams.
|
||||
* **Stewards**: small TAC (you + two external reviewers) to approve new cases and adjudicate disputes.
|
||||
|
||||
# Immediate next steps (checklist)
|
||||
|
||||
* Lock the **schemas** (case + submission + attestation fields).
|
||||
* Pick 8 seed projects (2 per language tiered by size).
|
||||
* Draft 12 sink‑cases (6 reachable, 6 unreachable) with unit‑test oracles.
|
||||
* Script deterministic builds and **hash‑locked SBOMs**.
|
||||
* Implement the scorer; publish a **starter leaderboard** with 2 baselines.
|
||||
* Ship **v1 website/docs** and open submissions.
|
||||
|
||||
If you want, I can generate the repo scaffold (folders, YAML/JSON schemas, Dockerfiles, scorer script) so your team can `git clone` and start adding cases immediately.
|
||||
Cool, let’s turn the blueprint into a concrete, developer‑friendly implementation plan.
|
||||
|
||||
I’ll assume **v1 scope** is:
|
||||
|
||||
* Languages: **JavaScript/TypeScript (Node)**, **Python**, **Java**, **C (ELF)**
|
||||
* ~**20–30 cases** total (reachable/unreachable mix)
|
||||
* Baselines: **CodeQL**, **Semgrep**, maybe **Snyk** where licenses allow, and **angr** for a few native cases
|
||||
|
||||
You can expand later, but this plan is enough to get v1 shipped.
|
||||
|
||||
---
|
||||
|
||||
## 0. Overall project structure & ownership
|
||||
|
||||
**Owners**
|
||||
|
||||
* **Tech Lead** – owns architecture & final decisions
|
||||
* **Benchmark Core** – 2–3 devs building schemas, scorer, infra
|
||||
* **Language Tracks** – 1 dev per language (JS, Python, Java, C)
|
||||
* **Website/Docs** – 1 dev
|
||||
|
||||
**Repo layout (target)**
|
||||
|
||||
```text
|
||||
reachability-benchmark/
|
||||
README.md
|
||||
LICENSE
|
||||
CONTRIBUTING.md
|
||||
CODE_OF_CONDUCT.md
|
||||
|
||||
benchmark/
|
||||
cases/
|
||||
js/
|
||||
express-blog/
|
||||
case-001/
|
||||
case.yaml
|
||||
entrypoints.yaml
|
||||
build/
|
||||
Dockerfile
|
||||
build.sh
|
||||
src/ # project source (or submodule)
|
||||
tests/ # unit tests as oracles
|
||||
outputs/
|
||||
sbom.cdx.json
|
||||
binary.tar.gz
|
||||
coverage.json
|
||||
traces/ # optional dynamic traces
|
||||
py/
|
||||
flask-api/...
|
||||
java/
|
||||
spring-app/...
|
||||
c/
|
||||
httpd-like/...
|
||||
schemas/
|
||||
case.schema.yaml
|
||||
entrypoints.schema.yaml
|
||||
truth.schema.yaml
|
||||
submission.schema.json
|
||||
tools/
|
||||
scorer/
|
||||
rb_score/
|
||||
__init__.py
|
||||
cli.py
|
||||
metrics.py
|
||||
loader.py
|
||||
explainability.py
|
||||
pyproject.toml
|
||||
tests/
|
||||
build/
|
||||
build_all.py
|
||||
validate_builds.py
|
||||
|
||||
baselines/
|
||||
codeql/
|
||||
run_case.sh
|
||||
config/
|
||||
semgrep/
|
||||
run_case.sh
|
||||
rules/
|
||||
snyk/
|
||||
run_case.sh
|
||||
angr/
|
||||
run_case.sh
|
||||
|
||||
ci/
|
||||
github/
|
||||
benchmark.yml
|
||||
|
||||
website/
|
||||
# static site / leaderboard
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 1. Phase 1 – Repo & infra setup
|
||||
|
||||
### Task 1.1 – Create repository
|
||||
|
||||
**Developer:** Tech Lead
|
||||
**Deliverables:**
|
||||
|
||||
* Repo created (`reachability-benchmark` or similar)
|
||||
* `LICENSE` (e.g., Apache-2.0 or MIT)
|
||||
* Basic `README.md` describing:
|
||||
|
||||
* Purpose (public reachability benchmark)
|
||||
* High‑level design
|
||||
* v1 scope (langs, #cases)
|
||||
|
||||
### Task 1.2 – Bootstrap structure
|
||||
|
||||
**Developer:** Benchmark Core
|
||||
|
||||
Create directory skeleton as above (without filling everything yet).
|
||||
|
||||
Add:
|
||||
|
||||
```bash
|
||||
# benchmark/Makefile
|
||||
.PHONY: test lint build
|
||||
test:
|
||||
\tpytest benchmark/tools/scorer/tests
|
||||
|
||||
lint:
|
||||
\tblack benchmark/tools/scorer
|
||||
\tflake8 benchmark/tools/scorer
|
||||
|
||||
build:
|
||||
\tpython benchmark/tools/build/build_all.py
|
||||
```
|
||||
|
||||
### Task 1.3 – Coding standards & tooling
|
||||
|
||||
**Developer:** Benchmark Core
|
||||
|
||||
* Add `.editorconfig`, `.gitignore`, and Python tool configs (`ruff`, `black`, or `flake8`).
|
||||
* Define minimal **PR checklist** in `CONTRIBUTING.md`:
|
||||
|
||||
* Tests pass
|
||||
* Lint passes
|
||||
* New schemas have JSON schema or YAML schema and tests
|
||||
* New cases come with oracles (tests/coverage)
|
||||
|
||||
---
|
||||
|
||||
## 2. Phase 2 – Case & submission schemas
|
||||
|
||||
### Task 2.1 – Define case metadata format
|
||||
|
||||
**Developer:** Benchmark Core
|
||||
|
||||
Create `benchmark/schemas/case.schema.yaml` and an example `case.yaml`.
|
||||
|
||||
**Example `case.yaml`**
|
||||
|
||||
```yaml
|
||||
id: "js-express-blog:001"
|
||||
language: "javascript"
|
||||
framework: "express"
|
||||
size: "small" # small | medium | large
|
||||
component:
|
||||
name: "express-blog"
|
||||
version: "1.0.0-bench"
|
||||
vulnerability:
|
||||
cve: "CVE-XXXX-YYYY"
|
||||
cwe: "CWE-502"
|
||||
description: "Unsafe deserialization via user-controlled JSON."
|
||||
sink_id: "Deserializer::parse"
|
||||
ground_truth:
|
||||
label: "reachable" # reachable | unreachable | unknown
|
||||
confidence: "high" # high | medium | low
|
||||
evidence_files:
|
||||
- "truth.yaml"
|
||||
notes: >
|
||||
Unit test test_reachable_deserialization triggers the sink.
|
||||
build:
|
||||
dockerfile: "build/Dockerfile"
|
||||
build_script: "build/build.sh"
|
||||
output:
|
||||
artifact_path: "outputs/binary.tar.gz"
|
||||
sbom_path: "outputs/sbom.cdx.json"
|
||||
coverage_path: "outputs/coverage.json"
|
||||
traces_dir: "outputs/traces"
|
||||
environment:
|
||||
os_image: "ubuntu:24.04"
|
||||
compiler: null
|
||||
runtime:
|
||||
node: "20.11.0"
|
||||
source_date_epoch: 1730000000
|
||||
```
|
||||
|
||||
**Acceptance criteria**
|
||||
|
||||
* Schema validates sample `case.yaml` with a Python script:
|
||||
|
||||
* `benchmark/tools/build/validate_schema.py` using `jsonschema` or `pykwalify`.
|
||||
|
||||
---
|
||||
|
||||
### Task 2.2 – Entry points schema
|
||||
|
||||
**Developer:** Benchmark Core
|
||||
|
||||
`benchmark/schemas/entrypoints.schema.yaml`
|
||||
|
||||
**Example `entrypoints.yaml`**
|
||||
|
||||
```yaml
|
||||
entries:
|
||||
http:
|
||||
- id: "POST /api/posts"
|
||||
route: "/api/posts"
|
||||
method: "POST"
|
||||
handler: "PostsController.create"
|
||||
cli:
|
||||
- id: "generate-report"
|
||||
command: "node cli.js generate-report"
|
||||
description: "Generates summary report."
|
||||
scheduled:
|
||||
- id: "daily-cleanup"
|
||||
schedule: "0 3 * * *"
|
||||
handler: "CleanupJob.run"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2.3 – Ground truth / truth schema
|
||||
|
||||
**Developer:** Benchmark Core + Language Tracks
|
||||
|
||||
`benchmark/schemas/truth.schema.yaml`
|
||||
|
||||
**Example `truth.yaml`**
|
||||
|
||||
```yaml
|
||||
id: "js-express-blog:001"
|
||||
cases:
|
||||
- sink_id: "Deserializer::parse"
|
||||
label: "reachable"
|
||||
dynamic_evidence:
|
||||
covered_by_tests:
|
||||
- "tests/test_reachable_deserialization.js::should_reach_sink"
|
||||
coverage_files:
|
||||
- "outputs/coverage.json"
|
||||
static_evidence:
|
||||
call_path:
|
||||
- "POST /api/posts"
|
||||
- "PostsController.create"
|
||||
- "PostsService.createFromJson"
|
||||
- "Deserializer.parse"
|
||||
config_conditions:
|
||||
- "process.env.FEATURE_JSON_ENABLED == 'true'"
|
||||
notes: "If FEATURE_JSON_ENABLED=false, path is unreachable."
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2.4 – Submission schema
|
||||
|
||||
**Developer:** Benchmark Core
|
||||
|
||||
`benchmark/schemas/submission.schema.json`
|
||||
|
||||
**Shape**
|
||||
|
||||
```json
|
||||
{
|
||||
"tool": { "name": "YourTool", "version": "1.2.3" },
|
||||
"run": {
|
||||
"commit": "abcd1234",
|
||||
"platform": "ubuntu:24.04",
|
||||
"time_s": 182.4,
|
||||
"peak_mb": 3072
|
||||
},
|
||||
"cases": [
|
||||
{
|
||||
"id": "js-express-blog:001",
|
||||
"prediction": "reachable",
|
||||
"confidence": 0.88,
|
||||
"explain": {
|
||||
"entry": "POST /api/posts",
|
||||
"path": [
|
||||
"PostsController.create",
|
||||
"PostsService.createFromJson",
|
||||
"Deserializer.parse"
|
||||
],
|
||||
"guards": [
|
||||
"process.env.FEATURE_JSON_ENABLED === 'true'"
|
||||
]
|
||||
}
|
||||
}
|
||||
],
|
||||
"artifacts": {
|
||||
"sbom": "sha256:...",
|
||||
"attestation": "sha256:..."
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Write Python validation utility:
|
||||
|
||||
```bash
|
||||
python benchmark/tools/scorer/validate_submission.py submission.json
|
||||
```
|
||||
|
||||
**Acceptance criteria**
|
||||
|
||||
* Validation fails on missing fields / wrong enum values.
|
||||
* At least two sample submissions pass validation (e.g., “perfect” and “random baseline”).
|
||||
|
||||
---
|
||||
|
||||
## 3. Phase 3 – Reference projects & deterministic builds
|
||||
|
||||
### Task 3.1 – Select and vendor v1 projects
|
||||
|
||||
**Developer:** Tech Lead + Language Tracks
|
||||
|
||||
For each language, choose:
|
||||
|
||||
* 1 small toy app (simple web or CLI)
|
||||
* 1 medium app (more routes, multiple modules)
|
||||
* Optional: 1 large (for performance stress tests)
|
||||
|
||||
Add them under `benchmark/cases/<lang>/<project>/src/`
|
||||
(or as git submodules if you want to track upstream).
|
||||
|
||||
---
|
||||
|
||||
### Task 3.2 – Deterministic Docker build per project
|
||||
|
||||
**Developer:** Language Tracks
|
||||
|
||||
For each project:
|
||||
|
||||
* Create `build/Dockerfile`
|
||||
* Create `build/build.sh` that:
|
||||
|
||||
* Builds the app
|
||||
* Produces artifacts
|
||||
* Generates SBOM and attestation
|
||||
|
||||
**Example `build/Dockerfile` (Node)**
|
||||
|
||||
```dockerfile
|
||||
FROM node:20.11-slim
|
||||
|
||||
ENV NODE_ENV=production
|
||||
ENV SOURCE_DATE_EPOCH=1730000000
|
||||
|
||||
WORKDIR /app
|
||||
COPY src/ /app
|
||||
COPY package.json package-lock.json /app/
|
||||
|
||||
RUN npm ci --ignore-scripts && \
|
||||
npm run build || true
|
||||
|
||||
CMD ["node", "server.js"]
|
||||
```
|
||||
|
||||
**Example `build.sh`**
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
ROOT_DIR="$(dirname "$(readlink -f "$0")")/.."
|
||||
OUT_DIR="$ROOT_DIR/outputs"
|
||||
mkdir -p "$OUT_DIR"
|
||||
|
||||
IMAGE_TAG="rb-js-express-blog:1"
|
||||
|
||||
docker build -t "$IMAGE_TAG" "$ROOT_DIR/build"
|
||||
|
||||
# Export image as tarball (binary artifact)
|
||||
docker save "$IMAGE_TAG" | gzip > "$OUT_DIR/binary.tar.gz"
|
||||
|
||||
# Generate SBOM (e.g. via syft) – can be optional stub for v1
|
||||
syft packages "docker:$IMAGE_TAG" -o cyclonedx-json > "$OUT_DIR/sbom.cdx.json"
|
||||
|
||||
# In future: generate in-toto attestations
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3.3 – Determinism checker
|
||||
|
||||
**Developer:** Benchmark Core
|
||||
|
||||
`benchmark/tools/build/validate_builds.py`:
|
||||
|
||||
* For each case:
|
||||
|
||||
* Run `build.sh` twice
|
||||
* Compare hashes of `outputs/binary.tar.gz` and `outputs/sbom.cdx.json`
|
||||
* Fail if hashes differ.
|
||||
|
||||
**Acceptance criteria**
|
||||
|
||||
* All v1 cases produce identical artifacts across two builds on CI.
|
||||
|
||||
---
|
||||
|
||||
## 4. Phase 4 – Ground truth oracles (tests & traces)
|
||||
|
||||
### Task 4.1 – Add unit/integration tests for reachable cases
|
||||
|
||||
**Developer:** Language Tracks
|
||||
|
||||
For each **reachable** case:
|
||||
|
||||
* Add `tests/` under the project to:
|
||||
|
||||
* Start the app (if necessary)
|
||||
* Send a request/trigger that reaches the vulnerable sink
|
||||
* Assert that a sentinel side effect occurs (e.g. log or marker file) instead of real exploitation.
|
||||
|
||||
Example for Node using Jest:
|
||||
|
||||
```js
|
||||
test("should reach deserialization sink", async () => {
|
||||
const res = await request(app)
|
||||
.post("/api/posts")
|
||||
.send({ title: "x", body: '{"__proto__":{}}' });
|
||||
|
||||
expect(res.statusCode).toBe(200);
|
||||
// Sink logs "REACH_SINK" – we check log or variable
|
||||
expect(sinkWasReached()).toBe(true);
|
||||
});
|
||||
```
|
||||
|
||||
### Task 4.2 – Instrument coverage
|
||||
|
||||
**Developer:** Language Tracks
|
||||
|
||||
* For each language, pick a coverage tool:
|
||||
|
||||
* JS: `nyc` + `istanbul`
|
||||
* Python: `coverage.py`
|
||||
* Java: `jacoco`
|
||||
* C: `gcov`/`llvm-cov` (optional for v1)
|
||||
|
||||
* Ensure running tests produces `outputs/coverage.json` or `.xml` that we then convert to a simple JSON format:
|
||||
|
||||
```json
|
||||
{
|
||||
"files": {
|
||||
"src/controllers/posts.js": {
|
||||
"lines_covered": [12, 13, 14, 27],
|
||||
"lines_total": 40
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Create a small converter script if needed.
|
||||
|
||||
### Task 4.3 – Optional dynamic traces
|
||||
|
||||
If you want richer evidence:
|
||||
|
||||
* JS: add middleware that logs `(entry_id, handler, sink)` triples to `outputs/traces/traces.json`
|
||||
* Python: similar using decorators
|
||||
* C/Java: out of scope for v1 unless you want to invest extra time.
|
||||
|
||||
---
|
||||
|
||||
## 5. Phase 5 – Scoring tool (CLI)
|
||||
|
||||
### Task 5.1 – Implement `rb-score` library + CLI
|
||||
|
||||
**Developer:** Benchmark Core
|
||||
|
||||
Create `benchmark/tools/scorer/rb_score/` with:
|
||||
|
||||
* `loader.py`
|
||||
|
||||
* Load all `case.yaml`, `truth.yaml` into memory.
|
||||
* Provide functions: `load_cases() -> Dict[case_id, Case]`.
|
||||
|
||||
* `metrics.py`
|
||||
|
||||
* Implement:
|
||||
|
||||
* `compute_precision_recall(truth, predictions)`
|
||||
* `compute_path_quality_score(explain_block)` (0–3)
|
||||
* `compute_runtime_stats(run_block)`
|
||||
|
||||
* `cli.py`
|
||||
|
||||
* CLI:
|
||||
|
||||
```bash
|
||||
rb-score \
|
||||
--cases-root benchmark/cases \
|
||||
--submission submissions/mytool.json \
|
||||
--output results/mytool_results.json
|
||||
```
|
||||
|
||||
**Pseudo-code for core scoring**
|
||||
|
||||
```python
|
||||
def score_submission(truth, submission):
|
||||
y_true = []
|
||||
y_pred = []
|
||||
per_case_scores = {}
|
||||
|
||||
for case in truth:
|
||||
gt = truth[case.id].label # reachable/unreachable
|
||||
pred_case = find_pred_case(submission.cases, case.id)
|
||||
pred_label = pred_case.prediction if pred_case else "unreachable"
|
||||
|
||||
y_true.append(gt == "reachable")
|
||||
y_pred.append(pred_label == "reachable")
|
||||
|
||||
explain_score = explainability(pred_case.explain if pred_case else None)
|
||||
|
||||
per_case_scores[case.id] = {
|
||||
"gt": gt,
|
||||
"pred": pred_label,
|
||||
"explainability": explain_score,
|
||||
}
|
||||
|
||||
precision, recall, f1 = compute_prf(y_true, y_pred)
|
||||
|
||||
return {
|
||||
"summary": {
|
||||
"precision": precision,
|
||||
"recall": recall,
|
||||
"f1": f1,
|
||||
"num_cases": len(truth),
|
||||
},
|
||||
"cases": per_case_scores,
|
||||
}
|
||||
```
|
||||
|
||||
### Task 5.2 – Explainability scoring rules
|
||||
|
||||
**Developer:** Benchmark Core
|
||||
|
||||
Implement `explainability(explain)`:
|
||||
|
||||
* 0 – `explain` missing or `path` empty
|
||||
* 1 – `path` present with at least 2 nodes (sink + one function)
|
||||
* 2 – `path` contains:
|
||||
|
||||
* Entry label (HTTP route/CLI id)
|
||||
* ≥3 nodes (entry → … → sink)
|
||||
* 3 – Level 2 plus `guards` list non-empty
|
||||
|
||||
Unit tests for at least 4 scenarios.
|
||||
|
||||
### Task 5.3 – Regression tests for scoring
|
||||
|
||||
Add small test fixture:
|
||||
|
||||
* Tiny synthetic benchmark: 3 cases, 2 reachable, 1 unreachable.
|
||||
* 3 submissions:
|
||||
|
||||
* Perfect
|
||||
* All reachable
|
||||
* All unreachable
|
||||
|
||||
Assertions:
|
||||
|
||||
* Perfect: `precision=1, recall=1`
|
||||
* All reachable: `recall=1, precision<1`
|
||||
* All unreachable: `precision=1 (trivially on negatives), recall=0`
|
||||
|
||||
---
|
||||
|
||||
## 6. Phase 6 – Baseline integrations
|
||||
|
||||
### Task 6.1 – Semgrep baseline
|
||||
|
||||
**Developer:** Benchmark Core (with Semgrep experience)
|
||||
|
||||
* `baselines/semgrep/run_case.sh`:
|
||||
|
||||
* Inputs: `case_id`, `cases_root`, `output_path`
|
||||
* Steps:
|
||||
|
||||
* Find `src/` for case
|
||||
* Run `semgrep --config auto` or curated rules
|
||||
* Convert Semgrep findings into benchmark submission format:
|
||||
|
||||
* Map Semgrep rules → vulnerability types → candidate sinks
|
||||
* Heuristically guess reachability (for v1, maybe always “reachable” if sink in code path)
|
||||
* Output: `output_path` JSON conforming to `submission.schema.json`.
|
||||
|
||||
### Task 6.2 – CodeQL baseline
|
||||
|
||||
* Create CodeQL databases for each project (likely via `codeql database create`).
|
||||
* Create queries targeting known sinks (e.g., `Deserialization`, `CommandInjection`).
|
||||
* `baselines/codeql/run_case.sh`:
|
||||
|
||||
* Build DB (or reuse)
|
||||
* Run queries
|
||||
* Translate results into our submission format (again as heuristic reachability).
|
||||
|
||||
### Task 6.3 – Optional Snyk / angr baselines
|
||||
|
||||
* Snyk:
|
||||
|
||||
* Use `snyk test` on the project
|
||||
* Map results to dependencies & known CVEs
|
||||
* For v1, just mark as `reachable` if Snyk reports a reachable path (if available).
|
||||
* angr:
|
||||
|
||||
* For 1–2 small C samples, configure simple analysis script.
|
||||
|
||||
**Acceptance criteria**
|
||||
|
||||
* For at least 5 cases (across languages), the baselines produce valid submission JSON.
|
||||
* `rb-score` runs and yields metrics without errors.
|
||||
|
||||
---
|
||||
|
||||
## 7. Phase 7 – CI/CD
|
||||
|
||||
### Task 7.1 – GitHub Actions workflow
|
||||
|
||||
**Developer:** Benchmark Core
|
||||
|
||||
`ci/github/benchmark.yml`:
|
||||
|
||||
Jobs:
|
||||
|
||||
1. `lint-and-test`
|
||||
|
||||
* `python -m pip install -e benchmark/tools/scorer[dev]`
|
||||
* `make lint`
|
||||
* `make test`
|
||||
|
||||
2. `build-cases`
|
||||
|
||||
* `python benchmark/tools/build/build_all.py`
|
||||
* Run `validate_builds.py`
|
||||
|
||||
3. `smoke-baselines`
|
||||
|
||||
* For 2–3 cases, run Semgrep/CodeQL wrappers and ensure they emit valid submissions.
|
||||
|
||||
### Task 7.2 – Artifact upload
|
||||
|
||||
* Upload `outputs/` tarball from `build-cases` as workflow artifacts.
|
||||
* Upload `results/*.json` from scoring runs.
|
||||
|
||||
---
|
||||
|
||||
## 8. Phase 8 – Website & leaderboard
|
||||
|
||||
### Task 8.1 – Define results JSON format
|
||||
|
||||
**Developer:** Benchmark Core + Website dev
|
||||
|
||||
`results/leaderboard.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"tools": [
|
||||
{
|
||||
"name": "Semgrep",
|
||||
"version": "1.60.0",
|
||||
"summary": {
|
||||
"precision": 0.72,
|
||||
"recall": 0.48,
|
||||
"f1": 0.58
|
||||
},
|
||||
"by_language": {
|
||||
"javascript": {"precision": 0.80, "recall": 0.50, "f1": 0.62},
|
||||
"python": {"precision": 0.65, "recall": 0.45, "f1": 0.53}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
CLI option to generate this:
|
||||
|
||||
```bash
|
||||
rb-score compare \
|
||||
--cases-root benchmark/cases \
|
||||
--submissions submissions/*.json \
|
||||
--output results/leaderboard.json
|
||||
```
|
||||
|
||||
### Task 8.2 – Static site
|
||||
|
||||
**Developer:** Website dev
|
||||
|
||||
Tech choice: any static framework (Next.js, Astro, Docusaurus, or even pure HTML+JS).
|
||||
|
||||
Pages:
|
||||
|
||||
* **Home**
|
||||
|
||||
* What is reachability?
|
||||
* Summary of benchmark
|
||||
|
||||
* **Leaderboard**
|
||||
|
||||
* Renders `leaderboard.json`
|
||||
* Filters: language, case size
|
||||
|
||||
* **Docs**
|
||||
|
||||
* How to run benchmark locally
|
||||
* How to prepare a submission
|
||||
|
||||
Add a simple script to copy `results/leaderboard.json` into `website/public/` for publishing.
|
||||
|
||||
---
|
||||
|
||||
## 9. Phase 9 – Docs, governance, and contribution flow
|
||||
|
||||
### Task 9.1 – CONTRIBUTING.md
|
||||
|
||||
Include:
|
||||
|
||||
* How to add a new case:
|
||||
|
||||
* Step‑by‑step:
|
||||
|
||||
1. Create project folder under `benchmark/cases/<lang>/<project>/case-XXX/`
|
||||
2. Add `case.yaml`, `entrypoints.yaml`, `truth.yaml`
|
||||
3. Add oracles (tests, coverage)
|
||||
4. Add deterministic `build/` assets
|
||||
5. Run local tooling:
|
||||
|
||||
* `validate_schema.py`
|
||||
* `validate_builds.py --case <id>`
|
||||
* Example PR description template.
|
||||
|
||||
### Task 9.2 – Governance doc
|
||||
|
||||
* Define **Technical Advisory Committee (TAC)** roles:
|
||||
|
||||
* Approve new cases
|
||||
* Approve schema changes
|
||||
* Manage hidden test sets (future phase)
|
||||
|
||||
* Define **release cadence**:
|
||||
|
||||
* v1.0 with public cases
|
||||
* Quarterly updates with new hidden cases.
|
||||
|
||||
---
|
||||
|
||||
## 10. Suggested milestone breakdown (for planning / sprints)
|
||||
|
||||
### Milestone 1 – Foundation (1–2 sprints)
|
||||
|
||||
* Repo scaffolding (Tasks 1.x)
|
||||
* Schemas (Tasks 2.x)
|
||||
* Two tiny toy cases (one JS, one Python) with:
|
||||
|
||||
* `case.yaml`, `entrypoints.yaml`, `truth.yaml`
|
||||
* Deterministic build
|
||||
* Basic unit tests
|
||||
* Minimal `rb-score` with:
|
||||
|
||||
* Case loading
|
||||
* Precision/recall only
|
||||
|
||||
**Exit:** You can run `rb-score` on a dummy submission for 2 cases.
|
||||
|
||||
---
|
||||
|
||||
### Milestone 2 – v1 dataset (2–3 sprints)
|
||||
|
||||
* Add ~20–30 cases across JS, Python, Java, C
|
||||
* Ground truth & coverage for each
|
||||
* Deterministic builds validated
|
||||
* Explainability scoring implemented
|
||||
* Regression tests for `rb-score`
|
||||
|
||||
**Exit:** Full scoring tool stable; dataset repeatably builds on CI.
|
||||
|
||||
---
|
||||
|
||||
### Milestone 3 – Baselines & site (1–2 sprints)
|
||||
|
||||
* Semgrep + CodeQL baselines producing valid submissions
|
||||
* CI running smoke baselines
|
||||
* `leaderboard.json` generator
|
||||
* Static website with public leaderboard and docs
|
||||
|
||||
**Exit:** Public v1 benchmark you can share with external tool authors.
|
||||
|
||||
---
|
||||
|
||||
If you tell me which stack your team prefers for the site (React, plain HTML, SSG, etc.) or which CI you’re on, I can adapt this into concrete config files (e.g., a full GitHub Actions workflow, Next.js scaffold, or exact `pyproject.toml` for `rb-score`).
|
||||
Reference in New Issue
Block a user