up
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
api-governance / spectral-lint (push) Has been cancelled
oas-ci / oas-validate (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
api-governance / spectral-lint (push) Has been cancelled
oas-ci / oas-validate (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
This commit is contained in:
@@ -0,0 +1,886 @@
|
||||
Here’s a concrete, low‑lift way to boost Stella Ops’s visibility and prove your “deterministic, replayable” moat: publish a **sanitized subset of reachability graphs** as a public benchmark that others can run and score identically.
|
||||
|
||||
### What this is (plain English)
|
||||
|
||||
* You release a small, carefully scrubbed set of **packages + SBOMs + VEX + call‑graphs** (source & binaries) with **ground‑truth reachability labels** for a curated list of CVEs.
|
||||
* You also ship a **deterministic scoring harness** (container + manifest) so anyone can reproduce the exact scores, byte‑for‑byte.
|
||||
|
||||
### Why it helps
|
||||
|
||||
* **Proof of determinism:** identical inputs → identical graphs → identical scores.
|
||||
* **Research magnet:** gives labs and tool vendors a neutral yardstick; you become “the” benchmark steward.
|
||||
* **Biz impact:** easy demo for buyers; lets you publish leaderboards and whitepapers.
|
||||
|
||||
### Scope (MVP dataset)
|
||||
|
||||
* **Languages:** PHP, JS, Python, plus **binary** (ELF/PE/Mach‑O) mini-cases.
|
||||
* **Units:** 20–30 packages total; 3–6 CVEs per language; 4–6 binary cases (static & dynamically‑linked).
|
||||
* **Artifacts per unit:**
|
||||
|
||||
* Package tarball(s) or container image digest
|
||||
* SBOM (CycloneDX 1.6 + SPDX 3.0.1)
|
||||
* VEX (known‑exploited, not‑affected, under‑investigation)
|
||||
* **Call graph** (normalized JSON)
|
||||
* **Ground truth**: list of vulnerable entrypoints/edges considered *reachable*
|
||||
* **Determinism manifest**: feed URLs + rule hashes + container digests + tool versions
|
||||
|
||||
### Data model (keep it simple)
|
||||
|
||||
* `dataset.json`: index of cases with content‑addressed URIs (sha256)
|
||||
* `sbom/`, `vex/`, `graphs/`, `truth/` folders mirroring the index
|
||||
* `manifest.lock.json`: DSSE‑signed record of:
|
||||
|
||||
* feeder rules, lattice policies, normalizers (name + version + hash)
|
||||
* container image digests for each step (scanner/cartographer/normalizer)
|
||||
* timestamp + signer (Stella Ops Authority)
|
||||
|
||||
### Scoring harness (deterministic)
|
||||
|
||||
* One Docker image: `stellaops/benchmark-harness:<tag>`
|
||||
* Inputs: dataset root + `manifest.lock.json`
|
||||
* Outputs:
|
||||
|
||||
* `scores.json` (precision/recall/F1, per‑case and macro)
|
||||
* `replay-proof.txt` (hashes of every artifact used)
|
||||
* **No network** mode (offline‑first). Fails closed if any hash mismatches.
|
||||
|
||||
### Metrics (clear + auditable)
|
||||
|
||||
* Per case: TP/FP/FN for **reachable** functions (or edges), plus optional **sink‑reach** verification.
|
||||
* Aggregates: micro/macro F1; “Determinism Index” (stddev of repeated runs must be 0).
|
||||
* **Repro test:** the harness re‑runs N=3 and asserts identical outputs (hash compare).
|
||||
|
||||
### Sanitization & legal
|
||||
|
||||
* Strip any proprietary code/data; prefer OSS with permissive licenses.
|
||||
* Replace real package registries with **local mirrors** and pin digests.
|
||||
* Publish under **CC‑BY‑4.0** (data) + **Apache‑2.0** (harness). Add a simple **contributor license agreement** for external case submissions.
|
||||
|
||||
### Baselines to include (neutral + useful)
|
||||
|
||||
* “Naïve reachable” (all functions in package)
|
||||
* “Imports‑only” (entrypoints that match import graph)
|
||||
* “Call‑depth‑2” (bounded traversal)
|
||||
* **Your** graph engine run with **frozen rules** from the manifest (as a reference, not a claim of SOTA)
|
||||
|
||||
### Repository layout (public)
|
||||
|
||||
```
|
||||
stellaops-reachability-benchmark/
|
||||
dataset/
|
||||
dataset.json
|
||||
sbom/...
|
||||
vex/...
|
||||
graphs/...
|
||||
truth/...
|
||||
manifest.lock.json (DSSE-signed)
|
||||
harness/
|
||||
Dockerfile
|
||||
runner.py (CLI)
|
||||
schema/ (JSON Schemas for graphs, truth, scores)
|
||||
docs/
|
||||
HOWTO.md (5-min run)
|
||||
CONTRIBUTING.md
|
||||
SANITIZATION.md
|
||||
LICENSES/
|
||||
```
|
||||
|
||||
### Docs your team can ship in a day
|
||||
|
||||
* **HOWTO.md:** `docker run -v $PWD/dataset:/d -v $PWD/out:/o stellaops/benchmark-harness score /d /o`
|
||||
* **SCHEMA.md:** JSON Schemas for graph and truth (keep fields minimal: `nodes`, `edges`, `purls`, `sinks`, `evidence`).
|
||||
* **REPRODUCIBILITY.md:** explains DSSE signatures, lockfile, and offline run.
|
||||
* **LIMITATIONS.md:** clarifies scope (no dynamic runtime traces in v1, etc.).
|
||||
|
||||
### Governance (lightweight)
|
||||
|
||||
* **Versioned releases:** `v0.1`, `v0.2` with changelogs.
|
||||
* **Submission gate:** PR template + CI that:
|
||||
|
||||
* validates schemas
|
||||
* checks hashes match lockfile
|
||||
* re‑scores and compares to contributor’s score
|
||||
* **Leaderboard cadence:** monthly markdown table regenerated by CI.
|
||||
|
||||
### Launch plan (2‑week sprint)
|
||||
|
||||
* **Day 1–2:** pick cases; finalize schemas; write SANITIZATION.md.
|
||||
* **Day 3–5:** build harness image; implement deterministic runner; freeze `manifest.lock.json`.
|
||||
* **Day 6–8:** produce ground truth; run baselines; generate initial scores.
|
||||
* **Day 9–10:** docs + website README; record a 2‑minute demo GIF.
|
||||
* **Day 11–12:** legal review + licenses; create issue labels (“good first case”).
|
||||
* **Day 13–14:** publish, post on GitHub + LinkedIn; invite Semgrep/Snyk/OSS‑Fuzz folks to submit cases.
|
||||
|
||||
### Nice‑to‑have (but easy)
|
||||
|
||||
* **JSON Schema** for ground‑truth edges so academics can auto‑ingest.
|
||||
* **Small “unknowns” registry** example to show how you annotate unresolved symbols without breaking determinism.
|
||||
* **Binary mini‑lab**: stripped vs non‑stripped ELF pair to show your patch‑oracle technique in action (truth labels reflect oracle result).
|
||||
|
||||
If you want, I can draft the repo skeleton (folders, placeholder JSON Schemas, a sample `manifest.lock.json`, and a minimal `runner.py` CLI) so you can drop it straight into GitHub.
|
||||
Got you — let’s turn that high‑level idea into something your devs can actually pick up and ship.
|
||||
|
||||
Below is a **concrete implementation plan** for the *StellaOps Reachability Benchmark* repo: directory structure, components, tasks, and acceptance criteria. You can drop this straight into a ticketing system as epics → stories.
|
||||
|
||||
---
|
||||
|
||||
## 0. Tech assumptions (adjust if needed)
|
||||
|
||||
To be specific, I’ll assume:
|
||||
|
||||
* **Repo**: `stellaops-reachability-benchmark`
|
||||
* **Harness language**: Python 3.11+
|
||||
* **Packaging**: Docker image for the harness
|
||||
* **Schemas**: JSON Schema (Draft 2020–12)
|
||||
* **CI**: GitHub Actions
|
||||
|
||||
If your stack differs, you can still reuse the structure and acceptance criteria.
|
||||
|
||||
---
|
||||
|
||||
## 1. Repo skeleton & project bootstrap
|
||||
|
||||
**Goal:** Create a minimal but fully wired repo.
|
||||
|
||||
### Tasks
|
||||
|
||||
1. **Create skeleton**
|
||||
|
||||
* Structure:
|
||||
|
||||
```text
|
||||
stellaops-reachability-benchmark/
|
||||
dataset/
|
||||
dataset.json
|
||||
sbom/
|
||||
vex/
|
||||
graphs/
|
||||
truth/
|
||||
packages/
|
||||
manifest.lock.json # initially stub
|
||||
harness/
|
||||
reachbench/
|
||||
__init__.py
|
||||
cli.py
|
||||
dataset_loader.py
|
||||
schemas/
|
||||
graph.schema.json
|
||||
truth.schema.json
|
||||
dataset.schema.json
|
||||
scores.schema.json
|
||||
tests/
|
||||
docs/
|
||||
HOWTO.md
|
||||
SCHEMA.md
|
||||
REPRODUCIBILITY.md
|
||||
LIMITATIONS.md
|
||||
SANITIZATION.md
|
||||
.github/
|
||||
workflows/
|
||||
ci.yml
|
||||
pyproject.toml
|
||||
README.md
|
||||
LICENSE
|
||||
Dockerfile
|
||||
```
|
||||
|
||||
2. **Bootstrap Python project**
|
||||
|
||||
* `pyproject.toml` with:
|
||||
|
||||
* `reachbench` package
|
||||
* deps: `jsonschema`, `click` or `typer`, `pyyaml`, `pytest`
|
||||
* `harness/tests/` with a dummy test to ensure CI is green.
|
||||
|
||||
3. **Dockerfile**
|
||||
|
||||
* Minimal, pinned versions:
|
||||
|
||||
```Dockerfile
|
||||
FROM python:3.11-slim
|
||||
WORKDIR /app
|
||||
COPY . .
|
||||
RUN pip install --no-cache-dir .
|
||||
ENTRYPOINT ["reachbench"]
|
||||
```
|
||||
|
||||
4. **CI basic pipeline (`.github/workflows/ci.yml`)**
|
||||
|
||||
* Jobs:
|
||||
|
||||
* `lint` (e.g., `ruff` or `flake8` if you want)
|
||||
* `test` (pytest)
|
||||
* `build-docker` (just to ensure Dockerfile stays valid)
|
||||
|
||||
### Acceptance criteria
|
||||
|
||||
* `pip install .` works locally.
|
||||
* `reachbench --help` prints CLI help (even if commands are stubs).
|
||||
* CI passes on main branch.
|
||||
|
||||
---
|
||||
|
||||
## 2. Dataset & schema definitions
|
||||
|
||||
**Goal:** Define all JSON formats and enforce them.
|
||||
|
||||
### 2.1 Define dataset index format (`dataset/dataset.json`)
|
||||
|
||||
**File:** `dataset/dataset.json`
|
||||
|
||||
**Example:**
|
||||
|
||||
```json
|
||||
{
|
||||
"version": "0.1.0",
|
||||
"cases": [
|
||||
{
|
||||
"id": "php-wordpress-5.8-cve-2023-12345",
|
||||
"language": "php",
|
||||
"kind": "source", // "source" | "binary" | "container"
|
||||
"cves": ["CVE-2023-12345"],
|
||||
"artifacts": {
|
||||
"package": {
|
||||
"path": "packages/php/wordpress-5.8.tar.gz",
|
||||
"sha256": "…"
|
||||
},
|
||||
"sbom": {
|
||||
"path": "sbom/php/wordpress-5.8.cdx.json",
|
||||
"format": "cyclonedx-1.6",
|
||||
"sha256": "…"
|
||||
},
|
||||
"vex": {
|
||||
"path": "vex/php/wordpress-5.8.vex.json",
|
||||
"format": "csaf-2.0",
|
||||
"sha256": "…"
|
||||
},
|
||||
"graph": {
|
||||
"path": "graphs/php/wordpress-5.8.graph.json",
|
||||
"schema": "graph.schema.json",
|
||||
"sha256": "…"
|
||||
},
|
||||
"truth": {
|
||||
"path": "truth/php/wordpress-5.8.truth.json",
|
||||
"schema": "truth.schema.json",
|
||||
"sha256": "…"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 2.2 Define **truth schema** (`harness/reachbench/schemas/truth.schema.json`)
|
||||
|
||||
**Model (conceptual):**
|
||||
|
||||
```jsonc
|
||||
{
|
||||
"case_id": "php-wordpress-5.8-cve-2023-12345",
|
||||
"vulnerable_components": [
|
||||
{
|
||||
"cve": "CVE-2023-12345",
|
||||
"symbol": "wp_ajax_nopriv_some_vuln",
|
||||
"symbol_kind": "function", // "function" | "method" | "binary_symbol"
|
||||
"status": "reachable", // "reachable" | "not_reachable"
|
||||
"reachable_from": [
|
||||
{
|
||||
"entrypoint_id": "web:GET:/foo",
|
||||
"notes": "HTTP route /foo"
|
||||
}
|
||||
],
|
||||
"evidence": "manual-analysis" // or "unit-test", "patch-oracle"
|
||||
}
|
||||
],
|
||||
"non_vulnerable_components": [
|
||||
{
|
||||
"symbol": "wp_safe_function",
|
||||
"symbol_kind": "function",
|
||||
"status": "not_reachable",
|
||||
"evidence": "manual-analysis"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Tasks**
|
||||
|
||||
* Implement JSON Schema capturing:
|
||||
|
||||
* required fields: `case_id`, `vulnerable_components`
|
||||
* allowed enums for `symbol_kind`, `status`, `evidence`
|
||||
* Add unit tests that:
|
||||
|
||||
* validate a valid truth file
|
||||
* fail on various broken ones (missing `case_id`, unknown `status`, etc.)
|
||||
|
||||
### 2.3 Define **graph schema** (`harness/reachbench/schemas/graph.schema.json`)
|
||||
|
||||
**Model (conceptual):**
|
||||
|
||||
```jsonc
|
||||
{
|
||||
"case_id": "php-wordpress-5.8-cve-2023-12345",
|
||||
"language": "php",
|
||||
"nodes": [
|
||||
{
|
||||
"id": "func:wp_ajax_nopriv_some_vuln",
|
||||
"symbol": "wp_ajax_nopriv_some_vuln",
|
||||
"kind": "function",
|
||||
"purl": "pkg:composer/wordpress/wordpress@5.8"
|
||||
}
|
||||
],
|
||||
"edges": [
|
||||
{
|
||||
"from": "func:wp_ajax_nopriv_some_vuln",
|
||||
"to": "func:wpdb_query",
|
||||
"kind": "call"
|
||||
}
|
||||
],
|
||||
"entrypoints": [
|
||||
{
|
||||
"id": "web:GET:/foo",
|
||||
"symbol": "some_controller",
|
||||
"kind": "http_route"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Tasks**
|
||||
|
||||
* JSON Schema with:
|
||||
|
||||
* `nodes[]` (id, symbol, kind, optional purl)
|
||||
* `edges[]` (`from`, `to`, `kind`)
|
||||
* `entrypoints[]` (id, symbol, kind)
|
||||
* Tests: verify a valid graph; invalid ones (missing `id`, unknown `kind`) are rejected.
|
||||
|
||||
### 2.4 Dataset index schema (`dataset.schema.json`)
|
||||
|
||||
* JSON Schema describing `dataset.json` (version string, cases array).
|
||||
* Tests: validate the example dataset file.
|
||||
|
||||
### Acceptance criteria
|
||||
|
||||
* Running a simple script (will be `reachbench validate-dataset`) validates all JSON files in `dataset/` against schemas without errors.
|
||||
* CI fails if any dataset JSON is invalid.
|
||||
|
||||
---
|
||||
|
||||
## 3. Lockfile & determinism manifest
|
||||
|
||||
**Goal:** Implement `manifest.lock.json` generation and verification.
|
||||
|
||||
### 3.1 Lockfile structure
|
||||
|
||||
**File:** `dataset/manifest.lock.json`
|
||||
|
||||
**Example:**
|
||||
|
||||
```jsonc
|
||||
{
|
||||
"version": "0.1.0",
|
||||
"created_at": "2025-01-15T12:00:00Z",
|
||||
"dataset": {
|
||||
"root": "dataset/",
|
||||
"sha256": "…",
|
||||
"cases": {
|
||||
"php-wordpress-5.8-cve-2023-12345": {
|
||||
"sha256": "…"
|
||||
}
|
||||
}
|
||||
},
|
||||
"tools": {
|
||||
"graph_normalizer": {
|
||||
"name": "stellaops-graph-normalizer",
|
||||
"version": "1.2.3",
|
||||
"sha256": "…"
|
||||
}
|
||||
},
|
||||
"containers": {
|
||||
"scanner_image": "ghcr.io/stellaops/scanner@sha256:…",
|
||||
"normalizer_image": "ghcr.io/stellaops/normalizer@sha256:…"
|
||||
},
|
||||
"signatures": [
|
||||
{
|
||||
"type": "dsse",
|
||||
"key_id": "stellaops-benchmark-key-1",
|
||||
"signature": "base64-encoded-blob"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
*(Signatures can be optional in v1 – but structure should be there.)*
|
||||
|
||||
### 3.2 `lockfile.py` module
|
||||
|
||||
**File:** `harness/reachbench/lockfile.py`
|
||||
|
||||
**Responsibilities**
|
||||
|
||||
* Compute deterministic SHA-256 digest of:
|
||||
|
||||
* each case’s artifacts (path → hash from `dataset.json`)
|
||||
* entire `dataset/` tree (sorted traversal)
|
||||
* Generate new `manifest.lock.json`:
|
||||
|
||||
* `version` (hard-coded constant)
|
||||
* `created_at` (UTC ISO8601)
|
||||
* `dataset` section with case hashes
|
||||
* Verification:
|
||||
|
||||
* `verify_lockfile(dataset_root, lockfile_path)`:
|
||||
|
||||
* recompute hashes
|
||||
* compare to `lockfile.dataset`
|
||||
* return boolean + list of mismatches
|
||||
|
||||
**Tasks**
|
||||
|
||||
1. Implement canonical hashing:
|
||||
|
||||
* For text JSON files: normalize with:
|
||||
|
||||
* sort keys
|
||||
* no whitespace
|
||||
* UTF‑8 encoding
|
||||
* For binaries (packages): raw bytes.
|
||||
2. Implement `compute_dataset_hashes(dataset_root)`:
|
||||
|
||||
* Returns `{"cases": {...}, "root_sha256": "…"}`.
|
||||
3. Implement `write_lockfile(...)` and `verify_lockfile(...)`.
|
||||
4. Tests:
|
||||
|
||||
* Two calls with same dataset produce identical lockfile (order of `cases` keys normalized).
|
||||
* Changing any artifact file changes the root hash and causes verify to fail.
|
||||
|
||||
### 3.3 CLI commands
|
||||
|
||||
Add to `cli.py`:
|
||||
|
||||
* `reachbench compute-lockfile --dataset-root ./dataset --out ./dataset/manifest.lock.json`
|
||||
* `reachbench verify-lockfile --dataset-root ./dataset --lockfile ./dataset/manifest.lock.json`
|
||||
|
||||
### Acceptance criteria
|
||||
|
||||
* `reachbench compute-lockfile` generates a stable file (byte-for-byte identical across runs).
|
||||
* `reachbench verify-lockfile` exits with:
|
||||
|
||||
* code 0 if matches
|
||||
* non-zero if mismatch (plus human-readable diff).
|
||||
|
||||
---
|
||||
|
||||
## 4. Scoring harness CLI
|
||||
|
||||
**Goal:** Deterministically score participant results against ground truth.
|
||||
|
||||
### 4.1 Result format (participant output)
|
||||
|
||||
**Expectation:**
|
||||
|
||||
Participants provide `results/` with one JSON per case:
|
||||
|
||||
```text
|
||||
results/
|
||||
php-wordpress-5.8-cve-2023-12345.json
|
||||
js-express-4.17-cve-2022-9999.json
|
||||
```
|
||||
|
||||
**Result file example:**
|
||||
|
||||
```jsonc
|
||||
{
|
||||
"case_id": "php-wordpress-5.8-cve-2023-12345",
|
||||
"tool_name": "my-reachability-analyzer",
|
||||
"tool_version": "1.0.0",
|
||||
"predictions": [
|
||||
{
|
||||
"cve": "CVE-2023-12345",
|
||||
"symbol": "wp_ajax_nopriv_some_vuln",
|
||||
"symbol_kind": "function",
|
||||
"status": "reachable"
|
||||
},
|
||||
{
|
||||
"cve": "CVE-2023-12345",
|
||||
"symbol": "wp_safe_function",
|
||||
"symbol_kind": "function",
|
||||
"status": "not_reachable"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 4.2 Scoring model
|
||||
|
||||
* Treat scoring as classification over `(cve, symbol)` pairs.
|
||||
* For each case:
|
||||
|
||||
* Truth positives: all `vulnerable_components` with `status == "reachable"`.
|
||||
* Truth negatives: everything marked `not_reachable` (optional in v1).
|
||||
* Predictions: all entries with `status == "reachable"`.
|
||||
* Compute:
|
||||
|
||||
* `TP`: predicted reachable & truth reachable.
|
||||
* `FP`: predicted reachable but truth says not reachable / unknown.
|
||||
* `FN`: truth reachable but not predicted reachable.
|
||||
* Metrics:
|
||||
|
||||
* Precision, Recall, F1 per case.
|
||||
* Macro-averaged metrics across all cases.
|
||||
|
||||
### 4.3 Implementation (`scoring.py`)
|
||||
|
||||
**File:** `harness/reachbench/scoring.py`
|
||||
|
||||
**Functions:**
|
||||
|
||||
* `load_truth(case_truth_path) -> TruthModel`
|
||||
|
||||
* `load_predictions(predictions_path) -> PredictionModel`
|
||||
|
||||
* `compute_case_metrics(truth, preds) -> dict`
|
||||
|
||||
* returns:
|
||||
|
||||
```python
|
||||
{
|
||||
"case_id": str,
|
||||
"tp": int,
|
||||
"fp": int,
|
||||
"fn": int,
|
||||
"precision": float,
|
||||
"recall": float,
|
||||
"f1": float
|
||||
}
|
||||
```
|
||||
|
||||
* `aggregate_metrics(case_metrics_list) -> dict`
|
||||
|
||||
* `macro_precision`, `macro_recall`, `macro_f1`, `num_cases`.
|
||||
|
||||
### 4.4 CLI: `score`
|
||||
|
||||
**Signature:**
|
||||
|
||||
```bash
|
||||
reachbench score \
|
||||
--dataset-root ./dataset \
|
||||
--results-root ./results \
|
||||
--lockfile ./dataset/manifest.lock.json \
|
||||
--out ./out/scores.json \
|
||||
[--cases php-*] \
|
||||
[--repeat 3]
|
||||
```
|
||||
|
||||
**Behavior:**
|
||||
|
||||
1. **Verify lockfile** (fail closed if mismatch).
|
||||
|
||||
2. Load `dataset.json`, filter cases if `--cases` is set (glob).
|
||||
|
||||
3. For each case:
|
||||
|
||||
* Load truth file (and validate schema).
|
||||
* Locate results file (`<case_id>.json`) under `results-root`:
|
||||
|
||||
* If missing, treat as all FN (or mark case as “no submission”).
|
||||
* Load and validate predictions (include a JSON Schema: `results.schema.json`).
|
||||
* Compute per-case metrics.
|
||||
|
||||
4. Aggregate metrics.
|
||||
|
||||
5. Write `scores.json`:
|
||||
|
||||
```jsonc
|
||||
{
|
||||
"version": "0.1.0",
|
||||
"dataset_version": "0.1.0",
|
||||
"generated_at": "2025-01-15T12:34:56Z",
|
||||
"macro_precision": 0.92,
|
||||
"macro_recall": 0.88,
|
||||
"macro_f1": 0.90,
|
||||
"cases": [
|
||||
{
|
||||
"case_id": "php-wordpress-5.8-cve-2023-12345",
|
||||
"tp": 10,
|
||||
"fp": 1,
|
||||
"fn": 2,
|
||||
"precision": 0.91,
|
||||
"recall": 0.83,
|
||||
"f1": 0.87
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
6. **Determinism check**:
|
||||
|
||||
* If `--repeat N` given:
|
||||
|
||||
* Re-run scoring in-memory N times.
|
||||
* Compare resulting JSON strings (canonicalized via sorted keys).
|
||||
* If any differ, exit non-zero with message (“non-deterministic scoring detected”).
|
||||
|
||||
### 4.5 Offline-only mode
|
||||
|
||||
* In `cli.py`, early check:
|
||||
|
||||
```python
|
||||
if os.getenv("REACHBENCH_OFFLINE_ONLY", "1") == "1":
|
||||
# Verify no outbound network: by policy, just ensure we never call any net libs.
|
||||
# (In v1, simply avoid adding any such calls.)
|
||||
```
|
||||
|
||||
* Document that harness must not reach out to the internet.
|
||||
|
||||
### Acceptance criteria
|
||||
|
||||
* Given a small artificial dataset with 2–3 cases and handcrafted results, `reachbench score` produces expected metrics (assert via tests).
|
||||
* Running `reachbench score --repeat 3` produces identical `scores.json` across runs.
|
||||
* Missing results files are handled gracefully (but clearly documented).
|
||||
|
||||
---
|
||||
|
||||
## 5. Baseline implementations
|
||||
|
||||
**Goal:** Provide in-repo baselines that use only the provided graphs (no extra tooling).
|
||||
|
||||
### 5.1 Baseline types
|
||||
|
||||
1. **Naïve reachable**: all symbols in the vulnerable package are considered reachable.
|
||||
2. **Imports-only**: reachable = any symbol that:
|
||||
|
||||
* appears in the graph AND
|
||||
* is reachable from any entrypoint by a single edge OR name match.
|
||||
3. **Call-depth-2**:
|
||||
|
||||
* From each entrypoint, traverse up to depth 2 along `call` edges.
|
||||
* Anything at depth ≤ 2 is considered reachable.
|
||||
|
||||
### 5.2 Implementation
|
||||
|
||||
**File:** `harness/reachbench/baselines.py`
|
||||
|
||||
* `baseline_naive(graph, truth) -> PredictionModel`
|
||||
* `baseline_imports_only(graph, truth) -> PredictionModel`
|
||||
* `baseline_call_depth_2(graph, truth) -> PredictionModel`
|
||||
|
||||
**CLI:**
|
||||
|
||||
```bash
|
||||
reachbench run-baseline \
|
||||
--dataset-root ./dataset \
|
||||
--baseline naive|imports|depth2 \
|
||||
--out ./results-baseline-<baseline>/
|
||||
```
|
||||
|
||||
Behavior:
|
||||
|
||||
* For each case:
|
||||
|
||||
* Load graph.
|
||||
* Generate predictions per baseline.
|
||||
* Write result file `results-baseline-<baseline>/<case_id>.json`.
|
||||
|
||||
### 5.3 Tests
|
||||
|
||||
* Tiny synthetic dataset in `harness/tests/data/`:
|
||||
|
||||
* 1–2 cases with simple graphs.
|
||||
* Known expectations for each baseline (TP/FP/FN counts).
|
||||
|
||||
### Acceptance criteria
|
||||
|
||||
* `reachbench run-baseline --baseline naive` runs end-to-end and outputs results files.
|
||||
* `reachbench score` on baseline results produces stable scores.
|
||||
* Tests validate baseline behavior on synthetic cases.
|
||||
|
||||
---
|
||||
|
||||
## 6. Dataset validation & tooling
|
||||
|
||||
**Goal:** One command to validate everything (schemas, hashes, internal consistency).
|
||||
|
||||
### CLI: `validate-dataset`
|
||||
|
||||
```bash
|
||||
reachbench validate-dataset \
|
||||
--dataset-root ./dataset \
|
||||
[--lockfile ./dataset/manifest.lock.json]
|
||||
```
|
||||
|
||||
**Checks:**
|
||||
|
||||
1. `dataset.json` conforms to `dataset.schema.json`.
|
||||
2. For each case:
|
||||
|
||||
* all artifact paths exist
|
||||
* `graph` file passes `graph.schema.json`
|
||||
* `truth` file passes `truth.schema.json`
|
||||
3. Optional: verify lockfile if provided.
|
||||
|
||||
**Implementation:**
|
||||
|
||||
* `dataset_loader.py`:
|
||||
|
||||
* `load_dataset_index(path) -> DatasetIndex`
|
||||
* `iter_cases(dataset_index)` yields case objects.
|
||||
* `validate_case(case, dataset_root) -> list[str]` (list of error messages).
|
||||
|
||||
**Acceptance criteria**
|
||||
|
||||
* Broken paths / invalid JSON produce a clear error message and non-zero exit code.
|
||||
* CI job calls `reachbench validate-dataset` on every push.
|
||||
|
||||
---
|
||||
|
||||
## 7. Documentation
|
||||
|
||||
**Goal:** Make it trivial for outsiders to use the benchmark.
|
||||
|
||||
### 7.1 `README.md`
|
||||
|
||||
* Overview:
|
||||
|
||||
* What the benchmark is.
|
||||
* What it measures (reachability precision/recall).
|
||||
* Quickstart:
|
||||
|
||||
```bash
|
||||
git clone ...
|
||||
cd stellaops-reachability-benchmark
|
||||
|
||||
# Validate dataset
|
||||
reachbench validate-dataset --dataset-root ./dataset
|
||||
|
||||
# Run baselines
|
||||
reachbench run-baseline --baseline naive --dataset-root ./dataset --out ./results-naive
|
||||
|
||||
# Score baselines
|
||||
reachbench score --dataset-root ./dataset --results-root ./results-naive --out ./out/naive-scores.json
|
||||
```
|
||||
|
||||
### 7.2 `docs/HOWTO.md`
|
||||
|
||||
* Step-by-step:
|
||||
|
||||
* Installing harness.
|
||||
* Running your own tool on the dataset.
|
||||
* Formatting your `results/`.
|
||||
* Running `reachbench score`.
|
||||
* Interpreting `scores.json`.
|
||||
|
||||
### 7.3 `docs/SCHEMA.md`
|
||||
|
||||
* Human-readable description of:
|
||||
|
||||
* `graph` JSON
|
||||
* `truth` JSON
|
||||
* `results` JSON
|
||||
* `scores` JSON
|
||||
* Link to actual JSON Schemas.
|
||||
|
||||
### 7.4 `docs/REPRODUCIBILITY.md`
|
||||
|
||||
* Explain:
|
||||
|
||||
* lockfile design
|
||||
* hashing rules
|
||||
* deterministic scoring and `--repeat` flag
|
||||
* how to verify you’re using the exact same dataset.
|
||||
|
||||
### 7.5 `docs/SANITIZATION.md`
|
||||
|
||||
* Rules for adding new cases:
|
||||
|
||||
* Only use OSS or properly licensed code.
|
||||
* Strip secrets / proprietary paths / user data.
|
||||
* How to confirm nothing sensitive is in package tarballs.
|
||||
|
||||
### Acceptance criteria
|
||||
|
||||
* A new engineer (or external user) can go from zero to “I ran the baseline and got scores” by following docs only.
|
||||
* All example commands work as written.
|
||||
|
||||
---
|
||||
|
||||
## 8. CI/CD details
|
||||
|
||||
**Goal:** Keep repo healthy and ensure determinism.
|
||||
|
||||
### CI jobs (GitHub Actions)
|
||||
|
||||
1. **`lint`**
|
||||
|
||||
* Run `ruff` / `flake8` (your choice).
|
||||
2. **`test`**
|
||||
|
||||
* Run `pytest`.
|
||||
3. **`validate-dataset`**
|
||||
|
||||
* Run `reachbench validate-dataset --dataset-root ./dataset`.
|
||||
4. **`determinism`**
|
||||
|
||||
* Small workflow step:
|
||||
|
||||
* Run `reachbench score` on a tiny test dataset with `--repeat 3`.
|
||||
* Assert success.
|
||||
5. **`docker-build`**
|
||||
|
||||
* `docker build` the harness image.
|
||||
|
||||
### Acceptance criteria
|
||||
|
||||
* All jobs green on main.
|
||||
* PRs show failing status if schemas or determinism break.
|
||||
|
||||
---
|
||||
|
||||
## 9. Rough “epics → stories” breakdown
|
||||
|
||||
You can paste roughly like this into Jira/Linear:
|
||||
|
||||
1. **Epic: Repo bootstrap & CI**
|
||||
|
||||
* Story: Create repo skeleton & Python project
|
||||
* Story: Add Dockerfile & basic CI (lint + tests)
|
||||
|
||||
2. **Epic: Schemas & dataset plumbing**
|
||||
|
||||
* Story: Implement `truth.schema.json` + tests
|
||||
* Story: Implement `graph.schema.json` + tests
|
||||
* Story: Implement `dataset.schema.json` + tests
|
||||
* Story: Implement `validate-dataset` CLI
|
||||
|
||||
3. **Epic: Lockfile & determinism**
|
||||
|
||||
* Story: Implement lockfile computation + verification
|
||||
* Story: Add `compute-lockfile` & `verify-lockfile` CLI
|
||||
* Story: Add determinism checks in CI
|
||||
|
||||
4. **Epic: Scoring harness**
|
||||
|
||||
* Story: Define results format + `results.schema.json`
|
||||
* Story: Implement scoring logic (`scoring.py`)
|
||||
* Story: Implement `score` CLI with `--repeat`
|
||||
* Story: Add unit tests for metrics
|
||||
|
||||
5. **Epic: Baselines**
|
||||
|
||||
* Story: Implement naive baseline
|
||||
* Story: Implement imports-only baseline
|
||||
* Story: Implement depth-2 baseline
|
||||
* Story: Add `run-baseline` CLI + tests
|
||||
|
||||
6. **Epic: Documentation & polish**
|
||||
|
||||
* Story: Write README + HOWTO
|
||||
* Story: Write SCHEMA / REPRODUCIBILITY / SANITIZATION docs
|
||||
* Story: Final repo cleanup & examples
|
||||
|
||||
---
|
||||
|
||||
If you tell me your preferred language and CI, I can also rewrite this into exact tickets and even starter code for `cli.py` and a couple of schemas.
|
||||
Reference in New Issue
Block a user