Here’s a concrete, low‑lift way to boost Stella Ops’s visibility and prove your “deterministic, replayable” moat: publish a **sanitized subset of reachability graphs** as a public benchmark that others can run and score identically. ### What this is (plain English) * You release a small, carefully scrubbed set of **packages + SBOMs + VEX + call‑graphs** (source & binaries) with **ground‑truth reachability labels** for a curated list of CVEs. * You also ship a **deterministic scoring harness** (container + manifest) so anyone can reproduce the exact scores, byte‑for‑byte. ### Why it helps * **Proof of determinism:** identical inputs → identical graphs → identical scores. * **Research magnet:** gives labs and tool vendors a neutral yardstick; you become “the” benchmark steward. * **Biz impact:** easy demo for buyers; lets you publish leaderboards and whitepapers. ### Scope (MVP dataset) * **Languages:** PHP, JS, Python, plus **binary** (ELF/PE/Mach‑O) mini-cases. * **Units:** 20–30 packages total; 3–6 CVEs per language; 4–6 binary cases (static & dynamically‑linked). * **Artifacts per unit:** * Package tarball(s) or container image digest * SBOM (CycloneDX 1.6 + SPDX 3.0.1) * VEX (known‑exploited, not‑affected, under‑investigation) * **Call graph** (normalized JSON) * **Ground truth**: list of vulnerable entrypoints/edges considered *reachable* * **Determinism manifest**: feed URLs + rule hashes + container digests + tool versions ### Data model (keep it simple) * `dataset.json`: index of cases with content‑addressed URIs (sha256) * `sbom/`, `vex/`, `graphs/`, `truth/` folders mirroring the index * `manifest.lock.json`: DSSE‑signed record of: * feeder rules, lattice policies, normalizers (name + version + hash) * container image digests for each step (scanner/cartographer/normalizer) * timestamp + signer (Stella Ops Authority) ### Scoring harness (deterministic) * One Docker image: `stellaops/benchmark-harness:` * Inputs: dataset root + `manifest.lock.json` * Outputs: * `scores.json` (precision/recall/F1, per‑case and macro) * `replay-proof.txt` (hashes of every artifact used) * **No network** mode (offline‑first). Fails closed if any hash mismatches. ### Metrics (clear + auditable) * Per case: TP/FP/FN for **reachable** functions (or edges), plus optional **sink‑reach** verification. * Aggregates: micro/macro F1; “Determinism Index” (stddev of repeated runs must be 0). * **Repro test:** the harness re‑runs N=3 and asserts identical outputs (hash compare). ### Sanitization & legal * Strip any proprietary code/data; prefer OSS with permissive licenses. * Replace real package registries with **local mirrors** and pin digests. * Publish under **CC‑BY‑4.0** (data) + **Apache‑2.0** (harness). Add a simple **contributor license agreement** for external case submissions. ### Baselines to include (neutral + useful) * “Naïve reachable” (all functions in package) * “Imports‑only” (entrypoints that match import graph) * “Call‑depth‑2” (bounded traversal) * **Your** graph engine run with **frozen rules** from the manifest (as a reference, not a claim of SOTA) ### Repository layout (public) ``` stellaops-reachability-benchmark/ dataset/ dataset.json sbom/... vex/... graphs/... truth/... manifest.lock.json (DSSE-signed) harness/ Dockerfile runner.py (CLI) schema/ (JSON Schemas for graphs, truth, scores) docs/ HOWTO.md (5-min run) CONTRIBUTING.md SANITIZATION.md LICENSES/ ``` ### Docs your team can ship in a day * **HOWTO.md:** `docker run -v $PWD/dataset:/d -v $PWD/out:/o stellaops/benchmark-harness score /d /o` * **SCHEMA.md:** JSON Schemas for graph and truth (keep fields minimal: `nodes`, `edges`, `purls`, `sinks`, `evidence`). * **REPRODUCIBILITY.md:** explains DSSE signatures, lockfile, and offline run. * **LIMITATIONS.md:** clarifies scope (no dynamic runtime traces in v1, etc.). ### Governance (lightweight) * **Versioned releases:** `v0.1`, `v0.2` with changelogs. * **Submission gate:** PR template + CI that: * validates schemas * checks hashes match lockfile * re‑scores and compares to contributor’s score * **Leaderboard cadence:** monthly markdown table regenerated by CI. ### Launch plan (2‑week sprint) * **Day 1–2:** pick cases; finalize schemas; write SANITIZATION.md. * **Day 3–5:** build harness image; implement deterministic runner; freeze `manifest.lock.json`. * **Day 6–8:** produce ground truth; run baselines; generate initial scores. * **Day 9–10:** docs + website README; record a 2‑minute demo GIF. * **Day 11–12:** legal review + licenses; create issue labels (“good first case”). * **Day 13–14:** publish, post on GitHub + LinkedIn; invite Semgrep/Snyk/OSS‑Fuzz folks to submit cases. ### Nice‑to‑have (but easy) * **JSON Schema** for ground‑truth edges so academics can auto‑ingest. * **Small “unknowns” registry** example to show how you annotate unresolved symbols without breaking determinism. * **Binary mini‑lab**: stripped vs non‑stripped ELF pair to show your patch‑oracle technique in action (truth labels reflect oracle result). If you want, I can draft the repo skeleton (folders, placeholder JSON Schemas, a sample `manifest.lock.json`, and a minimal `runner.py` CLI) so you can drop it straight into GitHub. Got you — let’s turn that high‑level idea into something your devs can actually pick up and ship. Below is a **concrete implementation plan** for the *StellaOps Reachability Benchmark* repo: directory structure, components, tasks, and acceptance criteria. You can drop this straight into a ticketing system as epics → stories. --- ## 0. Tech assumptions (adjust if needed) To be specific, I’ll assume: * **Repo**: `stellaops-reachability-benchmark` * **Harness language**: Python 3.11+ * **Packaging**: Docker image for the harness * **Schemas**: JSON Schema (Draft 2020–12) * **CI**: GitHub Actions If your stack differs, you can still reuse the structure and acceptance criteria. --- ## 1. Repo skeleton & project bootstrap **Goal:** Create a minimal but fully wired repo. ### Tasks 1. **Create skeleton** * Structure: ```text stellaops-reachability-benchmark/ dataset/ dataset.json sbom/ vex/ graphs/ truth/ packages/ manifest.lock.json # initially stub harness/ reachbench/ __init__.py cli.py dataset_loader.py schemas/ graph.schema.json truth.schema.json dataset.schema.json scores.schema.json tests/ docs/ HOWTO.md SCHEMA.md REPRODUCIBILITY.md LIMITATIONS.md SANITIZATION.md .github/ workflows/ ci.yml pyproject.toml README.md LICENSE Dockerfile ``` 2. **Bootstrap Python project** * `pyproject.toml` with: * `reachbench` package * deps: `jsonschema`, `click` or `typer`, `pyyaml`, `pytest` * `harness/tests/` with a dummy test to ensure CI is green. 3. **Dockerfile** * Minimal, pinned versions: ```Dockerfile FROM python:3.11-slim WORKDIR /app COPY . . RUN pip install --no-cache-dir . ENTRYPOINT ["reachbench"] ``` 4. **CI basic pipeline (`.github/workflows/ci.yml`)** * Jobs: * `lint` (e.g., `ruff` or `flake8` if you want) * `test` (pytest) * `build-docker` (just to ensure Dockerfile stays valid) ### Acceptance criteria * `pip install .` works locally. * `reachbench --help` prints CLI help (even if commands are stubs). * CI passes on main branch. --- ## 2. Dataset & schema definitions **Goal:** Define all JSON formats and enforce them. ### 2.1 Define dataset index format (`dataset/dataset.json`) **File:** `dataset/dataset.json` **Example:** ```json { "version": "0.1.0", "cases": [ { "id": "php-wordpress-5.8-cve-2023-12345", "language": "php", "kind": "source", // "source" | "binary" | "container" "cves": ["CVE-2023-12345"], "artifacts": { "package": { "path": "packages/php/wordpress-5.8.tar.gz", "sha256": "…" }, "sbom": { "path": "sbom/php/wordpress-5.8.cdx.json", "format": "cyclonedx-1.6", "sha256": "…" }, "vex": { "path": "vex/php/wordpress-5.8.vex.json", "format": "csaf-2.0", "sha256": "…" }, "graph": { "path": "graphs/php/wordpress-5.8.graph.json", "schema": "graph.schema.json", "sha256": "…" }, "truth": { "path": "truth/php/wordpress-5.8.truth.json", "schema": "truth.schema.json", "sha256": "…" } } } ] } ``` ### 2.2 Define **truth schema** (`harness/reachbench/schemas/truth.schema.json`) **Model (conceptual):** ```jsonc { "case_id": "php-wordpress-5.8-cve-2023-12345", "vulnerable_components": [ { "cve": "CVE-2023-12345", "symbol": "wp_ajax_nopriv_some_vuln", "symbol_kind": "function", // "function" | "method" | "binary_symbol" "status": "reachable", // "reachable" | "not_reachable" "reachable_from": [ { "entrypoint_id": "web:GET:/foo", "notes": "HTTP route /foo" } ], "evidence": "manual-analysis" // or "unit-test", "patch-oracle" } ], "non_vulnerable_components": [ { "symbol": "wp_safe_function", "symbol_kind": "function", "status": "not_reachable", "evidence": "manual-analysis" } ] } ``` **Tasks** * Implement JSON Schema capturing: * required fields: `case_id`, `vulnerable_components` * allowed enums for `symbol_kind`, `status`, `evidence` * Add unit tests that: * validate a valid truth file * fail on various broken ones (missing `case_id`, unknown `status`, etc.) ### 2.3 Define **graph schema** (`harness/reachbench/schemas/graph.schema.json`) **Model (conceptual):** ```jsonc { "case_id": "php-wordpress-5.8-cve-2023-12345", "language": "php", "nodes": [ { "id": "func:wp_ajax_nopriv_some_vuln", "symbol": "wp_ajax_nopriv_some_vuln", "kind": "function", "purl": "pkg:composer/wordpress/wordpress@5.8" } ], "edges": [ { "from": "func:wp_ajax_nopriv_some_vuln", "to": "func:wpdb_query", "kind": "call" } ], "entrypoints": [ { "id": "web:GET:/foo", "symbol": "some_controller", "kind": "http_route" } ] } ``` **Tasks** * JSON Schema with: * `nodes[]` (id, symbol, kind, optional purl) * `edges[]` (`from`, `to`, `kind`) * `entrypoints[]` (id, symbol, kind) * Tests: verify a valid graph; invalid ones (missing `id`, unknown `kind`) are rejected. ### 2.4 Dataset index schema (`dataset.schema.json`) * JSON Schema describing `dataset.json` (version string, cases array). * Tests: validate the example dataset file. ### Acceptance criteria * Running a simple script (will be `reachbench validate-dataset`) validates all JSON files in `dataset/` against schemas without errors. * CI fails if any dataset JSON is invalid. --- ## 3. Lockfile & determinism manifest **Goal:** Implement `manifest.lock.json` generation and verification. ### 3.1 Lockfile structure **File:** `dataset/manifest.lock.json` **Example:** ```jsonc { "version": "0.1.0", "created_at": "2025-01-15T12:00:00Z", "dataset": { "root": "dataset/", "sha256": "…", "cases": { "php-wordpress-5.8-cve-2023-12345": { "sha256": "…" } } }, "tools": { "graph_normalizer": { "name": "stellaops-graph-normalizer", "version": "1.2.3", "sha256": "…" } }, "containers": { "scanner_image": "ghcr.io/stellaops/scanner@sha256:…", "normalizer_image": "ghcr.io/stellaops/normalizer@sha256:…" }, "signatures": [ { "type": "dsse", "key_id": "stellaops-benchmark-key-1", "signature": "base64-encoded-blob" } ] } ``` *(Signatures can be optional in v1 – but structure should be there.)* ### 3.2 `lockfile.py` module **File:** `harness/reachbench/lockfile.py` **Responsibilities** * Compute deterministic SHA-256 digest of: * each case’s artifacts (path → hash from `dataset.json`) * entire `dataset/` tree (sorted traversal) * Generate new `manifest.lock.json`: * `version` (hard-coded constant) * `created_at` (UTC ISO8601) * `dataset` section with case hashes * Verification: * `verify_lockfile(dataset_root, lockfile_path)`: * recompute hashes * compare to `lockfile.dataset` * return boolean + list of mismatches **Tasks** 1. Implement canonical hashing: * For text JSON files: normalize with: * sort keys * no whitespace * UTF‑8 encoding * For binaries (packages): raw bytes. 2. Implement `compute_dataset_hashes(dataset_root)`: * Returns `{"cases": {...}, "root_sha256": "…"}`. 3. Implement `write_lockfile(...)` and `verify_lockfile(...)`. 4. Tests: * Two calls with same dataset produce identical lockfile (order of `cases` keys normalized). * Changing any artifact file changes the root hash and causes verify to fail. ### 3.3 CLI commands Add to `cli.py`: * `reachbench compute-lockfile --dataset-root ./dataset --out ./dataset/manifest.lock.json` * `reachbench verify-lockfile --dataset-root ./dataset --lockfile ./dataset/manifest.lock.json` ### Acceptance criteria * `reachbench compute-lockfile` generates a stable file (byte-for-byte identical across runs). * `reachbench verify-lockfile` exits with: * code 0 if matches * non-zero if mismatch (plus human-readable diff). --- ## 4. Scoring harness CLI **Goal:** Deterministically score participant results against ground truth. ### 4.1 Result format (participant output) **Expectation:** Participants provide `results/` with one JSON per case: ```text results/ php-wordpress-5.8-cve-2023-12345.json js-express-4.17-cve-2022-9999.json ``` **Result file example:** ```jsonc { "case_id": "php-wordpress-5.8-cve-2023-12345", "tool_name": "my-reachability-analyzer", "tool_version": "1.0.0", "predictions": [ { "cve": "CVE-2023-12345", "symbol": "wp_ajax_nopriv_some_vuln", "symbol_kind": "function", "status": "reachable" }, { "cve": "CVE-2023-12345", "symbol": "wp_safe_function", "symbol_kind": "function", "status": "not_reachable" } ] } ``` ### 4.2 Scoring model * Treat scoring as classification over `(cve, symbol)` pairs. * For each case: * Truth positives: all `vulnerable_components` with `status == "reachable"`. * Truth negatives: everything marked `not_reachable` (optional in v1). * Predictions: all entries with `status == "reachable"`. * Compute: * `TP`: predicted reachable & truth reachable. * `FP`: predicted reachable but truth says not reachable / unknown. * `FN`: truth reachable but not predicted reachable. * Metrics: * Precision, Recall, F1 per case. * Macro-averaged metrics across all cases. ### 4.3 Implementation (`scoring.py`) **File:** `harness/reachbench/scoring.py` **Functions:** * `load_truth(case_truth_path) -> TruthModel` * `load_predictions(predictions_path) -> PredictionModel` * `compute_case_metrics(truth, preds) -> dict` * returns: ```python { "case_id": str, "tp": int, "fp": int, "fn": int, "precision": float, "recall": float, "f1": float } ``` * `aggregate_metrics(case_metrics_list) -> dict` * `macro_precision`, `macro_recall`, `macro_f1`, `num_cases`. ### 4.4 CLI: `score` **Signature:** ```bash reachbench score \ --dataset-root ./dataset \ --results-root ./results \ --lockfile ./dataset/manifest.lock.json \ --out ./out/scores.json \ [--cases php-*] \ [--repeat 3] ``` **Behavior:** 1. **Verify lockfile** (fail closed if mismatch). 2. Load `dataset.json`, filter cases if `--cases` is set (glob). 3. For each case: * Load truth file (and validate schema). * Locate results file (`.json`) under `results-root`: * If missing, treat as all FN (or mark case as “no submission”). * Load and validate predictions (include a JSON Schema: `results.schema.json`). * Compute per-case metrics. 4. Aggregate metrics. 5. Write `scores.json`: ```jsonc { "version": "0.1.0", "dataset_version": "0.1.0", "generated_at": "2025-01-15T12:34:56Z", "macro_precision": 0.92, "macro_recall": 0.88, "macro_f1": 0.90, "cases": [ { "case_id": "php-wordpress-5.8-cve-2023-12345", "tp": 10, "fp": 1, "fn": 2, "precision": 0.91, "recall": 0.83, "f1": 0.87 } ] } ``` 6. **Determinism check**: * If `--repeat N` given: * Re-run scoring in-memory N times. * Compare resulting JSON strings (canonicalized via sorted keys). * If any differ, exit non-zero with message (“non-deterministic scoring detected”). ### 4.5 Offline-only mode * In `cli.py`, early check: ```python if os.getenv("REACHBENCH_OFFLINE_ONLY", "1") == "1": # Verify no outbound network: by policy, just ensure we never call any net libs. # (In v1, simply avoid adding any such calls.) ``` * Document that harness must not reach out to the internet. ### Acceptance criteria * Given a small artificial dataset with 2–3 cases and handcrafted results, `reachbench score` produces expected metrics (assert via tests). * Running `reachbench score --repeat 3` produces identical `scores.json` across runs. * Missing results files are handled gracefully (but clearly documented). --- ## 5. Baseline implementations **Goal:** Provide in-repo baselines that use only the provided graphs (no extra tooling). ### 5.1 Baseline types 1. **Naïve reachable**: all symbols in the vulnerable package are considered reachable. 2. **Imports-only**: reachable = any symbol that: * appears in the graph AND * is reachable from any entrypoint by a single edge OR name match. 3. **Call-depth-2**: * From each entrypoint, traverse up to depth 2 along `call` edges. * Anything at depth ≤ 2 is considered reachable. ### 5.2 Implementation **File:** `harness/reachbench/baselines.py` * `baseline_naive(graph, truth) -> PredictionModel` * `baseline_imports_only(graph, truth) -> PredictionModel` * `baseline_call_depth_2(graph, truth) -> PredictionModel` **CLI:** ```bash reachbench run-baseline \ --dataset-root ./dataset \ --baseline naive|imports|depth2 \ --out ./results-baseline-/ ``` Behavior: * For each case: * Load graph. * Generate predictions per baseline. * Write result file `results-baseline-/.json`. ### 5.3 Tests * Tiny synthetic dataset in `harness/tests/data/`: * 1–2 cases with simple graphs. * Known expectations for each baseline (TP/FP/FN counts). ### Acceptance criteria * `reachbench run-baseline --baseline naive` runs end-to-end and outputs results files. * `reachbench score` on baseline results produces stable scores. * Tests validate baseline behavior on synthetic cases. --- ## 6. Dataset validation & tooling **Goal:** One command to validate everything (schemas, hashes, internal consistency). ### CLI: `validate-dataset` ```bash reachbench validate-dataset \ --dataset-root ./dataset \ [--lockfile ./dataset/manifest.lock.json] ``` **Checks:** 1. `dataset.json` conforms to `dataset.schema.json`. 2. For each case: * all artifact paths exist * `graph` file passes `graph.schema.json` * `truth` file passes `truth.schema.json` 3. Optional: verify lockfile if provided. **Implementation:** * `dataset_loader.py`: * `load_dataset_index(path) -> DatasetIndex` * `iter_cases(dataset_index)` yields case objects. * `validate_case(case, dataset_root) -> list[str]` (list of error messages). **Acceptance criteria** * Broken paths / invalid JSON produce a clear error message and non-zero exit code. * CI job calls `reachbench validate-dataset` on every push. --- ## 7. Documentation **Goal:** Make it trivial for outsiders to use the benchmark. ### 7.1 `README.md` * Overview: * What the benchmark is. * What it measures (reachability precision/recall). * Quickstart: ```bash git clone ... cd stellaops-reachability-benchmark # Validate dataset reachbench validate-dataset --dataset-root ./dataset # Run baselines reachbench run-baseline --baseline naive --dataset-root ./dataset --out ./results-naive # Score baselines reachbench score --dataset-root ./dataset --results-root ./results-naive --out ./out/naive-scores.json ``` ### 7.2 `docs/HOWTO.md` * Step-by-step: * Installing harness. * Running your own tool on the dataset. * Formatting your `results/`. * Running `reachbench score`. * Interpreting `scores.json`. ### 7.3 `docs/SCHEMA.md` * Human-readable description of: * `graph` JSON * `truth` JSON * `results` JSON * `scores` JSON * Link to actual JSON Schemas. ### 7.4 `docs/REPRODUCIBILITY.md` * Explain: * lockfile design * hashing rules * deterministic scoring and `--repeat` flag * how to verify you’re using the exact same dataset. ### 7.5 `docs/SANITIZATION.md` * Rules for adding new cases: * Only use OSS or properly licensed code. * Strip secrets / proprietary paths / user data. * How to confirm nothing sensitive is in package tarballs. ### Acceptance criteria * A new engineer (or external user) can go from zero to “I ran the baseline and got scores” by following docs only. * All example commands work as written. --- ## 8. CI/CD details **Goal:** Keep repo healthy and ensure determinism. ### CI jobs (GitHub Actions) 1. **`lint`** * Run `ruff` / `flake8` (your choice). 2. **`test`** * Run `pytest`. 3. **`validate-dataset`** * Run `reachbench validate-dataset --dataset-root ./dataset`. 4. **`determinism`** * Small workflow step: * Run `reachbench score` on a tiny test dataset with `--repeat 3`. * Assert success. 5. **`docker-build`** * `docker build` the harness image. ### Acceptance criteria * All jobs green on main. * PRs show failing status if schemas or determinism break. --- ## 9. Rough “epics → stories” breakdown You can paste roughly like this into Jira/Linear: 1. **Epic: Repo bootstrap & CI** * Story: Create repo skeleton & Python project * Story: Add Dockerfile & basic CI (lint + tests) 2. **Epic: Schemas & dataset plumbing** * Story: Implement `truth.schema.json` + tests * Story: Implement `graph.schema.json` + tests * Story: Implement `dataset.schema.json` + tests * Story: Implement `validate-dataset` CLI 3. **Epic: Lockfile & determinism** * Story: Implement lockfile computation + verification * Story: Add `compute-lockfile` & `verify-lockfile` CLI * Story: Add determinism checks in CI 4. **Epic: Scoring harness** * Story: Define results format + `results.schema.json` * Story: Implement scoring logic (`scoring.py`) * Story: Implement `score` CLI with `--repeat` * Story: Add unit tests for metrics 5. **Epic: Baselines** * Story: Implement naive baseline * Story: Implement imports-only baseline * Story: Implement depth-2 baseline * Story: Add `run-baseline` CLI + tests 6. **Epic: Documentation & polish** * Story: Write README + HOWTO * Story: Write SCHEMA / REPRODUCIBILITY / SANITIZATION docs * Story: Final repo cleanup & examples --- If you tell me your preferred language and CI, I can also rewrite this into exact tickets and even starter code for `cli.py` and a couple of schemas.