24 KiB
Here’s a concrete, low‑lift way to boost Stella Ops’s visibility and prove your “deterministic, replayable” moat: publish a sanitized subset of reachability graphs as a public benchmark that others can run and score identically.
What this is (plain English)
- You release a small, carefully scrubbed set of packages + SBOMs + VEX + call‑graphs (source & binaries) with ground‑truth reachability labels for a curated list of CVEs.
- You also ship a deterministic scoring harness (container + manifest) so anyone can reproduce the exact scores, byte‑for‑byte.
Why it helps
- Proof of determinism: identical inputs → identical graphs → identical scores.
- Research magnet: gives labs and tool vendors a neutral yardstick; you become “the” benchmark steward.
- Biz impact: easy demo for buyers; lets you publish leaderboards and whitepapers.
Scope (MVP dataset)
-
Languages: PHP, JS, Python, plus binary (ELF/PE/Mach‑O) mini-cases.
-
Units: 20–30 packages total; 3–6 CVEs per language; 4–6 binary cases (static & dynamically‑linked).
-
Artifacts per unit:
- Package tarball(s) or container image digest
- SBOM (CycloneDX 1.6 + SPDX 3.0.1)
- VEX (known‑exploited, not‑affected, under‑investigation)
- Call graph (normalized JSON)
- Ground truth: list of vulnerable entrypoints/edges considered reachable
- Determinism manifest: feed URLs + rule hashes + container digests + tool versions
Data model (keep it simple)
-
dataset.json: index of cases with content‑addressed URIs (sha256) -
sbom/,vex/,graphs/,truth/folders mirroring the index -
manifest.lock.json: DSSE‑signed record of:- feeder rules, lattice policies, normalizers (name + version + hash)
- container image digests for each step (scanner/cartographer/normalizer)
- timestamp + signer (Stella Ops Authority)
Scoring harness (deterministic)
-
One Docker image:
stellaops/benchmark-harness:<tag> -
Inputs: dataset root +
manifest.lock.json -
Outputs:
scores.json(precision/recall/F1, per‑case and macro)replay-proof.txt(hashes of every artifact used)
-
No network mode (offline‑first). Fails closed if any hash mismatches.
Metrics (clear + auditable)
- Per case: TP/FP/FN for reachable functions (or edges), plus optional sink‑reach verification.
- Aggregates: micro/macro F1; “Determinism Index” (stddev of repeated runs must be 0).
- Repro test: the harness re‑runs N=3 and asserts identical outputs (hash compare).
Sanitization & legal
- Strip any proprietary code/data; prefer OSS with permissive licenses.
- Replace real package registries with local mirrors and pin digests.
- Publish under CC‑BY‑4.0 (data) + Apache‑2.0 (harness). Add a simple contributor license agreement for external case submissions.
Baselines to include (neutral + useful)
- “Naïve reachable” (all functions in package)
- “Imports‑only” (entrypoints that match import graph)
- “Call‑depth‑2” (bounded traversal)
- Your graph engine run with frozen rules from the manifest (as a reference, not a claim of SOTA)
Repository layout (public)
stellaops-reachability-benchmark/
dataset/
dataset.json
sbom/...
vex/...
graphs/...
truth/...
manifest.lock.json (DSSE-signed)
harness/
Dockerfile
runner.py (CLI)
schema/ (JSON Schemas for graphs, truth, scores)
docs/
HOWTO.md (5-min run)
CONTRIBUTING.md
SANITIZATION.md
LICENSES/
Docs your team can ship in a day
- HOWTO.md:
docker run -v $PWD/dataset:/d -v $PWD/out:/o stellaops/benchmark-harness score /d /o - SCHEMA.md: JSON Schemas for graph and truth (keep fields minimal:
nodes,edges,purls,sinks,evidence). - REPRODUCIBILITY.md: explains DSSE signatures, lockfile, and offline run.
- LIMITATIONS.md: clarifies scope (no dynamic runtime traces in v1, etc.).
Governance (lightweight)
-
Versioned releases:
v0.1,v0.2with changelogs. -
Submission gate: PR template + CI that:
- validates schemas
- checks hashes match lockfile
- re‑scores and compares to contributor’s score
-
Leaderboard cadence: monthly markdown table regenerated by CI.
Launch plan (2‑week sprint)
- Day 1–2: pick cases; finalize schemas; write SANITIZATION.md.
- Day 3–5: build harness image; implement deterministic runner; freeze
manifest.lock.json. - Day 6–8: produce ground truth; run baselines; generate initial scores.
- Day 9–10: docs + website README; record a 2‑minute demo GIF.
- Day 11–12: legal review + licenses; create issue labels (“good first case”).
- Day 13–14: publish, post on GitHub + LinkedIn; invite Semgrep/Snyk/OSS‑Fuzz folks to submit cases.
Nice‑to‑have (but easy)
- JSON Schema for ground‑truth edges so academics can auto‑ingest.
- Small “unknowns” registry example to show how you annotate unresolved symbols without breaking determinism.
- Binary mini‑lab: stripped vs non‑stripped ELF pair to show your patch‑oracle technique in action (truth labels reflect oracle result).
If you want, I can draft the repo skeleton (folders, placeholder JSON Schemas, a sample manifest.lock.json, and a minimal runner.py CLI) so you can drop it straight into GitHub.
Got you — let’s turn that high‑level idea into something your devs can actually pick up and ship.
Below is a concrete implementation plan for the StellaOps Reachability Benchmark repo: directory structure, components, tasks, and acceptance criteria. You can drop this straight into a ticketing system as epics → stories.
0. Tech assumptions (adjust if needed)
To be specific, I’ll assume:
- Repo:
stellaops-reachability-benchmark - Harness language: Python 3.11+
- Packaging: Docker image for the harness
- Schemas: JSON Schema (Draft 2020–12)
- CI: GitHub Actions
If your stack differs, you can still reuse the structure and acceptance criteria.
1. Repo skeleton & project bootstrap
Goal: Create a minimal but fully wired repo.
Tasks
-
Create skeleton
-
Structure:
stellaops-reachability-benchmark/ dataset/ dataset.json sbom/ vex/ graphs/ truth/ packages/ manifest.lock.json # initially stub harness/ reachbench/ __init__.py cli.py dataset_loader.py schemas/ graph.schema.json truth.schema.json dataset.schema.json scores.schema.json tests/ docs/ HOWTO.md SCHEMA.md REPRODUCIBILITY.md LIMITATIONS.md SANITIZATION.md .github/ workflows/ ci.yml pyproject.toml README.md LICENSE Dockerfile
-
-
Bootstrap Python project
-
pyproject.tomlwith:reachbenchpackage- deps:
jsonschema,clickortyper,pyyaml,pytest
-
harness/tests/with a dummy test to ensure CI is green.
-
-
Dockerfile
-
Minimal, pinned versions:
FROM python:3.11-slim WORKDIR /app COPY . . RUN pip install --no-cache-dir . ENTRYPOINT ["reachbench"]
-
-
CI basic pipeline (
.github/workflows/ci.yml)-
Jobs:
lint(e.g.,rufforflake8if you want)test(pytest)build-docker(just to ensure Dockerfile stays valid)
-
Acceptance criteria
pip install .works locally.reachbench --helpprints CLI help (even if commands are stubs).- CI passes on main branch.
2. Dataset & schema definitions
Goal: Define all JSON formats and enforce them.
2.1 Define dataset index format (dataset/dataset.json)
File: dataset/dataset.json
Example:
{
"version": "0.1.0",
"cases": [
{
"id": "php-wordpress-5.8-cve-2023-12345",
"language": "php",
"kind": "source", // "source" | "binary" | "container"
"cves": ["CVE-2023-12345"],
"artifacts": {
"package": {
"path": "packages/php/wordpress-5.8.tar.gz",
"sha256": "…"
},
"sbom": {
"path": "sbom/php/wordpress-5.8.cdx.json",
"format": "cyclonedx-1.6",
"sha256": "…"
},
"vex": {
"path": "vex/php/wordpress-5.8.vex.json",
"format": "csaf-2.0",
"sha256": "…"
},
"graph": {
"path": "graphs/php/wordpress-5.8.graph.json",
"schema": "graph.schema.json",
"sha256": "…"
},
"truth": {
"path": "truth/php/wordpress-5.8.truth.json",
"schema": "truth.schema.json",
"sha256": "…"
}
}
}
]
}
2.2 Define truth schema (harness/reachbench/schemas/truth.schema.json)
Model (conceptual):
{
"case_id": "php-wordpress-5.8-cve-2023-12345",
"vulnerable_components": [
{
"cve": "CVE-2023-12345",
"symbol": "wp_ajax_nopriv_some_vuln",
"symbol_kind": "function", // "function" | "method" | "binary_symbol"
"status": "reachable", // "reachable" | "not_reachable"
"reachable_from": [
{
"entrypoint_id": "web:GET:/foo",
"notes": "HTTP route /foo"
}
],
"evidence": "manual-analysis" // or "unit-test", "patch-oracle"
}
],
"non_vulnerable_components": [
{
"symbol": "wp_safe_function",
"symbol_kind": "function",
"status": "not_reachable",
"evidence": "manual-analysis"
}
]
}
Tasks
-
Implement JSON Schema capturing:
- required fields:
case_id,vulnerable_components - allowed enums for
symbol_kind,status,evidence
- required fields:
-
Add unit tests that:
- validate a valid truth file
- fail on various broken ones (missing
case_id, unknownstatus, etc.)
2.3 Define graph schema (harness/reachbench/schemas/graph.schema.json)
Model (conceptual):
{
"case_id": "php-wordpress-5.8-cve-2023-12345",
"language": "php",
"nodes": [
{
"id": "func:wp_ajax_nopriv_some_vuln",
"symbol": "wp_ajax_nopriv_some_vuln",
"kind": "function",
"purl": "pkg:composer/wordpress/wordpress@5.8"
}
],
"edges": [
{
"from": "func:wp_ajax_nopriv_some_vuln",
"to": "func:wpdb_query",
"kind": "call"
}
],
"entrypoints": [
{
"id": "web:GET:/foo",
"symbol": "some_controller",
"kind": "http_route"
}
]
}
Tasks
-
JSON Schema with:
nodes[](id, symbol, kind, optional purl)edges[](from,to,kind)entrypoints[](id, symbol, kind)
-
Tests: verify a valid graph; invalid ones (missing
id, unknownkind) are rejected.
2.4 Dataset index schema (dataset.schema.json)
- JSON Schema describing
dataset.json(version string, cases array). - Tests: validate the example dataset file.
Acceptance criteria
- Running a simple script (will be
reachbench validate-dataset) validates all JSON files indataset/against schemas without errors. - CI fails if any dataset JSON is invalid.
3. Lockfile & determinism manifest
Goal: Implement manifest.lock.json generation and verification.
3.1 Lockfile structure
File: dataset/manifest.lock.json
Example:
{
"version": "0.1.0",
"created_at": "2025-01-15T12:00:00Z",
"dataset": {
"root": "dataset/",
"sha256": "…",
"cases": {
"php-wordpress-5.8-cve-2023-12345": {
"sha256": "…"
}
}
},
"tools": {
"graph_normalizer": {
"name": "stellaops-graph-normalizer",
"version": "1.2.3",
"sha256": "…"
}
},
"containers": {
"scanner_image": "ghcr.io/stellaops/scanner@sha256:…",
"normalizer_image": "ghcr.io/stellaops/normalizer@sha256:…"
},
"signatures": [
{
"type": "dsse",
"key_id": "stellaops-benchmark-key-1",
"signature": "base64-encoded-blob"
}
]
}
(Signatures can be optional in v1 – but structure should be there.)
3.2 lockfile.py module
File: harness/reachbench/lockfile.py
Responsibilities
-
Compute deterministic SHA-256 digest of:
- each case’s artifacts (path → hash from
dataset.json) - entire
dataset/tree (sorted traversal)
- each case’s artifacts (path → hash from
-
Generate new
manifest.lock.json:version(hard-coded constant)created_at(UTC ISO8601)datasetsection with case hashes
-
Verification:
-
verify_lockfile(dataset_root, lockfile_path):- recompute hashes
- compare to
lockfile.dataset - return boolean + list of mismatches
-
Tasks
-
Implement canonical hashing:
-
For text JSON files: normalize with:
- sort keys
- no whitespace
- UTF‑8 encoding
-
For binaries (packages): raw bytes.
-
-
Implement
compute_dataset_hashes(dataset_root):- Returns
{"cases": {...}, "root_sha256": "…"}.
- Returns
-
Implement
write_lockfile(...)andverify_lockfile(...). -
Tests:
- Two calls with same dataset produce identical lockfile (order of
caseskeys normalized). - Changing any artifact file changes the root hash and causes verify to fail.
- Two calls with same dataset produce identical lockfile (order of
3.3 CLI commands
Add to cli.py:
reachbench compute-lockfile --dataset-root ./dataset --out ./dataset/manifest.lock.jsonreachbench verify-lockfile --dataset-root ./dataset --lockfile ./dataset/manifest.lock.json
Acceptance criteria
-
reachbench compute-lockfilegenerates a stable file (byte-for-byte identical across runs). -
reachbench verify-lockfileexits with:- code 0 if matches
- non-zero if mismatch (plus human-readable diff).
4. Scoring harness CLI
Goal: Deterministically score participant results against ground truth.
4.1 Result format (participant output)
Expectation:
Participants provide results/ with one JSON per case:
results/
php-wordpress-5.8-cve-2023-12345.json
js-express-4.17-cve-2022-9999.json
Result file example:
{
"case_id": "php-wordpress-5.8-cve-2023-12345",
"tool_name": "my-reachability-analyzer",
"tool_version": "1.0.0",
"predictions": [
{
"cve": "CVE-2023-12345",
"symbol": "wp_ajax_nopriv_some_vuln",
"symbol_kind": "function",
"status": "reachable"
},
{
"cve": "CVE-2023-12345",
"symbol": "wp_safe_function",
"symbol_kind": "function",
"status": "not_reachable"
}
]
}
4.2 Scoring model
-
Treat scoring as classification over
(cve, symbol)pairs. -
For each case:
- Truth positives: all
vulnerable_componentswithstatus == "reachable". - Truth negatives: everything marked
not_reachable(optional in v1). - Predictions: all entries with
status == "reachable".
- Truth positives: all
-
Compute:
TP: predicted reachable & truth reachable.FP: predicted reachable but truth says not reachable / unknown.FN: truth reachable but not predicted reachable.
-
Metrics:
- Precision, Recall, F1 per case.
- Macro-averaged metrics across all cases.
4.3 Implementation (scoring.py)
File: harness/reachbench/scoring.py
Functions:
-
load_truth(case_truth_path) -> TruthModel -
load_predictions(predictions_path) -> PredictionModel -
compute_case_metrics(truth, preds) -> dict-
returns:
{ "case_id": str, "tp": int, "fp": int, "fn": int, "precision": float, "recall": float, "f1": float }
-
-
aggregate_metrics(case_metrics_list) -> dictmacro_precision,macro_recall,macro_f1,num_cases.
4.4 CLI: score
Signature:
reachbench score \
--dataset-root ./dataset \
--results-root ./results \
--lockfile ./dataset/manifest.lock.json \
--out ./out/scores.json \
[--cases php-*] \
[--repeat 3]
Behavior:
-
Verify lockfile (fail closed if mismatch).
-
Load
dataset.json, filter cases if--casesis set (glob). -
For each case:
-
Load truth file (and validate schema).
-
Locate results file (
<case_id>.json) underresults-root:- If missing, treat as all FN (or mark case as “no submission”).
-
Load and validate predictions (include a JSON Schema:
results.schema.json). -
Compute per-case metrics.
-
-
Aggregate metrics.
-
Write
scores.json:{ "version": "0.1.0", "dataset_version": "0.1.0", "generated_at": "2025-01-15T12:34:56Z", "macro_precision": 0.92, "macro_recall": 0.88, "macro_f1": 0.90, "cases": [ { "case_id": "php-wordpress-5.8-cve-2023-12345", "tp": 10, "fp": 1, "fn": 2, "precision": 0.91, "recall": 0.83, "f1": 0.87 } ] } -
Determinism check:
-
If
--repeat Ngiven:- Re-run scoring in-memory N times.
- Compare resulting JSON strings (canonicalized via sorted keys).
- If any differ, exit non-zero with message (“non-deterministic scoring detected”).
-
4.5 Offline-only mode
-
In
cli.py, early check:if os.getenv("REACHBENCH_OFFLINE_ONLY", "1") == "1": # Verify no outbound network: by policy, just ensure we never call any net libs. # (In v1, simply avoid adding any such calls.) -
Document that harness must not reach out to the internet.
Acceptance criteria
- Given a small artificial dataset with 2–3 cases and handcrafted results,
reachbench scoreproduces expected metrics (assert via tests). - Running
reachbench score --repeat 3produces identicalscores.jsonacross runs. - Missing results files are handled gracefully (but clearly documented).
5. Baseline implementations
Goal: Provide in-repo baselines that use only the provided graphs (no extra tooling).
5.1 Baseline types
-
Naïve reachable: all symbols in the vulnerable package are considered reachable.
-
Imports-only: reachable = any symbol that:
- appears in the graph AND
- is reachable from any entrypoint by a single edge OR name match.
-
Call-depth-2:
- From each entrypoint, traverse up to depth 2 along
calledges. - Anything at depth ≤ 2 is considered reachable.
- From each entrypoint, traverse up to depth 2 along
5.2 Implementation
File: harness/reachbench/baselines.py
baseline_naive(graph, truth) -> PredictionModelbaseline_imports_only(graph, truth) -> PredictionModelbaseline_call_depth_2(graph, truth) -> PredictionModel
CLI:
reachbench run-baseline \
--dataset-root ./dataset \
--baseline naive|imports|depth2 \
--out ./results-baseline-<baseline>/
Behavior:
-
For each case:
- Load graph.
- Generate predictions per baseline.
- Write result file
results-baseline-<baseline>/<case_id>.json.
5.3 Tests
-
Tiny synthetic dataset in
harness/tests/data/:- 1–2 cases with simple graphs.
- Known expectations for each baseline (TP/FP/FN counts).
Acceptance criteria
reachbench run-baseline --baseline naiveruns end-to-end and outputs results files.reachbench scoreon baseline results produces stable scores.- Tests validate baseline behavior on synthetic cases.
6. Dataset validation & tooling
Goal: One command to validate everything (schemas, hashes, internal consistency).
CLI: validate-dataset
reachbench validate-dataset \
--dataset-root ./dataset \
[--lockfile ./dataset/manifest.lock.json]
Checks:
-
dataset.jsonconforms todataset.schema.json. -
For each case:
- all artifact paths exist
graphfile passesgraph.schema.jsontruthfile passestruth.schema.json
-
Optional: verify lockfile if provided.
Implementation:
-
dataset_loader.py:load_dataset_index(path) -> DatasetIndexiter_cases(dataset_index)yields case objects.validate_case(case, dataset_root) -> list[str](list of error messages).
Acceptance criteria
- Broken paths / invalid JSON produce a clear error message and non-zero exit code.
- CI job calls
reachbench validate-dataseton every push.
7. Documentation
Goal: Make it trivial for outsiders to use the benchmark.
7.1 README.md
-
Overview:
- What the benchmark is.
- What it measures (reachability precision/recall).
-
Quickstart:
git clone ... cd stellaops-reachability-benchmark # Validate dataset reachbench validate-dataset --dataset-root ./dataset # Run baselines reachbench run-baseline --baseline naive --dataset-root ./dataset --out ./results-naive # Score baselines reachbench score --dataset-root ./dataset --results-root ./results-naive --out ./out/naive-scores.json
7.2 docs/HOWTO.md
-
Step-by-step:
- Installing harness.
- Running your own tool on the dataset.
- Formatting your
results/. - Running
reachbench score. - Interpreting
scores.json.
7.3 docs/SCHEMA.md
-
Human-readable description of:
graphJSONtruthJSONresultsJSONscoresJSON
-
Link to actual JSON Schemas.
7.4 docs/REPRODUCIBILITY.md
-
Explain:
- lockfile design
- hashing rules
- deterministic scoring and
--repeatflag - how to verify you’re using the exact same dataset.
7.5 docs/SANITIZATION.md
-
Rules for adding new cases:
- Only use OSS or properly licensed code.
- Strip secrets / proprietary paths / user data.
- How to confirm nothing sensitive is in package tarballs.
Acceptance criteria
- A new engineer (or external user) can go from zero to “I ran the baseline and got scores” by following docs only.
- All example commands work as written.
8. CI/CD details
Goal: Keep repo healthy and ensure determinism.
CI jobs (GitHub Actions)
-
lint- Run
ruff/flake8(your choice).
- Run
-
test- Run
pytest.
- Run
-
validate-dataset- Run
reachbench validate-dataset --dataset-root ./dataset.
- Run
-
determinism-
Small workflow step:
- Run
reachbench scoreon a tiny test dataset with--repeat 3. - Assert success.
- Run
-
-
docker-builddocker buildthe harness image.
Acceptance criteria
- All jobs green on main.
- PRs show failing status if schemas or determinism break.
9. Rough “epics → stories” breakdown
You can paste roughly like this into Jira/Linear:
-
Epic: Repo bootstrap & CI
- Story: Create repo skeleton & Python project
- Story: Add Dockerfile & basic CI (lint + tests)
-
Epic: Schemas & dataset plumbing
- Story: Implement
truth.schema.json+ tests - Story: Implement
graph.schema.json+ tests - Story: Implement
dataset.schema.json+ tests - Story: Implement
validate-datasetCLI
- Story: Implement
-
Epic: Lockfile & determinism
- Story: Implement lockfile computation + verification
- Story: Add
compute-lockfile&verify-lockfileCLI - Story: Add determinism checks in CI
-
Epic: Scoring harness
- Story: Define results format +
results.schema.json - Story: Implement scoring logic (
scoring.py) - Story: Implement
scoreCLI with--repeat - Story: Add unit tests for metrics
- Story: Define results format +
-
Epic: Baselines
- Story: Implement naive baseline
- Story: Implement imports-only baseline
- Story: Implement depth-2 baseline
- Story: Add
run-baselineCLI + tests
-
Epic: Documentation & polish
- Story: Write README + HOWTO
- Story: Write SCHEMA / REPRODUCIBILITY / SANITIZATION docs
- Story: Final repo cleanup & examples
If you tell me your preferred language and CI, I can also rewrite this into exact tickets and even starter code for cli.py and a couple of schemas.