update advisories
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled

This commit is contained in:
2025-11-29 01:32:00 +02:00
parent d040c001ac
commit b3656e5cb7
17 changed files with 2308 additions and 602 deletions

View File

@@ -0,0 +1,886 @@
Heres a concrete, lowlift way to boost StellaOpss visibility and prove your “deterministic, replayable” moat: publish a **sanitized subset of reachability graphs** as a public benchmark that others can run and score identically.
### What this is (plain English)
* You release a small, carefully scrubbed set of **packages + SBOMs + VEX + callgraphs** (source & binaries) with **groundtruth reachability labels** for a curated list of CVEs.
* You also ship a **deterministic scoring harness** (container + manifest) so anyone can reproduce the exact scores, byteforbyte.
### Why it helps
* **Proof of determinism:** identical inputs → identical graphs → identical scores.
* **Research magnet:** gives labs and tool vendors a neutral yardstick; you become “the” benchmark steward.
* **Biz impact:** easy demo for buyers; lets you publish leaderboards and whitepapers.
### Scope (MVP dataset)
* **Languages:** PHP, JS, Python, plus **binary** (ELF/PE/MachO) mini-cases.
* **Units:** 2030 packages total; 36 CVEs per language; 46 binary cases (static & dynamicallylinked).
* **Artifacts per unit:**
* Package tarball(s) or container image digest
* SBOM (CycloneDX 1.6 + SPDX 3.0.1)
* VEX (knownexploited, notaffected, underinvestigation)
* **Call graph** (normalized JSON)
* **Ground truth**: list of vulnerable entrypoints/edges considered *reachable*
* **Determinism manifest**: feed URLs + rule hashes + container digests + tool versions
### Data model (keep it simple)
* `dataset.json`: index of cases with contentaddressed URIs (sha256)
* `sbom/`, `vex/`, `graphs/`, `truth/` folders mirroring the index
* `manifest.lock.json`: DSSEsigned record of:
* feeder rules, lattice policies, normalizers (name + version + hash)
* container image digests for each step (scanner/cartographer/normalizer)
* timestamp + signer (StellaOps Authority)
### Scoring harness (deterministic)
* One Docker image: `stellaops/benchmark-harness:<tag>`
* Inputs: dataset root + `manifest.lock.json`
* Outputs:
* `scores.json` (precision/recall/F1, percase and macro)
* `replay-proof.txt` (hashes of every artifact used)
* **No network** mode (offlinefirst). Fails closed if any hash mismatches.
### Metrics (clear + auditable)
* Per case: TP/FP/FN for **reachable** functions (or edges), plus optional **sinkreach** verification.
* Aggregates: micro/macro F1; “Determinism Index” (stddev of repeated runs must be 0).
* **Repro test:** the harness reruns N=3 and asserts identical outputs (hash compare).
### Sanitization & legal
* Strip any proprietary code/data; prefer OSS with permissive licenses.
* Replace real package registries with **local mirrors** and pin digests.
* Publish under **CCBY4.0** (data) + **Apache2.0** (harness). Add a simple **contributor license agreement** for external case submissions.
### Baselines to include (neutral + useful)
* “Naïve reachable” (all functions in package)
* “Importsonly” (entrypoints that match import graph)
* “Calldepth2” (bounded traversal)
* **Your** graph engine run with **frozen rules** from the manifest (as a reference, not a claim of SOTA)
### Repository layout (public)
```
stellaops-reachability-benchmark/
dataset/
dataset.json
sbom/...
vex/...
graphs/...
truth/...
manifest.lock.json (DSSE-signed)
harness/
Dockerfile
runner.py (CLI)
schema/ (JSON Schemas for graphs, truth, scores)
docs/
HOWTO.md (5-min run)
CONTRIBUTING.md
SANITIZATION.md
LICENSES/
```
### Docs your team can ship in a day
* **HOWTO.md:** `docker run -v $PWD/dataset:/d -v $PWD/out:/o stellaops/benchmark-harness score /d /o`
* **SCHEMA.md:** JSON Schemas for graph and truth (keep fields minimal: `nodes`, `edges`, `purls`, `sinks`, `evidence`).
* **REPRODUCIBILITY.md:** explains DSSE signatures, lockfile, and offline run.
* **LIMITATIONS.md:** clarifies scope (no dynamic runtime traces in v1, etc.).
### Governance (lightweight)
* **Versioned releases:** `v0.1`, `v0.2` with changelogs.
* **Submission gate:** PR template + CI that:
* validates schemas
* checks hashes match lockfile
* rescores and compares to contributors score
* **Leaderboard cadence:** monthly markdown table regenerated by CI.
### Launch plan (2week sprint)
* **Day 12:** pick cases; finalize schemas; write SANITIZATION.md.
* **Day 35:** build harness image; implement deterministic runner; freeze `manifest.lock.json`.
* **Day 68:** produce ground truth; run baselines; generate initial scores.
* **Day 910:** docs + website README; record a 2minute demo GIF.
* **Day 1112:** legal review + licenses; create issue labels (“good first case”).
* **Day 1314:** publish, post on GitHub + LinkedIn; invite Semgrep/Snyk/OSSFuzz folks to submit cases.
### Nicetohave (but easy)
* **JSON Schema** for groundtruth edges so academics can autoingest.
* **Small “unknowns” registry** example to show how you annotate unresolved symbols without breaking determinism.
* **Binary minilab**: stripped vs nonstripped ELF pair to show your patchoracle technique in action (truth labels reflect oracle result).
If you want, I can draft the repo skeleton (folders, placeholder JSON Schemas, a sample `manifest.lock.json`, and a minimal `runner.py` CLI) so you can drop it straight into GitHub.
Got you — lets turn that highlevel idea into something your devs can actually pick up and ship.
Below is a **concrete implementation plan** for the *StellaOps Reachability Benchmark* repo: directory structure, components, tasks, and acceptance criteria. You can drop this straight into a ticketing system as epics → stories.
---
## 0. Tech assumptions (adjust if needed)
To be specific, Ill assume:
* **Repo**: `stellaops-reachability-benchmark`
* **Harness language**: Python 3.11+
* **Packaging**: Docker image for the harness
* **Schemas**: JSON Schema (Draft 202012)
* **CI**: GitHub Actions
If your stack differs, you can still reuse the structure and acceptance criteria.
---
## 1. Repo skeleton & project bootstrap
**Goal:** Create a minimal but fully wired repo.
### Tasks
1. **Create skeleton**
* Structure:
```text
stellaops-reachability-benchmark/
dataset/
dataset.json
sbom/
vex/
graphs/
truth/
packages/
manifest.lock.json # initially stub
harness/
reachbench/
__init__.py
cli.py
dataset_loader.py
schemas/
graph.schema.json
truth.schema.json
dataset.schema.json
scores.schema.json
tests/
docs/
HOWTO.md
SCHEMA.md
REPRODUCIBILITY.md
LIMITATIONS.md
SANITIZATION.md
.github/
workflows/
ci.yml
pyproject.toml
README.md
LICENSE
Dockerfile
```
2. **Bootstrap Python project**
* `pyproject.toml` with:
* `reachbench` package
* deps: `jsonschema`, `click` or `typer`, `pyyaml`, `pytest`
* `harness/tests/` with a dummy test to ensure CI is green.
3. **Dockerfile**
* Minimal, pinned versions:
```Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY . .
RUN pip install --no-cache-dir .
ENTRYPOINT ["reachbench"]
```
4. **CI basic pipeline (`.github/workflows/ci.yml`)**
* Jobs:
* `lint` (e.g., `ruff` or `flake8` if you want)
* `test` (pytest)
* `build-docker` (just to ensure Dockerfile stays valid)
### Acceptance criteria
* `pip install .` works locally.
* `reachbench --help` prints CLI help (even if commands are stubs).
* CI passes on main branch.
---
## 2. Dataset & schema definitions
**Goal:** Define all JSON formats and enforce them.
### 2.1 Define dataset index format (`dataset/dataset.json`)
**File:** `dataset/dataset.json`
**Example:**
```json
{
"version": "0.1.0",
"cases": [
{
"id": "php-wordpress-5.8-cve-2023-12345",
"language": "php",
"kind": "source", // "source" | "binary" | "container"
"cves": ["CVE-2023-12345"],
"artifacts": {
"package": {
"path": "packages/php/wordpress-5.8.tar.gz",
"sha256": "…"
},
"sbom": {
"path": "sbom/php/wordpress-5.8.cdx.json",
"format": "cyclonedx-1.6",
"sha256": "…"
},
"vex": {
"path": "vex/php/wordpress-5.8.vex.json",
"format": "csaf-2.0",
"sha256": "…"
},
"graph": {
"path": "graphs/php/wordpress-5.8.graph.json",
"schema": "graph.schema.json",
"sha256": "…"
},
"truth": {
"path": "truth/php/wordpress-5.8.truth.json",
"schema": "truth.schema.json",
"sha256": "…"
}
}
}
]
}
```
### 2.2 Define **truth schema** (`harness/reachbench/schemas/truth.schema.json`)
**Model (conceptual):**
```jsonc
{
"case_id": "php-wordpress-5.8-cve-2023-12345",
"vulnerable_components": [
{
"cve": "CVE-2023-12345",
"symbol": "wp_ajax_nopriv_some_vuln",
"symbol_kind": "function", // "function" | "method" | "binary_symbol"
"status": "reachable", // "reachable" | "not_reachable"
"reachable_from": [
{
"entrypoint_id": "web:GET:/foo",
"notes": "HTTP route /foo"
}
],
"evidence": "manual-analysis" // or "unit-test", "patch-oracle"
}
],
"non_vulnerable_components": [
{
"symbol": "wp_safe_function",
"symbol_kind": "function",
"status": "not_reachable",
"evidence": "manual-analysis"
}
]
}
```
**Tasks**
* Implement JSON Schema capturing:
* required fields: `case_id`, `vulnerable_components`
* allowed enums for `symbol_kind`, `status`, `evidence`
* Add unit tests that:
* validate a valid truth file
* fail on various broken ones (missing `case_id`, unknown `status`, etc.)
### 2.3 Define **graph schema** (`harness/reachbench/schemas/graph.schema.json`)
**Model (conceptual):**
```jsonc
{
"case_id": "php-wordpress-5.8-cve-2023-12345",
"language": "php",
"nodes": [
{
"id": "func:wp_ajax_nopriv_some_vuln",
"symbol": "wp_ajax_nopriv_some_vuln",
"kind": "function",
"purl": "pkg:composer/wordpress/wordpress@5.8"
}
],
"edges": [
{
"from": "func:wp_ajax_nopriv_some_vuln",
"to": "func:wpdb_query",
"kind": "call"
}
],
"entrypoints": [
{
"id": "web:GET:/foo",
"symbol": "some_controller",
"kind": "http_route"
}
]
}
```
**Tasks**
* JSON Schema with:
* `nodes[]` (id, symbol, kind, optional purl)
* `edges[]` (`from`, `to`, `kind`)
* `entrypoints[]` (id, symbol, kind)
* Tests: verify a valid graph; invalid ones (missing `id`, unknown `kind`) are rejected.
### 2.4 Dataset index schema (`dataset.schema.json`)
* JSON Schema describing `dataset.json` (version string, cases array).
* Tests: validate the example dataset file.
### Acceptance criteria
* Running a simple script (will be `reachbench validate-dataset`) validates all JSON files in `dataset/` against schemas without errors.
* CI fails if any dataset JSON is invalid.
---
## 3. Lockfile & determinism manifest
**Goal:** Implement `manifest.lock.json` generation and verification.
### 3.1 Lockfile structure
**File:** `dataset/manifest.lock.json`
**Example:**
```jsonc
{
"version": "0.1.0",
"created_at": "2025-01-15T12:00:00Z",
"dataset": {
"root": "dataset/",
"sha256": "…",
"cases": {
"php-wordpress-5.8-cve-2023-12345": {
"sha256": "…"
}
}
},
"tools": {
"graph_normalizer": {
"name": "stellaops-graph-normalizer",
"version": "1.2.3",
"sha256": "…"
}
},
"containers": {
"scanner_image": "ghcr.io/stellaops/scanner@sha256:…",
"normalizer_image": "ghcr.io/stellaops/normalizer@sha256:…"
},
"signatures": [
{
"type": "dsse",
"key_id": "stellaops-benchmark-key-1",
"signature": "base64-encoded-blob"
}
]
}
```
*(Signatures can be optional in v1 but structure should be there.)*
### 3.2 `lockfile.py` module
**File:** `harness/reachbench/lockfile.py`
**Responsibilities**
* Compute deterministic SHA-256 digest of:
* each cases artifacts (path → hash from `dataset.json`)
* entire `dataset/` tree (sorted traversal)
* Generate new `manifest.lock.json`:
* `version` (hard-coded constant)
* `created_at` (UTC ISO8601)
* `dataset` section with case hashes
* Verification:
* `verify_lockfile(dataset_root, lockfile_path)`:
* recompute hashes
* compare to `lockfile.dataset`
* return boolean + list of mismatches
**Tasks**
1. Implement canonical hashing:
* For text JSON files: normalize with:
* sort keys
* no whitespace
* UTF8 encoding
* For binaries (packages): raw bytes.
2. Implement `compute_dataset_hashes(dataset_root)`:
* Returns `{"cases": {...}, "root_sha256": "…"}`.
3. Implement `write_lockfile(...)` and `verify_lockfile(...)`.
4. Tests:
* Two calls with same dataset produce identical lockfile (order of `cases` keys normalized).
* Changing any artifact file changes the root hash and causes verify to fail.
### 3.3 CLI commands
Add to `cli.py`:
* `reachbench compute-lockfile --dataset-root ./dataset --out ./dataset/manifest.lock.json`
* `reachbench verify-lockfile --dataset-root ./dataset --lockfile ./dataset/manifest.lock.json`
### Acceptance criteria
* `reachbench compute-lockfile` generates a stable file (byte-for-byte identical across runs).
* `reachbench verify-lockfile` exits with:
* code 0 if matches
* non-zero if mismatch (plus human-readable diff).
---
## 4. Scoring harness CLI
**Goal:** Deterministically score participant results against ground truth.
### 4.1 Result format (participant output)
**Expectation:**
Participants provide `results/` with one JSON per case:
```text
results/
php-wordpress-5.8-cve-2023-12345.json
js-express-4.17-cve-2022-9999.json
```
**Result file example:**
```jsonc
{
"case_id": "php-wordpress-5.8-cve-2023-12345",
"tool_name": "my-reachability-analyzer",
"tool_version": "1.0.0",
"predictions": [
{
"cve": "CVE-2023-12345",
"symbol": "wp_ajax_nopriv_some_vuln",
"symbol_kind": "function",
"status": "reachable"
},
{
"cve": "CVE-2023-12345",
"symbol": "wp_safe_function",
"symbol_kind": "function",
"status": "not_reachable"
}
]
}
```
### 4.2 Scoring model
* Treat scoring as classification over `(cve, symbol)` pairs.
* For each case:
* Truth positives: all `vulnerable_components` with `status == "reachable"`.
* Truth negatives: everything marked `not_reachable` (optional in v1).
* Predictions: all entries with `status == "reachable"`.
* Compute:
* `TP`: predicted reachable & truth reachable.
* `FP`: predicted reachable but truth says not reachable / unknown.
* `FN`: truth reachable but not predicted reachable.
* Metrics:
* Precision, Recall, F1 per case.
* Macro-averaged metrics across all cases.
### 4.3 Implementation (`scoring.py`)
**File:** `harness/reachbench/scoring.py`
**Functions:**
* `load_truth(case_truth_path) -> TruthModel`
* `load_predictions(predictions_path) -> PredictionModel`
* `compute_case_metrics(truth, preds) -> dict`
* returns:
```python
{
"case_id": str,
"tp": int,
"fp": int,
"fn": int,
"precision": float,
"recall": float,
"f1": float
}
```
* `aggregate_metrics(case_metrics_list) -> dict`
* `macro_precision`, `macro_recall`, `macro_f1`, `num_cases`.
### 4.4 CLI: `score`
**Signature:**
```bash
reachbench score \
--dataset-root ./dataset \
--results-root ./results \
--lockfile ./dataset/manifest.lock.json \
--out ./out/scores.json \
[--cases php-*] \
[--repeat 3]
```
**Behavior:**
1. **Verify lockfile** (fail closed if mismatch).
2. Load `dataset.json`, filter cases if `--cases` is set (glob).
3. For each case:
* Load truth file (and validate schema).
* Locate results file (`<case_id>.json`) under `results-root`:
* If missing, treat as all FN (or mark case as “no submission”).
* Load and validate predictions (include a JSON Schema: `results.schema.json`).
* Compute per-case metrics.
4. Aggregate metrics.
5. Write `scores.json`:
```jsonc
{
"version": "0.1.0",
"dataset_version": "0.1.0",
"generated_at": "2025-01-15T12:34:56Z",
"macro_precision": 0.92,
"macro_recall": 0.88,
"macro_f1": 0.90,
"cases": [
{
"case_id": "php-wordpress-5.8-cve-2023-12345",
"tp": 10,
"fp": 1,
"fn": 2,
"precision": 0.91,
"recall": 0.83,
"f1": 0.87
}
]
}
```
6. **Determinism check**:
* If `--repeat N` given:
* Re-run scoring in-memory N times.
* Compare resulting JSON strings (canonicalized via sorted keys).
* If any differ, exit non-zero with message (“non-deterministic scoring detected”).
### 4.5 Offline-only mode
* In `cli.py`, early check:
```python
if os.getenv("REACHBENCH_OFFLINE_ONLY", "1") == "1":
# Verify no outbound network: by policy, just ensure we never call any net libs.
# (In v1, simply avoid adding any such calls.)
```
* Document that harness must not reach out to the internet.
### Acceptance criteria
* Given a small artificial dataset with 23 cases and handcrafted results, `reachbench score` produces expected metrics (assert via tests).
* Running `reachbench score --repeat 3` produces identical `scores.json` across runs.
* Missing results files are handled gracefully (but clearly documented).
---
## 5. Baseline implementations
**Goal:** Provide in-repo baselines that use only the provided graphs (no extra tooling).
### 5.1 Baseline types
1. **Naïve reachable**: all symbols in the vulnerable package are considered reachable.
2. **Imports-only**: reachable = any symbol that:
* appears in the graph AND
* is reachable from any entrypoint by a single edge OR name match.
3. **Call-depth-2**:
* From each entrypoint, traverse up to depth 2 along `call` edges.
* Anything at depth ≤ 2 is considered reachable.
### 5.2 Implementation
**File:** `harness/reachbench/baselines.py`
* `baseline_naive(graph, truth) -> PredictionModel`
* `baseline_imports_only(graph, truth) -> PredictionModel`
* `baseline_call_depth_2(graph, truth) -> PredictionModel`
**CLI:**
```bash
reachbench run-baseline \
--dataset-root ./dataset \
--baseline naive|imports|depth2 \
--out ./results-baseline-<baseline>/
```
Behavior:
* For each case:
* Load graph.
* Generate predictions per baseline.
* Write result file `results-baseline-<baseline>/<case_id>.json`.
### 5.3 Tests
* Tiny synthetic dataset in `harness/tests/data/`:
* 12 cases with simple graphs.
* Known expectations for each baseline (TP/FP/FN counts).
### Acceptance criteria
* `reachbench run-baseline --baseline naive` runs end-to-end and outputs results files.
* `reachbench score` on baseline results produces stable scores.
* Tests validate baseline behavior on synthetic cases.
---
## 6. Dataset validation & tooling
**Goal:** One command to validate everything (schemas, hashes, internal consistency).
### CLI: `validate-dataset`
```bash
reachbench validate-dataset \
--dataset-root ./dataset \
[--lockfile ./dataset/manifest.lock.json]
```
**Checks:**
1. `dataset.json` conforms to `dataset.schema.json`.
2. For each case:
* all artifact paths exist
* `graph` file passes `graph.schema.json`
* `truth` file passes `truth.schema.json`
3. Optional: verify lockfile if provided.
**Implementation:**
* `dataset_loader.py`:
* `load_dataset_index(path) -> DatasetIndex`
* `iter_cases(dataset_index)` yields case objects.
* `validate_case(case, dataset_root) -> list[str]` (list of error messages).
**Acceptance criteria**
* Broken paths / invalid JSON produce a clear error message and non-zero exit code.
* CI job calls `reachbench validate-dataset` on every push.
---
## 7. Documentation
**Goal:** Make it trivial for outsiders to use the benchmark.
### 7.1 `README.md`
* Overview:
* What the benchmark is.
* What it measures (reachability precision/recall).
* Quickstart:
```bash
git clone ...
cd stellaops-reachability-benchmark
# Validate dataset
reachbench validate-dataset --dataset-root ./dataset
# Run baselines
reachbench run-baseline --baseline naive --dataset-root ./dataset --out ./results-naive
# Score baselines
reachbench score --dataset-root ./dataset --results-root ./results-naive --out ./out/naive-scores.json
```
### 7.2 `docs/HOWTO.md`
* Step-by-step:
* Installing harness.
* Running your own tool on the dataset.
* Formatting your `results/`.
* Running `reachbench score`.
* Interpreting `scores.json`.
### 7.3 `docs/SCHEMA.md`
* Human-readable description of:
* `graph` JSON
* `truth` JSON
* `results` JSON
* `scores` JSON
* Link to actual JSON Schemas.
### 7.4 `docs/REPRODUCIBILITY.md`
* Explain:
* lockfile design
* hashing rules
* deterministic scoring and `--repeat` flag
* how to verify youre using the exact same dataset.
### 7.5 `docs/SANITIZATION.md`
* Rules for adding new cases:
* Only use OSS or properly licensed code.
* Strip secrets / proprietary paths / user data.
* How to confirm nothing sensitive is in package tarballs.
### Acceptance criteria
* A new engineer (or external user) can go from zero to “I ran the baseline and got scores” by following docs only.
* All example commands work as written.
---
## 8. CI/CD details
**Goal:** Keep repo healthy and ensure determinism.
### CI jobs (GitHub Actions)
1. **`lint`**
* Run `ruff` / `flake8` (your choice).
2. **`test`**
* Run `pytest`.
3. **`validate-dataset`**
* Run `reachbench validate-dataset --dataset-root ./dataset`.
4. **`determinism`**
* Small workflow step:
* Run `reachbench score` on a tiny test dataset with `--repeat 3`.
* Assert success.
5. **`docker-build`**
* `docker build` the harness image.
### Acceptance criteria
* All jobs green on main.
* PRs show failing status if schemas or determinism break.
---
## 9. Rough “epics → stories” breakdown
You can paste roughly like this into Jira/Linear:
1. **Epic: Repo bootstrap & CI**
* Story: Create repo skeleton & Python project
* Story: Add Dockerfile & basic CI (lint + tests)
2. **Epic: Schemas & dataset plumbing**
* Story: Implement `truth.schema.json` + tests
* Story: Implement `graph.schema.json` + tests
* Story: Implement `dataset.schema.json` + tests
* Story: Implement `validate-dataset` CLI
3. **Epic: Lockfile & determinism**
* Story: Implement lockfile computation + verification
* Story: Add `compute-lockfile` & `verify-lockfile` CLI
* Story: Add determinism checks in CI
4. **Epic: Scoring harness**
* Story: Define results format + `results.schema.json`
* Story: Implement scoring logic (`scoring.py`)
* Story: Implement `score` CLI with `--repeat`
* Story: Add unit tests for metrics
5. **Epic: Baselines**
* Story: Implement naive baseline
* Story: Implement imports-only baseline
* Story: Implement depth-2 baseline
* Story: Add `run-baseline` CLI + tests
6. **Epic: Documentation & polish**
* Story: Write README + HOWTO
* Story: Write SCHEMA / REPRODUCIBILITY / SANITIZATION docs
* Story: Final repo cleanup & examples
---
If you tell me your preferred language and CI, I can also rewrite this into exact tickets and even starter code for `cli.py` and a couple of schemas.