up

2025-11-27 15:05:48 +02:00
parent 4831c7fcb0
commit e950474a77
278 changed files with 81498 additions and 672 deletions
--- a/docs/product-advisories/26-Nov-2025
+++ b/docs/product-advisories/26-Nov-2025
@@ -0,0 +1,886 @@
+Here’s a concrete, low‑lift way to boost Stella Ops’s visibility and prove your “deterministic, replayable” moat: publish a **sanitized subset of reachability graphs** as a public benchmark that others can run and score identically.
+
+### What this is (plain English)
+
+* You release a small, carefully scrubbed set of **packages + SBOMs + VEX + call‑graphs** (source & binaries) with **ground‑truth reachability labels** for a curated list of CVEs.
+* You also ship a **deterministic scoring harness** (container + manifest) so anyone can reproduce the exact scores, byte‑for‑byte.
+
+### Why it helps
+
+* **Proof of determinism:** identical inputs → identical graphs → identical scores.
+* **Research magnet:** gives labs and tool vendors a neutral yardstick; you become “the” benchmark steward.
+* **Biz impact:** easy demo for buyers; lets you publish leaderboards and whitepapers.
+
+### Scope (MVP dataset)
+
+* **Languages:** PHP, JS, Python, plus **binary** (ELF/PE/Mach‑O) mini-cases.
+* **Units:** 20–30 packages total; 3–6 CVEs per language; 4–6 binary cases (static & dynamically‑linked).
+* **Artifacts per unit:**
+
+  * Package tarball(s) or container image digest
+  * SBOM (CycloneDX 1.6 + SPDX 3.0.1)
+  * VEX (known‑exploited, not‑affected, under‑investigation)
+  * **Call graph** (normalized JSON)
+  * **Ground truth**: list of vulnerable entrypoints/edges considered *reachable*
+  * **Determinism manifest**: feed URLs + rule hashes + container digests + tool versions
+
+### Data model (keep it simple)
+
+* `dataset.json`: index of cases with content‑addressed URIs (sha256)
+* `sbom/`, `vex/`, `graphs/`, `truth/` folders mirroring the index
+* `manifest.lock.json`: DSSE‑signed record of:
+
+  * feeder rules, lattice policies, normalizers (name + version + hash)
+  * container image digests for each step (scanner/cartographer/normalizer)
+  * timestamp + signer (Stella Ops Authority)
+
+### Scoring harness (deterministic)
+
+* One Docker image: `stellaops/benchmark-harness:<tag>`
+* Inputs: dataset root + `manifest.lock.json`
+* Outputs:
+
+  * `scores.json` (precision/recall/F1, per‑case and macro)
+  * `replay-proof.txt` (hashes of every artifact used)
+* **No network** mode (offline‑first). Fails closed if any hash mismatches.
+
+### Metrics (clear + auditable)
+
+* Per case: TP/FP/FN for **reachable** functions (or edges), plus optional **sink‑reach** verification.
+* Aggregates: micro/macro F1; “Determinism Index” (stddev of repeated runs must be 0).
+* **Repro test:** the harness re‑runs N=3 and asserts identical outputs (hash compare).
+
+### Sanitization & legal
+
+* Strip any proprietary code/data; prefer OSS with permissive licenses.
+* Replace real package registries with **local mirrors** and pin digests.
+* Publish under **CC‑BY‑4.0** (data) + **Apache‑2.0** (harness). Add a simple **contributor license agreement** for external case submissions.
+
+### Baselines to include (neutral + useful)
+
+* “Naïve reachable” (all functions in package)
+* “Imports‑only” (entrypoints that match import graph)
+* “Call‑depth‑2” (bounded traversal)
+* **Your** graph engine run with **frozen rules** from the manifest (as a reference, not a claim of SOTA)
+
+### Repository layout (public)
+
+```
+stellaops-reachability-benchmark/
+  dataset/
+    dataset.json
+    sbom/...
+    vex/...
+    graphs/...
+    truth/...
+    manifest.lock.json  (DSSE-signed)
+  harness/
+    Dockerfile
+    runner.py (CLI)
+    schema/ (JSON Schemas for graphs, truth, scores)
+  docs/
+    HOWTO.md (5-min run)
+    CONTRIBUTING.md
+    SANITIZATION.md
+    LICENSES/
+```
+
+### Docs your team can ship in a day
+
+* **HOWTO.md:** `docker run -v $PWD/dataset:/d -v $PWD/out:/o stellaops/benchmark-harness score /d /o`
+* **SCHEMA.md:** JSON Schemas for graph and truth (keep fields minimal: `nodes`, `edges`, `purls`, `sinks`, `evidence`).
+* **REPRODUCIBILITY.md:** explains DSSE signatures, lockfile, and offline run.
+* **LIMITATIONS.md:** clarifies scope (no dynamic runtime traces in v1, etc.).
+
+### Governance (lightweight)
+
+* **Versioned releases:** `v0.1`, `v0.2` with changelogs.
+* **Submission gate:** PR template + CI that:
+
+  * validates schemas
+  * checks hashes match lockfile
+  * re‑scores and compares to contributor’s score
+* **Leaderboard cadence:** monthly markdown table regenerated by CI.
+
+### Launch plan (2‑week sprint)
+
+* **Day 1–2:** pick cases; finalize schemas; write SANITIZATION.md.
+* **Day 3–5:** build harness image; implement deterministic runner; freeze `manifest.lock.json`.
+* **Day 6–8:** produce ground truth; run baselines; generate initial scores.
+* **Day 9–10:** docs + website README; record a 2‑minute demo GIF.
+* **Day 11–12:** legal review + licenses; create issue labels (“good first case”).
+* **Day 13–14:** publish, post on GitHub + LinkedIn; invite Semgrep/Snyk/OSS‑Fuzz folks to submit cases.
+
+### Nice‑to‑have (but easy)
+
+* **JSON Schema** for ground‑truth edges so academics can auto‑ingest.
+* **Small “unknowns” registry** example to show how you annotate unresolved symbols without breaking determinism.
+* **Binary mini‑lab**: stripped vs non‑stripped ELF pair to show your patch‑oracle technique in action (truth labels reflect oracle result).
+
+If you want, I can draft the repo skeleton (folders, placeholder JSON Schemas, a sample `manifest.lock.json`, and a minimal `runner.py` CLI) so you can drop it straight into GitHub.
+Got you — let’s turn that high‑level idea into something your devs can actually pick up and ship.
+
+Below is a **concrete implementation plan** for the *StellaOps Reachability Benchmark* repo: directory structure, components, tasks, and acceptance criteria. You can drop this straight into a ticketing system as epics → stories.
+
+---
+
+## 0. Tech assumptions (adjust if needed)
+
+To be specific, I’ll assume:
+
+* **Repo**: `stellaops-reachability-benchmark`
+* **Harness language**: Python 3.11+
+* **Packaging**: Docker image for the harness
+* **Schemas**: JSON Schema (Draft 2020–12)
+* **CI**: GitHub Actions
+
+If your stack differs, you can still reuse the structure and acceptance criteria.
+
+---
+
+## 1. Repo skeleton & project bootstrap
+
+**Goal:** Create a minimal but fully wired repo.
+
+### Tasks
+
+1. **Create skeleton**
+
+   * Structure:
+
+     ```text
+     stellaops-reachability-benchmark/
+       dataset/
+         dataset.json
+         sbom/
+         vex/
+         graphs/
+         truth/
+         packages/
+         manifest.lock.json    # initially stub
+       harness/
+         reachbench/
+           __init__.py
+           cli.py
+           dataset_loader.py
+           schemas/
+             graph.schema.json
+             truth.schema.json
+             dataset.schema.json
+             scores.schema.json
+         tests/
+       docs/
+         HOWTO.md
+         SCHEMA.md
+         REPRODUCIBILITY.md
+         LIMITATIONS.md
+         SANITIZATION.md
+       .github/
+         workflows/
+           ci.yml
+       pyproject.toml
+       README.md
+       LICENSE
+       Dockerfile
+     ```
+
+2. **Bootstrap Python project**
+
+   * `pyproject.toml` with:
+
+     * `reachbench` package
+     * deps: `jsonschema`, `click` or `typer`, `pyyaml`, `pytest`
+   * `harness/tests/` with a dummy test to ensure CI is green.
+
+3. **Dockerfile**
+
+   * Minimal, pinned versions:
+
+     ```Dockerfile
+     FROM python:3.11-slim
+     WORKDIR /app
+     COPY . .
+     RUN pip install --no-cache-dir .
+     ENTRYPOINT ["reachbench"]
+     ```
+
+4. **CI basic pipeline (`.github/workflows/ci.yml`)**
+
+   * Jobs:
+
+     * `lint` (e.g., `ruff` or `flake8` if you want)
+     * `test` (pytest)
+     * `build-docker` (just to ensure Dockerfile stays valid)
+
+### Acceptance criteria
+
+* `pip install .` works locally.
+* `reachbench --help` prints CLI help (even if commands are stubs).
+* CI passes on main branch.
+
+---
+
+## 2. Dataset & schema definitions
+
+**Goal:** Define all JSON formats and enforce them.
+
+### 2.1 Define dataset index format (`dataset/dataset.json`)
+
+**File:** `dataset/dataset.json`
+
+**Example:**
+
+```json
+{
+  "version": "0.1.0",
+  "cases": [
+    {
+      "id": "php-wordpress-5.8-cve-2023-12345",
+      "language": "php",
+      "kind": "source",          // "source" | "binary" | "container"
+      "cves": ["CVE-2023-12345"],
+      "artifacts": {
+        "package": {
+          "path": "packages/php/wordpress-5.8.tar.gz",
+          "sha256": "…"
+        },
+        "sbom": {
+          "path": "sbom/php/wordpress-5.8.cdx.json",
+          "format": "cyclonedx-1.6",
+          "sha256": "…"
+        },
+        "vex": {
+          "path": "vex/php/wordpress-5.8.vex.json",
+          "format": "csaf-2.0",
+          "sha256": "…"
+        },
+        "graph": {
+          "path": "graphs/php/wordpress-5.8.graph.json",
+          "schema": "graph.schema.json",
+          "sha256": "…"
+        },
+        "truth": {
+          "path": "truth/php/wordpress-5.8.truth.json",
+          "schema": "truth.schema.json",
+          "sha256": "…"
+        }
+      }
+    }
+  ]
+}
+```
+
+### 2.2 Define **truth schema** (`harness/reachbench/schemas/truth.schema.json`)
+
+**Model (conceptual):**
+
+```jsonc
+{
+  "case_id": "php-wordpress-5.8-cve-2023-12345",
+  "vulnerable_components": [
+    {
+      "cve": "CVE-2023-12345",
+      "symbol": "wp_ajax_nopriv_some_vuln",
+      "symbol_kind": "function",      // "function" | "method" | "binary_symbol"
+      "status": "reachable",          // "reachable" | "not_reachable"
+      "reachable_from": [
+        {
+          "entrypoint_id": "web:GET:/foo",
+          "notes": "HTTP route /foo"
+        }
+      ],
+      "evidence": "manual-analysis"   // or "unit-test", "patch-oracle"
+    }
+  ],
+  "non_vulnerable_components": [
+    {
+      "symbol": "wp_safe_function",
+      "symbol_kind": "function",
+      "status": "not_reachable",
+      "evidence": "manual-analysis"
+    }
+  ]
+}
+```
+
+**Tasks**
+
+* Implement JSON Schema capturing:
+
+  * required fields: `case_id`, `vulnerable_components`
+  * allowed enums for `symbol_kind`, `status`, `evidence`
+* Add unit tests that:
+
+  * validate a valid truth file
+  * fail on various broken ones (missing `case_id`, unknown `status`, etc.)
+
+### 2.3 Define **graph schema** (`harness/reachbench/schemas/graph.schema.json`)
+
+**Model (conceptual):**
+
+```jsonc
+{
+  "case_id": "php-wordpress-5.8-cve-2023-12345",
+  "language": "php",
+  "nodes": [
+    {
+      "id": "func:wp_ajax_nopriv_some_vuln",
+      "symbol": "wp_ajax_nopriv_some_vuln",
+      "kind": "function",
+      "purl": "pkg:composer/wordpress/wordpress@5.8"
+    }
+  ],
+  "edges": [
+    {
+      "from": "func:wp_ajax_nopriv_some_vuln",
+      "to": "func:wpdb_query",
+      "kind": "call"
+    }
+  ],
+  "entrypoints": [
+    {
+      "id": "web:GET:/foo",
+      "symbol": "some_controller",
+      "kind": "http_route"
+    }
+  ]
+}
+```
+
+**Tasks**
+
+* JSON Schema with:
+
+  * `nodes[]` (id, symbol, kind, optional purl)
+  * `edges[]` (`from`, `to`, `kind`)
+  * `entrypoints[]` (id, symbol, kind)
+* Tests: verify a valid graph; invalid ones (missing `id`, unknown `kind`) are rejected.
+
+### 2.4 Dataset index schema (`dataset.schema.json`)
+
+* JSON Schema describing `dataset.json` (version string, cases array).
+* Tests: validate the example dataset file.
+
+### Acceptance criteria
+
+* Running a simple script (will be `reachbench validate-dataset`) validates all JSON files in `dataset/` against schemas without errors.
+* CI fails if any dataset JSON is invalid.
+
+---
+
+## 3. Lockfile & determinism manifest
+
+**Goal:** Implement `manifest.lock.json` generation and verification.
+
+### 3.1 Lockfile structure
+
+**File:** `dataset/manifest.lock.json`
+
+**Example:**
+
+```jsonc
+{
+  "version": "0.1.0",
+  "created_at": "2025-01-15T12:00:00Z",
+  "dataset": {
+    "root": "dataset/",
+    "sha256": "…",
+    "cases": {
+      "php-wordpress-5.8-cve-2023-12345": {
+        "sha256": "…"
+      }
+    }
+  },
+  "tools": {
+    "graph_normalizer": {
+      "name": "stellaops-graph-normalizer",
+      "version": "1.2.3",
+      "sha256": "…"
+    }
+  },
+  "containers": {
+    "scanner_image": "ghcr.io/stellaops/scanner@sha256:…",
+    "normalizer_image": "ghcr.io/stellaops/normalizer@sha256:…"
+  },
+  "signatures": [
+    {
+      "type": "dsse",
+      "key_id": "stellaops-benchmark-key-1",
+      "signature": "base64-encoded-blob"
+    }
+  ]
+}
+```
+
+*(Signatures can be optional in v1 – but structure should be there.)*
+
+### 3.2 `lockfile.py` module
+
+**File:** `harness/reachbench/lockfile.py`
+
+**Responsibilities**
+
+* Compute deterministic SHA-256 digest of:
+
+  * each case’s artifacts (path → hash from `dataset.json`)
+  * entire `dataset/` tree (sorted traversal)
+* Generate new `manifest.lock.json`:
+
+  * `version` (hard-coded constant)
+  * `created_at` (UTC ISO8601)
+  * `dataset` section with case hashes
+* Verification:
+
+  * `verify_lockfile(dataset_root, lockfile_path)`:
+
+    * recompute hashes
+    * compare to `lockfile.dataset`
+    * return boolean + list of mismatches
+
+**Tasks**
+
+1. Implement canonical hashing:
+
+   * For text JSON files: normalize with:
+
+     * sort keys
+     * no whitespace
+     * UTF‑8 encoding
+   * For binaries (packages): raw bytes.
+2. Implement `compute_dataset_hashes(dataset_root)`:
+
+   * Returns `{"cases": {...}, "root_sha256": "…"}`.
+3. Implement `write_lockfile(...)` and `verify_lockfile(...)`.
+4. Tests:
+
+   * Two calls with same dataset produce identical lockfile (order of `cases` keys normalized).
+   * Changing any artifact file changes the root hash and causes verify to fail.
+
+### 3.3 CLI commands
+
+Add to `cli.py`:
+
+* `reachbench compute-lockfile --dataset-root ./dataset --out ./dataset/manifest.lock.json`
+* `reachbench verify-lockfile --dataset-root ./dataset --lockfile ./dataset/manifest.lock.json`
+
+### Acceptance criteria
+
+* `reachbench compute-lockfile` generates a stable file (byte-for-byte identical across runs).
+* `reachbench verify-lockfile` exits with:
+
+  * code 0 if matches
+  * non-zero if mismatch (plus human-readable diff).
+
+---
+
+## 4. Scoring harness CLI
+
+**Goal:** Deterministically score participant results against ground truth.
+
+### 4.1 Result format (participant output)
+
+**Expectation:**
+
+Participants provide `results/` with one JSON per case:
+
+```text
+results/
+  php-wordpress-5.8-cve-2023-12345.json
+  js-express-4.17-cve-2022-9999.json
+```
+
+**Result file example:**
+
+```jsonc
+{
+  "case_id": "php-wordpress-5.8-cve-2023-12345",
+  "tool_name": "my-reachability-analyzer",
+  "tool_version": "1.0.0",
+  "predictions": [
+    {
+      "cve": "CVE-2023-12345",
+      "symbol": "wp_ajax_nopriv_some_vuln",
+      "symbol_kind": "function",
+      "status": "reachable"
+    },
+    {
+      "cve": "CVE-2023-12345",
+      "symbol": "wp_safe_function",
+      "symbol_kind": "function",
+      "status": "not_reachable"
+    }
+  ]
+}
+```
+
+### 4.2 Scoring model
+
+* Treat scoring as classification over `(cve, symbol)` pairs.
+* For each case:
+
+  * Truth positives: all `vulnerable_components` with `status == "reachable"`.
+  * Truth negatives: everything marked `not_reachable` (optional in v1).
+  * Predictions: all entries with `status == "reachable"`.
+* Compute:
+
+  * `TP`: predicted reachable & truth reachable.
+  * `FP`: predicted reachable but truth says not reachable / unknown.
+  * `FN`: truth reachable but not predicted reachable.
+* Metrics:
+
+  * Precision, Recall, F1 per case.
+  * Macro-averaged metrics across all cases.
+
+### 4.3 Implementation (`scoring.py`)
+
+**File:** `harness/reachbench/scoring.py`
+
+**Functions:**
+
+* `load_truth(case_truth_path) -> TruthModel`
+
+* `load_predictions(predictions_path) -> PredictionModel`
+
+* `compute_case_metrics(truth, preds) -> dict`
+
+  * returns:
+
+    ```python
+    {
+      "case_id": str,
+      "tp": int,
+      "fp": int,
+      "fn": int,
+      "precision": float,
+      "recall": float,
+      "f1": float
+    }
+    ```
+
+* `aggregate_metrics(case_metrics_list) -> dict`
+
+  * `macro_precision`, `macro_recall`, `macro_f1`, `num_cases`.
+
+### 4.4 CLI: `score`
+
+**Signature:**
+
+```bash
+reachbench score \
+  --dataset-root ./dataset \
+  --results-root ./results \
+  --lockfile ./dataset/manifest.lock.json \
+  --out ./out/scores.json \
+  [--cases php-*] \
+  [--repeat 3]
+```
+
+**Behavior:**
+
+1. **Verify lockfile** (fail closed if mismatch).
+
+2. Load `dataset.json`, filter cases if `--cases` is set (glob).
+
+3. For each case:
+
+   * Load truth file (and validate schema).
+   * Locate results file (`<case_id>.json`) under `results-root`:
+
+     * If missing, treat as all FN (or mark case as “no submission”).
+   * Load and validate predictions (include a JSON Schema: `results.schema.json`).
+   * Compute per-case metrics.
+
+4. Aggregate metrics.
+
+5. Write `scores.json`:
+
+   ```jsonc
+   {
+     "version": "0.1.0",
+     "dataset_version": "0.1.0",
+     "generated_at": "2025-01-15T12:34:56Z",
+     "macro_precision": 0.92,
+     "macro_recall": 0.88,
+     "macro_f1": 0.90,
+     "cases": [
+       {
+         "case_id": "php-wordpress-5.8-cve-2023-12345",
+         "tp": 10,
+         "fp": 1,
+         "fn": 2,
+         "precision": 0.91,
+         "recall": 0.83,
+         "f1": 0.87
+       }
+     ]
+   }
+   ```
+
+6. **Determinism check**:
+
+   * If `--repeat N` given:
+
+     * Re-run scoring in-memory N times.
+     * Compare resulting JSON strings (canonicalized via sorted keys).
+     * If any differ, exit non-zero with message (“non-deterministic scoring detected”).
+
+### 4.5 Offline-only mode
+
+* In `cli.py`, early check:
+
+  ```python
+  if os.getenv("REACHBENCH_OFFLINE_ONLY", "1") == "1":
+      # Verify no outbound network: by policy, just ensure we never call any net libs.
+      # (In v1, simply avoid adding any such calls.)
+  ```
+
+* Document that harness must not reach out to the internet.
+
+### Acceptance criteria
+
+* Given a small artificial dataset with 2–3 cases and handcrafted results, `reachbench score` produces expected metrics (assert via tests).
+* Running `reachbench score --repeat 3` produces identical `scores.json` across runs.
+* Missing results files are handled gracefully (but clearly documented).
+
+---
+
+## 5. Baseline implementations
+
+**Goal:** Provide in-repo baselines that use only the provided graphs (no extra tooling).
+
+### 5.1 Baseline types
+
+1. **Naïve reachable**: all symbols in the vulnerable package are considered reachable.
+2. **Imports-only**: reachable = any symbol that:
+
+   * appears in the graph AND
+   * is reachable from any entrypoint by a single edge OR name match.
+3. **Call-depth-2**:
+
+   * From each entrypoint, traverse up to depth 2 along `call` edges.
+   * Anything at depth ≤ 2 is considered reachable.
+
+### 5.2 Implementation
+
+**File:** `harness/reachbench/baselines.py`
+
+* `baseline_naive(graph, truth) -> PredictionModel`
+* `baseline_imports_only(graph, truth) -> PredictionModel`
+* `baseline_call_depth_2(graph, truth) -> PredictionModel`
+
+**CLI:**
+
+```bash
+reachbench run-baseline \
+  --dataset-root ./dataset \
+  --baseline naive|imports|depth2 \
+  --out ./results-baseline-<baseline>/
+```
+
+Behavior:
+
+* For each case:
+
+  * Load graph.
+  * Generate predictions per baseline.
+  * Write result file `results-baseline-<baseline>/<case_id>.json`.
+
+### 5.3 Tests
+
+* Tiny synthetic dataset in `harness/tests/data/`:
+
+  * 1–2 cases with simple graphs.
+  * Known expectations for each baseline (TP/FP/FN counts).
+
+### Acceptance criteria
+
+* `reachbench run-baseline --baseline naive` runs end-to-end and outputs results files.
+* `reachbench score` on baseline results produces stable scores.
+* Tests validate baseline behavior on synthetic cases.
+
+---
+
+## 6. Dataset validation & tooling
+
+**Goal:** One command to validate everything (schemas, hashes, internal consistency).
+
+### CLI: `validate-dataset`
+
+```bash
+reachbench validate-dataset \
+  --dataset-root ./dataset \
+  [--lockfile ./dataset/manifest.lock.json]
+```
+
+**Checks:**
+
+1. `dataset.json` conforms to `dataset.schema.json`.
+2. For each case:
+
+   * all artifact paths exist
+   * `graph` file passes `graph.schema.json`
+   * `truth` file passes `truth.schema.json`
+3. Optional: verify lockfile if provided.
+
+**Implementation:**
+
+* `dataset_loader.py`:
+
+  * `load_dataset_index(path) -> DatasetIndex`
+  * `iter_cases(dataset_index)` yields case objects.
+  * `validate_case(case, dataset_root) -> list[str]` (list of error messages).
+
+**Acceptance criteria**
+
+* Broken paths / invalid JSON produce a clear error message and non-zero exit code.
+* CI job calls `reachbench validate-dataset` on every push.
+
+---
+
+## 7. Documentation
+
+**Goal:** Make it trivial for outsiders to use the benchmark.
+
+### 7.1 `README.md`
+
+* Overview:
+
+  * What the benchmark is.
+  * What it measures (reachability precision/recall).
+* Quickstart:
+
+  ```bash
+  git clone ...
+  cd stellaops-reachability-benchmark
+
+  # Validate dataset
+  reachbench validate-dataset --dataset-root ./dataset
+
+  # Run baselines
+  reachbench run-baseline --baseline naive --dataset-root ./dataset --out ./results-naive
+
+  # Score baselines
+  reachbench score --dataset-root ./dataset --results-root ./results-naive --out ./out/naive-scores.json
+  ```
+
+### 7.2 `docs/HOWTO.md`
+
+* Step-by-step:
+
+  * Installing harness.
+  * Running your own tool on the dataset.
+  * Formatting your `results/`.
+  * Running `reachbench score`.
+  * Interpreting `scores.json`.
+
+### 7.3 `docs/SCHEMA.md`
+
+* Human-readable description of:
+
+  * `graph` JSON
+  * `truth` JSON
+  * `results` JSON
+  * `scores` JSON
+* Link to actual JSON Schemas.
+
+### 7.4 `docs/REPRODUCIBILITY.md`
+
+* Explain:
+
+  * lockfile design
+  * hashing rules
+  * deterministic scoring and `--repeat` flag
+  * how to verify you’re using the exact same dataset.
+
+### 7.5 `docs/SANITIZATION.md`
+
+* Rules for adding new cases:
+
+  * Only use OSS or properly licensed code.
+  * Strip secrets / proprietary paths / user data.
+  * How to confirm nothing sensitive is in package tarballs.
+
+### Acceptance criteria
+
+* A new engineer (or external user) can go from zero to “I ran the baseline and got scores” by following docs only.
+* All example commands work as written.
+
+---
+
+## 8. CI/CD details
+
+**Goal:** Keep repo healthy and ensure determinism.
+
+### CI jobs (GitHub Actions)
+
+1. **`lint`**
+
+   * Run `ruff` / `flake8` (your choice).
+2. **`test`**
+
+   * Run `pytest`.
+3. **`validate-dataset`**
+
+   * Run `reachbench validate-dataset --dataset-root ./dataset`.
+4. **`determinism`**
+
+   * Small workflow step:
+
+     * Run `reachbench score` on a tiny test dataset with `--repeat 3`.
+     * Assert success.
+5. **`docker-build`**
+
+   * `docker build` the harness image.
+
+### Acceptance criteria
+
+* All jobs green on main.
+* PRs show failing status if schemas or determinism break.
+
+---
+
+## 9. Rough “epics → stories” breakdown
+
+You can paste roughly like this into Jira/Linear:
+
+1. **Epic: Repo bootstrap & CI**
+
+   * Story: Create repo skeleton & Python project
+   * Story: Add Dockerfile & basic CI (lint + tests)
+
+2. **Epic: Schemas & dataset plumbing**
+
+   * Story: Implement `truth.schema.json` + tests
+   * Story: Implement `graph.schema.json` + tests
+   * Story: Implement `dataset.schema.json` + tests
+   * Story: Implement `validate-dataset` CLI
+
+3. **Epic: Lockfile & determinism**
+
+   * Story: Implement lockfile computation + verification
+   * Story: Add `compute-lockfile` & `verify-lockfile` CLI
+   * Story: Add determinism checks in CI
+
+4. **Epic: Scoring harness**
+
+   * Story: Define results format + `results.schema.json`
+   * Story: Implement scoring logic (`scoring.py`)
+   * Story: Implement `score` CLI with `--repeat`
+   * Story: Add unit tests for metrics
+
+5. **Epic: Baselines**
+
+   * Story: Implement naive baseline
+   * Story: Implement imports-only baseline
+   * Story: Implement depth-2 baseline
+   * Story: Add `run-baseline` CLI + tests
+
+6. **Epic: Documentation & polish**
+
+   * Story: Write README + HOWTO
+   * Story: Write SCHEMA / REPRODUCIBILITY / SANITIZATION docs
+   * Story: Final repo cleanup & examples
+
+---
+
+If you tell me your preferred language and CI, I can also rewrite this into exact tickets and even starter code for `cli.py` and a couple of schemas.