Here’s a crisp, practical way to turn Stella Ops’ “verifiable proof spine” into a moat—and how to measure it. # Why this matters (in plain terms) Security tools often say “trust me.” You’ll say “prove it”—every finding and every “not‑affected” claim ships with cryptographic receipts anyone can verify. --- # Differentiators to build in **1) Bind every verdict to a graph hash** * Compute a stable **Graph Revision ID** (Merkle root) over: SBOM nodes, edges, policies, feeds, scan params, and tool versions. * Store the ID on each finding/VEX item; show it in the UI and APIs. * Rule: any data change → new graph hash → new revisioned verdicts. **2) Attach machine‑verifiable receipts (in‑toto/DSSE)** * For each verdict, emit a **DSSE‑wrapped in‑toto statement**: * predicateType: `stellaops.dev/verdict@v1` * includes: graphRevisionId, artifact digests, rule id/version, inputs (CPE/CVE/CVSS), timestamps. * Sign with your **Authority** (Sigstore key, offline mode supported). * Keep receipts queryable and exportable; mirror to Rekor‑compatible ledger when online. **3) Add reachability “call‑stack slices” or binary‑symbol proofs** * For code‑level reachability, store compact slices: entry → sink, with symbol names + file:line. * For binary-only targets, include **symbol presence proofs** (e.g., Bloom filters + offsets) with executable digest. * Compress and embed a hash of the slice/proof inside the DSSE payload. **4) Deterministic replay manifests** * Alongside receipts, publish a **Replay Manifest** (inputs, feeds, rule versions, container digests) so any auditor can reproduce the same graph hash and verdicts offline. --- # Benchmarks to publish (make them your headline KPIs) **A) False‑positive reduction vs. baseline scanners (%)** * Method: run a public corpus (e.g., sample images + app stacks) across 3–4 popular scanners; label ground truth once; compare FP rate. * Report: mean & p95 FP reduction. **B) Proof coverage (% of findings with signed evidence)** * Definition: `(# findings or VEX items carrying valid DSSE receipts) / (total surfaced items)`. * Break out: runtime‑reachable vs. unreachable, and “not‑affected” claims. **C) Triage time saved (p50/p95)** * Measure analyst minutes from “alert created” → “final disposition.” * A/B with receipts hidden vs. visible; publish median/p95 deltas. **D) Determinism stability** * Re-run identical scans N times / across nodes; publish `% identical graph hashes` and drift causes when different. --- # Minimal implementation plan (week‑by‑week) **Week 1: primitives** * Add Graph Revision ID generator in `scanner.webservice` (Merkle over normalized JSON of SBOM+edges+policies+toolVersions). * Define `VerdictReceipt` schema (protobuf/JSON) and DSSE envelope types. **Week 2: signing + storage** * Wire DSSE signing in **Authority**; offline key support + rotation. * Persist receipts in `Receipts` table (Postgres) keyed by `(graphRevisionId, verdictId)`; enable export (JSONL) and ledger mirror. **Week 3: reachability proofs** * Add call‑stack slice capture in reachability engine; serialize compactly; hash + reference from receipts. * Binary symbol proof module for ELF/PE: symbol bitmap + digest. **Week 4: replay + UX** * Emit `replay.manifest.json` per scan (inputs, tool digests). * UI: show **“Verified”** badge, graph hash, signature issuer, and a one‑click “Copy receipt” button. * API: `GET /verdicts/{id}/receipt`, `GET /graphs/{rev}/replay`. **Week 5: benchmarks harness** * Create `bench/` with golden fixtures and a runner: * Baseline scanner adapters * Ground‑truth labels * Metrics export (FP%, proof coverage, triage time capture hooks) --- # Developer guardrails (make these non‑negotiable) * **No receipt, no ship:** any surfaced verdict must carry a DSSE receipt. * **Schema freeze windows:** changes to rule inputs or policy logic must bump rule version and therefore the graph hash. * **Replay‑first CI:** PRs touching scanning/rules must pass a replay test that reproduces prior graph hashes on gold fixtures. * **Clock safety:** use monotonic time inside receipts; add UTC wall‑time separately. --- # What to show buyers/auditors * A short **audit kit**: sample container + your receipts + replay manifest + one command to reproduce the same graph hash. * A one‑page **benchmark readout**: FP reduction, proof coverage, and triage time saved (p50/p95), with corpus description. --- If you want, I’ll draft: 1. the DSSE `predicate` schema, 2. the Postgres DDL for `Receipts` and `Graphs`, and 3. a tiny .NET verification CLI (`stellaops-verify`) that replays a manifest and validates signatures. Here’s a focused “developer guidelines” doc just for **Benchmarks for a Testable Security Moat** in Stella Ops. --- # Stella Ops Developer Guidelines ## Benchmarks for a Testable Security Moat > **Goal:** Benchmarks are how we *prove* Stella Ops is better, not just say it is. If a “moat” claim can’t be tied to a benchmark, it doesn’t exist. Everything here is about how you, as a developer, design, extend, and run those benchmarks. --- ## 1. What our benchmarks must measure Every core product claim needs at least one benchmark: 1. **Detection quality** * Precision / recall vs ground truth. * False positives vs popular scanners. * False negatives on known‑bad samples. 2. **Proof & evidence quality** * % of findings with **valid receipts** (DSSE). * % of VEX “not‑affected” with attached proofs. * Reachability proof quality: * call‑stack slice present? * symbol proof present for binaries? 3. **Triage & workflow impact** * Time‑to‑decision for analysts (p50/p95). * Click depth and context switches per decision. * “Verified” vs “unverified” verdict triage times. 4. **Determinism & reproducibility** * Same inputs → same **Graph Revision ID**. * Stable verdict sets across runs/nodes. > **Rule:** If you add a feature that impacts any of these, you must either hook it into an existing benchmark or add a new one. --- ## 2. Benchmark assets and layout **2.1 Repo layout (convention)** Under `bench/` we maintain everything benchmark‑related: * `bench/corpus/` * `images/` – curated container images / tarballs. * `repos/` – sample codebases (with known vulns). * `sboms/` – canned SBOMs for edge cases. * `bench/scenarios/` * `*.yaml` – scenario definitions (inputs + expected outputs). * `bench/golden/` * `*.json` – golden results (expected findings, metrics). * `bench/tools/` * adapters for baseline scanners, parsers, helpers. * `bench/scripts/` * `run_benchmarks.[sh/cs]` – single entrypoint. **2.2 Scenario definition (high‑level)** Each scenario yaml should minimally specify: * **Inputs** * artifact references (image name / path / repo SHA / SBOM file). * environment knobs (features enabled/disabled). * **Ground truth** * list of expected vulns (or explicit “none”). * for some: expected reachability (reachable/unreachable). * expected VEX entries (affected / not affected). * **Expectations** * required metrics (e.g., “no more than 2 FPs”, “no FNs”). * required proof coverage (e.g., “100% of surfaced findings have receipts”). --- ## 3. Core benchmark metrics (developer‑facing definitions) Use these consistently across code and docs. ### 3.1 Detection metrics * `true_positive_count` (TP) * `false_positive_count` (FP) * `false_negative_count` (FN) Derived: * `precision = TP / (TP + FP)` * `recall = TP / (TP + FN)` * For UX: track **FP per asset** and **FP per 100 findings**. **Developer guideline:** * When you introduce a filter, deduper, or rule tweak, add/modify a scenario where: * the change **helps** (reduces FP or FN); and * a different scenario guards against regressions. ### 3.2 Moat‑specific metrics These are the ones that directly support the “testable moat” story: 1. **False‑positive reduction vs baseline scanners** * Run baseline scanners across our corpus (via adapters in `bench/tools`). * Compute: * `baseline_fp_rate` * `stella_fp_rate` * `fp_reduction = (baseline_fp_rate - stella_fp_rate) / baseline_fp_rate`. 2. **Proof coverage** * `proof_coverage_all = findings_with_valid_receipts / total_findings` * `proof_coverage_vex = vex_items_with_valid_receipts / total_vex_items` * `proof_coverage_reachable = reachable_findings_with_proofs / total_reachable_findings` 3. **Triage time improvement** * In test harnesses, simulate or record: * `time_to_triage_with_receipts` * `time_to_triage_without_receipts` * Compute median & p95 deltas. 4. **Determinism** * Re‑run the same scenario `N` times: * `% runs with identical Graph Revision ID` * `% runs with identical verdict sets` * On mismatch, diff and log cause (e.g., non‑stable sort, non‑pinned feed). --- ## 4. How developers should work with benchmarks ### 4.1 “No feature without benchmarks” If you’re adding or changing: * graph structure, * rule logic, * scanner integration, * VEX handling, * proof / receipt generation, you **must** do *at least one* of: 1. **Extend an existing scenario** * Add expectations that cover your change, or * tighten an existing bound (e.g., lower FP threshold). 2. **Add a new scenario** * For new attack classes / edge cases / ecosystems. **Anti‑patterns:** * Shipping a new capability with *no* corresponding scenario. * Updating golden outputs without explaining why metrics changed. ### 4.2 CI gates We treat benchmarks as **blocking**: * Add a CI job, e.g.: * `make bench:quick` on every PR (small subset). * `make bench:full` on main / nightly. * CI fails if: * Any scenario marked `strict: true` has: * Precision or recall below its threshold. * Proof coverage below its configured threshold. * Global regressions above tolerance: * e.g. total FP increases > X% without an explicit override. **Developer rule:** * If you intentionally change behavior: * Update the relevant golden files. * Include a short note in the PR (e.g., `bench-notes.md` snippet) describing: * what changed, * why the new result is better, and * which moat metric it improves (FP, proof coverage, determinism, etc.). --- ## 5. Benchmark implementation guidelines ### 5.1 Make benchmarks deterministic * **Pin everything**: * feed snapshots, * tool container digests, * rule versions, * time windows. * Use **Replay Manifests** as the source of truth: * `replay.manifest.json` should contain: * input artifacts, * tool versions, * feed versions, * configuration flags. * If a benchmark depends on time: * Inject a **fake clock** or explicit “as of” timestamp. ### 5.2 Keep scenarios small but meaningful * Prefer many **focused** scenarios over a few huge ones. * Each scenario should clearly answer: * “What property of Stella Ops are we testing?” * “What moat claim does this support?” Examples: * `bench/scenarios/false_pos_kubernetes.yaml` * Focus: config noise reduction vs baseline scanner. * `bench/scenarios/reachability_java_webapp.yaml` * Focus: reachable vs unreachable vuln proofs. * `bench/scenarios/vex_not_affected_openssl.yaml` * Focus: VEX correctness and proof coverage. ### 5.3 Use golden outputs, not ad‑hoc assertions * Bench harness should: * Run Stella Ops on scenario inputs. * Normalize outputs (sorted lists, stable IDs). * Compare to `bench/golden/.json`. * Golden file should include: * expected findings (id, severity, reachable?, etc.), * expected VEX entries, * expected metrics (precision, recall, coverage). --- ## 6. Moat‑critical benchmark types (we must have all of these) When you’re thinking about gaps, check that we have: 1. **Cross‑tool comparison** * Same corpus, multiple scanners. * Metrics vs baselines for FP/FN. 2. **Proof density & quality** * Corpus where: * some vulns are reachable, * some are not, * some are not present. * Ensure: * reachable ones have rich proofs (stack slices / symbol proofs). * non‑reachable or absent ones have: * correct disposition, and * clear receipts explaining why. 3. **VEX accuracy** * Scenarios with known SBOM + known vulnerability impact. * Check: * VEX “affected”/“not‑affected” matches ground truth. * every VEX entry has a receipt. 4. **Analyst workflow** * Small usability corpus for internal testing: * Measure time‑to‑triage with/without receipts. * Use the same scenarios across releases to track improvement. 5. **Upgrade / drift resistance** * Scenarios that are **expected to remain stable** across: * rule changes that *shouldn’t* affect outcomes. * feed updates (within a given version window). * These act as canaries for unintended regressions. --- ## 7. Developer checklist (TL;DR) Before merging a change that touches security logic, ask yourself: 1. **Is there at least one benchmark scenario that exercises this change?** 2. **Does the change improve at least one moat metric, or is it neutral?** 3. **Have I run `make bench:quick` locally and checked diffs?** 4. **If goldens changed, did I explain why in the PR?** 5. **Did I keep benchmarks deterministic (pinned versions, fake time, etc.)?** If any answer is “no”, fix that before merging. --- If you’d like, next step I can sketch a concrete `bench/scenarios/*.yaml` and matching `bench/golden/*.json` example that encodes one *specific* moat claim (e.g., “30% fewer FPs than Scanner X on Kubernetes configs”) so your team has a ready‑to-copy pattern.