Here’s a crisp, practical way to turn Stella Ops’ “verifiable proof spine” into a moat—and how to measure it.

# Why this matters (in plain terms)

Security tools often say “trust me.” You’ll say “prove it”—every finding and every “not‑affected” claim ships with cryptographic receipts anyone can verify.

---

# Differentiators to build in

**1) Bind every verdict to a graph hash**

* Compute a stable **Graph Revision ID** (Merkle root) over: SBOM nodes, edges, policies, feeds, scan params, and tool versions.
* Store the ID on each finding/VEX item; show it in the UI and APIs.
* Rule: any data change → new graph hash → new revisioned verdicts.

**2) Attach machine‑verifiable receipts (in‑toto/DSSE)**

* For each verdict, emit a **DSSE‑wrapped in‑toto statement**:

  * predicateType: `stellaops.dev/verdict@v1`
  * includes: graphRevisionId, artifact digests, rule id/version, inputs (CPE/CVE/CVSS), timestamps.
* Sign with your **Authority** (Sigstore key, offline mode supported).
* Keep receipts queryable and exportable; mirror to Rekor‑compatible ledger when online.

**3) Add reachability “call‑stack slices” or binary‑symbol proofs**

* For code‑level reachability, store compact slices: entry → sink, with symbol names + file:line.
* For binary-only targets, include **symbol presence proofs** (e.g., Bloom filters + offsets) with executable digest.
* Compress and embed a hash of the slice/proof inside the DSSE payload.

**4) Deterministic replay manifests**

* Alongside receipts, publish a **Replay Manifest** (inputs, feeds, rule versions, container digests) so any auditor can reproduce the same graph hash and verdicts offline.

---

# Benchmarks to publish (make them your headline KPIs)

**A) False‑positive reduction vs. baseline scanners (%)**

* Method: run a public corpus (e.g., sample images + app stacks) across 3–4 popular scanners; label ground truth once; compare FP rate.
* Report: mean & p95 FP reduction.

**B) Proof coverage (% of findings with signed evidence)**

* Definition: `(# findings or VEX items carrying valid DSSE receipts) / (total surfaced items)`.
* Break out: runtime‑reachable vs. unreachable, and “not‑affected” claims.

**C) Triage time saved (p50/p95)**

* Measure analyst minutes from “alert created” → “final disposition.”
* A/B with receipts hidden vs. visible; publish median/p95 deltas.

**D) Determinism stability**

* Re-run identical scans N times / across nodes; publish `% identical graph hashes` and drift causes when different.

---

# Minimal implementation plan (week‑by‑week)

**Week 1: primitives**

* Add Graph Revision ID generator in `scanner.webservice` (Merkle over normalized JSON of SBOM+edges+policies+toolVersions).
* Define `VerdictReceipt` schema (protobuf/JSON) and DSSE envelope types.

**Week 2: signing + storage**

* Wire DSSE signing in **Authority**; offline key support + rotation.
* Persist receipts in `Receipts` table (Postgres) keyed by `(graphRevisionId, verdictId)`; enable export (JSONL) and ledger mirror.

**Week 3: reachability proofs**

* Add call‑stack slice capture in reachability engine; serialize compactly; hash + reference from receipts.
* Binary symbol proof module for ELF/PE: symbol bitmap + digest.

**Week 4: replay + UX**

* Emit `replay.manifest.json` per scan (inputs, tool digests).
* UI: show **“Verified”** badge, graph hash, signature issuer, and a one‑click “Copy receipt” button.
* API: `GET /verdicts/{id}/receipt`, `GET /graphs/{rev}/replay`.

**Week 5: benchmarks harness**

* Create `bench/` with golden fixtures and a runner:

  * Baseline scanner adapters
  * Ground‑truth labels
  * Metrics export (FP%, proof coverage, triage time capture hooks)

---

# Developer guardrails (make these non‑negotiable)

* **No receipt, no ship:** any surfaced verdict must carry a DSSE receipt.
* **Schema freeze windows:** changes to rule inputs or policy logic must bump rule version and therefore the graph hash.
* **Replay‑first CI:** PRs touching scanning/rules must pass a replay test that reproduces prior graph hashes on gold fixtures.
* **Clock safety:** use monotonic time inside receipts; add UTC wall‑time separately.

---

# What to show buyers/auditors

* A short **audit kit**: sample container + your receipts + replay manifest + one command to reproduce the same graph hash.
* A one‑page **benchmark readout**: FP reduction, proof coverage, and triage time saved (p50/p95), with corpus description.

---

If you want, I’ll draft:

1. the DSSE `predicate` schema,
2. the Postgres DDL for `Receipts` and `Graphs`, and
3. a tiny .NET verification CLI (`stellaops-verify`) that replays a manifest and validates signatures.
Here’s a focused “developer guidelines” doc just for **Benchmarks for a Testable Security Moat** in Stella Ops.

---

# Stella Ops Developer Guidelines

## Benchmarks for a Testable Security Moat

> **Goal:** Benchmarks are how we *prove* Stella Ops is better, not just say it is. If a “moat” claim can’t be tied to a benchmark, it doesn’t exist.

Everything here is about how you, as a developer, design, extend, and run those benchmarks.

---

## 1. What our benchmarks must measure

Every core product claim needs at least one benchmark:

1. **Detection quality**

   * Precision / recall vs ground truth.
   * False positives vs popular scanners.
   * False negatives on known‑bad samples.

2. **Proof & evidence quality**

   * % of findings with **valid receipts** (DSSE).
   * % of VEX “not‑affected” with attached proofs.
   * Reachability proof quality:

     * call‑stack slice present?
     * symbol proof present for binaries?

3. **Triage & workflow impact**

   * Time‑to‑decision for analysts (p50/p95).
   * Click depth and context switches per decision.
   * “Verified” vs “unverified” verdict triage times.

4. **Determinism & reproducibility**

   * Same inputs → same **Graph Revision ID**.
   * Stable verdict sets across runs/nodes.

> **Rule:** If you add a feature that impacts any of these, you must either hook it into an existing benchmark or add a new one.

---

## 2. Benchmark assets and layout

**2.1 Repo layout (convention)**

Under `bench/` we maintain everything benchmark‑related:

* `bench/corpus/`

  * `images/` – curated container images / tarballs.
  * `repos/` – sample codebases (with known vulns).
  * `sboms/` – canned SBOMs for edge cases.
* `bench/scenarios/`

  * `*.yaml` – scenario definitions (inputs + expected outputs).
* `bench/golden/`

  * `*.json` – golden results (expected findings, metrics).
* `bench/tools/`

  * adapters for baseline scanners, parsers, helpers.
* `bench/scripts/`

  * `run_benchmarks.[sh/cs]` – single entrypoint.

**2.2 Scenario definition (high‑level)**

Each scenario yaml should minimally specify:

* **Inputs**

  * artifact references (image name / path / repo SHA / SBOM file).
  * environment knobs (features enabled/disabled).
* **Ground truth**

  * list of expected vulns (or explicit “none”).
  * for some: expected reachability (reachable/unreachable).
  * expected VEX entries (affected / not affected).
* **Expectations**

  * required metrics (e.g., “no more than 2 FPs”, “no FNs”).
  * required proof coverage (e.g., “100% of surfaced findings have receipts”).

---

## 3. Core benchmark metrics (developer‑facing definitions)

Use these consistently across code and docs.

### 3.1 Detection metrics

* `true_positive_count` (TP)
* `false_positive_count` (FP)
* `false_negative_count` (FN)

Derived:

* `precision = TP / (TP + FP)`
* `recall = TP / (TP + FN)`
* For UX: track **FP per asset** and **FP per 100 findings**.

**Developer guideline:**

* When you introduce a filter, deduper, or rule tweak, add/modify a scenario where:

  * the change **helps** (reduces FP or FN); and
  * a different scenario guards against regressions.

### 3.2 Moat‑specific metrics

These are the ones that directly support the “testable moat” story:

1. **False‑positive reduction vs baseline scanners**

   * Run baseline scanners across our corpus (via adapters in `bench/tools`).
   * Compute:

     * `baseline_fp_rate`
     * `stella_fp_rate`
     * `fp_reduction = (baseline_fp_rate - stella_fp_rate) / baseline_fp_rate`.

2. **Proof coverage**

   * `proof_coverage_all = findings_with_valid_receipts / total_findings`
   * `proof_coverage_vex = vex_items_with_valid_receipts / total_vex_items`
   * `proof_coverage_reachable = reachable_findings_with_proofs / total_reachable_findings`

3. **Triage time improvement**

   * In test harnesses, simulate or record:

     * `time_to_triage_with_receipts`
     * `time_to_triage_without_receipts`
   * Compute median & p95 deltas.

4. **Determinism**

   * Re‑run the same scenario `N` times:

     * `% runs with identical Graph Revision ID`
     * `% runs with identical verdict sets`
   * On mismatch, diff and log cause (e.g., non‑stable sort, non‑pinned feed).

---

## 4. How developers should work with benchmarks

### 4.1 “No feature without benchmarks”

If you’re adding or changing:

* graph structure,
* rule logic,
* scanner integration,
* VEX handling,
* proof / receipt generation,

you **must** do *at least one* of:

1. **Extend an existing scenario**

   * Add expectations that cover your change, or
   * tighten an existing bound (e.g., lower FP threshold).

2. **Add a new scenario**

   * For new attack classes / edge cases / ecosystems.

**Anti‑patterns:**

* Shipping a new capability with *no* corresponding scenario.
* Updating golden outputs without explaining why metrics changed.

### 4.2 CI gates

We treat benchmarks as **blocking**:

* Add a CI job, e.g.:

  * `make bench:quick` on every PR (small subset).
  * `make bench:full` on main / nightly.
* CI fails if:

  * Any scenario marked `strict: true` has:

    * Precision or recall below its threshold.
    * Proof coverage below its configured threshold.
  * Global regressions above tolerance:

    * e.g. total FP increases > X% without an explicit override.

**Developer rule:**

* If you intentionally change behavior:

  * Update the relevant golden files.
  * Include a short note in the PR (e.g., `bench-notes.md` snippet) describing:

    * what changed,
    * why the new result is better, and
    * which moat metric it improves (FP, proof coverage, determinism, etc.).

---

## 5. Benchmark implementation guidelines

### 5.1 Make benchmarks deterministic

* **Pin everything**:

  * feed snapshots,
  * tool container digests,
  * rule versions,
  * time windows.
* Use **Replay Manifests** as the source of truth:

  * `replay.manifest.json` should contain:

    * input artifacts,
    * tool versions,
    * feed versions,
    * configuration flags.
* If a benchmark depends on time:

  * Inject a **fake clock** or explicit “as of” timestamp.

### 5.2 Keep scenarios small but meaningful

* Prefer many **focused** scenarios over a few huge ones.
* Each scenario should clearly answer:

  * “What property of Stella Ops are we testing?”
  * “What moat claim does this support?”

Examples:

* `bench/scenarios/false_pos_kubernetes.yaml`

  * Focus: config noise reduction vs baseline scanner.
* `bench/scenarios/reachability_java_webapp.yaml`

  * Focus: reachable vs unreachable vuln proofs.
* `bench/scenarios/vex_not_affected_openssl.yaml`

  * Focus: VEX correctness and proof coverage.

### 5.3 Use golden outputs, not ad‑hoc assertions

* Bench harness should:

  * Run Stella Ops on scenario inputs.
  * Normalize outputs (sorted lists, stable IDs).
  * Compare to `bench/golden/<scenario>.json`.
* Golden file should include:

  * expected findings (id, severity, reachable?, etc.),
  * expected VEX entries,
  * expected metrics (precision, recall, coverage).

---

## 6. Moat‑critical benchmark types (we must have all of these)

When you’re thinking about gaps, check that we have:

1. **Cross‑tool comparison**

   * Same corpus, multiple scanners.
   * Metrics vs baselines for FP/FN.

2. **Proof density & quality**

   * Corpus where:

     * some vulns are reachable,
     * some are not,
     * some are not present.
   * Ensure:

     * reachable ones have rich proofs (stack slices / symbol proofs).
     * non‑reachable or absent ones have:

       * correct disposition, and
       * clear receipts explaining why.

3. **VEX accuracy**

   * Scenarios with known SBOM + known vulnerability impact.
   * Check:

     * VEX “affected”/“not‑affected” matches ground truth.
     * every VEX entry has a receipt.

4. **Analyst workflow**

   * Small usability corpus for internal testing:

     * Measure time‑to‑triage with/without receipts.
     * Use the same scenarios across releases to track improvement.

5. **Upgrade / drift resistance**

   * Scenarios that are **expected to remain stable** across:

     * rule changes that *shouldn’t* affect outcomes.
     * feed updates (within a given version window).
   * These act as canaries for unintended regressions.

---

## 7. Developer checklist (TL;DR)

Before merging a change that touches security logic, ask yourself:

1. **Is there at least one benchmark scenario that exercises this change?**
2. **Does the change improve at least one moat metric, or is it neutral?**
3. **Have I run `make bench:quick` locally and checked diffs?**
4. **If goldens changed, did I explain why in the PR?**
5. **Did I keep benchmarks deterministic (pinned versions, fake time, etc.)?**

If any answer is “no”, fix that before merging.

---

If you’d like, next step I can sketch a concrete `bench/scenarios/*.yaml` and matching `bench/golden/*.json` example that encodes one *specific* moat claim (e.g., “30% fewer FPs than Scanner X on Kubernetes configs”) so your team has a ready‑to-copy pattern.