add advisories

2025-12-01 17:50:11 +02:00
parent c11d87d252
commit 790801f329
7 changed files with 3723 additions and 0 deletions
--- a/docs/product-advisories/01-Dec-2025
+++ b/docs/product-advisories/01-Dec-2025
@@ -0,0 +1,446 @@
+Here’s a crisp, practical way to turn Stella Ops’ “verifiable proof spine” into a moat—and how to measure it.
+
+# Why this matters (in plain terms)
+
+Security tools often say “trust me.” You’ll say “prove it”—every finding and every “not‑affected” claim ships with cryptographic receipts anyone can verify.
+
+---
+
+# Differentiators to build in
+
+**1) Bind every verdict to a graph hash**
+
+* Compute a stable **Graph Revision ID** (Merkle root) over: SBOM nodes, edges, policies, feeds, scan params, and tool versions.
+* Store the ID on each finding/VEX item; show it in the UI and APIs.
+* Rule: any data change → new graph hash → new revisioned verdicts.
+
+**2) Attach machine‑verifiable receipts (in‑toto/DSSE)**
+
+* For each verdict, emit a **DSSE‑wrapped in‑toto statement**:
+
+  * predicateType: `stellaops.dev/verdict@v1`
+  * includes: graphRevisionId, artifact digests, rule id/version, inputs (CPE/CVE/CVSS), timestamps.
+* Sign with your **Authority** (Sigstore key, offline mode supported).
+* Keep receipts queryable and exportable; mirror to Rekor‑compatible ledger when online.
+
+**3) Add reachability “call‑stack slices” or binary‑symbol proofs**
+
+* For code‑level reachability, store compact slices: entry → sink, with symbol names + file:line.
+* For binary-only targets, include **symbol presence proofs** (e.g., Bloom filters + offsets) with executable digest.
+* Compress and embed a hash of the slice/proof inside the DSSE payload.
+
+**4) Deterministic replay manifests**
+
+* Alongside receipts, publish a **Replay Manifest** (inputs, feeds, rule versions, container digests) so any auditor can reproduce the same graph hash and verdicts offline.
+
+---
+
+# Benchmarks to publish (make them your headline KPIs)
+
+**A) False‑positive reduction vs. baseline scanners (%)**
+
+* Method: run a public corpus (e.g., sample images + app stacks) across 3–4 popular scanners; label ground truth once; compare FP rate.
+* Report: mean & p95 FP reduction.
+
+**B) Proof coverage (% of findings with signed evidence)**
+
+* Definition: `(# findings or VEX items carrying valid DSSE receipts) / (total surfaced items)`.
+* Break out: runtime‑reachable vs. unreachable, and “not‑affected” claims.
+
+**C) Triage time saved (p50/p95)**
+
+* Measure analyst minutes from “alert created” → “final disposition.”
+* A/B with receipts hidden vs. visible; publish median/p95 deltas.
+
+**D) Determinism stability**
+
+* Re-run identical scans N times / across nodes; publish `% identical graph hashes` and drift causes when different.
+
+---
+
+# Minimal implementation plan (week‑by‑week)
+
+**Week 1: primitives**
+
+* Add Graph Revision ID generator in `scanner.webservice` (Merkle over normalized JSON of SBOM+edges+policies+toolVersions).
+* Define `VerdictReceipt` schema (protobuf/JSON) and DSSE envelope types.
+
+**Week 2: signing + storage**
+
+* Wire DSSE signing in **Authority**; offline key support + rotation.
+* Persist receipts in `Receipts` table (Postgres) keyed by `(graphRevisionId, verdictId)`; enable export (JSONL) and ledger mirror.
+
+**Week 3: reachability proofs**
+
+* Add call‑stack slice capture in reachability engine; serialize compactly; hash + reference from receipts.
+* Binary symbol proof module for ELF/PE: symbol bitmap + digest.
+
+**Week 4: replay + UX**
+
+* Emit `replay.manifest.json` per scan (inputs, tool digests).
+* UI: show **“Verified”** badge, graph hash, signature issuer, and a one‑click “Copy receipt” button.
+* API: `GET /verdicts/{id}/receipt`, `GET /graphs/{rev}/replay`.
+
+**Week 5: benchmarks harness**
+
+* Create `bench/` with golden fixtures and a runner:
+
+  * Baseline scanner adapters
+  * Ground‑truth labels
+  * Metrics export (FP%, proof coverage, triage time capture hooks)
+
+---
+
+# Developer guardrails (make these non‑negotiable)
+
+* **No receipt, no ship:** any surfaced verdict must carry a DSSE receipt.
+* **Schema freeze windows:** changes to rule inputs or policy logic must bump rule version and therefore the graph hash.
+* **Replay‑first CI:** PRs touching scanning/rules must pass a replay test that reproduces prior graph hashes on gold fixtures.
+* **Clock safety:** use monotonic time inside receipts; add UTC wall‑time separately.
+
+---
+
+# What to show buyers/auditors
+
+* A short **audit kit**: sample container + your receipts + replay manifest + one command to reproduce the same graph hash.
+* A one‑page **benchmark readout**: FP reduction, proof coverage, and triage time saved (p50/p95), with corpus description.
+
+---
+
+If you want, I’ll draft:
+
+1. the DSSE `predicate` schema,
+2. the Postgres DDL for `Receipts` and `Graphs`, and
+3. a tiny .NET verification CLI (`stellaops-verify`) that replays a manifest and validates signatures.
+Here’s a focused “developer guidelines” doc just for **Benchmarks for a Testable Security Moat** in Stella Ops.
+
+---
+
+# Stella Ops Developer Guidelines
+
+## Benchmarks for a Testable Security Moat
+
+> **Goal:** Benchmarks are how we *prove* Stella Ops is better, not just say it is. If a “moat” claim can’t be tied to a benchmark, it doesn’t exist.
+
+Everything here is about how you, as a developer, design, extend, and run those benchmarks.
+
+---
+
+## 1. What our benchmarks must measure
+
+Every core product claim needs at least one benchmark:
+
+1. **Detection quality**
+
+   * Precision / recall vs ground truth.
+   * False positives vs popular scanners.
+   * False negatives on known‑bad samples.
+
+2. **Proof & evidence quality**
+
+   * % of findings with **valid receipts** (DSSE).
+   * % of VEX “not‑affected” with attached proofs.
+   * Reachability proof quality:
+
+     * call‑stack slice present?
+     * symbol proof present for binaries?
+
+3. **Triage & workflow impact**
+
+   * Time‑to‑decision for analysts (p50/p95).
+   * Click depth and context switches per decision.
+   * “Verified” vs “unverified” verdict triage times.
+
+4. **Determinism & reproducibility**
+
+   * Same inputs → same **Graph Revision ID**.
+   * Stable verdict sets across runs/nodes.
+
+> **Rule:** If you add a feature that impacts any of these, you must either hook it into an existing benchmark or add a new one.
+
+---
+
+## 2. Benchmark assets and layout
+
+**2.1 Repo layout (convention)**
+
+Under `bench/` we maintain everything benchmark‑related:
+
+* `bench/corpus/`
+
+  * `images/` – curated container images / tarballs.
+  * `repos/` – sample codebases (with known vulns).
+  * `sboms/` – canned SBOMs for edge cases.
+* `bench/scenarios/`
+
+  * `*.yaml` – scenario definitions (inputs + expected outputs).
+* `bench/golden/`
+
+  * `*.json` – golden results (expected findings, metrics).
+* `bench/tools/`
+
+  * adapters for baseline scanners, parsers, helpers.
+* `bench/scripts/`
+
+  * `run_benchmarks.[sh/cs]` – single entrypoint.
+
+**2.2 Scenario definition (high‑level)**
+
+Each scenario yaml should minimally specify:
+
+* **Inputs**
+
+  * artifact references (image name / path / repo SHA / SBOM file).
+  * environment knobs (features enabled/disabled).
+* **Ground truth**
+
+  * list of expected vulns (or explicit “none”).
+  * for some: expected reachability (reachable/unreachable).
+  * expected VEX entries (affected / not affected).
+* **Expectations**
+
+  * required metrics (e.g., “no more than 2 FPs”, “no FNs”).
+  * required proof coverage (e.g., “100% of surfaced findings have receipts”).
+
+---
+
+## 3. Core benchmark metrics (developer‑facing definitions)
+
+Use these consistently across code and docs.
+
+### 3.1 Detection metrics
+
+* `true_positive_count` (TP)
+* `false_positive_count` (FP)
+* `false_negative_count` (FN)
+
+Derived:
+
+* `precision = TP / (TP + FP)`
+* `recall = TP / (TP + FN)`
+* For UX: track **FP per asset** and **FP per 100 findings**.
+
+**Developer guideline:**
+
+* When you introduce a filter, deduper, or rule tweak, add/modify a scenario where:
+
+  * the change **helps** (reduces FP or FN); and
+  * a different scenario guards against regressions.
+
+### 3.2 Moat‑specific metrics
+
+These are the ones that directly support the “testable moat” story:
+
+1. **False‑positive reduction vs baseline scanners**
+
+   * Run baseline scanners across our corpus (via adapters in `bench/tools`).
+   * Compute:
+
+     * `baseline_fp_rate`
+     * `stella_fp_rate`
+     * `fp_reduction = (baseline_fp_rate - stella_fp_rate) / baseline_fp_rate`.
+
+2. **Proof coverage**
+
+   * `proof_coverage_all = findings_with_valid_receipts / total_findings`
+   * `proof_coverage_vex = vex_items_with_valid_receipts / total_vex_items`
+   * `proof_coverage_reachable = reachable_findings_with_proofs / total_reachable_findings`
+
+3. **Triage time improvement**
+
+   * In test harnesses, simulate or record:
+
+     * `time_to_triage_with_receipts`
+     * `time_to_triage_without_receipts`
+   * Compute median & p95 deltas.
+
+4. **Determinism**
+
+   * Re‑run the same scenario `N` times:
+
+     * `% runs with identical Graph Revision ID`
+     * `% runs with identical verdict sets`
+   * On mismatch, diff and log cause (e.g., non‑stable sort, non‑pinned feed).
+
+---
+
+## 4. How developers should work with benchmarks
+
+### 4.1 “No feature without benchmarks”
+
+If you’re adding or changing:
+
+* graph structure,
+* rule logic,
+* scanner integration,
+* VEX handling,
+* proof / receipt generation,
+
+you **must** do *at least one* of:
+
+1. **Extend an existing scenario**
+
+   * Add expectations that cover your change, or
+   * tighten an existing bound (e.g., lower FP threshold).
+
+2. **Add a new scenario**
+
+   * For new attack classes / edge cases / ecosystems.
+
+**Anti‑patterns:**
+
+* Shipping a new capability with *no* corresponding scenario.
+* Updating golden outputs without explaining why metrics changed.
+
+### 4.2 CI gates
+
+We treat benchmarks as **blocking**:
+
+* Add a CI job, e.g.:
+
+  * `make bench:quick` on every PR (small subset).
+  * `make bench:full` on main / nightly.
+* CI fails if:
+
+  * Any scenario marked `strict: true` has:
+
+    * Precision or recall below its threshold.
+    * Proof coverage below its configured threshold.
+  * Global regressions above tolerance:
+
+    * e.g. total FP increases > X% without an explicit override.
+
+**Developer rule:**
+
+* If you intentionally change behavior:
+
+  * Update the relevant golden files.
+  * Include a short note in the PR (e.g., `bench-notes.md` snippet) describing:
+
+    * what changed,
+    * why the new result is better, and
+    * which moat metric it improves (FP, proof coverage, determinism, etc.).
+
+---
+
+## 5. Benchmark implementation guidelines
+
+### 5.1 Make benchmarks deterministic
+
+* **Pin everything**:
+
+  * feed snapshots,
+  * tool container digests,
+  * rule versions,
+  * time windows.
+* Use **Replay Manifests** as the source of truth:
+
+  * `replay.manifest.json` should contain:
+
+    * input artifacts,
+    * tool versions,
+    * feed versions,
+    * configuration flags.
+* If a benchmark depends on time:
+
+  * Inject a **fake clock** or explicit “as of” timestamp.
+
+### 5.2 Keep scenarios small but meaningful
+
+* Prefer many **focused** scenarios over a few huge ones.
+* Each scenario should clearly answer:
+
+  * “What property of Stella Ops are we testing?”
+  * “What moat claim does this support?”
+
+Examples:
+
+* `bench/scenarios/false_pos_kubernetes.yaml`
+
+  * Focus: config noise reduction vs baseline scanner.
+* `bench/scenarios/reachability_java_webapp.yaml`
+
+  * Focus: reachable vs unreachable vuln proofs.
+* `bench/scenarios/vex_not_affected_openssl.yaml`
+
+  * Focus: VEX correctness and proof coverage.
+
+### 5.3 Use golden outputs, not ad‑hoc assertions
+
+* Bench harness should:
+
+  * Run Stella Ops on scenario inputs.
+  * Normalize outputs (sorted lists, stable IDs).
+  * Compare to `bench/golden/<scenario>.json`.
+* Golden file should include:
+
+  * expected findings (id, severity, reachable?, etc.),
+  * expected VEX entries,
+  * expected metrics (precision, recall, coverage).
+
+---
+
+## 6. Moat‑critical benchmark types (we must have all of these)
+
+When you’re thinking about gaps, check that we have:
+
+1. **Cross‑tool comparison**
+
+   * Same corpus, multiple scanners.
+   * Metrics vs baselines for FP/FN.
+
+2. **Proof density & quality**
+
+   * Corpus where:
+
+     * some vulns are reachable,
+     * some are not,
+     * some are not present.
+   * Ensure:
+
+     * reachable ones have rich proofs (stack slices / symbol proofs).
+     * non‑reachable or absent ones have:
+
+       * correct disposition, and
+       * clear receipts explaining why.
+
+3. **VEX accuracy**
+
+   * Scenarios with known SBOM + known vulnerability impact.
+   * Check:
+
+     * VEX “affected”/“not‑affected” matches ground truth.
+     * every VEX entry has a receipt.
+
+4. **Analyst workflow**
+
+   * Small usability corpus for internal testing:
+
+     * Measure time‑to‑triage with/without receipts.
+     * Use the same scenarios across releases to track improvement.
+
+5. **Upgrade / drift resistance**
+
+   * Scenarios that are **expected to remain stable** across:
+
+     * rule changes that *shouldn’t* affect outcomes.
+     * feed updates (within a given version window).
+   * These act as canaries for unintended regressions.
+
+---
+
+## 7. Developer checklist (TL;DR)
+
+Before merging a change that touches security logic, ask yourself:
+
+1. **Is there at least one benchmark scenario that exercises this change?**
+2. **Does the change improve at least one moat metric, or is it neutral?**
+3. **Have I run `make bench:quick` locally and checked diffs?**
+4. **If goldens changed, did I explain why in the PR?**
+5. **Did I keep benchmarks deterministic (pinned versions, fake time, etc.)?**
+
+If any answer is “no”, fix that before merging.
+
+---
+
+If you’d like, next step I can sketch a concrete `bench/scenarios/*.yaml` and matching `bench/golden/*.json` example that encodes one *specific* moat claim (e.g., “30% fewer FPs than Scanner X on Kubernetes configs”) so your team has a ready‑to-copy pattern.