add advisories
This commit is contained in:
@@ -0,0 +1,446 @@
|
||||
Here’s a crisp, practical way to turn Stella Ops’ “verifiable proof spine” into a moat—and how to measure it.
|
||||
|
||||
# Why this matters (in plain terms)
|
||||
|
||||
Security tools often say “trust me.” You’ll say “prove it”—every finding and every “not‑affected” claim ships with cryptographic receipts anyone can verify.
|
||||
|
||||
---
|
||||
|
||||
# Differentiators to build in
|
||||
|
||||
**1) Bind every verdict to a graph hash**
|
||||
|
||||
* Compute a stable **Graph Revision ID** (Merkle root) over: SBOM nodes, edges, policies, feeds, scan params, and tool versions.
|
||||
* Store the ID on each finding/VEX item; show it in the UI and APIs.
|
||||
* Rule: any data change → new graph hash → new revisioned verdicts.
|
||||
|
||||
**2) Attach machine‑verifiable receipts (in‑toto/DSSE)**
|
||||
|
||||
* For each verdict, emit a **DSSE‑wrapped in‑toto statement**:
|
||||
|
||||
* predicateType: `stellaops.dev/verdict@v1`
|
||||
* includes: graphRevisionId, artifact digests, rule id/version, inputs (CPE/CVE/CVSS), timestamps.
|
||||
* Sign with your **Authority** (Sigstore key, offline mode supported).
|
||||
* Keep receipts queryable and exportable; mirror to Rekor‑compatible ledger when online.
|
||||
|
||||
**3) Add reachability “call‑stack slices” or binary‑symbol proofs**
|
||||
|
||||
* For code‑level reachability, store compact slices: entry → sink, with symbol names + file:line.
|
||||
* For binary-only targets, include **symbol presence proofs** (e.g., Bloom filters + offsets) with executable digest.
|
||||
* Compress and embed a hash of the slice/proof inside the DSSE payload.
|
||||
|
||||
**4) Deterministic replay manifests**
|
||||
|
||||
* Alongside receipts, publish a **Replay Manifest** (inputs, feeds, rule versions, container digests) so any auditor can reproduce the same graph hash and verdicts offline.
|
||||
|
||||
---
|
||||
|
||||
# Benchmarks to publish (make them your headline KPIs)
|
||||
|
||||
**A) False‑positive reduction vs. baseline scanners (%)**
|
||||
|
||||
* Method: run a public corpus (e.g., sample images + app stacks) across 3–4 popular scanners; label ground truth once; compare FP rate.
|
||||
* Report: mean & p95 FP reduction.
|
||||
|
||||
**B) Proof coverage (% of findings with signed evidence)**
|
||||
|
||||
* Definition: `(# findings or VEX items carrying valid DSSE receipts) / (total surfaced items)`.
|
||||
* Break out: runtime‑reachable vs. unreachable, and “not‑affected” claims.
|
||||
|
||||
**C) Triage time saved (p50/p95)**
|
||||
|
||||
* Measure analyst minutes from “alert created” → “final disposition.”
|
||||
* A/B with receipts hidden vs. visible; publish median/p95 deltas.
|
||||
|
||||
**D) Determinism stability**
|
||||
|
||||
* Re-run identical scans N times / across nodes; publish `% identical graph hashes` and drift causes when different.
|
||||
|
||||
---
|
||||
|
||||
# Minimal implementation plan (week‑by‑week)
|
||||
|
||||
**Week 1: primitives**
|
||||
|
||||
* Add Graph Revision ID generator in `scanner.webservice` (Merkle over normalized JSON of SBOM+edges+policies+toolVersions).
|
||||
* Define `VerdictReceipt` schema (protobuf/JSON) and DSSE envelope types.
|
||||
|
||||
**Week 2: signing + storage**
|
||||
|
||||
* Wire DSSE signing in **Authority**; offline key support + rotation.
|
||||
* Persist receipts in `Receipts` table (Postgres) keyed by `(graphRevisionId, verdictId)`; enable export (JSONL) and ledger mirror.
|
||||
|
||||
**Week 3: reachability proofs**
|
||||
|
||||
* Add call‑stack slice capture in reachability engine; serialize compactly; hash + reference from receipts.
|
||||
* Binary symbol proof module for ELF/PE: symbol bitmap + digest.
|
||||
|
||||
**Week 4: replay + UX**
|
||||
|
||||
* Emit `replay.manifest.json` per scan (inputs, tool digests).
|
||||
* UI: show **“Verified”** badge, graph hash, signature issuer, and a one‑click “Copy receipt” button.
|
||||
* API: `GET /verdicts/{id}/receipt`, `GET /graphs/{rev}/replay`.
|
||||
|
||||
**Week 5: benchmarks harness**
|
||||
|
||||
* Create `bench/` with golden fixtures and a runner:
|
||||
|
||||
* Baseline scanner adapters
|
||||
* Ground‑truth labels
|
||||
* Metrics export (FP%, proof coverage, triage time capture hooks)
|
||||
|
||||
---
|
||||
|
||||
# Developer guardrails (make these non‑negotiable)
|
||||
|
||||
* **No receipt, no ship:** any surfaced verdict must carry a DSSE receipt.
|
||||
* **Schema freeze windows:** changes to rule inputs or policy logic must bump rule version and therefore the graph hash.
|
||||
* **Replay‑first CI:** PRs touching scanning/rules must pass a replay test that reproduces prior graph hashes on gold fixtures.
|
||||
* **Clock safety:** use monotonic time inside receipts; add UTC wall‑time separately.
|
||||
|
||||
---
|
||||
|
||||
# What to show buyers/auditors
|
||||
|
||||
* A short **audit kit**: sample container + your receipts + replay manifest + one command to reproduce the same graph hash.
|
||||
* A one‑page **benchmark readout**: FP reduction, proof coverage, and triage time saved (p50/p95), with corpus description.
|
||||
|
||||
---
|
||||
|
||||
If you want, I’ll draft:
|
||||
|
||||
1. the DSSE `predicate` schema,
|
||||
2. the Postgres DDL for `Receipts` and `Graphs`, and
|
||||
3. a tiny .NET verification CLI (`stellaops-verify`) that replays a manifest and validates signatures.
|
||||
Here’s a focused “developer guidelines” doc just for **Benchmarks for a Testable Security Moat** in Stella Ops.
|
||||
|
||||
---
|
||||
|
||||
# Stella Ops Developer Guidelines
|
||||
|
||||
## Benchmarks for a Testable Security Moat
|
||||
|
||||
> **Goal:** Benchmarks are how we *prove* Stella Ops is better, not just say it is. If a “moat” claim can’t be tied to a benchmark, it doesn’t exist.
|
||||
|
||||
Everything here is about how you, as a developer, design, extend, and run those benchmarks.
|
||||
|
||||
---
|
||||
|
||||
## 1. What our benchmarks must measure
|
||||
|
||||
Every core product claim needs at least one benchmark:
|
||||
|
||||
1. **Detection quality**
|
||||
|
||||
* Precision / recall vs ground truth.
|
||||
* False positives vs popular scanners.
|
||||
* False negatives on known‑bad samples.
|
||||
|
||||
2. **Proof & evidence quality**
|
||||
|
||||
* % of findings with **valid receipts** (DSSE).
|
||||
* % of VEX “not‑affected” with attached proofs.
|
||||
* Reachability proof quality:
|
||||
|
||||
* call‑stack slice present?
|
||||
* symbol proof present for binaries?
|
||||
|
||||
3. **Triage & workflow impact**
|
||||
|
||||
* Time‑to‑decision for analysts (p50/p95).
|
||||
* Click depth and context switches per decision.
|
||||
* “Verified” vs “unverified” verdict triage times.
|
||||
|
||||
4. **Determinism & reproducibility**
|
||||
|
||||
* Same inputs → same **Graph Revision ID**.
|
||||
* Stable verdict sets across runs/nodes.
|
||||
|
||||
> **Rule:** If you add a feature that impacts any of these, you must either hook it into an existing benchmark or add a new one.
|
||||
|
||||
---
|
||||
|
||||
## 2. Benchmark assets and layout
|
||||
|
||||
**2.1 Repo layout (convention)**
|
||||
|
||||
Under `bench/` we maintain everything benchmark‑related:
|
||||
|
||||
* `bench/corpus/`
|
||||
|
||||
* `images/` – curated container images / tarballs.
|
||||
* `repos/` – sample codebases (with known vulns).
|
||||
* `sboms/` – canned SBOMs for edge cases.
|
||||
* `bench/scenarios/`
|
||||
|
||||
* `*.yaml` – scenario definitions (inputs + expected outputs).
|
||||
* `bench/golden/`
|
||||
|
||||
* `*.json` – golden results (expected findings, metrics).
|
||||
* `bench/tools/`
|
||||
|
||||
* adapters for baseline scanners, parsers, helpers.
|
||||
* `bench/scripts/`
|
||||
|
||||
* `run_benchmarks.[sh/cs]` – single entrypoint.
|
||||
|
||||
**2.2 Scenario definition (high‑level)**
|
||||
|
||||
Each scenario yaml should minimally specify:
|
||||
|
||||
* **Inputs**
|
||||
|
||||
* artifact references (image name / path / repo SHA / SBOM file).
|
||||
* environment knobs (features enabled/disabled).
|
||||
* **Ground truth**
|
||||
|
||||
* list of expected vulns (or explicit “none”).
|
||||
* for some: expected reachability (reachable/unreachable).
|
||||
* expected VEX entries (affected / not affected).
|
||||
* **Expectations**
|
||||
|
||||
* required metrics (e.g., “no more than 2 FPs”, “no FNs”).
|
||||
* required proof coverage (e.g., “100% of surfaced findings have receipts”).
|
||||
|
||||
---
|
||||
|
||||
## 3. Core benchmark metrics (developer‑facing definitions)
|
||||
|
||||
Use these consistently across code and docs.
|
||||
|
||||
### 3.1 Detection metrics
|
||||
|
||||
* `true_positive_count` (TP)
|
||||
* `false_positive_count` (FP)
|
||||
* `false_negative_count` (FN)
|
||||
|
||||
Derived:
|
||||
|
||||
* `precision = TP / (TP + FP)`
|
||||
* `recall = TP / (TP + FN)`
|
||||
* For UX: track **FP per asset** and **FP per 100 findings**.
|
||||
|
||||
**Developer guideline:**
|
||||
|
||||
* When you introduce a filter, deduper, or rule tweak, add/modify a scenario where:
|
||||
|
||||
* the change **helps** (reduces FP or FN); and
|
||||
* a different scenario guards against regressions.
|
||||
|
||||
### 3.2 Moat‑specific metrics
|
||||
|
||||
These are the ones that directly support the “testable moat” story:
|
||||
|
||||
1. **False‑positive reduction vs baseline scanners**
|
||||
|
||||
* Run baseline scanners across our corpus (via adapters in `bench/tools`).
|
||||
* Compute:
|
||||
|
||||
* `baseline_fp_rate`
|
||||
* `stella_fp_rate`
|
||||
* `fp_reduction = (baseline_fp_rate - stella_fp_rate) / baseline_fp_rate`.
|
||||
|
||||
2. **Proof coverage**
|
||||
|
||||
* `proof_coverage_all = findings_with_valid_receipts / total_findings`
|
||||
* `proof_coverage_vex = vex_items_with_valid_receipts / total_vex_items`
|
||||
* `proof_coverage_reachable = reachable_findings_with_proofs / total_reachable_findings`
|
||||
|
||||
3. **Triage time improvement**
|
||||
|
||||
* In test harnesses, simulate or record:
|
||||
|
||||
* `time_to_triage_with_receipts`
|
||||
* `time_to_triage_without_receipts`
|
||||
* Compute median & p95 deltas.
|
||||
|
||||
4. **Determinism**
|
||||
|
||||
* Re‑run the same scenario `N` times:
|
||||
|
||||
* `% runs with identical Graph Revision ID`
|
||||
* `% runs with identical verdict sets`
|
||||
* On mismatch, diff and log cause (e.g., non‑stable sort, non‑pinned feed).
|
||||
|
||||
---
|
||||
|
||||
## 4. How developers should work with benchmarks
|
||||
|
||||
### 4.1 “No feature without benchmarks”
|
||||
|
||||
If you’re adding or changing:
|
||||
|
||||
* graph structure,
|
||||
* rule logic,
|
||||
* scanner integration,
|
||||
* VEX handling,
|
||||
* proof / receipt generation,
|
||||
|
||||
you **must** do *at least one* of:
|
||||
|
||||
1. **Extend an existing scenario**
|
||||
|
||||
* Add expectations that cover your change, or
|
||||
* tighten an existing bound (e.g., lower FP threshold).
|
||||
|
||||
2. **Add a new scenario**
|
||||
|
||||
* For new attack classes / edge cases / ecosystems.
|
||||
|
||||
**Anti‑patterns:**
|
||||
|
||||
* Shipping a new capability with *no* corresponding scenario.
|
||||
* Updating golden outputs without explaining why metrics changed.
|
||||
|
||||
### 4.2 CI gates
|
||||
|
||||
We treat benchmarks as **blocking**:
|
||||
|
||||
* Add a CI job, e.g.:
|
||||
|
||||
* `make bench:quick` on every PR (small subset).
|
||||
* `make bench:full` on main / nightly.
|
||||
* CI fails if:
|
||||
|
||||
* Any scenario marked `strict: true` has:
|
||||
|
||||
* Precision or recall below its threshold.
|
||||
* Proof coverage below its configured threshold.
|
||||
* Global regressions above tolerance:
|
||||
|
||||
* e.g. total FP increases > X% without an explicit override.
|
||||
|
||||
**Developer rule:**
|
||||
|
||||
* If you intentionally change behavior:
|
||||
|
||||
* Update the relevant golden files.
|
||||
* Include a short note in the PR (e.g., `bench-notes.md` snippet) describing:
|
||||
|
||||
* what changed,
|
||||
* why the new result is better, and
|
||||
* which moat metric it improves (FP, proof coverage, determinism, etc.).
|
||||
|
||||
---
|
||||
|
||||
## 5. Benchmark implementation guidelines
|
||||
|
||||
### 5.1 Make benchmarks deterministic
|
||||
|
||||
* **Pin everything**:
|
||||
|
||||
* feed snapshots,
|
||||
* tool container digests,
|
||||
* rule versions,
|
||||
* time windows.
|
||||
* Use **Replay Manifests** as the source of truth:
|
||||
|
||||
* `replay.manifest.json` should contain:
|
||||
|
||||
* input artifacts,
|
||||
* tool versions,
|
||||
* feed versions,
|
||||
* configuration flags.
|
||||
* If a benchmark depends on time:
|
||||
|
||||
* Inject a **fake clock** or explicit “as of” timestamp.
|
||||
|
||||
### 5.2 Keep scenarios small but meaningful
|
||||
|
||||
* Prefer many **focused** scenarios over a few huge ones.
|
||||
* Each scenario should clearly answer:
|
||||
|
||||
* “What property of Stella Ops are we testing?”
|
||||
* “What moat claim does this support?”
|
||||
|
||||
Examples:
|
||||
|
||||
* `bench/scenarios/false_pos_kubernetes.yaml`
|
||||
|
||||
* Focus: config noise reduction vs baseline scanner.
|
||||
* `bench/scenarios/reachability_java_webapp.yaml`
|
||||
|
||||
* Focus: reachable vs unreachable vuln proofs.
|
||||
* `bench/scenarios/vex_not_affected_openssl.yaml`
|
||||
|
||||
* Focus: VEX correctness and proof coverage.
|
||||
|
||||
### 5.3 Use golden outputs, not ad‑hoc assertions
|
||||
|
||||
* Bench harness should:
|
||||
|
||||
* Run Stella Ops on scenario inputs.
|
||||
* Normalize outputs (sorted lists, stable IDs).
|
||||
* Compare to `bench/golden/<scenario>.json`.
|
||||
* Golden file should include:
|
||||
|
||||
* expected findings (id, severity, reachable?, etc.),
|
||||
* expected VEX entries,
|
||||
* expected metrics (precision, recall, coverage).
|
||||
|
||||
---
|
||||
|
||||
## 6. Moat‑critical benchmark types (we must have all of these)
|
||||
|
||||
When you’re thinking about gaps, check that we have:
|
||||
|
||||
1. **Cross‑tool comparison**
|
||||
|
||||
* Same corpus, multiple scanners.
|
||||
* Metrics vs baselines for FP/FN.
|
||||
|
||||
2. **Proof density & quality**
|
||||
|
||||
* Corpus where:
|
||||
|
||||
* some vulns are reachable,
|
||||
* some are not,
|
||||
* some are not present.
|
||||
* Ensure:
|
||||
|
||||
* reachable ones have rich proofs (stack slices / symbol proofs).
|
||||
* non‑reachable or absent ones have:
|
||||
|
||||
* correct disposition, and
|
||||
* clear receipts explaining why.
|
||||
|
||||
3. **VEX accuracy**
|
||||
|
||||
* Scenarios with known SBOM + known vulnerability impact.
|
||||
* Check:
|
||||
|
||||
* VEX “affected”/“not‑affected” matches ground truth.
|
||||
* every VEX entry has a receipt.
|
||||
|
||||
4. **Analyst workflow**
|
||||
|
||||
* Small usability corpus for internal testing:
|
||||
|
||||
* Measure time‑to‑triage with/without receipts.
|
||||
* Use the same scenarios across releases to track improvement.
|
||||
|
||||
5. **Upgrade / drift resistance**
|
||||
|
||||
* Scenarios that are **expected to remain stable** across:
|
||||
|
||||
* rule changes that *shouldn’t* affect outcomes.
|
||||
* feed updates (within a given version window).
|
||||
* These act as canaries for unintended regressions.
|
||||
|
||||
---
|
||||
|
||||
## 7. Developer checklist (TL;DR)
|
||||
|
||||
Before merging a change that touches security logic, ask yourself:
|
||||
|
||||
1. **Is there at least one benchmark scenario that exercises this change?**
|
||||
2. **Does the change improve at least one moat metric, or is it neutral?**
|
||||
3. **Have I run `make bench:quick` locally and checked diffs?**
|
||||
4. **If goldens changed, did I explain why in the PR?**
|
||||
5. **Did I keep benchmarks deterministic (pinned versions, fake time, etc.)?**
|
||||
|
||||
If any answer is “no”, fix that before merging.
|
||||
|
||||
---
|
||||
|
||||
If you’d like, next step I can sketch a concrete `bench/scenarios/*.yaml` and matching `bench/golden/*.json` example that encodes one *specific* moat claim (e.g., “30% fewer FPs than Scanner X on Kubernetes configs”) so your team has a ready‑to-copy pattern.
|
||||
Reference in New Issue
Block a user