add advisories

This commit is contained in:
master
2025-12-01 17:50:11 +02:00
parent c11d87d252
commit 790801f329
7 changed files with 3723 additions and 0 deletions

View File

@@ -0,0 +1,446 @@
Heres a crisp, practical way to turn StellaOps “verifiable proof spine” into a moat—and how to measure it.
# Why this matters (in plain terms)
Security tools often say “trust me.” Youll say “prove it”—every finding and every “notaffected” claim ships with cryptographic receipts anyone can verify.
---
# Differentiators to build in
**1) Bind every verdict to a graph hash**
* Compute a stable **Graph Revision ID** (Merkle root) over: SBOM nodes, edges, policies, feeds, scan params, and tool versions.
* Store the ID on each finding/VEX item; show it in the UI and APIs.
* Rule: any data change → new graph hash → new revisioned verdicts.
**2) Attach machineverifiable receipts (intoto/DSSE)**
* For each verdict, emit a **DSSEwrapped intoto statement**:
* predicateType: `stellaops.dev/verdict@v1`
* includes: graphRevisionId, artifact digests, rule id/version, inputs (CPE/CVE/CVSS), timestamps.
* Sign with your **Authority** (Sigstore key, offline mode supported).
* Keep receipts queryable and exportable; mirror to Rekorcompatible ledger when online.
**3) Add reachability “callstack slices” or binarysymbol proofs**
* For codelevel reachability, store compact slices: entry → sink, with symbol names + file:line.
* For binary-only targets, include **symbol presence proofs** (e.g., Bloom filters + offsets) with executable digest.
* Compress and embed a hash of the slice/proof inside the DSSE payload.
**4) Deterministic replay manifests**
* Alongside receipts, publish a **Replay Manifest** (inputs, feeds, rule versions, container digests) so any auditor can reproduce the same graph hash and verdicts offline.
---
# Benchmarks to publish (make them your headline KPIs)
**A) Falsepositive reduction vs. baseline scanners (%)**
* Method: run a public corpus (e.g., sample images + app stacks) across 34 popular scanners; label ground truth once; compare FP rate.
* Report: mean & p95 FP reduction.
**B) Proof coverage (% of findings with signed evidence)**
* Definition: `(# findings or VEX items carrying valid DSSE receipts) / (total surfaced items)`.
* Break out: runtimereachable vs. unreachable, and “notaffected” claims.
**C) Triage time saved (p50/p95)**
* Measure analyst minutes from “alert created” → “final disposition.”
* A/B with receipts hidden vs. visible; publish median/p95 deltas.
**D) Determinism stability**
* Re-run identical scans N times / across nodes; publish `% identical graph hashes` and drift causes when different.
---
# Minimal implementation plan (weekbyweek)
**Week 1: primitives**
* Add Graph Revision ID generator in `scanner.webservice` (Merkle over normalized JSON of SBOM+edges+policies+toolVersions).
* Define `VerdictReceipt` schema (protobuf/JSON) and DSSE envelope types.
**Week 2: signing + storage**
* Wire DSSE signing in **Authority**; offline key support + rotation.
* Persist receipts in `Receipts` table (Postgres) keyed by `(graphRevisionId, verdictId)`; enable export (JSONL) and ledger mirror.
**Week 3: reachability proofs**
* Add callstack slice capture in reachability engine; serialize compactly; hash + reference from receipts.
* Binary symbol proof module for ELF/PE: symbol bitmap + digest.
**Week 4: replay + UX**
* Emit `replay.manifest.json` per scan (inputs, tool digests).
* UI: show **“Verified”** badge, graph hash, signature issuer, and a oneclick “Copy receipt” button.
* API: `GET /verdicts/{id}/receipt`, `GET /graphs/{rev}/replay`.
**Week 5: benchmarks harness**
* Create `bench/` with golden fixtures and a runner:
* Baseline scanner adapters
* Groundtruth labels
* Metrics export (FP%, proof coverage, triage time capture hooks)
---
# Developer guardrails (make these nonnegotiable)
* **No receipt, no ship:** any surfaced verdict must carry a DSSE receipt.
* **Schema freeze windows:** changes to rule inputs or policy logic must bump rule version and therefore the graph hash.
* **Replayfirst CI:** PRs touching scanning/rules must pass a replay test that reproduces prior graph hashes on gold fixtures.
* **Clock safety:** use monotonic time inside receipts; add UTC walltime separately.
---
# What to show buyers/auditors
* A short **audit kit**: sample container + your receipts + replay manifest + one command to reproduce the same graph hash.
* A onepage **benchmark readout**: FP reduction, proof coverage, and triage time saved (p50/p95), with corpus description.
---
If you want, Ill draft:
1. the DSSE `predicate` schema,
2. the Postgres DDL for `Receipts` and `Graphs`, and
3. a tiny .NET verification CLI (`stellaops-verify`) that replays a manifest and validates signatures.
Heres a focused “developer guidelines” doc just for **Benchmarks for a Testable Security Moat** in StellaOps.
---
# Stella Ops Developer Guidelines
## Benchmarks for a Testable Security Moat
> **Goal:** Benchmarks are how we *prove* StellaOps is better, not just say it is. If a “moat” claim cant be tied to a benchmark, it doesnt exist.
Everything here is about how you, as a developer, design, extend, and run those benchmarks.
---
## 1. What our benchmarks must measure
Every core product claim needs at least one benchmark:
1. **Detection quality**
* Precision / recall vs ground truth.
* False positives vs popular scanners.
* False negatives on knownbad samples.
2. **Proof & evidence quality**
* % of findings with **valid receipts** (DSSE).
* % of VEX “notaffected” with attached proofs.
* Reachability proof quality:
* callstack slice present?
* symbol proof present for binaries?
3. **Triage & workflow impact**
* Timetodecision for analysts (p50/p95).
* Click depth and context switches per decision.
* “Verified” vs “unverified” verdict triage times.
4. **Determinism & reproducibility**
* Same inputs → same **Graph Revision ID**.
* Stable verdict sets across runs/nodes.
> **Rule:** If you add a feature that impacts any of these, you must either hook it into an existing benchmark or add a new one.
---
## 2. Benchmark assets and layout
**2.1 Repo layout (convention)**
Under `bench/` we maintain everything benchmarkrelated:
* `bench/corpus/`
* `images/` curated container images / tarballs.
* `repos/` sample codebases (with known vulns).
* `sboms/` canned SBOMs for edge cases.
* `bench/scenarios/`
* `*.yaml` scenario definitions (inputs + expected outputs).
* `bench/golden/`
* `*.json` golden results (expected findings, metrics).
* `bench/tools/`
* adapters for baseline scanners, parsers, helpers.
* `bench/scripts/`
* `run_benchmarks.[sh/cs]` single entrypoint.
**2.2 Scenario definition (highlevel)**
Each scenario yaml should minimally specify:
* **Inputs**
* artifact references (image name / path / repo SHA / SBOM file).
* environment knobs (features enabled/disabled).
* **Ground truth**
* list of expected vulns (or explicit “none”).
* for some: expected reachability (reachable/unreachable).
* expected VEX entries (affected / not affected).
* **Expectations**
* required metrics (e.g., “no more than 2 FPs”, “no FNs”).
* required proof coverage (e.g., “100% of surfaced findings have receipts”).
---
## 3. Core benchmark metrics (developerfacing definitions)
Use these consistently across code and docs.
### 3.1 Detection metrics
* `true_positive_count` (TP)
* `false_positive_count` (FP)
* `false_negative_count` (FN)
Derived:
* `precision = TP / (TP + FP)`
* `recall = TP / (TP + FN)`
* For UX: track **FP per asset** and **FP per 100 findings**.
**Developer guideline:**
* When you introduce a filter, deduper, or rule tweak, add/modify a scenario where:
* the change **helps** (reduces FP or FN); and
* a different scenario guards against regressions.
### 3.2 Moatspecific metrics
These are the ones that directly support the “testable moat” story:
1. **Falsepositive reduction vs baseline scanners**
* Run baseline scanners across our corpus (via adapters in `bench/tools`).
* Compute:
* `baseline_fp_rate`
* `stella_fp_rate`
* `fp_reduction = (baseline_fp_rate - stella_fp_rate) / baseline_fp_rate`.
2. **Proof coverage**
* `proof_coverage_all = findings_with_valid_receipts / total_findings`
* `proof_coverage_vex = vex_items_with_valid_receipts / total_vex_items`
* `proof_coverage_reachable = reachable_findings_with_proofs / total_reachable_findings`
3. **Triage time improvement**
* In test harnesses, simulate or record:
* `time_to_triage_with_receipts`
* `time_to_triage_without_receipts`
* Compute median & p95 deltas.
4. **Determinism**
* Rerun the same scenario `N` times:
* `% runs with identical Graph Revision ID`
* `% runs with identical verdict sets`
* On mismatch, diff and log cause (e.g., nonstable sort, nonpinned feed).
---
## 4. How developers should work with benchmarks
### 4.1 “No feature without benchmarks”
If youre adding or changing:
* graph structure,
* rule logic,
* scanner integration,
* VEX handling,
* proof / receipt generation,
you **must** do *at least one* of:
1. **Extend an existing scenario**
* Add expectations that cover your change, or
* tighten an existing bound (e.g., lower FP threshold).
2. **Add a new scenario**
* For new attack classes / edge cases / ecosystems.
**Antipatterns:**
* Shipping a new capability with *no* corresponding scenario.
* Updating golden outputs without explaining why metrics changed.
### 4.2 CI gates
We treat benchmarks as **blocking**:
* Add a CI job, e.g.:
* `make bench:quick` on every PR (small subset).
* `make bench:full` on main / nightly.
* CI fails if:
* Any scenario marked `strict: true` has:
* Precision or recall below its threshold.
* Proof coverage below its configured threshold.
* Global regressions above tolerance:
* e.g. total FP increases > X% without an explicit override.
**Developer rule:**
* If you intentionally change behavior:
* Update the relevant golden files.
* Include a short note in the PR (e.g., `bench-notes.md` snippet) describing:
* what changed,
* why the new result is better, and
* which moat metric it improves (FP, proof coverage, determinism, etc.).
---
## 5. Benchmark implementation guidelines
### 5.1 Make benchmarks deterministic
* **Pin everything**:
* feed snapshots,
* tool container digests,
* rule versions,
* time windows.
* Use **Replay Manifests** as the source of truth:
* `replay.manifest.json` should contain:
* input artifacts,
* tool versions,
* feed versions,
* configuration flags.
* If a benchmark depends on time:
* Inject a **fake clock** or explicit “as of” timestamp.
### 5.2 Keep scenarios small but meaningful
* Prefer many **focused** scenarios over a few huge ones.
* Each scenario should clearly answer:
* “What property of StellaOps are we testing?”
* “What moat claim does this support?”
Examples:
* `bench/scenarios/false_pos_kubernetes.yaml`
* Focus: config noise reduction vs baseline scanner.
* `bench/scenarios/reachability_java_webapp.yaml`
* Focus: reachable vs unreachable vuln proofs.
* `bench/scenarios/vex_not_affected_openssl.yaml`
* Focus: VEX correctness and proof coverage.
### 5.3 Use golden outputs, not adhoc assertions
* Bench harness should:
* Run StellaOps on scenario inputs.
* Normalize outputs (sorted lists, stable IDs).
* Compare to `bench/golden/<scenario>.json`.
* Golden file should include:
* expected findings (id, severity, reachable?, etc.),
* expected VEX entries,
* expected metrics (precision, recall, coverage).
---
## 6. Moatcritical benchmark types (we must have all of these)
When youre thinking about gaps, check that we have:
1. **Crosstool comparison**
* Same corpus, multiple scanners.
* Metrics vs baselines for FP/FN.
2. **Proof density & quality**
* Corpus where:
* some vulns are reachable,
* some are not,
* some are not present.
* Ensure:
* reachable ones have rich proofs (stack slices / symbol proofs).
* nonreachable or absent ones have:
* correct disposition, and
* clear receipts explaining why.
3. **VEX accuracy**
* Scenarios with known SBOM + known vulnerability impact.
* Check:
* VEX “affected”/“notaffected” matches ground truth.
* every VEX entry has a receipt.
4. **Analyst workflow**
* Small usability corpus for internal testing:
* Measure timetotriage with/without receipts.
* Use the same scenarios across releases to track improvement.
5. **Upgrade / drift resistance**
* Scenarios that are **expected to remain stable** across:
* rule changes that *shouldnt* affect outcomes.
* feed updates (within a given version window).
* These act as canaries for unintended regressions.
---
## 7. Developer checklist (TL;DR)
Before merging a change that touches security logic, ask yourself:
1. **Is there at least one benchmark scenario that exercises this change?**
2. **Does the change improve at least one moat metric, or is it neutral?**
3. **Have I run `make bench:quick` locally and checked diffs?**
4. **If goldens changed, did I explain why in the PR?**
5. **Did I keep benchmarks deterministic (pinned versions, fake time, etc.)?**
If any answer is “no”, fix that before merging.
---
If youd like, next step I can sketch a concrete `bench/scenarios/*.yaml` and matching `bench/golden/*.json` example that encodes one *specific* moat claim (e.g., “30% fewer FPs than Scanner X on Kubernetes configs”) so your team has a readyto-copy pattern.