Files

master 790801f329 add advisories

2025-12-01 17:50:11 +02:00

14 KiB

Raw Blame History

Here’s a crisp, practical way to turn Stella Ops’ “verifiable proof spine” into a moat—and how to measure it.

Why this matters (in plain terms)

Security tools often say “trust me.” You’ll say “prove it”—every finding and every “not‑affected” claim ships with cryptographic receipts anyone can verify.

Differentiators to build in

1) Bind every verdict to a graph hash

Compute a stable Graph Revision ID (Merkle root) over: SBOM nodes, edges, policies, feeds, scan params, and tool versions.
Store the ID on each finding/VEX item; show it in the UI and APIs.
Rule: any data change → new graph hash → new revisioned verdicts.

2) Attach machine‑verifiable receipts (in‑toto/DSSE)

For each verdict, emit a DSSE‑wrapped in‑toto statement:
- predicateType: stellaops.dev/verdict@v1
- includes: graphRevisionId, artifact digests, rule id/version, inputs (CPE/CVE/CVSS), timestamps.
Sign with your Authority (Sigstore key, offline mode supported).
Keep receipts queryable and exportable; mirror to Rekor‑compatible ledger when online.

3) Add reachability “call‑stack slices” or binary‑symbol proofs

For code‑level reachability, store compact slices: entry → sink, with symbol names + file:line.
For binary-only targets, include symbol presence proofs (e.g., Bloom filters + offsets) with executable digest.
Compress and embed a hash of the slice/proof inside the DSSE payload.

4) Deterministic replay manifests

Alongside receipts, publish a Replay Manifest (inputs, feeds, rule versions, container digests) so any auditor can reproduce the same graph hash and verdicts offline.

Benchmarks to publish (make them your headline KPIs)

A) False‑positive reduction vs. baseline scanners (%)

Method: run a public corpus (e.g., sample images + app stacks) across 3–4 popular scanners; label ground truth once; compare FP rate.
Report: mean & p95 FP reduction.

B) Proof coverage (% of findings with signed evidence)

Definition: (# findings or VEX items carrying valid DSSE receipts) / (total surfaced items).
Break out: runtime‑reachable vs. unreachable, and “not‑affected” claims.

C) Triage time saved (p50/p95)

Measure analyst minutes from “alert created” → “final disposition.”
A/B with receipts hidden vs. visible; publish median/p95 deltas.

D) Determinism stability

Re-run identical scans N times / across nodes; publish % identical graph hashes and drift causes when different.

Minimal implementation plan (week‑by‑week)

Week 1: primitives

Add Graph Revision ID generator in scanner.webservice (Merkle over normalized JSON of SBOM+edges+policies+toolVersions).
Define VerdictReceipt schema (protobuf/JSON) and DSSE envelope types.

Week 2: signing + storage

Wire DSSE signing in Authority; offline key support + rotation.
Persist receipts in Receipts table (Postgres) keyed by (graphRevisionId, verdictId); enable export (JSONL) and ledger mirror.

Week 3: reachability proofs

Add call‑stack slice capture in reachability engine; serialize compactly; hash + reference from receipts.
Binary symbol proof module for ELF/PE: symbol bitmap + digest.

Week 4: replay + UX

Emit replay.manifest.json per scan (inputs, tool digests).
UI: show “Verified” badge, graph hash, signature issuer, and a one‑click “Copy receipt” button.
API: GET /verdicts/{id}/receipt, GET /graphs/{rev}/replay.

Week 5: benchmarks harness

Create bench/ with golden fixtures and a runner:
- Baseline scanner adapters
- Ground‑truth labels
- Metrics export (FP%, proof coverage, triage time capture hooks)

Developer guardrails (make these non‑negotiable)

No receipt, no ship: any surfaced verdict must carry a DSSE receipt.
Schema freeze windows: changes to rule inputs or policy logic must bump rule version and therefore the graph hash.
Replay‑first CI: PRs touching scanning/rules must pass a replay test that reproduces prior graph hashes on gold fixtures.
Clock safety: use monotonic time inside receipts; add UTC wall‑time separately.

What to show buyers/auditors

A short audit kit: sample container + your receipts + replay manifest + one command to reproduce the same graph hash.
A one‑page benchmark readout: FP reduction, proof coverage, and triage time saved (p50/p95), with corpus description.

If you want, I’ll draft:

the DSSE predicate schema,
the Postgres DDL for Receipts and Graphs, and
a tiny .NET verification CLI (stellaops-verify) that replays a manifest and validates signatures. Here’s a focused “developer guidelines” doc just for Benchmarks for a Testable Security Moat in Stella Ops.

Stella Ops Developer Guidelines

Benchmarks for a Testable Security Moat

Goal: Benchmarks are how we prove Stella Ops is better, not just say it is. If a “moat” claim can’t be tied to a benchmark, it doesn’t exist.

Everything here is about how you, as a developer, design, extend, and run those benchmarks.

1. What our benchmarks must measure

Every core product claim needs at least one benchmark:

Detection quality
- Precision / recall vs ground truth.
- False positives vs popular scanners.
- False negatives on known‑bad samples.
Proof & evidence quality
- % of findings with valid receipts (DSSE).
- % of VEX “not‑affected” with attached proofs.
- Reachability proof quality:
  - call‑stack slice present?
  - symbol proof present for binaries?
Triage & workflow impact
- Time‑to‑decision for analysts (p50/p95).
- Click depth and context switches per decision.
- “Verified” vs “unverified” verdict triage times.
Determinism & reproducibility
- Same inputs → same Graph Revision ID.
- Stable verdict sets across runs/nodes.

Rule: If you add a feature that impacts any of these, you must either hook it into an existing benchmark or add a new one.

2. Benchmark assets and layout

2.1 Repo layout (convention)

Under bench/ we maintain everything benchmark‑related:

bench/corpus/
- images/ – curated container images / tarballs.
- repos/ – sample codebases (with known vulns).
- sboms/ – canned SBOMs for edge cases.
bench/scenarios/
- *.yaml – scenario definitions (inputs + expected outputs).
bench/golden/
- *.json – golden results (expected findings, metrics).
bench/tools/
- adapters for baseline scanners, parsers, helpers.
bench/scripts/
- run_benchmarks.[sh/cs] – single entrypoint.

2.2 Scenario definition (high‑level)

Each scenario yaml should minimally specify:

Inputs
- artifact references (image name / path / repo SHA / SBOM file).
- environment knobs (features enabled/disabled).
Ground truth
- list of expected vulns (or explicit “none”).
- for some: expected reachability (reachable/unreachable).
- expected VEX entries (affected / not affected).
Expectations
- required metrics (e.g., “no more than 2 FPs”, “no FNs”).
- required proof coverage (e.g., “100% of surfaced findings have receipts”).

3. Core benchmark metrics (developer‑facing definitions)

Use these consistently across code and docs.

3.1 Detection metrics

true_positive_count (TP)
false_positive_count (FP)
false_negative_count (FN)

Derived:

precision = TP / (TP + FP)
recall = TP / (TP + FN)
For UX: track FP per asset and FP per 100 findings.

Developer guideline:

When you introduce a filter, deduper, or rule tweak, add/modify a scenario where:
- the change helps (reduces FP or FN); and
- a different scenario guards against regressions.

3.2 Moat‑specific metrics

These are the ones that directly support the “testable moat” story:

False‑positive reduction vs baseline scanners
- Run baseline scanners across our corpus (via adapters in bench/tools).
- Compute:
  - baseline_fp_rate
  - stella_fp_rate
  - fp_reduction = (baseline_fp_rate - stella_fp_rate) / baseline_fp_rate.
Proof coverage
- proof_coverage_all = findings_with_valid_receipts / total_findings
- proof_coverage_vex = vex_items_with_valid_receipts / total_vex_items
- proof_coverage_reachable = reachable_findings_with_proofs / total_reachable_findings
Triage time improvement
- In test harnesses, simulate or record:
  - time_to_triage_with_receipts
  - time_to_triage_without_receipts
- Compute median & p95 deltas.
Determinism
- Re‑run the same scenario N times:
  - % runs with identical Graph Revision ID
  - % runs with identical verdict sets
- On mismatch, diff and log cause (e.g., non‑stable sort, non‑pinned feed).

4. How developers should work with benchmarks

4.1 “No feature without benchmarks”

If you’re adding or changing:

graph structure,
rule logic,
scanner integration,
VEX handling,
proof / receipt generation,

you must do at least one of:

Extend an existing scenario
- Add expectations that cover your change, or
- tighten an existing bound (e.g., lower FP threshold).
Add a new scenario
- For new attack classes / edge cases / ecosystems.

Anti‑patterns:

Shipping a new capability with no corresponding scenario.
Updating golden outputs without explaining why metrics changed.

4.2 CI gates

We treat benchmarks as blocking:

Add a CI job, e.g.:
- make bench:quick on every PR (small subset).
- make bench:full on main / nightly.
CI fails if:
- Any scenario marked strict: true has:
  - Precision or recall below its threshold.
  - Proof coverage below its configured threshold.
- Global regressions above tolerance:
  - e.g. total FP increases > X% without an explicit override.

Developer rule:

If you intentionally change behavior:
- Update the relevant golden files.
- Include a short note in the PR (e.g., bench-notes.md snippet) describing:
  - what changed,
  - why the new result is better, and
  - which moat metric it improves (FP, proof coverage, determinism, etc.).

5. Benchmark implementation guidelines

5.1 Make benchmarks deterministic

Pin everything:
- feed snapshots,
- tool container digests,
- rule versions,
- time windows.
Use Replay Manifests as the source of truth:
- replay.manifest.json should contain:
  - input artifacts,
  - tool versions,
  - feed versions,
  - configuration flags.
If a benchmark depends on time:
- Inject a fake clock or explicit “as of” timestamp.

5.2 Keep scenarios small but meaningful

Prefer many focused scenarios over a few huge ones.
Each scenario should clearly answer:
- “What property of Stella Ops are we testing?”
- “What moat claim does this support?”

Examples:

bench/scenarios/false_pos_kubernetes.yaml
- Focus: config noise reduction vs baseline scanner.
bench/scenarios/reachability_java_webapp.yaml
- Focus: reachable vs unreachable vuln proofs.
bench/scenarios/vex_not_affected_openssl.yaml
- Focus: VEX correctness and proof coverage.

5.3 Use golden outputs, not ad‑hoc assertions

Bench harness should:
- Run Stella Ops on scenario inputs.
- Normalize outputs (sorted lists, stable IDs).
- Compare to bench/golden/<scenario>.json.
Golden file should include:
- expected findings (id, severity, reachable?, etc.),
- expected VEX entries,
- expected metrics (precision, recall, coverage).

6. Moat‑critical benchmark types (we must have all of these)

When you’re thinking about gaps, check that we have:

Cross‑tool comparison
- Same corpus, multiple scanners.
- Metrics vs baselines for FP/FN.
Proof density & quality
- Corpus where:
  - some vulns are reachable,
  - some are not,
  - some are not present.
- Ensure:
  - reachable ones have rich proofs (stack slices / symbol proofs).
  - non‑reachable or absent ones have:
    - correct disposition, and
    - clear receipts explaining why.
VEX accuracy
- Scenarios with known SBOM + known vulnerability impact.
- Check:
  - VEX “affected”/“not‑affected” matches ground truth.
  - every VEX entry has a receipt.
Analyst workflow
- Small usability corpus for internal testing:
  - Measure time‑to‑triage with/without receipts.
  - Use the same scenarios across releases to track improvement.
Upgrade / drift resistance
- Scenarios that are expected to remain stable across:
  - rule changes that shouldn’t affect outcomes.
  - feed updates (within a given version window).
- These act as canaries for unintended regressions.

7. Developer checklist (TL;DR)

Before merging a change that touches security logic, ask yourself:

Is there at least one benchmark scenario that exercises this change?
Does the change improve at least one moat metric, or is it neutral?
Have I run make bench:quick locally and checked diffs?
If goldens changed, did I explain why in the PR?
Did I keep benchmarks deterministic (pinned versions, fake time, etc.)?

If any answer is “no”, fix that before merging.

If you’d like, next step I can sketch a concrete bench/scenarios/*.yaml and matching bench/golden/*.json example that encodes one specific moat claim (e.g., “30% fewer FPs than Scanner X on Kubernetes configs”) so your team has a ready‑to-copy pattern.

14 KiB Raw Blame History Unescape Escape