Files
git.stella-ops.org/docs/product-advisories/01-Dec-2025 - Benchmarks for a Testable Security Moat.md
2025-12-01 17:50:11 +02:00

14 KiB
Raw Blame History

Heres a crisp, practical way to turn StellaOps “verifiable proof spine” into a moat—and how to measure it.

Why this matters (in plain terms)

Security tools often say “trust me.” Youll say “prove it”—every finding and every “notaffected” claim ships with cryptographic receipts anyone can verify.


Differentiators to build in

1) Bind every verdict to a graph hash

  • Compute a stable Graph Revision ID (Merkle root) over: SBOM nodes, edges, policies, feeds, scan params, and tool versions.
  • Store the ID on each finding/VEX item; show it in the UI and APIs.
  • Rule: any data change → new graph hash → new revisioned verdicts.

2) Attach machineverifiable receipts (intoto/DSSE)

  • For each verdict, emit a DSSEwrapped intoto statement:

    • predicateType: stellaops.dev/verdict@v1
    • includes: graphRevisionId, artifact digests, rule id/version, inputs (CPE/CVE/CVSS), timestamps.
  • Sign with your Authority (Sigstore key, offline mode supported).

  • Keep receipts queryable and exportable; mirror to Rekorcompatible ledger when online.

3) Add reachability “callstack slices” or binarysymbol proofs

  • For codelevel reachability, store compact slices: entry → sink, with symbol names + file:line.
  • For binary-only targets, include symbol presence proofs (e.g., Bloom filters + offsets) with executable digest.
  • Compress and embed a hash of the slice/proof inside the DSSE payload.

4) Deterministic replay manifests

  • Alongside receipts, publish a Replay Manifest (inputs, feeds, rule versions, container digests) so any auditor can reproduce the same graph hash and verdicts offline.

Benchmarks to publish (make them your headline KPIs)

A) Falsepositive reduction vs. baseline scanners (%)

  • Method: run a public corpus (e.g., sample images + app stacks) across 34 popular scanners; label ground truth once; compare FP rate.
  • Report: mean & p95 FP reduction.

B) Proof coverage (% of findings with signed evidence)

  • Definition: (# findings or VEX items carrying valid DSSE receipts) / (total surfaced items).
  • Break out: runtimereachable vs. unreachable, and “notaffected” claims.

C) Triage time saved (p50/p95)

  • Measure analyst minutes from “alert created” → “final disposition.”
  • A/B with receipts hidden vs. visible; publish median/p95 deltas.

D) Determinism stability

  • Re-run identical scans N times / across nodes; publish % identical graph hashes and drift causes when different.

Minimal implementation plan (weekbyweek)

Week 1: primitives

  • Add Graph Revision ID generator in scanner.webservice (Merkle over normalized JSON of SBOM+edges+policies+toolVersions).
  • Define VerdictReceipt schema (protobuf/JSON) and DSSE envelope types.

Week 2: signing + storage

  • Wire DSSE signing in Authority; offline key support + rotation.
  • Persist receipts in Receipts table (Postgres) keyed by (graphRevisionId, verdictId); enable export (JSONL) and ledger mirror.

Week 3: reachability proofs

  • Add callstack slice capture in reachability engine; serialize compactly; hash + reference from receipts.
  • Binary symbol proof module for ELF/PE: symbol bitmap + digest.

Week 4: replay + UX

  • Emit replay.manifest.json per scan (inputs, tool digests).
  • UI: show “Verified” badge, graph hash, signature issuer, and a oneclick “Copy receipt” button.
  • API: GET /verdicts/{id}/receipt, GET /graphs/{rev}/replay.

Week 5: benchmarks harness

  • Create bench/ with golden fixtures and a runner:

    • Baseline scanner adapters
    • Groundtruth labels
    • Metrics export (FP%, proof coverage, triage time capture hooks)

Developer guardrails (make these nonnegotiable)

  • No receipt, no ship: any surfaced verdict must carry a DSSE receipt.
  • Schema freeze windows: changes to rule inputs or policy logic must bump rule version and therefore the graph hash.
  • Replayfirst CI: PRs touching scanning/rules must pass a replay test that reproduces prior graph hashes on gold fixtures.
  • Clock safety: use monotonic time inside receipts; add UTC walltime separately.

What to show buyers/auditors

  • A short audit kit: sample container + your receipts + replay manifest + one command to reproduce the same graph hash.
  • A onepage benchmark readout: FP reduction, proof coverage, and triage time saved (p50/p95), with corpus description.

If you want, Ill draft:

  1. the DSSE predicate schema,
  2. the Postgres DDL for Receipts and Graphs, and
  3. a tiny .NET verification CLI (stellaops-verify) that replays a manifest and validates signatures. Heres a focused “developer guidelines” doc just for Benchmarks for a Testable Security Moat in StellaOps.

Stella Ops Developer Guidelines

Benchmarks for a Testable Security Moat

Goal: Benchmarks are how we prove StellaOps is better, not just say it is. If a “moat” claim cant be tied to a benchmark, it doesnt exist.

Everything here is about how you, as a developer, design, extend, and run those benchmarks.


1. What our benchmarks must measure

Every core product claim needs at least one benchmark:

  1. Detection quality

    • Precision / recall vs ground truth.
    • False positives vs popular scanners.
    • False negatives on knownbad samples.
  2. Proof & evidence quality

    • % of findings with valid receipts (DSSE).

    • % of VEX “notaffected” with attached proofs.

    • Reachability proof quality:

      • callstack slice present?
      • symbol proof present for binaries?
  3. Triage & workflow impact

    • Timetodecision for analysts (p50/p95).
    • Click depth and context switches per decision.
    • “Verified” vs “unverified” verdict triage times.
  4. Determinism & reproducibility

    • Same inputs → same Graph Revision ID.
    • Stable verdict sets across runs/nodes.

Rule: If you add a feature that impacts any of these, you must either hook it into an existing benchmark or add a new one.


2. Benchmark assets and layout

2.1 Repo layout (convention)

Under bench/ we maintain everything benchmarkrelated:

  • bench/corpus/

    • images/ curated container images / tarballs.
    • repos/ sample codebases (with known vulns).
    • sboms/ canned SBOMs for edge cases.
  • bench/scenarios/

    • *.yaml scenario definitions (inputs + expected outputs).
  • bench/golden/

    • *.json golden results (expected findings, metrics).
  • bench/tools/

    • adapters for baseline scanners, parsers, helpers.
  • bench/scripts/

    • run_benchmarks.[sh/cs] single entrypoint.

2.2 Scenario definition (highlevel)

Each scenario yaml should minimally specify:

  • Inputs

    • artifact references (image name / path / repo SHA / SBOM file).
    • environment knobs (features enabled/disabled).
  • Ground truth

    • list of expected vulns (or explicit “none”).
    • for some: expected reachability (reachable/unreachable).
    • expected VEX entries (affected / not affected).
  • Expectations

    • required metrics (e.g., “no more than 2 FPs”, “no FNs”).
    • required proof coverage (e.g., “100% of surfaced findings have receipts”).

3. Core benchmark metrics (developerfacing definitions)

Use these consistently across code and docs.

3.1 Detection metrics

  • true_positive_count (TP)
  • false_positive_count (FP)
  • false_negative_count (FN)

Derived:

  • precision = TP / (TP + FP)
  • recall = TP / (TP + FN)
  • For UX: track FP per asset and FP per 100 findings.

Developer guideline:

  • When you introduce a filter, deduper, or rule tweak, add/modify a scenario where:

    • the change helps (reduces FP or FN); and
    • a different scenario guards against regressions.

3.2 Moatspecific metrics

These are the ones that directly support the “testable moat” story:

  1. Falsepositive reduction vs baseline scanners

    • Run baseline scanners across our corpus (via adapters in bench/tools).

    • Compute:

      • baseline_fp_rate
      • stella_fp_rate
      • fp_reduction = (baseline_fp_rate - stella_fp_rate) / baseline_fp_rate.
  2. Proof coverage

    • proof_coverage_all = findings_with_valid_receipts / total_findings
    • proof_coverage_vex = vex_items_with_valid_receipts / total_vex_items
    • proof_coverage_reachable = reachable_findings_with_proofs / total_reachable_findings
  3. Triage time improvement

    • In test harnesses, simulate or record:

      • time_to_triage_with_receipts
      • time_to_triage_without_receipts
    • Compute median & p95 deltas.

  4. Determinism

    • Rerun the same scenario N times:

      • % runs with identical Graph Revision ID
      • % runs with identical verdict sets
    • On mismatch, diff and log cause (e.g., nonstable sort, nonpinned feed).


4. How developers should work with benchmarks

4.1 “No feature without benchmarks”

If youre adding or changing:

  • graph structure,
  • rule logic,
  • scanner integration,
  • VEX handling,
  • proof / receipt generation,

you must do at least one of:

  1. Extend an existing scenario

    • Add expectations that cover your change, or
    • tighten an existing bound (e.g., lower FP threshold).
  2. Add a new scenario

    • For new attack classes / edge cases / ecosystems.

Antipatterns:

  • Shipping a new capability with no corresponding scenario.
  • Updating golden outputs without explaining why metrics changed.

4.2 CI gates

We treat benchmarks as blocking:

  • Add a CI job, e.g.:

    • make bench:quick on every PR (small subset).
    • make bench:full on main / nightly.
  • CI fails if:

    • Any scenario marked strict: true has:

      • Precision or recall below its threshold.
      • Proof coverage below its configured threshold.
    • Global regressions above tolerance:

      • e.g. total FP increases > X% without an explicit override.

Developer rule:

  • If you intentionally change behavior:

    • Update the relevant golden files.

    • Include a short note in the PR (e.g., bench-notes.md snippet) describing:

      • what changed,
      • why the new result is better, and
      • which moat metric it improves (FP, proof coverage, determinism, etc.).

5. Benchmark implementation guidelines

5.1 Make benchmarks deterministic

  • Pin everything:

    • feed snapshots,
    • tool container digests,
    • rule versions,
    • time windows.
  • Use Replay Manifests as the source of truth:

    • replay.manifest.json should contain:

      • input artifacts,
      • tool versions,
      • feed versions,
      • configuration flags.
  • If a benchmark depends on time:

    • Inject a fake clock or explicit “as of” timestamp.

5.2 Keep scenarios small but meaningful

  • Prefer many focused scenarios over a few huge ones.

  • Each scenario should clearly answer:

    • “What property of StellaOps are we testing?”
    • “What moat claim does this support?”

Examples:

  • bench/scenarios/false_pos_kubernetes.yaml

    • Focus: config noise reduction vs baseline scanner.
  • bench/scenarios/reachability_java_webapp.yaml

    • Focus: reachable vs unreachable vuln proofs.
  • bench/scenarios/vex_not_affected_openssl.yaml

    • Focus: VEX correctness and proof coverage.

5.3 Use golden outputs, not adhoc assertions

  • Bench harness should:

    • Run StellaOps on scenario inputs.
    • Normalize outputs (sorted lists, stable IDs).
    • Compare to bench/golden/<scenario>.json.
  • Golden file should include:

    • expected findings (id, severity, reachable?, etc.),
    • expected VEX entries,
    • expected metrics (precision, recall, coverage).

6. Moatcritical benchmark types (we must have all of these)

When youre thinking about gaps, check that we have:

  1. Crosstool comparison

    • Same corpus, multiple scanners.
    • Metrics vs baselines for FP/FN.
  2. Proof density & quality

    • Corpus where:

      • some vulns are reachable,
      • some are not,
      • some are not present.
    • Ensure:

      • reachable ones have rich proofs (stack slices / symbol proofs).

      • nonreachable or absent ones have:

        • correct disposition, and
        • clear receipts explaining why.
  3. VEX accuracy

    • Scenarios with known SBOM + known vulnerability impact.

    • Check:

      • VEX “affected”/“notaffected” matches ground truth.
      • every VEX entry has a receipt.
  4. Analyst workflow

    • Small usability corpus for internal testing:

      • Measure timetotriage with/without receipts.
      • Use the same scenarios across releases to track improvement.
  5. Upgrade / drift resistance

    • Scenarios that are expected to remain stable across:

      • rule changes that shouldnt affect outcomes.
      • feed updates (within a given version window).
    • These act as canaries for unintended regressions.


7. Developer checklist (TL;DR)

Before merging a change that touches security logic, ask yourself:

  1. Is there at least one benchmark scenario that exercises this change?
  2. Does the change improve at least one moat metric, or is it neutral?
  3. Have I run make bench:quick locally and checked diffs?
  4. If goldens changed, did I explain why in the PR?
  5. Did I keep benchmarks deterministic (pinned versions, fake time, etc.)?

If any answer is “no”, fix that before merging.


If youd like, next step I can sketch a concrete bench/scenarios/*.yaml and matching bench/golden/*.json example that encodes one specific moat claim (e.g., “30% fewer FPs than Scanner X on Kubernetes configs”) so your team has a readyto-copy pattern.