Files
git.stella-ops.org/docs/product-advisories/12-Dec-2025 - Replay Fidelity as a Proof Metric.md
master d776e93b16
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
add advisories
2025-12-13 02:08:11 +02:00

18 KiB
Raw Blame History

Heres a simple, practical idea to make your scans provably repeatable over time and catch drift fast.

Replay Fidelity (what, why, how)

What it is: the share of historical scans that reproduce bitforbit when rerun using their saved manifests (inputs, versions, rules, seeds). Higher = more deterministic system.

Why you want it: it exposes hidden nondeterminism (feed drift, timedependent rules, race conditions, unstable dependency resolution) and proves auditability for customers/compliance.


The metric

  • Perscan: replay_match = 1 if SBOM/VEX/findings + hashes are identical; else 0.
  • Windowed: Replay Fidelity = (Σ replay_match) / (# historical replays in window).
  • Breakdown: also track by scanner, language, image base, feed version, and environment.

What must be captured in the scan manifest

  • Exact source refs (image digest / repo SHA), container layers digests
  • Scanner build ID + config (flags, rules, lattice/policy sets, seeds)
  • Feed snapshots (CVE DB, OVAL, vendor advisories) as contentaddressed bundles
  • Normalization/version of SBOM schema (e.g., CycloneDX 1.6 vs SPDX 3.0.1)
  • Platform facts (OS/kernel, tz, locale), toolchain versions, clock policy

Pass/Fail rules you can ship

  • Green: Fidelity ≥ 0.98 over 30 days, and no bucket < 0.95
  • Warn: Any bucket drops by ≥ 2% weekoverweek
  • Fail the pipeline: If fidelity < 0.90 or any regulated project < 0.95

Minimal replay harness (outline)

  1. Pick N historical scans (e.g., last 200 or stratified by image language).
  2. Restore their frozen manifest (scanner binary, feed bundle, policy lattice, seeds).
  3. Rerun in a pinned runtime (OCI digest, pinned kernel in VM, fixed TZ/locale).
  4. Compare artifacts: SBOM JSON, VEX JSON, findings list, evidence blobs → SHA256.
  5. Emit: pass/fail, diff summary, and the “cause” tag if mismatch (feed, policy, runtime, code).

Dashboard (what to show)

  • Fidelity % (30/90day) + sparkline
  • Top offenders (by language/scanner/policy set)
  • “Cause of mismatch” histogram (feed vs runtime vs code vs policy)
  • Clickthrough: deterministic diff (e.g., which CVEs flipped and why)

Quick wins for StellaOps

  • Treat feeds as immutable snapshots (contentaddressed tar.zst) and record their digest in each scan.
  • Run scanner in a repro shell (OCI image digest + fixed TZ/locale + no network).
  • Normalize SBOM/VEX (key order, whitespace, float precision) before hashing.
  • Add a stella replay --from MANIFEST.json command + nightly cron to sample replays.
  • Store replay_result rows; expose /metrics for Prometheus and a CI badge: Replay Fidelity: 99.2%.

Want me to draft the stella replay CLI spec and the DB table (DDL) you can drop into Postgres? Below is an extended “Replay Fidelity” design plus a concrete development implementation plan you can hand to engineering. Im assuming StellaOps is doing container/app security scans that output SBOM + findings (and optionally VEX), and uses vulnerability “feeds” and policy/lattice/rules.


1) Extend the concept: Replay Fidelity as a product capability

1.1 Fidelity levels (so you can be strict without being brittle)

Instead of a single yes/no, define tiers that you can report and gate on:

  1. Bitwise Fidelity (BF)

    • Definition: All primary artifacts (SBOM, findings, VEX, evidence) match byte-for-byte after canonicalization.
    • Use: strongest auditability, catch ordering/nondeterminism.
  2. Semantic Fidelity (SF)

    • Definition: The meaning matches even if formatting differs (e.g., key order, whitespace, timestamps).
    • How: compare normalized objects: same packages, versions, CVEs, fix versions, severities, policy verdicts.
    • Use: protects you from “cosmetic diffs” and helps triage.
  3. Policy Fidelity (PF)

    • Definition: Final policy decision (pass/fail + reason codes) matches.
    • Use: useful when outputs may evolve but governance outcome must remain stable.

Recommended reporting:

  • Dashboard shows BF, SF, PF together.
  • Default engineering SLO: BF ≥ 0.98; compliance SLO: BF ≥ 0.95 for regulated projects; PF should be ~1.0 unless policy changed intentionally.

1.2 “Why did it drift?”—Mismatch classification taxonomy

When a replay fails, auto-tag the cause so humans dont diff JSON by hand.

Primary mismatch classes

  • Feed drift: CVE/OVAL/vendor advisory snapshot differs.
  • Policy drift: policy/lattice/rules differ (or default rule set changed).
  • Runtime drift: base image / libc / kernel / locale / tz / CPU arch differences.
  • Scanner drift: scanner binary build differs or dependency versions changed.
  • Nondeterminism: ordering instability, concurrency race, unseeded RNG, time-based logic.
  • External IO: network calls, “latest” resolution, remote package registry changes.

Output: a mismatch_reason plus a short diff_summary.


1.3 Deterministic “scan envelope” design

A replay only works if the scan is fully specified.

Scan envelope components

  • Inputs: image digest, repo commit SHA, build provenance, layers digests.
  • Scanner: scanner OCI image digest (or binary digest), config flags, feature toggles.
  • Feeds: content-addressed feed bundle digests (see §2.3).
  • Policy/rules: git commit SHA + content digest of compiled rules.
  • Environment: OS/arch, tz/locale, “clock mode”, network mode, CPU count.
  • Normalization: “canonicalization version” for SBOM/VEX/findings.

1.4 Canonicalization so “bitwise” is meaningful

To make BF achievable:

  • Canonical JSON serialization (sorted keys, stable array ordering, normalized floats)
  • Strip/normalize volatile fields (timestamps, “scan_duration_ms”, hostnames)
  • Stable ordering for lists: packages sorted by (purl, version), vulnerabilities by (cve_id, affected_purl)
  • Deterministic IDs: if you generate internal IDs, derive from stable hashes of content (not UUID4)

1.5 Sampling strategy

You dont need to replay everything.

Nightly sample: stratified by:

  • language ecosystem (npm, pip, maven, go, rust…)
  • scanner engine
  • base OS
  • “regulatory tier”
  • image size/complexity

Plus: always replay “golden canaries” (a fixed set of reference images) after every scanner release and feed ingestion pipeline change.


2) Technical architecture blueprint

2.1 System components

  1. Manifest Writer (in the scan pipeline)

    • Produces ScanManifest v1 JSON
    • Records all digests and versions
  2. Artifact Store

    • Stores SBOM, findings, VEX, evidence blobs
    • Stores canonical hashes for BF checks
  3. Feed Snapshotter

    • Periodically builds immutable feed bundles
    • Content-addressed (digest-keyed)
    • Stores metadata (source URLs, generation timestamp, signature)
  4. Replay Orchestrator

    • Chooses historical scans to replay
    • Launches “replay executor” jobs
  5. Replay Executor

    • Runs scanner in pinned container image
    • Network off, tz fixed, clock policy applied
    • Produces new artifacts + hashes
  6. Diff & Scoring Engine

    • Computes BF/SF/PF
    • Generates mismatch classification + diff summary
  7. Metrics + UI Dashboard

    • Prometheus metrics
    • UI for drill-down diffs

2.2 Data model (Postgres-friendly)

Core tables

  • scan_manifests

    • scan_id (pk)
    • manifest_json
    • manifest_sha256
    • created_at
  • scan_artifacts

    • scan_id (fk)
    • artifact_type (sbom|findings|vex|evidence)
    • artifact_uri
    • canonical_sha256
    • schema_version
  • feed_snapshots

    • feed_digest (pk)
    • bundle_uri
    • sources_json
    • generated_at
    • signature
  • replay_runs

    • replay_id (pk)
    • original_scan_id (fk)
    • status (queued|running|passed|failed)
    • bf_match bool, sf_match bool, pf_match bool
    • mismatch_reason
    • diff_summary_json
    • started_at, finished_at
    • executor_env_json (arch, tz, cpu, image digest)

Indexes

  • (created_at) for sampling windows
  • (mismatch_reason, finished_at) for triage
  • (scanner_version, ecosystem) for breakdown dashboards

2.3 Feed Snapshotting (the key to long-term replay)

Feed bundle format

  • feeds/<source>/<date>/... inside a tar.zst

  • manifest file inside bundle: feed_bundle_manifest.json containing:

    • source URLs
    • retrieval commit/etag (if any)
    • file hashes
    • generated_by version

Content addressing

  • Digest of the entire bundle (sha256(tar.zst)) is the reference.
  • Scans record only the digest + URI.

Immutability

  • Store bundles in object storage with WORM / retention if you need compliance.

2.4 Replay execution sandbox

For determinism, enforce:

  • No network (K8s NetworkPolicy, firewall rules, or container runtime flags)

  • Fixed TZ/locale

  • Pinned container image digest

  • Clock policy

    • Either “real time but recorded” or “frozen time at original scan timestamp”
    • If scanner logic uses current date for severity windows, freeze time

3) Development implementation plan

Ill lay this out as workstreams + a sprinted plan. You can compress/expand depending on team size.

Workstream A — Scan Manifest & Canonical Artifacts

Goal: every scan is replayable on paper, even before replays run.

Deliverables

  • ScanManifest v1 schema + writer integrated into scan pipeline
  • Canonicalization library + canonical hashing for all artifacts

Acceptance criteria

  • Every scan stores: input digests, scanner digest, policy digest, feed digest placeholders
  • Artifact hashes are stable across repeated runs in the same environment

Workstream B — Feed Snapshotting & Policy Versioning

Goal: eliminate “feed drift” by pinning immutable inputs.

Deliverables

  • Feed bundle builder + signer + uploader
  • Policy/rules bundler (compiled rules bundle, digest recorded)

Acceptance criteria

  • New scans reference feed bundle digests (not “latest”)
  • A scan can be re-run with the same feed bundle and policy bundle

Workstream C — Replay Runner & Diff Engine

Goal: execute historical scans and score BF/SF/PF with actionable diffs.

Deliverables

  • stella replay --from manifest.json
  • Orchestrator job to schedule replays
  • Diff engine + mismatch classifier
  • Storage of replay results

Acceptance criteria

  • Replay produces deterministic artifacts in a pinned environment
  • Dashboard/CLI shows BF/SF/PF + diff summary for failures

Workstream D — Observability, Dashboard, and CI Gates

Goal: make fidelity visible and enforceable.

Deliverables

  • Prometheus metrics: replay_fidelity_bf, replay_fidelity_sf, replay_fidelity_pf
  • Breakdown labels (scanner, ecosystem, policy_set, base_os)
  • Alerts for drop thresholds
  • CI gate option: “block release if BF < threshold on canary set”

Acceptance criteria

  • Engineering can see drift within 24h
  • Releases are blocked when fidelity regressions occur

4) Suggested sprint plan with concrete tasks

Sprint 0 — Design lock + baseline

Tasks

  • Define manifest schema: ScanManifest v1 fields + versioning rules
  • Decide canonicalization rules (what is normalized vs preserved)
  • Choose initial “golden canary” scan set (1020 representative targets)
  • Add “replay-fidelity” epic with ownership & SLIs/SLOs

Exit criteria

  • Approved schema + canonicalization spec
  • Canary set stored and tagged

Sprint 1 — Manifest writer + artifact hashing (MVP)

Tasks

  • Implement manifest writer in scan pipeline

  • Store manifest_json + manifest_sha256

  • Implement canonicalization + hashing for:

    • findings list (sorted)
    • SBOM (normalized)
    • VEX (if present)
  • Persist canonical hashes in scan_artifacts

Exit criteria

  • Two identical scans in the same environment yield identical artifact hashes

  • A “manifest export” endpoint/CLI works:

    • stella scan --emit-manifest out.json

Sprint 2 — Feed snapshotter + policy bundling

Tasks

  • Build feed bundler job:

    • pull raw sources
    • normalize layout
    • generate feed_bundle_manifest.json
    • tar.zst + sha256
    • upload + record in feed_snapshots
  • Update scan pipeline:

    • resolve feed bundle digest at scan start
    • record digest in scan manifest
  • Bundle policy/lattice:

    • compile rules into an immutable artifact
    • record policy bundle digest in manifest

Exit criteria

  • Scans reference immutable feed + policy digests
  • You can fetch feed bundle by digest and reproduce the same feed inputs

Sprint 3 — Replay executor + “no network” sandbox

Tasks

  • Create replay container image / runtime wrapper

  • Implement stella replay --from MANIFEST.json

    • pulls scanner image by digest
    • mounts feed bundle + policy bundle
    • runs in network-off mode
    • applies tz/locale + clock mode
  • Store replay outputs as artifacts (replay_scan_id or replay_id linkage)

Exit criteria

  • Replay runs end-to-end for canary scans
  • Deterministic runtime controls verified (no DNS egress, fixed tz)

Sprint 4 — Diff engine + mismatch classification

Tasks

  • Implement BF compare (canonical hashes)

  • Implement SF compare (semantic JSON/object comparison)

  • Implement PF compare (policy decision equivalence)

  • Implement mismatch classification rules:

    • if feed digest differs → feed drift
    • if scanner digest differs → scanner drift
    • if environment differs → runtime drift
    • else → nondeterminism (with sub-tags for ordering/time/RNG)
  • Generate diff_summary_json:

    • top N changed CVEs
    • packages added/removed
    • policy verdict changes

Exit criteria

  • Every failed replay has a cause tag and a diff summary thats useful in <2 minutes
  • Engineers can reproduce failures locally with the manifest

Sprint 5 — Dashboard + alerts + CI gate

Tasks

  • Expose Prometheus metrics from replay service

  • Build dashboard:

    • BF/SF/PF trends
    • breakdown by ecosystem/scanner/policy
    • mismatch cause histogram
  • Add alerting rules (drop threshold, bucket regression)

  • Add CI gate mode:

    • “run replays on canary set for this release candidate”
    • block merge if BF < target

Exit criteria

  • Fidelity visible to leadership and engineering
  • Release process is protected by canary replays

Sprint 6 — Hardening + compliance polish

Tasks

  • Backward compatible manifest upgrades:

    • manifest_version bump rules
    • migration support
  • Artifact signing / integrity:

    • sign manifest hash
    • optional transparency log later
  • Storage & retention policies (cost controls)

  • Runbook + oncall playbook

Exit criteria

  • Audit story is complete: “show me exactly how scan X was produced”
  • Operational load is manageable and cost-bounded

5) Engineering specs you can start implementing immediately

5.1 ScanManifest v1 skeleton (example)

{
  "manifest_version": "1.0",
  "scan_id": "scan_123",
  "created_at": "2025-12-12T10:15:30Z",

  "input": {
    "type": "oci_image",
    "image_ref": "registry/app@sha256:...",
    "layers": ["sha256:...", "sha256:..."],
    "source_provenance": {"repo_sha": "abc123", "build_id": "ci-999"}
  },

  "scanner": {
    "engine": "stella",
    "scanner_image_digest": "sha256:...",
    "scanner_version": "2025.12.0",
    "config_digest": "sha256:...",
    "flags": ["--deep", "--vex"]
  },

  "feeds": {
    "vuln_feed_bundle_digest": "sha256:...",
    "license_db_digest": "sha256:..."
  },

  "policy": {
    "policy_bundle_digest": "sha256:...",
    "policy_set": "prod-default"
  },

  "environment": {
    "arch": "amd64",
    "os": "linux",
    "tz": "UTC",
    "locale": "C",
    "network": "disabled",
    "clock_mode": "frozen",
    "clock_value": "2025-12-12T10:15:30Z"
  },

  "normalization": {
    "canonicalizer_version": "1.2.0",
    "sbom_schema": "cyclonedx-1.6",
    "vex_schema": "cyclonedx-vex-1.0"
  }
}

5.2 CLI spec (minimal)

  • stella scan ... --emit-manifest MANIFEST.json --emit-artifacts-dir out/
  • stella replay --from MANIFEST.json --out-dir replay_out/
  • stella diff --a out/ --b replay_out/ --mode bf|sf|pf --json

6) Testing strategy (to prevent determinism regressions)

Unit tests

  • Canonicalization: same object → same bytes
  • Sorting stability: randomized input order → stable output
  • Hash determinism

Integration tests

  • Golden canaries:

    • run scan twice in same runner → BF match
    • replay from manifest → BF match
  • “Network leak” test:

    • DNS requests must be zero
  • “Clock leak” test:

    • freeze time; ensure outputs do not include real timestamps

Chaos tests

  • Vary CPU count, run concurrency, run order → still BF match
  • Randomized scheduling / thread interleavings to find races

7) Operational policies (so it stays useful)

Retention & cost controls

  • Keep full artifacts for regulated scans (e.g., 17 years)

  • For non-regulated:

    • keep manifests + canonical hashes long-term
    • expire heavy evidence blobs after N days
  • Compress large artifacts and dedupe by digest

Alerting examples

  • BF drops by ≥2% week-over-week (any major bucket) → warn
  • BF < 0.90 overall or regulated BF < 0.95 → page / block release

Triage workflow

  • Failed replay auto-creates a ticket with:

    • manifest link
    • mismatch_reason
    • diff_summary
    • reproduction command

8) What “done” looks like (definition of success)

  • Any customer/auditor can pick a scan from 6 months ago and you can:

    1. retrieve manifest + feed bundle + policy bundle by digest
    2. replay in a pinned sandbox
    3. show BF/SF/PF results and diffs
  • Engineering sees drift quickly and can attribute it to feed vs scanner vs runtime.


If you want, I can also provide:

  • a Postgres DDL for the tables above,
  • a Prometheus metrics contract (names + labels + example queries),
  • and a diff_summary_json schema that supports a UI “diff view” without reprocessing artifacts.