Here’s a simple, practical idea to make your scans provably repeatable over time and catch drift fast.

# Replay Fidelity (what, why, how)

**What it is:** the share of historical scans that reproduce **bit‑for‑bit** when re‑run using their saved manifests (inputs, versions, rules, seeds). Higher = more deterministic system.

**Why you want it:** it exposes hidden nondeterminism (feed drift, time‑dependent rules, race conditions, unstable dependency resolution) and proves auditability for customers/compliance.

---

## The metric

* **Per‑scan:** `replay_match = 1` if SBOM/VEX/findings + hashes are identical; else `0`.
* **Windowed:** `Replay Fidelity = (Σ replay_match) / (# historical replays in window)`.
* **Breakdown:** also track by scanner, language, image base, feed version, and environment.

---

## What must be captured in the scan manifest

* Exact source refs (image digest / repo SHA), container layers’ digests
* Scanner build ID + config (flags, rules, lattice/policy sets, seeds)
* Feed snapshots (CVE DB, OVAL, vendor advisories) as **content‑addressed** bundles
* Normalization/version of SBOM schema (e.g., CycloneDX 1.6 vs SPDX 3.0.1)
* Platform facts (OS/kernel, tz, locale), toolchain versions, clock policy

---

## Pass/Fail rules you can ship

* **Green:** Fidelity ≥ 0.98 over 30 days, and no bucket < 0.95
* **Warn:** Any bucket drops by ≥ 2% week‑over‑week
* **Fail the pipeline:** If fidelity < 0.90 or any regulated project < 0.95

---

## Minimal replay harness (outline)

1. Pick N historical scans (e.g., last 200 or stratified by image language).
2. Restore their **frozen** manifest (scanner binary, feed bundle, policy lattice, seeds).
3. Re‑run in a pinned runtime (OCI digest, pinned kernel in VM, fixed TZ/locale).
4. Compare artifacts: SBOM JSON, VEX JSON, findings list, evidence blobs → SHA‑256.
5. Emit: pass/fail, diff summary, and the “cause” tag if mismatch (feed, policy, runtime, code).

---

## Dashboard (what to show)

* Fidelity % (30/90‑day) + sparkline
* Top offenders (by language/scanner/policy set)
* “Cause of mismatch” histogram (feed vs runtime vs code vs policy)
* Click‑through: deterministic diff (e.g., which CVEs flipped and why)

---

## Quick wins for Stella Ops

* Treat **feeds as immutable snapshots** (content‑addressed tar.zst) and record their digest in each scan.
* Run scanner in a **repro shell** (OCI image digest + fixed TZ/locale + no network).
* Normalize SBOM/VEX (key order, whitespace, float precision) before hashing.
* Add a `stella replay --from MANIFEST.json` command + nightly cron to sample replays.
* Store `replay_result` rows; expose `/metrics` for Prometheus and a CI badge: `Replay Fidelity: 99.2%`.

Want me to draft the `stella replay` CLI spec and the DB table (DDL) you can drop into Postgres?
Below is an extended “Replay Fidelity” design **plus a concrete development implementation plan** you can hand to engineering. I’m assuming Stella Ops is doing container/app security scans that output SBOM + findings (and optionally VEX), and uses vulnerability “feeds” and policy/lattice/rules.

---

## 1) Extend the concept: Replay Fidelity as a product capability

### 1.1 Fidelity levels (so you can be strict without being brittle)

Instead of a single yes/no, define **tiers** that you can report and gate on:

1. **Bitwise Fidelity (BF)**

   * *Definition:* All primary artifacts (SBOM, findings, VEX, evidence) match **byte-for-byte** after canonicalization.
   * *Use:* strongest auditability, catch ordering/nondeterminism.

2. **Semantic Fidelity (SF)**

   * *Definition:* The *meaning* matches even if formatting differs (e.g., key order, whitespace, timestamps).
   * *How:* compare normalized objects: same packages, versions, CVEs, fix versions, severities, policy verdicts.
   * *Use:* protects you from “cosmetic diffs” and helps triage.

3. **Policy Fidelity (PF)**

   * *Definition:* Final policy decision (pass/fail + reason codes) matches.
   * *Use:* useful when outputs may evolve but governance outcome must remain stable.

**Recommended reporting:**

* Dashboard shows BF, SF, PF together.
* Default engineering SLO: **BF ≥ 0.98**; compliance SLO: **BF ≥ 0.95** for regulated projects; PF should be ~1.0 unless policy changed intentionally.

---

### 1.2 “Why did it drift?”—Mismatch classification taxonomy

When a replay fails, auto-tag the cause so humans don’t diff JSON by hand.

**Primary mismatch classes**

* **Feed drift:** CVE/OVAL/vendor advisory snapshot differs.
* **Policy drift:** policy/lattice/rules differ (or default rule set changed).
* **Runtime drift:** base image / libc / kernel / locale / tz / CPU arch differences.
* **Scanner drift:** scanner binary build differs or dependency versions changed.
* **Nondeterminism:** ordering instability, concurrency race, unseeded RNG, time-based logic.
* **External IO:** network calls, “latest” resolution, remote package registry changes.

**Output:** a `mismatch_reason` plus a short `diff_summary`.

---

### 1.3 Deterministic “scan envelope” design

A replay only works if the scan is fully specified.

**Scan envelope components**

* **Inputs:** image digest, repo commit SHA, build provenance, layers digests.
* **Scanner:** scanner OCI image digest (or binary digest), config flags, feature toggles.
* **Feeds:** content-addressed feed bundle digests (see §2.3).
* **Policy/rules:** git commit SHA + content digest of compiled rules.
* **Environment:** OS/arch, tz/locale, “clock mode”, network mode, CPU count.
* **Normalization:** “canonicalization version” for SBOM/VEX/findings.

---

### 1.4 Canonicalization so “bitwise” is meaningful

To make BF achievable:

* Canonical JSON serialization (sorted keys, stable array ordering, normalized floats)
* Strip/normalize volatile fields (timestamps, “scan_duration_ms”, hostnames)
* Stable ordering for lists: packages sorted by `(purl, version)`, vulnerabilities by `(cve_id, affected_purl)`
* Deterministic IDs: if you generate internal IDs, derive from stable hashes of content (not UUID4)

---

### 1.5 Sampling strategy

You don’t need to replay everything.

**Nightly sample:** stratified by:

* language ecosystem (npm, pip, maven, go, rust…)
* scanner engine
* base OS
* “regulatory tier”
* image size/complexity

**Plus:** always replay “golden canaries” (a fixed set of reference images) after every scanner release and feed ingestion pipeline change.

---

## 2) Technical architecture blueprint

### 2.1 System components

1. **Manifest Writer (in the scan pipeline)**

   * Produces `ScanManifest v1` JSON
   * Records all digests and versions

2. **Artifact Store**

   * Stores SBOM, findings, VEX, evidence blobs
   * Stores canonical hashes for BF checks

3. **Feed Snapshotter**

   * Periodically builds immutable feed bundles
   * Content-addressed (digest-keyed)
   * Stores metadata (source URLs, generation timestamp, signature)

4. **Replay Orchestrator**

   * Chooses historical scans to replay
   * Launches “replay executor” jobs

5. **Replay Executor**

   * Runs scanner in pinned container image
   * Network off, tz fixed, clock policy applied
   * Produces new artifacts + hashes

6. **Diff & Scoring Engine**

   * Computes BF/SF/PF
   * Generates mismatch classification + diff summary

7. **Metrics + UI Dashboard**

   * Prometheus metrics
   * UI for drill-down diffs

---

### 2.2 Data model (Postgres-friendly)

**Core tables**

* `scan_manifests`

  * `scan_id (pk)`
  * `manifest_json`
  * `manifest_sha256`
  * `created_at`
* `scan_artifacts`

  * `scan_id (fk)`
  * `artifact_type` (sbom|findings|vex|evidence)
  * `artifact_uri`
  * `canonical_sha256`
  * `schema_version`
* `feed_snapshots`

  * `feed_digest (pk)`
  * `bundle_uri`
  * `sources_json`
  * `generated_at`
  * `signature`
* `replay_runs`

  * `replay_id (pk)`
  * `original_scan_id (fk)`
  * `status` (queued|running|passed|failed)
  * `bf_match bool`, `sf_match bool`, `pf_match bool`
  * `mismatch_reason`
  * `diff_summary_json`
  * `started_at`, `finished_at`
  * `executor_env_json` (arch, tz, cpu, image digest)

**Indexes**

* `(created_at)` for sampling windows
* `(mismatch_reason, finished_at)` for triage
* `(scanner_version, ecosystem)` for breakdown dashboards

---

### 2.3 Feed Snapshotting (the key to long-term replay)

**Feed bundle format**

* `feeds/<source>/<date>/...` inside a tar.zst
* manifest file inside bundle: `feed_bundle_manifest.json` containing:

  * source URLs
  * retrieval commit/etag (if any)
  * file hashes
  * generated_by version

**Content addressing**

* Digest of the entire bundle (`sha256(tar.zst)`) is the reference.
* Scans record only the digest + URI.

**Immutability**

* Store bundles in object storage with WORM / retention if you need compliance.

---

### 2.4 Replay execution sandbox

For determinism, enforce:

* **No network** (K8s NetworkPolicy, firewall rules, or container runtime flags)
* **Fixed TZ/locale**
* **Pinned container image digest**
* **Clock policy**

  * Either “real time but recorded” or “frozen time at original scan timestamp”
  * If scanner logic uses current date for severity windows, freeze time

---

## 3) Development implementation plan

I’ll lay this out as **workstreams** + **a sprinted plan**. You can compress/expand depending on team size.

### Workstream A — Scan Manifest & Canonical Artifacts

**Goal:** every scan is replayable on paper, even before replays run.

**Deliverables**

* `ScanManifest v1` schema + writer integrated into scan pipeline
* Canonicalization library + canonical hashing for all artifacts

**Acceptance criteria**

* Every scan stores: input digests, scanner digest, policy digest, feed digest placeholders
* Artifact hashes are stable across repeated runs in the same environment

---

### Workstream B — Feed Snapshotting & Policy Versioning

**Goal:** eliminate “feed drift” by pinning immutable inputs.

**Deliverables**

* Feed bundle builder + signer + uploader
* Policy/rules bundler (compiled rules bundle, digest recorded)

**Acceptance criteria**

* New scans reference feed bundle digests (not “latest”)
* A scan can be re-run with the same feed bundle and policy bundle

---

### Workstream C — Replay Runner & Diff Engine

**Goal:** execute historical scans and score BF/SF/PF with actionable diffs.

**Deliverables**

* `stella replay --from manifest.json`
* Orchestrator job to schedule replays
* Diff engine + mismatch classifier
* Storage of replay results

**Acceptance criteria**

* Replay produces deterministic artifacts in a pinned environment
* Dashboard/CLI shows BF/SF/PF + diff summary for failures

---

### Workstream D — Observability, Dashboard, and CI Gates

**Goal:** make fidelity visible and enforceable.

**Deliverables**

* Prometheus metrics: `replay_fidelity_bf`, `replay_fidelity_sf`, `replay_fidelity_pf`
* Breakdown labels (scanner, ecosystem, policy_set, base_os)
* Alerts for drop thresholds
* CI gate option: “block release if BF < threshold on canary set”

**Acceptance criteria**

* Engineering can see drift within 24h
* Releases are blocked when fidelity regressions occur

---

## 4) Suggested sprint plan with concrete tasks

### Sprint 0 — Design lock + baseline

**Tasks**

* Define manifest schema: `ScanManifest v1` fields + versioning rules
* Decide canonicalization rules (what is normalized vs preserved)
* Choose initial “golden canary” scan set (10–20 representative targets)
* Add “replay-fidelity” epic with ownership & SLIs/SLOs

**Exit criteria**

* Approved schema + canonicalization spec
* Canary set stored and tagged

---

### Sprint 1 — Manifest writer + artifact hashing (MVP)

**Tasks**

* Implement manifest writer in scan pipeline
* Store `manifest_json` + `manifest_sha256`
* Implement canonicalization + hashing for:

  * findings list (sorted)
  * SBOM (normalized)
  * VEX (if present)
* Persist canonical hashes in `scan_artifacts`

**Exit criteria**

* Two identical scans in the same environment yield identical artifact hashes
* A “manifest export” endpoint/CLI works:

  * `stella scan --emit-manifest out.json`

---

### Sprint 2 — Feed snapshotter + policy bundling

**Tasks**

* Build feed bundler job:

  * pull raw sources
  * normalize layout
  * generate `feed_bundle_manifest.json`
  * tar.zst + sha256
  * upload + record in `feed_snapshots`
* Update scan pipeline:

  * resolve feed bundle digest at scan start
  * record digest in scan manifest
* Bundle policy/lattice:

  * compile rules into an immutable artifact
  * record policy bundle digest in manifest

**Exit criteria**

* Scans reference immutable feed + policy digests
* You can fetch feed bundle by digest and reproduce the same feed inputs

---

### Sprint 3 — Replay executor + “no network” sandbox

**Tasks**

* Create replay container image / runtime wrapper
* Implement `stella replay --from MANIFEST.json`

  * pulls scanner image by digest
  * mounts feed bundle + policy bundle
  * runs in network-off mode
  * applies tz/locale + clock mode
* Store replay outputs as artifacts (`replay_scan_id` or `replay_id` linkage)

**Exit criteria**

* Replay runs end-to-end for canary scans
* Deterministic runtime controls verified (no DNS egress, fixed tz)

---

### Sprint 4 — Diff engine + mismatch classification

**Tasks**

* Implement BF compare (canonical hashes)
* Implement SF compare (semantic JSON/object comparison)
* Implement PF compare (policy decision equivalence)
* Implement mismatch classification rules:

  * if feed digest differs → feed drift
  * if scanner digest differs → scanner drift
  * if environment differs → runtime drift
  * else → nondeterminism (with sub-tags for ordering/time/RNG)
* Generate `diff_summary_json`:

  * top N changed CVEs
  * packages added/removed
  * policy verdict changes

**Exit criteria**

* Every failed replay has a cause tag and a diff summary that’s useful in <2 minutes
* Engineers can reproduce failures locally with the manifest

---

### Sprint 5 — Dashboard + alerts + CI gate

**Tasks**

* Expose Prometheus metrics from replay service
* Build dashboard:

  * BF/SF/PF trends
  * breakdown by ecosystem/scanner/policy
  * mismatch cause histogram
* Add alerting rules (drop threshold, bucket regression)
* Add CI gate mode:

  * “run replays on canary set for this release candidate”
  * block merge if BF < target

**Exit criteria**

* Fidelity visible to leadership and engineering
* Release process is protected by canary replays

---

### Sprint 6 — Hardening + compliance polish

**Tasks**

* Backward compatible manifest upgrades:

  * `manifest_version` bump rules
  * migration support
* Artifact signing / integrity:

  * sign manifest hash
  * optional transparency log later
* Storage & retention policies (cost controls)
* Runbook + oncall playbook

**Exit criteria**

* Audit story is complete: “show me exactly how scan X was produced”
* Operational load is manageable and cost-bounded

---

## 5) Engineering specs you can start implementing immediately

### 5.1 `ScanManifest v1` skeleton (example)

```json
{
  "manifest_version": "1.0",
  "scan_id": "scan_123",
  "created_at": "2025-12-12T10:15:30Z",

  "input": {
    "type": "oci_image",
    "image_ref": "registry/app@sha256:...",
    "layers": ["sha256:...", "sha256:..."],
    "source_provenance": {"repo_sha": "abc123", "build_id": "ci-999"}
  },

  "scanner": {
    "engine": "stella",
    "scanner_image_digest": "sha256:...",
    "scanner_version": "2025.12.0",
    "config_digest": "sha256:...",
    "flags": ["--deep", "--vex"]
  },

  "feeds": {
    "vuln_feed_bundle_digest": "sha256:...",
    "license_db_digest": "sha256:..."
  },

  "policy": {
    "policy_bundle_digest": "sha256:...",
    "policy_set": "prod-default"
  },

  "environment": {
    "arch": "amd64",
    "os": "linux",
    "tz": "UTC",
    "locale": "C",
    "network": "disabled",
    "clock_mode": "frozen",
    "clock_value": "2025-12-12T10:15:30Z"
  },

  "normalization": {
    "canonicalizer_version": "1.2.0",
    "sbom_schema": "cyclonedx-1.6",
    "vex_schema": "cyclonedx-vex-1.0"
  }
}
```

---

### 5.2 CLI spec (minimal)

* `stella scan ... --emit-manifest MANIFEST.json --emit-artifacts-dir out/`
* `stella replay --from MANIFEST.json --out-dir replay_out/`
* `stella diff --a out/ --b replay_out/ --mode bf|sf|pf --json`

---

## 6) Testing strategy (to prevent determinism regressions)

### Unit tests

* Canonicalization: same object → same bytes
* Sorting stability: randomized input order → stable output
* Hash determinism

### Integration tests

* Golden canaries:

  * run scan twice in same runner → BF match
  * replay from manifest → BF match
* “Network leak” test:

  * DNS requests must be zero
* “Clock leak” test:

  * freeze time; ensure outputs do not include real timestamps

### Chaos tests

* Vary CPU count, run concurrency, run order → still BF match
* Randomized scheduling / thread interleavings to find races

---

## 7) Operational policies (so it stays useful)

### Retention & cost controls

* Keep full artifacts for regulated scans (e.g., 1–7 years)
* For non-regulated:

  * keep manifests + canonical hashes long-term
  * expire heavy evidence blobs after N days
* Compress large artifacts and dedupe by digest

### Alerting examples

* BF drops by ≥2% week-over-week (any major bucket) → warn
* BF < 0.90 overall or regulated BF < 0.95 → page / block release

### Triage workflow

* Failed replay auto-creates a ticket with:

  * manifest link
  * mismatch_reason
  * diff_summary
  * reproduction command

---

## 8) What “done” looks like (definition of success)

* Any customer/auditor can pick a scan from 6 months ago and you can:

  1. retrieve manifest + feed bundle + policy bundle by digest
  2. replay in a pinned sandbox
  3. show BF/SF/PF results and diffs
* Engineering sees drift quickly and can attribute it to feed vs scanner vs runtime.

---

If you want, I can also provide:

* a **Postgres DDL** for the tables above,
* a **Prometheus metrics contract** (names + labels + example queries),
* and a **diff_summary_json schema** that supports a UI “diff view” without reprocessing artifacts.