# Runbook — Replay Operations > **Audience:** Ops Guild · Evidence Locker Guild · Scanner Guild · Authority/Signer · Attestor > **Prereqs:** `docs/replay/DETERMINISTIC_REPLAY.md`, `docs/replay/DEVS_GUIDE_REPLAY.md`, `docs/replay/TEST_STRATEGY.md`, `docs/modules/platform/architecture-overview.md` §5 This runbook governs day-to-day replay operations, retention, and incident handling across online and air-gapped environments. Keep it in sync with the tasks in `docs/implplan/SPRINT_187_evidence_cli_replay.md`. --- ## 1 · Terminology - **Replay Manifest** — `manifest.json` describing scan inputs, outputs, signatures. - **Input Bundle** — `inputbundle.tar.zst` containing feeds, policies, tools, env. - **Output Bundle** — `outputbundle.tar.zst` with SBOM, findings, VEX, logs. - **DSSE Envelope** — Signed metadata produced by Authority/Signer. - **RootPack** — Trusted key bundle used to validate DSSE signatures offline. --- ## 2 · Normal operations 1. **Ingestion** - Scanner WebService writes manifest metadata to `replay_runs`. - Bundles uploaded to CAS (`cas://replay/...`) and mirrored into Evidence Locker (`evidence.replay_bundles`). - Authority triggers DSSE signing; Attestor optionally anchors to Rekor. 2. **Verification** - Nightly job runs `stella verify` on the latest N replay manifests per tenant. - Metrics `replay_verify_total{result}`, `replay_bundle_size_bytes` recorded in Telemetry Stack (see `docs/modules/telemetry/architecture.md`). - Failures alert `#ops-replay` via PagerDuty with runbook link. 3. **Retention** - Hot CAS retention: 180 days (configurable per tenant). Cron job `replay-retention` prunes expired digests and writes audit entries. - Cold storage (Evidence Locker): 2 years; legal holds extend via `/evidence/holds`. Ensure holds recorded in `timeline.events` with type `replay.hold.created`. 4. **Access control** - Only service identities with `replay:read` scope may fetch bundles. CLI requires device or client credential flow with DPoP. --- ## 3 · Incident response (Replay Integrity) | Step | Action | Owner | Notes | |------|--------|-------|-------| | 1 | Page Ops via `replay_verify_total{result="failed"}` alert | Observability | Include scan id, tenant, failure codes | | 2 | Lock affected bundles (`POST /evidence/holds`) | Evidence Locker | Reference incident ticket | | 3 | Re-run `stella verify` with `--explain` to gather diffs | Scanner Guild | Attach diff JSON to incident | | 4 | Check Rekor inclusion proofs (`stella verify --ledger`) | Attestor | Flag if ledger mismatch or stale | | 5 | If tool hash drift → coordinate Signer for rotation | Authority/Signer | Rotate DSSE profile, update RootPack | | 6 | Update incident timeline (`docs/runbooks/replay_ops.md` -> Incident Log) | Ops Guild | Record timestamps and decisions | | 7 | Close hold once resolved, publish postmortem | Ops + Docs | Postmortem must reference replay spec sections | --- ## 4 · Air-gapped workflow 1. Receive Offline Kit bundle containing: - `offline/replay//manifest.json` - Bundles + DSSE signatures - RootPack snapshot 2. Run `stella replay manifest.json --strict --offline` using local CLI. 3. Load feed/policy snapshots from kit; never hit external networks. 4. Store verification logs under `ops/offline/replay//`. 5. Sync results back to Evidence Locker once connectivity restored. --- ## 5 · Maintenance checklist - [ ] RootPack rotated quarterly; CLI/Evidence Locker updated with new fingerprints. - [ ] CAS retention job executed successfully in the past 24 hours. - [ ] Replay verification metrics present in dashboards (x64 + arm64 lanes). - [ ] Runbook incident log updated (see section 6) for the last drill. - [ ] Offline kit instructions verified against current CLI version. --- ## 6 · Incident log | Date (UTC) | Incident ID | Tenant | Summary | Follow-up | |------------|-------------|--------|---------|-----------| | _TBD_ | | | | | --- ## 7 · References - `docs/replay/DETERMINISTIC_REPLAY.md` - `docs/replay/DEVS_GUIDE_REPLAY.md` - `docs/replay/TEST_STRATEGY.md` - `docs/modules/platform/architecture-overview.md` §5 - `docs/modules/evidence-locker/architecture.md` - `docs/modules/telemetry/architecture.md` - `docs/implplan/SPRINT_187_evidence_cli_replay.md` --- *Created: 2025-11-03 — Update alongside replay task status changes.*