Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
- Implement `SbomIngestServiceCollectionExtensionsTests` to verify the SBOM ingestion pipeline exports snapshots correctly. - Create `SbomIngestTransformerTests` to ensure the transformation produces expected nodes and edges, including deduplication of license nodes and normalization of timestamps. - Add `SbomSnapshotExporterTests` to test the export functionality for manifest, adjacency, nodes, and edges. - Introduce `VexOverlayTransformerTests` to validate the transformation of VEX nodes and edges. - Set up project file for the test project with necessary dependencies and configurations. - Include JSON fixture files for testing purposes.
4.3 KiB
4.3 KiB
Runbook — Replay Operations
Audience: Ops Guild · Evidence Locker Guild · Scanner Guild · Authority/Signer · Attestor
Prereqs:docs/replay/DETERMINISTIC_REPLAY.md,docs/replay/DEVS_GUIDE_REPLAY.md,docs/replay/TEST_STRATEGY.md,docs/modules/platform/architecture-overview.md§5
This runbook governs day-to-day replay operations, retention, and incident handling across online and air-gapped environments. Keep it in sync with the tasks in docs/implplan/SPRINT_187_evidence_cli_replay.md.
1 · Terminology
- Replay Manifest —
manifest.jsondescribing scan inputs, outputs, signatures. - Input Bundle —
inputbundle.tar.zstcontaining feeds, policies, tools, env. - Output Bundle —
outputbundle.tar.zstwith SBOM, findings, VEX, logs. - DSSE Envelope — Signed metadata produced by Authority/Signer.
- RootPack — Trusted key bundle used to validate DSSE signatures offline.
2 · Normal operations
- Ingestion
- Scanner WebService writes manifest metadata to
replay_runs. - Bundles uploaded to CAS (
cas://replay/...) and mirrored into Evidence Locker (evidence.replay_bundles). - Authority triggers DSSE signing; Attestor optionally anchors to Rekor.
- Scanner WebService writes manifest metadata to
- Verification
- Nightly job runs
stella verifyon the latest N replay manifests per tenant. - Metrics
replay_verify_total{result},replay_bundle_size_bytesrecorded in Telemetry Stack (seedocs/modules/telemetry/architecture.md). - Failures alert
#ops-replayvia PagerDuty with runbook link.
- Nightly job runs
- Retention
- Hot CAS retention: 180 days (configurable per tenant). Cron job
replay-retentionprunes expired digests and writes audit entries. - Cold storage (Evidence Locker): 2 years; legal holds extend via
/evidence/holds. Ensure holds recorded intimeline.eventswith typereplay.hold.created.
- Hot CAS retention: 180 days (configurable per tenant). Cron job
- Access control
- Only service identities with
replay:readscope may fetch bundles. CLI requires device or client credential flow with DPoP.
- Only service identities with
3 · Incident response (Replay Integrity)
| Step | Action | Owner | Notes |
|---|---|---|---|
| 1 | Page Ops via replay_verify_total{result="failed"} alert |
Observability | Include scan id, tenant, failure codes |
| 2 | Lock affected bundles (POST /evidence/holds) |
Evidence Locker | Reference incident ticket |
| 3 | Re-run stella verify with --explain to gather diffs |
Scanner Guild | Attach diff JSON to incident |
| 4 | Check Rekor inclusion proofs (stella verify --ledger) |
Attestor | Flag if ledger mismatch or stale |
| 5 | If tool hash drift → coordinate Signer for rotation | Authority/Signer | Rotate DSSE profile, update RootPack |
| 6 | Update incident timeline (docs/runbooks/replay_ops.md -> Incident Log) |
Ops Guild | Record timestamps and decisions |
| 7 | Close hold once resolved, publish postmortem | Ops + Docs | Postmortem must reference replay spec sections |
4 · Air-gapped workflow
- Receive Offline Kit bundle containing:
offline/replay/<scan-id>/manifest.json- Bundles + DSSE signatures
- RootPack snapshot
- Run
stella replay manifest.json --strict --offlineusing local CLI. - Load feed/policy snapshots from kit; never hit external networks.
- Store verification logs under
ops/offline/replay/<scan-id>/. - Sync results back to Evidence Locker once connectivity restored.
5 · Maintenance checklist
- RootPack rotated quarterly; CLI/Evidence Locker updated with new fingerprints.
- CAS retention job executed successfully in the past 24 hours.
- Replay verification metrics present in dashboards (x64 + arm64 lanes).
- Runbook incident log updated (see section 6) for the last drill.
- Offline kit instructions verified against current CLI version.
6 · Incident log
| Date (UTC) | Incident ID | Tenant | Summary | Follow-up |
|---|---|---|---|---|
| TBD |
7 · References
docs/replay/DETERMINISTIC_REPLAY.mddocs/replay/DEVS_GUIDE_REPLAY.mddocs/replay/TEST_STRATEGY.mddocs/modules/platform/architecture-overview.md§5docs/modules/evidence-locker/architecture.mddocs/modules/telemetry/architecture.mddocs/implplan/SPRINT_187_evidence_cli_replay.md
Created: 2025-11-03 — Update alongside replay task status changes.