Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Notify Smoke Test / Notifier Service Tests (push) Has been cancelled
Notify Smoke Test / Notification Smoke Test (push) Has been cancelled
Notify Smoke Test / Notify Unit Tests (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Export Center CI / export-ci (push) Has been cancelled
Manifest Integrity / Validate Schema Integrity (push) Has been cancelled
Manifest Integrity / Validate Contract Documents (push) Has been cancelled
Manifest Integrity / Validate Pack Fixtures (push) Has been cancelled
Manifest Integrity / Audit SHA256SUMS Files (push) Has been cancelled
Manifest Integrity / Verify Merkle Roots (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Risk Bundle CI / risk-bundle-build (push) Has been cancelled
Scanner Analyzers / Discover Analyzers (push) Has been cancelled
Scanner Analyzers / Validate Test Fixtures (push) Has been cancelled
Risk Bundle CI / risk-bundle-offline-kit (push) Has been cancelled
Risk Bundle CI / publish-checksums (push) Has been cancelled
Scanner Analyzers / Build Analyzers (push) Has been cancelled
Scanner Analyzers / Test Language Analyzers (push) Has been cancelled
Scanner Analyzers / Verify Deterministic Output (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
Mirror Thin Bundle Sign & Verify / mirror-sign (push) Has been cancelled
4.6 KiB
4.6 KiB
Runbook - Replay Operations
Audience: Ops Guild / Evidence Locker Guild / Scanner Guild / Authority/Signer / Attestor
Prereqs:docs/replay/DETERMINISTIC_REPLAY.md,docs/replay/DEVS_GUIDE_REPLAY.md,docs/replay/TEST_STRATEGY.md,docs/modules/platform/architecture-overview.md
This runbook governs day-to-day replay operations, retention, and incident handling across online and air-gapped environments. Keep it in sync with the tasks in docs/implplan/SPRINT_0187_0001_0001_evidence_locker_cli_integration.md.
1 Terminology
- Replay Manifest -
manifest.jsondescribing scan inputs, outputs, signatures. - Input Bundle -
inputbundle.tar.zstcontaining feeds, policies, tools, env. - Output Bundle -
outputbundle.tar.zstwith SBOM, findings, VEX, logs. - DSSE Envelope - Signed metadata produced by Authority/Signer.
- RootPack - Trusted key bundle used to validate DSSE signatures offline.
2 Normal operations
- Ingestion
- Scanner WebService writes manifest metadata to
replay_runs. - Bundles uploaded to CAS (
cas://replay/...) and mirrored into Evidence Locker (evidence.replay_bundles). - Authority triggers DSSE signing; Attestor optionally anchors to Rekor.
- Scanner WebService writes manifest metadata to
- Verification
- Nightly job runs
stella verifyon the latest N replay manifests per tenant. - Metrics
replay_verify_total{result},replay_bundle_size_bytesrecorded in Telemetry Stack (seedocs/modules/telemetry/architecture.md). - Failures alert
#ops-replayvia PagerDuty with runbook link.
- Nightly job runs
- Retention
- Hot CAS retention: 180 days (configurable per tenant). Cron job
replay-retentionprunes expired digests and writes audit entries. - Cold storage (Evidence Locker): 2 years; legal holds extend via
/evidence/holds. Ensure holds recorded intimeline.eventswith typereplay.hold.created. - Retention declaration: validate against
docs/schemas/replay-retention.schema.json(frozen 2025-12-10). Includeretention_policy_id,tenant_id,bundle_type,retention_days,legal_hold,purge_after,checksum,created_at. Audit checksum via DSSE envelope when persisting.
- Hot CAS retention: 180 days (configurable per tenant). Cron job
- Access control
- Only service identities with
replay:readscope may fetch bundles. CLI requires device or client credential flow with DPoP.
- Only service identities with
3 Incident response (Replay Integrity)
| Step | Action | Owner | Notes |
|---|---|---|---|
| 1 | Page Ops via replay_verify_total{result="failed"} alert |
Observability | Include scan id, tenant, failure codes |
| 2 | Lock affected bundles (POST /evidence/holds) |
Evidence Locker | Reference incident ticket |
| 3 | Re-run stella verify with --explain to gather diffs |
Scanner Guild | Attach diff JSON to incident |
| 4 | Check Rekor inclusion proofs (stella verify --ledger) |
Attestor | Flag if ledger mismatch or stale |
| 5 | If tool hash drift -> coordinate Signer for rotation | Authority/Signer | Rotate DSSE profile, update RootPack |
| 6 | Update incident timeline (docs/runbooks/replay_ops.md -> Incident Log) |
Ops Guild | Record timestamps and decisions |
| 7 | Close hold once resolved, publish postmortem | Ops + Docs | Postmortem must reference replay spec sections |
4 Air-gapped workflow
- Receive Offline Kit bundle containing:
offline/replay/<scan-id>/manifest.json- Bundles + DSSE signatures
- RootPack snapshot
- Run
stella replay manifest.json --strict --offlineusing local CLI. - Load feed/policy snapshots from kit; never hit external networks.
- Store verification logs under
ops/offline/replay/<scan-id>/. - Sync results back to Evidence Locker once connectivity restored.
5 Maintenance checklist
- RootPack rotated quarterly; CLI/Evidence Locker updated with new fingerprints.
- CAS retention job executed successfully in the past 24 hours.
- Replay verification metrics present in dashboards (x64 + arm64 lanes).
- Runbook incident log updated (see section 6) for the last drill.
- Offline kit instructions verified against current CLI version.
6 Incident log
| Date (UTC) | Incident ID | Tenant | Summary | Follow-up |
|---|---|---|---|---|
| TBD |
7 References
docs/replay/DETERMINISTIC_REPLAY.mddocs/replay/DEVS_GUIDE_REPLAY.mddocs/replay/TEST_STRATEGY.mddocs/modules/platform/architecture-overview.mdsection 5docs/modules/evidence-locker/architecture.mddocs/modules/telemetry/architecture.mddocs/implplan/SPRINT_0187_0001_0001_evidence_locker_cli_integration.md
Created: 2025-11-03 - Update alongside replay task status changes.