Files
git.stella-ops.org/docs/runbooks/replay_ops.md
master ae69b1a8a1 feat: Add documentation and task tracking for Sprints 508 to 514 in Ops & Offline
- Created detailed markdown files for Sprints 508 (Ops Offline Kit), 509 (Samples), 510 (AirGap), 511 (Api), 512 (Bench), 513 (Provenance), and 514 (Sovereign Crypto Enablement) outlining tasks, dependencies, and owners.
- Introduced a comprehensive Reachability Evidence Delivery Guide to streamline the reachability signal process.
- Implemented unit tests for Advisory AI to block known injection patterns and redact secrets.
- Added AuthoritySenderConstraintHelper to manage sender constraints in OpenIddict transactions.
2025-11-08 23:18:28 +02:00

96 lines
4.3 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Runbook — Replay Operations
> **Audience:** Ops Guild · Evidence Locker Guild · Scanner Guild · Authority/Signer · Attestor
> **Prereqs:** `docs/replay/DETERMINISTIC_REPLAY.md`, `docs/replay/DEVS_GUIDE_REPLAY.md`, `docs/replay/TEST_STRATEGY.md`, `docs/modules/platform/architecture-overview.md` §5
This runbook governs day-to-day replay operations, retention, and incident handling across online and air-gapped environments. Keep it in sync with the tasks in `docs/implplan/SPRINT_187_evidence_locker_cli_integration.md`.
---
## 1 · Terminology
- **Replay Manifest** — `manifest.json` describing scan inputs, outputs, signatures.
- **Input Bundle** — `inputbundle.tar.zst` containing feeds, policies, tools, env.
- **Output Bundle** — `outputbundle.tar.zst` with SBOM, findings, VEX, logs.
- **DSSE Envelope** — Signed metadata produced by Authority/Signer.
- **RootPack** — Trusted key bundle used to validate DSSE signatures offline.
---
## 2 · Normal operations
1. **Ingestion**
- Scanner WebService writes manifest metadata to `replay_runs`.
- Bundles uploaded to CAS (`cas://replay/...`) and mirrored into Evidence Locker (`evidence.replay_bundles`).
- Authority triggers DSSE signing; Attestor optionally anchors to Rekor.
2. **Verification**
- Nightly job runs `stella verify` on the latest N replay manifests per tenant.
- Metrics `replay_verify_total{result}`, `replay_bundle_size_bytes` recorded in Telemetry Stack (see `docs/modules/telemetry/architecture.md`).
- Failures alert `#ops-replay` via PagerDuty with runbook link.
3. **Retention**
- Hot CAS retention: 180days (configurable per tenant). Cron job `replay-retention` prunes expired digests and writes audit entries.
- Cold storage (Evidence Locker): 2years; legal holds extend via `/evidence/holds`. Ensure holds recorded in `timeline.events` with type `replay.hold.created`.
4. **Access control**
- Only service identities with `replay:read` scope may fetch bundles. CLI requires device or client credential flow with DPoP.
---
## 3 · Incident response (Replay Integrity)
| Step | Action | Owner | Notes |
|------|--------|-------|-------|
| 1 | Page Ops via `replay_verify_total{result="failed"}` alert | Observability | Include scan id, tenant, failure codes |
| 2 | Lock affected bundles (`POST /evidence/holds`) | Evidence Locker | Reference incident ticket |
| 3 | Re-run `stella verify` with `--explain` to gather diffs | Scanner Guild | Attach diff JSON to incident |
| 4 | Check Rekor inclusion proofs (`stella verify --ledger`) | Attestor | Flag if ledger mismatch or stale |
| 5 | If tool hash drift → coordinate Signer for rotation | Authority/Signer | Rotate DSSE profile, update RootPack |
| 6 | Update incident timeline (`docs/runbooks/replay_ops.md` -> Incident Log) | Ops Guild | Record timestamps and decisions |
| 7 | Close hold once resolved, publish postmortem | Ops + Docs | Postmortem must reference replay spec sections |
---
## 4 · Air-gapped workflow
1. Receive Offline Kit bundle containing:
- `offline/replay/<scan-id>/manifest.json`
- Bundles + DSSE signatures
- RootPack snapshot
2. Run `stella replay manifest.json --strict --offline` using local CLI.
3. Load feed/policy snapshots from kit; never hit external networks.
4. Store verification logs under `ops/offline/replay/<scan-id>/`.
5. Sync results back to Evidence Locker once connectivity restored.
---
## 5 · Maintenance checklist
- [ ] RootPack rotated quarterly; CLI/Evidence Locker updated with new fingerprints.
- [ ] CAS retention job executed successfully in the past 24hours.
- [ ] Replay verification metrics present in dashboards (x64 + arm64 lanes).
- [ ] Runbook incident log updated (see section 6) for the last drill.
- [ ] Offline kit instructions verified against current CLI version.
---
## 6 · Incident log
| Date (UTC) | Incident ID | Tenant | Summary | Follow-up |
|------------|-------------|--------|---------|-----------|
| _TBD_ | | | | |
---
## 7 · References
- `docs/replay/DETERMINISTIC_REPLAY.md`
- `docs/replay/DEVS_GUIDE_REPLAY.md`
- `docs/replay/TEST_STRATEGY.md`
- `docs/modules/platform/architecture-overview.md` §5
- `docs/modules/evidence-locker/architecture.md`
- `docs/modules/telemetry/architecture.md`
- `docs/implplan/SPRINT_187_evidence_locker_cli_integration.md`
---
*Created: 2025-11-03 — Update alongside replay task status changes.*