Files
git.stella-ops.org/docs/runbooks/replay_ops.md
master ae69b1a8a1 feat: Add documentation and task tracking for Sprints 508 to 514 in Ops & Offline
- Created detailed markdown files for Sprints 508 (Ops Offline Kit), 509 (Samples), 510 (AirGap), 511 (Api), 512 (Bench), 513 (Provenance), and 514 (Sovereign Crypto Enablement) outlining tasks, dependencies, and owners.
- Introduced a comprehensive Reachability Evidence Delivery Guide to streamline the reachability signal process.
- Implemented unit tests for Advisory AI to block known injection patterns and redact secrets.
- Added AuthoritySenderConstraintHelper to manage sender constraints in OpenIddict transactions.
2025-11-08 23:18:28 +02:00

96 lines
4.3 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Runbook — Replay Operations
> **Audience:** Ops Guild · Evidence Locker Guild · Scanner Guild · Authority/Signer · Attestor
> **Prereqs:** `docs/replay/DETERMINISTIC_REPLAY.md`, `docs/replay/DEVS_GUIDE_REPLAY.md`, `docs/replay/TEST_STRATEGY.md`, `docs/modules/platform/architecture-overview.md` §5
This runbook governs day-to-day replay operations, retention, and incident handling across online and air-gapped environments. Keep it in sync with the tasks in `docs/implplan/SPRINT_187_evidence_locker_cli_integration.md`.
---
## 1 · Terminology
- **Replay Manifest** — `manifest.json` describing scan inputs, outputs, signatures.
- **Input Bundle** — `inputbundle.tar.zst` containing feeds, policies, tools, env.
- **Output Bundle** — `outputbundle.tar.zst` with SBOM, findings, VEX, logs.
- **DSSE Envelope** — Signed metadata produced by Authority/Signer.
- **RootPack** — Trusted key bundle used to validate DSSE signatures offline.
---
## 2 · Normal operations
1. **Ingestion**
- Scanner WebService writes manifest metadata to `replay_runs`.
- Bundles uploaded to CAS (`cas://replay/...`) and mirrored into Evidence Locker (`evidence.replay_bundles`).
- Authority triggers DSSE signing; Attestor optionally anchors to Rekor.
2. **Verification**
- Nightly job runs `stella verify` on the latest N replay manifests per tenant.
- Metrics `replay_verify_total{result}`, `replay_bundle_size_bytes` recorded in Telemetry Stack (see `docs/modules/telemetry/architecture.md`).
- Failures alert `#ops-replay` via PagerDuty with runbook link.
3. **Retention**
- Hot CAS retention: 180days (configurable per tenant). Cron job `replay-retention` prunes expired digests and writes audit entries.
- Cold storage (Evidence Locker): 2years; legal holds extend via `/evidence/holds`. Ensure holds recorded in `timeline.events` with type `replay.hold.created`.
4. **Access control**
- Only service identities with `replay:read` scope may fetch bundles. CLI requires device or client credential flow with DPoP.
---
## 3 · Incident response (Replay Integrity)
| Step | Action | Owner | Notes |
|------|--------|-------|-------|
| 1 | Page Ops via `replay_verify_total{result="failed"}` alert | Observability | Include scan id, tenant, failure codes |
| 2 | Lock affected bundles (`POST /evidence/holds`) | Evidence Locker | Reference incident ticket |
| 3 | Re-run `stella verify` with `--explain` to gather diffs | Scanner Guild | Attach diff JSON to incident |
| 4 | Check Rekor inclusion proofs (`stella verify --ledger`) | Attestor | Flag if ledger mismatch or stale |
| 5 | If tool hash drift → coordinate Signer for rotation | Authority/Signer | Rotate DSSE profile, update RootPack |
| 6 | Update incident timeline (`docs/runbooks/replay_ops.md` -> Incident Log) | Ops Guild | Record timestamps and decisions |
| 7 | Close hold once resolved, publish postmortem | Ops + Docs | Postmortem must reference replay spec sections |
---
## 4 · Air-gapped workflow
1. Receive Offline Kit bundle containing:
- `offline/replay/<scan-id>/manifest.json`
- Bundles + DSSE signatures
- RootPack snapshot
2. Run `stella replay manifest.json --strict --offline` using local CLI.
3. Load feed/policy snapshots from kit; never hit external networks.
4. Store verification logs under `ops/offline/replay/<scan-id>/`.
5. Sync results back to Evidence Locker once connectivity restored.
---
## 5 · Maintenance checklist
- [ ] RootPack rotated quarterly; CLI/Evidence Locker updated with new fingerprints.
- [ ] CAS retention job executed successfully in the past 24hours.
- [ ] Replay verification metrics present in dashboards (x64 + arm64 lanes).
- [ ] Runbook incident log updated (see section 6) for the last drill.
- [ ] Offline kit instructions verified against current CLI version.
---
## 6 · Incident log
| Date (UTC) | Incident ID | Tenant | Summary | Follow-up |
|------------|-------------|--------|---------|-----------|
| _TBD_ | | | | |
---
## 7 · References
- `docs/replay/DETERMINISTIC_REPLAY.md`
- `docs/replay/DEVS_GUIDE_REPLAY.md`
- `docs/replay/TEST_STRATEGY.md`
- `docs/modules/platform/architecture-overview.md` §5
- `docs/modules/evidence-locker/architecture.md`
- `docs/modules/telemetry/architecture.md`
- `docs/implplan/SPRINT_187_evidence_locker_cli_integration.md`
---
*Created: 2025-11-03 — Update alongside replay task status changes.*