Files
git.stella-ops.org/docs/runbooks/replay_ops.md
StellaOps Bot 49922dff5a
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Notify Smoke Test / Notifier Service Tests (push) Has been cancelled
Notify Smoke Test / Notification Smoke Test (push) Has been cancelled
Notify Smoke Test / Notify Unit Tests (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Export Center CI / export-ci (push) Has been cancelled
Manifest Integrity / Validate Schema Integrity (push) Has been cancelled
Manifest Integrity / Validate Contract Documents (push) Has been cancelled
Manifest Integrity / Validate Pack Fixtures (push) Has been cancelled
Manifest Integrity / Audit SHA256SUMS Files (push) Has been cancelled
Manifest Integrity / Verify Merkle Roots (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Risk Bundle CI / risk-bundle-build (push) Has been cancelled
Scanner Analyzers / Discover Analyzers (push) Has been cancelled
Scanner Analyzers / Validate Test Fixtures (push) Has been cancelled
Risk Bundle CI / risk-bundle-offline-kit (push) Has been cancelled
Risk Bundle CI / publish-checksums (push) Has been cancelled
Scanner Analyzers / Build Analyzers (push) Has been cancelled
Scanner Analyzers / Test Language Analyzers (push) Has been cancelled
Scanner Analyzers / Verify Deterministic Output (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
Mirror Thin Bundle Sign & Verify / mirror-sign (push) Has been cancelled
up the blokcing tasks
2025-12-11 02:32:18 +02:00

4.6 KiB

Runbook - Replay Operations

Audience: Ops Guild / Evidence Locker Guild / Scanner Guild / Authority/Signer / Attestor
Prereqs: docs/replay/DETERMINISTIC_REPLAY.md, docs/replay/DEVS_GUIDE_REPLAY.md, docs/replay/TEST_STRATEGY.md, docs/modules/platform/architecture-overview.md

This runbook governs day-to-day replay operations, retention, and incident handling across online and air-gapped environments. Keep it in sync with the tasks in docs/implplan/SPRINT_0187_0001_0001_evidence_locker_cli_integration.md.


1 Terminology

  • Replay Manifest - manifest.json describing scan inputs, outputs, signatures.
  • Input Bundle - inputbundle.tar.zst containing feeds, policies, tools, env.
  • Output Bundle - outputbundle.tar.zst with SBOM, findings, VEX, logs.
  • DSSE Envelope - Signed metadata produced by Authority/Signer.
  • RootPack - Trusted key bundle used to validate DSSE signatures offline.

2 Normal operations

  1. Ingestion
    • Scanner WebService writes manifest metadata to replay_runs.
    • Bundles uploaded to CAS (cas://replay/...) and mirrored into Evidence Locker (evidence.replay_bundles).
    • Authority triggers DSSE signing; Attestor optionally anchors to Rekor.
  2. Verification
    • Nightly job runs stella verify on the latest N replay manifests per tenant.
    • Metrics replay_verify_total{result}, replay_bundle_size_bytes recorded in Telemetry Stack (see docs/modules/telemetry/architecture.md).
    • Failures alert #ops-replay via PagerDuty with runbook link.
  3. Retention
    • Hot CAS retention: 180 days (configurable per tenant). Cron job replay-retention prunes expired digests and writes audit entries.
    • Cold storage (Evidence Locker): 2 years; legal holds extend via /evidence/holds. Ensure holds recorded in timeline.events with type replay.hold.created.
    • Retention declaration: validate against docs/schemas/replay-retention.schema.json (frozen 2025-12-10). Include retention_policy_id, tenant_id, bundle_type, retention_days, legal_hold, purge_after, checksum, created_at. Audit checksum via DSSE envelope when persisting.
  4. Access control
    • Only service identities with replay:read scope may fetch bundles. CLI requires device or client credential flow with DPoP.

3 Incident response (Replay Integrity)

Step Action Owner Notes
1 Page Ops via replay_verify_total{result="failed"} alert Observability Include scan id, tenant, failure codes
2 Lock affected bundles (POST /evidence/holds) Evidence Locker Reference incident ticket
3 Re-run stella verify with --explain to gather diffs Scanner Guild Attach diff JSON to incident
4 Check Rekor inclusion proofs (stella verify --ledger) Attestor Flag if ledger mismatch or stale
5 If tool hash drift -> coordinate Signer for rotation Authority/Signer Rotate DSSE profile, update RootPack
6 Update incident timeline (docs/runbooks/replay_ops.md -> Incident Log) Ops Guild Record timestamps and decisions
7 Close hold once resolved, publish postmortem Ops + Docs Postmortem must reference replay spec sections

4 Air-gapped workflow

  1. Receive Offline Kit bundle containing:
    • offline/replay/<scan-id>/manifest.json
    • Bundles + DSSE signatures
    • RootPack snapshot
  2. Run stella replay manifest.json --strict --offline using local CLI.
  3. Load feed/policy snapshots from kit; never hit external networks.
  4. Store verification logs under ops/offline/replay/<scan-id>/.
  5. Sync results back to Evidence Locker once connectivity restored.

5 Maintenance checklist

  • RootPack rotated quarterly; CLI/Evidence Locker updated with new fingerprints.
  • CAS retention job executed successfully in the past 24 hours.
  • Replay verification metrics present in dashboards (x64 + arm64 lanes).
  • Runbook incident log updated (see section 6) for the last drill.
  • Offline kit instructions verified against current CLI version.

6 Incident log

Date (UTC) Incident ID Tenant Summary Follow-up
TBD

7 References

  • docs/replay/DETERMINISTIC_REPLAY.md
  • docs/replay/DEVS_GUIDE_REPLAY.md
  • docs/replay/TEST_STRATEGY.md
  • docs/modules/platform/architecture-overview.md section 5
  • docs/modules/evidence-locker/architecture.md
  • docs/modules/telemetry/architecture.md
  • docs/implplan/SPRINT_0187_0001_0001_evidence_locker_cli_integration.md

Created: 2025-11-03 - Update alongside replay task status changes.