Files
git.stella-ops.org/docs/runbooks/replay_ops.md
StellaOps Bot 2eaf0f699b
Some checks failed
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
Mirror Thin Bundle Sign & Verify / mirror-sign (push) Has been cancelled
feat: Implement air-gap functionality with timeline impact and evidence snapshot services
- Added AirgapTimelineImpact, AirgapTimelineImpactInput, and AirgapTimelineImpactResult records for managing air-gap bundle import impacts.
- Introduced EvidenceSnapshotRecord, EvidenceSnapshotLinkInput, and EvidenceSnapshotLinkResult records for linking findings to evidence snapshots.
- Created IEvidenceSnapshotRepository interface for managing evidence snapshot records.
- Developed StalenessValidationService to validate staleness and enforce freshness thresholds.
- Implemented AirgapTimelineService for emitting timeline events related to bundle imports.
- Added EvidenceSnapshotService for linking findings to evidence snapshots and verifying their validity.
- Introduced AirGapOptions for configuring air-gap staleness enforcement and thresholds.
- Added minimal jsPDF stub for offline/testing builds in the web application.
- Created TypeScript definitions for jsPDF to enhance type safety in the web application.
2025-12-06 01:30:08 +02:00

4.4 KiB
Raw Blame History

Runbook — Replay Operations

Audience: Ops Guild · Evidence Locker Guild · Scanner Guild · Authority/Signer · Attestor
Prereqs: docs/replay/DETERMINISTIC_REPLAY.md, docs/replay/DEVS_GUIDE_REPLAY.md, docs/replay/TEST_STRATEGY.md, docs/modules/platform/architecture-overview.md §5

This runbook governs day-to-day replay operations, retention, and incident handling across online and air-gapped environments. Keep it in sync with the tasks in docs/implplan/SPRINT_0187_0001_0001_evidence_locker_cli_integration.md.


1 · Terminology

  • Replay Manifestmanifest.json describing scan inputs, outputs, signatures.
  • Input Bundleinputbundle.tar.zst containing feeds, policies, tools, env.
  • Output Bundleoutputbundle.tar.zst with SBOM, findings, VEX, logs.
  • DSSE Envelope — Signed metadata produced by Authority/Signer.
  • RootPack — Trusted key bundle used to validate DSSE signatures offline.

2 · Normal operations

  1. Ingestion
    • Scanner WebService writes manifest metadata to replay_runs.
    • Bundles uploaded to CAS (cas://replay/...) and mirrored into Evidence Locker (evidence.replay_bundles).
    • Authority triggers DSSE signing; Attestor optionally anchors to Rekor.
  2. Verification
    • Nightly job runs stella verify on the latest N replay manifests per tenant.
    • Metrics replay_verify_total{result}, replay_bundle_size_bytes recorded in Telemetry Stack (see docs/modules/telemetry/architecture.md).
    • Failures alert #ops-replay via PagerDuty with runbook link.
  3. Retention
    • Hot CAS retention: 180days (configurable per tenant). Cron job replay-retention prunes expired digests and writes audit entries.
    • Cold storage (Evidence Locker): 2years; legal holds extend via /evidence/holds. Ensure holds recorded in timeline.events with type replay.hold.created.
  4. Access control
    • Only service identities with replay:read scope may fetch bundles. CLI requires device or client credential flow with DPoP.

3 · Incident response (Replay Integrity)

Step Action Owner Notes
1 Page Ops via replay_verify_total{result="failed"} alert Observability Include scan id, tenant, failure codes
2 Lock affected bundles (POST /evidence/holds) Evidence Locker Reference incident ticket
3 Re-run stella verify with --explain to gather diffs Scanner Guild Attach diff JSON to incident
4 Check Rekor inclusion proofs (stella verify --ledger) Attestor Flag if ledger mismatch or stale
5 If tool hash drift → coordinate Signer for rotation Authority/Signer Rotate DSSE profile, update RootPack
6 Update incident timeline (docs/runbooks/replay_ops.md -> Incident Log) Ops Guild Record timestamps and decisions
7 Close hold once resolved, publish postmortem Ops + Docs Postmortem must reference replay spec sections

4 · Air-gapped workflow

  1. Receive Offline Kit bundle containing:
    • offline/replay/<scan-id>/manifest.json
    • Bundles + DSSE signatures
    • RootPack snapshot
  2. Run stella replay manifest.json --strict --offline using local CLI.
  3. Load feed/policy snapshots from kit; never hit external networks.
  4. Store verification logs under ops/offline/replay/<scan-id>/.
  5. Sync results back to Evidence Locker once connectivity restored.

5 · Maintenance checklist

  • RootPack rotated quarterly; CLI/Evidence Locker updated with new fingerprints.
  • CAS retention job executed successfully in the past 24hours.
  • Replay verification metrics present in dashboards (x64 + arm64 lanes).
  • Runbook incident log updated (see section 6) for the last drill.
  • Offline kit instructions verified against current CLI version.

6 · Incident log

Date (UTC) Incident ID Tenant Summary Follow-up
TBD

7 · References

  • docs/replay/DETERMINISTIC_REPLAY.md
  • docs/replay/DEVS_GUIDE_REPLAY.md
  • docs/replay/TEST_STRATEGY.md
  • docs/modules/platform/architecture-overview.md §5
  • docs/modules/evidence-locker/architecture.md
  • docs/modules/telemetry/architecture.md
  • docs/implplan/SPRINT_0187_0001_0001_evidence_locker_cli_integration.md

Created: 2025-11-03 — Update alongside replay task status changes.