Files
git.stella-ops.org/docs/runbooks/replay_ops.md
master 2eb6852d34
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Add unit tests for SBOM ingestion and transformation
- Implement `SbomIngestServiceCollectionExtensionsTests` to verify the SBOM ingestion pipeline exports snapshots correctly.
- Create `SbomIngestTransformerTests` to ensure the transformation produces expected nodes and edges, including deduplication of license nodes and normalization of timestamps.
- Add `SbomSnapshotExporterTests` to test the export functionality for manifest, adjacency, nodes, and edges.
- Introduce `VexOverlayTransformerTests` to validate the transformation of VEX nodes and edges.
- Set up project file for the test project with necessary dependencies and configurations.
- Include JSON fixture files for testing purposes.
2025-11-04 07:49:39 +02:00

4.3 KiB
Raw Blame History

Runbook — Replay Operations

Audience: Ops Guild · Evidence Locker Guild · Scanner Guild · Authority/Signer · Attestor
Prereqs: docs/replay/DETERMINISTIC_REPLAY.md, docs/replay/DEVS_GUIDE_REPLAY.md, docs/replay/TEST_STRATEGY.md, docs/modules/platform/architecture-overview.md §5

This runbook governs day-to-day replay operations, retention, and incident handling across online and air-gapped environments. Keep it in sync with the tasks in docs/implplan/SPRINT_187_evidence_cli_replay.md.


1 · Terminology

  • Replay Manifestmanifest.json describing scan inputs, outputs, signatures.
  • Input Bundleinputbundle.tar.zst containing feeds, policies, tools, env.
  • Output Bundleoutputbundle.tar.zst with SBOM, findings, VEX, logs.
  • DSSE Envelope — Signed metadata produced by Authority/Signer.
  • RootPack — Trusted key bundle used to validate DSSE signatures offline.

2 · Normal operations

  1. Ingestion
    • Scanner WebService writes manifest metadata to replay_runs.
    • Bundles uploaded to CAS (cas://replay/...) and mirrored into Evidence Locker (evidence.replay_bundles).
    • Authority triggers DSSE signing; Attestor optionally anchors to Rekor.
  2. Verification
    • Nightly job runs stella verify on the latest N replay manifests per tenant.
    • Metrics replay_verify_total{result}, replay_bundle_size_bytes recorded in Telemetry Stack (see docs/modules/telemetry/architecture.md).
    • Failures alert #ops-replay via PagerDuty with runbook link.
  3. Retention
    • Hot CAS retention: 180days (configurable per tenant). Cron job replay-retention prunes expired digests and writes audit entries.
    • Cold storage (Evidence Locker): 2years; legal holds extend via /evidence/holds. Ensure holds recorded in timeline.events with type replay.hold.created.
  4. Access control
    • Only service identities with replay:read scope may fetch bundles. CLI requires device or client credential flow with DPoP.

3 · Incident response (Replay Integrity)

Step Action Owner Notes
1 Page Ops via replay_verify_total{result="failed"} alert Observability Include scan id, tenant, failure codes
2 Lock affected bundles (POST /evidence/holds) Evidence Locker Reference incident ticket
3 Re-run stella verify with --explain to gather diffs Scanner Guild Attach diff JSON to incident
4 Check Rekor inclusion proofs (stella verify --ledger) Attestor Flag if ledger mismatch or stale
5 If tool hash drift → coordinate Signer for rotation Authority/Signer Rotate DSSE profile, update RootPack
6 Update incident timeline (docs/runbooks/replay_ops.md -> Incident Log) Ops Guild Record timestamps and decisions
7 Close hold once resolved, publish postmortem Ops + Docs Postmortem must reference replay spec sections

4 · Air-gapped workflow

  1. Receive Offline Kit bundle containing:
    • offline/replay/<scan-id>/manifest.json
    • Bundles + DSSE signatures
    • RootPack snapshot
  2. Run stella replay manifest.json --strict --offline using local CLI.
  3. Load feed/policy snapshots from kit; never hit external networks.
  4. Store verification logs under ops/offline/replay/<scan-id>/.
  5. Sync results back to Evidence Locker once connectivity restored.

5 · Maintenance checklist

  • RootPack rotated quarterly; CLI/Evidence Locker updated with new fingerprints.
  • CAS retention job executed successfully in the past 24hours.
  • Replay verification metrics present in dashboards (x64 + arm64 lanes).
  • Runbook incident log updated (see section 6) for the last drill.
  • Offline kit instructions verified against current CLI version.

6 · Incident log

Date (UTC) Incident ID Tenant Summary Follow-up
TBD

7 · References

  • docs/replay/DETERMINISTIC_REPLAY.md
  • docs/replay/DEVS_GUIDE_REPLAY.md
  • docs/replay/TEST_STRATEGY.md
  • docs/modules/platform/architecture-overview.md §5
  • docs/modules/evidence-locker/architecture.md
  • docs/modules/telemetry/architecture.md
  • docs/implplan/SPRINT_187_evidence_cli_replay.md

Created: 2025-11-03 — Update alongside replay task status changes.