From 98934170ca04cc3646c3606bee05fe15fc9be520 Mon Sep 17 00:00:00 2001 From: StellaOps Bot Date: Sat, 6 Dec 2025 23:30:12 +0000 Subject: [PATCH] ops: add policy incident runbook draft --- ...SPRINT_0502_0001_0001_ops_deployment_ii.md | 4 +- docs/implplan/tasks-all.md | 4 +- docs/runbooks/policy-incident.md | 50 +++++++++++++++++++ 3 files changed, 55 insertions(+), 3 deletions(-) create mode 100644 docs/runbooks/policy-incident.md diff --git a/docs/implplan/SPRINT_0502_0001_0001_ops_deployment_ii.md b/docs/implplan/SPRINT_0502_0001_0001_ops_deployment_ii.md index d3b614e1d..5785e9402 100644 --- a/docs/implplan/SPRINT_0502_0001_0001_ops_deployment_ii.md +++ b/docs/implplan/SPRINT_0502_0001_0001_ops_deployment_ii.md @@ -20,7 +20,7 @@ ## Delivery Tracker | # | Task ID | Status | Key dependency / next step | Owners | Task Definition | | --- | --- | --- | --- | --- | --- | -| 1 | DEPLOY-POLICY-27-002 | TODO | Depends on DEPLOY-POLICY-27-001 | Deployment Guild, Policy Guild | Document rollout/rollback playbooks for policy publish/promote (canary, emergency freeze, evidence retrieval) under `docs/runbooks/policy-incident.md` | +| 1 | DEPLOY-POLICY-27-002 | DOING (draft 2025-12-06) | Pending policy overlay/digests from DEPLOY-POLICY-27-001; draft runbook at `docs/runbooks/policy-incident.md` | Deployment Guild, Policy Guild | Document rollout/rollback playbooks for policy publish/promote (canary, emergency freeze, evidence retrieval) under `docs/runbooks/policy-incident.md` | | 2 | DEPLOY-VEX-30-001 | DOING (dev-mock digests 2025-12-06) | Mock digests published in `deploy/releases/2025.09-mock-dev.yaml`; production still awaits real artefacts | Deployment Guild, VEX Lens Guild | Provide Helm/Compose overlays, scaling defaults, offline kit instructions for VEX Lens service | | 3 | DEPLOY-VEX-30-002 | DOING (dev-mock digests 2025-12-06) | Depends on DEPLOY-VEX-30-001 | Deployment Guild, Issuer Directory Guild | Package Issuer Directory deployment manifests, backups, security hardening guidance | | 4 | DEPLOY-VULN-29-001 | DOING (dev-mock digests 2025-12-06) | Mock digests available in `deploy/releases/2025.09-mock-dev.yaml`; production pins pending | Deployment Guild, Findings Ledger Guild | Helm/Compose overlays for Findings Ledger + projector incl. DB migrations, Merkle anchor jobs, scaling guidance | @@ -33,6 +33,7 @@ ## Execution Log | Date (UTC) | Update | Owner | | --- | --- | --- | +| 2025-12-06 | Drafted policy incident runbook (`docs/runbooks/policy-incident.md`); set DEPLOY-POLICY-27-002 to DOING pending policy overlay/digests. | Deployment Guild | | 2025-12-06 | Header normalised to standard template; no content/status changes. | Project Mgmt | | 2025-12-06 | Seeded mock dev release manifest (`deploy/releases/2025.09-mock-dev.yaml`) covering VEX Lens and Findings/Vuln stacks; tasks moved to DOING (dev-mock) for development packaging. Production release still awaits real digests. | Deployment Guild | | 2025-12-06 | Added mock downloads manifest at `deploy/downloads/manifest.json` to unblock dev/test; production still requires signed console artefacts. | Deployment Guild | @@ -53,6 +54,7 @@ - Risk: Offline kit instructions must avoid external image pulls; ensure pinned digests and air-gap copy steps. - VEX Lens and Findings/Vuln overlays blocked: release digests absent from `deploy/releases/2025.09-stable.yaml`; cannot pin images or publish offline bundles until artefacts land. - Console downloads manifest blocked: console images/bundles not published, so `deploy/downloads/manifest.json` cannot be signed/updated. +- Policy incident runbook is draft-only until DEPLOY-POLICY-27-001 delivers policy overlay schema and production digests. ## Next Checkpoints | Date (UTC) | Session / Owner | Target outcome | Fallback / Escalation | diff --git a/docs/implplan/tasks-all.md b/docs/implplan/tasks-all.md index 1027e96a4..8ed1e5063 100644 --- a/docs/implplan/tasks-all.md +++ b/docs/implplan/tasks-all.md @@ -539,7 +539,7 @@ | DEPLOY-PACKS-42-001 | DOING (dev-mock 2025-12-06) | 2025-12-06 | SPRINT_0501_0001_0001_ops_deployment_i | Deployment Guild · Packs Registry Guild | ops/deployment | Provide deployment manifests for packs-registry and task-runner services, including Helm/Compose overlays, scaling defaults, and secret templates. | Wait for pack registry schema | AGDP0101 | | DEPLOY-PACKS-43-001 | DOING (dev-mock 2025-12-06) | 2025-12-06 | SPRINT_0501_0001_0001_ops_deployment_i | Deployment Guild · Task Runner Guild | ops/deployment | Ship remote Task Runner worker profiles, object storage bootstrap, approval workflow integration, and Offline Kit packaging instructions. Dependencies: DEPLOY-PACKS-42-001. | Needs #7 artifacts | AGDP0101 | | DEPLOY-POLICY-27-001 | DOING (dev-mock 2025-12-06) | 2025-12-05 | SPRINT_0501_0001_0001_ops_deployment_i | Deployment Guild · Policy Registry Guild | ops/deployment | Produce Helm/Compose overlays for Policy Registry + simulation workers (migrations, buckets, signing keys, tenancy defaults). | WEPO0101 | DVPL0105 | -| DEPLOY-POLICY-27-002 | TODO | | SPRINT_0502_0001_0001_ops_deployment_ii | Deployment Guild · Policy Guild | ops/deployment | Document rollout/rollback playbooks for policy publish/promote (canary strategy, emergency freeze, evidence retrieval). | DEPLOY-POLICY-27-001 | DVPL0105 | +| DEPLOY-POLICY-27-002 | DOING (draft 2025-12-06) | 2025-12-06 | SPRINT_0502_0001_0001_ops_deployment_ii | Deployment Guild · Policy Guild | ops/deployment | Drafted `docs/runbooks/policy-incident.md` (publish/promote, freeze, evidence); awaiting policy overlay schema/digests from DEPLOY-POLICY-27-001. | DEPLOY-POLICY-27-001 | DVPL0105 | | DEPLOY-VEX-30-001 | DOING (dev-mock 2025-12-06) | 2025-12-06 | SPRINT_0502_0001_0001_ops_deployment_ii | Deployment + VEX Lens Guild | ops/deployment | Provide Helm/Compose overlays, scaling defaults, and offline kit instructions for VEX Lens service. | Wait for CCWO0101 schema | DVPL0101 | | DEPLOY-VEX-30-002 | DOING (dev-mock 2025-12-06) | 2025-12-06 | SPRINT_0502_0001_0001_ops_deployment_ii | Deployment Guild | ops/deployment | Package Issuer Directory deployment manifests, backups, and security hardening guidance. Dependencies: DEPLOY-VEX-30-001. | Depends on #5 | DVPL0101 | | DEPLOY-VULN-29-001 | DOING (dev-mock 2025-12-06) | 2025-12-06 | SPRINT_0502_0001_0001_ops_deployment_ii | Deployment + Vuln Guild | ops/deployment | Produce Helm/Compose overlays for Findings Ledger + projector, including DB migrations, Merkle anchor jobs, and scaling guidance. | Needs CCWO0101 | DVPL0101 | @@ -2753,7 +2753,7 @@ | DEPLOY-PACKS-42-001 | DOING (dev-mock 2025-12-06) | 2025-12-06 | SPRINT_0501_0001_0001_ops_deployment_i | Deployment Guild · Packs Registry Guild | ops/deployment | Provide deployment manifests for packs-registry and task-runner services, including Helm/Compose overlays, scaling defaults, and secret templates. | Wait for pack registry schema | AGDP0101 | | DEPLOY-PACKS-43-001 | DOING (dev-mock 2025-12-06) | 2025-12-06 | SPRINT_0501_0001_0001_ops_deployment_i | Deployment Guild · Task Runner Guild | ops/deployment | Ship remote Task Runner worker profiles, object storage bootstrap, approval workflow integration, and Offline Kit packaging instructions. Dependencies: DEPLOY-PACKS-42-001. | Needs #7 artifacts | AGDP0101 | | DEPLOY-POLICY-27-001 | DOING (dev-mock 2025-12-06) | 2025-12-05 | SPRINT_0501_0001_0001_ops_deployment_i | Deployment Guild · Policy Registry Guild | ops/deployment | Produce Helm/Compose overlays for Policy Registry + simulation workers, including Mongo migrations, object storage buckets, signing key secrets, and tenancy defaults. | Needs registry schema + secrets | AGDP0101 | -| DEPLOY-POLICY-27-002 | TODO | | SPRINT_0502_0001_0001_ops_deployment_ii | Deployment Guild · Policy Guild | ops/deployment | Document rollout/rollback playbooks for policy publish/promote (canary strategy, emergency freeze toggle, evidence retrieval) under `/docs/runbooks/policy-incident.md`. Dependencies: DEPLOY-POLICY-27-001. | Depends on 27-001 | AGDP0101 | +| DEPLOY-POLICY-27-002 | DOING (draft 2025-12-06) | 2025-12-06 | SPRINT_0502_0001_0001_ops_deployment_ii | Deployment Guild · Policy Guild | ops/deployment | Drafted `docs/runbooks/policy-incident.md` (publish/promote, freeze, evidence); finalize once DEPLOY-POLICY-27-001 ships schema/digests. | Depends on 27-001 | AGDP0101 | | DEPLOY-VEX-30-001 | DOING (dev-mock 2025-12-06) | 2025-12-06 | SPRINT_0502_0001_0001_ops_deployment_ii | Deployment + VEX Lens Guild | ops/deployment | Provide Helm/Compose overlays, scaling defaults, and offline kit instructions for VEX Lens service. | Wait for CCWO0101 schema | DVPL0101 | | DEPLOY-VEX-30-002 | DOING (dev-mock 2025-12-06) | 2025-12-06 | SPRINT_0502_0001_0001_ops_deployment_ii | Deployment Guild | ops/deployment | Package Issuer Directory deployment manifests, backups, and security hardening guidance. Dependencies: DEPLOY-VEX-30-001. | Depends on #5 | DVPL0101 | | DEPLOY-VULN-29-001 | DOING (dev-mock 2025-12-06) | 2025-12-06 | SPRINT_0502_0001_0001_ops_deployment_ii | Deployment + Vuln Guild | ops/deployment | Produce Helm/Compose overlays for Findings Ledger + projector, including DB migrations, Merkle anchor jobs, and scaling guidance. | Needs CCWO0101 | DVPL0101 | diff --git a/docs/runbooks/policy-incident.md b/docs/runbooks/policy-incident.md new file mode 100644 index 000000000..804ff65d1 --- /dev/null +++ b/docs/runbooks/policy-incident.md @@ -0,0 +1,50 @@ +# Policy Publish / Incident Runbook (draft) + +Status: DRAFT — pending policy-registry overlay and production digests. Use for dev/mock exercises until policy release artefacts land. + +## Scope +- Policy Registry publish/promote workflows (canary → full rollout). +- Emergency freeze for publish endpoints. +- Evidence capture for audits and postmortems. + +## Pre-flight checks (dev vs. prod) +1) Validate manifests + - Dev/mock: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-mock-dev.yaml --downloads deploy/downloads/manifest.json` + - Prod: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-stable.yaml --downloads deploy/downloads/manifest.json` + - Confirm `.gitea/workflows/release-manifest-verify.yml` is green for the target manifest change. +2) Render deployment plan (no apply yet) + - Helm: `helm template stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-orchestrator.yaml > /tmp/policy-plan.yaml` + - Compose (dev): `USE_MOCK=1 deploy/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f deploy/compose/docker-compose.dev.yaml -f deploy/compose/docker-compose.mock.yaml config > /tmp/policy-compose.yaml` +3) Backups + - Run `deploy/compose/scripts/backup.sh` before production rollout; archive Mongo/Redis/ObjectStore snapshots to the regulated vault. + +## Canary publish → promote +1) Prepare override (temporary) + - Create `deploy/helm/stellaops/values-policy-canary.yaml` with a single replica, reduced worker counts, and an isolated ingress path for policy publish. + - Keep `mock.enabled=false`; only use real digests when available. +2) Dry-run render + - `helm template stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-policy-canary.yaml --debug --validate > /tmp/policy-canary.yaml` +3) Apply canary + - `helm upgrade --install stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-policy-canary.yaml --atomic --timeout 10m` + - Monitor: `kubectl logs deployment/policy-registry -n stellaops --tail=200 -f` and readiness probes; rollback on errors. +4) Promote + - Remove the canary override from the release branch; rerender with `values-prod.yaml` only and redeploy. + - Update the release manifest with final policy digests and rerun `release-manifest-verify`. + +## Emergency freeze +- Hard stop publishes while keeping read access + - `kubectl scale deployment/policy-registry -n stellaops --replicas=0` + - Alternatively, apply a NetworkPolicy that blocks ingress to the publish endpoint while leaving status/read paths open. +- Manifest gate + - Remove policy entries from the target `deploy/releases/*.yaml` and rerun `.gitea/workflows/release-manifest-verify.yml` so pipelines fail closed until the issue is cleared. + +## Evidence capture +- Release artefacts: copy the exact release manifest, `/tmp/policy-canary.yaml`, and `/tmp/policy-compose.yaml` used for rollout. +- Runtime state: `kubectl get deploy,po,svc -n stellaops -l app=policy-registry -o yaml > /tmp/policy-live.yaml`. +- Logs: `kubectl logs deployment/policy-registry -n stellaops --since=1h > /tmp/policy-logs.txt`. +- Package as `tar -czf policy-incident-$(date -u +%Y%m%dT%H%M%SZ).tar.gz /tmp/policy-*.yaml /tmp/policy-*.txt` and store in the audit bucket. + +## Open items (blockers) +- Replace mock digests with production pins in `deploy/releases/*` once provided. +- Update the canary override file with the real policy-registry chart values (service/env schema pending from DEPLOY-POLICY-27-001). +- Add Grafana/Prometheus dashboard references once policy metrics are exposed.