Files
git.stella-ops.org/docs/runbooks/policy-incident.md
2025-12-06 23:30:12 +00:00

51 lines
3.7 KiB
Markdown

# Policy Publish / Incident Runbook (draft)
Status: DRAFT — pending policy-registry overlay and production digests. Use for dev/mock exercises until policy release artefacts land.
## Scope
- Policy Registry publish/promote workflows (canary → full rollout).
- Emergency freeze for publish endpoints.
- Evidence capture for audits and postmortems.
## Pre-flight checks (dev vs. prod)
1) Validate manifests
- Dev/mock: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-mock-dev.yaml --downloads deploy/downloads/manifest.json`
- Prod: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-stable.yaml --downloads deploy/downloads/manifest.json`
- Confirm `.gitea/workflows/release-manifest-verify.yml` is green for the target manifest change.
2) Render deployment plan (no apply yet)
- Helm: `helm template stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-orchestrator.yaml > /tmp/policy-plan.yaml`
- Compose (dev): `USE_MOCK=1 deploy/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f deploy/compose/docker-compose.dev.yaml -f deploy/compose/docker-compose.mock.yaml config > /tmp/policy-compose.yaml`
3) Backups
- Run `deploy/compose/scripts/backup.sh` before production rollout; archive Mongo/Redis/ObjectStore snapshots to the regulated vault.
## Canary publish → promote
1) Prepare override (temporary)
- Create `deploy/helm/stellaops/values-policy-canary.yaml` with a single replica, reduced worker counts, and an isolated ingress path for policy publish.
- Keep `mock.enabled=false`; only use real digests when available.
2) Dry-run render
- `helm template stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-policy-canary.yaml --debug --validate > /tmp/policy-canary.yaml`
3) Apply canary
- `helm upgrade --install stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-policy-canary.yaml --atomic --timeout 10m`
- Monitor: `kubectl logs deployment/policy-registry -n stellaops --tail=200 -f` and readiness probes; rollback on errors.
4) Promote
- Remove the canary override from the release branch; rerender with `values-prod.yaml` only and redeploy.
- Update the release manifest with final policy digests and rerun `release-manifest-verify`.
## Emergency freeze
- Hard stop publishes while keeping read access
- `kubectl scale deployment/policy-registry -n stellaops --replicas=0`
- Alternatively, apply a NetworkPolicy that blocks ingress to the publish endpoint while leaving status/read paths open.
- Manifest gate
- Remove policy entries from the target `deploy/releases/*.yaml` and rerun `.gitea/workflows/release-manifest-verify.yml` so pipelines fail closed until the issue is cleared.
## Evidence capture
- Release artefacts: copy the exact release manifest, `/tmp/policy-canary.yaml`, and `/tmp/policy-compose.yaml` used for rollout.
- Runtime state: `kubectl get deploy,po,svc -n stellaops -l app=policy-registry -o yaml > /tmp/policy-live.yaml`.
- Logs: `kubectl logs deployment/policy-registry -n stellaops --since=1h > /tmp/policy-logs.txt`.
- Package as `tar -czf policy-incident-$(date -u +%Y%m%dT%H%M%SZ).tar.gz /tmp/policy-*.yaml /tmp/policy-*.txt` and store in the audit bucket.
## Open items (blockers)
- Replace mock digests with production pins in `deploy/releases/*` once provided.
- Update the canary override file with the real policy-registry chart values (service/env schema pending from DEPLOY-POLICY-27-001).
- Add Grafana/Prometheus dashboard references once policy metrics are exposed.