3.7 KiB
3.7 KiB
Policy Publish / Incident Runbook (draft)
Status: DRAFT — pending policy-registry overlay and production digests. Use for dev/mock exercises until policy release artefacts land.
Scope
- Policy Registry publish/promote workflows (canary → full rollout).
- Emergency freeze for publish endpoints.
- Evidence capture for audits and postmortems.
Pre-flight checks (dev vs. prod)
- Validate manifests
- Dev/mock:
python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-mock-dev.yaml --downloads deploy/downloads/manifest.json - Prod:
python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-stable.yaml --downloads deploy/downloads/manifest.json - Confirm
.gitea/workflows/release-manifest-verify.ymlis green for the target manifest change.
- Dev/mock:
- Render deployment plan (no apply yet)
- Helm:
helm template stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-orchestrator.yaml > /tmp/policy-plan.yaml - Compose (dev):
USE_MOCK=1 deploy/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f deploy/compose/docker-compose.dev.yaml -f deploy/compose/docker-compose.mock.yaml config > /tmp/policy-compose.yaml
- Helm:
- Backups
- Run
deploy/compose/scripts/backup.shbefore production rollout; archive Mongo/Redis/ObjectStore snapshots to the regulated vault.
- Run
Canary publish → promote
- Prepare override (temporary)
- Create
deploy/helm/stellaops/values-policy-canary.yamlwith a single replica, reduced worker counts, and an isolated ingress path for policy publish. - Keep
mock.enabled=false; only use real digests when available.
- Create
- Dry-run render
helm template stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-policy-canary.yaml --debug --validate > /tmp/policy-canary.yaml
- Apply canary
helm upgrade --install stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-policy-canary.yaml --atomic --timeout 10m- Monitor:
kubectl logs deployment/policy-registry -n stellaops --tail=200 -fand readiness probes; rollback on errors.
- Promote
- Remove the canary override from the release branch; rerender with
values-prod.yamlonly and redeploy. - Update the release manifest with final policy digests and rerun
release-manifest-verify.
- Remove the canary override from the release branch; rerender with
Emergency freeze
- Hard stop publishes while keeping read access
kubectl scale deployment/policy-registry -n stellaops --replicas=0- Alternatively, apply a NetworkPolicy that blocks ingress to the publish endpoint while leaving status/read paths open.
- Manifest gate
- Remove policy entries from the target
deploy/releases/*.yamland rerun.gitea/workflows/release-manifest-verify.ymlso pipelines fail closed until the issue is cleared.
- Remove policy entries from the target
Evidence capture
- Release artefacts: copy the exact release manifest,
/tmp/policy-canary.yaml, and/tmp/policy-compose.yamlused for rollout. - Runtime state:
kubectl get deploy,po,svc -n stellaops -l app=policy-registry -o yaml > /tmp/policy-live.yaml. - Logs:
kubectl logs deployment/policy-registry -n stellaops --since=1h > /tmp/policy-logs.txt. - Package as
tar -czf policy-incident-$(date -u +%Y%m%dT%H%M%SZ).tar.gz /tmp/policy-*.yaml /tmp/policy-*.txtand store in the audit bucket.
Open items (blockers)
- Replace mock digests with production pins in
deploy/releases/*once provided. - Update the canary override file with the real policy-registry chart values (service/env schema pending from DEPLOY-POLICY-27-001).
- Add Grafana/Prometheus dashboard references once policy metrics are exposed.