51 lines
3.7 KiB
Markdown
51 lines
3.7 KiB
Markdown
# Policy Publish / Incident Runbook (draft)
|
|
|
|
Status: DRAFT — pending policy-registry overlay and production digests. Use for dev/mock exercises until policy release artefacts land.
|
|
|
|
## Scope
|
|
- Policy Registry publish/promote workflows (canary → full rollout).
|
|
- Emergency freeze for publish endpoints.
|
|
- Evidence capture for audits and postmortems.
|
|
|
|
## Pre-flight checks (dev vs. prod)
|
|
1) Validate manifests
|
|
- Dev/mock: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-mock-dev.yaml --downloads deploy/downloads/manifest.json`
|
|
- Prod: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-stable.yaml --downloads deploy/downloads/manifest.json`
|
|
- Confirm `.gitea/workflows/release-manifest-verify.yml` is green for the target manifest change.
|
|
2) Render deployment plan (no apply yet)
|
|
- Helm: `helm template stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-orchestrator.yaml > /tmp/policy-plan.yaml`
|
|
- Compose (dev): `USE_MOCK=1 deploy/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f deploy/compose/docker-compose.dev.yaml -f deploy/compose/docker-compose.mock.yaml config > /tmp/policy-compose.yaml`
|
|
3) Backups
|
|
- Run `deploy/compose/scripts/backup.sh` before production rollout; archive Mongo/Redis/ObjectStore snapshots to the regulated vault.
|
|
|
|
## Canary publish → promote
|
|
1) Prepare override (temporary)
|
|
- Create `deploy/helm/stellaops/values-policy-canary.yaml` with a single replica, reduced worker counts, and an isolated ingress path for policy publish.
|
|
- Keep `mock.enabled=false`; only use real digests when available.
|
|
2) Dry-run render
|
|
- `helm template stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-policy-canary.yaml --debug --validate > /tmp/policy-canary.yaml`
|
|
3) Apply canary
|
|
- `helm upgrade --install stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-policy-canary.yaml --atomic --timeout 10m`
|
|
- Monitor: `kubectl logs deployment/policy-registry -n stellaops --tail=200 -f` and readiness probes; rollback on errors.
|
|
4) Promote
|
|
- Remove the canary override from the release branch; rerender with `values-prod.yaml` only and redeploy.
|
|
- Update the release manifest with final policy digests and rerun `release-manifest-verify`.
|
|
|
|
## Emergency freeze
|
|
- Hard stop publishes while keeping read access
|
|
- `kubectl scale deployment/policy-registry -n stellaops --replicas=0`
|
|
- Alternatively, apply a NetworkPolicy that blocks ingress to the publish endpoint while leaving status/read paths open.
|
|
- Manifest gate
|
|
- Remove policy entries from the target `deploy/releases/*.yaml` and rerun `.gitea/workflows/release-manifest-verify.yml` so pipelines fail closed until the issue is cleared.
|
|
|
|
## Evidence capture
|
|
- Release artefacts: copy the exact release manifest, `/tmp/policy-canary.yaml`, and `/tmp/policy-compose.yaml` used for rollout.
|
|
- Runtime state: `kubectl get deploy,po,svc -n stellaops -l app=policy-registry -o yaml > /tmp/policy-live.yaml`.
|
|
- Logs: `kubectl logs deployment/policy-registry -n stellaops --since=1h > /tmp/policy-logs.txt`.
|
|
- Package as `tar -czf policy-incident-$(date -u +%Y%m%dT%H%M%SZ).tar.gz /tmp/policy-*.yaml /tmp/policy-*.txt` and store in the audit bucket.
|
|
|
|
## Open items (blockers)
|
|
- Replace mock digests with production pins in `deploy/releases/*` once provided.
|
|
- Update the canary override file with the real policy-registry chart values (service/env schema pending from DEPLOY-POLICY-27-001).
|
|
- Add Grafana/Prometheus dashboard references once policy metrics are exposed.
|