Files
git.stella-ops.org/docs/runbooks/policy-incident.md
2025-12-06 23:30:12 +00:00

3.7 KiB

Policy Publish / Incident Runbook (draft)

Status: DRAFT — pending policy-registry overlay and production digests. Use for dev/mock exercises until policy release artefacts land.

Scope

  • Policy Registry publish/promote workflows (canary → full rollout).
  • Emergency freeze for publish endpoints.
  • Evidence capture for audits and postmortems.

Pre-flight checks (dev vs. prod)

  1. Validate manifests
    • Dev/mock: python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-mock-dev.yaml --downloads deploy/downloads/manifest.json
    • Prod: python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-stable.yaml --downloads deploy/downloads/manifest.json
    • Confirm .gitea/workflows/release-manifest-verify.yml is green for the target manifest change.
  2. Render deployment plan (no apply yet)
    • Helm: helm template stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-orchestrator.yaml > /tmp/policy-plan.yaml
    • Compose (dev): USE_MOCK=1 deploy/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f deploy/compose/docker-compose.dev.yaml -f deploy/compose/docker-compose.mock.yaml config > /tmp/policy-compose.yaml
  3. Backups
    • Run deploy/compose/scripts/backup.sh before production rollout; archive Mongo/Redis/ObjectStore snapshots to the regulated vault.

Canary publish → promote

  1. Prepare override (temporary)
    • Create deploy/helm/stellaops/values-policy-canary.yaml with a single replica, reduced worker counts, and an isolated ingress path for policy publish.
    • Keep mock.enabled=false; only use real digests when available.
  2. Dry-run render
    • helm template stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-policy-canary.yaml --debug --validate > /tmp/policy-canary.yaml
  3. Apply canary
    • helm upgrade --install stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-policy-canary.yaml --atomic --timeout 10m
    • Monitor: kubectl logs deployment/policy-registry -n stellaops --tail=200 -f and readiness probes; rollback on errors.
  4. Promote
    • Remove the canary override from the release branch; rerender with values-prod.yaml only and redeploy.
    • Update the release manifest with final policy digests and rerun release-manifest-verify.

Emergency freeze

  • Hard stop publishes while keeping read access
    • kubectl scale deployment/policy-registry -n stellaops --replicas=0
    • Alternatively, apply a NetworkPolicy that blocks ingress to the publish endpoint while leaving status/read paths open.
  • Manifest gate
    • Remove policy entries from the target deploy/releases/*.yaml and rerun .gitea/workflows/release-manifest-verify.yml so pipelines fail closed until the issue is cleared.

Evidence capture

  • Release artefacts: copy the exact release manifest, /tmp/policy-canary.yaml, and /tmp/policy-compose.yaml used for rollout.
  • Runtime state: kubectl get deploy,po,svc -n stellaops -l app=policy-registry -o yaml > /tmp/policy-live.yaml.
  • Logs: kubectl logs deployment/policy-registry -n stellaops --since=1h > /tmp/policy-logs.txt.
  • Package as tar -czf policy-incident-$(date -u +%Y%m%dT%H%M%SZ).tar.gz /tmp/policy-*.yaml /tmp/policy-*.txt and store in the audit bucket.

Open items (blockers)

  • Replace mock digests with production pins in deploy/releases/* once provided.
  • Update the canary override file with the real policy-registry chart values (service/env schema pending from DEPLOY-POLICY-27-001).
  • Add Grafana/Prometheus dashboard references once policy metrics are exposed.