# Orchestrator Incident Response & GA Readiness ## Alert links - Prometheus rules: `ops/devops/orchestrator/alerts.yaml` (includes burn-rate). - Dashboard: `ops/devops/orchestrator/grafana/orchestrator-overview.json`. ## Runbook (by alert) - **QueueDepthHigh / DLQDepthHigh** - Check backlog cause: slow workers vs. downstream dependency. - Scale workers + clear DLQ after snapshot; if DLQ cause is transient, replay via `replay-smoke.sh` after fixes. - **FailuresHigh / ErrorCluster / FailureBurnRateHigh** - Inspect failing job type from alert labels. - Pause new dispatch for the job type; ship hotfix or rollback offending worker image. - Validate with `scripts/orchestrator/probe.sh` then `smoke.sh` to ensure infra is healthy. - **LeaseStall** - Look for stuck locks in Postgres `locks` view; force release or restart the worker set. - Confirm NATS health (probe) and worker heartbeats. - **Backpressure** - Increase rate-limit budgets temporarily; ensure backlog drains; restore defaults after stability. ## Synthetic checks - `scripts/orchestrator/probe.sh` — psql ping, mongo ping, NATS pub/ping; writes `out/orchestrator-probe/status.txt`. - `scripts/orchestrator/smoke.sh` — end-to-end infra smoke, emits readiness. - `scripts/orchestrator/replay-smoke.sh` — restart stack then run smoke to prove restart/replay works. ## GA readiness checklist - [ ] Burn-rate alerting enabled in Prometheus/Alertmanager (see `alerts.yaml` rule `OrchestratorFailureBurnRateHigh`). - [ ] Dashboard imported and linked in on-call rotation. - [ ] Synthetic probe cron in CI/ops runner publishing `status.txt` artifact daily. - [ ] Replay smoke scheduled post-deploy to validate persistence/volumes. - [ ] Backup/restore for Postgres & Mongo verified weekly (not automated here). - [ ] NATS JetStream retention + DLQ policy reviewed and documented. ## Escalation - Primary: Orchestrator on-call. - Secondary: DevOps Guild (release). - Page when any critical alert persists >15m or dual criticals fire simultaneously.