up

2025-11-25 22:09:44 +02:00
parent 6bee1fdcf5
commit 9f6e6f7fb3
116 changed files with 4495 additions and 730 deletions
--- a/ops/devops/orchestrator/incident-response.md
+++ b/ops/devops/orchestrator/incident-response.md
@@ -0,0 +1,37 @@
+# Orchestrator Incident Response & GA Readiness
+
+## Alert links
+- Prometheus rules: `ops/devops/orchestrator/alerts.yaml` (includes burn-rate).
+- Dashboard: `ops/devops/orchestrator/grafana/orchestrator-overview.json`.
+
+## Runbook (by alert)
+- **QueueDepthHigh / DLQDepthHigh**
+  - Check backlog cause: slow workers vs. downstream dependency.
+  - Scale workers + clear DLQ after snapshot; if DLQ cause is transient, replay via `replay-smoke.sh` after fixes.
+- **FailuresHigh / ErrorCluster / FailureBurnRateHigh**
+  - Inspect failing job type from alert labels.
+  - Pause new dispatch for the job type; ship hotfix or rollback offending worker image.
+  - Validate with `scripts/orchestrator/probe.sh` then `smoke.sh` to ensure infra is healthy.
+- **LeaseStall**
+  - Look for stuck locks in Postgres `locks` view; force release or restart the worker set.
+  - Confirm NATS health (probe) and worker heartbeats.
+- **Backpressure**
+  - Increase rate-limit budgets temporarily; ensure backlog drains; restore defaults after stability.
+
+## Synthetic checks
+- `scripts/orchestrator/probe.sh` — psql ping, mongo ping, NATS pub/ping; writes `out/orchestrator-probe/status.txt`.
+- `scripts/orchestrator/smoke.sh` — end-to-end infra smoke, emits readiness.
+- `scripts/orchestrator/replay-smoke.sh` — restart stack then run smoke to prove restart/replay works.
+
+## GA readiness checklist
+- [ ] Burn-rate alerting enabled in Prometheus/Alertmanager (see `alerts.yaml` rule `OrchestratorFailureBurnRateHigh`).
+- [ ] Dashboard imported and linked in on-call rotation.
+- [ ] Synthetic probe cron in CI/ops runner publishing `status.txt` artifact daily.
+- [ ] Replay smoke scheduled post-deploy to validate persistence/volumes.
+- [ ] Backup/restore for Postgres & Mongo verified weekly (not automated here).
+- [ ] NATS JetStream retention + DLQ policy reviewed and documented.
+
+## Escalation
+- Primary: Orchestrator on-call.
+- Secondary: DevOps Guild (release).
+- Page when any critical alert persists >15m or dual criticals fire simultaneously.