up
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
This commit is contained in:
37
ops/devops/orchestrator/incident-response.md
Normal file
37
ops/devops/orchestrator/incident-response.md
Normal file
@@ -0,0 +1,37 @@
|
||||
# Orchestrator Incident Response & GA Readiness
|
||||
|
||||
## Alert links
|
||||
- Prometheus rules: `ops/devops/orchestrator/alerts.yaml` (includes burn-rate).
|
||||
- Dashboard: `ops/devops/orchestrator/grafana/orchestrator-overview.json`.
|
||||
|
||||
## Runbook (by alert)
|
||||
- **QueueDepthHigh / DLQDepthHigh**
|
||||
- Check backlog cause: slow workers vs. downstream dependency.
|
||||
- Scale workers + clear DLQ after snapshot; if DLQ cause is transient, replay via `replay-smoke.sh` after fixes.
|
||||
- **FailuresHigh / ErrorCluster / FailureBurnRateHigh**
|
||||
- Inspect failing job type from alert labels.
|
||||
- Pause new dispatch for the job type; ship hotfix or rollback offending worker image.
|
||||
- Validate with `scripts/orchestrator/probe.sh` then `smoke.sh` to ensure infra is healthy.
|
||||
- **LeaseStall**
|
||||
- Look for stuck locks in Postgres `locks` view; force release or restart the worker set.
|
||||
- Confirm NATS health (probe) and worker heartbeats.
|
||||
- **Backpressure**
|
||||
- Increase rate-limit budgets temporarily; ensure backlog drains; restore defaults after stability.
|
||||
|
||||
## Synthetic checks
|
||||
- `scripts/orchestrator/probe.sh` — psql ping, mongo ping, NATS pub/ping; writes `out/orchestrator-probe/status.txt`.
|
||||
- `scripts/orchestrator/smoke.sh` — end-to-end infra smoke, emits readiness.
|
||||
- `scripts/orchestrator/replay-smoke.sh` — restart stack then run smoke to prove restart/replay works.
|
||||
|
||||
## GA readiness checklist
|
||||
- [ ] Burn-rate alerting enabled in Prometheus/Alertmanager (see `alerts.yaml` rule `OrchestratorFailureBurnRateHigh`).
|
||||
- [ ] Dashboard imported and linked in on-call rotation.
|
||||
- [ ] Synthetic probe cron in CI/ops runner publishing `status.txt` artifact daily.
|
||||
- [ ] Replay smoke scheduled post-deploy to validate persistence/volumes.
|
||||
- [ ] Backup/restore for Postgres & Mongo verified weekly (not automated here).
|
||||
- [ ] NATS JetStream retention + DLQ policy reviewed and documented.
|
||||
|
||||
## Escalation
|
||||
- Primary: Orchestrator on-call.
|
||||
- Secondary: DevOps Guild (release).
|
||||
- Page when any critical alert persists >15m or dual criticals fire simultaneously.
|
||||
Reference in New Issue
Block a user