Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
2.0 KiB
2.0 KiB
Orchestrator Incident Response & GA Readiness
Alert links
- Prometheus rules:
ops/devops/orchestrator/alerts.yaml(includes burn-rate). - Dashboard:
ops/devops/orchestrator/grafana/orchestrator-overview.json.
Runbook (by alert)
- QueueDepthHigh / DLQDepthHigh
- Check backlog cause: slow workers vs. downstream dependency.
- Scale workers + clear DLQ after snapshot; if DLQ cause is transient, replay via
replay-smoke.shafter fixes.
- FailuresHigh / ErrorCluster / FailureBurnRateHigh
- Inspect failing job type from alert labels.
- Pause new dispatch for the job type; ship hotfix or rollback offending worker image.
- Validate with
scripts/orchestrator/probe.shthensmoke.shto ensure infra is healthy.
- LeaseStall
- Look for stuck locks in Postgres
locksview; force release or restart the worker set. - Confirm NATS health (probe) and worker heartbeats.
- Look for stuck locks in Postgres
- Backpressure
- Increase rate-limit budgets temporarily; ensure backlog drains; restore defaults after stability.
Synthetic checks
scripts/orchestrator/probe.sh— psql ping, mongo ping, NATS pub/ping; writesout/orchestrator-probe/status.txt.scripts/orchestrator/smoke.sh— end-to-end infra smoke, emits readiness.scripts/orchestrator/replay-smoke.sh— restart stack then run smoke to prove restart/replay works.
GA readiness checklist
- Burn-rate alerting enabled in Prometheus/Alertmanager (see
alerts.yamlruleOrchestratorFailureBurnRateHigh). - Dashboard imported and linked in on-call rotation.
- Synthetic probe cron in CI/ops runner publishing
status.txtartifact daily. - Replay smoke scheduled post-deploy to validate persistence/volumes.
- Backup/restore for Postgres & Mongo verified weekly (not automated here).
- NATS JetStream retention + DLQ policy reviewed and documented.
Escalation
- Primary: Orchestrator on-call.
- Secondary: DevOps Guild (release).
- Page when any critical alert persists >15m or dual criticals fire simultaneously.