Files
git.stella-ops.org/ops/devops/orchestrator/incident-response.md
StellaOps Bot 9f6e6f7fb3
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
up
2025-11-25 22:09:44 +02:00

2.0 KiB

Orchestrator Incident Response & GA Readiness

  • Prometheus rules: ops/devops/orchestrator/alerts.yaml (includes burn-rate).
  • Dashboard: ops/devops/orchestrator/grafana/orchestrator-overview.json.

Runbook (by alert)

  • QueueDepthHigh / DLQDepthHigh
    • Check backlog cause: slow workers vs. downstream dependency.
    • Scale workers + clear DLQ after snapshot; if DLQ cause is transient, replay via replay-smoke.sh after fixes.
  • FailuresHigh / ErrorCluster / FailureBurnRateHigh
    • Inspect failing job type from alert labels.
    • Pause new dispatch for the job type; ship hotfix or rollback offending worker image.
    • Validate with scripts/orchestrator/probe.sh then smoke.sh to ensure infra is healthy.
  • LeaseStall
    • Look for stuck locks in Postgres locks view; force release or restart the worker set.
    • Confirm NATS health (probe) and worker heartbeats.
  • Backpressure
    • Increase rate-limit budgets temporarily; ensure backlog drains; restore defaults after stability.

Synthetic checks

  • scripts/orchestrator/probe.sh — psql ping, mongo ping, NATS pub/ping; writes out/orchestrator-probe/status.txt.
  • scripts/orchestrator/smoke.sh — end-to-end infra smoke, emits readiness.
  • scripts/orchestrator/replay-smoke.sh — restart stack then run smoke to prove restart/replay works.

GA readiness checklist

  • Burn-rate alerting enabled in Prometheus/Alertmanager (see alerts.yaml rule OrchestratorFailureBurnRateHigh).
  • Dashboard imported and linked in on-call rotation.
  • Synthetic probe cron in CI/ops runner publishing status.txt artifact daily.
  • Replay smoke scheduled post-deploy to validate persistence/volumes.
  • Backup/restore for Postgres & Mongo verified weekly (not automated here).
  • NATS JetStream retention + DLQ policy reviewed and documented.

Escalation

  • Primary: Orchestrator on-call.
  • Secondary: DevOps Guild (release).
  • Page when any critical alert persists >15m or dual criticals fire simultaneously.