up

2025-11-25 22:09:44 +02:00
parent 6bee1fdcf5
commit 9f6e6f7fb3
116 changed files with 4495 additions and 730 deletions
--- a/ops/devops/orchestrator/README.md
+++ b/ops/devops/orchestrator/README.md
@@ -15,6 +15,13 @@ COMPOSE_FILE=ops/devops/orchestrator/docker-compose.orchestrator.yml docker comp
 # smoke check and emit connection strings
 scripts/orchestrator/smoke.sh
 cat out/orchestrator-smoke/readiness.txt
+
+# synthetic probe (postgres/mongo/nats health)
+scripts/orchestrator/probe.sh
+cat out/orchestrator-probe/status.txt
+
+# replay readiness (restart then smoke)
+scripts/orchestrator/replay-smoke.sh
 ```

 Connection strings
@@ -26,6 +33,9 @@ Connection strings
 - Alerts: `ops/devops/orchestrator/alerts.yaml`
 - Grafana dashboard: `ops/devops/orchestrator/grafana/orchestrator-overview.json`
  - Metrics expected: `job_queue_depth`, `job_failures_total`, `lease_extensions_total`, `job_latency_seconds_bucket`.
+- Runbook: `ops/devops/orchestrator/incident-response.md`
+- Synthetic probes: `scripts/orchestrator/probe.sh` (writes `out/orchestrator-probe/status.txt`).
+- Replay smoke: `scripts/orchestrator/replay-smoke.sh` (idempotent restart + smoke).

 ## CI hook (suggested)
 Add a workflow step (or local cron) to run `scripts/orchestrator/smoke.sh` with `SKIP_UP=1` against existing infra and publish the `readiness.txt` artifact for traceability.
--- a/ops/devops/orchestrator/alerts.yaml
+++ b/ops/devops/orchestrator/alerts.yaml
@@ -28,3 +28,42 @@ groups:
        annotations:
          summary: "Leases stalled"
          description: "No lease renewals while queue has items"
+      - alert: OrchestratorDLQDepthHigh
+        expr: job_dlq_depth > 10
+        for: 10m
+        labels:
+          severity: warning
+          service: orchestrator
+        annotations:
+          summary: "DLQ depth high"
+          description: "Dead-letter queue depth above 10 for 10m"
+      - alert: OrchestratorBackpressure
+        expr: avg_over_time(rate_limiter_backpressure_ratio[5m]) > 0.5
+        for: 5m
+        labels:
+          severity: warning
+          service: orchestrator
+        annotations:
+          summary: "Backpressure elevated"
+          description: "Rate limiter backpressure >50% over 5m"
+      - alert: OrchestratorErrorCluster
+        expr: sum by(jobType) (rate(job_failures_total[5m])) > 3
+        for: 5m
+        labels:
+          severity: critical
+          service: orchestrator
+        annotations:
+          summary: "Error cluster detected"
+          description: "Failure rate >3/min for a job type"
+      - alert: OrchestratorFailureBurnRateHigh
+        expr: |
+          (rate(job_failures_total[5m]) / clamp_min(rate(job_processed_total[5m]), 1)) > 0.02
+          and
+          (rate(job_failures_total[30m]) / clamp_min(rate(job_processed_total[30m]), 1)) > 0.01
+        for: 10m
+        labels:
+          severity: critical
+          service: orchestrator
+        annotations:
+          summary: "Failure burn rate breaching SLO"
+          description: "5m/30m failure burn rate above 2%/1% SLO; investigate upstream jobs and dependencies."
--- a/ops/devops/orchestrator/grafana/orchestrator-overview.json
+++ b/ops/devops/orchestrator/grafana/orchestrator-overview.json
@@ -36,6 +36,27 @@
      "datasource": "Prometheus",
      "targets": [{"expr": "histogram_quantile(0.95, sum(rate(job_latency_seconds_bucket[5m])) by (le))"}],
      "fieldConfig": {"defaults": {"unit": "s"}}
+    },
+    {
+      "type": "timeseries",
+      "title": "DLQ depth",
+      "datasource": "Prometheus",
+      "targets": [{"expr": "job_dlq_depth"}],
+      "fieldConfig": {"defaults": {"unit": "none"}}
+    },
+    {
+      "type": "timeseries",
+      "title": "Backpressure ratio",
+      "datasource": "Prometheus",
+      "targets": [{"expr": "rate_limiter_backpressure_ratio"}],
+      "fieldConfig": {"defaults": {"unit": "percentunit"}}
+    },
+    {
+      "type": "timeseries",
+      "title": "Failures by job type",
+      "datasource": "Prometheus",
+      "targets": [{"expr": "rate(job_failures_total[5m])"}],
+      "fieldConfig": {"defaults": {"unit": "short"}}
    }
  ],
  "time": {"from": "now-6h", "to": "now"}
--- a/ops/devops/orchestrator/incident-response.md
+++ b/ops/devops/orchestrator/incident-response.md
@@ -0,0 +1,37 @@
+# Orchestrator Incident Response & GA Readiness
+
+## Alert links
+- Prometheus rules: `ops/devops/orchestrator/alerts.yaml` (includes burn-rate).
+- Dashboard: `ops/devops/orchestrator/grafana/orchestrator-overview.json`.
+
+## Runbook (by alert)
+- **QueueDepthHigh / DLQDepthHigh**
+  - Check backlog cause: slow workers vs. downstream dependency.
+  - Scale workers + clear DLQ after snapshot; if DLQ cause is transient, replay via `replay-smoke.sh` after fixes.
+- **FailuresHigh / ErrorCluster / FailureBurnRateHigh**
+  - Inspect failing job type from alert labels.
+  - Pause new dispatch for the job type; ship hotfix or rollback offending worker image.
+  - Validate with `scripts/orchestrator/probe.sh` then `smoke.sh` to ensure infra is healthy.
+- **LeaseStall**
+  - Look for stuck locks in Postgres `locks` view; force release or restart the worker set.
+  - Confirm NATS health (probe) and worker heartbeats.
+- **Backpressure**
+  - Increase rate-limit budgets temporarily; ensure backlog drains; restore defaults after stability.
+
+## Synthetic checks
+- `scripts/orchestrator/probe.sh` — psql ping, mongo ping, NATS pub/ping; writes `out/orchestrator-probe/status.txt`.
+- `scripts/orchestrator/smoke.sh` — end-to-end infra smoke, emits readiness.
+- `scripts/orchestrator/replay-smoke.sh` — restart stack then run smoke to prove restart/replay works.
+
+## GA readiness checklist
+- [ ] Burn-rate alerting enabled in Prometheus/Alertmanager (see `alerts.yaml` rule `OrchestratorFailureBurnRateHigh`).
+- [ ] Dashboard imported and linked in on-call rotation.
+- [ ] Synthetic probe cron in CI/ops runner publishing `status.txt` artifact daily.
+- [ ] Replay smoke scheduled post-deploy to validate persistence/volumes.
+- [ ] Backup/restore for Postgres & Mongo verified weekly (not automated here).
+- [ ] NATS JetStream retention + DLQ policy reviewed and documented.
+
+## Escalation
+- Primary: Orchestrator on-call.
+- Secondary: DevOps Guild (release).
+- Page when any critical alert persists >15m or dual criticals fire simultaneously.