CD/CD consolidation

2025-12-26 17:32:23 +02:00
parent a866eb6277
commit c786faae84
638 changed files with 3821 additions and 181 deletions
--- a/devops/services/orchestrator-config/README.md
+++ b/devops/services/orchestrator-config/README.md
@@ -0,0 +1,46 @@
+# Orchestrator Infra Bootstrap (DEVOPS-ORCH-32-001)
+
+## Components
+- Postgres 16 (state/config)
+- Mongo 7 (job ledger history)
+- NATS 2.10 JetStream (queue/bus)
+
+Compose file: `ops/devops/orchestrator/docker-compose.orchestrator.yml`
+
+## Quick start (offline-friendly)
+```bash
+# bring up infra
+COMPOSE_FILE=ops/devops/orchestrator/docker-compose.orchestrator.yml docker compose up -d
+
+# smoke check and emit connection strings
+scripts/orchestrator/smoke.sh
+cat out/orchestrator-smoke/readiness.txt
+
+# synthetic probe (postgres/mongo/nats health)
+scripts/orchestrator/probe.sh
+cat out/orchestrator-probe/status.txt
+
+# replay readiness (restart then smoke)
+scripts/orchestrator/replay-smoke.sh
+```
+
+Connection strings
+- Postgres: `postgres://orch:orchpass@localhost:55432/orchestrator`
+- Mongo: `mongodb://localhost:57017`
+- NATS: `nats://localhost:4222`
+
+## Observability
+- Alerts: `ops/devops/orchestrator/alerts.yaml`
+- Grafana dashboard: `ops/devops/orchestrator/grafana/orchestrator-overview.json`
+  - Metrics expected: `job_queue_depth`, `job_failures_total`, `lease_extensions_total`, `job_latency_seconds_bucket`.
+- Runbook: `ops/devops/orchestrator/incident-response.md`
+- Synthetic probes: `scripts/orchestrator/probe.sh` (writes `out/orchestrator-probe/status.txt`).
+- Replay smoke: `scripts/orchestrator/replay-smoke.sh` (idempotent restart + smoke).
+
+## CI hook (suggested)
+Add a workflow step (or local cron) to run `scripts/orchestrator/smoke.sh` with `SKIP_UP=1` against existing infra and publish the `readiness.txt` artifact for traceability.
+
+## Notes
+- Uses fixed ports for determinism; adjust via COMPOSE overrides if needed.
+- Data volumes: `orch_pg_data`, `orch_mongo_data` (docker volumes).
+- No external downloads beyond base images; pin images to specific tags above.
--- a/devops/services/orchestrator-config/alerts.yaml
+++ b/devops/services/orchestrator-config/alerts.yaml
@@ -0,0 +1,69 @@
+groups:
+  - name: orchestrator-core
+    rules:
+      - alert: OrchestratorQueueDepthHigh
+        expr: job_queue_depth > 500
+        for: 10m
+        labels:
+          severity: warning
+          service: orchestrator
+        annotations:
+          summary: "Queue depth high"
+          description: "job_queue_depth exceeded 500 for 10m"
+      - alert: OrchestratorFailuresHigh
+        expr: rate(job_failures_total[5m]) > 5
+        for: 5m
+        labels:
+          severity: critical
+          service: orchestrator
+        annotations:
+          summary: "Job failures elevated"
+          description: "Failure rate above 5/min in last 5m"
+      - alert: OrchestratorLeaseStall
+        expr: rate(lease_extensions_total[5m]) == 0 and job_queue_depth > 0
+        for: 5m
+        labels:
+          severity: critical
+          service: orchestrator
+        annotations:
+          summary: "Leases stalled"
+          description: "No lease renewals while queue has items"
+      - alert: OrchestratorDLQDepthHigh
+        expr: job_dlq_depth > 10
+        for: 10m
+        labels:
+          severity: warning
+          service: orchestrator
+        annotations:
+          summary: "DLQ depth high"
+          description: "Dead-letter queue depth above 10 for 10m"
+      - alert: OrchestratorBackpressure
+        expr: avg_over_time(rate_limiter_backpressure_ratio[5m]) > 0.5
+        for: 5m
+        labels:
+          severity: warning
+          service: orchestrator
+        annotations:
+          summary: "Backpressure elevated"
+          description: "Rate limiter backpressure >50% over 5m"
+      - alert: OrchestratorErrorCluster
+        expr: sum by(jobType) (rate(job_failures_total[5m])) > 3
+        for: 5m
+        labels:
+          severity: critical
+          service: orchestrator
+        annotations:
+          summary: "Error cluster detected"
+          description: "Failure rate >3/min for a job type"
+      - alert: OrchestratorFailureBurnRateHigh
+        expr: |
+          (rate(job_failures_total[5m]) / clamp_min(rate(job_processed_total[5m]), 1)) > 0.02
+          and
+          (rate(job_failures_total[30m]) / clamp_min(rate(job_processed_total[30m]), 1)) > 0.01
+        for: 10m
+        labels:
+          severity: critical
+          service: orchestrator
+        annotations:
+          summary: "Failure burn rate breaching SLO"
+          description: "5m/30m failure burn rate above 2%/1% SLO; investigate upstream jobs and dependencies."
--- a/devops/services/orchestrator-config/docker-compose.orchestrator.yml
+++ b/devops/services/orchestrator-config/docker-compose.orchestrator.yml
@@ -0,0 +1,49 @@
+version: "3.9"
+services:
+  orchestrator-postgres:
+    image: postgres:16-alpine
+    environment:
+      POSTGRES_USER: orch
+      POSTGRES_PASSWORD: orchpass
+      POSTGRES_DB: orchestrator
+    volumes:
+      - orch_pg_data:/var/lib/postgresql/data
+    ports:
+      - "55432:5432"
+    healthcheck:
+      test: ["CMD-SHELL", "pg_isready -U orch"]
+      interval: 10s
+      timeout: 5s
+      retries: 5
+    restart: unless-stopped
+
+  orchestrator-mongo:
+    image: mongo:7
+    command: ["mongod", "--quiet", "--storageEngine=wiredTiger"]
+    ports:
+      - "57017:27017"
+    volumes:
+      - orch_mongo_data:/data/db
+    healthcheck:
+      test: ["CMD", "mongosh", "--quiet", "--eval", "db.adminCommand('ping')"]
+      interval: 10s
+      timeout: 5s
+      retries: 5
+    restart: unless-stopped
+
+  orchestrator-nats:
+    image: nats:2.10-alpine
+    ports:
+      - "5422:4222"
+      - "5822:8222"
+    command: ["-js", "-m", "8222"]
+    healthcheck:
+      test: ["CMD", "nats", "--server", "localhost:4222", "ping"]
+      interval: 10s
+      timeout: 5s
+      retries: 5
+    restart: unless-stopped
+
+volumes:
+  orch_pg_data:
+  orch_mongo_data:
--- a/devops/services/orchestrator-config/grafana/orchestrator-overview.json
+++ b/devops/services/orchestrator-config/grafana/orchestrator-overview.json
@@ -0,0 +1,63 @@
+{
+  "schemaVersion": 39,
+  "title": "Orchestrator Overview",
+  "panels": [
+    {
+      "type": "stat",
+      "title": "Queue Depth",
+      "datasource": "Prometheus",
+      "fieldConfig": {"defaults": {"unit": "none"}},
+      "targets": [{"expr": "sum(job_queue_depth)"}]
+    },
+    {
+      "type": "timeseries",
+      "title": "Queue Depth by Job Type",
+      "datasource": "Prometheus",
+      "targets": [{"expr": "job_queue_depth"}],
+      "fieldConfig": {"defaults": {"unit": "none"}}
+    },
+    {
+      "type": "timeseries",
+      "title": "Failures per minute",
+      "datasource": "Prometheus",
+      "targets": [{"expr": "rate(job_failures_total[5m])"}],
+      "fieldConfig": {"defaults": {"unit": "short"}}
+    },
+    {
+      "type": "timeseries",
+      "title": "Leases per second",
+      "datasource": "Prometheus",
+      "targets": [{"expr": "rate(lease_extensions_total[5m])"}],
+      "fieldConfig": {"defaults": {"unit": "ops"}}
+    },
+    {
+      "type": "timeseries",
+      "title": "Job latency p95",
+      "datasource": "Prometheus",
+      "targets": [{"expr": "histogram_quantile(0.95, sum(rate(job_latency_seconds_bucket[5m])) by (le))"}],
+      "fieldConfig": {"defaults": {"unit": "s"}}
+    },
+    {
+      "type": "timeseries",
+      "title": "DLQ depth",
+      "datasource": "Prometheus",
+      "targets": [{"expr": "job_dlq_depth"}],
+      "fieldConfig": {"defaults": {"unit": "none"}}
+    },
+    {
+      "type": "timeseries",
+      "title": "Backpressure ratio",
+      "datasource": "Prometheus",
+      "targets": [{"expr": "rate_limiter_backpressure_ratio"}],
+      "fieldConfig": {"defaults": {"unit": "percentunit"}}
+    },
+    {
+      "type": "timeseries",
+      "title": "Failures by job type",
+      "datasource": "Prometheus",
+      "targets": [{"expr": "rate(job_failures_total[5m])"}],
+      "fieldConfig": {"defaults": {"unit": "short"}}
+    }
+  ],
+  "time": {"from": "now-6h", "to": "now"}
+}
--- a/devops/services/orchestrator-config/incident-response.md
+++ b/devops/services/orchestrator-config/incident-response.md
@@ -0,0 +1,37 @@
+# Orchestrator Incident Response & GA Readiness
+
+## Alert links
+- Prometheus rules: `ops/devops/orchestrator/alerts.yaml` (includes burn-rate).
+- Dashboard: `ops/devops/orchestrator/grafana/orchestrator-overview.json`.
+
+## Runbook (by alert)
+- **QueueDepthHigh / DLQDepthHigh**
+  - Check backlog cause: slow workers vs. downstream dependency.
+  - Scale workers + clear DLQ after snapshot; if DLQ cause is transient, replay via `replay-smoke.sh` after fixes.
+- **FailuresHigh / ErrorCluster / FailureBurnRateHigh**
+  - Inspect failing job type from alert labels.
+  - Pause new dispatch for the job type; ship hotfix or rollback offending worker image.
+  - Validate with `scripts/orchestrator/probe.sh` then `smoke.sh` to ensure infra is healthy.
+- **LeaseStall**
+  - Look for stuck locks in Postgres `locks` view; force release or restart the worker set.
+  - Confirm NATS health (probe) and worker heartbeats.
+- **Backpressure**
+  - Increase rate-limit budgets temporarily; ensure backlog drains; restore defaults after stability.
+
+## Synthetic checks
+- `scripts/orchestrator/probe.sh` — psql ping, mongo ping, NATS pub/ping; writes `out/orchestrator-probe/status.txt`.
+- `scripts/orchestrator/smoke.sh` — end-to-end infra smoke, emits readiness.
+- `scripts/orchestrator/replay-smoke.sh` — restart stack then run smoke to prove restart/replay works.
+
+## GA readiness checklist
+- [ ] Burn-rate alerting enabled in Prometheus/Alertmanager (see `alerts.yaml` rule `OrchestratorFailureBurnRateHigh`).
+- [ ] Dashboard imported and linked in on-call rotation.
+- [ ] Synthetic probe cron in CI/ops runner publishing `status.txt` artifact daily.
+- [ ] Replay smoke scheduled post-deploy to validate persistence/volumes.
+- [ ] Backup/restore for Postgres & Mongo verified weekly (not automated here).
+- [ ] NATS JetStream retention + DLQ policy reviewed and documented.
+
+## Escalation
+- Primary: Orchestrator on-call.
+- Secondary: DevOps Guild (release).
+- Page when any critical alert persists >15m or dual criticals fire simultaneously.