CD/CD consolidation
This commit is contained in:
46
devops/services/orchestrator-config/README.md
Normal file
46
devops/services/orchestrator-config/README.md
Normal file
@@ -0,0 +1,46 @@
|
||||
# Orchestrator Infra Bootstrap (DEVOPS-ORCH-32-001)
|
||||
|
||||
## Components
|
||||
- Postgres 16 (state/config)
|
||||
- Mongo 7 (job ledger history)
|
||||
- NATS 2.10 JetStream (queue/bus)
|
||||
|
||||
Compose file: `ops/devops/orchestrator/docker-compose.orchestrator.yml`
|
||||
|
||||
## Quick start (offline-friendly)
|
||||
```bash
|
||||
# bring up infra
|
||||
COMPOSE_FILE=ops/devops/orchestrator/docker-compose.orchestrator.yml docker compose up -d
|
||||
|
||||
# smoke check and emit connection strings
|
||||
scripts/orchestrator/smoke.sh
|
||||
cat out/orchestrator-smoke/readiness.txt
|
||||
|
||||
# synthetic probe (postgres/mongo/nats health)
|
||||
scripts/orchestrator/probe.sh
|
||||
cat out/orchestrator-probe/status.txt
|
||||
|
||||
# replay readiness (restart then smoke)
|
||||
scripts/orchestrator/replay-smoke.sh
|
||||
```
|
||||
|
||||
Connection strings
|
||||
- Postgres: `postgres://orch:orchpass@localhost:55432/orchestrator`
|
||||
- Mongo: `mongodb://localhost:57017`
|
||||
- NATS: `nats://localhost:4222`
|
||||
|
||||
## Observability
|
||||
- Alerts: `ops/devops/orchestrator/alerts.yaml`
|
||||
- Grafana dashboard: `ops/devops/orchestrator/grafana/orchestrator-overview.json`
|
||||
- Metrics expected: `job_queue_depth`, `job_failures_total`, `lease_extensions_total`, `job_latency_seconds_bucket`.
|
||||
- Runbook: `ops/devops/orchestrator/incident-response.md`
|
||||
- Synthetic probes: `scripts/orchestrator/probe.sh` (writes `out/orchestrator-probe/status.txt`).
|
||||
- Replay smoke: `scripts/orchestrator/replay-smoke.sh` (idempotent restart + smoke).
|
||||
|
||||
## CI hook (suggested)
|
||||
Add a workflow step (or local cron) to run `scripts/orchestrator/smoke.sh` with `SKIP_UP=1` against existing infra and publish the `readiness.txt` artifact for traceability.
|
||||
|
||||
## Notes
|
||||
- Uses fixed ports for determinism; adjust via COMPOSE overrides if needed.
|
||||
- Data volumes: `orch_pg_data`, `orch_mongo_data` (docker volumes).
|
||||
- No external downloads beyond base images; pin images to specific tags above.
|
||||
69
devops/services/orchestrator-config/alerts.yaml
Normal file
69
devops/services/orchestrator-config/alerts.yaml
Normal file
@@ -0,0 +1,69 @@
|
||||
groups:
|
||||
- name: orchestrator-core
|
||||
rules:
|
||||
- alert: OrchestratorQueueDepthHigh
|
||||
expr: job_queue_depth > 500
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
service: orchestrator
|
||||
annotations:
|
||||
summary: "Queue depth high"
|
||||
description: "job_queue_depth exceeded 500 for 10m"
|
||||
- alert: OrchestratorFailuresHigh
|
||||
expr: rate(job_failures_total[5m]) > 5
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
service: orchestrator
|
||||
annotations:
|
||||
summary: "Job failures elevated"
|
||||
description: "Failure rate above 5/min in last 5m"
|
||||
- alert: OrchestratorLeaseStall
|
||||
expr: rate(lease_extensions_total[5m]) == 0 and job_queue_depth > 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
service: orchestrator
|
||||
annotations:
|
||||
summary: "Leases stalled"
|
||||
description: "No lease renewals while queue has items"
|
||||
- alert: OrchestratorDLQDepthHigh
|
||||
expr: job_dlq_depth > 10
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
service: orchestrator
|
||||
annotations:
|
||||
summary: "DLQ depth high"
|
||||
description: "Dead-letter queue depth above 10 for 10m"
|
||||
- alert: OrchestratorBackpressure
|
||||
expr: avg_over_time(rate_limiter_backpressure_ratio[5m]) > 0.5
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
service: orchestrator
|
||||
annotations:
|
||||
summary: "Backpressure elevated"
|
||||
description: "Rate limiter backpressure >50% over 5m"
|
||||
- alert: OrchestratorErrorCluster
|
||||
expr: sum by(jobType) (rate(job_failures_total[5m])) > 3
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
service: orchestrator
|
||||
annotations:
|
||||
summary: "Error cluster detected"
|
||||
description: "Failure rate >3/min for a job type"
|
||||
- alert: OrchestratorFailureBurnRateHigh
|
||||
expr: |
|
||||
(rate(job_failures_total[5m]) / clamp_min(rate(job_processed_total[5m]), 1)) > 0.02
|
||||
and
|
||||
(rate(job_failures_total[30m]) / clamp_min(rate(job_processed_total[30m]), 1)) > 0.01
|
||||
for: 10m
|
||||
labels:
|
||||
severity: critical
|
||||
service: orchestrator
|
||||
annotations:
|
||||
summary: "Failure burn rate breaching SLO"
|
||||
description: "5m/30m failure burn rate above 2%/1% SLO; investigate upstream jobs and dependencies."
|
||||
@@ -0,0 +1,49 @@
|
||||
version: "3.9"
|
||||
services:
|
||||
orchestrator-postgres:
|
||||
image: postgres:16-alpine
|
||||
environment:
|
||||
POSTGRES_USER: orch
|
||||
POSTGRES_PASSWORD: orchpass
|
||||
POSTGRES_DB: orchestrator
|
||||
volumes:
|
||||
- orch_pg_data:/var/lib/postgresql/data
|
||||
ports:
|
||||
- "55432:5432"
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "pg_isready -U orch"]
|
||||
interval: 10s
|
||||
timeout: 5s
|
||||
retries: 5
|
||||
restart: unless-stopped
|
||||
|
||||
orchestrator-mongo:
|
||||
image: mongo:7
|
||||
command: ["mongod", "--quiet", "--storageEngine=wiredTiger"]
|
||||
ports:
|
||||
- "57017:27017"
|
||||
volumes:
|
||||
- orch_mongo_data:/data/db
|
||||
healthcheck:
|
||||
test: ["CMD", "mongosh", "--quiet", "--eval", "db.adminCommand('ping')"]
|
||||
interval: 10s
|
||||
timeout: 5s
|
||||
retries: 5
|
||||
restart: unless-stopped
|
||||
|
||||
orchestrator-nats:
|
||||
image: nats:2.10-alpine
|
||||
ports:
|
||||
- "5422:4222"
|
||||
- "5822:8222"
|
||||
command: ["-js", "-m", "8222"]
|
||||
healthcheck:
|
||||
test: ["CMD", "nats", "--server", "localhost:4222", "ping"]
|
||||
interval: 10s
|
||||
timeout: 5s
|
||||
retries: 5
|
||||
restart: unless-stopped
|
||||
|
||||
volumes:
|
||||
orch_pg_data:
|
||||
orch_mongo_data:
|
||||
@@ -0,0 +1,63 @@
|
||||
{
|
||||
"schemaVersion": 39,
|
||||
"title": "Orchestrator Overview",
|
||||
"panels": [
|
||||
{
|
||||
"type": "stat",
|
||||
"title": "Queue Depth",
|
||||
"datasource": "Prometheus",
|
||||
"fieldConfig": {"defaults": {"unit": "none"}},
|
||||
"targets": [{"expr": "sum(job_queue_depth)"}]
|
||||
},
|
||||
{
|
||||
"type": "timeseries",
|
||||
"title": "Queue Depth by Job Type",
|
||||
"datasource": "Prometheus",
|
||||
"targets": [{"expr": "job_queue_depth"}],
|
||||
"fieldConfig": {"defaults": {"unit": "none"}}
|
||||
},
|
||||
{
|
||||
"type": "timeseries",
|
||||
"title": "Failures per minute",
|
||||
"datasource": "Prometheus",
|
||||
"targets": [{"expr": "rate(job_failures_total[5m])"}],
|
||||
"fieldConfig": {"defaults": {"unit": "short"}}
|
||||
},
|
||||
{
|
||||
"type": "timeseries",
|
||||
"title": "Leases per second",
|
||||
"datasource": "Prometheus",
|
||||
"targets": [{"expr": "rate(lease_extensions_total[5m])"}],
|
||||
"fieldConfig": {"defaults": {"unit": "ops"}}
|
||||
},
|
||||
{
|
||||
"type": "timeseries",
|
||||
"title": "Job latency p95",
|
||||
"datasource": "Prometheus",
|
||||
"targets": [{"expr": "histogram_quantile(0.95, sum(rate(job_latency_seconds_bucket[5m])) by (le))"}],
|
||||
"fieldConfig": {"defaults": {"unit": "s"}}
|
||||
},
|
||||
{
|
||||
"type": "timeseries",
|
||||
"title": "DLQ depth",
|
||||
"datasource": "Prometheus",
|
||||
"targets": [{"expr": "job_dlq_depth"}],
|
||||
"fieldConfig": {"defaults": {"unit": "none"}}
|
||||
},
|
||||
{
|
||||
"type": "timeseries",
|
||||
"title": "Backpressure ratio",
|
||||
"datasource": "Prometheus",
|
||||
"targets": [{"expr": "rate_limiter_backpressure_ratio"}],
|
||||
"fieldConfig": {"defaults": {"unit": "percentunit"}}
|
||||
},
|
||||
{
|
||||
"type": "timeseries",
|
||||
"title": "Failures by job type",
|
||||
"datasource": "Prometheus",
|
||||
"targets": [{"expr": "rate(job_failures_total[5m])"}],
|
||||
"fieldConfig": {"defaults": {"unit": "short"}}
|
||||
}
|
||||
],
|
||||
"time": {"from": "now-6h", "to": "now"}
|
||||
}
|
||||
37
devops/services/orchestrator-config/incident-response.md
Normal file
37
devops/services/orchestrator-config/incident-response.md
Normal file
@@ -0,0 +1,37 @@
|
||||
# Orchestrator Incident Response & GA Readiness
|
||||
|
||||
## Alert links
|
||||
- Prometheus rules: `ops/devops/orchestrator/alerts.yaml` (includes burn-rate).
|
||||
- Dashboard: `ops/devops/orchestrator/grafana/orchestrator-overview.json`.
|
||||
|
||||
## Runbook (by alert)
|
||||
- **QueueDepthHigh / DLQDepthHigh**
|
||||
- Check backlog cause: slow workers vs. downstream dependency.
|
||||
- Scale workers + clear DLQ after snapshot; if DLQ cause is transient, replay via `replay-smoke.sh` after fixes.
|
||||
- **FailuresHigh / ErrorCluster / FailureBurnRateHigh**
|
||||
- Inspect failing job type from alert labels.
|
||||
- Pause new dispatch for the job type; ship hotfix or rollback offending worker image.
|
||||
- Validate with `scripts/orchestrator/probe.sh` then `smoke.sh` to ensure infra is healthy.
|
||||
- **LeaseStall**
|
||||
- Look for stuck locks in Postgres `locks` view; force release or restart the worker set.
|
||||
- Confirm NATS health (probe) and worker heartbeats.
|
||||
- **Backpressure**
|
||||
- Increase rate-limit budgets temporarily; ensure backlog drains; restore defaults after stability.
|
||||
|
||||
## Synthetic checks
|
||||
- `scripts/orchestrator/probe.sh` — psql ping, mongo ping, NATS pub/ping; writes `out/orchestrator-probe/status.txt`.
|
||||
- `scripts/orchestrator/smoke.sh` — end-to-end infra smoke, emits readiness.
|
||||
- `scripts/orchestrator/replay-smoke.sh` — restart stack then run smoke to prove restart/replay works.
|
||||
|
||||
## GA readiness checklist
|
||||
- [ ] Burn-rate alerting enabled in Prometheus/Alertmanager (see `alerts.yaml` rule `OrchestratorFailureBurnRateHigh`).
|
||||
- [ ] Dashboard imported and linked in on-call rotation.
|
||||
- [ ] Synthetic probe cron in CI/ops runner publishing `status.txt` artifact daily.
|
||||
- [ ] Replay smoke scheduled post-deploy to validate persistence/volumes.
|
||||
- [ ] Backup/restore for Postgres & Mongo verified weekly (not automated here).
|
||||
- [ ] NATS JetStream retention + DLQ policy reviewed and documented.
|
||||
|
||||
## Escalation
|
||||
- Primary: Orchestrator on-call.
|
||||
- Secondary: DevOps Guild (release).
|
||||
- Page when any critical alert persists >15m or dual criticals fire simultaneously.
|
||||
Reference in New Issue
Block a user