up
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled

This commit is contained in:
StellaOps Bot
2025-11-25 22:09:44 +02:00
parent 6bee1fdcf5
commit 9f6e6f7fb3
116 changed files with 4495 additions and 730 deletions

View File

@@ -15,6 +15,13 @@ COMPOSE_FILE=ops/devops/orchestrator/docker-compose.orchestrator.yml docker comp
# smoke check and emit connection strings
scripts/orchestrator/smoke.sh
cat out/orchestrator-smoke/readiness.txt
# synthetic probe (postgres/mongo/nats health)
scripts/orchestrator/probe.sh
cat out/orchestrator-probe/status.txt
# replay readiness (restart then smoke)
scripts/orchestrator/replay-smoke.sh
```
Connection strings
@@ -26,6 +33,9 @@ Connection strings
- Alerts: `ops/devops/orchestrator/alerts.yaml`
- Grafana dashboard: `ops/devops/orchestrator/grafana/orchestrator-overview.json`
- Metrics expected: `job_queue_depth`, `job_failures_total`, `lease_extensions_total`, `job_latency_seconds_bucket`.
- Runbook: `ops/devops/orchestrator/incident-response.md`
- Synthetic probes: `scripts/orchestrator/probe.sh` (writes `out/orchestrator-probe/status.txt`).
- Replay smoke: `scripts/orchestrator/replay-smoke.sh` (idempotent restart + smoke).
## CI hook (suggested)
Add a workflow step (or local cron) to run `scripts/orchestrator/smoke.sh` with `SKIP_UP=1` against existing infra and publish the `readiness.txt` artifact for traceability.

View File

@@ -28,3 +28,42 @@ groups:
annotations:
summary: "Leases stalled"
description: "No lease renewals while queue has items"
- alert: OrchestratorDLQDepthHigh
expr: job_dlq_depth > 10
for: 10m
labels:
severity: warning
service: orchestrator
annotations:
summary: "DLQ depth high"
description: "Dead-letter queue depth above 10 for 10m"
- alert: OrchestratorBackpressure
expr: avg_over_time(rate_limiter_backpressure_ratio[5m]) > 0.5
for: 5m
labels:
severity: warning
service: orchestrator
annotations:
summary: "Backpressure elevated"
description: "Rate limiter backpressure >50% over 5m"
- alert: OrchestratorErrorCluster
expr: sum by(jobType) (rate(job_failures_total[5m])) > 3
for: 5m
labels:
severity: critical
service: orchestrator
annotations:
summary: "Error cluster detected"
description: "Failure rate >3/min for a job type"
- alert: OrchestratorFailureBurnRateHigh
expr: |
(rate(job_failures_total[5m]) / clamp_min(rate(job_processed_total[5m]), 1)) > 0.02
and
(rate(job_failures_total[30m]) / clamp_min(rate(job_processed_total[30m]), 1)) > 0.01
for: 10m
labels:
severity: critical
service: orchestrator
annotations:
summary: "Failure burn rate breaching SLO"
description: "5m/30m failure burn rate above 2%/1% SLO; investigate upstream jobs and dependencies."

View File

@@ -36,6 +36,27 @@
"datasource": "Prometheus",
"targets": [{"expr": "histogram_quantile(0.95, sum(rate(job_latency_seconds_bucket[5m])) by (le))"}],
"fieldConfig": {"defaults": {"unit": "s"}}
},
{
"type": "timeseries",
"title": "DLQ depth",
"datasource": "Prometheus",
"targets": [{"expr": "job_dlq_depth"}],
"fieldConfig": {"defaults": {"unit": "none"}}
},
{
"type": "timeseries",
"title": "Backpressure ratio",
"datasource": "Prometheus",
"targets": [{"expr": "rate_limiter_backpressure_ratio"}],
"fieldConfig": {"defaults": {"unit": "percentunit"}}
},
{
"type": "timeseries",
"title": "Failures by job type",
"datasource": "Prometheus",
"targets": [{"expr": "rate(job_failures_total[5m])"}],
"fieldConfig": {"defaults": {"unit": "short"}}
}
],
"time": {"from": "now-6h", "to": "now"}

View File

@@ -0,0 +1,37 @@
# Orchestrator Incident Response & GA Readiness
## Alert links
- Prometheus rules: `ops/devops/orchestrator/alerts.yaml` (includes burn-rate).
- Dashboard: `ops/devops/orchestrator/grafana/orchestrator-overview.json`.
## Runbook (by alert)
- **QueueDepthHigh / DLQDepthHigh**
- Check backlog cause: slow workers vs. downstream dependency.
- Scale workers + clear DLQ after snapshot; if DLQ cause is transient, replay via `replay-smoke.sh` after fixes.
- **FailuresHigh / ErrorCluster / FailureBurnRateHigh**
- Inspect failing job type from alert labels.
- Pause new dispatch for the job type; ship hotfix or rollback offending worker image.
- Validate with `scripts/orchestrator/probe.sh` then `smoke.sh` to ensure infra is healthy.
- **LeaseStall**
- Look for stuck locks in Postgres `locks` view; force release or restart the worker set.
- Confirm NATS health (probe) and worker heartbeats.
- **Backpressure**
- Increase rate-limit budgets temporarily; ensure backlog drains; restore defaults after stability.
## Synthetic checks
- `scripts/orchestrator/probe.sh` — psql ping, mongo ping, NATS pub/ping; writes `out/orchestrator-probe/status.txt`.
- `scripts/orchestrator/smoke.sh` — end-to-end infra smoke, emits readiness.
- `scripts/orchestrator/replay-smoke.sh` — restart stack then run smoke to prove restart/replay works.
## GA readiness checklist
- [ ] Burn-rate alerting enabled in Prometheus/Alertmanager (see `alerts.yaml` rule `OrchestratorFailureBurnRateHigh`).
- [ ] Dashboard imported and linked in on-call rotation.
- [ ] Synthetic probe cron in CI/ops runner publishing `status.txt` artifact daily.
- [ ] Replay smoke scheduled post-deploy to validate persistence/volumes.
- [ ] Backup/restore for Postgres & Mongo verified weekly (not automated here).
- [ ] NATS JetStream retention + DLQ policy reviewed and documented.
## Escalation
- Primary: Orchestrator on-call.
- Secondary: DevOps Guild (release).
- Page when any critical alert persists >15m or dual criticals fire simultaneously.