Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
31 lines
1.4 KiB
Markdown
31 lines
1.4 KiB
Markdown
# Orchestrator SLOs (DOCS-ORCH-34-005)
|
||
|
||
Last updated: 2025-11-25
|
||
|
||
## Service level objectives
|
||
- **Availability**: 99.9% monthly for WebService API per tenant.
|
||
- **Run completion**: P95 run duration < 5m for standard DAGs; failure rate <1% over 30d.
|
||
- **Queue health**: backlog < 1000 items per tenant for >95% of 5m windows.
|
||
- **Event delivery**: WebSocket/stream delivery success > 99.5% (per day).
|
||
|
||
## Error budget policy
|
||
- Window: 28 days. Burn alerts:
|
||
- 2× burn: page on-call.
|
||
- 14× burn: immediate mitigation (disable offending DAGs, scale workers).
|
||
|
||
## Alerts (examples)
|
||
- Availability: `probe_success{job="orchestrator-api"} < 0.999 over 1h`.
|
||
- Latency: `histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le,route)) > 0.5`.
|
||
- Run failures: `rate(orchestrator_runs_total{status="failed"}[30m]) / rate(orchestrator_runs_total[30m]) > 0.01`.
|
||
- Queue backlog: `orchestrator_queue_depth > 1000` for 10m.
|
||
|
||
## Dashboards
|
||
- Golden signals per service (traffic, errors, latency, saturation).
|
||
- Run outcome panel: success/fail/cancel counts, retry counts.
|
||
- Queue panel: depth, age, worker consumption rate.
|
||
- Burn-rate panel tied to error budget.
|
||
|
||
## Ownership & review
|
||
- SLOs owned by Orchestrator Guild; reviewed quarterly or when architecture changes.
|
||
- Changes must be reflected in runbook and alert rules; update manifests for offline/air-gap monitoring kits.
|