# Orchestrator SLOs (DOCS-ORCH-34-005) Last updated: 2025-11-25 ## Service level objectives - **Availability**: 99.9% monthly for WebService API per tenant. - **Run completion**: P95 run duration < 5m for standard DAGs; failure rate <1% over 30d. - **Queue health**: backlog < 1000 items per tenant for >95% of 5m windows. - **Event delivery**: WebSocket/stream delivery success > 99.5% (per day). ## Error budget policy - Window: 28 days. Burn alerts: - 2× burn: page on-call. - 14× burn: immediate mitigation (disable offending DAGs, scale workers). ## Alerts (examples) - Availability: `probe_success{job="orchestrator-api"} < 0.999 over 1h`. - Latency: `histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le,route)) > 0.5`. - Run failures: `rate(orchestrator_runs_total{status="failed"}[30m]) / rate(orchestrator_runs_total[30m]) > 0.01`. - Queue backlog: `orchestrator_queue_depth > 1000` for 10m. ## Dashboards - Golden signals per service (traffic, errors, latency, saturation). - Run outcome panel: success/fail/cancel counts, retry counts. - Queue panel: depth, age, worker consumption rate. - Burn-rate panel tied to error budget. ## Ownership & review - SLOs owned by Orchestrator Guild; reviewed quarterly or when architecture changes. - Changes must be reflected in runbook and alert rules; update manifests for offline/air-gap monitoring kits.