Files
git.stella-ops.org/docs/modules/jobengine/guides/orchestrator-slo.md

1.4 KiB
Raw Blame History

Orchestrator SLOs (DOCS-ORCH-34-005)

Last updated: 2025-11-25

Service level objectives

  • Availability: 99.9% monthly for WebService API per tenant.
  • Run completion: P95 run duration < 5m for standard DAGs; failure rate <1% over 30d.
  • Queue health: backlog < 1000 items per tenant for >95% of 5m windows.
  • Event delivery: WebSocket/stream delivery success > 99.5% (per day).

Error budget policy

  • Window: 28 days. Burn alerts:
    • 2× burn: page on-call.
    • 14× burn: immediate mitigation (disable offending DAGs, scale workers).

Alerts (examples)

  • Availability: probe_success{job="orchestrator-api"} < 0.999 over 1h.
  • Latency: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le,route)) > 0.5.
  • Run failures: rate(orchestrator_runs_total{status="failed"}[30m]) / rate(orchestrator_runs_total[30m]) > 0.01.
  • Queue backlog: orchestrator_queue_depth > 1000 for 10m.

Dashboards

  • Golden signals per service (traffic, errors, latency, saturation).
  • Run outcome panel: success/fail/cancel counts, retry counts.
  • Queue panel: depth, age, worker consumption rate.
  • Burn-rate panel tied to error budget.

Ownership & review

  • SLOs owned by Orchestrator Guild; reviewed quarterly or when architecture changes.
  • Changes must be reflected in runbook and alert rules; update manifests for offline/air-gap monitoring kits.