consolidation of some of the modules, localization fixes, product advisories work, qa work

2026-03-05 03:54:22 +02:00
parent 7bafcc3eef
commit 8e1cb9448d
3878 changed files with 72600 additions and 46861 deletions
--- a/docs/modules/jobengine/guides/orchestrator-slo.md
+++ b/docs/modules/jobengine/guides/orchestrator-slo.md
@@ -0,0 +1,30 @@
+# Orchestrator SLOs (DOCS-ORCH-34-005)
+
+Last updated: 2025-11-25
+
+## Service level objectives
+- **Availability**: 99.9% monthly for WebService API per tenant.
+- **Run completion**: P95 run duration < 5m for standard DAGs; failure rate <1% over 30d.
+- **Queue health**: backlog < 1000 items per tenant for >95% of 5m windows.
+- **Event delivery**: WebSocket/stream delivery success > 99.5% (per day).
+
+## Error budget policy
+- Window: 28 days. Burn alerts:
+  - 2× burn: page on-call.
+  - 14× burn: immediate mitigation (disable offending DAGs, scale workers).
+
+## Alerts (examples)
+- Availability: `probe_success{job="orchestrator-api"} < 0.999 over 1h`.
+- Latency: `histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le,route)) > 0.5`.
+- Run failures: `rate(orchestrator_runs_total{status="failed"}[30m]) / rate(orchestrator_runs_total[30m]) > 0.01`.
+- Queue backlog: `orchestrator_queue_depth > 1000` for 10m.
+
+## Dashboards
+- Golden signals per service (traffic, errors, latency, saturation).
+- Run outcome panel: success/fail/cancel counts, retry counts.
+- Queue panel: depth, age, worker consumption rate.
+- Burn-rate panel tied to error budget.
+
+## Ownership & review
+- SLOs owned by Orchestrator Guild; reviewed quarterly or when architecture changes.
+- Changes must be reflected in runbook and alert rules; update manifests for offline/air-gap monitoring kits.