Files
git.stella-ops.org/docs/features/checked/jobengine/slo-burn-rate-computation-and-alert-budget-tracking.md

39 lines
2.8 KiB
Markdown

# SLO Burn-Rate Computation and Alert Budget Tracking
## Module
Orchestrator
## Status
VERIFIED
## Description
SLO burn-rate computation for orchestrator operations with configurable alert budgets, enabling proactive capacity and reliability management.
## Implementation Details
- **Modules**: `src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/SloManagement/`, `src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/Domain/`
- **Key Classes**:
- `BurnRateEngine` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/SloManagement/BurnRateEngine.cs`) - computes SLO burn rate from error budget consumption over rolling windows (1h, 6h, 24h, 30d)
- `Slo` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/Domain/Slo.cs`) - SLO entity with target (e.g., 99.9%), error budget, and current burn rate
- `SloEndpoints` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.WebService/Endpoints/SloEndpoints.cs`) - REST API for SLO queries and burn rate dashboards
- `IncidentModeHooks` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/Observability/IncidentModeHooks.cs`) - activates incident mode when burn rate exceeds alert thresholds
- `OrchestratorGoldenSignals` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Infrastructure/Observability/OrchestratorGoldenSignals.cs`) - provides underlying error/latency data for SLO computation
- `ScaleMetrics` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/Scale/ScaleMetrics.cs`) - metrics feeding SLO saturation signals
- **Interfaces**: None (uses concrete implementations)
- **Source**: Feature matrix scan
## E2E Test Plan
- [ ] Define an `Slo` with target=99.9% and error budget=43.2 minutes/month; verify the SLO is persisted
- [ ] Generate 10 successful and 1 failed request and verify `BurnRateEngine` computation reflects the error
- [ ] Verify rolling window: compute burn rates for 1h, 6h, and 24h windows via `BurnRateEngine` and verify each reflects the appropriate time range
- [ ] Exceed the alert threshold (e.g., 2x burn rate) and verify `IncidentModeHooks` triggers incident mode
- [ ] Query SLO via `SloEndpoints` and verify the response includes current burn rate, remaining budget, and alert status
- [ ] Verify budget depletion: consume the entire error budget and verify the `Slo` shows 0% remaining
- [ ] Reset the SLO period (monthly rollover) and verify the error budget resets to full
- [ ] Verify multi-SLO: define SLOs for latency and availability, verify `BurnRateEngine` computes each independently
## Verification
- Verified on 2026-02-13 via `run-002`.
- Tier 0: Source files confirmed present on disk.
- Tier 1: `dotnet build` passed (0 errors); 1292/1292 tests passed.
- Tier 2d: `docs/qa/feature-checks/runs/jobengine/slo-burn-rate-computation-and-alert-budget-tracking/run-002/tier2-integration-check.json`