39 lines
2.8 KiB
Markdown
39 lines
2.8 KiB
Markdown
# SLO Burn-Rate Computation and Alert Budget Tracking
|
|
|
|
## Module
|
|
Orchestrator
|
|
|
|
## Status
|
|
VERIFIED
|
|
|
|
## Description
|
|
SLO burn-rate computation for orchestrator operations with configurable alert budgets, enabling proactive capacity and reliability management.
|
|
|
|
## Implementation Details
|
|
- **Modules**: `src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/SloManagement/`, `src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/Domain/`
|
|
- **Key Classes**:
|
|
- `BurnRateEngine` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/SloManagement/BurnRateEngine.cs`) - computes SLO burn rate from error budget consumption over rolling windows (1h, 6h, 24h, 30d)
|
|
- `Slo` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/Domain/Slo.cs`) - SLO entity with target (e.g., 99.9%), error budget, and current burn rate
|
|
- `SloEndpoints` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.WebService/Endpoints/SloEndpoints.cs`) - REST API for SLO queries and burn rate dashboards
|
|
- `IncidentModeHooks` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/Observability/IncidentModeHooks.cs`) - activates incident mode when burn rate exceeds alert thresholds
|
|
- `OrchestratorGoldenSignals` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Infrastructure/Observability/OrchestratorGoldenSignals.cs`) - provides underlying error/latency data for SLO computation
|
|
- `ScaleMetrics` (`src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/Scale/ScaleMetrics.cs`) - metrics feeding SLO saturation signals
|
|
- **Interfaces**: None (uses concrete implementations)
|
|
- **Source**: Feature matrix scan
|
|
|
|
## E2E Test Plan
|
|
- [ ] Define an `Slo` with target=99.9% and error budget=43.2 minutes/month; verify the SLO is persisted
|
|
- [ ] Generate 10 successful and 1 failed request and verify `BurnRateEngine` computation reflects the error
|
|
- [ ] Verify rolling window: compute burn rates for 1h, 6h, and 24h windows via `BurnRateEngine` and verify each reflects the appropriate time range
|
|
- [ ] Exceed the alert threshold (e.g., 2x burn rate) and verify `IncidentModeHooks` triggers incident mode
|
|
- [ ] Query SLO via `SloEndpoints` and verify the response includes current burn rate, remaining budget, and alert status
|
|
- [ ] Verify budget depletion: consume the entire error budget and verify the `Slo` shows 0% remaining
|
|
- [ ] Reset the SLO period (monthly rollover) and verify the error budget resets to full
|
|
- [ ] Verify multi-SLO: define SLOs for latency and availability, verify `BurnRateEngine` computes each independently
|
|
|
|
## Verification
|
|
- Verified on 2026-02-13 via `run-002`.
|
|
- Tier 0: Source files confirmed present on disk.
|
|
- Tier 1: `dotnet build` passed (0 errors); 1292/1292 tests passed.
|
|
- Tier 2d: `docs/qa/feature-checks/runs/jobengine/slo-burn-rate-computation-and-alert-budget-tracking/run-002/tier2-integration-check.json`
|