2.6 KiB
2.6 KiB
SLO Burn-Rate Computation and Alert Budget Tracking
Module
Orchestrator
Status
IMPLEMENTED
Description
SLO burn-rate computation for orchestrator operations with configurable alert budgets, enabling proactive capacity and reliability management.
Implementation Details
- Modules:
src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/SloManagement/,src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/ - Key Classes:
BurnRateEngine(src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/SloManagement/BurnRateEngine.cs) - computes SLO burn rate from error budget consumption over rolling windows (1h, 6h, 24h, 30d)Slo(src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Slo.cs) - SLO entity with target (e.g., 99.9%), error budget, and current burn rateSloEndpoints(src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Endpoints/SloEndpoints.cs) - REST API for SLO queries and burn rate dashboardsIncidentModeHooks(src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Observability/IncidentModeHooks.cs) - activates incident mode when burn rate exceeds alert thresholdsOrchestratorGoldenSignals(src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Observability/OrchestratorGoldenSignals.cs) - provides underlying error/latency data for SLO computationScaleMetrics(src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Scale/ScaleMetrics.cs) - metrics feeding SLO saturation signals
- Interfaces: None (uses concrete implementations)
- Source: Feature matrix scan
E2E Test Plan
- Define an
Slowith target=99.9% and error budget=43.2 minutes/month; verify the SLO is persisted - Generate 10 successful and 1 failed request and verify
BurnRateEnginecomputation reflects the error - Verify rolling window: compute burn rates for 1h, 6h, and 24h windows via
BurnRateEngineand verify each reflects the appropriate time range - Exceed the alert threshold (e.g., 2x burn rate) and verify
IncidentModeHookstriggers incident mode - Query SLO via
SloEndpointsand verify the response includes current burn rate, remaining budget, and alert status - Verify budget depletion: consume the entire error budget and verify the
Sloshows 0% remaining - Reset the SLO period (monthly rollover) and verify the error budget resets to full
- Verify multi-SLO: define SLOs for latency and availability, verify
BurnRateEnginecomputes each independently