Files
git.stella-ops.org/docs/features/unchecked/orchestrator/slo-burn-rate-computation-and-alert-budget-tracking.md

2.6 KiB

SLO Burn-Rate Computation and Alert Budget Tracking

Module

Orchestrator

Status

IMPLEMENTED

Description

SLO burn-rate computation for orchestrator operations with configurable alert budgets, enabling proactive capacity and reliability management.

Implementation Details

  • Modules: src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/SloManagement/, src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/
  • Key Classes:
    • BurnRateEngine (src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/SloManagement/BurnRateEngine.cs) - computes SLO burn rate from error budget consumption over rolling windows (1h, 6h, 24h, 30d)
    • Slo (src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Domain/Slo.cs) - SLO entity with target (e.g., 99.9%), error budget, and current burn rate
    • SloEndpoints (src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.WebService/Endpoints/SloEndpoints.cs) - REST API for SLO queries and burn rate dashboards
    • IncidentModeHooks (src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Observability/IncidentModeHooks.cs) - activates incident mode when burn rate exceeds alert thresholds
    • OrchestratorGoldenSignals (src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Infrastructure/Observability/OrchestratorGoldenSignals.cs) - provides underlying error/latency data for SLO computation
    • ScaleMetrics (src/Orchestrator/StellaOps.Orchestrator/StellaOps.Orchestrator.Core/Scale/ScaleMetrics.cs) - metrics feeding SLO saturation signals
  • Interfaces: None (uses concrete implementations)
  • Source: Feature matrix scan

E2E Test Plan

  • Define an Slo with target=99.9% and error budget=43.2 minutes/month; verify the SLO is persisted
  • Generate 10 successful and 1 failed request and verify BurnRateEngine computation reflects the error
  • Verify rolling window: compute burn rates for 1h, 6h, and 24h windows via BurnRateEngine and verify each reflects the appropriate time range
  • Exceed the alert threshold (e.g., 2x burn rate) and verify IncidentModeHooks triggers incident mode
  • Query SLO via SloEndpoints and verify the response includes current burn rate, remaining budget, and alert status
  • Verify budget depletion: consume the entire error budget and verify the Slo shows 0% remaining
  • Reset the SLO period (monthly rollover) and verify the error budget resets to full
  • Verify multi-SLO: define SLOs for latency and availability, verify BurnRateEngine computes each independently