Files
git.stella-ops.org/docs/features/checked/jobengine/slo-burn-rate-computation-and-alert-budget-tracking.md

2.8 KiB

SLO Burn-Rate Computation and Alert Budget Tracking

Module

Orchestrator

Status

VERIFIED

Description

SLO burn-rate computation for orchestrator operations with configurable alert budgets, enabling proactive capacity and reliability management.

Implementation Details

  • Modules: src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/SloManagement/, src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/Domain/
  • Key Classes:
    • BurnRateEngine (src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/SloManagement/BurnRateEngine.cs) - computes SLO burn rate from error budget consumption over rolling windows (1h, 6h, 24h, 30d)
    • Slo (src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/Domain/Slo.cs) - SLO entity with target (e.g., 99.9%), error budget, and current burn rate
    • SloEndpoints (src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.WebService/Endpoints/SloEndpoints.cs) - REST API for SLO queries and burn rate dashboards
    • IncidentModeHooks (src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/Observability/IncidentModeHooks.cs) - activates incident mode when burn rate exceeds alert thresholds
    • OrchestratorGoldenSignals (src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Infrastructure/Observability/OrchestratorGoldenSignals.cs) - provides underlying error/latency data for SLO computation
    • ScaleMetrics (src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Core/Scale/ScaleMetrics.cs) - metrics feeding SLO saturation signals
  • Interfaces: None (uses concrete implementations)
  • Source: Feature matrix scan

E2E Test Plan

  • Define an Slo with target=99.9% and error budget=43.2 minutes/month; verify the SLO is persisted
  • Generate 10 successful and 1 failed request and verify BurnRateEngine computation reflects the error
  • Verify rolling window: compute burn rates for 1h, 6h, and 24h windows via BurnRateEngine and verify each reflects the appropriate time range
  • Exceed the alert threshold (e.g., 2x burn rate) and verify IncidentModeHooks triggers incident mode
  • Query SLO via SloEndpoints and verify the response includes current burn rate, remaining budget, and alert status
  • Verify budget depletion: consume the entire error budget and verify the Slo shows 0% remaining
  • Reset the SLO period (monthly rollover) and verify the error budget resets to full
  • Verify multi-SLO: define SLOs for latency and availability, verify BurnRateEngine computes each independently

Verification

  • Verified on 2026-02-13 via run-002.
  • Tier 0: Source files confirmed present on disk.
  • Tier 1: dotnet build passed (0 errors); 1292/1292 tests passed.
  • Tier 2d: docs/qa/feature-checks/runs/jobengine/slo-burn-rate-computation-and-alert-budget-tracking/run-002/tier2-integration-check.json