Files
git.stella-ops.org/docs/features/unchecked/unknowns/unknowns-sla-monitoring.md

2.8 KiB

Unknowns SLA Monitoring

Module

Unknowns

Status

IMPLEMENTED

Description

SLA monitoring for unknowns tracking resolution timelines and health checks for unknown queue items.

Implementation Details

  • Unknowns SLA Monitor Service: src/Unknowns/StellaOps.Unknowns.Services/UnknownsSlaMonitorService.cs -- background service that periodically checks unknown queue items against configured SLA thresholds (time-to-triage, time-to-resolution); raises alerts for SLA breaches.
  • Unknowns SLA Health Check: src/Unknowns/StellaOps.Unknowns.Services/UnknownsSlaHealthCheck.cs -- ASP.NET health check that reports SLA compliance status; returns degraded/unhealthy when unknowns exceed SLA thresholds, enabling integration with orchestrator health monitoring.
  • Unknowns Metrics Service: src/Unknowns/StellaOps.Unknowns.Services/UnknownsMetricsService.cs -- exposes Prometheus/OpenTelemetry metrics for unknown queue depth, average resolution time, SLA breach count, and hint coverage percentage.
  • Grey Queue Watchdog Service: src/Unknowns/StellaOps.Unknowns.Services/GreyQueueWatchdogService.cs -- monitors the grey queue (unknowns awaiting classification) and escalates items that have been pending beyond the configured watchdog timeout.
  • Unknowns Lifecycle Service: src/Unknowns/StellaOps.Unknowns.Services/UnknownsLifecycleService.cs -- manages the full lifecycle of unknown items from ingestion through classification, hint gathering, resolution, and archival; tracks state transitions for SLA timing.
  • Grey Queue Entry Model: src/Unknowns/__Libraries/StellaOps.Unknowns.Core/Models/GreyQueueEntry.cs -- data model for grey queue entries including creation timestamp, last activity timestamp, and SLA deadline fields used by the monitor.

E2E Test Plan

  • Enqueue an unknown item, let the UnknownsSlaMonitorService run its check cycle, and verify the item is reported as within SLA when the elapsed time is below the threshold
  • Enqueue an unknown item with an artificially past creation timestamp (exceeding the SLA threshold), run the monitor, and verify an SLA breach alert is raised
  • Query the UnknownsSlaHealthCheck endpoint when all unknowns are within SLA and verify it returns Healthy; then introduce an SLA breach and verify it returns Degraded
  • Verify UnknownsMetricsService exposes correct Prometheus metrics: enqueue an item, resolve it, and verify unknown_resolution_time_seconds histogram records the elapsed time
  • Enqueue a grey queue item, let GreyQueueWatchdogService run, and verify the item is escalated when it exceeds the watchdog timeout
  • Track an unknown item through its full lifecycle via UnknownsLifecycleService (ingestion -> classification -> resolution) and verify SLA timestamps are recorded at each state transition