2.8 KiB
2.8 KiB
Unknowns SLA Monitoring
Module
Unknowns
Status
IMPLEMENTED
Description
SLA monitoring for unknowns tracking resolution timelines and health checks for unknown queue items.
Implementation Details
- Unknowns SLA Monitor Service:
src/Unknowns/StellaOps.Unknowns.Services/UnknownsSlaMonitorService.cs-- background service that periodically checks unknown queue items against configured SLA thresholds (time-to-triage, time-to-resolution); raises alerts for SLA breaches. - Unknowns SLA Health Check:
src/Unknowns/StellaOps.Unknowns.Services/UnknownsSlaHealthCheck.cs-- ASP.NET health check that reports SLA compliance status; returns degraded/unhealthy when unknowns exceed SLA thresholds, enabling integration with orchestrator health monitoring. - Unknowns Metrics Service:
src/Unknowns/StellaOps.Unknowns.Services/UnknownsMetricsService.cs-- exposes Prometheus/OpenTelemetry metrics for unknown queue depth, average resolution time, SLA breach count, and hint coverage percentage. - Grey Queue Watchdog Service:
src/Unknowns/StellaOps.Unknowns.Services/GreyQueueWatchdogService.cs-- monitors the grey queue (unknowns awaiting classification) and escalates items that have been pending beyond the configured watchdog timeout. - Unknowns Lifecycle Service:
src/Unknowns/StellaOps.Unknowns.Services/UnknownsLifecycleService.cs-- manages the full lifecycle of unknown items from ingestion through classification, hint gathering, resolution, and archival; tracks state transitions for SLA timing. - Grey Queue Entry Model:
src/Unknowns/__Libraries/StellaOps.Unknowns.Core/Models/GreyQueueEntry.cs-- data model for grey queue entries including creation timestamp, last activity timestamp, and SLA deadline fields used by the monitor.
E2E Test Plan
- Enqueue an unknown item, let the
UnknownsSlaMonitorServicerun its check cycle, and verify the item is reported as within SLA when the elapsed time is below the threshold - Enqueue an unknown item with an artificially past creation timestamp (exceeding the SLA threshold), run the monitor, and verify an SLA breach alert is raised
- Query the
UnknownsSlaHealthCheckendpoint when all unknowns are within SLA and verify it returnsHealthy; then introduce an SLA breach and verify it returnsDegraded - Verify
UnknownsMetricsServiceexposes correct Prometheus metrics: enqueue an item, resolve it, and verifyunknown_resolution_time_secondshistogram records the elapsed time - Enqueue a grey queue item, let
GreyQueueWatchdogServicerun, and verify the item is escalated when it exceeds the watchdog timeout - Track an unknown item through its full lifecycle via
UnknownsLifecycleService(ingestion -> classification -> resolution) and verify SLA timestamps are recorded at each state transition