# Unknowns SLA Monitoring ## Module Unknowns ## Status IMPLEMENTED ## Description SLA monitoring for unknowns tracking resolution timelines and health checks for unknown queue items. ## Implementation Details - **Unknowns SLA Monitor Service**: `src/Unknowns/StellaOps.Unknowns.Services/UnknownsSlaMonitorService.cs` -- background service that periodically checks unknown queue items against configured SLA thresholds (time-to-triage, time-to-resolution); raises alerts for SLA breaches. - **Unknowns SLA Health Check**: `src/Unknowns/StellaOps.Unknowns.Services/UnknownsSlaHealthCheck.cs` -- ASP.NET health check that reports SLA compliance status; returns degraded/unhealthy when unknowns exceed SLA thresholds, enabling integration with orchestrator health monitoring. - **Unknowns Metrics Service**: `src/Unknowns/StellaOps.Unknowns.Services/UnknownsMetricsService.cs` -- exposes Prometheus/OpenTelemetry metrics for unknown queue depth, average resolution time, SLA breach count, and hint coverage percentage. - **Grey Queue Watchdog Service**: `src/Unknowns/StellaOps.Unknowns.Services/GreyQueueWatchdogService.cs` -- monitors the grey queue (unknowns awaiting classification) and escalates items that have been pending beyond the configured watchdog timeout. - **Unknowns Lifecycle Service**: `src/Unknowns/StellaOps.Unknowns.Services/UnknownsLifecycleService.cs` -- manages the full lifecycle of unknown items from ingestion through classification, hint gathering, resolution, and archival; tracks state transitions for SLA timing. - **Grey Queue Entry Model**: `src/Unknowns/__Libraries/StellaOps.Unknowns.Core/Models/GreyQueueEntry.cs` -- data model for grey queue entries including creation timestamp, last activity timestamp, and SLA deadline fields used by the monitor. ## E2E Test Plan - [ ] Enqueue an unknown item, let the `UnknownsSlaMonitorService` run its check cycle, and verify the item is reported as within SLA when the elapsed time is below the threshold - [ ] Enqueue an unknown item with an artificially past creation timestamp (exceeding the SLA threshold), run the monitor, and verify an SLA breach alert is raised - [ ] Query the `UnknownsSlaHealthCheck` endpoint when all unknowns are within SLA and verify it returns `Healthy`; then introduce an SLA breach and verify it returns `Degraded` - [ ] Verify `UnknownsMetricsService` exposes correct Prometheus metrics: enqueue an item, resolve it, and verify `unknown_resolution_time_seconds` histogram records the elapsed time - [ ] Enqueue a grey queue item, let `GreyQueueWatchdogService` run, and verify the item is escalated when it exceeds the watchdog timeout - [ ] Track an unknown item through its full lifecycle via `UnknownsLifecycleService` (ingestion -> classification -> resolution) and verify SLA timestamps are recorded at each state transition