semi implemented and features implemented save checkpoint

2026-02-08 18:00:49 +02:00
parent 04360dff63
commit 1bf6bbf395
20895 changed files with 716795 additions and 64 deletions
--- a/docs/features/unchecked/unknowns/unknowns-sla-monitoring.md
+++ b/docs/features/unchecked/unknowns/unknowns-sla-monitoring.md
@@ -0,0 +1,26 @@
+# Unknowns SLA Monitoring
+
+## Module
+Unknowns
+
+## Status
+IMPLEMENTED
+
+## Description
+SLA monitoring for unknowns tracking resolution timelines and health checks for unknown queue items.
+
+## Implementation Details
+- **Unknowns SLA Monitor Service**: `src/Unknowns/StellaOps.Unknowns.Services/UnknownsSlaMonitorService.cs` -- background service that periodically checks unknown queue items against configured SLA thresholds (time-to-triage, time-to-resolution); raises alerts for SLA breaches.
+- **Unknowns SLA Health Check**: `src/Unknowns/StellaOps.Unknowns.Services/UnknownsSlaHealthCheck.cs` -- ASP.NET health check that reports SLA compliance status; returns degraded/unhealthy when unknowns exceed SLA thresholds, enabling integration with orchestrator health monitoring.
+- **Unknowns Metrics Service**: `src/Unknowns/StellaOps.Unknowns.Services/UnknownsMetricsService.cs` -- exposes Prometheus/OpenTelemetry metrics for unknown queue depth, average resolution time, SLA breach count, and hint coverage percentage.
+- **Grey Queue Watchdog Service**: `src/Unknowns/StellaOps.Unknowns.Services/GreyQueueWatchdogService.cs` -- monitors the grey queue (unknowns awaiting classification) and escalates items that have been pending beyond the configured watchdog timeout.
+- **Unknowns Lifecycle Service**: `src/Unknowns/StellaOps.Unknowns.Services/UnknownsLifecycleService.cs` -- manages the full lifecycle of unknown items from ingestion through classification, hint gathering, resolution, and archival; tracks state transitions for SLA timing.
+- **Grey Queue Entry Model**: `src/Unknowns/__Libraries/StellaOps.Unknowns.Core/Models/GreyQueueEntry.cs` -- data model for grey queue entries including creation timestamp, last activity timestamp, and SLA deadline fields used by the monitor.
+
+## E2E Test Plan
+- [ ] Enqueue an unknown item, let the `UnknownsSlaMonitorService` run its check cycle, and verify the item is reported as within SLA when the elapsed time is below the threshold
+- [ ] Enqueue an unknown item with an artificially past creation timestamp (exceeding the SLA threshold), run the monitor, and verify an SLA breach alert is raised
+- [ ] Query the `UnknownsSlaHealthCheck` endpoint when all unknowns are within SLA and verify it returns `Healthy`; then introduce an SLA breach and verify it returns `Degraded`
+- [ ] Verify `UnknownsMetricsService` exposes correct Prometheus metrics: enqueue an item, resolve it, and verify `unknown_resolution_time_seconds` histogram records the elapsed time
+- [ ] Enqueue a grey queue item, let `GreyQueueWatchdogService` run, and verify the item is escalated when it exceeds the watchdog timeout
+- [ ] Track an unknown item through its full lifecycle via `UnknownsLifecycleService` (ingestion -> classification -> resolution) and verify SLA timestamps are recorded at each state transition