semi implemented and features implemented save checkpoint

This commit is contained in:
master
2026-02-08 18:00:49 +02:00
parent 04360dff63
commit 1bf6bbf395
20895 changed files with 716795 additions and 64 deletions

View File

@@ -0,0 +1,26 @@
# Unknowns SLA Monitoring
## Module
Unknowns
## Status
IMPLEMENTED
## Description
SLA monitoring for unknowns tracking resolution timelines and health checks for unknown queue items.
## Implementation Details
- **Unknowns SLA Monitor Service**: `src/Unknowns/StellaOps.Unknowns.Services/UnknownsSlaMonitorService.cs` -- background service that periodically checks unknown queue items against configured SLA thresholds (time-to-triage, time-to-resolution); raises alerts for SLA breaches.
- **Unknowns SLA Health Check**: `src/Unknowns/StellaOps.Unknowns.Services/UnknownsSlaHealthCheck.cs` -- ASP.NET health check that reports SLA compliance status; returns degraded/unhealthy when unknowns exceed SLA thresholds, enabling integration with orchestrator health monitoring.
- **Unknowns Metrics Service**: `src/Unknowns/StellaOps.Unknowns.Services/UnknownsMetricsService.cs` -- exposes Prometheus/OpenTelemetry metrics for unknown queue depth, average resolution time, SLA breach count, and hint coverage percentage.
- **Grey Queue Watchdog Service**: `src/Unknowns/StellaOps.Unknowns.Services/GreyQueueWatchdogService.cs` -- monitors the grey queue (unknowns awaiting classification) and escalates items that have been pending beyond the configured watchdog timeout.
- **Unknowns Lifecycle Service**: `src/Unknowns/StellaOps.Unknowns.Services/UnknownsLifecycleService.cs` -- manages the full lifecycle of unknown items from ingestion through classification, hint gathering, resolution, and archival; tracks state transitions for SLA timing.
- **Grey Queue Entry Model**: `src/Unknowns/__Libraries/StellaOps.Unknowns.Core/Models/GreyQueueEntry.cs` -- data model for grey queue entries including creation timestamp, last activity timestamp, and SLA deadline fields used by the monitor.
## E2E Test Plan
- [ ] Enqueue an unknown item, let the `UnknownsSlaMonitorService` run its check cycle, and verify the item is reported as within SLA when the elapsed time is below the threshold
- [ ] Enqueue an unknown item with an artificially past creation timestamp (exceeding the SLA threshold), run the monitor, and verify an SLA breach alert is raised
- [ ] Query the `UnknownsSlaHealthCheck` endpoint when all unknowns are within SLA and verify it returns `Healthy`; then introduce an SLA breach and verify it returns `Degraded`
- [ ] Verify `UnknownsMetricsService` exposes correct Prometheus metrics: enqueue an item, resolve it, and verify `unknown_resolution_time_seconds` histogram records the elapsed time
- [ ] Enqueue a grey queue item, let `GreyQueueWatchdogService` run, and verify the item is escalated when it exceeds the watchdog timeout
- [ ] Track an unknown item through its full lifecycle via `UnknownsLifecycleService` (ingestion -> classification -> resolution) and verify SLA timestamps are recorded at each state transition