Some checks failed
		
		
	
	Docs CI / lint-and-preview (push) Has been cancelled
				
			- Introduced RunnerBackgroundService to handle execution of runner segments. - Added RunnerExecutionService for processing segments and aggregating results. - Implemented PlannerQueueDispatchService to manage dispatching of planner messages. - Created PlannerQueueDispatcherBackgroundService for leasing and processing planner queue messages. - Developed ScannerReportClient for interacting with the scanner service. - Enhanced observability with SchedulerWorkerMetrics for tracking planner and runner performance. - Added comprehensive documentation for the new runner execution pipeline and observability metrics. - Implemented event emission for rescan activity and scanner report readiness.
		
			
				
	
	
		
			44 lines
		
	
	
		
			2.5 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			44 lines
		
	
	
		
			2.5 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # SCHED-WORKER-16-205 — Scheduler Worker Observability
 | |
| 
 | |
| _Sprint 16 · Scheduler Worker Guild_
 | |
| 
 | |
| The scheduler worker now exposes first-class metrics covering planner latency,
 | |
| runner throughput, and backlog health.
 | |
| 
 | |
| ## Meter: `StellaOps.Scheduler.Worker`
 | |
| 
 | |
| | Metric | Type | Tags | Description |
 | |
| | --- | --- | --- | --- |
 | |
| | `scheduler_planner_runs_total` | Counter | `mode`, `status` | Planner outcomes (`enqueued`, `no_work`, `failed`). |
 | |
| | `scheduler_planner_latency_seconds` | Histogram | `mode`, `status` | Time between run creation and planner completion. |
 | |
| | `scheduler_runner_segments_total` | Counter | `mode`, `status` | Runner segments processed (`Completed`, `persist_failed`, `RunMissing`). |
 | |
| | `scheduler_runner_images_total` | Counter | `mode`, `delta` | Images processed per mode, split by whether a delta was observed. |
 | |
| | `scheduler_runner_delta_total` | Counter | `mode` | Total new findings observed. |
 | |
| | `scheduler_runner_delta_critical_total` | Counter | `mode` | Critical findings observed. |
 | |
| | `scheduler_runner_delta_high_total` | Counter | `mode` | High findings observed. |
 | |
| | `scheduler_runner_delta_kev_total` | Counter | `mode` | KEV hits surfaced across runner segments. |
 | |
| | `scheduler_run_duration_seconds` | Histogram | `mode`, `result` | End-to-end run durations (currently recorded for successful completions). |
 | |
| | `scheduler_runs_active` | Up/down counter | `mode` | Active runs in-flight. |
 | |
| | `scheduler_runner_backlog` | Observable gauge | `mode`, `scheduleId` | Remaining images awaiting runner processing per schedule. |
 | |
| 
 | |
| ## Instrumentation notes
 | |
| 
 | |
| - Planner records latency once a run transitions out of `Planning`. `no_work`
 | |
|   completions emit zero-duration runs without incrementing the active counter.
 | |
| - Runner updates backlog after every segment and decrements the active counter
 | |
|   when a run reaches `Completed`.
 | |
| - Delta counters aggregate per severity and KEV hit; they only increment when
 | |
|   `DeltaSummary` reports meaningful changes.
 | |
| - Metrics are emitted regardless of Notify availability so operators can track
 | |
|   queue pressure even in air-gapped deployments.
 | |
| 
 | |
| ## Dashboards & alerts
 | |
| 
 | |
| - **Grafana dashboard:** `docs/ops/scheduler-worker-grafana-dashboard.json`
 | |
|   (import into Prometheus-backed Grafana). Panels mirror the metrics above with
 | |
|   mode filters.
 | |
| - **Prometheus rules:** `docs/ops/scheduler-worker-prometheus-rules.yaml`
 | |
|   provides planner failure/latency, backlog, and stuck-run alerts.
 | |
| - **Operations guide:** see `docs/ops/scheduler-worker-operations.md` for
 | |
|   runbook steps, alert context, and dashboard wiring instructions.
 |