Files
git.stella-ops.org/src/StellaOps.Scheduler.Worker/docs/SCHED-WORKER-16-205-OBSERVABILITY.md
Vladimir Moushkov 4d932cc1ba
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
feat: Implement runner execution pipeline with planner dispatch and execution services
- Introduced RunnerBackgroundService to handle execution of runner segments.
- Added RunnerExecutionService for processing segments and aggregating results.
- Implemented PlannerQueueDispatchService to manage dispatching of planner messages.
- Created PlannerQueueDispatcherBackgroundService for leasing and processing planner queue messages.
- Developed ScannerReportClient for interacting with the scanner service.
- Enhanced observability with SchedulerWorkerMetrics for tracking planner and runner performance.
- Added comprehensive documentation for the new runner execution pipeline and observability metrics.
- Implemented event emission for rescan activity and scanner report readiness.
2025-10-27 18:57:35 +02:00

44 lines
2.5 KiB
Markdown

# SCHED-WORKER-16-205 — Scheduler Worker Observability
_Sprint 16 · Scheduler Worker Guild_
The scheduler worker now exposes first-class metrics covering planner latency,
runner throughput, and backlog health.
## Meter: `StellaOps.Scheduler.Worker`
| Metric | Type | Tags | Description |
| --- | --- | --- | --- |
| `scheduler_planner_runs_total` | Counter | `mode`, `status` | Planner outcomes (`enqueued`, `no_work`, `failed`). |
| `scheduler_planner_latency_seconds` | Histogram | `mode`, `status` | Time between run creation and planner completion. |
| `scheduler_runner_segments_total` | Counter | `mode`, `status` | Runner segments processed (`Completed`, `persist_failed`, `RunMissing`). |
| `scheduler_runner_images_total` | Counter | `mode`, `delta` | Images processed per mode, split by whether a delta was observed. |
| `scheduler_runner_delta_total` | Counter | `mode` | Total new findings observed. |
| `scheduler_runner_delta_critical_total` | Counter | `mode` | Critical findings observed. |
| `scheduler_runner_delta_high_total` | Counter | `mode` | High findings observed. |
| `scheduler_runner_delta_kev_total` | Counter | `mode` | KEV hits surfaced across runner segments. |
| `scheduler_run_duration_seconds` | Histogram | `mode`, `result` | End-to-end run durations (currently recorded for successful completions). |
| `scheduler_runs_active` | Up/down counter | `mode` | Active runs in-flight. |
| `scheduler_runner_backlog` | Observable gauge | `mode`, `scheduleId` | Remaining images awaiting runner processing per schedule. |
## Instrumentation notes
- Planner records latency once a run transitions out of `Planning`. `no_work`
completions emit zero-duration runs without incrementing the active counter.
- Runner updates backlog after every segment and decrements the active counter
when a run reaches `Completed`.
- Delta counters aggregate per severity and KEV hit; they only increment when
`DeltaSummary` reports meaningful changes.
- Metrics are emitted regardless of Notify availability so operators can track
queue pressure even in air-gapped deployments.
## Dashboards & alerts
- **Grafana dashboard:** `docs/ops/scheduler-worker-grafana-dashboard.json`
(import into Prometheus-backed Grafana). Panels mirror the metrics above with
mode filters.
- **Prometheus rules:** `docs/ops/scheduler-worker-prometheus-rules.yaml`
provides planner failure/latency, backlog, and stuck-run alerts.
- **Operations guide:** see `docs/ops/scheduler-worker-operations.md` for
runbook steps, alert context, and dashboard wiring instructions.