|
|
|
|
@@ -1,43 +1,43 @@
|
|
|
|
|
# SCHED-WORKER-16-205 — Scheduler Worker Observability
|
|
|
|
|
|
|
|
|
|
_Sprint 16 · Scheduler Worker Guild_
|
|
|
|
|
|
|
|
|
|
The scheduler worker now exposes first-class metrics covering planner latency,
|
|
|
|
|
runner throughput, and backlog health.
|
|
|
|
|
|
|
|
|
|
## Meter: `StellaOps.Scheduler.Worker`
|
|
|
|
|
|
|
|
|
|
| Metric | Type | Tags | Description |
|
|
|
|
|
| --- | --- | --- | --- |
|
|
|
|
|
| `scheduler_planner_runs_total` | Counter | `mode`, `status` | Planner outcomes (`enqueued`, `no_work`, `failed`). |
|
|
|
|
|
| `scheduler_planner_latency_seconds` | Histogram | `mode`, `status` | Time between run creation and planner completion. |
|
|
|
|
|
| `scheduler_runner_segments_total` | Counter | `mode`, `status` | Runner segments processed (`Completed`, `persist_failed`, `RunMissing`). |
|
|
|
|
|
| `scheduler_runner_images_total` | Counter | `mode`, `delta` | Images processed per mode, split by whether a delta was observed. |
|
|
|
|
|
| `scheduler_runner_delta_total` | Counter | `mode` | Total new findings observed. |
|
|
|
|
|
| `scheduler_runner_delta_critical_total` | Counter | `mode` | Critical findings observed. |
|
|
|
|
|
| `scheduler_runner_delta_high_total` | Counter | `mode` | High findings observed. |
|
|
|
|
|
| `scheduler_runner_delta_kev_total` | Counter | `mode` | KEV hits surfaced across runner segments. |
|
|
|
|
|
| `scheduler_run_duration_seconds` | Histogram | `mode`, `result` | End-to-end run durations (currently recorded for successful completions). |
|
|
|
|
|
| `scheduler_runs_active` | Up/down counter | `mode` | Active runs in-flight. |
|
|
|
|
|
| `scheduler_runner_backlog` | Observable gauge | `mode`, `scheduleId` | Remaining images awaiting runner processing per schedule. |
|
|
|
|
|
|
|
|
|
|
## Instrumentation notes
|
|
|
|
|
|
|
|
|
|
- Planner records latency once a run transitions out of `Planning`. `no_work`
|
|
|
|
|
completions emit zero-duration runs without incrementing the active counter.
|
|
|
|
|
- Runner updates backlog after every segment and decrements the active counter
|
|
|
|
|
when a run reaches `Completed`.
|
|
|
|
|
- Delta counters aggregate per severity and KEV hit; they only increment when
|
|
|
|
|
`DeltaSummary` reports meaningful changes.
|
|
|
|
|
- Metrics are emitted regardless of Notify availability so operators can track
|
|
|
|
|
queue pressure even in air-gapped deployments.
|
|
|
|
|
|
|
|
|
|
## Dashboards & alerts
|
|
|
|
|
|
|
|
|
|
- **Grafana dashboard:** `docs/modules/scheduler/operations/worker-grafana-dashboard.json`
|
|
|
|
|
(import into Prometheus-backed Grafana). Panels mirror the metrics above with
|
|
|
|
|
mode filters.
|
|
|
|
|
- **Prometheus rules:** `docs/modules/scheduler/operations/worker-prometheus-rules.yaml`
|
|
|
|
|
provides planner failure/latency, backlog, and stuck-run alerts.
|
|
|
|
|
- **Operations guide:** see `docs/modules/scheduler/operations/worker.md` for
|
|
|
|
|
runbook steps, alert context, and dashboard wiring instructions.
|
|
|
|
|
# SCHED-WORKER-16-205 — Scheduler Worker Observability
|
|
|
|
|
|
|
|
|
|
_Sprint 16 · Scheduler Worker Guild_
|
|
|
|
|
|
|
|
|
|
The scheduler worker now exposes first-class metrics covering planner latency,
|
|
|
|
|
runner throughput, and backlog health.
|
|
|
|
|
|
|
|
|
|
## Meter: `StellaOps.Scheduler.Worker`
|
|
|
|
|
|
|
|
|
|
| Metric | Type | Tags | Description |
|
|
|
|
|
| --- | --- | --- | --- |
|
|
|
|
|
| `scheduler_planner_runs_total` | Counter | `mode`, `status` | Planner outcomes (`enqueued`, `no_work`, `failed`). |
|
|
|
|
|
| `scheduler_planner_latency_seconds` | Histogram | `mode`, `status` | Time between run creation and planner completion. |
|
|
|
|
|
| `scheduler_runner_segments_total` | Counter | `mode`, `status` | Runner segments processed (`Completed`, `persist_failed`, `RunMissing`). |
|
|
|
|
|
| `scheduler_runner_images_total` | Counter | `mode`, `delta` | Images processed per mode, split by whether a delta was observed. |
|
|
|
|
|
| `scheduler_runner_delta_total` | Counter | `mode` | Total new findings observed. |
|
|
|
|
|
| `scheduler_runner_delta_critical_total` | Counter | `mode` | Critical findings observed. |
|
|
|
|
|
| `scheduler_runner_delta_high_total` | Counter | `mode` | High findings observed. |
|
|
|
|
|
| `scheduler_runner_delta_kev_total` | Counter | `mode` | KEV hits surfaced across runner segments. |
|
|
|
|
|
| `scheduler_run_duration_seconds` | Histogram | `mode`, `result` | End-to-end run durations (currently recorded for successful completions). |
|
|
|
|
|
| `scheduler_runs_active` | Up/down counter | `mode` | Active runs in-flight. |
|
|
|
|
|
| `scheduler_runner_backlog` | Observable gauge | `mode`, `scheduleId` | Remaining images awaiting runner processing per schedule. |
|
|
|
|
|
|
|
|
|
|
## Instrumentation notes
|
|
|
|
|
|
|
|
|
|
- Planner records latency once a run transitions out of `Planning`. `no_work`
|
|
|
|
|
completions emit zero-duration runs without incrementing the active counter.
|
|
|
|
|
- Runner updates backlog after every segment and decrements the active counter
|
|
|
|
|
when a run reaches `Completed`.
|
|
|
|
|
- Delta counters aggregate per severity and KEV hit; they only increment when
|
|
|
|
|
`DeltaSummary` reports meaningful changes.
|
|
|
|
|
- Metrics are emitted regardless of Notify availability so operators can track
|
|
|
|
|
queue pressure even in air-gapped deployments.
|
|
|
|
|
|
|
|
|
|
## Dashboards & alerts
|
|
|
|
|
|
|
|
|
|
- **Grafana dashboard:** `docs/modules/scheduler/operations/worker-grafana-dashboard.json`
|
|
|
|
|
(import into Prometheus-backed Grafana). Panels mirror the metrics above with
|
|
|
|
|
mode filters.
|
|
|
|
|
- **Prometheus rules:** `docs/modules/scheduler/operations/worker-prometheus-rules.yaml`
|
|
|
|
|
provides planner failure/latency, backlog, and stuck-run alerts.
|
|
|
|
|
- **Operations guide:** see `docs/modules/scheduler/operations/worker.md` for
|
|
|
|
|
runbook steps, alert context, and dashboard wiring instructions.
|
|
|
|
|
|