Files
git.stella-ops.org/src/StellaOps.Scheduler.Worker/docs/SCHED-WORKER-16-205-OBSERVABILITY.md
Vladimir Moushkov 4d932cc1ba
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
feat: Implement runner execution pipeline with planner dispatch and execution services
- Introduced RunnerBackgroundService to handle execution of runner segments.
- Added RunnerExecutionService for processing segments and aggregating results.
- Implemented PlannerQueueDispatchService to manage dispatching of planner messages.
- Created PlannerQueueDispatcherBackgroundService for leasing and processing planner queue messages.
- Developed ScannerReportClient for interacting with the scanner service.
- Enhanced observability with SchedulerWorkerMetrics for tracking planner and runner performance.
- Added comprehensive documentation for the new runner execution pipeline and observability metrics.
- Implemented event emission for rescan activity and scanner report readiness.
2025-10-27 18:57:35 +02:00

2.5 KiB

SCHED-WORKER-16-205 — Scheduler Worker Observability

Sprint 16 · Scheduler Worker Guild

The scheduler worker now exposes first-class metrics covering planner latency, runner throughput, and backlog health.

Meter: StellaOps.Scheduler.Worker

Metric Type Tags Description
scheduler_planner_runs_total Counter mode, status Planner outcomes (enqueued, no_work, failed).
scheduler_planner_latency_seconds Histogram mode, status Time between run creation and planner completion.
scheduler_runner_segments_total Counter mode, status Runner segments processed (Completed, persist_failed, RunMissing).
scheduler_runner_images_total Counter mode, delta Images processed per mode, split by whether a delta was observed.
scheduler_runner_delta_total Counter mode Total new findings observed.
scheduler_runner_delta_critical_total Counter mode Critical findings observed.
scheduler_runner_delta_high_total Counter mode High findings observed.
scheduler_runner_delta_kev_total Counter mode KEV hits surfaced across runner segments.
scheduler_run_duration_seconds Histogram mode, result End-to-end run durations (currently recorded for successful completions).
scheduler_runs_active Up/down counter mode Active runs in-flight.
scheduler_runner_backlog Observable gauge mode, scheduleId Remaining images awaiting runner processing per schedule.

Instrumentation notes

  • Planner records latency once a run transitions out of Planning. no_work completions emit zero-duration runs without incrementing the active counter.
  • Runner updates backlog after every segment and decrements the active counter when a run reaches Completed.
  • Delta counters aggregate per severity and KEV hit; they only increment when DeltaSummary reports meaningful changes.
  • Metrics are emitted regardless of Notify availability so operators can track queue pressure even in air-gapped deployments.

Dashboards & alerts

  • Grafana dashboard: docs/ops/scheduler-worker-grafana-dashboard.json (import into Prometheus-backed Grafana). Panels mirror the metrics above with mode filters.
  • Prometheus rules: docs/ops/scheduler-worker-prometheus-rules.yaml provides planner failure/latency, backlog, and stuck-run alerts.
  • Operations guide: see docs/ops/scheduler-worker-operations.md for runbook steps, alert context, and dashboard wiring instructions.