Files
git.stella-ops.org/docs/modules/scheduler/operations/worker.md
master 7b5bdcf4d3 feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules
- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes.
- Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes.
- Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables.
- Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
2025-10-30 00:09:39 +02:00

4.8 KiB
Raw Blame History

Scheduler Worker Observability & Runbook

Purpose

Monitor planner and runner health for the Scheduler Worker (Sprint16 telemetry). The new .NET meters surface queue throughput, latency, backlog, and delta severities so operators can detect stalled runs before rescan SLAs slip.

Grafana note: Import docs/modules/scheduler/operations/worker-grafana-dashboard.json into the Prometheus-backed Grafana stack that scrapes the OpenTelemetry Collector.


Key metrics

Metric Use case Suggested query
scheduler_planner_runs_total{status} Planner throughput & failure ratio sum by (status) (rate(scheduler_planner_runs_total[5m]))
scheduler_planner_latency_seconds_bucket Planning latency (p95 / p99) histogram_quantile(0.95, sum by (le) (rate(scheduler_planner_latency_seconds_bucket[5m])))
scheduler_runner_segments_total{status} Runner success vs retries sum by (status) (rate(scheduler_runner_segments_total[5m]))
scheduler_runner_delta_{critical,high,total} Newly-detected findings sum(rate(scheduler_runner_delta_critical_total[5m]))
scheduler_runner_backlog{scheduleId} Remaining digests awaiting runner max by (scheduleId) (scheduler_runner_backlog)
scheduler_runs_active{mode} Active runs in-flight sum(scheduler_runs_active)

Reference queries power the bundled Grafana dashboard panels. Use the mode template variable to focus on analysisOnly versus contentRefresh schedules.


Grafana dashboard

  1. Import docs/modules/scheduler/operations/worker-grafana-dashboard.json (UID scheduler-worker-observability).
  2. Point the datasource variable to the Prometheus instance scraping the collector. Optional: pin the mode variable to a specific schedule mode.
  3. Panels included:
    • Planner Runs per Status visualises success vs failure ratio.
    • Planner Latency P95 highlights degradations in ImpactIndex or Mongo lookups.
    • Runner Segments per Status shows retry pressure and queue health.
    • New Findings per Severity rolls up delta counters (critical/high/total).
    • Runner Backlog by Schedule tabulates outstanding digests per schedule.
    • Active Runs stat panel showing the current number of in-flight runs.

Capture screenshots once Grafana provisioning completes and store them under docs/assets/dashboards/ (pending automation ticket OBS-157).


Prometheus alerts

Import docs/modules/scheduler/operations/worker-prometheus-rules.yaml into your Prometheus rule configuration. The bundle defines:

  • SchedulerPlannerFailuresHigh 5%+ of planner runs failed for 10 minutes. Page SRE.
  • SchedulerPlannerLatencyHigh planner p95 latency remains above 45s for 10 minutes. Investigate ImpactIndex, Mongo, and Feedser/Vexer event queues.
  • SchedulerRunnerBacklogGrowing backlog exceeded 500 images for 15 minutes. Inspect runner workers, Scanner availability, and rate limiting.
  • SchedulerRunStuck active run count stayed flat for 30 minutes while remaining non-zero. Check stuck segments, expired leases, and scanner retries.

Hook these alerts into the existing Observability notification pathway (observability-pager routing key) and ensure service=scheduler-worker is mapped to the on-call rotation.


Runbook snapshot

  1. Planner failure/latency:
    • Check Planner logs for ImpactIndex or Mongo exceptions.
    • Verify Feedser/Vexer webhook health; requeue events if necessary.
    • If planner is overwhelmed, temporarily reduce schedule parallelism via stella scheduler schedule update.
  2. Runner backlog spike:
    • Confirm Scanner WebService health (/healthz).
    • Inspect runner queue for stuck segments; consider increasing runner workers or scaling scanner capacity.
    • Review rate limits (schedule limits, ImpactIndex throughput) before changing global throttles.
  3. Stuck runs:
    • Use stella scheduler runs list --state running to identify affected runs.
    • Drill into Grafana panel “Runner Backlog by Schedule” to see offending schedule IDs.
    • If a segment will not progress, use stella scheduler segments release --segment <id> to force retry after resolving root cause.
  4. Unexpected critical deltas:
    • Correlate scheduler_runner_delta_critical_total spikes with Notify events (scheduler.rescan.delta).
    • Pivot to Scanner report links for impacted digests and confirm they match upstream advisories/policies.

Document incidents and mitigation in ops/runbooks/INCIDENT_LOG.md (per SRE policy) and attach Grafana screenshots for post-mortems.


Checklist

  • Grafana dashboard imported and wired to Prometheus datasource.
  • Prometheus alert rules deployed (see above).
  • Runbook linked from on-call rotation portal.
  • Observability Guild sign-off captured for Sprint16 telemetry (OWNER: @obs-guild).