Align AOC tasks for Excititor and Concelier
This commit is contained in:
@@ -1,22 +1,22 @@
|
||||
# Scheduler agent guide
|
||||
|
||||
## Mission
|
||||
Scheduler detects advisory/VEX deltas, computes impact windows, and orchestrates re-evaluations across Scanner and Policy Engine.
|
||||
|
||||
## Key docs
|
||||
- [Module README](./README.md)
|
||||
- [Architecture](./architecture.md)
|
||||
- [Implementation plan](./implementation_plan.md)
|
||||
- [Task board](./TASKS.md)
|
||||
|
||||
## How to get started
|
||||
1. Open ../../implplan/SPRINTS.md and locate the stories referencing this module.
|
||||
2. Review ./TASKS.md for local follow-ups and confirm status transitions (TODO → DOING → DONE/BLOCKED).
|
||||
3. Read the architecture and README for domain context before editing code or docs.
|
||||
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
|
||||
|
||||
## Guardrails
|
||||
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
|
||||
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
|
||||
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
|
||||
# Scheduler agent guide
|
||||
|
||||
## Mission
|
||||
Scheduler detects advisory/VEX deltas, computes impact windows, and orchestrates re-evaluations across Scanner and Policy Engine.
|
||||
|
||||
## Key docs
|
||||
- [Module README](./README.md)
|
||||
- [Architecture](./architecture.md)
|
||||
- [Implementation plan](./implementation_plan.md)
|
||||
- [Task board](./TASKS.md)
|
||||
|
||||
## How to get started
|
||||
1. Open ../../implplan/SPRINTS.md and locate the stories referencing this module.
|
||||
2. Review ./TASKS.md for local follow-ups and confirm status transitions (TODO → DOING → DONE/BLOCKED).
|
||||
3. Read the architecture and README for domain context before editing code or docs.
|
||||
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
|
||||
|
||||
## Guardrails
|
||||
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
|
||||
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
|
||||
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
|
||||
- Update runbooks/observability assets when operational characteristics change.
|
||||
@@ -1,37 +1,37 @@
|
||||
# StellaOps Scheduler
|
||||
|
||||
Scheduler detects advisory/VEX deltas, computes impact windows, and orchestrates re-evaluations across Scanner and Policy Engine.
|
||||
|
||||
## Responsibilities
|
||||
- Maintain impact cursors and queues for re-scan/re-evaluate jobs.
|
||||
- Expose APIs for policy-triggered rechecks and runtime hooks.
|
||||
- Emit DSSE-backed completion events for downstream consumers (UI, Notify).
|
||||
- Provide SLA-aware retry logic with deterministic evaluation windows.
|
||||
|
||||
## Key components
|
||||
- `StellaOps.Scheduler.WebService` control plane.
|
||||
- `StellaOps.Scheduler.Worker` job executor.
|
||||
- Shared libraries under `StellaOps.Scheduler.*`.
|
||||
|
||||
## Integrations & dependencies
|
||||
- MongoDB for impact models.
|
||||
- Redis/NATS for queueing.
|
||||
- Policy Engine, Scanner, Notify.
|
||||
|
||||
## Operational notes
|
||||
- Monitoring assets in ./operations/worker-grafana-dashboard.json & worker-prometheus-rules.yaml.
|
||||
- Operational runbook ./operations/worker.md.
|
||||
|
||||
## Related resources
|
||||
- ./operations/worker.md
|
||||
- ./operations/worker-grafana-dashboard.json
|
||||
- ./operations/worker-prometheus-rules.yaml
|
||||
|
||||
## Backlog references
|
||||
- SCHED-MODELS-20-001 (policy run DTOs) and related tasks in ../../TASKS.md.
|
||||
- Scheduler observability follow-ups in src/Scheduler/**/TASKS.md.
|
||||
|
||||
## Epic alignment
|
||||
- **Epic 2 – Policy Engine & Editor:** orchestrate incremental re-evaluation and simulation runs when raw facts or policies change.
|
||||
- **Epic 6 – Vulnerability Explorer:** feed triage workflows with up-to-date job status, explain traces, and ledger hooks.
|
||||
- **Epic 9 – Orchestrator Dashboard:** expose job telemetry, throttling, and replay controls through orchestration dashboards.
|
||||
# StellaOps Scheduler
|
||||
|
||||
Scheduler detects advisory/VEX deltas, computes impact windows, and orchestrates re-evaluations across Scanner and Policy Engine.
|
||||
|
||||
## Responsibilities
|
||||
- Maintain impact cursors and queues for re-scan/re-evaluate jobs.
|
||||
- Expose APIs for policy-triggered rechecks and runtime hooks.
|
||||
- Emit DSSE-backed completion events for downstream consumers (UI, Notify).
|
||||
- Provide SLA-aware retry logic with deterministic evaluation windows.
|
||||
|
||||
## Key components
|
||||
- `StellaOps.Scheduler.WebService` control plane.
|
||||
- `StellaOps.Scheduler.Worker` job executor.
|
||||
- Shared libraries under `StellaOps.Scheduler.*`.
|
||||
|
||||
## Integrations & dependencies
|
||||
- MongoDB for impact models.
|
||||
- Redis/NATS for queueing.
|
||||
- Policy Engine, Scanner, Notify.
|
||||
|
||||
## Operational notes
|
||||
- Monitoring assets in ./operations/worker-grafana-dashboard.json & worker-prometheus-rules.yaml.
|
||||
- Operational runbook ./operations/worker.md.
|
||||
|
||||
## Related resources
|
||||
- ./operations/worker.md
|
||||
- ./operations/worker-grafana-dashboard.json
|
||||
- ./operations/worker-prometheus-rules.yaml
|
||||
|
||||
## Backlog references
|
||||
- SCHED-MODELS-20-001 (policy run DTOs) and related tasks in ../../TASKS.md.
|
||||
- Scheduler observability follow-ups in src/Scheduler/**/TASKS.md.
|
||||
|
||||
## Epic alignment
|
||||
- **Epic 2 – Policy Engine & Editor:** orchestrate incremental re-evaluation and simulation runs when raw facts or policies change.
|
||||
- **Epic 6 – Vulnerability Explorer:** feed triage workflows with up-to-date job status, explain traces, and ledger hooks.
|
||||
- **Epic 9 – Orchestrator Dashboard:** expose job telemetry, throttling, and replay controls through orchestration dashboards.
|
||||
|
||||
@@ -1,9 +1,9 @@
|
||||
# Task board — Scheduler
|
||||
|
||||
> Local tasks should link back to ./AGENTS.md and mirror status updates into ../../TASKS.md when applicable.
|
||||
|
||||
| ID | Status | Owner(s) | Description | Notes |
|
||||
|----|--------|----------|-------------|-------|
|
||||
| SCHEDULER-DOCS-0001 | TODO | Docs Guild | Validate that ./README.md aligns with the latest release notes. | See ./AGENTS.md |
|
||||
| SCHEDULER-OPS-0001 | TODO | Ops Guild | Review runbooks/observability assets after next sprint demo. | Sync outcomes back to ../../TASKS.md |
|
||||
| SCHEDULER-ENG-0001 | TODO | Module Team | Cross-check implementation plan milestones against ../../implplan/SPRINTS.md. | Update status via ./AGENTS.md workflow |
|
||||
# Task board — Scheduler
|
||||
|
||||
> Local tasks should link back to ./AGENTS.md and mirror status updates into ../../TASKS.md when applicable.
|
||||
|
||||
| ID | Status | Owner(s) | Description | Notes |
|
||||
|----|--------|----------|-------------|-------|
|
||||
| SCHEDULER-DOCS-0001 | TODO | Docs Guild | Validate that ./README.md aligns with the latest release notes. | See ./AGENTS.md |
|
||||
| SCHEDULER-OPS-0001 | TODO | Ops Guild | Review runbooks/observability assets after next sprint demo. | Sync outcomes back to ../../TASKS.md |
|
||||
| SCHEDULER-ENG-0001 | TODO | Module Team | Cross-check implementation plan milestones against ../../implplan/SPRINTS.md. | Update status via ./AGENTS.md workflow |
|
||||
|
||||
@@ -1,21 +1,21 @@
|
||||
# Implementation plan — Scheduler
|
||||
|
||||
## Current objectives
|
||||
- Maintain deterministic behaviour and offline parity across releases.
|
||||
- Keep documentation, telemetry, and runbooks aligned with the latest sprint outcomes.
|
||||
|
||||
## Workstreams
|
||||
- Backlog grooming: reconcile open stories in ../../TASKS.md with this module's roadmap.
|
||||
- Implementation: collaborate with service owners to land feature work defined in SPRINTS/EPIC docs.
|
||||
- Validation: extend tests/fixtures to preserve determinism and provenance requirements.
|
||||
|
||||
## Epic milestones
|
||||
- **Epic 2 – Policy Engine & Editor:** deliver incremental policy run orchestration, change streams, and explain trace propagation.
|
||||
- **Epic 6 – Vulnerability Explorer:** ensure findings updates and remediation triggers integrate with scheduler outputs.
|
||||
- **Epic 9 – Orchestrator Dashboard:** provide job telemetry and control surfaces consumed by the orchestrator UI/CLI.
|
||||
- Track additional work (SCHED-MODELS-20-001, observability follow-ups) in ../../TASKS.md and src/Scheduler/**/TASKS.md.
|
||||
|
||||
## Coordination
|
||||
- Review ./AGENTS.md before picking up new work.
|
||||
- Sync with cross-cutting teams noted in ../../implplan/SPRINTS.md.
|
||||
- Update this plan whenever scope, dependencies, or guardrails change.
|
||||
# Implementation plan — Scheduler
|
||||
|
||||
## Current objectives
|
||||
- Maintain deterministic behaviour and offline parity across releases.
|
||||
- Keep documentation, telemetry, and runbooks aligned with the latest sprint outcomes.
|
||||
|
||||
## Workstreams
|
||||
- Backlog grooming: reconcile open stories in ../../TASKS.md with this module's roadmap.
|
||||
- Implementation: collaborate with service owners to land feature work defined in SPRINTS/EPIC docs.
|
||||
- Validation: extend tests/fixtures to preserve determinism and provenance requirements.
|
||||
|
||||
## Epic milestones
|
||||
- **Epic 2 – Policy Engine & Editor:** deliver incremental policy run orchestration, change streams, and explain trace propagation.
|
||||
- **Epic 6 – Vulnerability Explorer:** ensure findings updates and remediation triggers integrate with scheduler outputs.
|
||||
- **Epic 9 – Orchestrator Dashboard:** provide job telemetry and control surfaces consumed by the orchestrator UI/CLI.
|
||||
- Track additional work (SCHED-MODELS-20-001, observability follow-ups) in ../../TASKS.md and src/Scheduler/**/TASKS.md.
|
||||
|
||||
## Coordination
|
||||
- Review ./AGENTS.md before picking up new work.
|
||||
- Sync with cross-cutting teams noted in ../../implplan/SPRINTS.md.
|
||||
- Update this plan whenever scope, dependencies, or guardrails change.
|
||||
|
||||
@@ -1,261 +1,261 @@
|
||||
{
|
||||
"title": "Scheduler Worker – Planning & Rescan",
|
||||
"uid": "scheduler-worker-observability",
|
||||
"schemaVersion": 38,
|
||||
"version": 1,
|
||||
"editable": true,
|
||||
"timezone": "",
|
||||
"graphTooltip": 0,
|
||||
"time": {
|
||||
"from": "now-24h",
|
||||
"to": "now"
|
||||
},
|
||||
"templating": {
|
||||
"list": [
|
||||
{
|
||||
"name": "datasource",
|
||||
"type": "datasource",
|
||||
"query": "prometheus",
|
||||
"hide": 0,
|
||||
"refresh": 1,
|
||||
"current": {}
|
||||
},
|
||||
{
|
||||
"name": "mode",
|
||||
"label": "Mode",
|
||||
"type": "query",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"query": "label_values(scheduler_planner_runs_total, mode)",
|
||||
"refresh": 1,
|
||||
"multi": true,
|
||||
"includeAll": true,
|
||||
"allValue": ".*",
|
||||
"current": {
|
||||
"selected": false,
|
||||
"text": "All",
|
||||
"value": ".*"
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
"annotations": {
|
||||
"list": []
|
||||
},
|
||||
"panels": [
|
||||
{
|
||||
"id": 1,
|
||||
"title": "Planner Runs per Status",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"displayName": "{{status}}"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum by (status) (rate(scheduler_planner_runs_total{mode=~\"$mode\"}[5m]))",
|
||||
"legendFormat": "{{status}}",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 0
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 2,
|
||||
"title": "Planner Latency P95 (s)",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "s"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "histogram_quantile(0.95, sum by (le) (rate(scheduler_planner_latency_seconds_bucket{mode=~\"$mode\"}[5m])))",
|
||||
"legendFormat": "p95",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 0
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 3,
|
||||
"title": "Runner Segments per Status",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"displayName": "{{status}}"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum by (status) (rate(scheduler_runner_segments_total{mode=~\"$mode\"}[5m]))",
|
||||
"legendFormat": "{{status}}",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 8
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 4,
|
||||
"title": "New Findings per Severity",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"displayName": "{{severity}}"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(scheduler_runner_delta_critical_total{mode=~\"$mode\"}[5m]))",
|
||||
"legendFormat": "critical",
|
||||
"refId": "A"
|
||||
},
|
||||
{
|
||||
"expr": "sum(rate(scheduler_runner_delta_high_total{mode=~\"$mode\"}[5m]))",
|
||||
"legendFormat": "high",
|
||||
"refId": "B"
|
||||
},
|
||||
{
|
||||
"expr": "sum(rate(scheduler_runner_delta_total{mode=~\"$mode\"}[5m]))",
|
||||
"legendFormat": "total",
|
||||
"refId": "C"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 8
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 5,
|
||||
"title": "Runner Backlog by Schedule",
|
||||
"type": "table",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"displayName": "{{scheduleId}}",
|
||||
"unit": "none"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"showHeader": true
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "max by (scheduleId) (scheduler_runner_backlog{mode=~\"$mode\"})",
|
||||
"format": "table",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 16
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 6,
|
||||
"title": "Active Runs",
|
||||
"type": "stat",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "none"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"orientation": "horizontal",
|
||||
"textMode": "value"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(scheduler_runs_active{mode=~\"$mode\"})",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 16
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
{
|
||||
"title": "Scheduler Worker – Planning & Rescan",
|
||||
"uid": "scheduler-worker-observability",
|
||||
"schemaVersion": 38,
|
||||
"version": 1,
|
||||
"editable": true,
|
||||
"timezone": "",
|
||||
"graphTooltip": 0,
|
||||
"time": {
|
||||
"from": "now-24h",
|
||||
"to": "now"
|
||||
},
|
||||
"templating": {
|
||||
"list": [
|
||||
{
|
||||
"name": "datasource",
|
||||
"type": "datasource",
|
||||
"query": "prometheus",
|
||||
"hide": 0,
|
||||
"refresh": 1,
|
||||
"current": {}
|
||||
},
|
||||
{
|
||||
"name": "mode",
|
||||
"label": "Mode",
|
||||
"type": "query",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"query": "label_values(scheduler_planner_runs_total, mode)",
|
||||
"refresh": 1,
|
||||
"multi": true,
|
||||
"includeAll": true,
|
||||
"allValue": ".*",
|
||||
"current": {
|
||||
"selected": false,
|
||||
"text": "All",
|
||||
"value": ".*"
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
"annotations": {
|
||||
"list": []
|
||||
},
|
||||
"panels": [
|
||||
{
|
||||
"id": 1,
|
||||
"title": "Planner Runs per Status",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"displayName": "{{status}}"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum by (status) (rate(scheduler_planner_runs_total{mode=~\"$mode\"}[5m]))",
|
||||
"legendFormat": "{{status}}",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 0
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 2,
|
||||
"title": "Planner Latency P95 (s)",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "s"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "histogram_quantile(0.95, sum by (le) (rate(scheduler_planner_latency_seconds_bucket{mode=~\"$mode\"}[5m])))",
|
||||
"legendFormat": "p95",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 0
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 3,
|
||||
"title": "Runner Segments per Status",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"displayName": "{{status}}"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum by (status) (rate(scheduler_runner_segments_total{mode=~\"$mode\"}[5m]))",
|
||||
"legendFormat": "{{status}}",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 8
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 4,
|
||||
"title": "New Findings per Severity",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"displayName": "{{severity}}"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(scheduler_runner_delta_critical_total{mode=~\"$mode\"}[5m]))",
|
||||
"legendFormat": "critical",
|
||||
"refId": "A"
|
||||
},
|
||||
{
|
||||
"expr": "sum(rate(scheduler_runner_delta_high_total{mode=~\"$mode\"}[5m]))",
|
||||
"legendFormat": "high",
|
||||
"refId": "B"
|
||||
},
|
||||
{
|
||||
"expr": "sum(rate(scheduler_runner_delta_total{mode=~\"$mode\"}[5m]))",
|
||||
"legendFormat": "total",
|
||||
"refId": "C"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 8
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 5,
|
||||
"title": "Runner Backlog by Schedule",
|
||||
"type": "table",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"displayName": "{{scheduleId}}",
|
||||
"unit": "none"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"showHeader": true
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "max by (scheduleId) (scheduler_runner_backlog{mode=~\"$mode\"})",
|
||||
"format": "table",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 16
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 6,
|
||||
"title": "Active Runs",
|
||||
"type": "stat",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "none"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"orientation": "horizontal",
|
||||
"textMode": "value"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(scheduler_runs_active{mode=~\"$mode\"})",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 16
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
@@ -1,42 +1,42 @@
|
||||
groups:
|
||||
- name: scheduler-worker
|
||||
interval: 30s
|
||||
rules:
|
||||
- alert: SchedulerPlannerFailuresHigh
|
||||
expr: sum(rate(scheduler_planner_runs_total{status="failed"}[5m]))
|
||||
/
|
||||
sum(rate(scheduler_planner_runs_total[5m])) > 0.05
|
||||
for: 10m
|
||||
labels:
|
||||
severity: critical
|
||||
service: scheduler-worker
|
||||
annotations:
|
||||
summary: "Planner failure ratio above 5%"
|
||||
description: "More than 5% of planning runs are failing. Inspect scheduler logs and ImpactIndex connectivity before queues back up."
|
||||
- alert: SchedulerPlannerLatencyHigh
|
||||
expr: histogram_quantile(0.95, sum by (le) (rate(scheduler_planner_latency_seconds_bucket[5m]))) > 45
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
service: scheduler-worker
|
||||
annotations:
|
||||
summary: "Planner latency p95 above 45s"
|
||||
description: "Planning latency p95 stayed above 45 seconds for 10 minutes. Check ImpactIndex, Mongo, or external selectors to prevent missed SLAs."
|
||||
- alert: SchedulerRunnerBacklogGrowing
|
||||
expr: max_over_time(scheduler_runner_backlog[15m]) > 500
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
service: scheduler-worker
|
||||
annotations:
|
||||
summary: "Runner backlog above 500 images"
|
||||
description: "Runner backlog exceeded 500 images over the last 15 minutes. Verify runner workers, scanner availability, and rate limits."
|
||||
- alert: SchedulerRunStuck
|
||||
expr: sum(scheduler_runs_active) > 0 and max_over_time(scheduler_runs_active[30m]) == min_over_time(scheduler_runs_active[30m])
|
||||
for: 30m
|
||||
labels:
|
||||
severity: warning
|
||||
service: scheduler-worker
|
||||
annotations:
|
||||
summary: "Scheduler runs stuck without progress"
|
||||
description: "Active runs count has remained flat for 30 minutes. Investigate stuck segments or scanner timeouts."
|
||||
groups:
|
||||
- name: scheduler-worker
|
||||
interval: 30s
|
||||
rules:
|
||||
- alert: SchedulerPlannerFailuresHigh
|
||||
expr: sum(rate(scheduler_planner_runs_total{status="failed"}[5m]))
|
||||
/
|
||||
sum(rate(scheduler_planner_runs_total[5m])) > 0.05
|
||||
for: 10m
|
||||
labels:
|
||||
severity: critical
|
||||
service: scheduler-worker
|
||||
annotations:
|
||||
summary: "Planner failure ratio above 5%"
|
||||
description: "More than 5% of planning runs are failing. Inspect scheduler logs and ImpactIndex connectivity before queues back up."
|
||||
- alert: SchedulerPlannerLatencyHigh
|
||||
expr: histogram_quantile(0.95, sum by (le) (rate(scheduler_planner_latency_seconds_bucket[5m]))) > 45
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
service: scheduler-worker
|
||||
annotations:
|
||||
summary: "Planner latency p95 above 45s"
|
||||
description: "Planning latency p95 stayed above 45 seconds for 10 minutes. Check ImpactIndex, Mongo, or external selectors to prevent missed SLAs."
|
||||
- alert: SchedulerRunnerBacklogGrowing
|
||||
expr: max_over_time(scheduler_runner_backlog[15m]) > 500
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
service: scheduler-worker
|
||||
annotations:
|
||||
summary: "Runner backlog above 500 images"
|
||||
description: "Runner backlog exceeded 500 images over the last 15 minutes. Verify runner workers, scanner availability, and rate limits."
|
||||
- alert: SchedulerRunStuck
|
||||
expr: sum(scheduler_runs_active) > 0 and max_over_time(scheduler_runs_active[30m]) == min_over_time(scheduler_runs_active[30m])
|
||||
for: 30m
|
||||
labels:
|
||||
severity: warning
|
||||
service: scheduler-worker
|
||||
annotations:
|
||||
summary: "Scheduler runs stuck without progress"
|
||||
description: "Active runs count has remained flat for 30 minutes. Investigate stuck segments or scanner timeouts."
|
||||
|
||||
@@ -1,82 +1,82 @@
|
||||
# Scheduler Worker – Observability & Runbook
|
||||
|
||||
## Purpose
|
||||
Monitor planner and runner health for the Scheduler Worker (Sprint 16 telemetry). The new .NET meters surface queue throughput, latency, backlog, and delta severities so operators can detect stalled runs before rescan SLAs slip.
|
||||
|
||||
> **Grafana note:** Import `docs/modules/scheduler/operations/worker-grafana-dashboard.json` into the Prometheus-backed Grafana stack that scrapes the OpenTelemetry Collector.
|
||||
|
||||
---
|
||||
|
||||
## Key metrics
|
||||
|
||||
| Metric | Use case | Suggested query |
|
||||
| --- | --- | --- |
|
||||
| `scheduler_planner_runs_total{status}` | Planner throughput & failure ratio | `sum by (status) (rate(scheduler_planner_runs_total[5m]))` |
|
||||
| `scheduler_planner_latency_seconds_bucket` | Planning latency (p95 / p99) | `histogram_quantile(0.95, sum by (le) (rate(scheduler_planner_latency_seconds_bucket[5m])))` |
|
||||
| `scheduler_runner_segments_total{status}` | Runner success vs retries | `sum by (status) (rate(scheduler_runner_segments_total[5m]))` |
|
||||
| `scheduler_runner_delta_{critical,high,total}` | Newly-detected findings | `sum(rate(scheduler_runner_delta_critical_total[5m]))` |
|
||||
| `scheduler_runner_backlog{scheduleId}` | Remaining digests awaiting runner | `max by (scheduleId) (scheduler_runner_backlog)` |
|
||||
| `scheduler_runs_active{mode}` | Active runs in-flight | `sum(scheduler_runs_active)` |
|
||||
|
||||
Reference queries power the bundled Grafana dashboard panels. Use the `mode` template variable to focus on `analysisOnly` versus `contentRefresh` schedules.
|
||||
|
||||
---
|
||||
|
||||
## Grafana dashboard
|
||||
|
||||
1. Import `docs/modules/scheduler/operations/worker-grafana-dashboard.json` (UID `scheduler-worker-observability`).
|
||||
2. Point the `datasource` variable to the Prometheus instance scraping the collector. Optional: pin the `mode` variable to a specific schedule mode.
|
||||
3. Panels included:
|
||||
- **Planner Runs per Status** – visualises success vs failure ratio.
|
||||
- **Planner Latency P95** – highlights degradations in ImpactIndex or Mongo lookups.
|
||||
- **Runner Segments per Status** – shows retry pressure and queue health.
|
||||
- **New Findings per Severity** – rolls up delta counters (critical/high/total).
|
||||
- **Runner Backlog by Schedule** – tabulates outstanding digests per schedule.
|
||||
- **Active Runs** – stat panel showing the current number of in-flight runs.
|
||||
|
||||
Capture screenshots once Grafana provisioning completes and store them under `docs/assets/dashboards/` (pending automation ticket OBS-157).
|
||||
|
||||
---
|
||||
|
||||
## Prometheus alerts
|
||||
|
||||
Import `docs/modules/scheduler/operations/worker-prometheus-rules.yaml` into your Prometheus rule configuration. The bundle defines:
|
||||
|
||||
- **SchedulerPlannerFailuresHigh** – 5%+ of planner runs failed for 10 minutes. Page SRE.
|
||||
- **SchedulerPlannerLatencyHigh** – planner p95 latency remains above 45 s for 10 minutes. Investigate ImpactIndex, Mongo, and Feedser/Vexer event queues.
|
||||
- **SchedulerRunnerBacklogGrowing** – backlog exceeded 500 images for 15 minutes. Inspect runner workers, Scanner availability, and rate limiting.
|
||||
- **SchedulerRunStuck** – active run count stayed flat for 30 minutes while remaining non-zero. Check stuck segments, expired leases, and scanner retries.
|
||||
|
||||
Hook these alerts into the existing Observability notification pathway (`observability-pager` routing key) and ensure `service=scheduler-worker` is mapped to the on-call rotation.
|
||||
|
||||
---
|
||||
|
||||
## Runbook snapshot
|
||||
|
||||
1. **Planner failure/latency:**
|
||||
- Check Planner logs for ImpactIndex or Mongo exceptions.
|
||||
- Verify Feedser/Vexer webhook health; requeue events if necessary.
|
||||
- If planner is overwhelmed, temporarily reduce schedule parallelism via `stella scheduler schedule update`.
|
||||
2. **Runner backlog spike:**
|
||||
- Confirm Scanner WebService health (`/healthz`).
|
||||
- Inspect runner queue for stuck segments; consider increasing runner workers or scaling scanner capacity.
|
||||
- Review rate limits (schedule limits, ImpactIndex throughput) before changing global throttles.
|
||||
3. **Stuck runs:**
|
||||
- Use `stella scheduler runs list --state running` to identify affected runs.
|
||||
- Drill into Grafana panel “Runner Backlog by Schedule” to see offending schedule IDs.
|
||||
- If a segment will not progress, use `stella scheduler segments release --segment <id>` to force retry after resolving root cause.
|
||||
4. **Unexpected critical deltas:**
|
||||
- Correlate `scheduler_runner_delta_critical_total` spikes with Notify events (`scheduler.rescan.delta`).
|
||||
- Pivot to Scanner report links for impacted digests and confirm they match upstream advisories/policies.
|
||||
|
||||
Document incidents and mitigation in `ops/runbooks/INCIDENT_LOG.md` (per SRE policy) and attach Grafana screenshots for post-mortems.
|
||||
|
||||
---
|
||||
|
||||
## Checklist
|
||||
|
||||
- [ ] Grafana dashboard imported and wired to Prometheus datasource.
|
||||
- [ ] Prometheus alert rules deployed (see above).
|
||||
- [ ] Runbook linked from on-call rotation portal.
|
||||
- [ ] Observability Guild sign-off captured for Sprint 16 telemetry (OWNER: @obs-guild).
|
||||
|
||||
# Scheduler Worker – Observability & Runbook
|
||||
|
||||
## Purpose
|
||||
Monitor planner and runner health for the Scheduler Worker (Sprint 16 telemetry). The new .NET meters surface queue throughput, latency, backlog, and delta severities so operators can detect stalled runs before rescan SLAs slip.
|
||||
|
||||
> **Grafana note:** Import `docs/modules/scheduler/operations/worker-grafana-dashboard.json` into the Prometheus-backed Grafana stack that scrapes the OpenTelemetry Collector.
|
||||
|
||||
---
|
||||
|
||||
## Key metrics
|
||||
|
||||
| Metric | Use case | Suggested query |
|
||||
| --- | --- | --- |
|
||||
| `scheduler_planner_runs_total{status}` | Planner throughput & failure ratio | `sum by (status) (rate(scheduler_planner_runs_total[5m]))` |
|
||||
| `scheduler_planner_latency_seconds_bucket` | Planning latency (p95 / p99) | `histogram_quantile(0.95, sum by (le) (rate(scheduler_planner_latency_seconds_bucket[5m])))` |
|
||||
| `scheduler_runner_segments_total{status}` | Runner success vs retries | `sum by (status) (rate(scheduler_runner_segments_total[5m]))` |
|
||||
| `scheduler_runner_delta_{critical,high,total}` | Newly-detected findings | `sum(rate(scheduler_runner_delta_critical_total[5m]))` |
|
||||
| `scheduler_runner_backlog{scheduleId}` | Remaining digests awaiting runner | `max by (scheduleId) (scheduler_runner_backlog)` |
|
||||
| `scheduler_runs_active{mode}` | Active runs in-flight | `sum(scheduler_runs_active)` |
|
||||
|
||||
Reference queries power the bundled Grafana dashboard panels. Use the `mode` template variable to focus on `analysisOnly` versus `contentRefresh` schedules.
|
||||
|
||||
---
|
||||
|
||||
## Grafana dashboard
|
||||
|
||||
1. Import `docs/modules/scheduler/operations/worker-grafana-dashboard.json` (UID `scheduler-worker-observability`).
|
||||
2. Point the `datasource` variable to the Prometheus instance scraping the collector. Optional: pin the `mode` variable to a specific schedule mode.
|
||||
3. Panels included:
|
||||
- **Planner Runs per Status** – visualises success vs failure ratio.
|
||||
- **Planner Latency P95** – highlights degradations in ImpactIndex or Mongo lookups.
|
||||
- **Runner Segments per Status** – shows retry pressure and queue health.
|
||||
- **New Findings per Severity** – rolls up delta counters (critical/high/total).
|
||||
- **Runner Backlog by Schedule** – tabulates outstanding digests per schedule.
|
||||
- **Active Runs** – stat panel showing the current number of in-flight runs.
|
||||
|
||||
Capture screenshots once Grafana provisioning completes and store them under `docs/assets/dashboards/` (pending automation ticket OBS-157).
|
||||
|
||||
---
|
||||
|
||||
## Prometheus alerts
|
||||
|
||||
Import `docs/modules/scheduler/operations/worker-prometheus-rules.yaml` into your Prometheus rule configuration. The bundle defines:
|
||||
|
||||
- **SchedulerPlannerFailuresHigh** – 5%+ of planner runs failed for 10 minutes. Page SRE.
|
||||
- **SchedulerPlannerLatencyHigh** – planner p95 latency remains above 45 s for 10 minutes. Investigate ImpactIndex, Mongo, and Feedser/Vexer event queues.
|
||||
- **SchedulerRunnerBacklogGrowing** – backlog exceeded 500 images for 15 minutes. Inspect runner workers, Scanner availability, and rate limiting.
|
||||
- **SchedulerRunStuck** – active run count stayed flat for 30 minutes while remaining non-zero. Check stuck segments, expired leases, and scanner retries.
|
||||
|
||||
Hook these alerts into the existing Observability notification pathway (`observability-pager` routing key) and ensure `service=scheduler-worker` is mapped to the on-call rotation.
|
||||
|
||||
---
|
||||
|
||||
## Runbook snapshot
|
||||
|
||||
1. **Planner failure/latency:**
|
||||
- Check Planner logs for ImpactIndex or Mongo exceptions.
|
||||
- Verify Feedser/Vexer webhook health; requeue events if necessary.
|
||||
- If planner is overwhelmed, temporarily reduce schedule parallelism via `stella scheduler schedule update`.
|
||||
2. **Runner backlog spike:**
|
||||
- Confirm Scanner WebService health (`/healthz`).
|
||||
- Inspect runner queue for stuck segments; consider increasing runner workers or scaling scanner capacity.
|
||||
- Review rate limits (schedule limits, ImpactIndex throughput) before changing global throttles.
|
||||
3. **Stuck runs:**
|
||||
- Use `stella scheduler runs list --state running` to identify affected runs.
|
||||
- Drill into Grafana panel “Runner Backlog by Schedule” to see offending schedule IDs.
|
||||
- If a segment will not progress, use `stella scheduler segments release --segment <id>` to force retry after resolving root cause.
|
||||
4. **Unexpected critical deltas:**
|
||||
- Correlate `scheduler_runner_delta_critical_total` spikes with Notify events (`scheduler.rescan.delta`).
|
||||
- Pivot to Scanner report links for impacted digests and confirm they match upstream advisories/policies.
|
||||
|
||||
Document incidents and mitigation in `ops/runbooks/INCIDENT_LOG.md` (per SRE policy) and attach Grafana screenshots for post-mortems.
|
||||
|
||||
---
|
||||
|
||||
## Checklist
|
||||
|
||||
- [ ] Grafana dashboard imported and wired to Prometheus datasource.
|
||||
- [ ] Prometheus alert rules deployed (see above).
|
||||
- [ ] Runbook linked from on-call rotation portal.
|
||||
- [ ] Observability Guild sign-off captured for Sprint 16 telemetry (OWNER: @obs-guild).
|
||||
|
||||
|
||||
Reference in New Issue
Block a user