consolidation of some of the modules, localization fixes, product advisories work, qa work
This commit is contained in:
39
docs-archived/modules/scheduler/AGENTS.md
Normal file
39
docs-archived/modules/scheduler/AGENTS.md
Normal file
@@ -0,0 +1,39 @@
|
||||
# Scheduler agent guide
|
||||
|
||||
## Mission
|
||||
Scheduler detects advisory/VEX deltas, computes impact windows, and orchestrates re-evaluations across Scanner and Policy Engine. Docs in this directory are the front-door contract for contributors.
|
||||
|
||||
## Working directory
|
||||
- `docs/modules/scheduler` (docs-only); code changes live under `src/Scheduler/**` but must be coordinated via sprint plans.
|
||||
|
||||
## Roles & owners
|
||||
- **Docs author**: curates AGENTS/TASKS/runbooks; keeps determinism/offline guidance accurate.
|
||||
- **Scheduler engineer (Worker/WebService)**: aligns implementation notes with architecture and ensures observability/runbook updates land with code.
|
||||
- **Observability/Ops**: maintains dashboards/rules, documents operational SLOs and alert contracts.
|
||||
|
||||
## Required Reading
|
||||
- `docs/modules/scheduler/README.md`
|
||||
- `docs/modules/scheduler/architecture.md`
|
||||
- `docs/modules/scheduler/implementation_plan.md`
|
||||
- `docs/modules/platform/architecture-overview.md`
|
||||
|
||||
## How to work
|
||||
1. Open relevant sprint file in `docs/implplan/SPRINT_*.md` and set task status to `DOING` there and in `docs/modules/scheduler/TASKS.md` before starting.
|
||||
2. Confirm prerequisites above are read; note any missing contracts in sprint **Decisions & Risks**.
|
||||
3. Keep outputs deterministic (stable ordering, UTC ISO-8601 timestamps, sorted lists) and offline-friendly (no external fetches without mirrors).
|
||||
4. When changing behavior, update runbooks and observability assets in `./operations/`.
|
||||
5. On completion, set status to `DONE` in both the sprint file and `TASKS.md`; if paused, revert to `TODO` and add a brief note.
|
||||
|
||||
## Guardrails
|
||||
- Honour the Aggregation-Only Contract where applicable (see `../../aoc/aggregation-only-contract.md`).
|
||||
- No undocumented schema or API contract changes; document deltas in architecture or implementation_plan.
|
||||
- Keep Offline Kit parity—document air-gapped workflows for any new feature.
|
||||
- Prefer deterministic fixtures and avoid machine-specific artefacts in examples.
|
||||
|
||||
## Testing & determinism expectations
|
||||
- Examples and snippets should be reproducible; pin sample timestamps to UTC and sort collections.
|
||||
- Observability examples must align with published metric names and labels; update `operations/worker-prometheus-rules.yaml` if alert semantics change.
|
||||
|
||||
## Status mirrors
|
||||
- Sprint tracker: `/docs/implplan/SPRINT_*.md` (source of record for Delivery Tracker).
|
||||
- Local tracker: `docs/modules/scheduler/TASKS.md` (mirrors sprint status; keep in sync).
|
||||
72
docs-archived/modules/scheduler/README.md
Normal file
72
docs-archived/modules/scheduler/README.md
Normal file
@@ -0,0 +1,72 @@
|
||||
# StellaOps Scheduler
|
||||
|
||||
Scheduler detects advisory/VEX deltas, computes impact windows, and orchestrates re-evaluations across Scanner and Policy Engine.
|
||||
|
||||
## Responsibilities
|
||||
- Maintain impact cursors and queues for re-scan/re-evaluate jobs.
|
||||
- Expose APIs for policy-triggered rechecks and runtime hooks.
|
||||
- Emit DSSE-backed completion events for downstream consumers (UI, Notify).
|
||||
- Provide SLA-aware retry logic with deterministic evaluation windows.
|
||||
|
||||
## Key components
|
||||
- `StellaOps.Scheduler.WebService` control plane.
|
||||
- `StellaOps.Scheduler.Worker` job executor.
|
||||
- Shared libraries under `StellaOps.Scheduler.*`.
|
||||
|
||||
## Integrations & dependencies
|
||||
- PostgreSQL (schema `scheduler`) for impact models.
|
||||
- Valkey/NATS for queueing.
|
||||
- Policy Engine, Scanner, Notify.
|
||||
|
||||
## Operational notes
|
||||
- Monitoring assets in ./operations/worker-grafana-dashboard.json & worker-prometheus-rules.yaml.
|
||||
- Operational runbook ./operations/worker.md.
|
||||
|
||||
## Related resources
|
||||
- ./operations/worker.md
|
||||
- ./operations/worker-grafana-dashboard.json
|
||||
- ./operations/worker-prometheus-rules.yaml
|
||||
|
||||
## Backlog references
|
||||
- SCHED-MODELS-20-001 (policy run DTOs) and related tasks in ../../TASKS.md.
|
||||
- Scheduler observability follow-ups in src/Scheduler/**/TASKS.md.
|
||||
|
||||
## Implementation Status
|
||||
|
||||
### Current Objectives
|
||||
- Maintain deterministic behaviour and offline parity across releases
|
||||
- Keep documentation, telemetry, and runbooks aligned with latest sprint outcomes
|
||||
- Coordinate with Policy Engine for incremental re-evaluation workflows
|
||||
|
||||
### Epic Milestones
|
||||
- Epic 2 – Policy Engine & Editor: incremental policy run orchestration, change streams, explain trace propagation (in progress)
|
||||
- Epic 6 – Vulnerability Explorer: findings updates and remediation triggers integration (in progress)
|
||||
- Epic 9 – Orchestrator Dashboard: job telemetry and control surfaces for UI/CLI (planned)
|
||||
|
||||
### Core Capabilities
|
||||
- Impact cursor maintenance and queue management for re-scan/re-evaluate jobs
|
||||
- Change-stream detection for advisory/VEX/SBOM deltas
|
||||
- Policy-triggered recheck orchestration with runtime hooks
|
||||
- SLA-aware retry logic with deterministic evaluation windows
|
||||
- DSSE-backed completion events for downstream consumers
|
||||
|
||||
### Integration Points
|
||||
- PostgreSQL schema (scheduler) for impact models and job state
|
||||
- Valkey/NATS for queueing with idempotency
|
||||
- Policy Engine, Scanner, Notify for job coordination
|
||||
- Orchestrator for backfills and incident routing
|
||||
|
||||
### Operational Assets
|
||||
- Monitoring: worker-grafana-dashboard.json, worker-prometheus-rules.yaml
|
||||
- Runbooks: operations/worker.md
|
||||
- Observability: metrics, traces, structured logs with correlation IDs
|
||||
|
||||
### Technical Notes
|
||||
- Coordination approach: review AGENTS.md, sync via docs/implplan/SPRINT_*.md
|
||||
- Backlog tracking: SCHED-MODELS-20-001 and related tasks in ../../TASKS.md
|
||||
- Module tasks: src/Scheduler/**/TASKS.md
|
||||
|
||||
## Epic alignment
|
||||
- **Epic 2 – Policy Engine & Editor:** orchestrate incremental re-evaluation and simulation runs when raw facts or policies change.
|
||||
- **Epic 6 – Vulnerability Explorer:** feed triage workflows with up-to-date job status, explain traces, and ledger hooks.
|
||||
- **Epic 9 – Orchestrator Dashboard:** expose job telemetry, throttling, and replay controls through orchestration dashboards.
|
||||
9
docs-archived/modules/scheduler/TASKS.md
Normal file
9
docs-archived/modules/scheduler/TASKS.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# Scheduler Module Task Board
|
||||
|
||||
This board mirrors active Scheduler sprint(s). Update alongside the sprint tracker.
|
||||
|
||||
Source of truth: docs/implplan/SPRINT_*.md.
|
||||
|
||||
| Task ID | Status | Notes |
|
||||
| --- | --- | --- |
|
||||
| TBD | TODO | Populate from active sprint. |
|
||||
434
docs-archived/modules/scheduler/architecture.md
Normal file
434
docs-archived/modules/scheduler/architecture.md
Normal file
@@ -0,0 +1,434 @@
|
||||
# component_architecture_scheduler.md — **Stella Ops Scheduler** (2025Q4)
|
||||
|
||||
> Synthesises the scheduling requirements documented across the Policy, Vulnerability Explorer, and Orchestrator module guides and implementation plans.
|
||||
|
||||
> **Scope.** Implementation‑ready architecture for **Scheduler**: a service that (1) **re‑evaluates** already‑cataloged images when intel changes (Conselier/Excitor/policy), (2) orchestrates **nightly** and **ad‑hoc** runs, (3) targets only the **impacted** images using the BOM‑Index, and (4) emits **report‑ready** events that downstream **Notify** fans out. Default mode is **analysis‑only** (no image pull); optional **content‑refresh** can be enabled per schedule.
|
||||
|
||||
---
|
||||
|
||||
## 0) Mission & boundaries
|
||||
|
||||
**Mission.** Keep scan results **current** without rescanning the world. When new advisories or VEX claims land, **pinpoint** affected images and ask the backend to recompute **verdicts** against the **existing SBOMs**. Surface only **meaningful deltas** to humans and ticket queues.
|
||||
|
||||
**Boundaries.**
|
||||
|
||||
* Scheduler **does not** compute SBOMs and **does not** sign. It calls Scanner/WebService’s **/reports (analysis‑only)** endpoint and lets the backend (Policy + Excitor + Conselier) decide PASS/FAIL.
|
||||
* Scheduler **may** ask Scanner to **content‑refresh** selected targets (e.g., mutable tags) but the default is **no** image pull.
|
||||
* Notifications are **not** sent directly; Scheduler emits events consumed by **Notify**.
|
||||
|
||||
---
|
||||
|
||||
## 1) Runtime shape & projects
|
||||
|
||||
```
|
||||
src/
|
||||
├─ StellaOps.Scheduler.WebService/ # REST (schedules CRUD, runs, admin)
|
||||
├─ StellaOps.Scheduler.Worker/ # planners + runners (N replicas)
|
||||
├─ StellaOps.Scheduler.ImpactIndex/ # purl→images inverted index (roaring bitmaps)
|
||||
├─ StellaOps.Scheduler.Models/ # DTOs (Schedule, Run, ImpactSet, Deltas)
|
||||
├─ StellaOps.Scheduler.Storage.Postgres/ # schedules, runs, cursors, locks
|
||||
├─ StellaOps.Scheduler.Queue/ # Valkey Streams / NATS abstraction
|
||||
├─ StellaOps.Scheduler.Tests.* # unit/integration/e2e
|
||||
```
|
||||
|
||||
**Deployables**:
|
||||
|
||||
* **Scheduler.WebService** (stateless)
|
||||
* **Scheduler.Worker** (scale‑out; planners + executors)
|
||||
|
||||
**Dependencies**: Authority (OpTok + DPoP/mTLS), Scanner.WebService, Conselier, Excitor, PostgreSQL, Valkey/NATS, (optional) Notify.
|
||||
|
||||
---
|
||||
|
||||
## 2) Core responsibilities
|
||||
|
||||
1. **Time‑based** runs: cron windows per tenant/timezone (e.g., “02:00 Europe/Sofia”).
|
||||
2. **Event‑driven** runs: react to **Conselier export** and **Excitor export** deltas (changed product keys / advisories / claims).
|
||||
3. **Impact targeting**: map changes to **image sets** using a **global inverted index** built from Scanner’s per‑image **BOM‑Index** sidecars.
|
||||
4. **Run planning**: shard, pace, and rate‑limit jobs to avoid thundering herds.
|
||||
5. **Execution**: call Scanner **/reports (analysis‑only)** or **/scans (content‑refresh)**; aggregate **delta** results.
|
||||
6. **Events**: publish `rescan.delta` and `report.ready` summaries for **Notify** & **UI**.
|
||||
7. **Control plane**: CRUD schedules, **pause/resume**, dry‑run previews, audit.
|
||||
|
||||
---
|
||||
|
||||
## 3) Data model (PostgreSQL)
|
||||
|
||||
**Database**: `scheduler`
|
||||
|
||||
* `schedules`
|
||||
|
||||
```
|
||||
{ _id, tenantId, name, enabled, whenCron, timezone,
|
||||
mode: "analysis-only" | "content-refresh",
|
||||
selection: { scope: "all-images" | "by-namespace" | "by-repo" | "by-digest" | "by-labels",
|
||||
includeTags?: ["prod-*"], digests?: [sha256...], resolvesTags?: bool },
|
||||
onlyIf: { lastReportOlderThanDays?: int, policyRevision?: string },
|
||||
notify: { onNewFindings: bool, minSeverity: "low|medium|high|critical", includeKEV: bool },
|
||||
limits: { maxJobs?: int, ratePerSecond?: int, parallelism?: int },
|
||||
createdAt, updatedAt, createdBy, updatedBy }
|
||||
```
|
||||
|
||||
* `runs`
|
||||
|
||||
```
|
||||
{ _id, scheduleId?, tenantId, trigger: "cron|conselier|excitor|manual",
|
||||
reason?: { conselierExportId?, excitorExportId?, cursor? },
|
||||
state: "planning|queued|running|completed|error|cancelled",
|
||||
stats: { candidates: int, deduped: int, queued: int, completed: int, deltas: int, newCriticals: int },
|
||||
startedAt, finishedAt, error? }
|
||||
```
|
||||
|
||||
* `impact_cursors`
|
||||
|
||||
```
|
||||
{ _id: tenantId, conselierLastExportId, excitorLastExportId, updatedAt }
|
||||
```
|
||||
|
||||
* `locks` (singleton schedulers, run leases)
|
||||
|
||||
* `audit` (CRUD actions, run outcomes)
|
||||
|
||||
**Indexes**:
|
||||
|
||||
* `schedules` on `{tenantId, enabled}`, `{whenCron}`.
|
||||
* `runs` on `{tenantId, startedAt desc}`, `{state}`.
|
||||
* TTL optional for completed runs (e.g., 180 days).
|
||||
|
||||
---
|
||||
|
||||
## 4) ImpactIndex (global inverted index)
|
||||
|
||||
Goal: translate **change keys** → **image sets** in **milliseconds**.
|
||||
|
||||
**Source**: Scanner produces per‑image **BOM‑Index** sidecars (purls, and `usedByEntrypoint` bitmaps). Scheduler ingests/refreshes them to build a **global** index.
|
||||
|
||||
**Representation**:
|
||||
|
||||
* Assign **image IDs** (dense ints) to catalog images.
|
||||
* Keep **Roaring Bitmaps**:
|
||||
|
||||
* `Contains[purl] → bitmap(imageIds)`
|
||||
* `UsedBy[purl] → bitmap(imageIds)` (subset of Contains)
|
||||
* Optionally keep **Owner maps**: `{imageId → {tenantId, namespaces[], repos[]}}` for selection filters.
|
||||
* Persist in RocksDB/LMDB or Valkey‑modules; cache hot shards in memory; snapshot to PostgreSQL for cold start.
|
||||
|
||||
**Update paths**:
|
||||
|
||||
* On new/updated image SBOM: **merge** per‑image set into global maps.
|
||||
* On image remove/expiry: **clear** id from bitmaps.
|
||||
|
||||
**API (internal)**:
|
||||
|
||||
```csharp
|
||||
IImpactIndex {
|
||||
ImpactSet ResolveByPurls(IEnumerable<string> purls, bool usageOnly, Selector sel);
|
||||
ImpactSet ResolveByVulns(IEnumerable<string> vulnIds, bool usageOnly, Selector sel); // optional (vuln->purl precomputed by Conselier)
|
||||
ImpactSet ResolveAll(Selector sel); // for nightly
|
||||
}
|
||||
```
|
||||
|
||||
**Selector filters**: tenant, namespaces, repos, labels, digest allowlists, `includeTags` patterns.
|
||||
|
||||
---
|
||||
|
||||
## 5) External interfaces (REST)
|
||||
|
||||
Base path: `/api/v1/scheduler` (Authority OpToks; scopes: `scheduler.read`, `scheduler.admin`).
|
||||
|
||||
### 5.1 Schedules CRUD
|
||||
|
||||
* `POST /schedules` → create
|
||||
* `GET /schedules` → list (filter by tenant)
|
||||
* `GET /schedules/{id}` → details + next run
|
||||
* `PATCH /schedules/{id}` → pause/resume/update
|
||||
* `DELETE /schedules/{id}` → delete (soft delete, optional)
|
||||
|
||||
### 5.2 Run control & introspection
|
||||
|
||||
* `POST /run` — ad‑hoc run
|
||||
|
||||
```json
|
||||
{ "mode": "analysis-only|content-refresh", "selection": {...}, "reason": "manual" }
|
||||
```
|
||||
* `GET /runs` — list with paging
|
||||
* `GET /runs/{id}` — status, stats, links to deltas
|
||||
* `POST /runs/{id}/cancel` — best‑effort cancel
|
||||
|
||||
### 5.3 Previews (dry‑run)
|
||||
|
||||
* `POST /preview/impact` — returns **candidate count** and a small sample of impacted digests for given change keys or selection.
|
||||
|
||||
### 5.4 Event webhooks (optional push from Conselier/Excitor)
|
||||
|
||||
* `POST /events/conselier-export`
|
||||
|
||||
```json
|
||||
{ "exportId":"...", "changedProductKeys":["pkg:rpm/openssl", ...], "kev": ["CVE-..."], "window": { "from":"...","to":"..." } }
|
||||
```
|
||||
* `POST /events/excitor-export`
|
||||
|
||||
```json
|
||||
{ "exportId":"...", "changedClaims":[ { "productKey":"pkg:deb/...", "vulnId":"CVE-...", "status":"not_affected→affected"} ], ... }
|
||||
```
|
||||
|
||||
**Security**: webhook requires **mTLS** or an **HMAC** `X-Scheduler-Signature` (Ed25519 / SHA‑256) plus Authority token.
|
||||
|
||||
---
|
||||
|
||||
## 6) Planner → Runner pipeline
|
||||
|
||||
### 6.1 Planning algorithm (event‑driven)
|
||||
|
||||
```
|
||||
On Export Event (Conselier/Excitor):
|
||||
keys = Normalize(change payload) # productKeys or vulnIds→productKeys
|
||||
usageOnly = schedule/policy hint? # default true
|
||||
sel = Selector for tenant/scope from schedules subscribed to events
|
||||
|
||||
impacted = ImpactIndex.ResolveByPurls(keys, usageOnly, sel)
|
||||
impacted = ApplyOwnerFilters(impacted, sel) # namespaces/repos/labels
|
||||
impacted = DeduplicateByDigest(impacted)
|
||||
impacted = EnforceLimits(impacted, limits.maxJobs)
|
||||
shards = Shard(impacted, byHashPrefix, n=limits.parallelism)
|
||||
|
||||
For each shard:
|
||||
Enqueue RunSegment (runId, shard, rate=limits.ratePerSecond)
|
||||
```
|
||||
|
||||
**Fairness & pacing**
|
||||
|
||||
* Use **leaky bucket** per tenant and per registry host.
|
||||
* Prioritize **KEV‑tagged** and **critical** first if oversubscribed.
|
||||
|
||||
### 6.2 Nightly planning
|
||||
|
||||
```
|
||||
At cron tick:
|
||||
sel = resolve selection
|
||||
candidates = ImpactIndex.ResolveAll(sel)
|
||||
if lastReportOlderThanDays present → filter by report age (via Scanner catalog)
|
||||
shard & enqueue as above
|
||||
```
|
||||
|
||||
### 6.3 Execution (Runner)
|
||||
|
||||
* Pop **RunSegment** job → for each image digest:
|
||||
|
||||
* **analysis‑only**: `POST scanner/reports { imageDigest, policyRevision? }`
|
||||
* **content‑refresh**: resolve tag→digest if needed; `POST scanner/scans { imageRef, attest? false }` then `POST /reports`
|
||||
* Collect **delta**: `newFindings`, `newCriticals`/`highs`, `links` (UI deep link, Rekor if present).
|
||||
* Persist per‑image outcome in `runs.{id}.stats` (incremental counters).
|
||||
* Emit `scheduler.rescan.delta` events to **Notify** only when **delta > 0** and matches severity rule.
|
||||
|
||||
---
|
||||
|
||||
## 7) Event model (outbound)
|
||||
|
||||
**Topic**: `rescan.delta` (internal bus → Notify; UI subscribes via backend).
|
||||
|
||||
```json
|
||||
{
|
||||
"tenant": "tenant-01",
|
||||
"runId": "324af…",
|
||||
"imageDigest": "sha256:…",
|
||||
"newCriticals": 1,
|
||||
"newHigh": 2,
|
||||
"kevHits": ["CVE-2025-..."],
|
||||
"topFindings": [
|
||||
{ "purl":"pkg:rpm/openssl@3.0.12-...","vulnId":"CVE-2025-...","severity":"critical","link":"https://ui/scans/..." }
|
||||
],
|
||||
"reportUrl": "https://ui/.../scans/sha256:.../report",
|
||||
"attestation": { "uuid":"rekor-uuid", "verified": true },
|
||||
"ts": "2025-10-18T03:12:45Z"
|
||||
}
|
||||
```
|
||||
|
||||
**Also**: `report.ready` for “no‑change” summaries (digest + zero delta), which Notify can ignore by rule.
|
||||
|
||||
---
|
||||
|
||||
## 8) Security posture
|
||||
|
||||
* **AuthN/Z**: Authority OpToks with `aud=scheduler`; DPoP (preferred) or mTLS.
|
||||
* **Multi‑tenant**: every schedule, run, and event carries `tenantId`; ImpactIndex filters by tenant‑visible images.
|
||||
* **Webhook** callers (Conselier/Excitor) present **mTLS** or **HMAC** and Authority token.
|
||||
* **Input hardening**: size caps on changed key lists; reject >100k keys per event; compress (zstd/gzip) allowed with limits.
|
||||
* **No secrets** in logs; redact tokens and signatures.
|
||||
|
||||
---
|
||||
|
||||
## 9) Observability & SLOs
|
||||
|
||||
**Metrics (Prometheus)**
|
||||
|
||||
* `scheduler.events_total{source, result}`
|
||||
* `scheduler.impact_resolve_seconds{quantile}`
|
||||
* `scheduler.images_selected_total{mode}`
|
||||
* `scheduler.jobs_enqueued_total{mode}`
|
||||
* `scheduler.run_latency_seconds{quantile}` // event → first verdict
|
||||
* `scheduler.delta_images_total{severity}`
|
||||
* `scheduler.rate_limited_total{reason}`
|
||||
* `policy_simulation_queue_depth{status}` (WebService gauge)
|
||||
* `policy_simulation_latency_seconds` (WebService histogram)
|
||||
|
||||
**Targets**
|
||||
|
||||
* Resolve 10k changed keys → impacted set in **<300 ms** (hot cache).
|
||||
* Event → first rescan verdict in **≤60 s** (p95).
|
||||
* Nightly coverage 50k images in **≤10 min** with 10 workers (analysis‑only).
|
||||
|
||||
**Tracing** (OTEL): spans `plan`, `resolve`, `enqueue`, `report_call`, `persist`, `emit`.
|
||||
|
||||
**Webhooks**
|
||||
|
||||
* Policy simulation webhooks fire on terminal states (`succeeded`, `failed`, `cancelled`).
|
||||
* Configure under `Scheduler:Worker:Policy:Webhook` (see `SCHED-WEB-27-002-POLICY-SIMULATION-WEBHOOKS.md`).
|
||||
* Requests include headers `X-StellaOps-Tenant` and `X-StellaOps-Run-Id` for idempotency.
|
||||
|
||||
---
|
||||
|
||||
## 10) Configuration (YAML)
|
||||
|
||||
```yaml
|
||||
scheduler:
|
||||
authority:
|
||||
issuer: "https://authority.internal"
|
||||
require: "dpop" # or "mtls"
|
||||
queue:
|
||||
kind: "valkey" # or "nats" (valkey uses redis:// protocol)
|
||||
url: "redis://valkey:6379/4"
|
||||
postgres:
|
||||
connectionString: "Host=postgres;Port=5432;Database=scheduler;Username=stellaops;Password=stellaops"
|
||||
impactIndex:
|
||||
storage: "rocksdb" # "rocksdb" | "valkey" | "memory"
|
||||
warmOnStart: true
|
||||
usageOnlyDefault: true
|
||||
limits:
|
||||
defaultRatePerSecond: 50
|
||||
defaultParallelism: 8
|
||||
maxJobsPerRun: 50000
|
||||
integrates:
|
||||
scannerUrl: "https://scanner-web.internal"
|
||||
conselierWebhook: true
|
||||
excitorWebhook: true
|
||||
notifications:
|
||||
emitBus: "internal" # deliver to Notify via internal bus
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 11) UI touch‑points
|
||||
|
||||
* **Schedules** page: CRUD, enable/pause, next run, last run stats, mode (analysis/content), selector preview.
|
||||
* **Runs** page: timeline; heat‑map of deltas; drill‑down to affected images.
|
||||
* **Dry‑run preview** modal: “This Conselier export touches ~3,214 images; projected deltas: ~420 (34 KEV).”
|
||||
|
||||
---
|
||||
|
||||
## 12) Failure modes & degradations
|
||||
|
||||
| Condition | Behavior |
|
||||
| ------------------------------------ | ---------------------------------------------------------------------------------------- |
|
||||
| ImpactIndex cold / incomplete | Fall back to **All** selection for nightly; for events, cap to KEV+critical until warmed |
|
||||
| Conselier/Excitor webhook storm | Coalesce by exportId; debounce 30–60 s; keep last |
|
||||
| Scanner under load (429) | Backoff with jitter; respect per‑tenant/leaky bucket |
|
||||
| Oversubscription (too many impacted) | Prioritize KEV/critical first; spillover to next window; UI banner shows backlog |
|
||||
| Notify down | Buffer outbound events in queue (TTL 24h) |
|
||||
| PostgreSQL slow | Cut batch sizes; sample‑log; alert ops; don't drop runs unless critical |
|
||||
|
||||
---
|
||||
|
||||
## 13) Testing matrix
|
||||
|
||||
* **ImpactIndex**: correctness (purl→image sets), performance, persistence after restart, memory pressure with 1M purls.
|
||||
* **Planner**: dedupe, shard, fairness, limit enforcement, KEV prioritization.
|
||||
* **Runner**: parallel report calls, error backoff, partial failures, idempotency.
|
||||
* **End‑to‑end**: Conselier export → deltas visible in UI in ≤60 s.
|
||||
* **Security**: webhook auth (mTLS/HMAC), DPoP nonce dance, tenant isolation.
|
||||
* **Chaos**: drop scanner availability; simulate registry throttles (content‑refresh mode).
|
||||
* **Nightly**: cron tick correctness across timezones and DST.
|
||||
|
||||
---
|
||||
|
||||
## 14) Implementation notes
|
||||
|
||||
* **Language**: .NET 10 minimal API; Channels‑based pipeline; `System.Threading.RateLimiting`.
|
||||
* **Bitmaps**: Roaring via `RoaringBitmap` bindings; memory‑map large shards if RocksDB used.
|
||||
* **Cron**: Quartz‑style parser with timezone support; clock skew tolerated ±60 s.
|
||||
* **Dry‑run**: use ImpactIndex only; never call scanner.
|
||||
* **Idempotency**: run segments carry deterministic keys; retries safe.
|
||||
* **Backpressure**: per‑tenant buckets; per‑host registry budgets respected when content‑refresh enabled.
|
||||
|
||||
---
|
||||
|
||||
## 15) Sequences (representative)
|
||||
|
||||
**A) Event‑driven rescan (Conselier delta)**
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
autonumber
|
||||
participant FE as Conselier
|
||||
participant SCH as Scheduler.Worker
|
||||
participant IDX as ImpactIndex
|
||||
participant SC as Scanner.WebService
|
||||
participant NO as Notify
|
||||
|
||||
FE->>SCH: POST /events/conselier-export {exportId, changedProductKeys}
|
||||
SCH->>IDX: ResolveByPurls(keys, usageOnly=true, sel)
|
||||
IDX-->>SCH: bitmap(imageIds) → digests list
|
||||
SCH->>SC: POST /reports {imageDigest} (batch/sequenced)
|
||||
SC-->>SCH: report deltas (new criticals/highs)
|
||||
alt delta>0
|
||||
SCH->>NO: rescan.delta {digest, newCriticals, links}
|
||||
end
|
||||
```
|
||||
|
||||
**B) Nightly rescan**
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
autonumber
|
||||
participant CRON as Cron
|
||||
participant SCH as Scheduler.Worker
|
||||
participant IDX as ImpactIndex
|
||||
participant SC as Scanner.WebService
|
||||
|
||||
CRON->>SCH: tick (02:00 Europe/Sofia)
|
||||
SCH->>IDX: ResolveAll(selector)
|
||||
IDX-->>SCH: candidates
|
||||
SCH->>SC: POST /reports {digest} (paced)
|
||||
SC-->>SCH: results
|
||||
SCH-->>SCH: aggregate, store run stats
|
||||
```
|
||||
|
||||
**C) Content‑refresh (tag followers)**
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
autonumber
|
||||
participant SCH as Scheduler
|
||||
participant SC as Scanner
|
||||
SCH->>SC: resolve tag→digest (if changed)
|
||||
alt digest changed
|
||||
SCH->>SC: POST /scans {imageRef} # new SBOM
|
||||
SC-->>SCH: scan complete (artifacts)
|
||||
SCH->>SC: POST /reports {imageDigest}
|
||||
else unchanged
|
||||
SCH->>SC: POST /reports {imageDigest} # analysis-only
|
||||
end
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 16) Roadmap
|
||||
|
||||
* **Vuln‑centric impact**: pre‑join vuln→purl→images to rank by **KEV** and **exploited‑in‑the‑wild** signals.
|
||||
* **Policy diff preview**: when a staged policy changes, show projected breakage set before promotion.
|
||||
* **Cross‑cluster federation**: one Scheduler instance driving many Scanner clusters (tenant isolation).
|
||||
* **Windows containers**: integrate Zastava runtime hints for Usage view tightening.
|
||||
|
||||
---
|
||||
|
||||
**End — component_architecture_scheduler.md**
|
||||
190
docs-archived/modules/scheduler/hlc-migration-guide.md
Normal file
190
docs-archived/modules/scheduler/hlc-migration-guide.md
Normal file
@@ -0,0 +1,190 @@
|
||||
# HLC Queue Ordering Migration Guide
|
||||
|
||||
This guide describes how to enable HLC (Hybrid Logical Clock) ordering for the Scheduler queue, transitioning from legacy `(priority, created_at)` ordering to HLC-based ordering with cryptographic chain linking.
|
||||
|
||||
## Overview
|
||||
|
||||
HLC ordering provides:
|
||||
- **Deterministic global ordering**: Causal consistency across distributed nodes
|
||||
- **Cryptographic chain linking**: Audit-safe job sequence proofs
|
||||
- **Reproducible processing**: Same input produces same chain
|
||||
|
||||
## Prerequisites
|
||||
|
||||
1. PostgreSQL 16+ with the scheduler schema
|
||||
2. HLC library dependency (`StellaOps.HybridLogicalClock`)
|
||||
3. Schema migration `002_hlc_queue_chain.sql` applied
|
||||
|
||||
## Migration Phases
|
||||
|
||||
### Phase 1: Deploy with Dual-Write Mode
|
||||
|
||||
Enable dual-write to populate the new `scheduler_log` table without affecting existing operations.
|
||||
|
||||
```yaml
|
||||
# appsettings.yaml or environment configuration
|
||||
Scheduler:
|
||||
Queue:
|
||||
Hlc:
|
||||
EnableHlcOrdering: false # Keep using legacy ordering for reads
|
||||
DualWriteMode: true # Write to both legacy and HLC tables
|
||||
```
|
||||
|
||||
```csharp
|
||||
// Program.cs or Startup.cs
|
||||
services.AddOptions<SchedulerQueueOptions>()
|
||||
.Bind(configuration.GetSection("Scheduler:Queue"))
|
||||
.ValidateDataAnnotations()
|
||||
.ValidateOnStart();
|
||||
|
||||
// Register HLC services
|
||||
services.AddHlcSchedulerServices();
|
||||
|
||||
// Register HLC clock
|
||||
services.AddSingleton<IHybridLogicalClock>(sp =>
|
||||
{
|
||||
var nodeId = Environment.MachineName; // or use a stable node identifier
|
||||
return new HybridLogicalClock(nodeId, TimeProvider.System);
|
||||
});
|
||||
```
|
||||
|
||||
**Verification:**
|
||||
- Monitor `scheduler_hlc_enqueues_total` metric for dual-write activity
|
||||
- Verify `scheduler_log` table is being populated
|
||||
- Check chain verification passes: `scheduler_chain_verifications_total{result="valid"}`
|
||||
|
||||
### Phase 2: Backfill Historical Data (Optional)
|
||||
|
||||
If you need historical jobs in the HLC chain, backfill from the existing `scheduler.jobs` table:
|
||||
|
||||
```sql
|
||||
-- Backfill script (run during maintenance window)
|
||||
-- Note: This creates a new chain starting from historical data
|
||||
-- The chain will not have valid prev_link values for historical entries
|
||||
|
||||
INSERT INTO scheduler.scheduler_log (
|
||||
tenant_id, t_hlc, partition_key, job_id, payload_hash, prev_link, link
|
||||
)
|
||||
SELECT
|
||||
tenant_id,
|
||||
-- Generate synthetic HLC timestamps based on created_at
|
||||
-- Format: YYYYMMDDHHMMSS-nodeid-counter
|
||||
TO_CHAR(created_at AT TIME ZONE 'UTC', 'YYYYMMDDHH24MISS') || '-backfill-' ||
|
||||
LPAD(ROW_NUMBER() OVER (PARTITION BY tenant_id ORDER BY created_at)::TEXT, 6, '0'),
|
||||
COALESCE(project_id, ''),
|
||||
id,
|
||||
DECODE(payload_digest, 'hex'),
|
||||
NULL, -- No chain linking for historical data
|
||||
DECODE(payload_digest, 'hex') -- Use payload_digest as link placeholder
|
||||
FROM scheduler.jobs
|
||||
WHERE status IN ('pending', 'scheduled', 'running')
|
||||
AND NOT EXISTS (
|
||||
SELECT 1 FROM scheduler.scheduler_log sl
|
||||
WHERE sl.job_id = jobs.id
|
||||
)
|
||||
ORDER BY tenant_id, created_at;
|
||||
```
|
||||
|
||||
### Phase 3: Enable HLC Ordering for Reads
|
||||
|
||||
Once dual-write is stable and backfill (if needed) is complete:
|
||||
|
||||
```yaml
|
||||
Scheduler:
|
||||
Queue:
|
||||
Hlc:
|
||||
EnableHlcOrdering: true # Use HLC ordering for reads
|
||||
DualWriteMode: true # Keep dual-write during transition
|
||||
VerifyOnDequeue: false # Optional: enable for extra validation
|
||||
```
|
||||
|
||||
**Verification:**
|
||||
- Monitor dequeue latency (should be similar to legacy)
|
||||
- Verify job processing order matches HLC order
|
||||
- Check chain integrity periodically
|
||||
|
||||
### Phase 4: Disable Dual-Write Mode
|
||||
|
||||
Once confident in HLC ordering:
|
||||
|
||||
```yaml
|
||||
Scheduler:
|
||||
Queue:
|
||||
Hlc:
|
||||
EnableHlcOrdering: true
|
||||
DualWriteMode: false # Stop writing to legacy table
|
||||
VerifyOnDequeue: false
|
||||
```
|
||||
|
||||
## Configuration Reference
|
||||
|
||||
### SchedulerHlcOptions
|
||||
|
||||
| Property | Type | Default | Description |
|
||||
|----------|------|---------|-------------|
|
||||
| `EnableHlcOrdering` | bool | false | Use HLC ordering for queue reads |
|
||||
| `DualWriteMode` | bool | false | Write to both legacy and HLC tables |
|
||||
| `VerifyOnDequeue` | bool | false | Verify chain integrity on each dequeue |
|
||||
| `MaxClockDriftMs` | int | 60000 | Maximum allowed clock drift in milliseconds |
|
||||
|
||||
## Metrics
|
||||
|
||||
| Metric | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `scheduler_hlc_enqueues_total` | Counter | Total HLC enqueue operations |
|
||||
| `scheduler_hlc_enqueue_deduplicated_total` | Counter | Deduplicated enqueue operations |
|
||||
| `scheduler_hlc_enqueue_duration_seconds` | Histogram | Enqueue operation duration |
|
||||
| `scheduler_hlc_dequeues_total` | Counter | Total HLC dequeue operations |
|
||||
| `scheduler_hlc_dequeued_entries_total` | Counter | Total entries dequeued |
|
||||
| `scheduler_chain_verifications_total` | Counter | Chain verification operations |
|
||||
| `scheduler_chain_verification_issues_total` | Counter | Chain verification issues found |
|
||||
| `scheduler_batch_snapshots_created_total` | Counter | Batch snapshots created |
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Chain Verification Failures
|
||||
|
||||
If chain verification reports issues:
|
||||
|
||||
1. Check `scheduler_chain_verification_issues_total` for issue count
|
||||
2. Query the log for specific issues:
|
||||
```csharp
|
||||
var result = await chainVerifier.VerifyAsync(tenantId);
|
||||
foreach (var issue in result.Issues)
|
||||
{
|
||||
logger.LogError(
|
||||
"Chain issue at job {JobId}: {Type} - {Description}",
|
||||
issue.JobId, issue.IssueType, issue.Description);
|
||||
}
|
||||
```
|
||||
|
||||
3. Common causes:
|
||||
- Database corruption: Restore from backup
|
||||
- Concurrent writes without proper locking: Check transaction isolation
|
||||
- Clock drift: Verify `MaxClockDriftMs` setting
|
||||
|
||||
### Performance Considerations
|
||||
|
||||
- **Index usage**: Ensure `idx_scheduler_log_tenant_hlc` is being used
|
||||
- **Chain head caching**: The `chain_heads` table provides O(1) access to latest link
|
||||
- **Batch sizes**: Adjust dequeue batch size based on workload
|
||||
|
||||
## Rollback Procedure
|
||||
|
||||
To rollback to legacy ordering:
|
||||
|
||||
```yaml
|
||||
Scheduler:
|
||||
Queue:
|
||||
Hlc:
|
||||
EnableHlcOrdering: false
|
||||
DualWriteMode: false
|
||||
```
|
||||
|
||||
The `scheduler_log` table can be retained for audit purposes or dropped if no longer needed.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Scheduler Architecture](architecture.md)
|
||||
- [HLC Library Documentation](../../__Libraries/StellaOps.HybridLogicalClock/README.md)
|
||||
- [Product Advisory: Audit-safe Job Queue Ordering](../../product/advisories/audit-safe-job-queue-ordering.md)
|
||||
176
docs-archived/modules/scheduler/hlc-ordering.md
Normal file
176
docs-archived/modules/scheduler/hlc-ordering.md
Normal file
@@ -0,0 +1,176 @@
|
||||
# Scheduler HLC Ordering Architecture
|
||||
|
||||
This document describes the Hybrid Logical Clock (HLC) based ordering system used by the StellaOps Scheduler for audit-safe job queue operations.
|
||||
|
||||
## Overview
|
||||
|
||||
The Scheduler uses HLC timestamps instead of wall-clock time to ensure:
|
||||
|
||||
1. **Total ordering** of jobs across distributed nodes
|
||||
2. **Audit-safe sequencing** with cryptographic chain linking
|
||||
3. **Deterministic merge** when offline nodes reconnect
|
||||
4. **Clock skew tolerance** in distributed deployments
|
||||
|
||||
## HLC Timestamp Format
|
||||
|
||||
An HLC timestamp consists of three components:
|
||||
|
||||
```
|
||||
(PhysicalTime, LogicalCounter, NodeId)
|
||||
```
|
||||
|
||||
| Component | Description | Example |
|
||||
|-----------|-------------|---------|
|
||||
| PhysicalTime | Unix milliseconds (UTC) | `1704585600000` |
|
||||
| LogicalCounter | Monotonic counter for same-millisecond events | `0`, `1`, `2`... |
|
||||
| NodeId | Unique identifier for the node | `scheduler-prod-01` |
|
||||
|
||||
**String format:** `{physical}:{logical}:{nodeId}`
|
||||
Example: `1704585600000:0:scheduler-prod-01`
|
||||
|
||||
## Database Schema
|
||||
|
||||
### scheduler_log Table
|
||||
|
||||
```sql
|
||||
CREATE TABLE scheduler.scheduler_log (
|
||||
id BIGSERIAL PRIMARY KEY,
|
||||
t_hlc TEXT NOT NULL, -- HLC timestamp
|
||||
job_id TEXT NOT NULL, -- Job identifier
|
||||
action TEXT NOT NULL, -- ENQUEUE, DEQUEUE, EXECUTE, COMPLETE, FAIL
|
||||
prev_chain_link TEXT, -- Hash of previous entry
|
||||
chain_link TEXT NOT NULL, -- Hash of this entry
|
||||
payload JSONB NOT NULL, -- Job metadata
|
||||
tenant_id TEXT NOT NULL,
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX idx_scheduler_log_hlc ON scheduler.scheduler_log (t_hlc);
|
||||
CREATE INDEX idx_scheduler_log_tenant_hlc ON scheduler.scheduler_log (tenant_id, t_hlc);
|
||||
CREATE INDEX idx_scheduler_log_job ON scheduler.scheduler_log (job_id);
|
||||
```
|
||||
|
||||
### batch_snapshot Table
|
||||
|
||||
```sql
|
||||
CREATE TABLE scheduler.batch_snapshot (
|
||||
id BIGSERIAL PRIMARY KEY,
|
||||
snapshot_hlc TEXT NOT NULL, -- HLC at snapshot time
|
||||
from_chain_link TEXT NOT NULL, -- First entry in batch
|
||||
to_chain_link TEXT NOT NULL, -- Last entry in batch
|
||||
entry_count INTEGER NOT NULL,
|
||||
merkle_root TEXT NOT NULL, -- Merkle root of entries
|
||||
dsse_envelope JSONB, -- DSSE-signed attestation
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
||||
);
|
||||
```
|
||||
|
||||
### chain_heads Table
|
||||
|
||||
```sql
|
||||
CREATE TABLE scheduler.chain_heads (
|
||||
tenant_id TEXT PRIMARY KEY,
|
||||
head_chain_link TEXT NOT NULL, -- Current chain head
|
||||
head_hlc TEXT NOT NULL, -- HLC of chain head
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
||||
);
|
||||
```
|
||||
|
||||
## Chain Link Computation
|
||||
|
||||
Each log entry is cryptographically linked to its predecessor:
|
||||
|
||||
```csharp
|
||||
public static string ComputeChainLink(
|
||||
string tHlc,
|
||||
string jobId,
|
||||
string action,
|
||||
string? prevChainLink,
|
||||
string payloadDigest)
|
||||
{
|
||||
using var hasher = IncrementalHash.CreateHash(HashAlgorithmName.SHA256);
|
||||
hasher.AppendData(Encoding.UTF8.GetBytes(tHlc));
|
||||
hasher.AppendData(Encoding.UTF8.GetBytes(jobId));
|
||||
hasher.AppendData(Encoding.UTF8.GetBytes(action));
|
||||
hasher.AppendData(Encoding.UTF8.GetBytes(prevChainLink ?? "genesis"));
|
||||
hasher.AppendData(Encoding.UTF8.GetBytes(payloadDigest));
|
||||
return Convert.ToHexString(hasher.GetHashAndReset()).ToLowerInvariant();
|
||||
}
|
||||
```
|
||||
|
||||
## Configuration Options
|
||||
|
||||
```yaml
|
||||
# etc/scheduler.yaml
|
||||
scheduler:
|
||||
hlc:
|
||||
enabled: true # Enable HLC ordering (default: true)
|
||||
nodeId: "scheduler-prod-01" # Unique node identifier
|
||||
maxClockSkew: "00:00:05" # Maximum tolerable clock skew (5 seconds)
|
||||
persistenceInterval: "00:01:00" # HLC state persistence interval
|
||||
|
||||
chain:
|
||||
enabled: true # Enable chain linking (default: true)
|
||||
batchSize: 1000 # Entries per batch snapshot
|
||||
batchInterval: "00:05:00" # Batch snapshot interval
|
||||
signSnapshots: true # DSSE-sign batch snapshots
|
||||
keyId: "scheduler-signing-key" # Key for snapshot signing
|
||||
```
|
||||
|
||||
## Operational Considerations
|
||||
|
||||
### Clock Skew Handling
|
||||
|
||||
The HLC algorithm tolerates clock skew by:
|
||||
|
||||
1. Advancing logical counter when physical time hasn't progressed
|
||||
2. Rejecting events with excessive clock skew (> `maxClockSkew`)
|
||||
3. Emitting `hlc_clock_skew_rejections_total` metric for monitoring
|
||||
|
||||
**Alert:** `HlcClockSkewExceeded` triggers when skew > tolerance.
|
||||
|
||||
### Chain Verification
|
||||
|
||||
Verify chain integrity on startup and periodically:
|
||||
|
||||
```bash
|
||||
# CLI command
|
||||
stella scheduler chain verify --tenant-id <tenant>
|
||||
|
||||
# API endpoint
|
||||
GET /api/v1/scheduler/chain/verify?tenantId=<tenant>
|
||||
```
|
||||
|
||||
### Offline Merge
|
||||
|
||||
When offline nodes reconnect:
|
||||
|
||||
1. Export local job log as bundle
|
||||
2. Import on connected node
|
||||
3. HLC-based merge produces deterministic ordering
|
||||
4. Chain is extended with merged entries
|
||||
|
||||
See `docs/operations/airgap-operations-runbook.md` for details.
|
||||
|
||||
## Metrics
|
||||
|
||||
| Metric | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `hlc_ticks_total` | Counter | Total HLC tick operations |
|
||||
| `hlc_clock_skew_rejections_total` | Counter | Events rejected due to clock skew |
|
||||
| `hlc_physical_offset_seconds` | Gauge | Current physical time offset |
|
||||
| `scheduler_chain_entries_total` | Counter | Total chain log entries |
|
||||
| `scheduler_chain_verifications_total` | Counter | Chain verification operations |
|
||||
| `scheduler_chain_verification_failures_total` | Counter | Failed verifications |
|
||||
| `scheduler_batch_snapshots_total` | Counter | Batch snapshots created |
|
||||
|
||||
## Grafana Dashboard
|
||||
|
||||
See `devops/observability/grafana/hlc-queue-metrics.json` for the HLC monitoring dashboard.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [HLC Core Library](../../../src/__Libraries/StellaOps.HybridLogicalClock/README.md)
|
||||
- [HLC Migration Guide](./hlc-migration-guide.md)
|
||||
- [Air-Gap Operations Runbook](../../operations/airgap-operations-runbook.md)
|
||||
- [HLC Troubleshooting](../../operations/runbooks/hlc-troubleshooting.md)
|
||||
24
docs-archived/modules/scheduler/implementation_plan.md
Normal file
24
docs-archived/modules/scheduler/implementation_plan.md
Normal file
@@ -0,0 +1,24 @@
|
||||
# Scheduler Implementation Plan
|
||||
|
||||
## Purpose
|
||||
Provide a living plan for Scheduler deliverables, dependencies, and evidence.
|
||||
|
||||
## Active work
|
||||
- Track current sprints under `docs/implplan/SPRINT_*.md` for this module.
|
||||
- Update this file when new scoped work is approved.
|
||||
|
||||
## Near-term deliverables
|
||||
- TBD (add when sprint is staffed).
|
||||
|
||||
## Dependencies
|
||||
- `docs/modules/scheduler/architecture.md`
|
||||
- `docs/modules/scheduler/README.md`
|
||||
- `docs/modules/platform/architecture-overview.md`
|
||||
|
||||
## Evidence of completion
|
||||
- Code changes under `src/Scheduler/**`.
|
||||
- Tests and fixtures under the module's `__Tests` / `__Libraries`.
|
||||
- Docs and runbooks under `docs/modules/scheduler/**`.
|
||||
|
||||
## Notes
|
||||
- Keep deterministic and offline-first expectations aligned with module AGENTS.
|
||||
@@ -0,0 +1,261 @@
|
||||
{
|
||||
"title": "Scheduler Worker – Planning & Rescan",
|
||||
"uid": "scheduler-worker-observability",
|
||||
"schemaVersion": 38,
|
||||
"version": 1,
|
||||
"editable": true,
|
||||
"timezone": "",
|
||||
"graphTooltip": 0,
|
||||
"time": {
|
||||
"from": "now-24h",
|
||||
"to": "now"
|
||||
},
|
||||
"templating": {
|
||||
"list": [
|
||||
{
|
||||
"name": "datasource",
|
||||
"type": "datasource",
|
||||
"query": "prometheus",
|
||||
"hide": 0,
|
||||
"refresh": 1,
|
||||
"current": {}
|
||||
},
|
||||
{
|
||||
"name": "mode",
|
||||
"label": "Mode",
|
||||
"type": "query",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"query": "label_values(scheduler_planner_runs_total, mode)",
|
||||
"refresh": 1,
|
||||
"multi": true,
|
||||
"includeAll": true,
|
||||
"allValue": ".*",
|
||||
"current": {
|
||||
"selected": false,
|
||||
"text": "All",
|
||||
"value": ".*"
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
"annotations": {
|
||||
"list": []
|
||||
},
|
||||
"panels": [
|
||||
{
|
||||
"id": 1,
|
||||
"title": "Planner Runs per Status",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"displayName": "{{status}}"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum by (status) (rate(scheduler_planner_runs_total{mode=~\"$mode\"}[5m]))",
|
||||
"legendFormat": "{{status}}",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 0
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 2,
|
||||
"title": "Planner Latency P95 (s)",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "s"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "histogram_quantile(0.95, sum by (le) (rate(scheduler_planner_latency_seconds_bucket{mode=~\"$mode\"}[5m])))",
|
||||
"legendFormat": "p95",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 0
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 3,
|
||||
"title": "Runner Segments per Status",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"displayName": "{{status}}"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum by (status) (rate(scheduler_runner_segments_total{mode=~\"$mode\"}[5m]))",
|
||||
"legendFormat": "{{status}}",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 8
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 4,
|
||||
"title": "New Findings per Severity",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"displayName": "{{severity}}"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(scheduler_runner_delta_critical_total{mode=~\"$mode\"}[5m]))",
|
||||
"legendFormat": "critical",
|
||||
"refId": "A"
|
||||
},
|
||||
{
|
||||
"expr": "sum(rate(scheduler_runner_delta_high_total{mode=~\"$mode\"}[5m]))",
|
||||
"legendFormat": "high",
|
||||
"refId": "B"
|
||||
},
|
||||
{
|
||||
"expr": "sum(rate(scheduler_runner_delta_total{mode=~\"$mode\"}[5m]))",
|
||||
"legendFormat": "total",
|
||||
"refId": "C"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 8
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 5,
|
||||
"title": "Runner Backlog by Schedule",
|
||||
"type": "table",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"displayName": "{{scheduleId}}",
|
||||
"unit": "none"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"showHeader": true
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "max by (scheduleId) (scheduler_runner_backlog{mode=~\"$mode\"})",
|
||||
"format": "table",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 16
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 6,
|
||||
"title": "Active Runs",
|
||||
"type": "stat",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "none"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"orientation": "horizontal",
|
||||
"textMode": "value"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(scheduler_runs_active{mode=~\"$mode\"})",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 16
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,42 @@
|
||||
groups:
|
||||
- name: scheduler-worker
|
||||
interval: 30s
|
||||
rules:
|
||||
- alert: SchedulerPlannerFailuresHigh
|
||||
expr: sum(rate(scheduler_planner_runs_total{status="failed"}[5m]))
|
||||
/
|
||||
sum(rate(scheduler_planner_runs_total[5m])) > 0.05
|
||||
for: 10m
|
||||
labels:
|
||||
severity: critical
|
||||
service: scheduler-worker
|
||||
annotations:
|
||||
summary: "Planner failure ratio above 5%"
|
||||
description: "More than 5% of planning runs are failing. Inspect scheduler logs and ImpactIndex connectivity before queues back up."
|
||||
- alert: SchedulerPlannerLatencyHigh
|
||||
expr: histogram_quantile(0.95, sum by (le) (rate(scheduler_planner_latency_seconds_bucket[5m]))) > 45
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
service: scheduler-worker
|
||||
annotations:
|
||||
summary: "Planner latency p95 above 45s"
|
||||
description: "Planning latency p95 stayed above 45 seconds for 10 minutes. Check ImpactIndex, Mongo, or external selectors to prevent missed SLAs."
|
||||
- alert: SchedulerRunnerBacklogGrowing
|
||||
expr: max_over_time(scheduler_runner_backlog[15m]) > 500
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
service: scheduler-worker
|
||||
annotations:
|
||||
summary: "Runner backlog above 500 images"
|
||||
description: "Runner backlog exceeded 500 images over the last 15 minutes. Verify runner workers, scanner availability, and rate limits."
|
||||
- alert: SchedulerRunStuck
|
||||
expr: sum(scheduler_runs_active) > 0 and max_over_time(scheduler_runs_active[30m]) == min_over_time(scheduler_runs_active[30m])
|
||||
for: 30m
|
||||
labels:
|
||||
severity: warning
|
||||
service: scheduler-worker
|
||||
annotations:
|
||||
summary: "Scheduler runs stuck without progress"
|
||||
description: "Active runs count has remained flat for 30 minutes. Investigate stuck segments or scanner timeouts."
|
||||
82
docs-archived/modules/scheduler/operations/worker.md
Normal file
82
docs-archived/modules/scheduler/operations/worker.md
Normal file
@@ -0,0 +1,82 @@
|
||||
# Scheduler Worker – Observability & Runbook
|
||||
|
||||
## Purpose
|
||||
Monitor planner and runner health for the Scheduler Worker (Sprint 16 telemetry). The new .NET meters surface queue throughput, latency, backlog, and delta severities so operators can detect stalled runs before rescan SLAs slip.
|
||||
|
||||
> **Grafana note:** Import `docs/modules/scheduler/operations/worker-grafana-dashboard.json` into the Prometheus-backed Grafana stack that scrapes the OpenTelemetry Collector.
|
||||
|
||||
---
|
||||
|
||||
## Key metrics
|
||||
|
||||
| Metric | Use case | Suggested query |
|
||||
| --- | --- | --- |
|
||||
| `scheduler_planner_runs_total{status}` | Planner throughput & failure ratio | `sum by (status) (rate(scheduler_planner_runs_total[5m]))` |
|
||||
| `scheduler_planner_latency_seconds_bucket` | Planning latency (p95 / p99) | `histogram_quantile(0.95, sum by (le) (rate(scheduler_planner_latency_seconds_bucket[5m])))` |
|
||||
| `scheduler_runner_segments_total{status}` | Runner success vs retries | `sum by (status) (rate(scheduler_runner_segments_total[5m]))` |
|
||||
| `scheduler_runner_delta_{critical,high,total}` | Newly-detected findings | `sum(rate(scheduler_runner_delta_critical_total[5m]))` |
|
||||
| `scheduler_runner_backlog{scheduleId}` | Remaining digests awaiting runner | `max by (scheduleId) (scheduler_runner_backlog)` |
|
||||
| `scheduler_runs_active{mode}` | Active runs in-flight | `sum(scheduler_runs_active)` |
|
||||
|
||||
Reference queries power the bundled Grafana dashboard panels. Use the `mode` template variable to focus on `analysisOnly` versus `contentRefresh` schedules.
|
||||
|
||||
---
|
||||
|
||||
## Grafana dashboard
|
||||
|
||||
1. Import `docs/modules/scheduler/operations/worker-grafana-dashboard.json` (UID `scheduler-worker-observability`).
|
||||
2. Point the `datasource` variable to the Prometheus instance scraping the collector. Optional: pin the `mode` variable to a specific schedule mode.
|
||||
3. Panels included:
|
||||
- **Planner Runs per Status** – visualises success vs failure ratio.
|
||||
- **Planner Latency P95** – highlights degradations in ImpactIndex or Mongo lookups.
|
||||
- **Runner Segments per Status** – shows retry pressure and queue health.
|
||||
- **New Findings per Severity** – rolls up delta counters (critical/high/total).
|
||||
- **Runner Backlog by Schedule** – tabulates outstanding digests per schedule.
|
||||
- **Active Runs** – stat panel showing the current number of in-flight runs.
|
||||
|
||||
Capture screenshots once Grafana provisioning completes and store them under `docs/assets/dashboards/` (pending automation ticket OBS-157).
|
||||
|
||||
---
|
||||
|
||||
## Prometheus alerts
|
||||
|
||||
Import `docs/modules/scheduler/operations/worker-prometheus-rules.yaml` into your Prometheus rule configuration. The bundle defines:
|
||||
|
||||
- **SchedulerPlannerFailuresHigh** – 5%+ of planner runs failed for 10 minutes. Page SRE.
|
||||
- **SchedulerPlannerLatencyHigh** – planner p95 latency remains above 45 s for 10 minutes. Investigate ImpactIndex, Mongo, and Conselier/Excitor event queues.
|
||||
- **SchedulerRunnerBacklogGrowing** – backlog exceeded 500 images for 15 minutes. Inspect runner workers, Scanner availability, and rate limiting.
|
||||
- **SchedulerRunStuck** – active run count stayed flat for 30 minutes while remaining non-zero. Check stuck segments, expired leases, and scanner retries.
|
||||
|
||||
Hook these alerts into the existing Observability notification pathway (`observability-pager` routing key) and ensure `service=scheduler-worker` is mapped to the on-call rotation.
|
||||
|
||||
---
|
||||
|
||||
## Runbook snapshot
|
||||
|
||||
1. **Planner failure/latency:**
|
||||
- Check Planner logs for ImpactIndex or Mongo exceptions.
|
||||
- Verify Conselier/Excitor webhook health; requeue events if necessary.
|
||||
- If planner is overwhelmed, temporarily reduce schedule parallelism via `stella scheduler schedule update`.
|
||||
2. **Runner backlog spike:**
|
||||
- Confirm Scanner WebService health (`/healthz`).
|
||||
- Inspect runner queue for stuck segments; consider increasing runner workers or scaling scanner capacity.
|
||||
- Review rate limits (schedule limits, ImpactIndex throughput) before changing global throttles.
|
||||
3. **Stuck runs:**
|
||||
- Use `stella scheduler runs list --state running` to identify affected runs.
|
||||
- Drill into Grafana panel “Runner Backlog by Schedule” to see offending schedule IDs.
|
||||
- If a segment will not progress, use `stella scheduler segments release --segment <id>` to force retry after resolving root cause.
|
||||
4. **Unexpected critical deltas:**
|
||||
- Correlate `scheduler_runner_delta_critical_total` spikes with Notify events (`scheduler.rescan.delta`).
|
||||
- Pivot to Scanner report links for impacted digests and confirm they match upstream advisories/policies.
|
||||
|
||||
Document incidents and mitigation in `ops/runbooks/INCIDENT_LOG.md` (per SRE policy) and attach Grafana screenshots for post-mortems.
|
||||
|
||||
---
|
||||
|
||||
## Checklist
|
||||
|
||||
- [ ] Grafana dashboard imported and wired to Prometheus datasource.
|
||||
- [ ] Prometheus alert rules deployed (see above).
|
||||
- [ ] Runbook linked from on-call rotation portal.
|
||||
- [ ] Observability Guild sign-off captured for Sprint 16 telemetry (OWNER: @obs-guild).
|
||||
|
||||
Reference in New Issue
Block a user