feat: Implement vulnerability token signing and verification utilities
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
- Added VulnTokenSigner for signing JWT tokens with specified algorithms and keys. - Introduced VulnTokenUtilities for resolving tenant and subject claims, and sanitizing context dictionaries. - Created VulnTokenVerificationUtilities for parsing tokens, verifying signatures, and deserializing payloads. - Developed VulnWorkflowAntiForgeryTokenIssuer for issuing anti-forgery tokens with configurable options. - Implemented VulnWorkflowAntiForgeryTokenVerifier for verifying anti-forgery tokens and validating payloads. - Added AuthorityVulnerabilityExplorerOptions to manage configuration for vulnerability explorer features. - Included tests for FilesystemPackRunDispatcher to ensure proper job handling under egress policy restrictions.
This commit is contained in:
@@ -1,426 +1,426 @@
|
||||
# component_architecture_scheduler.md — **Stella Ops Scheduler** (2025Q4)
|
||||
|
||||
> Synthesises the scheduling requirements documented across the Policy, Vulnerability Explorer, and Orchestrator module guides and implementation plans.
|
||||
|
||||
> **Scope.** Implementation‑ready architecture for **Scheduler**: a service that (1) **re‑evaluates** already‑cataloged images when intel changes (Feedser/Vexer/policy), (2) orchestrates **nightly** and **ad‑hoc** runs, (3) targets only the **impacted** images using the BOM‑Index, and (4) emits **report‑ready** events that downstream **Notify** fans out. Default mode is **analysis‑only** (no image pull); optional **content‑refresh** can be enabled per schedule.
|
||||
|
||||
---
|
||||
|
||||
## 0) Mission & boundaries
|
||||
|
||||
**Mission.** Keep scan results **current** without rescanning the world. When new advisories or VEX claims land, **pinpoint** affected images and ask the backend to recompute **verdicts** against the **existing SBOMs**. Surface only **meaningful deltas** to humans and ticket queues.
|
||||
|
||||
**Boundaries.**
|
||||
|
||||
* Scheduler **does not** compute SBOMs and **does not** sign. It calls Scanner/WebService’s **/reports (analysis‑only)** endpoint and lets the backend (Policy + Vexer + Feedser) decide PASS/FAIL.
|
||||
* Scheduler **may** ask Scanner to **content‑refresh** selected targets (e.g., mutable tags) but the default is **no** image pull.
|
||||
* Notifications are **not** sent directly; Scheduler emits events consumed by **Notify**.
|
||||
|
||||
---
|
||||
|
||||
## 1) Runtime shape & projects
|
||||
|
||||
```
|
||||
src/
|
||||
├─ StellaOps.Scheduler.WebService/ # REST (schedules CRUD, runs, admin)
|
||||
├─ StellaOps.Scheduler.Worker/ # planners + runners (N replicas)
|
||||
├─ StellaOps.Scheduler.ImpactIndex/ # purl→images inverted index (roaring bitmaps)
|
||||
├─ StellaOps.Scheduler.Models/ # DTOs (Schedule, Run, ImpactSet, Deltas)
|
||||
├─ StellaOps.Scheduler.Storage.Mongo/ # schedules, runs, cursors, locks
|
||||
├─ StellaOps.Scheduler.Queue/ # Redis Streams / NATS abstraction
|
||||
├─ StellaOps.Scheduler.Tests.* # unit/integration/e2e
|
||||
```
|
||||
|
||||
**Deployables**:
|
||||
|
||||
* **Scheduler.WebService** (stateless)
|
||||
* **Scheduler.Worker** (scale‑out; planners + executors)
|
||||
|
||||
**Dependencies**: Authority (OpTok + DPoP/mTLS), Scanner.WebService, Feedser, Vexer, MongoDB, Redis/NATS, (optional) Notify.
|
||||
|
||||
---
|
||||
|
||||
## 2) Core responsibilities
|
||||
|
||||
1. **Time‑based** runs: cron windows per tenant/timezone (e.g., “02:00 Europe/Sofia”).
|
||||
2. **Event‑driven** runs: react to **Feedser export** and **Vexer export** deltas (changed product keys / advisories / claims).
|
||||
3. **Impact targeting**: map changes to **image sets** using a **global inverted index** built from Scanner’s per‑image **BOM‑Index** sidecars.
|
||||
4. **Run planning**: shard, pace, and rate‑limit jobs to avoid thundering herds.
|
||||
5. **Execution**: call Scanner **/reports (analysis‑only)** or **/scans (content‑refresh)**; aggregate **delta** results.
|
||||
6. **Events**: publish `rescan.delta` and `report.ready` summaries for **Notify** & **UI**.
|
||||
7. **Control plane**: CRUD schedules, **pause/resume**, dry‑run previews, audit.
|
||||
|
||||
---
|
||||
|
||||
## 3) Data model (Mongo)
|
||||
|
||||
**Database**: `scheduler`
|
||||
|
||||
* `schedules`
|
||||
|
||||
```
|
||||
{ _id, tenantId, name, enabled, whenCron, timezone,
|
||||
mode: "analysis-only" | "content-refresh",
|
||||
selection: { scope: "all-images" | "by-namespace" | "by-repo" | "by-digest" | "by-labels",
|
||||
includeTags?: ["prod-*"], digests?: [sha256...], resolvesTags?: bool },
|
||||
onlyIf: { lastReportOlderThanDays?: int, policyRevision?: string },
|
||||
notify: { onNewFindings: bool, minSeverity: "low|medium|high|critical", includeKEV: bool },
|
||||
limits: { maxJobs?: int, ratePerSecond?: int, parallelism?: int },
|
||||
createdAt, updatedAt, createdBy, updatedBy }
|
||||
```
|
||||
|
||||
* `runs`
|
||||
|
||||
```
|
||||
{ _id, scheduleId?, tenantId, trigger: "cron|feedser|vexer|manual",
|
||||
reason?: { feedserExportId?, vexerExportId?, cursor? },
|
||||
state: "planning|queued|running|completed|error|cancelled",
|
||||
stats: { candidates: int, deduped: int, queued: int, completed: int, deltas: int, newCriticals: int },
|
||||
startedAt, finishedAt, error? }
|
||||
```
|
||||
|
||||
* `impact_cursors`
|
||||
|
||||
```
|
||||
{ _id: tenantId, feedserLastExportId, vexerLastExportId, updatedAt }
|
||||
```
|
||||
|
||||
* `locks` (singleton schedulers, run leases)
|
||||
|
||||
* `audit` (CRUD actions, run outcomes)
|
||||
|
||||
**Indexes**:
|
||||
|
||||
* `schedules` on `{tenantId, enabled}`, `{whenCron}`.
|
||||
* `runs` on `{tenantId, startedAt desc}`, `{state}`.
|
||||
* TTL optional for completed runs (e.g., 180 days).
|
||||
|
||||
---
|
||||
|
||||
## 4) ImpactIndex (global inverted index)
|
||||
|
||||
Goal: translate **change keys** → **image sets** in **milliseconds**.
|
||||
|
||||
**Source**: Scanner produces per‑image **BOM‑Index** sidecars (purls, and `usedByEntrypoint` bitmaps). Scheduler ingests/refreshes them to build a **global** index.
|
||||
|
||||
**Representation**:
|
||||
|
||||
* Assign **image IDs** (dense ints) to catalog images.
|
||||
* Keep **Roaring Bitmaps**:
|
||||
|
||||
* `Contains[purl] → bitmap(imageIds)`
|
||||
* `UsedBy[purl] → bitmap(imageIds)` (subset of Contains)
|
||||
* Optionally keep **Owner maps**: `{imageId → {tenantId, namespaces[], repos[]}}` for selection filters.
|
||||
* Persist in RocksDB/LMDB or Redis‑modules; cache hot shards in memory; snapshot to Mongo for cold start.
|
||||
|
||||
**Update paths**:
|
||||
|
||||
* On new/updated image SBOM: **merge** per‑image set into global maps.
|
||||
* On image remove/expiry: **clear** id from bitmaps.
|
||||
|
||||
**API (internal)**:
|
||||
|
||||
```csharp
|
||||
IImpactIndex {
|
||||
ImpactSet ResolveByPurls(IEnumerable<string> purls, bool usageOnly, Selector sel);
|
||||
ImpactSet ResolveByVulns(IEnumerable<string> vulnIds, bool usageOnly, Selector sel); // optional (vuln->purl precomputed by Feedser)
|
||||
ImpactSet ResolveAll(Selector sel); // for nightly
|
||||
}
|
||||
```
|
||||
|
||||
**Selector filters**: tenant, namespaces, repos, labels, digest allowlists, `includeTags` patterns.
|
||||
|
||||
---
|
||||
|
||||
## 5) External interfaces (REST)
|
||||
|
||||
Base path: `/api/v1/scheduler` (Authority OpToks; scopes: `scheduler.read`, `scheduler.admin`).
|
||||
|
||||
### 5.1 Schedules CRUD
|
||||
|
||||
* `POST /schedules` → create
|
||||
* `GET /schedules` → list (filter by tenant)
|
||||
* `GET /schedules/{id}` → details + next run
|
||||
* `PATCH /schedules/{id}` → pause/resume/update
|
||||
* `DELETE /schedules/{id}` → delete (soft delete, optional)
|
||||
|
||||
### 5.2 Run control & introspection
|
||||
|
||||
* `POST /run` — ad‑hoc run
|
||||
|
||||
```json
|
||||
{ "mode": "analysis-only|content-refresh", "selection": {...}, "reason": "manual" }
|
||||
```
|
||||
* `GET /runs` — list with paging
|
||||
* `GET /runs/{id}` — status, stats, links to deltas
|
||||
* `POST /runs/{id}/cancel` — best‑effort cancel
|
||||
|
||||
### 5.3 Previews (dry‑run)
|
||||
|
||||
* `POST /preview/impact` — returns **candidate count** and a small sample of impacted digests for given change keys or selection.
|
||||
|
||||
### 5.4 Event webhooks (optional push from Feedser/Vexer)
|
||||
|
||||
* `POST /events/feedser-export`
|
||||
|
||||
```json
|
||||
{ "exportId":"...", "changedProductKeys":["pkg:rpm/openssl", ...], "kev": ["CVE-..."], "window": { "from":"...","to":"..." } }
|
||||
```
|
||||
* `POST /events/vexer-export`
|
||||
|
||||
```json
|
||||
{ "exportId":"...", "changedClaims":[ { "productKey":"pkg:deb/...", "vulnId":"CVE-...", "status":"not_affected→affected"} ], ... }
|
||||
```
|
||||
|
||||
**Security**: webhook requires **mTLS** or an **HMAC** `X-Scheduler-Signature` (Ed25519 / SHA‑256) plus Authority token.
|
||||
|
||||
---
|
||||
|
||||
## 6) Planner → Runner pipeline
|
||||
|
||||
### 6.1 Planning algorithm (event‑driven)
|
||||
|
||||
```
|
||||
On Export Event (Feedser/Vexer):
|
||||
keys = Normalize(change payload) # productKeys or vulnIds→productKeys
|
||||
usageOnly = schedule/policy hint? # default true
|
||||
sel = Selector for tenant/scope from schedules subscribed to events
|
||||
|
||||
impacted = ImpactIndex.ResolveByPurls(keys, usageOnly, sel)
|
||||
impacted = ApplyOwnerFilters(impacted, sel) # namespaces/repos/labels
|
||||
impacted = DeduplicateByDigest(impacted)
|
||||
impacted = EnforceLimits(impacted, limits.maxJobs)
|
||||
shards = Shard(impacted, byHashPrefix, n=limits.parallelism)
|
||||
|
||||
For each shard:
|
||||
Enqueue RunSegment (runId, shard, rate=limits.ratePerSecond)
|
||||
```
|
||||
|
||||
**Fairness & pacing**
|
||||
|
||||
* Use **leaky bucket** per tenant and per registry host.
|
||||
* Prioritize **KEV‑tagged** and **critical** first if oversubscribed.
|
||||
|
||||
### 6.2 Nightly planning
|
||||
|
||||
```
|
||||
At cron tick:
|
||||
sel = resolve selection
|
||||
candidates = ImpactIndex.ResolveAll(sel)
|
||||
if lastReportOlderThanDays present → filter by report age (via Scanner catalog)
|
||||
shard & enqueue as above
|
||||
```
|
||||
|
||||
### 6.3 Execution (Runner)
|
||||
|
||||
* Pop **RunSegment** job → for each image digest:
|
||||
|
||||
* **analysis‑only**: `POST scanner/reports { imageDigest, policyRevision? }`
|
||||
* **content‑refresh**: resolve tag→digest if needed; `POST scanner/scans { imageRef, attest? false }` then `POST /reports`
|
||||
* Collect **delta**: `newFindings`, `newCriticals`/`highs`, `links` (UI deep link, Rekor if present).
|
||||
* Persist per‑image outcome in `runs.{id}.stats` (incremental counters).
|
||||
* Emit `scheduler.rescan.delta` events to **Notify** only when **delta > 0** and matches severity rule.
|
||||
|
||||
---
|
||||
|
||||
## 7) Event model (outbound)
|
||||
|
||||
**Topic**: `rescan.delta` (internal bus → Notify; UI subscribes via backend).
|
||||
|
||||
```json
|
||||
{
|
||||
"tenant": "tenant-01",
|
||||
"runId": "324af…",
|
||||
"imageDigest": "sha256:…",
|
||||
"newCriticals": 1,
|
||||
"newHigh": 2,
|
||||
"kevHits": ["CVE-2025-..."],
|
||||
"topFindings": [
|
||||
{ "purl":"pkg:rpm/openssl@3.0.12-...","vulnId":"CVE-2025-...","severity":"critical","link":"https://ui/scans/..." }
|
||||
],
|
||||
"reportUrl": "https://ui/.../scans/sha256:.../report",
|
||||
"attestation": { "uuid":"rekor-uuid", "verified": true },
|
||||
"ts": "2025-10-18T03:12:45Z"
|
||||
}
|
||||
```
|
||||
|
||||
**Also**: `report.ready` for “no‑change” summaries (digest + zero delta), which Notify can ignore by rule.
|
||||
|
||||
---
|
||||
|
||||
## 8) Security posture
|
||||
|
||||
* **AuthN/Z**: Authority OpToks with `aud=scheduler`; DPoP (preferred) or mTLS.
|
||||
* **Multi‑tenant**: every schedule, run, and event carries `tenantId`; ImpactIndex filters by tenant‑visible images.
|
||||
* **Webhook** callers (Feedser/Vexer) present **mTLS** or **HMAC** and Authority token.
|
||||
* **Input hardening**: size caps on changed key lists; reject >100k keys per event; compress (zstd/gzip) allowed with limits.
|
||||
* **No secrets** in logs; redact tokens and signatures.
|
||||
|
||||
---
|
||||
|
||||
## 9) Observability & SLOs
|
||||
|
||||
**Metrics (Prometheus)**
|
||||
|
||||
* `scheduler.events_total{source, result}`
|
||||
* `scheduler.impact_resolve_seconds{quantile}`
|
||||
* `scheduler.images_selected_total{mode}`
|
||||
* `scheduler.jobs_enqueued_total{mode}`
|
||||
* `scheduler.run_latency_seconds{quantile}` // event → first verdict
|
||||
* `scheduler.delta_images_total{severity}`
|
||||
* `scheduler.rate_limited_total{reason}`
|
||||
|
||||
**Targets**
|
||||
|
||||
* Resolve 10k changed keys → impacted set in **<300 ms** (hot cache).
|
||||
* Event → first rescan verdict in **≤60 s** (p95).
|
||||
* Nightly coverage 50k images in **≤10 min** with 10 workers (analysis‑only).
|
||||
|
||||
**Tracing** (OTEL): spans `plan`, `resolve`, `enqueue`, `report_call`, `persist`, `emit`.
|
||||
|
||||
---
|
||||
|
||||
## 10) Configuration (YAML)
|
||||
|
||||
```yaml
|
||||
scheduler:
|
||||
authority:
|
||||
issuer: "https://authority.internal"
|
||||
require: "dpop" # or "mtls"
|
||||
queue:
|
||||
kind: "redis" # or "nats"
|
||||
url: "redis://redis:6379/4"
|
||||
mongo:
|
||||
uri: "mongodb://mongo/scheduler"
|
||||
impactIndex:
|
||||
storage: "rocksdb" # "rocksdb" | "redis" | "memory"
|
||||
warmOnStart: true
|
||||
usageOnlyDefault: true
|
||||
limits:
|
||||
defaultRatePerSecond: 50
|
||||
defaultParallelism: 8
|
||||
maxJobsPerRun: 50000
|
||||
integrates:
|
||||
scannerUrl: "https://scanner-web.internal"
|
||||
feedserWebhook: true
|
||||
vexerWebhook: true
|
||||
notifications:
|
||||
emitBus: "internal" # deliver to Notify via internal bus
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 11) UI touch‑points
|
||||
|
||||
* **Schedules** page: CRUD, enable/pause, next run, last run stats, mode (analysis/content), selector preview.
|
||||
* **Runs** page: timeline; heat‑map of deltas; drill‑down to affected images.
|
||||
* **Dry‑run preview** modal: “This Feedser export touches ~3,214 images; projected deltas: ~420 (34 KEV).”
|
||||
|
||||
---
|
||||
|
||||
## 12) Failure modes & degradations
|
||||
|
||||
| Condition | Behavior |
|
||||
| ------------------------------------ | ---------------------------------------------------------------------------------------- |
|
||||
| ImpactIndex cold / incomplete | Fall back to **All** selection for nightly; for events, cap to KEV+critical until warmed |
|
||||
| Feedser/Vexer webhook storm | Coalesce by exportId; debounce 30–60 s; keep last |
|
||||
| Scanner under load (429) | Backoff with jitter; respect per‑tenant/leaky bucket |
|
||||
| Oversubscription (too many impacted) | Prioritize KEV/critical first; spillover to next window; UI banner shows backlog |
|
||||
| Notify down | Buffer outbound events in queue (TTL 24h) |
|
||||
| Mongo slow | Cut batch sizes; sample‑log; alert ops; don’t drop runs unless critical |
|
||||
|
||||
---
|
||||
|
||||
## 13) Testing matrix
|
||||
|
||||
* **ImpactIndex**: correctness (purl→image sets), performance, persistence after restart, memory pressure with 1M purls.
|
||||
* **Planner**: dedupe, shard, fairness, limit enforcement, KEV prioritization.
|
||||
* **Runner**: parallel report calls, error backoff, partial failures, idempotency.
|
||||
* **End‑to‑end**: Feedser export → deltas visible in UI in ≤60 s.
|
||||
* **Security**: webhook auth (mTLS/HMAC), DPoP nonce dance, tenant isolation.
|
||||
* **Chaos**: drop scanner availability; simulate registry throttles (content‑refresh mode).
|
||||
* **Nightly**: cron tick correctness across timezones and DST.
|
||||
|
||||
---
|
||||
|
||||
## 14) Implementation notes
|
||||
|
||||
* **Language**: .NET 10 minimal API; Channels‑based pipeline; `System.Threading.RateLimiting`.
|
||||
* **Bitmaps**: Roaring via `RoaringBitmap` bindings; memory‑map large shards if RocksDB used.
|
||||
* **Cron**: Quartz‑style parser with timezone support; clock skew tolerated ±60 s.
|
||||
* **Dry‑run**: use ImpactIndex only; never call scanner.
|
||||
* **Idempotency**: run segments carry deterministic keys; retries safe.
|
||||
* **Backpressure**: per‑tenant buckets; per‑host registry budgets respected when content‑refresh enabled.
|
||||
|
||||
---
|
||||
|
||||
## 15) Sequences (representative)
|
||||
|
||||
**A) Event‑driven rescan (Feedser delta)**
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
autonumber
|
||||
participant FE as Feedser
|
||||
participant SCH as Scheduler.Worker
|
||||
participant IDX as ImpactIndex
|
||||
participant SC as Scanner.WebService
|
||||
participant NO as Notify
|
||||
|
||||
FE->>SCH: POST /events/feedser-export {exportId, changedProductKeys}
|
||||
SCH->>IDX: ResolveByPurls(keys, usageOnly=true, sel)
|
||||
IDX-->>SCH: bitmap(imageIds) → digests list
|
||||
SCH->>SC: POST /reports {imageDigest} (batch/sequenced)
|
||||
SC-->>SCH: report deltas (new criticals/highs)
|
||||
alt delta>0
|
||||
SCH->>NO: rescan.delta {digest, newCriticals, links}
|
||||
end
|
||||
```
|
||||
|
||||
**B) Nightly rescan**
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
autonumber
|
||||
participant CRON as Cron
|
||||
participant SCH as Scheduler.Worker
|
||||
participant IDX as ImpactIndex
|
||||
participant SC as Scanner.WebService
|
||||
|
||||
CRON->>SCH: tick (02:00 Europe/Sofia)
|
||||
SCH->>IDX: ResolveAll(selector)
|
||||
IDX-->>SCH: candidates
|
||||
SCH->>SC: POST /reports {digest} (paced)
|
||||
SC-->>SCH: results
|
||||
SCH-->>SCH: aggregate, store run stats
|
||||
```
|
||||
|
||||
**C) Content‑refresh (tag followers)**
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
autonumber
|
||||
participant SCH as Scheduler
|
||||
participant SC as Scanner
|
||||
SCH->>SC: resolve tag→digest (if changed)
|
||||
alt digest changed
|
||||
SCH->>SC: POST /scans {imageRef} # new SBOM
|
||||
SC-->>SCH: scan complete (artifacts)
|
||||
SCH->>SC: POST /reports {imageDigest}
|
||||
else unchanged
|
||||
SCH->>SC: POST /reports {imageDigest} # analysis-only
|
||||
end
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 16) Roadmap
|
||||
|
||||
* **Vuln‑centric impact**: pre‑join vuln→purl→images to rank by **KEV** and **exploited‑in‑the‑wild** signals.
|
||||
* **Policy diff preview**: when a staged policy changes, show projected breakage set before promotion.
|
||||
* **Cross‑cluster federation**: one Scheduler instance driving many Scanner clusters (tenant isolation).
|
||||
* **Windows containers**: integrate Zastava runtime hints for Usage view tightening.
|
||||
|
||||
---
|
||||
|
||||
**End — component_architecture_scheduler.md**
|
||||
|
||||
> **Scope.** Implementation‑ready architecture for **Scheduler**: a service that (1) **re‑evaluates** already‑cataloged images when intel changes (Conselier/Excitor/policy), (2) orchestrates **nightly** and **ad‑hoc** runs, (3) targets only the **impacted** images using the BOM‑Index, and (4) emits **report‑ready** events that downstream **Notify** fans out. Default mode is **analysis‑only** (no image pull); optional **content‑refresh** can be enabled per schedule.
|
||||
|
||||
---
|
||||
|
||||
## 0) Mission & boundaries
|
||||
|
||||
**Mission.** Keep scan results **current** without rescanning the world. When new advisories or VEX claims land, **pinpoint** affected images and ask the backend to recompute **verdicts** against the **existing SBOMs**. Surface only **meaningful deltas** to humans and ticket queues.
|
||||
|
||||
**Boundaries.**
|
||||
|
||||
* Scheduler **does not** compute SBOMs and **does not** sign. It calls Scanner/WebService’s **/reports (analysis‑only)** endpoint and lets the backend (Policy + Excitor + Conselier) decide PASS/FAIL.
|
||||
* Scheduler **may** ask Scanner to **content‑refresh** selected targets (e.g., mutable tags) but the default is **no** image pull.
|
||||
* Notifications are **not** sent directly; Scheduler emits events consumed by **Notify**.
|
||||
|
||||
---
|
||||
|
||||
## 1) Runtime shape & projects
|
||||
|
||||
```
|
||||
src/
|
||||
├─ StellaOps.Scheduler.WebService/ # REST (schedules CRUD, runs, admin)
|
||||
├─ StellaOps.Scheduler.Worker/ # planners + runners (N replicas)
|
||||
├─ StellaOps.Scheduler.ImpactIndex/ # purl→images inverted index (roaring bitmaps)
|
||||
├─ StellaOps.Scheduler.Models/ # DTOs (Schedule, Run, ImpactSet, Deltas)
|
||||
├─ StellaOps.Scheduler.Storage.Mongo/ # schedules, runs, cursors, locks
|
||||
├─ StellaOps.Scheduler.Queue/ # Redis Streams / NATS abstraction
|
||||
├─ StellaOps.Scheduler.Tests.* # unit/integration/e2e
|
||||
```
|
||||
|
||||
**Deployables**:
|
||||
|
||||
* **Scheduler.WebService** (stateless)
|
||||
* **Scheduler.Worker** (scale‑out; planners + executors)
|
||||
|
||||
**Dependencies**: Authority (OpTok + DPoP/mTLS), Scanner.WebService, Conselier, Excitor, MongoDB, Redis/NATS, (optional) Notify.
|
||||
|
||||
---
|
||||
|
||||
## 2) Core responsibilities
|
||||
|
||||
1. **Time‑based** runs: cron windows per tenant/timezone (e.g., “02:00 Europe/Sofia”).
|
||||
2. **Event‑driven** runs: react to **Conselier export** and **Excitor export** deltas (changed product keys / advisories / claims).
|
||||
3. **Impact targeting**: map changes to **image sets** using a **global inverted index** built from Scanner’s per‑image **BOM‑Index** sidecars.
|
||||
4. **Run planning**: shard, pace, and rate‑limit jobs to avoid thundering herds.
|
||||
5. **Execution**: call Scanner **/reports (analysis‑only)** or **/scans (content‑refresh)**; aggregate **delta** results.
|
||||
6. **Events**: publish `rescan.delta` and `report.ready` summaries for **Notify** & **UI**.
|
||||
7. **Control plane**: CRUD schedules, **pause/resume**, dry‑run previews, audit.
|
||||
|
||||
---
|
||||
|
||||
## 3) Data model (Mongo)
|
||||
|
||||
**Database**: `scheduler`
|
||||
|
||||
* `schedules`
|
||||
|
||||
```
|
||||
{ _id, tenantId, name, enabled, whenCron, timezone,
|
||||
mode: "analysis-only" | "content-refresh",
|
||||
selection: { scope: "all-images" | "by-namespace" | "by-repo" | "by-digest" | "by-labels",
|
||||
includeTags?: ["prod-*"], digests?: [sha256...], resolvesTags?: bool },
|
||||
onlyIf: { lastReportOlderThanDays?: int, policyRevision?: string },
|
||||
notify: { onNewFindings: bool, minSeverity: "low|medium|high|critical", includeKEV: bool },
|
||||
limits: { maxJobs?: int, ratePerSecond?: int, parallelism?: int },
|
||||
createdAt, updatedAt, createdBy, updatedBy }
|
||||
```
|
||||
|
||||
* `runs`
|
||||
|
||||
```
|
||||
{ _id, scheduleId?, tenantId, trigger: "cron|conselier|excitor|manual",
|
||||
reason?: { conselierExportId?, excitorExportId?, cursor? },
|
||||
state: "planning|queued|running|completed|error|cancelled",
|
||||
stats: { candidates: int, deduped: int, queued: int, completed: int, deltas: int, newCriticals: int },
|
||||
startedAt, finishedAt, error? }
|
||||
```
|
||||
|
||||
* `impact_cursors`
|
||||
|
||||
```
|
||||
{ _id: tenantId, conselierLastExportId, excitorLastExportId, updatedAt }
|
||||
```
|
||||
|
||||
* `locks` (singleton schedulers, run leases)
|
||||
|
||||
* `audit` (CRUD actions, run outcomes)
|
||||
|
||||
**Indexes**:
|
||||
|
||||
* `schedules` on `{tenantId, enabled}`, `{whenCron}`.
|
||||
* `runs` on `{tenantId, startedAt desc}`, `{state}`.
|
||||
* TTL optional for completed runs (e.g., 180 days).
|
||||
|
||||
---
|
||||
|
||||
## 4) ImpactIndex (global inverted index)
|
||||
|
||||
Goal: translate **change keys** → **image sets** in **milliseconds**.
|
||||
|
||||
**Source**: Scanner produces per‑image **BOM‑Index** sidecars (purls, and `usedByEntrypoint` bitmaps). Scheduler ingests/refreshes them to build a **global** index.
|
||||
|
||||
**Representation**:
|
||||
|
||||
* Assign **image IDs** (dense ints) to catalog images.
|
||||
* Keep **Roaring Bitmaps**:
|
||||
|
||||
* `Contains[purl] → bitmap(imageIds)`
|
||||
* `UsedBy[purl] → bitmap(imageIds)` (subset of Contains)
|
||||
* Optionally keep **Owner maps**: `{imageId → {tenantId, namespaces[], repos[]}}` for selection filters.
|
||||
* Persist in RocksDB/LMDB or Redis‑modules; cache hot shards in memory; snapshot to Mongo for cold start.
|
||||
|
||||
**Update paths**:
|
||||
|
||||
* On new/updated image SBOM: **merge** per‑image set into global maps.
|
||||
* On image remove/expiry: **clear** id from bitmaps.
|
||||
|
||||
**API (internal)**:
|
||||
|
||||
```csharp
|
||||
IImpactIndex {
|
||||
ImpactSet ResolveByPurls(IEnumerable<string> purls, bool usageOnly, Selector sel);
|
||||
ImpactSet ResolveByVulns(IEnumerable<string> vulnIds, bool usageOnly, Selector sel); // optional (vuln->purl precomputed by Conselier)
|
||||
ImpactSet ResolveAll(Selector sel); // for nightly
|
||||
}
|
||||
```
|
||||
|
||||
**Selector filters**: tenant, namespaces, repos, labels, digest allowlists, `includeTags` patterns.
|
||||
|
||||
---
|
||||
|
||||
## 5) External interfaces (REST)
|
||||
|
||||
Base path: `/api/v1/scheduler` (Authority OpToks; scopes: `scheduler.read`, `scheduler.admin`).
|
||||
|
||||
### 5.1 Schedules CRUD
|
||||
|
||||
* `POST /schedules` → create
|
||||
* `GET /schedules` → list (filter by tenant)
|
||||
* `GET /schedules/{id}` → details + next run
|
||||
* `PATCH /schedules/{id}` → pause/resume/update
|
||||
* `DELETE /schedules/{id}` → delete (soft delete, optional)
|
||||
|
||||
### 5.2 Run control & introspection
|
||||
|
||||
* `POST /run` — ad‑hoc run
|
||||
|
||||
```json
|
||||
{ "mode": "analysis-only|content-refresh", "selection": {...}, "reason": "manual" }
|
||||
```
|
||||
* `GET /runs` — list with paging
|
||||
* `GET /runs/{id}` — status, stats, links to deltas
|
||||
* `POST /runs/{id}/cancel` — best‑effort cancel
|
||||
|
||||
### 5.3 Previews (dry‑run)
|
||||
|
||||
* `POST /preview/impact` — returns **candidate count** and a small sample of impacted digests for given change keys or selection.
|
||||
|
||||
### 5.4 Event webhooks (optional push from Conselier/Excitor)
|
||||
|
||||
* `POST /events/conselier-export`
|
||||
|
||||
```json
|
||||
{ "exportId":"...", "changedProductKeys":["pkg:rpm/openssl", ...], "kev": ["CVE-..."], "window": { "from":"...","to":"..." } }
|
||||
```
|
||||
* `POST /events/excitor-export`
|
||||
|
||||
```json
|
||||
{ "exportId":"...", "changedClaims":[ { "productKey":"pkg:deb/...", "vulnId":"CVE-...", "status":"not_affected→affected"} ], ... }
|
||||
```
|
||||
|
||||
**Security**: webhook requires **mTLS** or an **HMAC** `X-Scheduler-Signature` (Ed25519 / SHA‑256) plus Authority token.
|
||||
|
||||
---
|
||||
|
||||
## 6) Planner → Runner pipeline
|
||||
|
||||
### 6.1 Planning algorithm (event‑driven)
|
||||
|
||||
```
|
||||
On Export Event (Conselier/Excitor):
|
||||
keys = Normalize(change payload) # productKeys or vulnIds→productKeys
|
||||
usageOnly = schedule/policy hint? # default true
|
||||
sel = Selector for tenant/scope from schedules subscribed to events
|
||||
|
||||
impacted = ImpactIndex.ResolveByPurls(keys, usageOnly, sel)
|
||||
impacted = ApplyOwnerFilters(impacted, sel) # namespaces/repos/labels
|
||||
impacted = DeduplicateByDigest(impacted)
|
||||
impacted = EnforceLimits(impacted, limits.maxJobs)
|
||||
shards = Shard(impacted, byHashPrefix, n=limits.parallelism)
|
||||
|
||||
For each shard:
|
||||
Enqueue RunSegment (runId, shard, rate=limits.ratePerSecond)
|
||||
```
|
||||
|
||||
**Fairness & pacing**
|
||||
|
||||
* Use **leaky bucket** per tenant and per registry host.
|
||||
* Prioritize **KEV‑tagged** and **critical** first if oversubscribed.
|
||||
|
||||
### 6.2 Nightly planning
|
||||
|
||||
```
|
||||
At cron tick:
|
||||
sel = resolve selection
|
||||
candidates = ImpactIndex.ResolveAll(sel)
|
||||
if lastReportOlderThanDays present → filter by report age (via Scanner catalog)
|
||||
shard & enqueue as above
|
||||
```
|
||||
|
||||
### 6.3 Execution (Runner)
|
||||
|
||||
* Pop **RunSegment** job → for each image digest:
|
||||
|
||||
* **analysis‑only**: `POST scanner/reports { imageDigest, policyRevision? }`
|
||||
* **content‑refresh**: resolve tag→digest if needed; `POST scanner/scans { imageRef, attest? false }` then `POST /reports`
|
||||
* Collect **delta**: `newFindings`, `newCriticals`/`highs`, `links` (UI deep link, Rekor if present).
|
||||
* Persist per‑image outcome in `runs.{id}.stats` (incremental counters).
|
||||
* Emit `scheduler.rescan.delta` events to **Notify** only when **delta > 0** and matches severity rule.
|
||||
|
||||
---
|
||||
|
||||
## 7) Event model (outbound)
|
||||
|
||||
**Topic**: `rescan.delta` (internal bus → Notify; UI subscribes via backend).
|
||||
|
||||
```json
|
||||
{
|
||||
"tenant": "tenant-01",
|
||||
"runId": "324af…",
|
||||
"imageDigest": "sha256:…",
|
||||
"newCriticals": 1,
|
||||
"newHigh": 2,
|
||||
"kevHits": ["CVE-2025-..."],
|
||||
"topFindings": [
|
||||
{ "purl":"pkg:rpm/openssl@3.0.12-...","vulnId":"CVE-2025-...","severity":"critical","link":"https://ui/scans/..." }
|
||||
],
|
||||
"reportUrl": "https://ui/.../scans/sha256:.../report",
|
||||
"attestation": { "uuid":"rekor-uuid", "verified": true },
|
||||
"ts": "2025-10-18T03:12:45Z"
|
||||
}
|
||||
```
|
||||
|
||||
**Also**: `report.ready` for “no‑change” summaries (digest + zero delta), which Notify can ignore by rule.
|
||||
|
||||
---
|
||||
|
||||
## 8) Security posture
|
||||
|
||||
* **AuthN/Z**: Authority OpToks with `aud=scheduler`; DPoP (preferred) or mTLS.
|
||||
* **Multi‑tenant**: every schedule, run, and event carries `tenantId`; ImpactIndex filters by tenant‑visible images.
|
||||
* **Webhook** callers (Conselier/Excitor) present **mTLS** or **HMAC** and Authority token.
|
||||
* **Input hardening**: size caps on changed key lists; reject >100k keys per event; compress (zstd/gzip) allowed with limits.
|
||||
* **No secrets** in logs; redact tokens and signatures.
|
||||
|
||||
---
|
||||
|
||||
## 9) Observability & SLOs
|
||||
|
||||
**Metrics (Prometheus)**
|
||||
|
||||
* `scheduler.events_total{source, result}`
|
||||
* `scheduler.impact_resolve_seconds{quantile}`
|
||||
* `scheduler.images_selected_total{mode}`
|
||||
* `scheduler.jobs_enqueued_total{mode}`
|
||||
* `scheduler.run_latency_seconds{quantile}` // event → first verdict
|
||||
* `scheduler.delta_images_total{severity}`
|
||||
* `scheduler.rate_limited_total{reason}`
|
||||
|
||||
**Targets**
|
||||
|
||||
* Resolve 10k changed keys → impacted set in **<300 ms** (hot cache).
|
||||
* Event → first rescan verdict in **≤60 s** (p95).
|
||||
* Nightly coverage 50k images in **≤10 min** with 10 workers (analysis‑only).
|
||||
|
||||
**Tracing** (OTEL): spans `plan`, `resolve`, `enqueue`, `report_call`, `persist`, `emit`.
|
||||
|
||||
---
|
||||
|
||||
## 10) Configuration (YAML)
|
||||
|
||||
```yaml
|
||||
scheduler:
|
||||
authority:
|
||||
issuer: "https://authority.internal"
|
||||
require: "dpop" # or "mtls"
|
||||
queue:
|
||||
kind: "redis" # or "nats"
|
||||
url: "redis://redis:6379/4"
|
||||
mongo:
|
||||
uri: "mongodb://mongo/scheduler"
|
||||
impactIndex:
|
||||
storage: "rocksdb" # "rocksdb" | "redis" | "memory"
|
||||
warmOnStart: true
|
||||
usageOnlyDefault: true
|
||||
limits:
|
||||
defaultRatePerSecond: 50
|
||||
defaultParallelism: 8
|
||||
maxJobsPerRun: 50000
|
||||
integrates:
|
||||
scannerUrl: "https://scanner-web.internal"
|
||||
conselierWebhook: true
|
||||
excitorWebhook: true
|
||||
notifications:
|
||||
emitBus: "internal" # deliver to Notify via internal bus
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 11) UI touch‑points
|
||||
|
||||
* **Schedules** page: CRUD, enable/pause, next run, last run stats, mode (analysis/content), selector preview.
|
||||
* **Runs** page: timeline; heat‑map of deltas; drill‑down to affected images.
|
||||
* **Dry‑run preview** modal: “This Conselier export touches ~3,214 images; projected deltas: ~420 (34 KEV).”
|
||||
|
||||
---
|
||||
|
||||
## 12) Failure modes & degradations
|
||||
|
||||
| Condition | Behavior |
|
||||
| ------------------------------------ | ---------------------------------------------------------------------------------------- |
|
||||
| ImpactIndex cold / incomplete | Fall back to **All** selection for nightly; for events, cap to KEV+critical until warmed |
|
||||
| Conselier/Excitor webhook storm | Coalesce by exportId; debounce 30–60 s; keep last |
|
||||
| Scanner under load (429) | Backoff with jitter; respect per‑tenant/leaky bucket |
|
||||
| Oversubscription (too many impacted) | Prioritize KEV/critical first; spillover to next window; UI banner shows backlog |
|
||||
| Notify down | Buffer outbound events in queue (TTL 24h) |
|
||||
| Mongo slow | Cut batch sizes; sample‑log; alert ops; don’t drop runs unless critical |
|
||||
|
||||
---
|
||||
|
||||
## 13) Testing matrix
|
||||
|
||||
* **ImpactIndex**: correctness (purl→image sets), performance, persistence after restart, memory pressure with 1M purls.
|
||||
* **Planner**: dedupe, shard, fairness, limit enforcement, KEV prioritization.
|
||||
* **Runner**: parallel report calls, error backoff, partial failures, idempotency.
|
||||
* **End‑to‑end**: Conselier export → deltas visible in UI in ≤60 s.
|
||||
* **Security**: webhook auth (mTLS/HMAC), DPoP nonce dance, tenant isolation.
|
||||
* **Chaos**: drop scanner availability; simulate registry throttles (content‑refresh mode).
|
||||
* **Nightly**: cron tick correctness across timezones and DST.
|
||||
|
||||
---
|
||||
|
||||
## 14) Implementation notes
|
||||
|
||||
* **Language**: .NET 10 minimal API; Channels‑based pipeline; `System.Threading.RateLimiting`.
|
||||
* **Bitmaps**: Roaring via `RoaringBitmap` bindings; memory‑map large shards if RocksDB used.
|
||||
* **Cron**: Quartz‑style parser with timezone support; clock skew tolerated ±60 s.
|
||||
* **Dry‑run**: use ImpactIndex only; never call scanner.
|
||||
* **Idempotency**: run segments carry deterministic keys; retries safe.
|
||||
* **Backpressure**: per‑tenant buckets; per‑host registry budgets respected when content‑refresh enabled.
|
||||
|
||||
---
|
||||
|
||||
## 15) Sequences (representative)
|
||||
|
||||
**A) Event‑driven rescan (Conselier delta)**
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
autonumber
|
||||
participant FE as Conselier
|
||||
participant SCH as Scheduler.Worker
|
||||
participant IDX as ImpactIndex
|
||||
participant SC as Scanner.WebService
|
||||
participant NO as Notify
|
||||
|
||||
FE->>SCH: POST /events/conselier-export {exportId, changedProductKeys}
|
||||
SCH->>IDX: ResolveByPurls(keys, usageOnly=true, sel)
|
||||
IDX-->>SCH: bitmap(imageIds) → digests list
|
||||
SCH->>SC: POST /reports {imageDigest} (batch/sequenced)
|
||||
SC-->>SCH: report deltas (new criticals/highs)
|
||||
alt delta>0
|
||||
SCH->>NO: rescan.delta {digest, newCriticals, links}
|
||||
end
|
||||
```
|
||||
|
||||
**B) Nightly rescan**
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
autonumber
|
||||
participant CRON as Cron
|
||||
participant SCH as Scheduler.Worker
|
||||
participant IDX as ImpactIndex
|
||||
participant SC as Scanner.WebService
|
||||
|
||||
CRON->>SCH: tick (02:00 Europe/Sofia)
|
||||
SCH->>IDX: ResolveAll(selector)
|
||||
IDX-->>SCH: candidates
|
||||
SCH->>SC: POST /reports {digest} (paced)
|
||||
SC-->>SCH: results
|
||||
SCH-->>SCH: aggregate, store run stats
|
||||
```
|
||||
|
||||
**C) Content‑refresh (tag followers)**
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
autonumber
|
||||
participant SCH as Scheduler
|
||||
participant SC as Scanner
|
||||
SCH->>SC: resolve tag→digest (if changed)
|
||||
alt digest changed
|
||||
SCH->>SC: POST /scans {imageRef} # new SBOM
|
||||
SC-->>SCH: scan complete (artifacts)
|
||||
SCH->>SC: POST /reports {imageDigest}
|
||||
else unchanged
|
||||
SCH->>SC: POST /reports {imageDigest} # analysis-only
|
||||
end
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 16) Roadmap
|
||||
|
||||
* **Vuln‑centric impact**: pre‑join vuln→purl→images to rank by **KEV** and **exploited‑in‑the‑wild** signals.
|
||||
* **Policy diff preview**: when a staged policy changes, show projected breakage set before promotion.
|
||||
* **Cross‑cluster federation**: one Scheduler instance driving many Scanner clusters (tenant isolation).
|
||||
* **Windows containers**: integrate Zastava runtime hints for Usage view tightening.
|
||||
|
||||
---
|
||||
|
||||
**End — component_architecture_scheduler.md**
|
||||
|
||||
@@ -1,82 +1,82 @@
|
||||
# Scheduler Worker – Observability & Runbook
|
||||
|
||||
## Purpose
|
||||
Monitor planner and runner health for the Scheduler Worker (Sprint 16 telemetry). The new .NET meters surface queue throughput, latency, backlog, and delta severities so operators can detect stalled runs before rescan SLAs slip.
|
||||
|
||||
> **Grafana note:** Import `docs/modules/scheduler/operations/worker-grafana-dashboard.json` into the Prometheus-backed Grafana stack that scrapes the OpenTelemetry Collector.
|
||||
|
||||
---
|
||||
|
||||
## Key metrics
|
||||
|
||||
| Metric | Use case | Suggested query |
|
||||
| --- | --- | --- |
|
||||
| `scheduler_planner_runs_total{status}` | Planner throughput & failure ratio | `sum by (status) (rate(scheduler_planner_runs_total[5m]))` |
|
||||
| `scheduler_planner_latency_seconds_bucket` | Planning latency (p95 / p99) | `histogram_quantile(0.95, sum by (le) (rate(scheduler_planner_latency_seconds_bucket[5m])))` |
|
||||
| `scheduler_runner_segments_total{status}` | Runner success vs retries | `sum by (status) (rate(scheduler_runner_segments_total[5m]))` |
|
||||
| `scheduler_runner_delta_{critical,high,total}` | Newly-detected findings | `sum(rate(scheduler_runner_delta_critical_total[5m]))` |
|
||||
| `scheduler_runner_backlog{scheduleId}` | Remaining digests awaiting runner | `max by (scheduleId) (scheduler_runner_backlog)` |
|
||||
| `scheduler_runs_active{mode}` | Active runs in-flight | `sum(scheduler_runs_active)` |
|
||||
|
||||
Reference queries power the bundled Grafana dashboard panels. Use the `mode` template variable to focus on `analysisOnly` versus `contentRefresh` schedules.
|
||||
|
||||
---
|
||||
|
||||
## Grafana dashboard
|
||||
|
||||
1. Import `docs/modules/scheduler/operations/worker-grafana-dashboard.json` (UID `scheduler-worker-observability`).
|
||||
2. Point the `datasource` variable to the Prometheus instance scraping the collector. Optional: pin the `mode` variable to a specific schedule mode.
|
||||
3. Panels included:
|
||||
- **Planner Runs per Status** – visualises success vs failure ratio.
|
||||
- **Planner Latency P95** – highlights degradations in ImpactIndex or Mongo lookups.
|
||||
- **Runner Segments per Status** – shows retry pressure and queue health.
|
||||
- **New Findings per Severity** – rolls up delta counters (critical/high/total).
|
||||
- **Runner Backlog by Schedule** – tabulates outstanding digests per schedule.
|
||||
- **Active Runs** – stat panel showing the current number of in-flight runs.
|
||||
|
||||
Capture screenshots once Grafana provisioning completes and store them under `docs/assets/dashboards/` (pending automation ticket OBS-157).
|
||||
|
||||
---
|
||||
|
||||
## Prometheus alerts
|
||||
|
||||
Import `docs/modules/scheduler/operations/worker-prometheus-rules.yaml` into your Prometheus rule configuration. The bundle defines:
|
||||
|
||||
- **SchedulerPlannerFailuresHigh** – 5%+ of planner runs failed for 10 minutes. Page SRE.
|
||||
- **SchedulerPlannerLatencyHigh** – planner p95 latency remains above 45 s for 10 minutes. Investigate ImpactIndex, Mongo, and Feedser/Vexer event queues.
|
||||
- **SchedulerRunnerBacklogGrowing** – backlog exceeded 500 images for 15 minutes. Inspect runner workers, Scanner availability, and rate limiting.
|
||||
- **SchedulerRunStuck** – active run count stayed flat for 30 minutes while remaining non-zero. Check stuck segments, expired leases, and scanner retries.
|
||||
|
||||
Hook these alerts into the existing Observability notification pathway (`observability-pager` routing key) and ensure `service=scheduler-worker` is mapped to the on-call rotation.
|
||||
|
||||
---
|
||||
|
||||
## Runbook snapshot
|
||||
|
||||
1. **Planner failure/latency:**
|
||||
- Check Planner logs for ImpactIndex or Mongo exceptions.
|
||||
- Verify Feedser/Vexer webhook health; requeue events if necessary.
|
||||
- If planner is overwhelmed, temporarily reduce schedule parallelism via `stella scheduler schedule update`.
|
||||
2. **Runner backlog spike:**
|
||||
- Confirm Scanner WebService health (`/healthz`).
|
||||
- Inspect runner queue for stuck segments; consider increasing runner workers or scaling scanner capacity.
|
||||
- Review rate limits (schedule limits, ImpactIndex throughput) before changing global throttles.
|
||||
3. **Stuck runs:**
|
||||
- Use `stella scheduler runs list --state running` to identify affected runs.
|
||||
- Drill into Grafana panel “Runner Backlog by Schedule” to see offending schedule IDs.
|
||||
- If a segment will not progress, use `stella scheduler segments release --segment <id>` to force retry after resolving root cause.
|
||||
4. **Unexpected critical deltas:**
|
||||
- Correlate `scheduler_runner_delta_critical_total` spikes with Notify events (`scheduler.rescan.delta`).
|
||||
- Pivot to Scanner report links for impacted digests and confirm they match upstream advisories/policies.
|
||||
|
||||
Document incidents and mitigation in `ops/runbooks/INCIDENT_LOG.md` (per SRE policy) and attach Grafana screenshots for post-mortems.
|
||||
|
||||
---
|
||||
|
||||
## Checklist
|
||||
|
||||
- [ ] Grafana dashboard imported and wired to Prometheus datasource.
|
||||
- [ ] Prometheus alert rules deployed (see above).
|
||||
- [ ] Runbook linked from on-call rotation portal.
|
||||
- [ ] Observability Guild sign-off captured for Sprint 16 telemetry (OWNER: @obs-guild).
|
||||
|
||||
# Scheduler Worker – Observability & Runbook
|
||||
|
||||
## Purpose
|
||||
Monitor planner and runner health for the Scheduler Worker (Sprint 16 telemetry). The new .NET meters surface queue throughput, latency, backlog, and delta severities so operators can detect stalled runs before rescan SLAs slip.
|
||||
|
||||
> **Grafana note:** Import `docs/modules/scheduler/operations/worker-grafana-dashboard.json` into the Prometheus-backed Grafana stack that scrapes the OpenTelemetry Collector.
|
||||
|
||||
---
|
||||
|
||||
## Key metrics
|
||||
|
||||
| Metric | Use case | Suggested query |
|
||||
| --- | --- | --- |
|
||||
| `scheduler_planner_runs_total{status}` | Planner throughput & failure ratio | `sum by (status) (rate(scheduler_planner_runs_total[5m]))` |
|
||||
| `scheduler_planner_latency_seconds_bucket` | Planning latency (p95 / p99) | `histogram_quantile(0.95, sum by (le) (rate(scheduler_planner_latency_seconds_bucket[5m])))` |
|
||||
| `scheduler_runner_segments_total{status}` | Runner success vs retries | `sum by (status) (rate(scheduler_runner_segments_total[5m]))` |
|
||||
| `scheduler_runner_delta_{critical,high,total}` | Newly-detected findings | `sum(rate(scheduler_runner_delta_critical_total[5m]))` |
|
||||
| `scheduler_runner_backlog{scheduleId}` | Remaining digests awaiting runner | `max by (scheduleId) (scheduler_runner_backlog)` |
|
||||
| `scheduler_runs_active{mode}` | Active runs in-flight | `sum(scheduler_runs_active)` |
|
||||
|
||||
Reference queries power the bundled Grafana dashboard panels. Use the `mode` template variable to focus on `analysisOnly` versus `contentRefresh` schedules.
|
||||
|
||||
---
|
||||
|
||||
## Grafana dashboard
|
||||
|
||||
1. Import `docs/modules/scheduler/operations/worker-grafana-dashboard.json` (UID `scheduler-worker-observability`).
|
||||
2. Point the `datasource` variable to the Prometheus instance scraping the collector. Optional: pin the `mode` variable to a specific schedule mode.
|
||||
3. Panels included:
|
||||
- **Planner Runs per Status** – visualises success vs failure ratio.
|
||||
- **Planner Latency P95** – highlights degradations in ImpactIndex or Mongo lookups.
|
||||
- **Runner Segments per Status** – shows retry pressure and queue health.
|
||||
- **New Findings per Severity** – rolls up delta counters (critical/high/total).
|
||||
- **Runner Backlog by Schedule** – tabulates outstanding digests per schedule.
|
||||
- **Active Runs** – stat panel showing the current number of in-flight runs.
|
||||
|
||||
Capture screenshots once Grafana provisioning completes and store them under `docs/assets/dashboards/` (pending automation ticket OBS-157).
|
||||
|
||||
---
|
||||
|
||||
## Prometheus alerts
|
||||
|
||||
Import `docs/modules/scheduler/operations/worker-prometheus-rules.yaml` into your Prometheus rule configuration. The bundle defines:
|
||||
|
||||
- **SchedulerPlannerFailuresHigh** – 5%+ of planner runs failed for 10 minutes. Page SRE.
|
||||
- **SchedulerPlannerLatencyHigh** – planner p95 latency remains above 45 s for 10 minutes. Investigate ImpactIndex, Mongo, and Conselier/Excitor event queues.
|
||||
- **SchedulerRunnerBacklogGrowing** – backlog exceeded 500 images for 15 minutes. Inspect runner workers, Scanner availability, and rate limiting.
|
||||
- **SchedulerRunStuck** – active run count stayed flat for 30 minutes while remaining non-zero. Check stuck segments, expired leases, and scanner retries.
|
||||
|
||||
Hook these alerts into the existing Observability notification pathway (`observability-pager` routing key) and ensure `service=scheduler-worker` is mapped to the on-call rotation.
|
||||
|
||||
---
|
||||
|
||||
## Runbook snapshot
|
||||
|
||||
1. **Planner failure/latency:**
|
||||
- Check Planner logs for ImpactIndex or Mongo exceptions.
|
||||
- Verify Conselier/Excitor webhook health; requeue events if necessary.
|
||||
- If planner is overwhelmed, temporarily reduce schedule parallelism via `stella scheduler schedule update`.
|
||||
2. **Runner backlog spike:**
|
||||
- Confirm Scanner WebService health (`/healthz`).
|
||||
- Inspect runner queue for stuck segments; consider increasing runner workers or scaling scanner capacity.
|
||||
- Review rate limits (schedule limits, ImpactIndex throughput) before changing global throttles.
|
||||
3. **Stuck runs:**
|
||||
- Use `stella scheduler runs list --state running` to identify affected runs.
|
||||
- Drill into Grafana panel “Runner Backlog by Schedule” to see offending schedule IDs.
|
||||
- If a segment will not progress, use `stella scheduler segments release --segment <id>` to force retry after resolving root cause.
|
||||
4. **Unexpected critical deltas:**
|
||||
- Correlate `scheduler_runner_delta_critical_total` spikes with Notify events (`scheduler.rescan.delta`).
|
||||
- Pivot to Scanner report links for impacted digests and confirm they match upstream advisories/policies.
|
||||
|
||||
Document incidents and mitigation in `ops/runbooks/INCIDENT_LOG.md` (per SRE policy) and attach Grafana screenshots for post-mortems.
|
||||
|
||||
---
|
||||
|
||||
## Checklist
|
||||
|
||||
- [ ] Grafana dashboard imported and wired to Prometheus datasource.
|
||||
- [ ] Prometheus alert rules deployed (see above).
|
||||
- [ ] Runbook linked from on-call rotation portal.
|
||||
- [ ] Observability Guild sign-off captured for Sprint 16 telemetry (OWNER: @obs-guild).
|
||||
|
||||
|
||||
Reference in New Issue
Block a user