feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules
- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes. - Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes. - Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables. - Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
This commit is contained in:
		
							
								
								
									
										426
									
								
								docs/modules/scheduler/architecture.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										426
									
								
								docs/modules/scheduler/architecture.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,426 @@ | ||||
| # component_architecture_scheduler.md — **Stella Ops Scheduler** (2025Q4) | ||||
|  | ||||
| > Synthesises the scheduling requirements documented across the Policy, Vulnerability Explorer, and Orchestrator module guides and implementation plans. | ||||
|  | ||||
| > **Scope.** Implementation‑ready architecture for **Scheduler**: a service that (1) **re‑evaluates** already‑cataloged images when intel changes (Feedser/Vexer/policy), (2) orchestrates **nightly** and **ad‑hoc** runs, (3) targets only the **impacted** images using the BOM‑Index, and (4) emits **report‑ready** events that downstream **Notify** fans out. Default mode is **analysis‑only** (no image pull); optional **content‑refresh** can be enabled per schedule. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 0) Mission & boundaries | ||||
|  | ||||
| **Mission.** Keep scan results **current** without rescanning the world. When new advisories or VEX claims land, **pinpoint** affected images and ask the backend to recompute **verdicts** against the **existing SBOMs**. Surface only **meaningful deltas** to humans and ticket queues. | ||||
|  | ||||
| **Boundaries.** | ||||
|  | ||||
| * Scheduler **does not** compute SBOMs and **does not** sign. It calls Scanner/WebService’s **/reports (analysis‑only)** endpoint and lets the backend (Policy + Vexer + Feedser) decide PASS/FAIL. | ||||
| * Scheduler **may** ask Scanner to **content‑refresh** selected targets (e.g., mutable tags) but the default is **no** image pull. | ||||
| * Notifications are **not** sent directly; Scheduler emits events consumed by **Notify**. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 1) Runtime shape & projects | ||||
|  | ||||
| ``` | ||||
| src/ | ||||
|  ├─ StellaOps.Scheduler.WebService/      # REST (schedules CRUD, runs, admin) | ||||
|  ├─ StellaOps.Scheduler.Worker/          # planners + runners (N replicas) | ||||
|  ├─ StellaOps.Scheduler.ImpactIndex/     # purl→images inverted index (roaring bitmaps) | ||||
|  ├─ StellaOps.Scheduler.Models/          # DTOs (Schedule, Run, ImpactSet, Deltas) | ||||
|  ├─ StellaOps.Scheduler.Storage.Mongo/   # schedules, runs, cursors, locks | ||||
|  ├─ StellaOps.Scheduler.Queue/           # Redis Streams / NATS abstraction | ||||
|  ├─ StellaOps.Scheduler.Tests.*          # unit/integration/e2e | ||||
| ``` | ||||
|  | ||||
| **Deployables**: | ||||
|  | ||||
| * **Scheduler.WebService** (stateless) | ||||
| * **Scheduler.Worker** (scale‑out; planners + executors) | ||||
|  | ||||
| **Dependencies**: Authority (OpTok + DPoP/mTLS), Scanner.WebService, Feedser, Vexer, MongoDB, Redis/NATS, (optional) Notify. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 2) Core responsibilities | ||||
|  | ||||
| 1. **Time‑based** runs: cron windows per tenant/timezone (e.g., “02:00 Europe/Sofia”). | ||||
| 2. **Event‑driven** runs: react to **Feedser export** and **Vexer export** deltas (changed product keys / advisories / claims). | ||||
| 3. **Impact targeting**: map changes to **image sets** using a **global inverted index** built from Scanner’s per‑image **BOM‑Index** sidecars. | ||||
| 4. **Run planning**: shard, pace, and rate‑limit jobs to avoid thundering herds. | ||||
| 5. **Execution**: call Scanner **/reports (analysis‑only)** or **/scans (content‑refresh)**; aggregate **delta** results. | ||||
| 6. **Events**: publish `rescan.delta` and `report.ready` summaries for **Notify** & **UI**. | ||||
| 7. **Control plane**: CRUD schedules, **pause/resume**, dry‑run previews, audit. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 3) Data model (Mongo) | ||||
|  | ||||
| **Database**: `scheduler` | ||||
|  | ||||
| * `schedules` | ||||
|  | ||||
|   ``` | ||||
|   { _id, tenantId, name, enabled, whenCron, timezone, | ||||
|     mode: "analysis-only" | "content-refresh", | ||||
|     selection: { scope: "all-images" | "by-namespace" | "by-repo" | "by-digest" | "by-labels", | ||||
|                  includeTags?: ["prod-*"], digests?: [sha256...], resolvesTags?: bool }, | ||||
|     onlyIf: { lastReportOlderThanDays?: int, policyRevision?: string }, | ||||
|     notify: { onNewFindings: bool, minSeverity: "low|medium|high|critical", includeKEV: bool }, | ||||
|     limits: { maxJobs?: int, ratePerSecond?: int, parallelism?: int }, | ||||
|     createdAt, updatedAt, createdBy, updatedBy } | ||||
|   ``` | ||||
|  | ||||
| * `runs` | ||||
|  | ||||
|   ``` | ||||
|   { _id, scheduleId?, tenantId, trigger: "cron|feedser|vexer|manual", | ||||
|     reason?: { feedserExportId?, vexerExportId?, cursor? }, | ||||
|     state: "planning|queued|running|completed|error|cancelled", | ||||
|     stats: { candidates: int, deduped: int, queued: int, completed: int, deltas: int, newCriticals: int }, | ||||
|     startedAt, finishedAt, error? } | ||||
|   ``` | ||||
|  | ||||
| * `impact_cursors` | ||||
|  | ||||
|   ``` | ||||
|   { _id: tenantId, feedserLastExportId, vexerLastExportId, updatedAt } | ||||
|   ``` | ||||
|  | ||||
| * `locks` (singleton schedulers, run leases) | ||||
|  | ||||
| * `audit` (CRUD actions, run outcomes) | ||||
|  | ||||
| **Indexes**: | ||||
|  | ||||
| * `schedules` on `{tenantId, enabled}`, `{whenCron}`. | ||||
| * `runs` on `{tenantId, startedAt desc}`, `{state}`. | ||||
| * TTL optional for completed runs (e.g., 180 days). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 4) ImpactIndex (global inverted index) | ||||
|  | ||||
| Goal: translate **change keys** → **image sets** in **milliseconds**. | ||||
|  | ||||
| **Source**: Scanner produces per‑image **BOM‑Index** sidecars (purls, and `usedByEntrypoint` bitmaps). Scheduler ingests/refreshes them to build a **global** index. | ||||
|  | ||||
| **Representation**: | ||||
|  | ||||
| * Assign **image IDs** (dense ints) to catalog images. | ||||
| * Keep **Roaring Bitmaps**: | ||||
|  | ||||
|   * `Contains[purl]        → bitmap(imageIds)` | ||||
|   * `UsedBy[purl]          → bitmap(imageIds)` (subset of Contains) | ||||
| * Optionally keep **Owner maps**: `{imageId → {tenantId, namespaces[], repos[]}}` for selection filters. | ||||
| * Persist in RocksDB/LMDB or Redis‑modules; cache hot shards in memory; snapshot to Mongo for cold start. | ||||
|  | ||||
| **Update paths**: | ||||
|  | ||||
| * On new/updated image SBOM: **merge** per‑image set into global maps. | ||||
| * On image remove/expiry: **clear** id from bitmaps. | ||||
|  | ||||
| **API (internal)**: | ||||
|  | ||||
| ```csharp | ||||
| IImpactIndex { | ||||
|   ImpactSet ResolveByPurls(IEnumerable<string> purls, bool usageOnly, Selector sel); | ||||
|   ImpactSet ResolveByVulns(IEnumerable<string> vulnIds, bool usageOnly, Selector sel); // optional (vuln->purl precomputed by Feedser) | ||||
|   ImpactSet ResolveAll(Selector sel); // for nightly | ||||
| } | ||||
| ``` | ||||
|  | ||||
| **Selector filters**: tenant, namespaces, repos, labels, digest allowlists, `includeTags` patterns. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 5) External interfaces (REST) | ||||
|  | ||||
| Base path: `/api/v1/scheduler` (Authority OpToks; scopes: `scheduler.read`, `scheduler.admin`). | ||||
|  | ||||
| ### 5.1 Schedules CRUD | ||||
|  | ||||
| * `POST /schedules` → create | ||||
| * `GET /schedules` → list (filter by tenant) | ||||
| * `GET /schedules/{id}` → details + next run | ||||
| * `PATCH /schedules/{id}` → pause/resume/update | ||||
| * `DELETE /schedules/{id}` → delete (soft delete, optional) | ||||
|  | ||||
| ### 5.2 Run control & introspection | ||||
|  | ||||
| * `POST /run` — ad‑hoc run | ||||
|  | ||||
|   ```json | ||||
|   { "mode": "analysis-only|content-refresh", "selection": {...}, "reason": "manual" } | ||||
|   ``` | ||||
| * `GET /runs` — list with paging | ||||
| * `GET /runs/{id}` — status, stats, links to deltas | ||||
| * `POST /runs/{id}/cancel` — best‑effort cancel | ||||
|  | ||||
| ### 5.3 Previews (dry‑run) | ||||
|  | ||||
| * `POST /preview/impact` — returns **candidate count** and a small sample of impacted digests for given change keys or selection. | ||||
|  | ||||
| ### 5.4 Event webhooks (optional push from Feedser/Vexer) | ||||
|  | ||||
| * `POST /events/feedser-export` | ||||
|  | ||||
|   ```json | ||||
|   { "exportId":"...", "changedProductKeys":["pkg:rpm/openssl", ...], "kev": ["CVE-..."], "window": { "from":"...","to":"..." } } | ||||
|   ``` | ||||
| * `POST /events/vexer-export` | ||||
|  | ||||
|   ```json | ||||
|   { "exportId":"...", "changedClaims":[ { "productKey":"pkg:deb/...", "vulnId":"CVE-...", "status":"not_affected→affected"} ], ... } | ||||
|   ``` | ||||
|  | ||||
| **Security**: webhook requires **mTLS** or an **HMAC** `X-Scheduler-Signature` (Ed25519 / SHA‑256) plus Authority token. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 6) Planner → Runner pipeline | ||||
|  | ||||
| ### 6.1 Planning algorithm (event‑driven) | ||||
|  | ||||
| ``` | ||||
| On Export Event (Feedser/Vexer): | ||||
|   keys = Normalize(change payload)          # productKeys or vulnIds→productKeys | ||||
|   usageOnly = schedule/policy hint?         # default true | ||||
|   sel = Selector for tenant/scope from schedules subscribed to events | ||||
|  | ||||
|   impacted = ImpactIndex.ResolveByPurls(keys, usageOnly, sel) | ||||
|   impacted = ApplyOwnerFilters(impacted, sel)           # namespaces/repos/labels | ||||
|   impacted = DeduplicateByDigest(impacted) | ||||
|   impacted = EnforceLimits(impacted, limits.maxJobs) | ||||
|   shards    = Shard(impacted, byHashPrefix, n=limits.parallelism) | ||||
|  | ||||
|   For each shard: | ||||
|     Enqueue RunSegment (runId, shard, rate=limits.ratePerSecond) | ||||
| ``` | ||||
|  | ||||
| **Fairness & pacing** | ||||
|  | ||||
| * Use **leaky bucket** per tenant and per registry host. | ||||
| * Prioritize **KEV‑tagged** and **critical** first if oversubscribed. | ||||
|  | ||||
| ### 6.2 Nightly planning | ||||
|  | ||||
| ``` | ||||
| At cron tick: | ||||
|   sel = resolve selection | ||||
|   candidates = ImpactIndex.ResolveAll(sel) | ||||
|   if lastReportOlderThanDays present → filter by report age (via Scanner catalog) | ||||
|   shard & enqueue as above | ||||
| ``` | ||||
|  | ||||
| ### 6.3 Execution (Runner) | ||||
|  | ||||
| * Pop **RunSegment** job → for each image digest: | ||||
|  | ||||
|   * **analysis‑only**: `POST scanner/reports { imageDigest, policyRevision? }` | ||||
|   * **content‑refresh**: resolve tag→digest if needed; `POST scanner/scans { imageRef, attest? false }` then `POST /reports` | ||||
| * Collect **delta**: `newFindings`, `newCriticals`/`highs`, `links` (UI deep link, Rekor if present). | ||||
| * Persist per‑image outcome in `runs.{id}.stats` (incremental counters). | ||||
| * Emit `scheduler.rescan.delta` events to **Notify** only when **delta > 0** and matches severity rule. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 7) Event model (outbound) | ||||
|  | ||||
| **Topic**: `rescan.delta` (internal bus → Notify; UI subscribes via backend). | ||||
|  | ||||
| ```json | ||||
| { | ||||
|   "tenant": "tenant-01", | ||||
|   "runId": "324af…", | ||||
|   "imageDigest": "sha256:…", | ||||
|   "newCriticals": 1, | ||||
|   "newHigh": 2, | ||||
|   "kevHits": ["CVE-2025-..."], | ||||
|   "topFindings": [ | ||||
|     { "purl":"pkg:rpm/openssl@3.0.12-...","vulnId":"CVE-2025-...","severity":"critical","link":"https://ui/scans/..." } | ||||
|   ], | ||||
|   "reportUrl": "https://ui/.../scans/sha256:.../report", | ||||
|   "attestation": { "uuid":"rekor-uuid", "verified": true }, | ||||
|   "ts": "2025-10-18T03:12:45Z" | ||||
| } | ||||
| ``` | ||||
|  | ||||
| **Also**: `report.ready` for “no‑change” summaries (digest + zero delta), which Notify can ignore by rule. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 8) Security posture | ||||
|  | ||||
| * **AuthN/Z**: Authority OpToks with `aud=scheduler`; DPoP (preferred) or mTLS. | ||||
| * **Multi‑tenant**: every schedule, run, and event carries `tenantId`; ImpactIndex filters by tenant‑visible images. | ||||
| * **Webhook** callers (Feedser/Vexer) present **mTLS** or **HMAC** and Authority token. | ||||
| * **Input hardening**: size caps on changed key lists; reject >100k keys per event; compress (zstd/gzip) allowed with limits. | ||||
| * **No secrets** in logs; redact tokens and signatures. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 9) Observability & SLOs | ||||
|  | ||||
| **Metrics (Prometheus)** | ||||
|  | ||||
| * `scheduler.events_total{source, result}` | ||||
| * `scheduler.impact_resolve_seconds{quantile}` | ||||
| * `scheduler.images_selected_total{mode}` | ||||
| * `scheduler.jobs_enqueued_total{mode}` | ||||
| * `scheduler.run_latency_seconds{quantile}` // event → first verdict | ||||
| * `scheduler.delta_images_total{severity}` | ||||
| * `scheduler.rate_limited_total{reason}` | ||||
|  | ||||
| **Targets** | ||||
|  | ||||
| * Resolve 10k changed keys → impacted set in **<300 ms** (hot cache). | ||||
| * Event → first rescan verdict in **≤60 s** (p95). | ||||
| * Nightly coverage 50k images in **≤10 min** with 10 workers (analysis‑only). | ||||
|  | ||||
| **Tracing** (OTEL): spans `plan`, `resolve`, `enqueue`, `report_call`, `persist`, `emit`. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 10) Configuration (YAML) | ||||
|  | ||||
| ```yaml | ||||
| scheduler: | ||||
|   authority: | ||||
|     issuer: "https://authority.internal" | ||||
|     require: "dpop"            # or "mtls" | ||||
|   queue: | ||||
|     kind: "redis"              # or "nats" | ||||
|     url: "redis://redis:6379/4" | ||||
|   mongo: | ||||
|     uri: "mongodb://mongo/scheduler" | ||||
|   impactIndex: | ||||
|     storage: "rocksdb"         # "rocksdb" | "redis" | "memory" | ||||
|     warmOnStart: true | ||||
|     usageOnlyDefault: true | ||||
|   limits: | ||||
|     defaultRatePerSecond: 50 | ||||
|     defaultParallelism: 8 | ||||
|     maxJobsPerRun: 50000 | ||||
|   integrates: | ||||
|     scannerUrl: "https://scanner-web.internal" | ||||
|     feedserWebhook: true | ||||
|     vexerWebhook: true | ||||
|   notifications: | ||||
|     emitBus: "internal"        # deliver to Notify via internal bus | ||||
| ``` | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 11) UI touch‑points | ||||
|  | ||||
| * **Schedules** page: CRUD, enable/pause, next run, last run stats, mode (analysis/content), selector preview. | ||||
| * **Runs** page: timeline; heat‑map of deltas; drill‑down to affected images. | ||||
| * **Dry‑run preview** modal: “This Feedser export touches ~3,214 images; projected deltas: ~420 (34 KEV).” | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 12) Failure modes & degradations | ||||
|  | ||||
| | Condition                            | Behavior                                                                                 | | ||||
| | ------------------------------------ | ---------------------------------------------------------------------------------------- | | ||||
| | ImpactIndex cold / incomplete        | Fall back to **All** selection for nightly; for events, cap to KEV+critical until warmed | | ||||
| | Feedser/Vexer webhook storm          | Coalesce by exportId; debounce 30–60 s; keep last                                        | | ||||
| | Scanner under load (429)             | Backoff with jitter; respect per‑tenant/leaky bucket                                     | | ||||
| | Oversubscription (too many impacted) | Prioritize KEV/critical first; spillover to next window; UI banner shows backlog         | | ||||
| | Notify down                          | Buffer outbound events in queue (TTL 24h)                                                | | ||||
| | Mongo slow                           | Cut batch sizes; sample‑log; alert ops; don’t drop runs unless critical                  | | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 13) Testing matrix | ||||
|  | ||||
| * **ImpactIndex**: correctness (purl→image sets), performance, persistence after restart, memory pressure with 1M purls. | ||||
| * **Planner**: dedupe, shard, fairness, limit enforcement, KEV prioritization. | ||||
| * **Runner**: parallel report calls, error backoff, partial failures, idempotency. | ||||
| * **End‑to‑end**: Feedser export → deltas visible in UI in ≤60 s. | ||||
| * **Security**: webhook auth (mTLS/HMAC), DPoP nonce dance, tenant isolation. | ||||
| * **Chaos**: drop scanner availability; simulate registry throttles (content‑refresh mode). | ||||
| * **Nightly**: cron tick correctness across timezones and DST. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 14) Implementation notes | ||||
|  | ||||
| * **Language**: .NET 10 minimal API; Channels‑based pipeline; `System.Threading.RateLimiting`. | ||||
| * **Bitmaps**: Roaring via `RoaringBitmap` bindings; memory‑map large shards if RocksDB used. | ||||
| * **Cron**: Quartz‑style parser with timezone support; clock skew tolerated ±60 s. | ||||
| * **Dry‑run**: use ImpactIndex only; never call scanner. | ||||
| * **Idempotency**: run segments carry deterministic keys; retries safe. | ||||
| * **Backpressure**: per‑tenant buckets; per‑host registry budgets respected when content‑refresh enabled. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 15) Sequences (representative) | ||||
|  | ||||
| **A) Event‑driven rescan (Feedser delta)** | ||||
|  | ||||
| ```mermaid | ||||
| sequenceDiagram | ||||
|   autonumber | ||||
|   participant FE as Feedser | ||||
|   participant SCH as Scheduler.Worker | ||||
|   participant IDX as ImpactIndex | ||||
|   participant SC as Scanner.WebService | ||||
|   participant NO as Notify | ||||
|  | ||||
|   FE->>SCH: POST /events/feedser-export {exportId, changedProductKeys} | ||||
|   SCH->>IDX: ResolveByPurls(keys, usageOnly=true, sel) | ||||
|   IDX-->>SCH: bitmap(imageIds) → digests list | ||||
|   SCH->>SC: POST /reports {imageDigest}  (batch/sequenced) | ||||
|   SC-->>SCH: report deltas (new criticals/highs) | ||||
|   alt delta>0 | ||||
|     SCH->>NO: rescan.delta {digest, newCriticals, links} | ||||
|   end | ||||
| ``` | ||||
|  | ||||
| **B) Nightly rescan** | ||||
|  | ||||
| ```mermaid | ||||
| sequenceDiagram | ||||
|   autonumber | ||||
|   participant CRON as Cron | ||||
|   participant SCH as Scheduler.Worker | ||||
|   participant IDX as ImpactIndex | ||||
|   participant SC as Scanner.WebService | ||||
|  | ||||
|   CRON->>SCH: tick (02:00 Europe/Sofia) | ||||
|   SCH->>IDX: ResolveAll(selector) | ||||
|   IDX-->>SCH: candidates | ||||
|   SCH->>SC: POST /reports {digest} (paced) | ||||
|   SC-->>SCH: results | ||||
|   SCH-->>SCH: aggregate, store run stats | ||||
| ``` | ||||
|  | ||||
| **C) Content‑refresh (tag followers)** | ||||
|  | ||||
| ```mermaid | ||||
| sequenceDiagram | ||||
|   autonumber | ||||
|   participant SCH as Scheduler | ||||
|   participant SC as Scanner | ||||
|   SCH->>SC: resolve tag→digest (if changed) | ||||
|   alt digest changed | ||||
|     SCH->>SC: POST /scans {imageRef}    # new SBOM | ||||
|     SC-->>SCH: scan complete (artifacts) | ||||
|     SCH->>SC: POST /reports {imageDigest} | ||||
|   else unchanged | ||||
|     SCH->>SC: POST /reports {imageDigest}  # analysis-only | ||||
|   end | ||||
| ``` | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 16) Roadmap | ||||
|  | ||||
| * **Vuln‑centric impact**: pre‑join vuln→purl→images to rank by **KEV** and **exploited‑in‑the‑wild** signals. | ||||
| * **Policy diff preview**: when a staged policy changes, show projected breakage set before promotion. | ||||
| * **Cross‑cluster federation**: one Scheduler instance driving many Scanner clusters (tenant isolation). | ||||
| * **Windows containers**: integrate Zastava runtime hints for Usage view tightening. | ||||
|  | ||||
| --- | ||||
|  | ||||
| **End — component_architecture_scheduler.md** | ||||
		Reference in New Issue
	
	Block a user