consolidation of some of the modules, localization fixes, product advisories work, qa work

This commit is contained in:
master
2026-03-05 03:54:22 +02:00
parent 7bafcc3eef
commit 8e1cb9448d
3878 changed files with 72600 additions and 46861 deletions

View File

@@ -0,0 +1,39 @@
# Scheduler agent guide
## Mission
Scheduler detects advisory/VEX deltas, computes impact windows, and orchestrates re-evaluations across Scanner and Policy Engine. Docs in this directory are the front-door contract for contributors.
## Working directory
- `docs/modules/scheduler` (docs-only); code changes live under `src/Scheduler/**` but must be coordinated via sprint plans.
## Roles & owners
- **Docs author**: curates AGENTS/TASKS/runbooks; keeps determinism/offline guidance accurate.
- **Scheduler engineer (Worker/WebService)**: aligns implementation notes with architecture and ensures observability/runbook updates land with code.
- **Observability/Ops**: maintains dashboards/rules, documents operational SLOs and alert contracts.
## Required Reading
- `docs/modules/scheduler/README.md`
- `docs/modules/scheduler/architecture.md`
- `docs/modules/scheduler/implementation_plan.md`
- `docs/modules/platform/architecture-overview.md`
## How to work
1. Open relevant sprint file in `docs/implplan/SPRINT_*.md` and set task status to `DOING` there and in `docs/modules/scheduler/TASKS.md` before starting.
2. Confirm prerequisites above are read; note any missing contracts in sprint **Decisions & Risks**.
3. Keep outputs deterministic (stable ordering, UTC ISO-8601 timestamps, sorted lists) and offline-friendly (no external fetches without mirrors).
4. When changing behavior, update runbooks and observability assets in `./operations/`.
5. On completion, set status to `DONE` in both the sprint file and `TASKS.md`; if paused, revert to `TODO` and add a brief note.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see `../../aoc/aggregation-only-contract.md`).
- No undocumented schema or API contract changes; document deltas in architecture or implementation_plan.
- Keep Offline Kit parity—document air-gapped workflows for any new feature.
- Prefer deterministic fixtures and avoid machine-specific artefacts in examples.
## Testing & determinism expectations
- Examples and snippets should be reproducible; pin sample timestamps to UTC and sort collections.
- Observability examples must align with published metric names and labels; update `operations/worker-prometheus-rules.yaml` if alert semantics change.
## Status mirrors
- Sprint tracker: `/docs/implplan/SPRINT_*.md` (source of record for Delivery Tracker).
- Local tracker: `docs/modules/scheduler/TASKS.md` (mirrors sprint status; keep in sync).

View File

@@ -0,0 +1,72 @@
# StellaOps Scheduler
Scheduler detects advisory/VEX deltas, computes impact windows, and orchestrates re-evaluations across Scanner and Policy Engine.
## Responsibilities
- Maintain impact cursors and queues for re-scan/re-evaluate jobs.
- Expose APIs for policy-triggered rechecks and runtime hooks.
- Emit DSSE-backed completion events for downstream consumers (UI, Notify).
- Provide SLA-aware retry logic with deterministic evaluation windows.
## Key components
- `StellaOps.Scheduler.WebService` control plane.
- `StellaOps.Scheduler.Worker` job executor.
- Shared libraries under `StellaOps.Scheduler.*`.
## Integrations & dependencies
- PostgreSQL (schema `scheduler`) for impact models.
- Valkey/NATS for queueing.
- Policy Engine, Scanner, Notify.
## Operational notes
- Monitoring assets in ./operations/worker-grafana-dashboard.json & worker-prometheus-rules.yaml.
- Operational runbook ./operations/worker.md.
## Related resources
- ./operations/worker.md
- ./operations/worker-grafana-dashboard.json
- ./operations/worker-prometheus-rules.yaml
## Backlog references
- SCHED-MODELS-20-001 (policy run DTOs) and related tasks in ../../TASKS.md.
- Scheduler observability follow-ups in src/Scheduler/**/TASKS.md.
## Implementation Status
### Current Objectives
- Maintain deterministic behaviour and offline parity across releases
- Keep documentation, telemetry, and runbooks aligned with latest sprint outcomes
- Coordinate with Policy Engine for incremental re-evaluation workflows
### Epic Milestones
- Epic 2 Policy Engine & Editor: incremental policy run orchestration, change streams, explain trace propagation (in progress)
- Epic 6 Vulnerability Explorer: findings updates and remediation triggers integration (in progress)
- Epic 9 Orchestrator Dashboard: job telemetry and control surfaces for UI/CLI (planned)
### Core Capabilities
- Impact cursor maintenance and queue management for re-scan/re-evaluate jobs
- Change-stream detection for advisory/VEX/SBOM deltas
- Policy-triggered recheck orchestration with runtime hooks
- SLA-aware retry logic with deterministic evaluation windows
- DSSE-backed completion events for downstream consumers
### Integration Points
- PostgreSQL schema (scheduler) for impact models and job state
- Valkey/NATS for queueing with idempotency
- Policy Engine, Scanner, Notify for job coordination
- Orchestrator for backfills and incident routing
### Operational Assets
- Monitoring: worker-grafana-dashboard.json, worker-prometheus-rules.yaml
- Runbooks: operations/worker.md
- Observability: metrics, traces, structured logs with correlation IDs
### Technical Notes
- Coordination approach: review AGENTS.md, sync via docs/implplan/SPRINT_*.md
- Backlog tracking: SCHED-MODELS-20-001 and related tasks in ../../TASKS.md
- Module tasks: src/Scheduler/**/TASKS.md
## Epic alignment
- **Epic 2 Policy Engine & Editor:** orchestrate incremental re-evaluation and simulation runs when raw facts or policies change.
- **Epic 6 Vulnerability Explorer:** feed triage workflows with up-to-date job status, explain traces, and ledger hooks.
- **Epic 9 Orchestrator Dashboard:** expose job telemetry, throttling, and replay controls through orchestration dashboards.

View File

@@ -0,0 +1,9 @@
# Scheduler Module Task Board
This board mirrors active Scheduler sprint(s). Update alongside the sprint tracker.
Source of truth: docs/implplan/SPRINT_*.md.
| Task ID | Status | Notes |
| --- | --- | --- |
| TBD | TODO | Populate from active sprint. |

View File

@@ -0,0 +1,434 @@
# component_architecture_scheduler.md — **StellaOps Scheduler** (2025Q4)
> Synthesises the scheduling requirements documented across the Policy, Vulnerability Explorer, and Orchestrator module guides and implementation plans.
> **Scope.** Implementationready architecture for **Scheduler**: a service that (1) **reevaluates** alreadycataloged images when intel changes (Conselier/Excitor/policy), (2) orchestrates **nightly** and **adhoc** runs, (3) targets only the **impacted** images using the BOMIndex, and (4) emits **reportready** events that downstream **Notify** fans out. Default mode is **analysisonly** (no image pull); optional **contentrefresh** can be enabled per schedule.
---
## 0) Mission & boundaries
**Mission.** Keep scan results **current** without rescanning the world. When new advisories or VEX claims land, **pinpoint** affected images and ask the backend to recompute **verdicts** against the **existing SBOMs**. Surface only **meaningful deltas** to humans and ticket queues.
**Boundaries.**
* Scheduler **does not** compute SBOMs and **does not** sign. It calls Scanner/WebServices **/reports (analysisonly)** endpoint and lets the backend (Policy + Excitor + Conselier) decide PASS/FAIL.
* Scheduler **may** ask Scanner to **contentrefresh** selected targets (e.g., mutable tags) but the default is **no** image pull.
* Notifications are **not** sent directly; Scheduler emits events consumed by **Notify**.
---
## 1) Runtime shape & projects
```
src/
├─ StellaOps.Scheduler.WebService/ # REST (schedules CRUD, runs, admin)
├─ StellaOps.Scheduler.Worker/ # planners + runners (N replicas)
├─ StellaOps.Scheduler.ImpactIndex/ # purl→images inverted index (roaring bitmaps)
├─ StellaOps.Scheduler.Models/ # DTOs (Schedule, Run, ImpactSet, Deltas)
├─ StellaOps.Scheduler.Storage.Postgres/ # schedules, runs, cursors, locks
├─ StellaOps.Scheduler.Queue/ # Valkey Streams / NATS abstraction
├─ StellaOps.Scheduler.Tests.* # unit/integration/e2e
```
**Deployables**:
* **Scheduler.WebService** (stateless)
* **Scheduler.Worker** (scaleout; planners + executors)
**Dependencies**: Authority (OpTok + DPoP/mTLS), Scanner.WebService, Conselier, Excitor, PostgreSQL, Valkey/NATS, (optional) Notify.
---
## 2) Core responsibilities
1. **Timebased** runs: cron windows per tenant/timezone (e.g., “02:00 Europe/Sofia”).
2. **Eventdriven** runs: react to **Conselier export** and **Excitor export** deltas (changed product keys / advisories / claims).
3. **Impact targeting**: map changes to **image sets** using a **global inverted index** built from Scanners perimage **BOMIndex** sidecars.
4. **Run planning**: shard, pace, and ratelimit jobs to avoid thundering herds.
5. **Execution**: call Scanner **/reports (analysisonly)** or **/scans (contentrefresh)**; aggregate **delta** results.
6. **Events**: publish `rescan.delta` and `report.ready` summaries for **Notify** & **UI**.
7. **Control plane**: CRUD schedules, **pause/resume**, dryrun previews, audit.
---
## 3) Data model (PostgreSQL)
**Database**: `scheduler`
* `schedules`
```
{ _id, tenantId, name, enabled, whenCron, timezone,
mode: "analysis-only" | "content-refresh",
selection: { scope: "all-images" | "by-namespace" | "by-repo" | "by-digest" | "by-labels",
includeTags?: ["prod-*"], digests?: [sha256...], resolvesTags?: bool },
onlyIf: { lastReportOlderThanDays?: int, policyRevision?: string },
notify: { onNewFindings: bool, minSeverity: "low|medium|high|critical", includeKEV: bool },
limits: { maxJobs?: int, ratePerSecond?: int, parallelism?: int },
createdAt, updatedAt, createdBy, updatedBy }
```
* `runs`
```
{ _id, scheduleId?, tenantId, trigger: "cron|conselier|excitor|manual",
reason?: { conselierExportId?, excitorExportId?, cursor? },
state: "planning|queued|running|completed|error|cancelled",
stats: { candidates: int, deduped: int, queued: int, completed: int, deltas: int, newCriticals: int },
startedAt, finishedAt, error? }
```
* `impact_cursors`
```
{ _id: tenantId, conselierLastExportId, excitorLastExportId, updatedAt }
```
* `locks` (singleton schedulers, run leases)
* `audit` (CRUD actions, run outcomes)
**Indexes**:
* `schedules` on `{tenantId, enabled}`, `{whenCron}`.
* `runs` on `{tenantId, startedAt desc}`, `{state}`.
* TTL optional for completed runs (e.g., 180 days).
---
## 4) ImpactIndex (global inverted index)
Goal: translate **change keys** → **image sets** in **milliseconds**.
**Source**: Scanner produces perimage **BOMIndex** sidecars (purls, and `usedByEntrypoint` bitmaps). Scheduler ingests/refreshes them to build a **global** index.
**Representation**:
* Assign **image IDs** (dense ints) to catalog images.
* Keep **Roaring Bitmaps**:
* `Contains[purl] → bitmap(imageIds)`
* `UsedBy[purl] → bitmap(imageIds)` (subset of Contains)
* Optionally keep **Owner maps**: `{imageId → {tenantId, namespaces[], repos[]}}` for selection filters.
* Persist in RocksDB/LMDB or Valkeymodules; cache hot shards in memory; snapshot to PostgreSQL for cold start.
**Update paths**:
* On new/updated image SBOM: **merge** perimage set into global maps.
* On image remove/expiry: **clear** id from bitmaps.
**API (internal)**:
```csharp
IImpactIndex {
ImpactSet ResolveByPurls(IEnumerable<string> purls, bool usageOnly, Selector sel);
ImpactSet ResolveByVulns(IEnumerable<string> vulnIds, bool usageOnly, Selector sel); // optional (vuln->purl precomputed by Conselier)
ImpactSet ResolveAll(Selector sel); // for nightly
}
```
**Selector filters**: tenant, namespaces, repos, labels, digest allowlists, `includeTags` patterns.
---
## 5) External interfaces (REST)
Base path: `/api/v1/scheduler` (Authority OpToks; scopes: `scheduler.read`, `scheduler.admin`).
### 5.1 Schedules CRUD
* `POST /schedules` → create
* `GET /schedules` → list (filter by tenant)
* `GET /schedules/{id}` → details + next run
* `PATCH /schedules/{id}` → pause/resume/update
* `DELETE /schedules/{id}` → delete (soft delete, optional)
### 5.2 Run control & introspection
* `POST /run` — adhoc run
```json
{ "mode": "analysis-only|content-refresh", "selection": {...}, "reason": "manual" }
```
* `GET /runs` — list with paging
* `GET /runs/{id}` — status, stats, links to deltas
* `POST /runs/{id}/cancel` — besteffort cancel
### 5.3 Previews (dryrun)
* `POST /preview/impact` — returns **candidate count** and a small sample of impacted digests for given change keys or selection.
### 5.4 Event webhooks (optional push from Conselier/Excitor)
* `POST /events/conselier-export`
```json
{ "exportId":"...", "changedProductKeys":["pkg:rpm/openssl", ...], "kev": ["CVE-..."], "window": { "from":"...","to":"..." } }
```
* `POST /events/excitor-export`
```json
{ "exportId":"...", "changedClaims":[ { "productKey":"pkg:deb/...", "vulnId":"CVE-...", "status":"not_affected→affected"} ], ... }
```
**Security**: webhook requires **mTLS** or an **HMAC** `X-Scheduler-Signature` (Ed25519 / SHA256) plus Authority token.
---
## 6) Planner → Runner pipeline
### 6.1 Planning algorithm (eventdriven)
```
On Export Event (Conselier/Excitor):
keys = Normalize(change payload) # productKeys or vulnIds→productKeys
usageOnly = schedule/policy hint? # default true
sel = Selector for tenant/scope from schedules subscribed to events
impacted = ImpactIndex.ResolveByPurls(keys, usageOnly, sel)
impacted = ApplyOwnerFilters(impacted, sel) # namespaces/repos/labels
impacted = DeduplicateByDigest(impacted)
impacted = EnforceLimits(impacted, limits.maxJobs)
shards = Shard(impacted, byHashPrefix, n=limits.parallelism)
For each shard:
Enqueue RunSegment (runId, shard, rate=limits.ratePerSecond)
```
**Fairness & pacing**
* Use **leaky bucket** per tenant and per registry host.
* Prioritize **KEVtagged** and **critical** first if oversubscribed.
### 6.2 Nightly planning
```
At cron tick:
sel = resolve selection
candidates = ImpactIndex.ResolveAll(sel)
if lastReportOlderThanDays present → filter by report age (via Scanner catalog)
shard & enqueue as above
```
### 6.3 Execution (Runner)
* Pop **RunSegment** job → for each image digest:
* **analysisonly**: `POST scanner/reports { imageDigest, policyRevision? }`
* **contentrefresh**: resolve tag→digest if needed; `POST scanner/scans { imageRef, attest? false }` then `POST /reports`
* Collect **delta**: `newFindings`, `newCriticals`/`highs`, `links` (UI deep link, Rekor if present).
* Persist perimage outcome in `runs.{id}.stats` (incremental counters).
* Emit `scheduler.rescan.delta` events to **Notify** only when **delta > 0** and matches severity rule.
---
## 7) Event model (outbound)
**Topic**: `rescan.delta` (internal bus → Notify; UI subscribes via backend).
```json
{
"tenant": "tenant-01",
"runId": "324af…",
"imageDigest": "sha256:…",
"newCriticals": 1,
"newHigh": 2,
"kevHits": ["CVE-2025-..."],
"topFindings": [
{ "purl":"pkg:rpm/openssl@3.0.12-...","vulnId":"CVE-2025-...","severity":"critical","link":"https://ui/scans/..." }
],
"reportUrl": "https://ui/.../scans/sha256:.../report",
"attestation": { "uuid":"rekor-uuid", "verified": true },
"ts": "2025-10-18T03:12:45Z"
}
```
**Also**: `report.ready` for “nochange” summaries (digest + zero delta), which Notify can ignore by rule.
---
## 8) Security posture
* **AuthN/Z**: Authority OpToks with `aud=scheduler`; DPoP (preferred) or mTLS.
* **Multitenant**: every schedule, run, and event carries `tenantId`; ImpactIndex filters by tenantvisible images.
* **Webhook** callers (Conselier/Excitor) present **mTLS** or **HMAC** and Authority token.
* **Input hardening**: size caps on changed key lists; reject >100k keys per event; compress (zstd/gzip) allowed with limits.
* **No secrets** in logs; redact tokens and signatures.
---
## 9) Observability & SLOs
**Metrics (Prometheus)**
* `scheduler.events_total{source, result}`
* `scheduler.impact_resolve_seconds{quantile}`
* `scheduler.images_selected_total{mode}`
* `scheduler.jobs_enqueued_total{mode}`
* `scheduler.run_latency_seconds{quantile}` // event → first verdict
* `scheduler.delta_images_total{severity}`
* `scheduler.rate_limited_total{reason}`
* `policy_simulation_queue_depth{status}` (WebService gauge)
* `policy_simulation_latency_seconds` (WebService histogram)
**Targets**
* Resolve 10k changed keys → impacted set in **<300ms** (hot cache).
* Event → first rescan verdict in **≤60s** (p95).
* Nightly coverage 50k images in **≤10min** with 10 workers (analysisonly).
**Tracing** (OTEL): spans `plan`, `resolve`, `enqueue`, `report_call`, `persist`, `emit`.
**Webhooks**
* Policy simulation webhooks fire on terminal states (`succeeded`, `failed`, `cancelled`).
* Configure under `Scheduler:Worker:Policy:Webhook` (see `SCHED-WEB-27-002-POLICY-SIMULATION-WEBHOOKS.md`).
* Requests include headers `X-StellaOps-Tenant` and `X-StellaOps-Run-Id` for idempotency.
---
## 10) Configuration (YAML)
```yaml
scheduler:
authority:
issuer: "https://authority.internal"
require: "dpop" # or "mtls"
queue:
kind: "valkey" # or "nats" (valkey uses redis:// protocol)
url: "redis://valkey:6379/4"
postgres:
connectionString: "Host=postgres;Port=5432;Database=scheduler;Username=stellaops;Password=stellaops"
impactIndex:
storage: "rocksdb" # "rocksdb" | "valkey" | "memory"
warmOnStart: true
usageOnlyDefault: true
limits:
defaultRatePerSecond: 50
defaultParallelism: 8
maxJobsPerRun: 50000
integrates:
scannerUrl: "https://scanner-web.internal"
conselierWebhook: true
excitorWebhook: true
notifications:
emitBus: "internal" # deliver to Notify via internal bus
```
---
## 11) UI touchpoints
* **Schedules** page: CRUD, enable/pause, next run, last run stats, mode (analysis/content), selector preview.
* **Runs** page: timeline; heatmap of deltas; drilldown to affected images.
* **Dryrun preview** modal: “This Conselier export touches ~3,214 images; projected deltas: ~420 (34 KEV).”
---
## 12) Failure modes & degradations
| Condition | Behavior |
| ------------------------------------ | ---------------------------------------------------------------------------------------- |
| ImpactIndex cold / incomplete | Fall back to **All** selection for nightly; for events, cap to KEV+critical until warmed |
| Conselier/Excitor webhook storm | Coalesce by exportId; debounce 3060s; keep last |
| Scanner under load (429) | Backoff with jitter; respect pertenant/leaky bucket |
| Oversubscription (too many impacted) | Prioritize KEV/critical first; spillover to next window; UI banner shows backlog |
| Notify down | Buffer outbound events in queue (TTL 24h) |
| PostgreSQL slow | Cut batch sizes; samplelog; alert ops; don't drop runs unless critical |
---
## 13) Testing matrix
* **ImpactIndex**: correctness (purl→image sets), performance, persistence after restart, memory pressure with 1M purls.
* **Planner**: dedupe, shard, fairness, limit enforcement, KEV prioritization.
* **Runner**: parallel report calls, error backoff, partial failures, idempotency.
* **Endtoend**: Conselier export → deltas visible in UI in ≤60s.
* **Security**: webhook auth (mTLS/HMAC), DPoP nonce dance, tenant isolation.
* **Chaos**: drop scanner availability; simulate registry throttles (contentrefresh mode).
* **Nightly**: cron tick correctness across timezones and DST.
---
## 14) Implementation notes
* **Language**: .NET 10 minimal API; Channelsbased pipeline; `System.Threading.RateLimiting`.
* **Bitmaps**: Roaring via `RoaringBitmap` bindings; memorymap large shards if RocksDB used.
* **Cron**: Quartzstyle parser with timezone support; clock skew tolerated ±60s.
* **Dryrun**: use ImpactIndex only; never call scanner.
* **Idempotency**: run segments carry deterministic keys; retries safe.
* **Backpressure**: pertenant buckets; perhost registry budgets respected when contentrefresh enabled.
---
## 15) Sequences (representative)
**A) Eventdriven rescan (Conselier delta)**
```mermaid
sequenceDiagram
autonumber
participant FE as Conselier
participant SCH as Scheduler.Worker
participant IDX as ImpactIndex
participant SC as Scanner.WebService
participant NO as Notify
FE->>SCH: POST /events/conselier-export {exportId, changedProductKeys}
SCH->>IDX: ResolveByPurls(keys, usageOnly=true, sel)
IDX-->>SCH: bitmap(imageIds) → digests list
SCH->>SC: POST /reports {imageDigest} (batch/sequenced)
SC-->>SCH: report deltas (new criticals/highs)
alt delta>0
SCH->>NO: rescan.delta {digest, newCriticals, links}
end
```
**B) Nightly rescan**
```mermaid
sequenceDiagram
autonumber
participant CRON as Cron
participant SCH as Scheduler.Worker
participant IDX as ImpactIndex
participant SC as Scanner.WebService
CRON->>SCH: tick (02:00 Europe/Sofia)
SCH->>IDX: ResolveAll(selector)
IDX-->>SCH: candidates
SCH->>SC: POST /reports {digest} (paced)
SC-->>SCH: results
SCH-->>SCH: aggregate, store run stats
```
**C) Contentrefresh (tag followers)**
```mermaid
sequenceDiagram
autonumber
participant SCH as Scheduler
participant SC as Scanner
SCH->>SC: resolve tag→digest (if changed)
alt digest changed
SCH->>SC: POST /scans {imageRef} # new SBOM
SC-->>SCH: scan complete (artifacts)
SCH->>SC: POST /reports {imageDigest}
else unchanged
SCH->>SC: POST /reports {imageDigest} # analysis-only
end
```
---
## 16) Roadmap
* **Vulncentric impact**: prejoin vuln→purl→images to rank by **KEV** and **exploitedinthewild** signals.
* **Policy diff preview**: when a staged policy changes, show projected breakage set before promotion.
* **Crosscluster federation**: one Scheduler instance driving many Scanner clusters (tenant isolation).
* **Windows containers**: integrate Zastava runtime hints for Usage view tightening.
---
**End — component_architecture_scheduler.md**

View File

@@ -0,0 +1,190 @@
# HLC Queue Ordering Migration Guide
This guide describes how to enable HLC (Hybrid Logical Clock) ordering for the Scheduler queue, transitioning from legacy `(priority, created_at)` ordering to HLC-based ordering with cryptographic chain linking.
## Overview
HLC ordering provides:
- **Deterministic global ordering**: Causal consistency across distributed nodes
- **Cryptographic chain linking**: Audit-safe job sequence proofs
- **Reproducible processing**: Same input produces same chain
## Prerequisites
1. PostgreSQL 16+ with the scheduler schema
2. HLC library dependency (`StellaOps.HybridLogicalClock`)
3. Schema migration `002_hlc_queue_chain.sql` applied
## Migration Phases
### Phase 1: Deploy with Dual-Write Mode
Enable dual-write to populate the new `scheduler_log` table without affecting existing operations.
```yaml
# appsettings.yaml or environment configuration
Scheduler:
Queue:
Hlc:
EnableHlcOrdering: false # Keep using legacy ordering for reads
DualWriteMode: true # Write to both legacy and HLC tables
```
```csharp
// Program.cs or Startup.cs
services.AddOptions<SchedulerQueueOptions>()
.Bind(configuration.GetSection("Scheduler:Queue"))
.ValidateDataAnnotations()
.ValidateOnStart();
// Register HLC services
services.AddHlcSchedulerServices();
// Register HLC clock
services.AddSingleton<IHybridLogicalClock>(sp =>
{
var nodeId = Environment.MachineName; // or use a stable node identifier
return new HybridLogicalClock(nodeId, TimeProvider.System);
});
```
**Verification:**
- Monitor `scheduler_hlc_enqueues_total` metric for dual-write activity
- Verify `scheduler_log` table is being populated
- Check chain verification passes: `scheduler_chain_verifications_total{result="valid"}`
### Phase 2: Backfill Historical Data (Optional)
If you need historical jobs in the HLC chain, backfill from the existing `scheduler.jobs` table:
```sql
-- Backfill script (run during maintenance window)
-- Note: This creates a new chain starting from historical data
-- The chain will not have valid prev_link values for historical entries
INSERT INTO scheduler.scheduler_log (
tenant_id, t_hlc, partition_key, job_id, payload_hash, prev_link, link
)
SELECT
tenant_id,
-- Generate synthetic HLC timestamps based on created_at
-- Format: YYYYMMDDHHMMSS-nodeid-counter
TO_CHAR(created_at AT TIME ZONE 'UTC', 'YYYYMMDDHH24MISS') || '-backfill-' ||
LPAD(ROW_NUMBER() OVER (PARTITION BY tenant_id ORDER BY created_at)::TEXT, 6, '0'),
COALESCE(project_id, ''),
id,
DECODE(payload_digest, 'hex'),
NULL, -- No chain linking for historical data
DECODE(payload_digest, 'hex') -- Use payload_digest as link placeholder
FROM scheduler.jobs
WHERE status IN ('pending', 'scheduled', 'running')
AND NOT EXISTS (
SELECT 1 FROM scheduler.scheduler_log sl
WHERE sl.job_id = jobs.id
)
ORDER BY tenant_id, created_at;
```
### Phase 3: Enable HLC Ordering for Reads
Once dual-write is stable and backfill (if needed) is complete:
```yaml
Scheduler:
Queue:
Hlc:
EnableHlcOrdering: true # Use HLC ordering for reads
DualWriteMode: true # Keep dual-write during transition
VerifyOnDequeue: false # Optional: enable for extra validation
```
**Verification:**
- Monitor dequeue latency (should be similar to legacy)
- Verify job processing order matches HLC order
- Check chain integrity periodically
### Phase 4: Disable Dual-Write Mode
Once confident in HLC ordering:
```yaml
Scheduler:
Queue:
Hlc:
EnableHlcOrdering: true
DualWriteMode: false # Stop writing to legacy table
VerifyOnDequeue: false
```
## Configuration Reference
### SchedulerHlcOptions
| Property | Type | Default | Description |
|----------|------|---------|-------------|
| `EnableHlcOrdering` | bool | false | Use HLC ordering for queue reads |
| `DualWriteMode` | bool | false | Write to both legacy and HLC tables |
| `VerifyOnDequeue` | bool | false | Verify chain integrity on each dequeue |
| `MaxClockDriftMs` | int | 60000 | Maximum allowed clock drift in milliseconds |
## Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `scheduler_hlc_enqueues_total` | Counter | Total HLC enqueue operations |
| `scheduler_hlc_enqueue_deduplicated_total` | Counter | Deduplicated enqueue operations |
| `scheduler_hlc_enqueue_duration_seconds` | Histogram | Enqueue operation duration |
| `scheduler_hlc_dequeues_total` | Counter | Total HLC dequeue operations |
| `scheduler_hlc_dequeued_entries_total` | Counter | Total entries dequeued |
| `scheduler_chain_verifications_total` | Counter | Chain verification operations |
| `scheduler_chain_verification_issues_total` | Counter | Chain verification issues found |
| `scheduler_batch_snapshots_created_total` | Counter | Batch snapshots created |
## Troubleshooting
### Chain Verification Failures
If chain verification reports issues:
1. Check `scheduler_chain_verification_issues_total` for issue count
2. Query the log for specific issues:
```csharp
var result = await chainVerifier.VerifyAsync(tenantId);
foreach (var issue in result.Issues)
{
logger.LogError(
"Chain issue at job {JobId}: {Type} - {Description}",
issue.JobId, issue.IssueType, issue.Description);
}
```
3. Common causes:
- Database corruption: Restore from backup
- Concurrent writes without proper locking: Check transaction isolation
- Clock drift: Verify `MaxClockDriftMs` setting
### Performance Considerations
- **Index usage**: Ensure `idx_scheduler_log_tenant_hlc` is being used
- **Chain head caching**: The `chain_heads` table provides O(1) access to latest link
- **Batch sizes**: Adjust dequeue batch size based on workload
## Rollback Procedure
To rollback to legacy ordering:
```yaml
Scheduler:
Queue:
Hlc:
EnableHlcOrdering: false
DualWriteMode: false
```
The `scheduler_log` table can be retained for audit purposes or dropped if no longer needed.
## Related Documentation
- [Scheduler Architecture](architecture.md)
- [HLC Library Documentation](../../__Libraries/StellaOps.HybridLogicalClock/README.md)
- [Product Advisory: Audit-safe Job Queue Ordering](../../product/advisories/audit-safe-job-queue-ordering.md)

View File

@@ -0,0 +1,176 @@
# Scheduler HLC Ordering Architecture
This document describes the Hybrid Logical Clock (HLC) based ordering system used by the StellaOps Scheduler for audit-safe job queue operations.
## Overview
The Scheduler uses HLC timestamps instead of wall-clock time to ensure:
1. **Total ordering** of jobs across distributed nodes
2. **Audit-safe sequencing** with cryptographic chain linking
3. **Deterministic merge** when offline nodes reconnect
4. **Clock skew tolerance** in distributed deployments
## HLC Timestamp Format
An HLC timestamp consists of three components:
```
(PhysicalTime, LogicalCounter, NodeId)
```
| Component | Description | Example |
|-----------|-------------|---------|
| PhysicalTime | Unix milliseconds (UTC) | `1704585600000` |
| LogicalCounter | Monotonic counter for same-millisecond events | `0`, `1`, `2`... |
| NodeId | Unique identifier for the node | `scheduler-prod-01` |
**String format:** `{physical}:{logical}:{nodeId}`
Example: `1704585600000:0:scheduler-prod-01`
## Database Schema
### scheduler_log Table
```sql
CREATE TABLE scheduler.scheduler_log (
id BIGSERIAL PRIMARY KEY,
t_hlc TEXT NOT NULL, -- HLC timestamp
job_id TEXT NOT NULL, -- Job identifier
action TEXT NOT NULL, -- ENQUEUE, DEQUEUE, EXECUTE, COMPLETE, FAIL
prev_chain_link TEXT, -- Hash of previous entry
chain_link TEXT NOT NULL, -- Hash of this entry
payload JSONB NOT NULL, -- Job metadata
tenant_id TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_scheduler_log_hlc ON scheduler.scheduler_log (t_hlc);
CREATE INDEX idx_scheduler_log_tenant_hlc ON scheduler.scheduler_log (tenant_id, t_hlc);
CREATE INDEX idx_scheduler_log_job ON scheduler.scheduler_log (job_id);
```
### batch_snapshot Table
```sql
CREATE TABLE scheduler.batch_snapshot (
id BIGSERIAL PRIMARY KEY,
snapshot_hlc TEXT NOT NULL, -- HLC at snapshot time
from_chain_link TEXT NOT NULL, -- First entry in batch
to_chain_link TEXT NOT NULL, -- Last entry in batch
entry_count INTEGER NOT NULL,
merkle_root TEXT NOT NULL, -- Merkle root of entries
dsse_envelope JSONB, -- DSSE-signed attestation
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
```
### chain_heads Table
```sql
CREATE TABLE scheduler.chain_heads (
tenant_id TEXT PRIMARY KEY,
head_chain_link TEXT NOT NULL, -- Current chain head
head_hlc TEXT NOT NULL, -- HLC of chain head
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
```
## Chain Link Computation
Each log entry is cryptographically linked to its predecessor:
```csharp
public static string ComputeChainLink(
string tHlc,
string jobId,
string action,
string? prevChainLink,
string payloadDigest)
{
using var hasher = IncrementalHash.CreateHash(HashAlgorithmName.SHA256);
hasher.AppendData(Encoding.UTF8.GetBytes(tHlc));
hasher.AppendData(Encoding.UTF8.GetBytes(jobId));
hasher.AppendData(Encoding.UTF8.GetBytes(action));
hasher.AppendData(Encoding.UTF8.GetBytes(prevChainLink ?? "genesis"));
hasher.AppendData(Encoding.UTF8.GetBytes(payloadDigest));
return Convert.ToHexString(hasher.GetHashAndReset()).ToLowerInvariant();
}
```
## Configuration Options
```yaml
# etc/scheduler.yaml
scheduler:
hlc:
enabled: true # Enable HLC ordering (default: true)
nodeId: "scheduler-prod-01" # Unique node identifier
maxClockSkew: "00:00:05" # Maximum tolerable clock skew (5 seconds)
persistenceInterval: "00:01:00" # HLC state persistence interval
chain:
enabled: true # Enable chain linking (default: true)
batchSize: 1000 # Entries per batch snapshot
batchInterval: "00:05:00" # Batch snapshot interval
signSnapshots: true # DSSE-sign batch snapshots
keyId: "scheduler-signing-key" # Key for snapshot signing
```
## Operational Considerations
### Clock Skew Handling
The HLC algorithm tolerates clock skew by:
1. Advancing logical counter when physical time hasn't progressed
2. Rejecting events with excessive clock skew (> `maxClockSkew`)
3. Emitting `hlc_clock_skew_rejections_total` metric for monitoring
**Alert:** `HlcClockSkewExceeded` triggers when skew > tolerance.
### Chain Verification
Verify chain integrity on startup and periodically:
```bash
# CLI command
stella scheduler chain verify --tenant-id <tenant>
# API endpoint
GET /api/v1/scheduler/chain/verify?tenantId=<tenant>
```
### Offline Merge
When offline nodes reconnect:
1. Export local job log as bundle
2. Import on connected node
3. HLC-based merge produces deterministic ordering
4. Chain is extended with merged entries
See `docs/operations/airgap-operations-runbook.md` for details.
## Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `hlc_ticks_total` | Counter | Total HLC tick operations |
| `hlc_clock_skew_rejections_total` | Counter | Events rejected due to clock skew |
| `hlc_physical_offset_seconds` | Gauge | Current physical time offset |
| `scheduler_chain_entries_total` | Counter | Total chain log entries |
| `scheduler_chain_verifications_total` | Counter | Chain verification operations |
| `scheduler_chain_verification_failures_total` | Counter | Failed verifications |
| `scheduler_batch_snapshots_total` | Counter | Batch snapshots created |
## Grafana Dashboard
See `devops/observability/grafana/hlc-queue-metrics.json` for the HLC monitoring dashboard.
## Related Documentation
- [HLC Core Library](../../../src/__Libraries/StellaOps.HybridLogicalClock/README.md)
- [HLC Migration Guide](./hlc-migration-guide.md)
- [Air-Gap Operations Runbook](../../operations/airgap-operations-runbook.md)
- [HLC Troubleshooting](../../operations/runbooks/hlc-troubleshooting.md)

View File

@@ -0,0 +1,24 @@
# Scheduler Implementation Plan
## Purpose
Provide a living plan for Scheduler deliverables, dependencies, and evidence.
## Active work
- Track current sprints under `docs/implplan/SPRINT_*.md` for this module.
- Update this file when new scoped work is approved.
## Near-term deliverables
- TBD (add when sprint is staffed).
## Dependencies
- `docs/modules/scheduler/architecture.md`
- `docs/modules/scheduler/README.md`
- `docs/modules/platform/architecture-overview.md`
## Evidence of completion
- Code changes under `src/Scheduler/**`.
- Tests and fixtures under the module's `__Tests` / `__Libraries`.
- Docs and runbooks under `docs/modules/scheduler/**`.
## Notes
- Keep deterministic and offline-first expectations aligned with module AGENTS.

View File

@@ -0,0 +1,261 @@
{
"title": "Scheduler Worker Planning & Rescan",
"uid": "scheduler-worker-observability",
"schemaVersion": 38,
"version": 1,
"editable": true,
"timezone": "",
"graphTooltip": 0,
"time": {
"from": "now-24h",
"to": "now"
},
"templating": {
"list": [
{
"name": "datasource",
"type": "datasource",
"query": "prometheus",
"hide": 0,
"refresh": 1,
"current": {}
},
{
"name": "mode",
"label": "Mode",
"type": "query",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"query": "label_values(scheduler_planner_runs_total, mode)",
"refresh": 1,
"multi": true,
"includeAll": true,
"allValue": ".*",
"current": {
"selected": false,
"text": "All",
"value": ".*"
}
}
]
},
"annotations": {
"list": []
},
"panels": [
{
"id": 1,
"title": "Planner Runs per Status",
"type": "timeseries",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"unit": "ops",
"displayName": "{{status}}"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "bottom"
}
},
"targets": [
{
"expr": "sum by (status) (rate(scheduler_planner_runs_total{mode=~\"$mode\"}[5m]))",
"legendFormat": "{{status}}",
"refId": "A"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
}
},
{
"id": 2,
"title": "Planner Latency P95 (s)",
"type": "timeseries",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"unit": "s"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "bottom"
}
},
"targets": [
{
"expr": "histogram_quantile(0.95, sum by (le) (rate(scheduler_planner_latency_seconds_bucket{mode=~\"$mode\"}[5m])))",
"legendFormat": "p95",
"refId": "A"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
}
},
{
"id": 3,
"title": "Runner Segments per Status",
"type": "timeseries",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"unit": "ops",
"displayName": "{{status}}"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "bottom"
}
},
"targets": [
{
"expr": "sum by (status) (rate(scheduler_runner_segments_total{mode=~\"$mode\"}[5m]))",
"legendFormat": "{{status}}",
"refId": "A"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
}
},
{
"id": 4,
"title": "New Findings per Severity",
"type": "timeseries",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"unit": "ops",
"displayName": "{{severity}}"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "bottom"
}
},
"targets": [
{
"expr": "sum(rate(scheduler_runner_delta_critical_total{mode=~\"$mode\"}[5m]))",
"legendFormat": "critical",
"refId": "A"
},
{
"expr": "sum(rate(scheduler_runner_delta_high_total{mode=~\"$mode\"}[5m]))",
"legendFormat": "high",
"refId": "B"
},
{
"expr": "sum(rate(scheduler_runner_delta_total{mode=~\"$mode\"}[5m]))",
"legendFormat": "total",
"refId": "C"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
}
},
{
"id": 5,
"title": "Runner Backlog by Schedule",
"type": "table",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"displayName": "{{scheduleId}}",
"unit": "none"
},
"overrides": []
},
"options": {
"showHeader": true
},
"targets": [
{
"expr": "max by (scheduleId) (scheduler_runner_backlog{mode=~\"$mode\"})",
"format": "table",
"refId": "A"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 16
}
},
{
"id": 6,
"title": "Active Runs",
"type": "stat",
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"unit": "none"
},
"overrides": []
},
"options": {
"orientation": "horizontal",
"textMode": "value"
},
"targets": [
{
"expr": "sum(scheduler_runs_active{mode=~\"$mode\"})",
"refId": "A"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 16
}
}
]
}

View File

@@ -0,0 +1,42 @@
groups:
- name: scheduler-worker
interval: 30s
rules:
- alert: SchedulerPlannerFailuresHigh
expr: sum(rate(scheduler_planner_runs_total{status="failed"}[5m]))
/
sum(rate(scheduler_planner_runs_total[5m])) > 0.05
for: 10m
labels:
severity: critical
service: scheduler-worker
annotations:
summary: "Planner failure ratio above 5%"
description: "More than 5% of planning runs are failing. Inspect scheduler logs and ImpactIndex connectivity before queues back up."
- alert: SchedulerPlannerLatencyHigh
expr: histogram_quantile(0.95, sum by (le) (rate(scheduler_planner_latency_seconds_bucket[5m]))) > 45
for: 10m
labels:
severity: warning
service: scheduler-worker
annotations:
summary: "Planner latency p95 above 45s"
description: "Planning latency p95 stayed above 45 seconds for 10 minutes. Check ImpactIndex, Mongo, or external selectors to prevent missed SLAs."
- alert: SchedulerRunnerBacklogGrowing
expr: max_over_time(scheduler_runner_backlog[15m]) > 500
for: 15m
labels:
severity: warning
service: scheduler-worker
annotations:
summary: "Runner backlog above 500 images"
description: "Runner backlog exceeded 500 images over the last 15 minutes. Verify runner workers, scanner availability, and rate limits."
- alert: SchedulerRunStuck
expr: sum(scheduler_runs_active) > 0 and max_over_time(scheduler_runs_active[30m]) == min_over_time(scheduler_runs_active[30m])
for: 30m
labels:
severity: warning
service: scheduler-worker
annotations:
summary: "Scheduler runs stuck without progress"
description: "Active runs count has remained flat for 30 minutes. Investigate stuck segments or scanner timeouts."

View File

@@ -0,0 +1,82 @@
# Scheduler Worker Observability & Runbook
## Purpose
Monitor planner and runner health for the Scheduler Worker (Sprint16 telemetry). The new .NET meters surface queue throughput, latency, backlog, and delta severities so operators can detect stalled runs before rescan SLAs slip.
> **Grafana note:** Import `docs/modules/scheduler/operations/worker-grafana-dashboard.json` into the Prometheus-backed Grafana stack that scrapes the OpenTelemetry Collector.
---
## Key metrics
| Metric | Use case | Suggested query |
| --- | --- | --- |
| `scheduler_planner_runs_total{status}` | Planner throughput & failure ratio | `sum by (status) (rate(scheduler_planner_runs_total[5m]))` |
| `scheduler_planner_latency_seconds_bucket` | Planning latency (p95 / p99) | `histogram_quantile(0.95, sum by (le) (rate(scheduler_planner_latency_seconds_bucket[5m])))` |
| `scheduler_runner_segments_total{status}` | Runner success vs retries | `sum by (status) (rate(scheduler_runner_segments_total[5m]))` |
| `scheduler_runner_delta_{critical,high,total}` | Newly-detected findings | `sum(rate(scheduler_runner_delta_critical_total[5m]))` |
| `scheduler_runner_backlog{scheduleId}` | Remaining digests awaiting runner | `max by (scheduleId) (scheduler_runner_backlog)` |
| `scheduler_runs_active{mode}` | Active runs in-flight | `sum(scheduler_runs_active)` |
Reference queries power the bundled Grafana dashboard panels. Use the `mode` template variable to focus on `analysisOnly` versus `contentRefresh` schedules.
---
## Grafana dashboard
1. Import `docs/modules/scheduler/operations/worker-grafana-dashboard.json` (UID `scheduler-worker-observability`).
2. Point the `datasource` variable to the Prometheus instance scraping the collector. Optional: pin the `mode` variable to a specific schedule mode.
3. Panels included:
- **Planner Runs per Status** visualises success vs failure ratio.
- **Planner Latency P95** highlights degradations in ImpactIndex or Mongo lookups.
- **Runner Segments per Status** shows retry pressure and queue health.
- **New Findings per Severity** rolls up delta counters (critical/high/total).
- **Runner Backlog by Schedule** tabulates outstanding digests per schedule.
- **Active Runs** stat panel showing the current number of in-flight runs.
Capture screenshots once Grafana provisioning completes and store them under `docs/assets/dashboards/` (pending automation ticket OBS-157).
---
## Prometheus alerts
Import `docs/modules/scheduler/operations/worker-prometheus-rules.yaml` into your Prometheus rule configuration. The bundle defines:
- **SchedulerPlannerFailuresHigh** 5%+ of planner runs failed for 10 minutes. Page SRE.
- **SchedulerPlannerLatencyHigh** planner p95 latency remains above 45s for 10 minutes. Investigate ImpactIndex, Mongo, and Conselier/Excitor event queues.
- **SchedulerRunnerBacklogGrowing** backlog exceeded 500 images for 15 minutes. Inspect runner workers, Scanner availability, and rate limiting.
- **SchedulerRunStuck** active run count stayed flat for 30 minutes while remaining non-zero. Check stuck segments, expired leases, and scanner retries.
Hook these alerts into the existing Observability notification pathway (`observability-pager` routing key) and ensure `service=scheduler-worker` is mapped to the on-call rotation.
---
## Runbook snapshot
1. **Planner failure/latency:**
- Check Planner logs for ImpactIndex or Mongo exceptions.
- Verify Conselier/Excitor webhook health; requeue events if necessary.
- If planner is overwhelmed, temporarily reduce schedule parallelism via `stella scheduler schedule update`.
2. **Runner backlog spike:**
- Confirm Scanner WebService health (`/healthz`).
- Inspect runner queue for stuck segments; consider increasing runner workers or scaling scanner capacity.
- Review rate limits (schedule limits, ImpactIndex throughput) before changing global throttles.
3. **Stuck runs:**
- Use `stella scheduler runs list --state running` to identify affected runs.
- Drill into Grafana panel “Runner Backlog by Schedule” to see offending schedule IDs.
- If a segment will not progress, use `stella scheduler segments release --segment <id>` to force retry after resolving root cause.
4. **Unexpected critical deltas:**
- Correlate `scheduler_runner_delta_critical_total` spikes with Notify events (`scheduler.rescan.delta`).
- Pivot to Scanner report links for impacted digests and confirm they match upstream advisories/policies.
Document incidents and mitigation in `ops/runbooks/INCIDENT_LOG.md` (per SRE policy) and attach Grafana screenshots for post-mortems.
---
## Checklist
- [ ] Grafana dashboard imported and wired to Prometheus datasource.
- [ ] Prometheus alert rules deployed (see above).
- [ ] Runbook linked from on-call rotation portal.
- [ ] Observability Guild sign-off captured for Sprint16 telemetry (OWNER: @obs-guild).