- Added "StellaOps.Policy.Engine", "StellaOps.Cartographer", and "StellaOps.SbomService" projects to the StellaOps solution. - Created AGENTS.md to outline the Contract Testing Guild Charter, detailing mission, scope, and definition of done. - Established TASKS.md for the Contract Testing Task Board, outlining tasks for Sprint 62 and Sprint 63 related to mock servers and replay testing.
19 KiB
Below is the “maximum documentation” bundle for Epic 9. Paste it into your repo and pretend the ingestion chaos was always under control.
Imposed rule: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Epic 9: Source & Job Orchestrator Dashboard
Short name: Orchestrator Dashboard
Primary service: orchestrator (scheduler, queues, rate‑limits, job state)
Surfaces: Console (Web UI), CLI, Web API
Touches: Conseiller (Feedser), Excitator (Vexer), VEX Consensus Lens, SBOM Service, Policy Engine, Findings Ledger, Authority (authN/Z), Telemetry/Analytics, Object Storage, Relational DB, Message Bus
AOC ground rule: Conseiller and Excitator aggregate but never merge. The orchestrator schedules, tracks and recovers jobs; it does not transform evidence beyond transport and storage. No “smart” merging in flight.
1) What it is
The Source & Job Orchestrator Dashboard is the control surface for every data source and pipeline run across StellaOps. It gives operators:
- Live health of all advisory/VEX/SBOM sources and derived jobs.
- End‑to‑end pipeline visibility as DAGs and timelines.
- Controls for pausing, backfilling, replaying, throttling and retrying.
- Error pattern analysis, rate‑limit observability and backpressure insights.
- Provenance and audit trails from initial fetch through parse, normalize, index and policy evaluation.
The dashboard sits over the orchestrator service, which maintains job state, schedules runs, enforces quotas and rate limits, and collects metrics from worker pools embedded in Conseiller, Excitator, SBOM and related services.
2) Why (brief)
Ingestion breaks quietly and then loudly. Without a unified control plane, you learn about it from angry users or empty indexes. This dashboard shortens incident MTTR, enables safe backfills, and makes compliance reviewers stop sending emails with twelve attachments and one emoji.
3) How it should work (maximum detail)
3.1 Capabilities
-
Source registry
- Register, tag and version connectors (OSV, GHSA, CSAF endpoints, vendor PDF scrapers, distro feeds, RSS, S3 drops, internal registries).
- Store connection details, secrets (via KMS), rate‑limit policy, schedules, and ownership metadata.
- Validate and “test connection” safely.
-
Job orchestration
- Create DAGs composed of job types:
fetch,parse,normalize,dedupe,index,consensus_compute,policy_eval,crosslink,sbom_ingest,sbom_index. - Priorities, queues, concurrency caps, exponential backoff, circuit breakers.
- Idempotency keys and output artifact hashing to avoid duplicate work.
- Event‑time watermarks for backfills without double counting.
- Create DAGs composed of job types:
-
Observability & control
- Gantt timeline and real‑time DAG view with critical path highlighting.
- Backpressure and queue depth heatmaps.
- Error clustering by class (HTTP 429, TLS, schema mismatch, parse failure, upstream 5xx).
- Per‑source SLOs and SLA budgets with burn‑rate alerts.
- One‑click actions: retry, replay range, pause/resume, throttle/unthrottle, reroute to canary workers.
-
Provenance & audit
- Immutable run ledger linking input artifact → every job → output artifact.
- Schema version tracking and drift detection.
- Operator actions recorded with reason and ticket reference.
-
Safety
- Secret redaction everywhere.
- Tenant isolation at API, queue and storage layers.
- AOC: no in‑flight merges of advisory or VEX content.
3.2 Core architecture
-
orchestrator (service)
- Maintains job state in Postgres (
sources,runs,jobs,artifacts,dag_edges,quotas,schedules). - Publishes work to a message bus (e.g.,
topic.jobs.ready.<queue>). - Distributed token‑bucket rate limiter per source/tenant/host.
- Watchdog for stuck jobs and circuit breakers for flapping sources.
- Watermark manager for backfills (event‑time windows).
- Maintains job state in Postgres (
-
worker SDK
-
Lightweight library embedded in Conseiller/Excitator/SBOM workers to:
- Claim work, heartbeat, update progress, report metrics.
- Emit artifact metadata and checksums.
- Enforce idempotency via orchestrator‑supplied key.
-
-
object store
-
Raw payloads and intermediate artifacts organized by schema and hash:
advisory/raw/<source_id>/<event_time>/<sha256>.json|pdfadvisory/normalized/<schema_ver>/<hash>.jsonvex/raw|normalized/...sbom/raw|graph/...
-
-
web API
- CRUD for sources, runs, jobs, schedules, quotas.
- Control actions (retry, cancel, pause, backfill).
- Streaming updates via WebSocket/SSE for the Console.
-
console
- React app consuming Orchestrator APIs, rendering DAGs, timelines, health charts and action panels with RBAC.
3.3 Data model (selected tables)
-
sourcesid,kind(advisory|vex|sbom|internal),subtype(e.g.,osv,ghsa,csaf,vendor_pdf),display_name,owner_team,schedule_cron,rate_policy,enabled,secrets_ref,tags,schema_hint,created_at,updated_at.
-
runsid,source_id,trigger(schedule|manual|event|backfill),window_start,window_end,state,started_at,finished_at,stats_json.
-
jobsid,run_id,type,queue,priority,state(pending|running|succeeded|failed|canceled|deadletter),attempt,max_attempt,idempotency_key,input_artifact_id,output_artifact_id,worker_id,created_at,started_at,finished_at,error_class,error_message,metrics_json.
-
dag_edgesfrom_job_id,to_job_id,edge_kind(success_only|always).
-
artifactsid,kind(raw|normalized|index|consensus),schema_ver,hash,uri,bytes,meta_json,created_at.
-
quotastenant_id,resource(requests_per_min,concurrent_jobs),limit,window_sec.
-
schedules- Per‑source cron plus jitter, timezone, blackout windows.
3.4 Job lifecycle
-
Plan Scheduler creates a
runfor a source and plans a DAG: e.g.,fetch → parse → normalize → dedupe → index → policy_eval(advisory) orfetch → parse → normalize → consensus_compute(VEX). -
Enqueue Ready nodes become
jobswith queue, priority, idempotency key and optional rate‑limit tokens reserved. -
Execute Worker claims job, heartbeats every N seconds. Output artifacts are stored and linked. Failures are classified and retried with exponential backoff and jitter, up to
max_attempt. -
Complete Downstream nodes unblock. On run completion, orchestrator computes SLO deltas and emits run summary.
-
Dead‑letter Jobs exceeding attempts move to a DLQ with structured context and suggested remediation.
3.5 Scheduling, backpressure, rate‑limits
- Token bucket per
{tenant, source.host}with adaptive refill if upstream 429/503 seen. - Concurrency caps per source and per job type to avoid thundering herd.
- Backpressure signals from queue depth, worker CPU, and upstream error rates; scheduler reduces inflight issuance accordingly.
- Backfills use event‑time windows with immutable watermarks to avoid re‑processing.
- Blackout windows for vendor maintenance periods.
3.6 APIs
POST /orchestrator/sources
GET /orchestrator/sources?kind=&tag=&q=
GET /orchestrator/sources/{id}
PATCH /orchestrator/sources/{id}
POST /orchestrator/sources/{id}/actions:test|pause|resume|sync-now
POST /orchestrator/sources/{id}/backfill { "from":"2024-01-01", "to":"2024-03-01" }
GET /orchestrator/runs?source_id=&state=&from=&to=
GET /orchestrator/runs/{run_id}
GET /orchestrator/runs/{run_id}/dag
POST /orchestrator/runs/{run_id}/cancel
GET /orchestrator/jobs?state=&type=&queue=&source_id=
GET /orchestrator/jobs/{job_id}
POST /orchestrator/jobs/{job_id}/actions:retry|cancel|prioritize
GET /orchestrator/metrics/overview
GET /orchestrator/errors/top?window=1h
GET /orchestrator/quotas
PATCH /orchestrator/quotas/{tenant_id}
WS /orchestrator/streams/updates
3.7 Console (Web UI)
-
Overview
- KPI tiles: sources healthy, runs in progress, queue depth, error rate, burn‑rate to SLO.
- Heatmap of source health by last 24h success ratio.
-
Sources
- Grid with filters, inline status (active, paused, throttled), next run eta, last error class.
- Detail panel: config, secrets status (redacted), schedule, rate limits, ownership, run history, action buttons.
-
Runs
- Timeline (Gantt) with critical path, duration distribution, and per‑stage breakdown.
- Run detail: DAG view with node metrics, artifacts, logs, action menu (cancel).
-
Jobs
- Live table with state filters and “tail” view.
- Job detail: payload preview (redacted), worker, attempts, stack traces, linked artifacts.
-
Errors
- Clusters by class and signature, suggested remediations (pause source, lower concurrency, patch parser).
-
Queues & Backpressure
- Per‑queue depth, service rate, inflight, age percentiles.
- Rate‑limit tokens graphs per source host.
-
Controls
- Backfill wizard with event‑time preview and safety checks.
- Canary routing: route 5% of next 100 runs to a new worker pool.
-
A11y
- Keyboard nav, ARIA roles for DAG nodes, live regions for updates, color‑blind friendly graphs.
3.8 CLI
stella orch sources list --kind advisory --tag prod
stella orch sources add --file source.yaml
stella orch sources test <source-id>
stella orch sources pause <source-id> # or resume
stella orch sources sync-now <source-id>
stella orch sources backfill <source-id> --from 2024-01-01 --to 2024-03-01
stella orch runs list --source <id> --state running
stella orch runs show <run-id> --dag
stella orch runs cancel <run-id>
stella orch jobs list --state failed --type parse --limit 100
stella orch jobs retry <job-id>
stella orch jobs cancel <job-id>
stella orch jobs tail --queue normalize --follow
stella orch quotas get --tenant default
stella orch quotas set --tenant default --concurrent-jobs 50 --rpm 1200
Exit codes: 0 success, 2 invalid args, 4 not found, 5 denied, 7 precondition failed, 8 rate‑limited.
3.9 RBAC & security
-
Roles
Orch.Viewer: read‑only sources/runs/jobs/metrics.Orch.Operator: perform actions on sources and jobs, launch backfills.Orch.Admin: manage quotas, schedules, connector versions, and delete sources.
-
Secrets
- Stored only as references to your KMS; never persisted in cleartext.
- Console shows redact badges and last rotated timestamp.
-
Tenancy
- Source, run, job rows scoped by tenant id.
- Queue names and token buckets namespaced per tenant.
-
Compliance
- Full audit log for every operator action with “reason” and optional ticket link.
- Exportable run ledger for audits.
3.10 Observability
-
Metrics (examples)
orch_jobs_inflight{type,queue}orch_jobs_latency_ms{type,percentile}orch_rate_tokens_available{source}orch_error_rate{source,error_class}orch_slo_burn_rate{source,slo}orch_deadletter_total{source,type}
-
Traces
- Span per job with baggage:
run_id,source_id,artifact_id. - Links across services to Conseiller/Excitator/SBOM workers.
- Span per job with baggage:
-
Logs
- Structured JSON with correlation ids, attempt numbers and redacted payload previews.
3.11 Performance targets
- Job dispatch P95 < 150 ms after dependency satisfied.
- Scheduler loop P95 < 500 ms for 10k pending jobs.
- Console live updates sub‑second at 1k events/sec per tenant.
- Backfill throughput ≥ 200 jobs/sec per worker pool with zero dupes.
3.12 Edge cases & behaviors
- Upstream 429 storms: auto‑throttle, pause optional, recommend extended jitter.
- Schema drift: parser moves job to DLQ with
error_class=schema_mismatchand opens a change ticket via webhook. - Flapping source: circuit breaker opens after N consecutive failures; requires human “resume”.
- Clock skew: watermark logic uses upstream event time; large skews flagged.
- Idempotency collisions: new attempt yields no‑op if artifact hash already exists.
4) Implementation plan
4.1 Modules (new and updated)
-
New service:
src/StellaOps.Orchestratorapi/REST + WS handlersscheduler/run planner, DAG builder, watermark/backfill logicqueues/publisher and consumer abstractionsratelimit/token bucket and adaptive controllerstate/Postgres repositories and migrationsaudit/action logging and exportmetrics/Prometheus exporterssecurity/tenant scoping, KMS client, secret refs
-
Worker SDKs:
src/StellaOps.Orchestrator.WorkerSdk.Goandsrc/StellaOps.Orchestrator.WorkerSdk.Pythonwith job claim, heartbeat, progress, artifact publish, and structured error reporting.
-
Console:
console/apps/orch/pages: Overview, Sources, Runs, Jobs, Errors, Queues.components/dag-view/,components/gantt/,components/health-heatmap/.
-
Updates to existing services:
- Conseiller/Excitator/SBOM workers adopt SDK and emit artifacts with schema/version/fingerprint.
- VEX Lens exposes
consensus_computeas a jobable operation. - Policy Engine exposes
policy_evalas a job type for scheduled recalcs.
4.2 Packaging & deployment
-
Containers:
stella/orchestrator:<ver>stella/worker-sdk-examples:<ver>for canary pools
-
Helm values:
- Queues/topics, per‑tenant concurrency, rate‑limit defaults, WS replica count.
- KMS integration secrets.
-
Migrations:
- Flyway/Goose migrations for new tables and indexes.
4.3 Rollout strategy
- Phase 1: Read‑only dashboard fed by existing job tables; no controls.
- Phase 2: Control actions enabled for non‑prod tenants.
- Phase 3: Backfills and quota management, then GA.
5) Documentation changes
Create/update the following, each ending with the imposed rule statement.
-
/docs/orchestrator/overview.mdConcepts, roles, responsibilities, AOC alignment. -
/docs/orchestrator/architecture.mdScheduler, DAGs, watermarks, queues, rate‑limits, data model. -
/docs/orchestrator/api.mdEndpoints, WebSocket events, error codes, examples. -
/docs/orchestrator/console.mdScreens, actions, a11y, live updates. -
/docs/orchestrator/cli.mdCommands, examples, exit codes, scripting patterns. -
/docs/orchestrator/run‑ledger.mdProvenance and audit export format. -
/docs/security/secrets‑handling.mdKMS references, redaction rules, operator hygiene. -
/docs/operations/orchestrator‑runbook.mdCommon failures, backfill guide, circuit breakers, tuning. -
/docs/schemas/artifacts.mdArtifact kinds, schema versions, hashing, storage layout. -
/docs/slo/orchestrator‑slo.mdSLO definitions, measurement, alerting.
6) Engineering tasks
Backend (orchestrator)
- Stand up Postgres schemas and indices for sources, runs, jobs, dag_edges, artifacts, quotas, schedules.
- Implement scheduler: DAG planner, dependency resolver, critical path computation.
- Implement rate limiter with adaptive behavior on 429/503 and per‑tenant tokens.
- Implement watermark/backfill manager with event‑time windows and idempotency keys.
- Implement API endpoints + OpenAPI spec + request validation.
- Implement WebSocket/SSE event stream for live updates.
- Implement audit logging and export.
- Implement dead‑letter store and replay.
Worker SDKs and integrations
- Build Go/Python SDKs with claim/heartbeat/progress API.
- Integrate SDK into Conseiller, Excitator, SBOM workers; ensure artifact emission with schema ver.
- Add
consensus_computeandpolicy_evalas job types with deterministic inputs/outputs.
Console
- Overview tiles and health heatmap.
- Source list/detail with actions and config view.
- Runs timeline (Gantt) and DAG visualization with node inspector.
- Jobs “tail” with live updates and filters.
- Errors clustering and suggested remediations.
- Queues/backpressure dashboard.
- Backfill wizard with safety checks.
Observability
- Emit metrics listed in §3.10 and wire traces across services.
- Dashboards: health, queue depth, error classes, burn‑rate, dispatch latency.
- Alerts for SLO burn and circuit breaker opens.
Security & RBAC
- Enforce tenant scoping on all endpoints; test leakage.
- Wire KMS for secret refs and redact everywhere.
- Implement
Orch.Viewer|Operator|Adminroles and check in Console and API.
Docs
- Author all files in §5 with examples and screenshots.
- Cross‑link from Conseiller/Excitator/SBOM pages to the dashboard docs.
- Append imposed rule to each page.
Imposed rule: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
7) Acceptance criteria
- Operators can: pause/resume a source, run “sync‑now,” initiate a backfill for a date range, and retry/cancel individual jobs from Console and CLI.
- DAG and timeline reflect reality within 1 second of job state changes at P95.
- Backfills do not create duplicate artifacts; idempotency proven by hash equality.
- Rate limiter reduces 429s by ≥80% under simulated throttle tests.
- Audit log includes who/when/why for every operator action.
- Provenance ledger exports a complete chain for any artifact.
- RBAC prevents non‑admins from quota changes; tenancy isolation proven via automated tests.
- SLO dashboard shows burn‑rate and triggers alerts under injected failure.
8) Risks & mitigations
-
Orchestrator becomes a single bottleneck. Horizontal scale stateless workers; DB indexes tuned; job state updates batched; cache hot paths.
-
Secret spillage. Only KMS references stored; aggressive redaction; log scrubbing in SDK.
-
Over‑eager backfills overwhelm upstream. Enforce per‑source quotas and sandbox previews; dry‑run backfills first.
-
Schema drift silently corrupts normalization. Hard‑fail on mismatch; DLQ with clear signatures; schema registry gating.
-
Flapping sources cause alert fatigue. Circuit breaker with cool‑down and deduped alerts; error budget policy.
9) Test plan
-
Unit Scheduler DAG building, topological sort, backoff math, token bucket, watermark math.
-
Integration Orchestrator ↔ worker SDK, artifact store wiring, DLQ replay, audit pipeline.
-
Chaos Inject 429 storms, packet loss, worker crashes; verify throttling and recovery.
-
Backfill Simulate overlapping windows and verify idempotency and watermark correctness.
-
Perf 10k concurrent jobs: dispatch latency, DB contention, WebSocket fan‑out.
-
Security Multi‑tenant isolation tests; KMS mock tests for secret access; RBAC matrix.
-
UX/A11y Screen reader labels on DAG, keyboard navigation, live region updates.
10) Philosophy
- Make the invisible visible. Pipelines should be legible at a glance.
- Prefer reproducibility to heroics. Idempotency and provenance over “we think it ran.”
- Safeguards before speed. Throttle first, retry thoughtfully, never melt upstreams.
- No silent merges. Evidence remains immutable; transformations are explicit, logged and reversible.
Final reminder: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.