Files

master 651b8e0fa3 feat: Add new projects to solution and implement contract testing documentation

- Added "StellaOps.Policy.Engine", "StellaOps.Cartographer", and "StellaOps.SbomService" projects to the StellaOps solution.
- Created AGENTS.md to outline the Contract Testing Guild Charter, detailing mission, scope, and definition of done.
- Established TASKS.md for the Contract Testing Task Board, outlining tasks for Sprint 62 and Sprint 63 related to mock servers and replay testing.

2025-10-27 07:57:55 +02:00

19 KiB

Raw Blame History

Below is the “maximum documentation” bundle for Epic 9. Paste it into your repo and pretend the ingestion chaos was always under control.

Imposed rule: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

Epic 9: Source & Job Orchestrator Dashboard

Short name: Orchestrator Dashboard Primary service: orchestrator (scheduler, queues, rate‑limits, job state) Surfaces: Console (Web UI), CLI, Web API Touches: Conseiller (Feedser), Excitator (Vexer), VEX Consensus Lens, SBOM Service, Policy Engine, Findings Ledger, Authority (authN/Z), Telemetry/Analytics, Object Storage, Relational DB, Message Bus

AOC ground rule: Conseiller and Excitator aggregate but never merge. The orchestrator schedules, tracks and recovers jobs; it does not transform evidence beyond transport and storage. No “smart” merging in flight.

1) What it is

The Source & Job Orchestrator Dashboard is the control surface for every data source and pipeline run across StellaOps. It gives operators:

Live health of all advisory/VEX/SBOM sources and derived jobs.
End‑to‑end pipeline visibility as DAGs and timelines.
Controls for pausing, backfilling, replaying, throttling and retrying.
Error pattern analysis, rate‑limit observability and backpressure insights.
Provenance and audit trails from initial fetch through parse, normalize, index and policy evaluation.

The dashboard sits over the orchestrator service, which maintains job state, schedules runs, enforces quotas and rate limits, and collects metrics from worker pools embedded in Conseiller, Excitator, SBOM and related services.

2) Why (brief)

Ingestion breaks quietly and then loudly. Without a unified control plane, you learn about it from angry users or empty indexes. This dashboard shortens incident MTTR, enables safe backfills, and makes compliance reviewers stop sending emails with twelve attachments and one emoji.

3) How it should work (maximum detail)

3.1 Capabilities

Source registry
- Register, tag and version connectors (OSV, GHSA, CSAF endpoints, vendor PDF scrapers, distro feeds, RSS, S3 drops, internal registries).
- Store connection details, secrets (via KMS), rate‑limit policy, schedules, and ownership metadata.
- Validate and “test connection” safely.
Job orchestration
- Create DAGs composed of job types: fetch, parse, normalize, dedupe, index, consensus_compute, policy_eval, crosslink, sbom_ingest, sbom_index.
- Priorities, queues, concurrency caps, exponential backoff, circuit breakers.
- Idempotency keys and output artifact hashing to avoid duplicate work.
- Event‑time watermarks for backfills without double counting.
Observability & control
- Gantt timeline and real‑time DAG view with critical path highlighting.
- Backpressure and queue depth heatmaps.
- Error clustering by class (HTTP 429, TLS, schema mismatch, parse failure, upstream 5xx).
- Per‑source SLOs and SLA budgets with burn‑rate alerts.
- One‑click actions: retry, replay range, pause/resume, throttle/unthrottle, reroute to canary workers.
Provenance & audit
- Immutable run ledger linking input artifact → every job → output artifact.
- Schema version tracking and drift detection.
- Operator actions recorded with reason and ticket reference.
Safety
- Secret redaction everywhere.
- Tenant isolation at API, queue and storage layers.
- AOC: no in‑flight merges of advisory or VEX content.

3.2 Core architecture

orchestrator (service)
- Maintains job state in Postgres (sources, runs, jobs, artifacts, dag_edges, quotas, schedules).
- Publishes work to a message bus (e.g., topic.jobs.ready.<queue>).
- Distributed token‑bucket rate limiter per source/tenant/host.
- Watchdog for stuck jobs and circuit breakers for flapping sources.
- Watermark manager for backfills (event‑time windows).
worker SDK
- Lightweight library embedded in Conseiller/Excitator/SBOM workers to:
  - Claim work, heartbeat, update progress, report metrics.
  - Emit artifact metadata and checksums.
  - Enforce idempotency via orchestrator‑supplied key.
object store
- Raw payloads and intermediate artifacts organized by schema and hash:
  - advisory/raw/<source_id>/<event_time>/<sha256>.json|pdf
  - advisory/normalized/<schema_ver>/<hash>.json
  - vex/raw|normalized/...
  - sbom/raw|graph/...
web API
- CRUD for sources, runs, jobs, schedules, quotas.
- Control actions (retry, cancel, pause, backfill).
- Streaming updates via WebSocket/SSE for the Console.
console
- React app consuming Orchestrator APIs, rendering DAGs, timelines, health charts and action panels with RBAC.

3.3 Data model (selected tables)

sources
- id, kind (advisory|vex|sbom|internal), subtype (e.g., osv, ghsa, csaf, vendor_pdf), display_name, owner_team, schedule_cron, rate_policy, enabled, secrets_ref, tags, schema_hint, created_at, updated_at.
runs
- id, source_id, trigger (schedule|manual|event|backfill), window_start, window_end, state, started_at, finished_at, stats_json.
jobs
- id, run_id, type, queue, priority, state (pending|running|succeeded|failed|canceled|deadletter), attempt, max_attempt, idempotency_key, input_artifact_id, output_artifact_id, worker_id, created_at, started_at, finished_at, error_class, error_message, metrics_json.
dag_edges
- from_job_id, to_job_id, edge_kind (success_only|always).
artifacts
- id, kind (raw|normalized|index|consensus), schema_ver, hash, uri, bytes, meta_json, created_at.
quotas
- tenant_id, resource (requests_per_min, concurrent_jobs), limit, window_sec.
schedules
- Per‑source cron plus jitter, timezone, blackout windows.

3.4 Job lifecycle

Plan Scheduler creates a run for a source and plans a DAG: e.g., fetch → parse → normalize → dedupe → index → policy_eval (advisory) or fetch → parse → normalize → consensus_compute (VEX).
Enqueue Ready nodes become jobs with queue, priority, idempotency key and optional rate‑limit tokens reserved.
Execute Worker claims job, heartbeats every N seconds. Output artifacts are stored and linked. Failures are classified and retried with exponential backoff and jitter, up to max_attempt.
Complete Downstream nodes unblock. On run completion, orchestrator computes SLO deltas and emits run summary.
Dead‑letter Jobs exceeding attempts move to a DLQ with structured context and suggested remediation.

3.5 Scheduling, backpressure, rate‑limits

Token bucket per {tenant, source.host} with adaptive refill if upstream 429/503 seen.
Concurrency caps per source and per job type to avoid thundering herd.
Backpressure signals from queue depth, worker CPU, and upstream error rates; scheduler reduces inflight issuance accordingly.
Backfills use event‑time windows with immutable watermarks to avoid re‑processing.
Blackout windows for vendor maintenance periods.

3.6 APIs

POST   /orchestrator/sources
GET    /orchestrator/sources?kind=&tag=&q=
GET    /orchestrator/sources/{id}
PATCH  /orchestrator/sources/{id}
POST   /orchestrator/sources/{id}/actions:test|pause|resume|sync-now
POST   /orchestrator/sources/{id}/backfill { "from":"2024-01-01", "to":"2024-03-01" }

GET    /orchestrator/runs?source_id=&state=&from=&to=
GET    /orchestrator/runs/{run_id}
GET    /orchestrator/runs/{run_id}/dag
POST   /orchestrator/runs/{run_id}/cancel

GET    /orchestrator/jobs?state=&type=&queue=&source_id=
GET    /orchestrator/jobs/{job_id}
POST   /orchestrator/jobs/{job_id}/actions:retry|cancel|prioritize

GET    /orchestrator/metrics/overview
GET    /orchestrator/errors/top?window=1h
GET    /orchestrator/quotas
PATCH  /orchestrator/quotas/{tenant_id}
WS     /orchestrator/streams/updates

3.7 Console (Web UI)

Overview
- KPI tiles: sources healthy, runs in progress, queue depth, error rate, burn‑rate to SLO.
- Heatmap of source health by last 24h success ratio.
Sources
- Grid with filters, inline status (active, paused, throttled), next run eta, last error class.
- Detail panel: config, secrets status (redacted), schedule, rate limits, ownership, run history, action buttons.
Runs
- Timeline (Gantt) with critical path, duration distribution, and per‑stage breakdown.
- Run detail: DAG view with node metrics, artifacts, logs, action menu (cancel).
Jobs
- Live table with state filters and “tail” view.
- Job detail: payload preview (redacted), worker, attempts, stack traces, linked artifacts.
Errors
- Clusters by class and signature, suggested remediations (pause source, lower concurrency, patch parser).
Queues & Backpressure
- Per‑queue depth, service rate, inflight, age percentiles.
- Rate‑limit tokens graphs per source host.
Controls
- Backfill wizard with event‑time preview and safety checks.
- Canary routing: route 5% of next 100 runs to a new worker pool.
A11y
- Keyboard nav, ARIA roles for DAG nodes, live regions for updates, color‑blind friendly graphs.

3.8 CLI

stella orch sources list --kind advisory --tag prod
stella orch sources add --file source.yaml
stella orch sources test <source-id>
stella orch sources pause <source-id>  # or resume
stella orch sources sync-now <source-id>
stella orch sources backfill <source-id> --from 2024-01-01 --to 2024-03-01

stella orch runs list --source <id> --state running
stella orch runs show <run-id> --dag
stella orch runs cancel <run-id>

stella orch jobs list --state failed --type parse --limit 100
stella orch jobs retry <job-id>
stella orch jobs cancel <job-id>
stella orch jobs tail --queue normalize --follow

stella orch quotas get --tenant default
stella orch quotas set --tenant default --concurrent-jobs 50 --rpm 1200

Exit codes: 0 success, 2 invalid args, 4 not found, 5 denied, 7 precondition failed, 8 rate‑limited.

3.9 RBAC & security

Roles
- Orch.Viewer: read‑only sources/runs/jobs/metrics.
- Orch.Operator: perform actions on sources and jobs, launch backfills.
- Orch.Admin: manage quotas, schedules, connector versions, and delete sources.
Secrets
- Stored only as references to your KMS; never persisted in cleartext.
- Console shows redact badges and last rotated timestamp.
Tenancy
- Source, run, job rows scoped by tenant id.
- Queue names and token buckets namespaced per tenant.
Compliance
- Full audit log for every operator action with “reason” and optional ticket link.
- Exportable run ledger for audits.

3.10 Observability

Metrics (examples)
- orch_jobs_inflight{type,queue}
- orch_jobs_latency_ms{type,percentile}
- orch_rate_tokens_available{source}
- orch_error_rate{source,error_class}
- orch_slo_burn_rate{source,slo}
- orch_deadletter_total{source,type}
Traces
- Span per job with baggage: run_id, source_id, artifact_id.
- Links across services to Conseiller/Excitator/SBOM workers.
Logs
- Structured JSON with correlation ids, attempt numbers and redacted payload previews.

3.11 Performance targets

Job dispatch P95 < 150 ms after dependency satisfied.
Scheduler loop P95 < 500 ms for 10k pending jobs.
Console live updates sub‑second at 1k events/sec per tenant.
Backfill throughput ≥ 200 jobs/sec per worker pool with zero dupes.

3.12 Edge cases & behaviors

Upstream 429 storms: auto‑throttle, pause optional, recommend extended jitter.
Schema drift: parser moves job to DLQ with error_class=schema_mismatch and opens a change ticket via webhook.
Flapping source: circuit breaker opens after N consecutive failures; requires human “resume”.
Clock skew: watermark logic uses upstream event time; large skews flagged.
Idempotency collisions: new attempt yields no‑op if artifact hash already exists.

4) Implementation plan

4.1 Modules (new and updated)

New service: src/StellaOps.Orchestrator
- api/ REST + WS handlers
- scheduler/ run planner, DAG builder, watermark/backfill logic
- queues/ publisher and consumer abstractions
- ratelimit/ token bucket and adaptive controller
- state/ Postgres repositories and migrations
- audit/ action logging and export
- metrics/ Prometheus exporters
- security/ tenant scoping, KMS client, secret refs
Worker SDKs:
- src/StellaOps.Orchestrator.WorkerSdk.Go and src/StellaOps.Orchestrator.WorkerSdk.Python with job claim, heartbeat, progress, artifact publish, and structured error reporting.
Console:
- console/apps/orch/ pages: Overview, Sources, Runs, Jobs, Errors, Queues.
- components/dag-view/, components/gantt/, components/health-heatmap/.
Updates to existing services:
- Conseiller/Excitator/SBOM workers adopt SDK and emit artifacts with schema/version/fingerprint.
- VEX Lens exposes consensus_compute as a jobable operation.
- Policy Engine exposes policy_eval as a job type for scheduled recalcs.

4.2 Packaging & deployment

Containers:
- stella/orchestrator:<ver>
- stella/worker-sdk-examples:<ver> for canary pools
Helm values:
- Queues/topics, per‑tenant concurrency, rate‑limit defaults, WS replica count.
- KMS integration secrets.
Migrations:
- Flyway/Goose migrations for new tables and indexes.

4.3 Rollout strategy

Phase 1: Read‑only dashboard fed by existing job tables; no controls.
Phase 2: Control actions enabled for non‑prod tenants.
Phase 3: Backfills and quota management, then GA.

5) Documentation changes

Create/update the following, each ending with the imposed rule statement.

/docs/orchestrator/overview.md Concepts, roles, responsibilities, AOC alignment.
/docs/orchestrator/architecture.md Scheduler, DAGs, watermarks, queues, rate‑limits, data model.
/docs/orchestrator/api.md Endpoints, WebSocket events, error codes, examples.
/docs/orchestrator/console.md Screens, actions, a11y, live updates.
/docs/orchestrator/cli.md Commands, examples, exit codes, scripting patterns.
/docs/orchestrator/run‑ledger.md Provenance and audit export format.
/docs/security/secrets‑handling.md KMS references, redaction rules, operator hygiene.
/docs/operations/orchestrator‑runbook.md Common failures, backfill guide, circuit breakers, tuning.
/docs/schemas/artifacts.md Artifact kinds, schema versions, hashing, storage layout.
/docs/slo/orchestrator‑slo.md SLO definitions, measurement, alerting.

6) Engineering tasks

Backend (orchestrator)

Stand up Postgres schemas and indices for sources, runs, jobs, dag_edges, artifacts, quotas, schedules.
Implement scheduler: DAG planner, dependency resolver, critical path computation.
Implement rate limiter with adaptive behavior on 429/503 and per‑tenant tokens.
Implement watermark/backfill manager with event‑time windows and idempotency keys.
Implement API endpoints + OpenAPI spec + request validation.
Implement WebSocket/SSE event stream for live updates.
Implement audit logging and export.
Implement dead‑letter store and replay.

Worker SDKs and integrations

Build Go/Python SDKs with claim/heartbeat/progress API.
Integrate SDK into Conseiller, Excitator, SBOM workers; ensure artifact emission with schema ver.
Add consensus_compute and policy_eval as job types with deterministic inputs/outputs.

Console

Overview tiles and health heatmap.
Source list/detail with actions and config view.
Runs timeline (Gantt) and DAG visualization with node inspector.
Jobs “tail” with live updates and filters.
Errors clustering and suggested remediations.
Queues/backpressure dashboard.
Backfill wizard with safety checks.

Observability

Emit metrics listed in §3.10 and wire traces across services.
Dashboards: health, queue depth, error classes, burn‑rate, dispatch latency.
Alerts for SLO burn and circuit breaker opens.

Security & RBAC

Enforce tenant scoping on all endpoints; test leakage.
Wire KMS for secret refs and redact everywhere.
Implement Orch.Viewer|Operator|Admin roles and check in Console and API.

Docs

Author all files in §5 with examples and screenshots.
Cross‑link from Conseiller/Excitator/SBOM pages to the dashboard docs.
Append imposed rule to each page.