Files
git.stella-ops.org/EPIC_9.md
master 651b8e0fa3 feat: Add new projects to solution and implement contract testing documentation
- Added "StellaOps.Policy.Engine", "StellaOps.Cartographer", and "StellaOps.SbomService" projects to the StellaOps solution.
- Created AGENTS.md to outline the Contract Testing Guild Charter, detailing mission, scope, and definition of done.
- Established TASKS.md for the Contract Testing Task Board, outlining tasks for Sprint 62 and Sprint 63 related to mock servers and replay testing.
2025-10-27 07:57:55 +02:00

19 KiB
Raw Blame History

Below is the “maximum documentation” bundle for Epic 9. Paste it into your repo and pretend the ingestion chaos was always under control.

Imposed rule: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.


Epic 9: Source & Job Orchestrator Dashboard

Short name: Orchestrator Dashboard Primary service: orchestrator (scheduler, queues, ratelimits, job state) Surfaces: Console (Web UI), CLI, Web API Touches: Conseiller (Feedser), Excitator (Vexer), VEX Consensus Lens, SBOM Service, Policy Engine, Findings Ledger, Authority (authN/Z), Telemetry/Analytics, Object Storage, Relational DB, Message Bus

AOC ground rule: Conseiller and Excitator aggregate but never merge. The orchestrator schedules, tracks and recovers jobs; it does not transform evidence beyond transport and storage. No “smart” merging in flight.


1) What it is

The Source & Job Orchestrator Dashboard is the control surface for every data source and pipeline run across StellaOps. It gives operators:

  • Live health of all advisory/VEX/SBOM sources and derived jobs.
  • Endtoend pipeline visibility as DAGs and timelines.
  • Controls for pausing, backfilling, replaying, throttling and retrying.
  • Error pattern analysis, ratelimit observability and backpressure insights.
  • Provenance and audit trails from initial fetch through parse, normalize, index and policy evaluation.

The dashboard sits over the orchestrator service, which maintains job state, schedules runs, enforces quotas and rate limits, and collects metrics from worker pools embedded in Conseiller, Excitator, SBOM and related services.


2) Why (brief)

Ingestion breaks quietly and then loudly. Without a unified control plane, you learn about it from angry users or empty indexes. This dashboard shortens incident MTTR, enables safe backfills, and makes compliance reviewers stop sending emails with twelve attachments and one emoji.


3) How it should work (maximum detail)

3.1 Capabilities

  • Source registry

    • Register, tag and version connectors (OSV, GHSA, CSAF endpoints, vendor PDF scrapers, distro feeds, RSS, S3 drops, internal registries).
    • Store connection details, secrets (via KMS), ratelimit policy, schedules, and ownership metadata.
    • Validate and “test connection” safely.
  • Job orchestration

    • Create DAGs composed of job types: fetch, parse, normalize, dedupe, index, consensus_compute, policy_eval, crosslink, sbom_ingest, sbom_index.
    • Priorities, queues, concurrency caps, exponential backoff, circuit breakers.
    • Idempotency keys and output artifact hashing to avoid duplicate work.
    • Eventtime watermarks for backfills without double counting.
  • Observability & control

    • Gantt timeline and realtime DAG view with critical path highlighting.
    • Backpressure and queue depth heatmaps.
    • Error clustering by class (HTTP 429, TLS, schema mismatch, parse failure, upstream 5xx).
    • Persource SLOs and SLA budgets with burnrate alerts.
    • Oneclick actions: retry, replay range, pause/resume, throttle/unthrottle, reroute to canary workers.
  • Provenance & audit

    • Immutable run ledger linking input artifact → every job → output artifact.
    • Schema version tracking and drift detection.
    • Operator actions recorded with reason and ticket reference.
  • Safety

    • Secret redaction everywhere.
    • Tenant isolation at API, queue and storage layers.
    • AOC: no inflight merges of advisory or VEX content.

3.2 Core architecture

  • orchestrator (service)

    • Maintains job state in Postgres (sources, runs, jobs, artifacts, dag_edges, quotas, schedules).
    • Publishes work to a message bus (e.g., topic.jobs.ready.<queue>).
    • Distributed tokenbucket rate limiter per source/tenant/host.
    • Watchdog for stuck jobs and circuit breakers for flapping sources.
    • Watermark manager for backfills (eventtime windows).
  • worker SDK

    • Lightweight library embedded in Conseiller/Excitator/SBOM workers to:

      • Claim work, heartbeat, update progress, report metrics.
      • Emit artifact metadata and checksums.
      • Enforce idempotency via orchestratorsupplied key.
  • object store

    • Raw payloads and intermediate artifacts organized by schema and hash:

      • advisory/raw/<source_id>/<event_time>/<sha256>.json|pdf
      • advisory/normalized/<schema_ver>/<hash>.json
      • vex/raw|normalized/...
      • sbom/raw|graph/...
  • web API

    • CRUD for sources, runs, jobs, schedules, quotas.
    • Control actions (retry, cancel, pause, backfill).
    • Streaming updates via WebSocket/SSE for the Console.
  • console

    • React app consuming Orchestrator APIs, rendering DAGs, timelines, health charts and action panels with RBAC.

3.3 Data model (selected tables)

  • sources

    • id, kind (advisory|vex|sbom|internal), subtype (e.g., osv, ghsa, csaf, vendor_pdf), display_name, owner_team, schedule_cron, rate_policy, enabled, secrets_ref, tags, schema_hint, created_at, updated_at.
  • runs

    • id, source_id, trigger (schedule|manual|event|backfill), window_start, window_end, state, started_at, finished_at, stats_json.
  • jobs

    • id, run_id, type, queue, priority, state (pending|running|succeeded|failed|canceled|deadletter), attempt, max_attempt, idempotency_key, input_artifact_id, output_artifact_id, worker_id, created_at, started_at, finished_at, error_class, error_message, metrics_json.
  • dag_edges

    • from_job_id, to_job_id, edge_kind (success_only|always).
  • artifacts

    • id, kind (raw|normalized|index|consensus), schema_ver, hash, uri, bytes, meta_json, created_at.
  • quotas

    • tenant_id, resource (requests_per_min, concurrent_jobs), limit, window_sec.
  • schedules

    • Persource cron plus jitter, timezone, blackout windows.

3.4 Job lifecycle

  1. Plan Scheduler creates a run for a source and plans a DAG: e.g., fetch → parse → normalize → dedupe → index → policy_eval (advisory) or fetch → parse → normalize → consensus_compute (VEX).

  2. Enqueue Ready nodes become jobs with queue, priority, idempotency key and optional ratelimit tokens reserved.

  3. Execute Worker claims job, heartbeats every N seconds. Output artifacts are stored and linked. Failures are classified and retried with exponential backoff and jitter, up to max_attempt.

  4. Complete Downstream nodes unblock. On run completion, orchestrator computes SLO deltas and emits run summary.

  5. Deadletter Jobs exceeding attempts move to a DLQ with structured context and suggested remediation.

3.5 Scheduling, backpressure, ratelimits

  • Token bucket per {tenant, source.host} with adaptive refill if upstream 429/503 seen.
  • Concurrency caps per source and per job type to avoid thundering herd.
  • Backpressure signals from queue depth, worker CPU, and upstream error rates; scheduler reduces inflight issuance accordingly.
  • Backfills use eventtime windows with immutable watermarks to avoid reprocessing.
  • Blackout windows for vendor maintenance periods.

3.6 APIs

POST   /orchestrator/sources
GET    /orchestrator/sources?kind=&tag=&q=
GET    /orchestrator/sources/{id}
PATCH  /orchestrator/sources/{id}
POST   /orchestrator/sources/{id}/actions:test|pause|resume|sync-now
POST   /orchestrator/sources/{id}/backfill { "from":"2024-01-01", "to":"2024-03-01" }

GET    /orchestrator/runs?source_id=&state=&from=&to=
GET    /orchestrator/runs/{run_id}
GET    /orchestrator/runs/{run_id}/dag
POST   /orchestrator/runs/{run_id}/cancel

GET    /orchestrator/jobs?state=&type=&queue=&source_id=
GET    /orchestrator/jobs/{job_id}
POST   /orchestrator/jobs/{job_id}/actions:retry|cancel|prioritize

GET    /orchestrator/metrics/overview
GET    /orchestrator/errors/top?window=1h
GET    /orchestrator/quotas
PATCH  /orchestrator/quotas/{tenant_id}
WS     /orchestrator/streams/updates

3.7 Console (Web UI)

  • Overview

    • KPI tiles: sources healthy, runs in progress, queue depth, error rate, burnrate to SLO.
    • Heatmap of source health by last 24h success ratio.
  • Sources

    • Grid with filters, inline status (active, paused, throttled), next run eta, last error class.
    • Detail panel: config, secrets status (redacted), schedule, rate limits, ownership, run history, action buttons.
  • Runs

    • Timeline (Gantt) with critical path, duration distribution, and perstage breakdown.
    • Run detail: DAG view with node metrics, artifacts, logs, action menu (cancel).
  • Jobs

    • Live table with state filters and “tail” view.
    • Job detail: payload preview (redacted), worker, attempts, stack traces, linked artifacts.
  • Errors

    • Clusters by class and signature, suggested remediations (pause source, lower concurrency, patch parser).
  • Queues & Backpressure

    • Perqueue depth, service rate, inflight, age percentiles.
    • Ratelimit tokens graphs per source host.
  • Controls

    • Backfill wizard with eventtime preview and safety checks.
    • Canary routing: route 5% of next 100 runs to a new worker pool.
  • A11y

    • Keyboard nav, ARIA roles for DAG nodes, live regions for updates, colorblind friendly graphs.

3.8 CLI

stella orch sources list --kind advisory --tag prod
stella orch sources add --file source.yaml
stella orch sources test <source-id>
stella orch sources pause <source-id>  # or resume
stella orch sources sync-now <source-id>
stella orch sources backfill <source-id> --from 2024-01-01 --to 2024-03-01

stella orch runs list --source <id> --state running
stella orch runs show <run-id> --dag
stella orch runs cancel <run-id>

stella orch jobs list --state failed --type parse --limit 100
stella orch jobs retry <job-id>
stella orch jobs cancel <job-id>
stella orch jobs tail --queue normalize --follow

stella orch quotas get --tenant default
stella orch quotas set --tenant default --concurrent-jobs 50 --rpm 1200

Exit codes: 0 success, 2 invalid args, 4 not found, 5 denied, 7 precondition failed, 8 ratelimited.

3.9 RBAC & security

  • Roles

    • Orch.Viewer: readonly sources/runs/jobs/metrics.
    • Orch.Operator: perform actions on sources and jobs, launch backfills.
    • Orch.Admin: manage quotas, schedules, connector versions, and delete sources.
  • Secrets

    • Stored only as references to your KMS; never persisted in cleartext.
    • Console shows redact badges and last rotated timestamp.
  • Tenancy

    • Source, run, job rows scoped by tenant id.
    • Queue names and token buckets namespaced per tenant.
  • Compliance

    • Full audit log for every operator action with “reason” and optional ticket link.
    • Exportable run ledger for audits.

3.10 Observability

  • Metrics (examples)

    • orch_jobs_inflight{type,queue}
    • orch_jobs_latency_ms{type,percentile}
    • orch_rate_tokens_available{source}
    • orch_error_rate{source,error_class}
    • orch_slo_burn_rate{source,slo}
    • orch_deadletter_total{source,type}
  • Traces

    • Span per job with baggage: run_id, source_id, artifact_id.
    • Links across services to Conseiller/Excitator/SBOM workers.
  • Logs

    • Structured JSON with correlation ids, attempt numbers and redacted payload previews.

3.11 Performance targets

  • Job dispatch P95 < 150 ms after dependency satisfied.
  • Scheduler loop P95 < 500 ms for 10k pending jobs.
  • Console live updates subsecond at 1k events/sec per tenant.
  • Backfill throughput ≥ 200 jobs/sec per worker pool with zero dupes.

3.12 Edge cases & behaviors

  • Upstream 429 storms: autothrottle, pause optional, recommend extended jitter.
  • Schema drift: parser moves job to DLQ with error_class=schema_mismatch and opens a change ticket via webhook.
  • Flapping source: circuit breaker opens after N consecutive failures; requires human “resume”.
  • Clock skew: watermark logic uses upstream event time; large skews flagged.
  • Idempotency collisions: new attempt yields noop if artifact hash already exists.

4) Implementation plan

4.1 Modules (new and updated)

  • New service: src/StellaOps.Orchestrator

    • api/ REST + WS handlers
    • scheduler/ run planner, DAG builder, watermark/backfill logic
    • queues/ publisher and consumer abstractions
    • ratelimit/ token bucket and adaptive controller
    • state/ Postgres repositories and migrations
    • audit/ action logging and export
    • metrics/ Prometheus exporters
    • security/ tenant scoping, KMS client, secret refs
  • Worker SDKs:

    • src/StellaOps.Orchestrator.WorkerSdk.Go and src/StellaOps.Orchestrator.WorkerSdk.Python with job claim, heartbeat, progress, artifact publish, and structured error reporting.
  • Console:

    • console/apps/orch/ pages: Overview, Sources, Runs, Jobs, Errors, Queues.
    • components/dag-view/, components/gantt/, components/health-heatmap/.
  • Updates to existing services:

    • Conseiller/Excitator/SBOM workers adopt SDK and emit artifacts with schema/version/fingerprint.
    • VEX Lens exposes consensus_compute as a jobable operation.
    • Policy Engine exposes policy_eval as a job type for scheduled recalcs.

4.2 Packaging & deployment

  • Containers:

    • stella/orchestrator:<ver>
    • stella/worker-sdk-examples:<ver> for canary pools
  • Helm values:

    • Queues/topics, pertenant concurrency, ratelimit defaults, WS replica count.
    • KMS integration secrets.
  • Migrations:

    • Flyway/Goose migrations for new tables and indexes.

4.3 Rollout strategy

  • Phase 1: Readonly dashboard fed by existing job tables; no controls.
  • Phase 2: Control actions enabled for nonprod tenants.
  • Phase 3: Backfills and quota management, then GA.

5) Documentation changes

Create/update the following, each ending with the imposed rule statement.

  1. /docs/orchestrator/overview.md Concepts, roles, responsibilities, AOC alignment.

  2. /docs/orchestrator/architecture.md Scheduler, DAGs, watermarks, queues, ratelimits, data model.

  3. /docs/orchestrator/api.md Endpoints, WebSocket events, error codes, examples.

  4. /docs/orchestrator/console.md Screens, actions, a11y, live updates.

  5. /docs/orchestrator/cli.md Commands, examples, exit codes, scripting patterns.

  6. /docs/orchestrator/runledger.md Provenance and audit export format.

  7. /docs/security/secretshandling.md KMS references, redaction rules, operator hygiene.

  8. /docs/operations/orchestratorrunbook.md Common failures, backfill guide, circuit breakers, tuning.

  9. /docs/schemas/artifacts.md Artifact kinds, schema versions, hashing, storage layout.

  10. /docs/slo/orchestratorslo.md SLO definitions, measurement, alerting.


6) Engineering tasks

Backend (orchestrator)

  • Stand up Postgres schemas and indices for sources, runs, jobs, dag_edges, artifacts, quotas, schedules.
  • Implement scheduler: DAG planner, dependency resolver, critical path computation.
  • Implement rate limiter with adaptive behavior on 429/503 and pertenant tokens.
  • Implement watermark/backfill manager with eventtime windows and idempotency keys.
  • Implement API endpoints + OpenAPI spec + request validation.
  • Implement WebSocket/SSE event stream for live updates.
  • Implement audit logging and export.
  • Implement deadletter store and replay.

Worker SDKs and integrations

  • Build Go/Python SDKs with claim/heartbeat/progress API.
  • Integrate SDK into Conseiller, Excitator, SBOM workers; ensure artifact emission with schema ver.
  • Add consensus_compute and policy_eval as job types with deterministic inputs/outputs.

Console

  • Overview tiles and health heatmap.
  • Source list/detail with actions and config view.
  • Runs timeline (Gantt) and DAG visualization with node inspector.
  • Jobs “tail” with live updates and filters.
  • Errors clustering and suggested remediations.
  • Queues/backpressure dashboard.
  • Backfill wizard with safety checks.

Observability

  • Emit metrics listed in §3.10 and wire traces across services.
  • Dashboards: health, queue depth, error classes, burnrate, dispatch latency.
  • Alerts for SLO burn and circuit breaker opens.

Security & RBAC

  • Enforce tenant scoping on all endpoints; test leakage.
  • Wire KMS for secret refs and redact everywhere.
  • Implement Orch.Viewer|Operator|Admin roles and check in Console and API.

Docs

  • Author all files in §5 with examples and screenshots.
  • Crosslink from Conseiller/Excitator/SBOM pages to the dashboard docs.
  • Append imposed rule to each page.

Imposed rule: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.


7) Acceptance criteria

  • Operators can: pause/resume a source, run “syncnow,” initiate a backfill for a date range, and retry/cancel individual jobs from Console and CLI.
  • DAG and timeline reflect reality within 1 second of job state changes at P95.
  • Backfills do not create duplicate artifacts; idempotency proven by hash equality.
  • Rate limiter reduces 429s by ≥80% under simulated throttle tests.
  • Audit log includes who/when/why for every operator action.
  • Provenance ledger exports a complete chain for any artifact.
  • RBAC prevents nonadmins from quota changes; tenancy isolation proven via automated tests.
  • SLO dashboard shows burnrate and triggers alerts under injected failure.

8) Risks & mitigations

  • Orchestrator becomes a single bottleneck. Horizontal scale stateless workers; DB indexes tuned; job state updates batched; cache hot paths.

  • Secret spillage. Only KMS references stored; aggressive redaction; log scrubbing in SDK.

  • Overeager backfills overwhelm upstream. Enforce persource quotas and sandbox previews; dryrun backfills first.

  • Schema drift silently corrupts normalization. Hardfail on mismatch; DLQ with clear signatures; schema registry gating.

  • Flapping sources cause alert fatigue. Circuit breaker with cooldown and deduped alerts; error budget policy.


9) Test plan

  • Unit Scheduler DAG building, topological sort, backoff math, token bucket, watermark math.

  • Integration Orchestrator ↔ worker SDK, artifact store wiring, DLQ replay, audit pipeline.

  • Chaos Inject 429 storms, packet loss, worker crashes; verify throttling and recovery.

  • Backfill Simulate overlapping windows and verify idempotency and watermark correctness.

  • Perf 10k concurrent jobs: dispatch latency, DB contention, WebSocket fanout.

  • Security Multitenant isolation tests; KMS mock tests for secret access; RBAC matrix.

  • UX/A11y Screen reader labels on DAG, keyboard navigation, live region updates.


10) Philosophy

  • Make the invisible visible. Pipelines should be legible at a glance.
  • Prefer reproducibility to heroics. Idempotency and provenance over “we think it ran.”
  • Safeguards before speed. Throttle first, retry thoughtfully, never melt upstreams.
  • No silent merges. Evidence remains immutable; transformations are explicit, logged and reversible.

Final reminder: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.