Files
git.stella-ops.org/docs/modules/orchestrator/README.md
2025-12-25 18:50:33 +02:00

4.3 KiB
Raw Blame History

StellaOps Source & Job Orchestrator

The Orchestrator schedules, observes, and recovers ingestion and analysis jobs across the StellaOps platform.

Latest updates (2025-11-30)

  • OpenAPI discovery published at /.well-known/openapi with openapi/orchestrator.json; includes pagination/idempotency/error-envelope examples and version headers.
  • Legacy job detail/summary endpoints now emit Deprecation + Link headers pointing to the stable replacements.
  • Job leasing flows through the Task Runner bridge: allocations carry idempotency keys, lease durations, and retry hints; workers acknowledge via claim/ack and emit heartbeats.
  • Event envelopes remain interim pending ORCH-SVC-37-101; include provenance (tenant/project, job type, correlationId, task runner id) in all notifier events.
  • Authority orch:quota / orch:backfill scopes require reason/ticket audit fields; include them in runbooks and dashboard overrides.

Responsibilities

  • Track job state, throughput, and errors for Concelier, Excititor, Scheduler, and export pipelines.
  • Expose dashboards and APIs for throttling, replays, and failover.
  • Enforce rate-limits, concurrency and dependency chains across queues.
  • Stream structured events and audit logs for incident response.
  • Provide Task Runner bridge semantics (claim/ack, heartbeats, progress, artifacts, backfills) for Go/Python SDKs.

Key components

  • Orchestrator WebService (control plane).
  • Queue adapters (Valkey/NATS) and job ledger.
  • Console dashboard module and CLI integration for operators.

Integrations & dependencies

  • Authority for authN/Z on operational actions.
  • Telemetry stack for job metrics and alerts.
  • Scheduler/Concelier/Excititor workers for job lifecycle.
  • Offline Kit for state export/import during air-gap refreshes.

Operational notes

  • Job recovery runbooks and dashboard JSON as described in Epic 9.
  • Rate-limit and lease reconfiguration guidelines; keep lease defaults aligned across runners and SDKs (Go/Python).
  • Log streaming: SSE/WS endpoints carry correlationId + tenant/project; buffer size and retention must be documented in runbooks.
  • When using orch:quota / orch:backfill scopes, capture reason/ticket fields in runbooks and audit checklists.

Implementation Status

Phase 1 Core service & job ledger (Complete)

  • PostgreSQL schema with sources, runs, jobs, artifacts, DAG edges, quotas, schedules, incidents
  • Lease manager with heartbeats, retries, dead-letter queues
  • Token-bucket rate limiter per tenant/source.host with adaptive refill
  • Watermark/backfill orchestration for event-time windows

Phase 2 Worker SDK & artifact registry (Complete)

  • Claim/heartbeat/report contract with deterministic artifact hashing
  • Idempotency enforcement and worker SDKs for .NET/Rust/Go agents
  • Integrated with Concelier, Excititor, SBOM Service, Policy Engine

Phase 3 Observability & dashboard (In Progress)

  • Metrics: queue depth, job latency, failure classes, rate-limit hits, burn rate
  • Error clustering for HTTP 429/5xx, schema mismatches, parse errors
  • SSE/WebSocket feeds for Console updates, Gantt timeline/DAG JSON

Phase 4 Controls & resilience (Planned)

  • Pause/resume/throttle/retry/backfill tooling
  • Dead-letter review, circuit breakers, blackouts, backpressure handling
  • Automation hooks and control plane APIs

Phase 5 Offline & compliance (Planned)

  • Deterministic audit bundles (jobs.jsonl, history.jsonl, throttles.jsonl)
  • Provenance manifests and offline replay scripts
  • Tenant isolation validation and secret redaction

Key Acceptance Criteria

  • Schedules all jobs with quotas, rate limits, idempotency; preserves provenance
  • Console reflects real-time DAG status, queue depth, SLO burn rate
  • Observability stack exposes metrics, logs, traces, incidents for stuck jobs and throttling
  • Offline audit bundles reproduce job history deterministically with verified signatures

Technical Decisions & Risks

  • Backpressure/queue overload mitigated via adaptive token buckets, circuit breakers, dynamic concurrency
  • Upstream vendor throttles managed with visible state, automatic jitter and retry
  • Tenant leakage prevented through API/queue/storage filters, fuzz tests, redaction
  • Complex DAG errors handled with diagnostics, error clustering, partial replay tooling

Epic alignment

  • Epic 9: Source & Job Orchestrator Dashboard.
  • ORCH stories in ../../TASKS.md.