# StellaOps Source & Job Orchestrator The Orchestrator schedules, observes, and recovers ingestion and analysis jobs across the StellaOps platform. ## Latest updates (2025-11-30) - OpenAPI discovery published at `/.well-known/openapi` with `openapi/orchestrator.json`; includes pagination/idempotency/error-envelope examples and version headers. - Legacy job detail/summary endpoints now emit `Deprecation` + `Link` headers pointing to the stable replacements. - Job leasing flows through the Task Runner bridge: allocations carry idempotency keys, lease durations, and retry hints; workers acknowledge via claim/ack and emit heartbeats. - Event envelopes remain interim pending ORCH-SVC-37-101; include provenance (tenant/project, job type, correlationId, task runner id) in all notifier events. - Authority `orch:quota` / `orch:backfill` scopes require reason/ticket audit fields; include them in runbooks and dashboard overrides. ## Responsibilities - Track job state, throughput, and errors for Concelier, Excititor, Scheduler, and export pipelines. - Expose dashboards and APIs for throttling, replays, and failover. - Enforce rate-limits, concurrency and dependency chains across queues. - Stream structured events and audit logs for incident response. - Provide Task Runner bridge semantics (claim/ack, heartbeats, progress, artifacts, backfills) for Go/Python SDKs. ## Key components - Orchestrator WebService (control plane). - Queue adapters (Valkey/NATS) and job ledger. - Console dashboard module and CLI integration for operators. ## Integrations & dependencies - Authority for authN/Z on operational actions. - Telemetry stack for job metrics and alerts. - Scheduler/Concelier/Excititor workers for job lifecycle. - Offline Kit for state export/import during air-gap refreshes. ## Operational notes - Job recovery runbooks and dashboard JSON as described in Epic 9. - Rate-limit and lease reconfiguration guidelines; keep lease defaults aligned across runners and SDKs (Go/Python). - Log streaming: SSE/WS endpoints carry correlationId + tenant/project; buffer size and retention must be documented in runbooks. - When using `orch:quota` / `orch:backfill` scopes, capture reason/ticket fields in runbooks and audit checklists. ## Implementation Status ### Phase 1 – Core service & job ledger (Complete) - PostgreSQL schema with sources, runs, jobs, artifacts, DAG edges, quotas, schedules, incidents - Lease manager with heartbeats, retries, dead-letter queues - Token-bucket rate limiter per tenant/source.host with adaptive refill - Watermark/backfill orchestration for event-time windows ### Phase 2 – Worker SDK & artifact registry (Complete) - Claim/heartbeat/report contract with deterministic artifact hashing - Idempotency enforcement and worker SDKs for .NET/Rust/Go agents - Integrated with Concelier, Excititor, SBOM Service, Policy Engine ### Phase 3 – Observability & dashboard (In Progress) - Metrics: queue depth, job latency, failure classes, rate-limit hits, burn rate - Error clustering for HTTP 429/5xx, schema mismatches, parse errors - SSE/WebSocket feeds for Console updates, Gantt timeline/DAG JSON ### Phase 4 – Controls & resilience (Planned) - Pause/resume/throttle/retry/backfill tooling - Dead-letter review, circuit breakers, blackouts, backpressure handling - Automation hooks and control plane APIs ### Phase 5 – Offline & compliance (Planned) - Deterministic audit bundles (jobs.jsonl, history.jsonl, throttles.jsonl) - Provenance manifests and offline replay scripts - Tenant isolation validation and secret redaction ### Key Acceptance Criteria - Schedules all jobs with quotas, rate limits, idempotency; preserves provenance - Console reflects real-time DAG status, queue depth, SLO burn rate - Observability stack exposes metrics, logs, traces, incidents for stuck jobs and throttling - Offline audit bundles reproduce job history deterministically with verified signatures ### Technical Decisions & Risks - Backpressure/queue overload mitigated via adaptive token buckets, circuit breakers, dynamic concurrency - Upstream vendor throttles managed with visible state, automatic jitter and retry - Tenant leakage prevented through API/queue/storage filters, fuzz tests, redaction - Complex DAG errors handled with diagnostics, error clustering, partial replay tooling ## Epic alignment - Epic 9: Source & Job Orchestrator Dashboard. - ORCH stories in ../../TASKS.md.