Files
git.stella-ops.org/docs/modules/jobengine
master 50abd2137f Update docs, sprint plans, and compose configuration
Add 12 new sprint files (Integrations, Graph, JobEngine, FE, Router,
AdvisoryAI), archive completed scheduler UI sprint, update module
architecture docs (router, graph, jobengine, web, integrations),
and add Gitea entrypoint script for local dev.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 08:53:50 +03:00
..

StellaOps Source & Job Orchestrator

The Orchestrator schedules, observes, and recovers ingestion and analysis jobs across the StellaOps platform.

Latest updates (2025-11-30)

  • OpenAPI discovery published at /.well-known/openapi with openapi/orchestrator.json; includes pagination/idempotency/error-envelope examples and version headers.
  • Legacy job detail/summary endpoints now emit Deprecation + Link headers pointing to the stable replacements.
  • Job leasing flows through the Task Runner bridge: allocations carry idempotency keys, lease durations, and retry hints; workers acknowledge via claim/ack and emit heartbeats.
  • Event envelopes remain interim pending ORCH-SVC-37-101; include provenance (tenant/project, job type, correlationId, task runner id) in all notifier events.
  • Authority orch:quota / orch:backfill scopes require reason/ticket audit fields; include them in runbooks and dashboard overrides.

Responsibilities

  • Track job state, throughput, and errors for Concelier, Excititor, Scheduler, and export pipelines.
  • Expose dashboards and APIs for throttling, replays, and failover.
  • Enforce rate-limits, concurrency and dependency chains across queues.
  • Stream structured events and audit logs for incident response.
  • Provide Task Runner bridge semantics (claim/ack, heartbeats, progress, artifacts, backfills) for Go/Python SDKs.

Key components

  • Orchestrator WebService (control plane).
  • Queue adapters (Valkey/NATS) and job ledger.
  • Console dashboard module and CLI integration for operators.

Integrations & dependencies

  • Authority for authN/Z on operational actions.
  • Telemetry stack for job metrics and alerts.
  • Scheduler/Concelier/Excititor workers for job lifecycle.
  • Offline Kit for state export/import during air-gap refreshes.

Operational notes

  • Job recovery runbooks and dashboard JSON as described in Epic 9.
  • Rate-limit and lease reconfiguration guidelines; keep lease defaults aligned across runners and SDKs (Go/Python).
  • Log streaming: SSE/WS endpoints carry correlationId + tenant/project; buffer size and retention must be documented in runbooks.
  • When using orch:quota / orch:backfill scopes, capture reason/ticket fields in runbooks and audit checklists.

Implementation Status

Phase 1 Core service & job ledger (Complete)

  • PostgreSQL schema with sources, runs, jobs, artifacts, DAG edges, quotas, schedules, incidents
  • Lease manager with heartbeats, retries, dead-letter queues
  • Token-bucket rate limiter per tenant/source.host with adaptive refill
  • Watermark/backfill orchestration for event-time windows

Phase 2 Worker SDK & artifact registry (Complete)

  • Claim/heartbeat/report contract with deterministic artifact hashing
  • Idempotency enforcement and worker SDKs for .NET/Rust/Go agents
  • Integrated with Concelier, Excititor, SBOM Service, Policy Engine

Phase 3 Observability & dashboard (In Progress)

  • Metrics: queue depth, job latency, failure classes, rate-limit hits, burn rate
  • Error clustering for HTTP 429/5xx, schema mismatches, parse errors
  • SSE/WebSocket feeds for Console updates, Gantt timeline/DAG JSON

Phase 4 Controls & resilience (Planned)

  • Pause/resume/throttle/retry/backfill tooling
  • Dead-letter review, circuit breakers, blackouts, backpressure handling
  • Automation hooks and control plane APIs

Phase 5 Offline & compliance (Planned)

  • Deterministic audit bundles (jobs.jsonl, history.jsonl, throttles.jsonl)
  • Provenance manifests and offline replay scripts
  • Tenant isolation validation and secret redaction

Key Acceptance Criteria

  • Schedules all jobs with quotas, rate limits, idempotency; preserves provenance
  • Console reflects real-time DAG status, queue depth, SLO burn rate
  • Observability stack exposes metrics, logs, traces, incidents for stuck jobs and throttling
  • Offline audit bundles reproduce job history deterministically with verified signatures

Technical Decisions & Risks

  • Backpressure/queue overload mitigated via adaptive token buckets, circuit breakers, dynamic concurrency
  • Upstream vendor throttles managed with visible state, automatic jitter and retry
  • Tenant leakage prevented through API/queue/storage filters, fuzz tests, redaction
  • Complex DAG errors handled with diagnostics, error clustering, partial replay tooling

Epic alignment

  • Epic 9: Source & Job Orchestrator Dashboard.
  • ORCH stories in ../../TASKS.md.