Files
git.stella-ops.org/docs/modules/jobengine/README.md

79 lines
4.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# StellaOps Source & Job Orchestrator
The Orchestrator schedules, observes, and recovers ingestion and analysis jobs across the StellaOps platform.
## Latest updates (2025-11-30)
- OpenAPI discovery published at `/.well-known/openapi` with `openapi/orchestrator.json`; includes pagination/idempotency/error-envelope examples and version headers.
- Legacy job detail/summary endpoints now emit `Deprecation` + `Link` headers pointing to the stable replacements.
- Job leasing flows through the Task Runner bridge: allocations carry idempotency keys, lease durations, and retry hints; workers acknowledge via claim/ack and emit heartbeats.
- Event envelopes remain interim pending ORCH-SVC-37-101; include provenance (tenant/project, job type, correlationId, task runner id) in all notifier events.
- Authority `orch:quota` / `orch:backfill` scopes require reason/ticket audit fields; include them in runbooks and dashboard overrides.
## Responsibilities
- Track job state, throughput, and errors for Concelier, Excititor, Scheduler, and export pipelines.
- Expose dashboards and APIs for throttling, replays, and failover.
- Enforce rate-limits, concurrency and dependency chains across queues.
- Stream structured events and audit logs for incident response.
- Provide Task Runner bridge semantics (claim/ack, heartbeats, progress, artifacts, backfills) for Go/Python SDKs.
## Key components
- Orchestrator WebService (control plane).
- Queue adapters (Valkey/NATS) and job ledger.
- Console dashboard module and CLI integration for operators.
## Integrations & dependencies
- Authority for authN/Z on operational actions.
- Telemetry stack for job metrics and alerts.
- Scheduler/Concelier/Excititor workers for job lifecycle.
- Offline Kit for state export/import during air-gap refreshes.
## Operational notes
- Job recovery runbooks and dashboard JSON as described in Epic 9.
- Rate-limit and lease reconfiguration guidelines; keep lease defaults aligned across runners and SDKs (Go/Python).
- Log streaming: SSE/WS endpoints carry correlationId + tenant/project; buffer size and retention must be documented in runbooks.
- When using `orch:quota` / `orch:backfill` scopes, capture reason/ticket fields in runbooks and audit checklists.
## Implementation Status
### Phase 1 Core service & job ledger (Complete)
- PostgreSQL schema with sources, runs, jobs, artifacts, DAG edges, quotas, schedules, incidents
- Lease manager with heartbeats, retries, dead-letter queues
- Token-bucket rate limiter per tenant/source.host with adaptive refill
- Watermark/backfill orchestration for event-time windows
### Phase 2 Worker SDK & artifact registry (Complete)
- Claim/heartbeat/report contract with deterministic artifact hashing
- Idempotency enforcement and worker SDKs for .NET/Rust/Go agents
- Integrated with Concelier, Excititor, SBOM Service, Policy Engine
### Phase 3 Observability & dashboard (In Progress)
- Metrics: queue depth, job latency, failure classes, rate-limit hits, burn rate
- Error clustering for HTTP 429/5xx, schema mismatches, parse errors
- SSE/WebSocket feeds for Console updates, Gantt timeline/DAG JSON
### Phase 4 Controls & resilience (Planned)
- Pause/resume/throttle/retry/backfill tooling
- Dead-letter review, circuit breakers, blackouts, backpressure handling
- Automation hooks and control plane APIs
### Phase 5 Offline & compliance (Planned)
- Deterministic audit bundles (jobs.jsonl, history.jsonl, throttles.jsonl)
- Provenance manifests and offline replay scripts
- Tenant isolation validation and secret redaction
### Key Acceptance Criteria
- Schedules all jobs with quotas, rate limits, idempotency; preserves provenance
- Console reflects real-time DAG status, queue depth, SLO burn rate
- Observability stack exposes metrics, logs, traces, incidents for stuck jobs and throttling
- Offline audit bundles reproduce job history deterministically with verified signatures
### Technical Decisions & Risks
- Backpressure/queue overload mitigated via adaptive token buckets, circuit breakers, dynamic concurrency
- Upstream vendor throttles managed with visible state, automatic jitter and retry
- Tenant leakage prevented through API/queue/storage filters, fuzz tests, redaction
- Complex DAG errors handled with diagnostics, error clustering, partial replay tooling
## Epic alignment
- Epic 9: Source & Job Orchestrator Dashboard.
- ORCH stories in ../../TASKS.md.