Files
git.stella-ops.org/docs/_archive/orchestrator-legacy/overview.md
2025-12-24 21:45:46 +02:00

2.6 KiB

Orchestrator Overview (DOCS-ORCH-32-001)

Imposed rule: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

Last updated: 2025-11-25

Mission & value

  • Coordinate deterministic job execution across StellaOps services (Policy, RiskEngine, VEX Lens, Export Center, Notify).
  • Provide reproducible DAG runs with tenant isolation, auditability, and Aggregation-Only Contract (AOC) alignment.
  • Stay sovereign/offline: all runners operate from bundled manifests and local queues; no external control plane.

Runtime shape

  • Services: Orchestrator WebService (API/UI), Worker (executors), Scheduler (timer-based triggers).
  • Queues: per-tenant work queues; FIFO with deterministic ordering and idempotency keys.
  • State: PostgreSQL for run metadata and DAG definitions; optional Valkey for locks/throttles; all scoped by tenant.
  • APIs: REST + WebSocket for run status/stream; admin endpoints require orchestrator:admin plus tenant header.

AOC alignment

  • Orchestrator never derives policy/verdicts; it only executes declared DAG steps and records outcomes.
  • Inputs/outputs are append-only; runs are immutable with replay tokens.
  • No consensus logic; all decisions remain in owning services (Policy Engine, RiskEngine, etc.).

Determinism

  • Stable DAG evaluation order (topological with lexical tie-breaks).
  • Idempotency via run tokens and step hashes; retries preserve trace_id.
  • UTC timestamps; hashes lowercase hex; NDJSON exports ordered by (timestamp, dagId, runId).

Observability

  • Traces propagate traceparent/baggage through scheduler→worker→task.
  • Metrics: orchestrator_runs_total{tenant,status}, orchestrator_run_duration_seconds, orchestrator_queue_depth.
  • Logs: structured JSON, redacted, tagged with tenant, dagId, runId, status.

Roles & responsibilities

  • Operator: manage DAG definitions, quotas, tenant allowlists, SLOs.
  • Developer: defines DAG specs and task plugins; supplies offline bundles for execution.
  • Security: validates scopes, enforces AOC boundaries, reviews audit trails.

Offline posture

  • DAG specs and plugins shipped in offline bundles; runners load from local disk.
  • No outbound network during execution unless task explicitly declares an allowlisted endpoint.
  • Transparency: export run logs/traces/metrics as NDJSON for air-gapped review.

Safety & governance

  • Mandatory tenant header; cross-tenant DAGs forbidden.
  • Step sandboxing: resource limits per task; deny network by default.
  • Audit: every run records actor, tenant, DAG version, inputs hash, outputs hash, and rationale notes.