Files
git.stella-ops.org/docs/orchestrator/architecture.md
StellaOps Bot 9f6e6f7fb3
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
up
2025-11-25 22:09:44 +02:00

3.2 KiB

Orchestrator Architecture (DOCS-ORCH-32-002)

Last updated: 2025-11-25

Runtime components

  • WebService: REST + WebSocket API for DAG definitions, run status, and admin actions; issues idempotency tokens and enforces tenant isolation.
  • Scheduler: timer/cron runner that instantiates DAG runs from schedules; publishes run intents into per-tenant queues.
  • Worker: executes DAG steps; pulls from tenant queues, applies resource limits, and reports spans/metrics/logs.
  • Plugin host: task plugins (HTTP call, queue dispatch, CLI tool, script) loaded from signed bundles; execution is sandboxed with deny-by-default network.

Data model

  • DAG: directed acyclic graph with topological order; tie-break lexicographically by step id for determinism.
  • Run: immutable record with runId, dagVersion, tenant, inputsHash, status, traceId, startedUtc, endedUtc.
  • Step execution: each step captures inputsHash, outputsHash, status, attempt, durationMs, logsRef, metricsRef.

Execution flow

  1. Client or scheduler creates a run (idempotent on runToken, dagId, inputsHash).
  2. Scheduler enqueues run intent into tenant queue.
  3. Worker dequeues, reconstructs DAG ordering, and executes steps:
    • skip disabled steps;
    • apply per-step concurrency, retries, and backoff;
    • emit spans/metrics/logs with propagated traceparent.
  4. Results are persisted append-only; WebSocket pushes status to clients.

Storage & queues

  • Mongo stores DAG specs, versions, and run history (per-tenant collections or tenant key prefix).
  • Queues: Redis/Mongo-backed FIFO per tenant; message includes traceparent, runToken, dagVersion, inputsHash.
  • Artifacts (logs, outputs) referenced by content hash; stored in object storage or Mongo GridFS; hashes recorded in run record.

Security & AOC alignment

  • Mandatory X-Stella-Tenant; cross-tenant DAGs prohibited.
  • Scopes: orchestrator:read|write|admin; admin needed for DAG publish/delete.
  • AOC: Orchestrator only schedules/executes; no policy/severity decisions. Inputs/outputs immutable; runs replayable.
  • Sandboxing: per-step CPU/memory limits; network egress blocked unless step declares allowlist entry.

Determinism

  • Step ordering: topological + lexical tie-breaks.
  • Idempotency: runToken + inputsHash; retries reuse same traceId; outputs hashed (lowercase hex).
  • Timestamps UTC; NDJSON exports sorted by (startedUtc, dagId, runId).

Offline posture

  • DAG specs and plugins shipped in signed offline bundles; no remote fetch.
  • Transparency: export runs/logs/metrics/traces as NDJSON for air-gapped audit.

Observability

  • Traces: spans named orchestrator.run, orchestrator.step with attributes tenant, dagId, runId, stepId, status.
  • Metrics: orchestrator_runs_total{tenant,status}, orchestrator_run_duration_seconds, orchestrator_queue_depth, orchestrator_step_retries_total.
  • Logs: structured JSON, redacted, carrying trace_id, tenant, dagId, runId, stepId.

Governance & rollout

  • DAG publishing requires signature/owner metadata; versions immutable after publish.
  • Rollback: schedule new version and disable old; runs stay immutable.
  • Upgrade path: workers hot-reload plugins from bundle catalog; scheduler is stateless.