Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
3.2 KiB
3.2 KiB
Orchestrator Architecture (DOCS-ORCH-32-002)
Last updated: 2025-11-25
Runtime components
- WebService: REST + WebSocket API for DAG definitions, run status, and admin actions; issues idempotency tokens and enforces tenant isolation.
- Scheduler: timer/cron runner that instantiates DAG runs from schedules; publishes run intents into per-tenant queues.
- Worker: executes DAG steps; pulls from tenant queues, applies resource limits, and reports spans/metrics/logs.
- Plugin host: task plugins (HTTP call, queue dispatch, CLI tool, script) loaded from signed bundles; execution is sandboxed with deny-by-default network.
Data model
- DAG: directed acyclic graph with topological order; tie-break lexicographically by step id for determinism.
- Run: immutable record with
runId,dagVersion,tenant,inputsHash,status,traceId,startedUtc,endedUtc. - Step execution: each step captures
inputsHash,outputsHash,status,attempt,durationMs,logsRef,metricsRef.
Execution flow
- Client or scheduler creates a run (idempotent on
runToken,dagId,inputsHash). - Scheduler enqueues run intent into tenant queue.
- Worker dequeues, reconstructs DAG ordering, and executes steps:
- skip disabled steps;
- apply per-step concurrency, retries, and backoff;
- emit spans/metrics/logs with propagated
traceparent.
- Results are persisted append-only; WebSocket pushes status to clients.
Storage & queues
- Mongo stores DAG specs, versions, and run history (per-tenant collections or tenant key prefix).
- Queues: Redis/Mongo-backed FIFO per tenant; message includes
traceparent,runToken,dagVersion,inputsHash. - Artifacts (logs, outputs) referenced by content hash; stored in object storage or Mongo GridFS; hashes recorded in run record.
Security & AOC alignment
- Mandatory
X-Stella-Tenant; cross-tenant DAGs prohibited. - Scopes:
orchestrator:read|write|admin; admin needed for DAG publish/delete. - AOC: Orchestrator only schedules/executes; no policy/severity decisions. Inputs/outputs immutable; runs replayable.
- Sandboxing: per-step CPU/memory limits; network egress blocked unless step declares allowlist entry.
Determinism
- Step ordering: topological + lexical tie-breaks.
- Idempotency:
runToken+inputsHash; retries reuse sametraceId; outputs hashed (lowercase hex). - Timestamps UTC; NDJSON exports sorted by
(startedUtc, dagId, runId).
Offline posture
- DAG specs and plugins shipped in signed offline bundles; no remote fetch.
- Transparency: export runs/logs/metrics/traces as NDJSON for air-gapped audit.
Observability
- Traces: spans named
orchestrator.run,orchestrator.stepwith attributestenant,dagId,runId,stepId,status. - Metrics:
orchestrator_runs_total{tenant,status},orchestrator_run_duration_seconds,orchestrator_queue_depth,orchestrator_step_retries_total. - Logs: structured JSON, redacted, carrying
trace_id,tenant,dagId,runId,stepId.
Governance & rollout
- DAG publishing requires signature/owner metadata; versions immutable after publish.
- Rollback: schedule new version and disable old; runs stay immutable.
- Upgrade path: workers hot-reload plugins from bundle catalog; scheduler is stateless.