- Add RateLimitConfig for configuration management with YAML binding support. - Introduce RateLimitDecision to encapsulate the result of rate limit checks. - Implement RateLimitMetrics for OpenTelemetry metrics tracking. - Create RateLimitMiddleware for enforcing rate limits on incoming requests. - Develop RateLimitService to orchestrate instance and environment rate limit checks. - Add RateLimitServiceCollectionExtensions for dependency injection registration.
3.2 KiB
3.2 KiB
Orchestrator Architecture (DOCS-ORCH-32-002)
Last updated: 2025-11-25
Runtime components
- WebService: REST + WebSocket API for DAG definitions, run status, and admin actions; issues idempotency tokens and enforces tenant isolation.
- Scheduler: timer/cron runner that instantiates DAG runs from schedules; publishes run intents into per-tenant queues.
- Worker: executes DAG steps; pulls from tenant queues, applies resource limits, and reports spans/metrics/logs.
- Plugin host: task plugins (HTTP call, queue dispatch, CLI tool, script) loaded from signed bundles; execution is sandboxed with deny-by-default network.
Data model
- DAG: directed acyclic graph with topological order; tie-break lexicographically by step id for determinism.
- Run: immutable record with
runId,dagVersion,tenant,inputsHash,status,traceId,startedUtc,endedUtc. - Step execution: each step captures
inputsHash,outputsHash,status,attempt,durationMs,logsRef,metricsRef.
Execution flow
- Client or scheduler creates a run (idempotent on
runToken,dagId,inputsHash). - Scheduler enqueues run intent into tenant queue.
- Worker dequeues, reconstructs DAG ordering, and executes steps:
- skip disabled steps;
- apply per-step concurrency, retries, and backoff;
- emit spans/metrics/logs with propagated
traceparent.
- Results are persisted append-only; WebSocket pushes status to clients.
Storage & queues
- PostgreSQL stores DAG specs, versions, and run history (per-tenant tables or tenant key prefix).
- Queues: Redis/PostgreSQL-backed FIFO per tenant; message includes
traceparent,runToken,dagVersion,inputsHash. - Artifacts (logs, outputs) referenced by content hash; stored in object storage or PostgreSQL large objects; hashes recorded in run record.
Security & AOC alignment
- Mandatory
X-Stella-Tenant; cross-tenant DAGs prohibited. - Scopes:
orchestrator:read|write|admin; admin needed for DAG publish/delete. - AOC: Orchestrator only schedules/executes; no policy/severity decisions. Inputs/outputs immutable; runs replayable.
- Sandboxing: per-step CPU/memory limits; network egress blocked unless step declares allowlist entry.
Determinism
- Step ordering: topological + lexical tie-breaks.
- Idempotency:
runToken+inputsHash; retries reuse sametraceId; outputs hashed (lowercase hex). - Timestamps UTC; NDJSON exports sorted by
(startedUtc, dagId, runId).
Offline posture
- DAG specs and plugins shipped in signed offline bundles; no remote fetch.
- Transparency: export runs/logs/metrics/traces as NDJSON for air-gapped audit.
Observability
- Traces: spans named
orchestrator.run,orchestrator.stepwith attributestenant,dagId,runId,stepId,status. - Metrics:
orchestrator_runs_total{tenant,status},orchestrator_run_duration_seconds,orchestrator_queue_depth,orchestrator_step_retries_total. - Logs: structured JSON, redacted, carrying
trace_id,tenant,dagId,runId,stepId.
Governance & rollout
- DAG publishing requires signature/owner metadata; versions immutable after publish.
- Rollback: schedule new version and disable old; runs stay immutable.
- Upgrade path: workers hot-reload plugins from bundle catalog; scheduler is stateless.