up
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
This commit is contained in:
44
docs/orchestrator/api.md
Normal file
44
docs/orchestrator/api.md
Normal file
@@ -0,0 +1,44 @@
|
||||
# Orchestrator API (DOCS-ORCH-33-001)
|
||||
|
||||
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
|
||||
|
||||
Last updated: 2025-11-25
|
||||
|
||||
## Scope & headers
|
||||
- Base path: `/api/v1/orchestrator`.
|
||||
- Required headers: `Authorization: Bearer <token>`, `X-Stella-Tenant`, `traceparent` (recommended), `Idempotency-Key` for POSTs that mutate state.
|
||||
- Error envelope: see `docs/api/overview.md` (code/message/trace_id).
|
||||
|
||||
## DAG management
|
||||
- `POST /dags` — create/publish DAG version. Body includes `dagId`, `version`, `steps[]`, `edges[]`, `metadata`, `signature`.
|
||||
- `GET /dags` — list DAGs (stable sort by `dagId`, then `version` DESC). Filters: `dagId`, `active=true|false`.
|
||||
- `GET /dags/{dagId}/{version}` — fetch DAG definition.
|
||||
- `POST /dags/{dagId}/{version}:disable` — disable a version (requires `orchestrator:admin`).
|
||||
|
||||
## Runs
|
||||
- `POST /runs` — start a run; accepts `dagId`, optional `version`, `inputs` (object), `runToken` (idempotency). Returns `runId`, `traceId`.
|
||||
- `GET /runs` — list runs with filters `dagId`, `status`, `from`, `to`. Sort: `startedUtc` DESC, then `runId`.
|
||||
- `GET /runs/{runId}` — run detail with step states and hashes.
|
||||
- `POST /runs/{runId}:cancel` — request cancellation (best-effort, idempotent).
|
||||
|
||||
## Steps & artifacts
|
||||
- `GET /runs/{runId}/steps` — list step executions.
|
||||
- `GET /runs/{runId}/steps/{stepId}` — step detail, including `attempts[]`, `logsRef`, `outputsHash`.
|
||||
- `GET /artifacts/{hash}` — retrieve artifact by content hash (if tenant owns it).
|
||||
|
||||
## WebSocket stream
|
||||
- `GET /runs/stream?dagId=&status=` — server sends NDJSON events: `run.started`, `run.updated`, `step.updated`, `run.completed`, `run.failed`, `run.cancelled`. Fields: `tenant`, `dagId`, `runId`, `status`, `timestamp`, `traceId`.
|
||||
|
||||
## Admin/ops
|
||||
- `POST /admin/warm` — warm caches for DAGs/plugins (optional).
|
||||
- `GET /admin/health` — liveness/readiness; includes queue depth per tenant.
|
||||
- `GET /admin/metrics` — Prometheus scrape endpoint.
|
||||
|
||||
## Determinism & offline posture
|
||||
- All list endpoints have deterministic ordering; pagination via `page_token`/`page_size`.
|
||||
- No remote fetches; DAGs/plugins must be preloaded. Exports available as NDJSON with stable ordering.
|
||||
- Hashes lowercase hex; timestamps UTC ISO-8601.
|
||||
|
||||
## Security
|
||||
- Scopes: `orchestrator:read`, `orchestrator:write`, `orchestrator:admin` (publish/disable DAGs, cache warm).
|
||||
- Tenant isolation enforced on every path; cross-tenant access forbidden.
|
||||
53
docs/orchestrator/architecture.md
Normal file
53
docs/orchestrator/architecture.md
Normal file
@@ -0,0 +1,53 @@
|
||||
# Orchestrator Architecture (DOCS-ORCH-32-002)
|
||||
|
||||
Last updated: 2025-11-25
|
||||
|
||||
## Runtime components
|
||||
- **WebService**: REST + WebSocket API for DAG definitions, run status, and admin actions; issues idempotency tokens and enforces tenant isolation.
|
||||
- **Scheduler**: timer/cron runner that instantiates DAG runs from schedules; publishes run intents into per-tenant queues.
|
||||
- **Worker**: executes DAG steps; pulls from tenant queues, applies resource limits, and reports spans/metrics/logs.
|
||||
- **Plugin host**: task plugins (HTTP call, queue dispatch, CLI tool, script) loaded from signed bundles; execution is sandboxed with deny-by-default network.
|
||||
|
||||
## Data model
|
||||
- **DAG**: directed acyclic graph with topological order; tie-break lexicographically by step id for determinism.
|
||||
- **Run**: immutable record with `runId`, `dagVersion`, `tenant`, `inputsHash`, `status`, `traceId`, `startedUtc`, `endedUtc`.
|
||||
- **Step execution**: each step captures `inputsHash`, `outputsHash`, `status`, `attempt`, `durationMs`, `logsRef`, `metricsRef`.
|
||||
|
||||
## Execution flow
|
||||
1) Client or scheduler creates a run (idempotent on `runToken`, `dagId`, `inputsHash`).
|
||||
2) Scheduler enqueues run intent into tenant queue.
|
||||
3) Worker dequeues, reconstructs DAG ordering, and executes steps:
|
||||
- skip disabled steps;
|
||||
- apply per-step concurrency, retries, and backoff;
|
||||
- emit spans/metrics/logs with propagated `traceparent`.
|
||||
4) Results are persisted append-only; WebSocket pushes status to clients.
|
||||
|
||||
## Storage & queues
|
||||
- Mongo stores DAG specs, versions, and run history (per-tenant collections or tenant key prefix).
|
||||
- Queues: Redis/Mongo-backed FIFO per tenant; message includes `traceparent`, `runToken`, `dagVersion`, `inputsHash`.
|
||||
- Artifacts (logs, outputs) referenced by content hash; stored in object storage or Mongo GridFS; hashes recorded in run record.
|
||||
|
||||
## Security & AOC alignment
|
||||
- Mandatory `X-Stella-Tenant`; cross-tenant DAGs prohibited.
|
||||
- Scopes: `orchestrator:read|write|admin`; admin needed for DAG publish/delete.
|
||||
- AOC: Orchestrator only schedules/executes; no policy/severity decisions. Inputs/outputs immutable; runs replayable.
|
||||
- Sandboxing: per-step CPU/memory limits; network egress blocked unless step declares allowlist entry.
|
||||
|
||||
## Determinism
|
||||
- Step ordering: topological + lexical tie-breaks.
|
||||
- Idempotency: `runToken` + `inputsHash`; retries reuse same `traceId`; outputs hashed (lowercase hex).
|
||||
- Timestamps UTC; NDJSON exports sorted by `(startedUtc, dagId, runId)`.
|
||||
|
||||
## Offline posture
|
||||
- DAG specs and plugins shipped in signed offline bundles; no remote fetch.
|
||||
- Transparency: export runs/logs/metrics/traces as NDJSON for air-gapped audit.
|
||||
|
||||
## Observability
|
||||
- Traces: spans named `orchestrator.run`, `orchestrator.step` with attributes `tenant`, `dagId`, `runId`, `stepId`, `status`.
|
||||
- Metrics: `orchestrator_runs_total{tenant,status}`, `orchestrator_run_duration_seconds`, `orchestrator_queue_depth`, `orchestrator_step_retries_total`.
|
||||
- Logs: structured JSON, redacted, carrying `trace_id`, `tenant`, `dagId`, `runId`, `stepId`.
|
||||
|
||||
## Governance & rollout
|
||||
- DAG publishing requires signature/owner metadata; versions immutable after publish.
|
||||
- Rollback: schedule new version and disable old; runs stay immutable.
|
||||
- Upgrade path: workers hot-reload plugins from bundle catalog; scheduler is stateless.
|
||||
35
docs/orchestrator/cli.md
Normal file
35
docs/orchestrator/cli.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# Orchestrator CLI (DOCS-ORCH-33-003)
|
||||
|
||||
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
|
||||
|
||||
Last updated: 2025-11-25
|
||||
|
||||
## Commands
|
||||
- `stella orch dag list` — list DAGs (stable order by `dagId`, `version` DESC). Flags: `--dag-id`, `--active`.
|
||||
- `stella orch dag publish --file dag.yaml --signature sig.dsse` — publish DAG version (idempotent on signature).
|
||||
- `stella orch dag disable --dag-id <id> --version <ver>` — disable version.
|
||||
- `stella orch run start --dag-id <id> [--version <ver>] --inputs inputs.json [--run-token <uuid>]` — start run.
|
||||
- `stella orch run list [--dag-id <id>] [--status running|completed|failed|cancelled] [--from ISO] [--to ISO]` — list runs.
|
||||
- `stella orch run cancel --run-id <id>` — request cancellation.
|
||||
- `stella orch run logs --run-id <id> [--step-id <step>]` — fetch logs/artifacts (tenant scoped).
|
||||
- `stella orch run stream --dag-id <id>` — stream NDJSON run events (matches WebSocket feed).
|
||||
|
||||
## Global flags
|
||||
- `--tenant <id>` (required), `--api-url`, `--token`, `--traceparent`, `--output json|table`, `--page-size`, `--page-token`.
|
||||
|
||||
## Determinism & offline
|
||||
- CLI sorts client-side exactly as API returns; table output uses fixed column order.
|
||||
- Works offline against local WebService; no external downloads.
|
||||
- All timestamps printed UTC; hashes lower-case hex.
|
||||
|
||||
## Exit codes
|
||||
- `0` success; `1` validation/HTTP error; `2` auth/tenant missing; `3` cancellation rejected.
|
||||
|
||||
## Examples
|
||||
```bash
|
||||
# Start a run with idempotency token
|
||||
stella orch run start --dag-id policy-refresh --inputs inputs.json --run-token 3e2b3d2e-1f21-4c2d-9a9d-123456789abc --tenant acme
|
||||
|
||||
# Stream run updates
|
||||
stella orch run stream --dag-id policy-refresh --tenant acme --output json
|
||||
```
|
||||
33
docs/orchestrator/console.md
Normal file
33
docs/orchestrator/console.md
Normal file
@@ -0,0 +1,33 @@
|
||||
# Orchestrator Console (DOCS-ORCH-33-002)
|
||||
|
||||
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
|
||||
|
||||
Last updated: 2025-11-25
|
||||
|
||||
## Views
|
||||
- **Run list**: deterministic table sorted by `startedUtc` DESC then `runId`; filters by `dagId`, `status`, `owner`, `time range`.
|
||||
- **Run detail**: step graph with topological order; shows status, attempts, duration, logs link, outputs hash.
|
||||
- **DAG catalog**: shows published versions with signatures and enable/disable state.
|
||||
- **Queue health**: per-tenant queue depth/age, retry counts, worker availability.
|
||||
|
||||
## Actions
|
||||
- Start run (select DAG/version, supply inputs JSON, optional run token).
|
||||
- Cancel run (best-effort).
|
||||
- Download artifacts/logs (tenant-scoped).
|
||||
- Stream live updates (WebSocket) for selected DAGs/runs.
|
||||
|
||||
## Accessibility & UX
|
||||
- Keyboard shortcuts: `f` focus filter, `r` refresh, `s` start run dialog.
|
||||
- All timestamps UTC; durations shown with tooltip raw ms.
|
||||
- Color palette meets WCAG AA; status badges have icons + text.
|
||||
- Loading states deterministic; no infinite spinners—show “No data” with retry.
|
||||
|
||||
## Determinism & offline
|
||||
- Client-side sorting mirrors API order; pagination uses stable `page_token`.
|
||||
- Console operates against local WebService; no external CDNs; fonts bundled.
|
||||
- Exports (runs, steps) available as NDJSON for air-gapped audits.
|
||||
|
||||
## Safety
|
||||
- Tenant enforced via session; cross-tenant DAGs hidden.
|
||||
- No raw secrets displayed; logs redacted server-side.
|
||||
- Run cancellation confirms and records rationale for audit.
|
||||
46
docs/orchestrator/overview.md
Normal file
46
docs/orchestrator/overview.md
Normal file
@@ -0,0 +1,46 @@
|
||||
# Orchestrator Overview (DOCS-ORCH-32-001)
|
||||
|
||||
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
|
||||
|
||||
Last updated: 2025-11-25
|
||||
|
||||
## Mission & value
|
||||
- Coordinate deterministic job execution across StellaOps services (Policy, RiskEngine, VEX Lens, Export Center, Notify).
|
||||
- Provide reproducible DAG runs with tenant isolation, auditability, and Aggregation-Only Contract (AOC) alignment.
|
||||
- Stay sovereign/offline: all runners operate from bundled manifests and local queues; no external control plane.
|
||||
|
||||
## Runtime shape
|
||||
- **Services**: Orchestrator WebService (API/UI), Worker (executors), Scheduler (timer-based triggers).
|
||||
- **Queues**: per-tenant work queues; FIFO with deterministic ordering and idempotency keys.
|
||||
- **State**: Mongo for run metadata and DAG definitions; optional Redis for locks/throttles; all scoped by tenant.
|
||||
- **APIs**: REST + WebSocket for run status/stream; admin endpoints require `orchestrator:admin` plus tenant header.
|
||||
|
||||
## AOC alignment
|
||||
- Orchestrator never derives policy/verdicts; it only executes declared DAG steps and records outcomes.
|
||||
- Inputs/outputs are append-only; runs are immutable with replay tokens.
|
||||
- No consensus logic; all decisions remain in owning services (Policy Engine, RiskEngine, etc.).
|
||||
|
||||
## Determinism
|
||||
- Stable DAG evaluation order (topological with lexical tie-breaks).
|
||||
- Idempotency via run tokens and step hashes; retries preserve `trace_id`.
|
||||
- UTC timestamps; hashes lowercase hex; NDJSON exports ordered by `(timestamp, dagId, runId)`.
|
||||
|
||||
## Observability
|
||||
- Traces propagate `traceparent`/`baggage` through scheduler→worker→task.
|
||||
- Metrics: `orchestrator_runs_total{tenant,status}`, `orchestrator_run_duration_seconds`, `orchestrator_queue_depth`.
|
||||
- Logs: structured JSON, redacted, tagged with `tenant`, `dagId`, `runId`, `status`.
|
||||
|
||||
## Roles & responsibilities
|
||||
- **Operator**: manage DAG definitions, quotas, tenant allowlists, SLOs.
|
||||
- **Developer**: defines DAG specs and task plugins; supplies offline bundles for execution.
|
||||
- **Security**: validates scopes, enforces AOC boundaries, reviews audit trails.
|
||||
|
||||
## Offline posture
|
||||
- DAG specs and plugins shipped in offline bundles; runners load from local disk.
|
||||
- No outbound network during execution unless task explicitly declares an allowlisted endpoint.
|
||||
- Transparency: export run logs/traces/metrics as NDJSON for air-gapped review.
|
||||
|
||||
## Safety & governance
|
||||
- Mandatory tenant header; cross-tenant DAGs forbidden.
|
||||
- Step sandboxing: resource limits per task; deny network by default.
|
||||
- Audit: every run records actor, tenant, DAG version, inputs hash, outputs hash, and rationale notes.
|
||||
36
docs/orchestrator/run-ledger.md
Normal file
36
docs/orchestrator/run-ledger.md
Normal file
@@ -0,0 +1,36 @@
|
||||
# Orchestrator Run Ledger (DOCS-ORCH-34-001)
|
||||
|
||||
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
|
||||
|
||||
Last updated: 2025-11-25
|
||||
|
||||
## Purpose
|
||||
Immutable record of every DAG run and step execution for audit, replay, and offline export.
|
||||
|
||||
## Record schema (conceptual)
|
||||
- `tenant`, `runId`, `dagId`, `dagVersion`, `runToken`, `traceId`
|
||||
- `status` (`running|completed|failed|cancelled`)
|
||||
- `inputsHash`, `outputsHash` (overall)
|
||||
- `startedUtc`, `endedUtc`, `durationMs`
|
||||
- `steps[]`:
|
||||
- `stepId`, `status`, `attempt`, `startedUtc`, `endedUtc`, `durationMs`
|
||||
- `inputsHash`, `outputsHash`, `logsRef`, `metricsRef`, `errorCode`, `retryable`
|
||||
- `events[]` (optional): ordered list of significant events with `timestamp`, `type`, `message`, `actor`
|
||||
|
||||
## Storage
|
||||
- Mongo collection partitioned by tenant; indexes on `(tenant, dagId, runId)`, `(tenant, status, startedUtc)`.
|
||||
- Artifacts/logs referenced by content hash; stored separately (object storage/GridFS).
|
||||
- Append-only updates; run status transitions are monotonic.
|
||||
|
||||
## Exports
|
||||
- NDJSON export sorted by `startedUtc`, then `runId`; includes steps/events inline.
|
||||
- Exports include manifest with hash and count for determinism.
|
||||
|
||||
## Observability
|
||||
- Metrics derived from ledger: run counts, durations, failure rates, retry counts.
|
||||
- Trace links preserved via stored `traceId`.
|
||||
|
||||
## Governance
|
||||
- Runs never mutated or deleted; cancellation recorded as an event.
|
||||
- Access is tenant-scoped; admin queries require `orchestrator:admin`.
|
||||
- Replay tokens can be derived from `inputsHash` + `dagVersion`; consumers must log rationale when replaying.
|
||||
Reference in New Issue
Block a user