up
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled

This commit is contained in:
StellaOps Bot
2025-11-25 22:09:44 +02:00
parent 6bee1fdcf5
commit 9f6e6f7fb3
116 changed files with 4495 additions and 730 deletions

44
docs/orchestrator/api.md Normal file
View File

@@ -0,0 +1,44 @@
# Orchestrator API (DOCS-ORCH-33-001)
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Last updated: 2025-11-25
## Scope & headers
- Base path: `/api/v1/orchestrator`.
- Required headers: `Authorization: Bearer <token>`, `X-Stella-Tenant`, `traceparent` (recommended), `Idempotency-Key` for POSTs that mutate state.
- Error envelope: see `docs/api/overview.md` (code/message/trace_id).
## DAG management
- `POST /dags` — create/publish DAG version. Body includes `dagId`, `version`, `steps[]`, `edges[]`, `metadata`, `signature`.
- `GET /dags` — list DAGs (stable sort by `dagId`, then `version` DESC). Filters: `dagId`, `active=true|false`.
- `GET /dags/{dagId}/{version}` — fetch DAG definition.
- `POST /dags/{dagId}/{version}:disable` — disable a version (requires `orchestrator:admin`).
## Runs
- `POST /runs` — start a run; accepts `dagId`, optional `version`, `inputs` (object), `runToken` (idempotency). Returns `runId`, `traceId`.
- `GET /runs` — list runs with filters `dagId`, `status`, `from`, `to`. Sort: `startedUtc` DESC, then `runId`.
- `GET /runs/{runId}` — run detail with step states and hashes.
- `POST /runs/{runId}:cancel` — request cancellation (best-effort, idempotent).
## Steps & artifacts
- `GET /runs/{runId}/steps` — list step executions.
- `GET /runs/{runId}/steps/{stepId}` — step detail, including `attempts[]`, `logsRef`, `outputsHash`.
- `GET /artifacts/{hash}` — retrieve artifact by content hash (if tenant owns it).
## WebSocket stream
- `GET /runs/stream?dagId=&status=` — server sends NDJSON events: `run.started`, `run.updated`, `step.updated`, `run.completed`, `run.failed`, `run.cancelled`. Fields: `tenant`, `dagId`, `runId`, `status`, `timestamp`, `traceId`.
## Admin/ops
- `POST /admin/warm` — warm caches for DAGs/plugins (optional).
- `GET /admin/health` — liveness/readiness; includes queue depth per tenant.
- `GET /admin/metrics` — Prometheus scrape endpoint.
## Determinism & offline posture
- All list endpoints have deterministic ordering; pagination via `page_token`/`page_size`.
- No remote fetches; DAGs/plugins must be preloaded. Exports available as NDJSON with stable ordering.
- Hashes lowercase hex; timestamps UTC ISO-8601.
## Security
- Scopes: `orchestrator:read`, `orchestrator:write`, `orchestrator:admin` (publish/disable DAGs, cache warm).
- Tenant isolation enforced on every path; cross-tenant access forbidden.

View File

@@ -0,0 +1,53 @@
# Orchestrator Architecture (DOCS-ORCH-32-002)
Last updated: 2025-11-25
## Runtime components
- **WebService**: REST + WebSocket API for DAG definitions, run status, and admin actions; issues idempotency tokens and enforces tenant isolation.
- **Scheduler**: timer/cron runner that instantiates DAG runs from schedules; publishes run intents into per-tenant queues.
- **Worker**: executes DAG steps; pulls from tenant queues, applies resource limits, and reports spans/metrics/logs.
- **Plugin host**: task plugins (HTTP call, queue dispatch, CLI tool, script) loaded from signed bundles; execution is sandboxed with deny-by-default network.
## Data model
- **DAG**: directed acyclic graph with topological order; tie-break lexicographically by step id for determinism.
- **Run**: immutable record with `runId`, `dagVersion`, `tenant`, `inputsHash`, `status`, `traceId`, `startedUtc`, `endedUtc`.
- **Step execution**: each step captures `inputsHash`, `outputsHash`, `status`, `attempt`, `durationMs`, `logsRef`, `metricsRef`.
## Execution flow
1) Client or scheduler creates a run (idempotent on `runToken`, `dagId`, `inputsHash`).
2) Scheduler enqueues run intent into tenant queue.
3) Worker dequeues, reconstructs DAG ordering, and executes steps:
- skip disabled steps;
- apply per-step concurrency, retries, and backoff;
- emit spans/metrics/logs with propagated `traceparent`.
4) Results are persisted append-only; WebSocket pushes status to clients.
## Storage & queues
- Mongo stores DAG specs, versions, and run history (per-tenant collections or tenant key prefix).
- Queues: Redis/Mongo-backed FIFO per tenant; message includes `traceparent`, `runToken`, `dagVersion`, `inputsHash`.
- Artifacts (logs, outputs) referenced by content hash; stored in object storage or Mongo GridFS; hashes recorded in run record.
## Security & AOC alignment
- Mandatory `X-Stella-Tenant`; cross-tenant DAGs prohibited.
- Scopes: `orchestrator:read|write|admin`; admin needed for DAG publish/delete.
- AOC: Orchestrator only schedules/executes; no policy/severity decisions. Inputs/outputs immutable; runs replayable.
- Sandboxing: per-step CPU/memory limits; network egress blocked unless step declares allowlist entry.
## Determinism
- Step ordering: topological + lexical tie-breaks.
- Idempotency: `runToken` + `inputsHash`; retries reuse same `traceId`; outputs hashed (lowercase hex).
- Timestamps UTC; NDJSON exports sorted by `(startedUtc, dagId, runId)`.
## Offline posture
- DAG specs and plugins shipped in signed offline bundles; no remote fetch.
- Transparency: export runs/logs/metrics/traces as NDJSON for air-gapped audit.
## Observability
- Traces: spans named `orchestrator.run`, `orchestrator.step` with attributes `tenant`, `dagId`, `runId`, `stepId`, `status`.
- Metrics: `orchestrator_runs_total{tenant,status}`, `orchestrator_run_duration_seconds`, `orchestrator_queue_depth`, `orchestrator_step_retries_total`.
- Logs: structured JSON, redacted, carrying `trace_id`, `tenant`, `dagId`, `runId`, `stepId`.
## Governance & rollout
- DAG publishing requires signature/owner metadata; versions immutable after publish.
- Rollback: schedule new version and disable old; runs stay immutable.
- Upgrade path: workers hot-reload plugins from bundle catalog; scheduler is stateless.

35
docs/orchestrator/cli.md Normal file
View File

@@ -0,0 +1,35 @@
# Orchestrator CLI (DOCS-ORCH-33-003)
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Last updated: 2025-11-25
## Commands
- `stella orch dag list` — list DAGs (stable order by `dagId`, `version` DESC). Flags: `--dag-id`, `--active`.
- `stella orch dag publish --file dag.yaml --signature sig.dsse` — publish DAG version (idempotent on signature).
- `stella orch dag disable --dag-id <id> --version <ver>` — disable version.
- `stella orch run start --dag-id <id> [--version <ver>] --inputs inputs.json [--run-token <uuid>]` — start run.
- `stella orch run list [--dag-id <id>] [--status running|completed|failed|cancelled] [--from ISO] [--to ISO]` — list runs.
- `stella orch run cancel --run-id <id>` — request cancellation.
- `stella orch run logs --run-id <id> [--step-id <step>]` — fetch logs/artifacts (tenant scoped).
- `stella orch run stream --dag-id <id>` — stream NDJSON run events (matches WebSocket feed).
## Global flags
- `--tenant <id>` (required), `--api-url`, `--token`, `--traceparent`, `--output json|table`, `--page-size`, `--page-token`.
## Determinism & offline
- CLI sorts client-side exactly as API returns; table output uses fixed column order.
- Works offline against local WebService; no external downloads.
- All timestamps printed UTC; hashes lower-case hex.
## Exit codes
- `0` success; `1` validation/HTTP error; `2` auth/tenant missing; `3` cancellation rejected.
## Examples
```bash
# Start a run with idempotency token
stella orch run start --dag-id policy-refresh --inputs inputs.json --run-token 3e2b3d2e-1f21-4c2d-9a9d-123456789abc --tenant acme
# Stream run updates
stella orch run stream --dag-id policy-refresh --tenant acme --output json
```

View File

@@ -0,0 +1,33 @@
# Orchestrator Console (DOCS-ORCH-33-002)
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Last updated: 2025-11-25
## Views
- **Run list**: deterministic table sorted by `startedUtc` DESC then `runId`; filters by `dagId`, `status`, `owner`, `time range`.
- **Run detail**: step graph with topological order; shows status, attempts, duration, logs link, outputs hash.
- **DAG catalog**: shows published versions with signatures and enable/disable state.
- **Queue health**: per-tenant queue depth/age, retry counts, worker availability.
## Actions
- Start run (select DAG/version, supply inputs JSON, optional run token).
- Cancel run (best-effort).
- Download artifacts/logs (tenant-scoped).
- Stream live updates (WebSocket) for selected DAGs/runs.
## Accessibility & UX
- Keyboard shortcuts: `f` focus filter, `r` refresh, `s` start run dialog.
- All timestamps UTC; durations shown with tooltip raw ms.
- Color palette meets WCAG AA; status badges have icons + text.
- Loading states deterministic; no infinite spinners—show “No data” with retry.
## Determinism & offline
- Client-side sorting mirrors API order; pagination uses stable `page_token`.
- Console operates against local WebService; no external CDNs; fonts bundled.
- Exports (runs, steps) available as NDJSON for air-gapped audits.
## Safety
- Tenant enforced via session; cross-tenant DAGs hidden.
- No raw secrets displayed; logs redacted server-side.
- Run cancellation confirms and records rationale for audit.

View File

@@ -0,0 +1,46 @@
# Orchestrator Overview (DOCS-ORCH-32-001)
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Last updated: 2025-11-25
## Mission & value
- Coordinate deterministic job execution across StellaOps services (Policy, RiskEngine, VEX Lens, Export Center, Notify).
- Provide reproducible DAG runs with tenant isolation, auditability, and Aggregation-Only Contract (AOC) alignment.
- Stay sovereign/offline: all runners operate from bundled manifests and local queues; no external control plane.
## Runtime shape
- **Services**: Orchestrator WebService (API/UI), Worker (executors), Scheduler (timer-based triggers).
- **Queues**: per-tenant work queues; FIFO with deterministic ordering and idempotency keys.
- **State**: Mongo for run metadata and DAG definitions; optional Redis for locks/throttles; all scoped by tenant.
- **APIs**: REST + WebSocket for run status/stream; admin endpoints require `orchestrator:admin` plus tenant header.
## AOC alignment
- Orchestrator never derives policy/verdicts; it only executes declared DAG steps and records outcomes.
- Inputs/outputs are append-only; runs are immutable with replay tokens.
- No consensus logic; all decisions remain in owning services (Policy Engine, RiskEngine, etc.).
## Determinism
- Stable DAG evaluation order (topological with lexical tie-breaks).
- Idempotency via run tokens and step hashes; retries preserve `trace_id`.
- UTC timestamps; hashes lowercase hex; NDJSON exports ordered by `(timestamp, dagId, runId)`.
## Observability
- Traces propagate `traceparent`/`baggage` through scheduler→worker→task.
- Metrics: `orchestrator_runs_total{tenant,status}`, `orchestrator_run_duration_seconds`, `orchestrator_queue_depth`.
- Logs: structured JSON, redacted, tagged with `tenant`, `dagId`, `runId`, `status`.
## Roles & responsibilities
- **Operator**: manage DAG definitions, quotas, tenant allowlists, SLOs.
- **Developer**: defines DAG specs and task plugins; supplies offline bundles for execution.
- **Security**: validates scopes, enforces AOC boundaries, reviews audit trails.
## Offline posture
- DAG specs and plugins shipped in offline bundles; runners load from local disk.
- No outbound network during execution unless task explicitly declares an allowlisted endpoint.
- Transparency: export run logs/traces/metrics as NDJSON for air-gapped review.
## Safety & governance
- Mandatory tenant header; cross-tenant DAGs forbidden.
- Step sandboxing: resource limits per task; deny network by default.
- Audit: every run records actor, tenant, DAG version, inputs hash, outputs hash, and rationale notes.

View File

@@ -0,0 +1,36 @@
# Orchestrator Run Ledger (DOCS-ORCH-34-001)
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Last updated: 2025-11-25
## Purpose
Immutable record of every DAG run and step execution for audit, replay, and offline export.
## Record schema (conceptual)
- `tenant`, `runId`, `dagId`, `dagVersion`, `runToken`, `traceId`
- `status` (`running|completed|failed|cancelled`)
- `inputsHash`, `outputsHash` (overall)
- `startedUtc`, `endedUtc`, `durationMs`
- `steps[]`:
- `stepId`, `status`, `attempt`, `startedUtc`, `endedUtc`, `durationMs`
- `inputsHash`, `outputsHash`, `logsRef`, `metricsRef`, `errorCode`, `retryable`
- `events[]` (optional): ordered list of significant events with `timestamp`, `type`, `message`, `actor`
## Storage
- Mongo collection partitioned by tenant; indexes on `(tenant, dagId, runId)`, `(tenant, status, startedUtc)`.
- Artifacts/logs referenced by content hash; stored separately (object storage/GridFS).
- Append-only updates; run status transitions are monotonic.
## Exports
- NDJSON export sorted by `startedUtc`, then `runId`; includes steps/events inline.
- Exports include manifest with hash and count for determinism.
## Observability
- Metrics derived from ledger: run counts, durations, failure rates, retry counts.
- Trace links preserved via stored `traceId`.
## Governance
- Runs never mutated or deleted; cancellation recorded as an event.
- Access is tenant-scoped; admin queries require `orchestrator:admin`.
- Replay tokens can be derived from `inputsHash` + `dagVersion`; consumers must log rationale when replaying.