Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
2.0 KiB
2.0 KiB
Orchestrator Runbook (DOCS-ORCH-34-003)
Last updated: 2025-11-25
Pre-flight
- Ensure Mongo and queue backend reachable; health at
/api/v1/orchestrator/admin/healthgreen. - Verify tenant allowlist and scopes (
orchestrator:*) configured in Authority. - Plugin bundles present and signatures verified.
Common operations
- Start a run:
POST /api/v1/orchestrator/runsorstella orch run start .... - Cancel a run:
POST /runs/{runId}:cancel; best-effort, idempotent. - Stream status: WebSocket
/runs/streamor CLIstella orch run stream. - Export ledger: NDJSON export by time window for audits.
Incident response
- Queue backlog: Check queue depth; scale workers or pause schedulers; drain oldest first. Verify no stuck plugin.
- Repeated failures: Inspect run ledger for
error.code; compareinputsHashand plugin version; roll back DAG version if regression. - Plugin auth errors: rotate
secretRefin Authority; warm worker cache; re-run impacted DAGs. - Scheduler runaway: disable offending DAG version; clear scheduled triggers; confirm queue drains.
Health checks
GET /admin/health— liveness/readiness + queue depth.- Metrics:
orchestrator_runs_total,orchestrator_queue_depth,orchestrator_step_retries_total,orchestrator_run_duration_seconds. - Logs: structured JSON with
tenant,dagId,runId,status; check for redaction markers.
Determinism/immutability
- Runs are append-only; do not mutate ledger entries. Use new DAG versions for fixes.
- Idempotency via
runToken; reruns should reuse the same token when repeating intended work.
Offline/air-gap
- Keep plugin bundles and DAG specs in sealed storage; no remote fetch.
- Export logs/metrics/traces as NDJSON for offline analysis; include manifest/hash.
Quick checks
- Health green, queue depth normal.
- Latest plugin bundle signatures valid.
- No secrets in logs (spot-check redaction).
- Error budget within SLO (see
docs/observability/metrics-and-slos.md).