# Orchestrator Runbook (DOCS-ORCH-34-003) Last updated: 2025-11-25 ## Pre-flight - Ensure Mongo and queue backend reachable; health at `/api/v1/orchestrator/admin/health` green. - Verify tenant allowlist and scopes (`orchestrator:*`) configured in Authority. - Plugin bundles present and signatures verified. ## Common operations - **Start a run**: `POST /api/v1/orchestrator/runs` or `stella orch run start ...`. - **Cancel a run**: `POST /runs/{runId}:cancel`; best-effort, idempotent. - **Stream status**: WebSocket `/runs/stream` or CLI `stella orch run stream`. - **Export ledger**: NDJSON export by time window for audits. ## Incident response - **Queue backlog**: Check queue depth; scale workers or pause schedulers; drain oldest first. Verify no stuck plugin. - **Repeated failures**: Inspect run ledger for `error.code`; compare `inputsHash` and plugin version; roll back DAG version if regression. - **Plugin auth errors**: rotate `secretRef` in Authority; warm worker cache; re-run impacted DAGs. - **Scheduler runaway**: disable offending DAG version; clear scheduled triggers; confirm queue drains. ## Health checks - `GET /admin/health` — liveness/readiness + queue depth. - Metrics: `orchestrator_runs_total`, `orchestrator_queue_depth`, `orchestrator_step_retries_total`, `orchestrator_run_duration_seconds`. - Logs: structured JSON with `tenant`, `dagId`, `runId`, `status`; check for redaction markers. ## Determinism/immutability - Runs are append-only; do not mutate ledger entries. Use new DAG versions for fixes. - Idempotency via `runToken`; reruns should reuse the same token when repeating intended work. ## Offline/air-gap - Keep plugin bundles and DAG specs in sealed storage; no remote fetch. - Export logs/metrics/traces as NDJSON for offline analysis; include manifest/hash. ## Quick checks - [ ] Health green, queue depth normal. - [ ] Latest plugin bundle signatures valid. - [ ] No secrets in logs (spot-check redaction). - [ ] Error budget within SLO (see `docs/observability/metrics-and-slos.md`).