Files
git.stella-ops.org/docs/operations/orchestrator-runbook.md
StellaOps Bot 9f6e6f7fb3
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
up
2025-11-25 22:09:44 +02:00

40 lines
2.0 KiB
Markdown

# Orchestrator Runbook (DOCS-ORCH-34-003)
Last updated: 2025-11-25
## Pre-flight
- Ensure Mongo and queue backend reachable; health at `/api/v1/orchestrator/admin/health` green.
- Verify tenant allowlist and scopes (`orchestrator:*`) configured in Authority.
- Plugin bundles present and signatures verified.
## Common operations
- **Start a run**: `POST /api/v1/orchestrator/runs` or `stella orch run start ...`.
- **Cancel a run**: `POST /runs/{runId}:cancel`; best-effort, idempotent.
- **Stream status**: WebSocket `/runs/stream` or CLI `stella orch run stream`.
- **Export ledger**: NDJSON export by time window for audits.
## Incident response
- **Queue backlog**: Check queue depth; scale workers or pause schedulers; drain oldest first. Verify no stuck plugin.
- **Repeated failures**: Inspect run ledger for `error.code`; compare `inputsHash` and plugin version; roll back DAG version if regression.
- **Plugin auth errors**: rotate `secretRef` in Authority; warm worker cache; re-run impacted DAGs.
- **Scheduler runaway**: disable offending DAG version; clear scheduled triggers; confirm queue drains.
## Health checks
- `GET /admin/health` — liveness/readiness + queue depth.
- Metrics: `orchestrator_runs_total`, `orchestrator_queue_depth`, `orchestrator_step_retries_total`, `orchestrator_run_duration_seconds`.
- Logs: structured JSON with `tenant`, `dagId`, `runId`, `status`; check for redaction markers.
## Determinism/immutability
- Runs are append-only; do not mutate ledger entries. Use new DAG versions for fixes.
- Idempotency via `runToken`; reruns should reuse the same token when repeating intended work.
## Offline/air-gap
- Keep plugin bundles and DAG specs in sealed storage; no remote fetch.
- Export logs/metrics/traces as NDJSON for offline analysis; include manifest/hash.
## Quick checks
- [ ] Health green, queue depth normal.
- [ ] Latest plugin bundle signatures valid.
- [ ] No secrets in logs (spot-check redaction).
- [ ] Error budget within SLO (see `docs/observability/metrics-and-slos.md`).