Files
git.stella-ops.org/docs/operations/orchestrator-runbook.md
2025-12-24 21:45:46 +02:00

40 lines
2.0 KiB
Markdown

# Orchestrator Runbook (DOCS-ORCH-34-003)
Last updated: 2025-11-25
## Pre-flight
- Ensure PostgreSQL and queue backend reachable; health at `/api/v1/orchestrator/admin/health` green.
- Verify tenant allowlist and scopes (`orchestrator:*`) configured in Authority.
- Plugin bundles present and signatures verified.
## Common operations
- **Start a run**: `POST /api/v1/orchestrator/runs` or `stella orch run start ...`.
- **Cancel a run**: `POST /runs/{runId}:cancel`; best-effort, idempotent.
- **Stream status**: WebSocket `/runs/stream` or CLI `stella orch run stream`.
- **Export ledger**: NDJSON export by time window for audits.
## Incident response
- **Queue backlog**: Check queue depth; scale workers or pause schedulers; drain oldest first. Verify no stuck plugin.
- **Repeated failures**: Inspect run ledger for `error.code`; compare `inputsHash` and plugin version; roll back DAG version if regression.
- **Plugin auth errors**: rotate `secretRef` in Authority; warm worker cache; re-run impacted DAGs.
- **Scheduler runaway**: disable offending DAG version; clear scheduled triggers; confirm queue drains.
## Health checks
- `GET /admin/health` — liveness/readiness + queue depth.
- Metrics: `orchestrator_runs_total`, `orchestrator_queue_depth`, `orchestrator_step_retries_total`, `orchestrator_run_duration_seconds`.
- Logs: structured JSON with `tenant`, `dagId`, `runId`, `status`; check for redaction markers.
## Determinism/immutability
- Runs are append-only; do not mutate ledger entries. Use new DAG versions for fixes.
- Idempotency via `runToken`; reruns should reuse the same token when repeating intended work.
## Offline/air-gap
- Keep plugin bundles and DAG specs in sealed storage; no remote fetch.
- Export logs/metrics/traces as NDJSON for offline analysis; include manifest/hash.
## Quick checks
- [ ] Health green, queue depth normal.
- [ ] Latest plugin bundle signatures valid.
- [ ] No secrets in logs (spot-check redaction).
- [ ] Error budget within SLO (see `docs/observability/metrics-and-slos.md`).