Files
git.stella-ops.org/docs/operations/orchestrator-runbook.md
StellaOps Bot 9f6e6f7fb3
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
up
2025-11-25 22:09:44 +02:00

2.0 KiB

Orchestrator Runbook (DOCS-ORCH-34-003)

Last updated: 2025-11-25

Pre-flight

  • Ensure Mongo and queue backend reachable; health at /api/v1/orchestrator/admin/health green.
  • Verify tenant allowlist and scopes (orchestrator:*) configured in Authority.
  • Plugin bundles present and signatures verified.

Common operations

  • Start a run: POST /api/v1/orchestrator/runs or stella orch run start ....
  • Cancel a run: POST /runs/{runId}:cancel; best-effort, idempotent.
  • Stream status: WebSocket /runs/stream or CLI stella orch run stream.
  • Export ledger: NDJSON export by time window for audits.

Incident response

  • Queue backlog: Check queue depth; scale workers or pause schedulers; drain oldest first. Verify no stuck plugin.
  • Repeated failures: Inspect run ledger for error.code; compare inputsHash and plugin version; roll back DAG version if regression.
  • Plugin auth errors: rotate secretRef in Authority; warm worker cache; re-run impacted DAGs.
  • Scheduler runaway: disable offending DAG version; clear scheduled triggers; confirm queue drains.

Health checks

  • GET /admin/health — liveness/readiness + queue depth.
  • Metrics: orchestrator_runs_total, orchestrator_queue_depth, orchestrator_step_retries_total, orchestrator_run_duration_seconds.
  • Logs: structured JSON with tenant, dagId, runId, status; check for redaction markers.

Determinism/immutability

  • Runs are append-only; do not mutate ledger entries. Use new DAG versions for fixes.
  • Idempotency via runToken; reruns should reuse the same token when repeating intended work.

Offline/air-gap

  • Keep plugin bundles and DAG specs in sealed storage; no remote fetch.
  • Export logs/metrics/traces as NDJSON for offline analysis; include manifest/hash.

Quick checks

  • Health green, queue depth normal.
  • Latest plugin bundle signatures valid.
  • No secrets in logs (spot-check redaction).
  • Error budget within SLO (see docs/observability/metrics-and-slos.md).