4.1 KiB
Runbook: Release Orchestrator - Gate Evaluation Timeout
Sprint: SPRINT_20260117_029_DOCS_runbook_coverage Task: RUN-004 - Release Orchestrator Runbooks
Metadata
| Field | Value |
|---|---|
| Component | Release Orchestrator |
| Severity | High |
| On-call scope | Platform team |
| Last updated | 2026-01-17 |
| Doctor check | check.orchestrator.gate-timeout |
Symptoms
- Promotion gates timing out before completing evaluation
- Alert
OrchestratorGateTimeoutfiring - Error: "gate evaluation timeout exceeded"
- Promotion stuck waiting for gate response
- Metric
orchestrator_gate_timeout_totalincreasing
Impact
| Impact Type | Description |
|---|---|
| User-facing | Promotions delayed or blocked; release pipeline stalled |
| Data integrity | No data loss; promotion can be retried |
| SLA impact | Release SLO violated if timeout persists |
Diagnosis
Quick checks
-
Check Doctor diagnostics:
stella doctor --check check.orchestrator.gate-timeout -
Identify timed-out gates:
stella promotion gates <promotion-id> --status timeout -
Check gate service health:
stella orch gate-services status
Deep diagnosis
-
Check specific gate latency:
stella orch gate stats --gate <gate-name> --last 1hLook for: P95 latency, timeout rate
-
Check external service connectivity:
stella orch connectivity --gate <gate-name> -
Check gate evaluation logs:
stella orch logs --gate <gate-name> --promotion <promotion-id>Look for: Slow queries, external API delays
-
Check policy engine latency (for policy gates):
stella policy stats --last 10m
Resolution
Immediate mitigation
-
Increase timeout for specific gate:
stella orch config set gates.<gate-name>.timeout 5m stella orch reload -
Skip the timed-out gate (requires approval):
stella promotion gate skip <promotion-id> <gate-name> \ --reason "External service timeout - approved by <approver>" -
Retry the promotion:
stella promotion retry <promotion-id>
Root cause fix
If external service is slow:
-
Configure gate retry with backoff:
stella orch config set gates.<gate-name>.retries 3 stella orch config set gates.<gate-name>.retry_backoff 5s -
Enable gate result caching:
stella orch config set gates.<gate-name>.cache_ttl 5m -
Configure circuit breaker:
stella orch config set gates.<gate-name>.circuit_breaker.enabled true stella orch config set gates.<gate-name>.circuit_breaker.threshold 5
If policy evaluation is slow:
-
Optimize policy (see
policy-evaluation-slow.mdrunbook) -
Increase policy worker count:
stella policy config set opa.workers 4
If evidence retrieval is slow:
-
Enable evidence pre-fetching:
stella orch config set gates.evidence_prefetch true -
Increase evidence cache:
stella orch config set evidence.cache_size 1000 stella orch config set evidence.cache_ttl 10m
Verification
# Retry promotion
stella promotion retry <promotion-id>
# Monitor gate evaluation
stella promotion gates <promotion-id> --watch
# Check gate latency improved
stella orch gate stats --gate <gate-name> --last 10m
# Verify no timeouts
stella orch logs --filter "timeout" --last 30m
Prevention
- Timeouts: Set appropriate timeouts based on gate SLAs (default: 2m)
- Monitoring: Alert on gate P95 latency > 1m
- Caching: Enable caching for slow gates
- Circuit breakers: Enable circuit breakers for external service gates
Related Resources
- Architecture:
docs/modules/release-orchestrator/gates.md - Related runbooks:
orchestrator-promotion-stuck.md,policy-evaluation-slow.md - Dashboard: Grafana > Stella Ops > Gate Latency