Files
git.stella-ops.org/docs/operations/runbooks/orchestrator-gate-timeout.md

4.1 KiB

Runbook: Release Orchestrator - Gate Evaluation Timeout

Sprint: SPRINT_20260117_029_DOCS_runbook_coverage Task: RUN-004 - Release Orchestrator Runbooks

Metadata

Field Value
Component Release Orchestrator
Severity High
On-call scope Platform team
Last updated 2026-01-17
Doctor check check.orchestrator.gate-timeout

Symptoms

  • Promotion gates timing out before completing evaluation
  • Alert OrchestratorGateTimeout firing
  • Error: "gate evaluation timeout exceeded"
  • Promotion stuck waiting for gate response
  • Metric orchestrator_gate_timeout_total increasing

Impact

Impact Type Description
User-facing Promotions delayed or blocked; release pipeline stalled
Data integrity No data loss; promotion can be retried
SLA impact Release SLO violated if timeout persists

Diagnosis

Quick checks

  1. Check Doctor diagnostics:

    stella doctor --check check.orchestrator.gate-timeout
    
  2. Identify timed-out gates:

    stella promotion gates <promotion-id> --status timeout
    
  3. Check gate service health:

    stella orch gate-services status
    

Deep diagnosis

  1. Check specific gate latency:

    stella orch gate stats --gate <gate-name> --last 1h
    

    Look for: P95 latency, timeout rate

  2. Check external service connectivity:

    stella orch connectivity --gate <gate-name>
    
  3. Check gate evaluation logs:

    stella orch logs --gate <gate-name> --promotion <promotion-id>
    

    Look for: Slow queries, external API delays

  4. Check policy engine latency (for policy gates):

    stella policy stats --last 10m
    

Resolution

Immediate mitigation

  1. Increase timeout for specific gate:

    stella orch config set gates.<gate-name>.timeout 5m
    stella orch reload
    
  2. Skip the timed-out gate (requires approval):

    stella promotion gate skip <promotion-id> <gate-name> \
      --reason "External service timeout - approved by <approver>"
    
  3. Retry the promotion:

    stella promotion retry <promotion-id>
    

Root cause fix

If external service is slow:

  1. Configure gate retry with backoff:

    stella orch config set gates.<gate-name>.retries 3
    stella orch config set gates.<gate-name>.retry_backoff 5s
    
  2. Enable gate result caching:

    stella orch config set gates.<gate-name>.cache_ttl 5m
    
  3. Configure circuit breaker:

    stella orch config set gates.<gate-name>.circuit_breaker.enabled true
    stella orch config set gates.<gate-name>.circuit_breaker.threshold 5
    

If policy evaluation is slow:

  1. Optimize policy (see policy-evaluation-slow.md runbook)

  2. Increase policy worker count:

    stella policy config set opa.workers 4
    

If evidence retrieval is slow:

  1. Enable evidence pre-fetching:

    stella orch config set gates.evidence_prefetch true
    
  2. Increase evidence cache:

    stella orch config set evidence.cache_size 1000
    stella orch config set evidence.cache_ttl 10m
    

Verification

# Retry promotion
stella promotion retry <promotion-id>

# Monitor gate evaluation
stella promotion gates <promotion-id> --watch

# Check gate latency improved
stella orch gate stats --gate <gate-name> --last 10m

# Verify no timeouts
stella orch logs --filter "timeout" --last 30m

Prevention

  • Timeouts: Set appropriate timeouts based on gate SLAs (default: 2m)
  • Monitoring: Alert on gate P95 latency > 1m
  • Caching: Enable caching for slow gates
  • Circuit breakers: Enable circuit breakers for external service gates

  • Architecture: docs/modules/release-orchestrator/gates.md
  • Related runbooks: orchestrator-promotion-stuck.md, policy-evaluation-slow.md
  • Dashboard: Grafana > Stella Ops > Gate Latency