Files
git.stella-ops.org/docs/operations/runbooks/orchestrator-promotion-stuck.md

3.9 KiB

Runbook: Release Orchestrator - Promotion Job Not Progressing

Sprint: SPRINT_20260117_029_DOCS_runbook_coverage Task: RUN-004 - Release Orchestrator Runbooks

Metadata

Field Value
Component Release Orchestrator
Severity Critical
On-call scope Platform team, Release team
Last updated 2026-01-17
Doctor check check.orchestrator.job-health

Symptoms

  • Promotion job stuck in "in_progress" state for >10 minutes
  • No progress updates in promotion timeline
  • Alert OrchestratorPromotionStuck firing
  • UI shows promotion spinner indefinitely
  • Downstream environment not receiving promoted artifact

Impact

Impact Type Description
User-facing Release blocked, cannot promote to target environment
Data integrity Artifact is safe; promotion can be retried
SLA impact Release SLO violated if not resolved within 30 minutes

Diagnosis

Quick checks

  1. Check Doctor diagnostics:

    stella doctor --check check.orchestrator.job-health
    
  2. Check promotion status:

    stella promotion status <promotion-id>
    

    Look for: Current step, last update time, any error messages

  3. Check orchestrator service:

    stella orch status
    

Deep diagnosis

  1. Get detailed promotion trace:

    stella promotion trace <promotion-id> --verbose
    

    Look for: Which step is stuck, any timeouts

  2. Check gate evaluation status:

    stella promotion gates <promotion-id>
    

    Problem if: Gate stuck waiting for external service

  3. Check target environment connectivity:

    stella orch connectivity --target <env-name>
    
  4. Check for lock contention:

    stella orch locks list
    

    Problem if: Stale locks on the artifact or environment


Resolution

Immediate mitigation

  1. If gate is stuck waiting for external service:

    # Skip the stuck gate (requires approval)
    stella promotion gate skip <promotion-id> <gate-name> --reason "External service timeout"
    
  2. If lock is stale:

    # Release the lock (use with caution)
    stella orch locks release <lock-id> --force
    
  3. If orchestrator is unresponsive:

    stella service restart orchestrator
    

Root cause fix

If external gate service is slow:

  1. Increase gate timeout:

    stella orch config set gates.<gate-name>.timeout 5m
    
  2. Configure gate retry:

    stella orch config set gates.<gate-name>.retries 3
    

If target environment is unreachable:

  1. Check network connectivity to target
  2. Verify credentials for target environment:
    stella orch credentials verify --target <env-name>
    

If database lock contention:

  1. Increase lock timeout:

    stella orch config set locks.timeout 60s
    
  2. Enable optimistic locking:

    stella orch config set locks.mode optimistic
    

Verification

# Check promotion completed
stella promotion status <promotion-id>

# Verify artifact in target environment
stella orch artifacts list --env <target-env> --filter <artifact-digest>

# Check no stuck promotions
stella promotion list --status in_progress --older-than 5m

Prevention

  • Timeouts: Configure appropriate timeouts for all gates
  • Monitoring: Alert on promotions stuck > 10 minutes
  • Health checks: Enable connectivity pre-checks before promotion
  • Documentation: Document SLAs for external gate services

  • Architecture: docs/modules/release-orchestrator/architecture.md
  • Related runbooks: orchestrator-gate-timeout.md, orchestrator-evidence-missing.md
  • Dashboard: Grafana > Stella Ops > Release Orchestrator