3.9 KiB
Runbook: Release Orchestrator - Promotion Job Not Progressing
Sprint: SPRINT_20260117_029_DOCS_runbook_coverage Task: RUN-004 - Release Orchestrator Runbooks
Metadata
| Field | Value |
|---|---|
| Component | Release Orchestrator |
| Severity | Critical |
| On-call scope | Platform team, Release team |
| Last updated | 2026-01-17 |
| Doctor check | check.orchestrator.job-health |
Symptoms
- Promotion job stuck in "in_progress" state for >10 minutes
- No progress updates in promotion timeline
- Alert
OrchestratorPromotionStuckfiring - UI shows promotion spinner indefinitely
- Downstream environment not receiving promoted artifact
Impact
| Impact Type | Description |
|---|---|
| User-facing | Release blocked, cannot promote to target environment |
| Data integrity | Artifact is safe; promotion can be retried |
| SLA impact | Release SLO violated if not resolved within 30 minutes |
Diagnosis
Quick checks
-
Check Doctor diagnostics:
stella doctor --check check.orchestrator.job-health -
Check promotion status:
stella promotion status <promotion-id>Look for: Current step, last update time, any error messages
-
Check orchestrator service:
stella orch status
Deep diagnosis
-
Get detailed promotion trace:
stella promotion trace <promotion-id> --verboseLook for: Which step is stuck, any timeouts
-
Check gate evaluation status:
stella promotion gates <promotion-id>Problem if: Gate stuck waiting for external service
-
Check target environment connectivity:
stella orch connectivity --target <env-name> -
Check for lock contention:
stella orch locks listProblem if: Stale locks on the artifact or environment
Resolution
Immediate mitigation
-
If gate is stuck waiting for external service:
# Skip the stuck gate (requires approval) stella promotion gate skip <promotion-id> <gate-name> --reason "External service timeout" -
If lock is stale:
# Release the lock (use with caution) stella orch locks release <lock-id> --force -
If orchestrator is unresponsive:
stella service restart orchestrator
Root cause fix
If external gate service is slow:
-
Increase gate timeout:
stella orch config set gates.<gate-name>.timeout 5m -
Configure gate retry:
stella orch config set gates.<gate-name>.retries 3
If target environment is unreachable:
- Check network connectivity to target
- Verify credentials for target environment:
stella orch credentials verify --target <env-name>
If database lock contention:
-
Increase lock timeout:
stella orch config set locks.timeout 60s -
Enable optimistic locking:
stella orch config set locks.mode optimistic
Verification
# Check promotion completed
stella promotion status <promotion-id>
# Verify artifact in target environment
stella orch artifacts list --env <target-env> --filter <artifact-digest>
# Check no stuck promotions
stella promotion list --status in_progress --older-than 5m
Prevention
- Timeouts: Configure appropriate timeouts for all gates
- Monitoring: Alert on promotions stuck > 10 minutes
- Health checks: Enable connectivity pre-checks before promotion
- Documentation: Document SLAs for external gate services
Related Resources
- Architecture:
docs/modules/release-orchestrator/architecture.md - Related runbooks:
orchestrator-gate-timeout.md,orchestrator-evidence-missing.md - Dashboard: Grafana > Stella Ops > Release Orchestrator