synergy moats product advisory implementations
This commit is contained in:
168
docs/operations/runbooks/orchestrator-promotion-stuck.md
Normal file
168
docs/operations/runbooks/orchestrator-promotion-stuck.md
Normal file
@@ -0,0 +1,168 @@
|
||||
# Runbook: Release Orchestrator - Promotion Job Not Progressing
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-004 - Release Orchestrator Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Release Orchestrator |
|
||||
| **Severity** | Critical |
|
||||
| **On-call scope** | Platform team, Release team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.orchestrator.job-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Promotion job stuck in "in_progress" state for >10 minutes
|
||||
- [ ] No progress updates in promotion timeline
|
||||
- [ ] Alert `OrchestratorPromotionStuck` firing
|
||||
- [ ] UI shows promotion spinner indefinitely
|
||||
- [ ] Downstream environment not receiving promoted artifact
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Release blocked, cannot promote to target environment |
|
||||
| **Data integrity** | Artifact is safe; promotion can be retried |
|
||||
| **SLA impact** | Release SLO violated if not resolved within 30 minutes |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.orchestrator.job-health
|
||||
```
|
||||
|
||||
2. **Check promotion status:**
|
||||
```bash
|
||||
stella promotion status <promotion-id>
|
||||
```
|
||||
Look for: Current step, last update time, any error messages
|
||||
|
||||
3. **Check orchestrator service:**
|
||||
```bash
|
||||
stella orch status
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Get detailed promotion trace:**
|
||||
```bash
|
||||
stella promotion trace <promotion-id> --verbose
|
||||
```
|
||||
Look for: Which step is stuck, any timeouts
|
||||
|
||||
2. **Check gate evaluation status:**
|
||||
```bash
|
||||
stella promotion gates <promotion-id>
|
||||
```
|
||||
Problem if: Gate stuck waiting for external service
|
||||
|
||||
3. **Check target environment connectivity:**
|
||||
```bash
|
||||
stella orch connectivity --target <env-name>
|
||||
```
|
||||
|
||||
4. **Check for lock contention:**
|
||||
```bash
|
||||
stella orch locks list
|
||||
```
|
||||
Problem if: Stale locks on the artifact or environment
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **If gate is stuck waiting for external service:**
|
||||
```bash
|
||||
# Skip the stuck gate (requires approval)
|
||||
stella promotion gate skip <promotion-id> <gate-name> --reason "External service timeout"
|
||||
```
|
||||
|
||||
2. **If lock is stale:**
|
||||
```bash
|
||||
# Release the lock (use with caution)
|
||||
stella orch locks release <lock-id> --force
|
||||
```
|
||||
|
||||
3. **If orchestrator is unresponsive:**
|
||||
```bash
|
||||
stella service restart orchestrator
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If external gate service is slow:**
|
||||
|
||||
1. Increase gate timeout:
|
||||
```bash
|
||||
stella orch config set gates.<gate-name>.timeout 5m
|
||||
```
|
||||
|
||||
2. Configure gate retry:
|
||||
```bash
|
||||
stella orch config set gates.<gate-name>.retries 3
|
||||
```
|
||||
|
||||
**If target environment is unreachable:**
|
||||
|
||||
1. Check network connectivity to target
|
||||
2. Verify credentials for target environment:
|
||||
```bash
|
||||
stella orch credentials verify --target <env-name>
|
||||
```
|
||||
|
||||
**If database lock contention:**
|
||||
|
||||
1. Increase lock timeout:
|
||||
```bash
|
||||
stella orch config set locks.timeout 60s
|
||||
```
|
||||
|
||||
2. Enable optimistic locking:
|
||||
```bash
|
||||
stella orch config set locks.mode optimistic
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Check promotion completed
|
||||
stella promotion status <promotion-id>
|
||||
|
||||
# Verify artifact in target environment
|
||||
stella orch artifacts list --env <target-env> --filter <artifact-digest>
|
||||
|
||||
# Check no stuck promotions
|
||||
stella promotion list --status in_progress --older-than 5m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Timeouts:** Configure appropriate timeouts for all gates
|
||||
- [ ] **Monitoring:** Alert on promotions stuck > 10 minutes
|
||||
- [ ] **Health checks:** Enable connectivity pre-checks before promotion
|
||||
- [ ] **Documentation:** Document SLAs for external gate services
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/release-orchestrator/architecture.md`
|
||||
- **Related runbooks:** `orchestrator-gate-timeout.md`, `orchestrator-evidence-missing.md`
|
||||
- **Dashboard:** Grafana > Stella Ops > Release Orchestrator
|
||||
Reference in New Issue
Block a user