synergy moats product advisory implementations
This commit is contained in:
189
docs/operations/runbooks/orchestrator-rollback-failed.md
Normal file
189
docs/operations/runbooks/orchestrator-rollback-failed.md
Normal file
@@ -0,0 +1,189 @@
|
||||
# Runbook: Release Orchestrator - Rollback Operation Failed
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-004 - Release Orchestrator Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Release Orchestrator |
|
||||
| **Severity** | Critical |
|
||||
| **On-call scope** | Platform team, Release team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.orchestrator.rollback-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Rollback operation failing or stuck
|
||||
- [ ] Alert `OrchestratorRollbackFailed` firing
|
||||
- [ ] Error: "rollback failed" or "cannot restore previous version"
|
||||
- [ ] Target environment in inconsistent state
|
||||
- [ ] Previous artifact not available for deployment
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Rollback blocked; potentially broken release in production |
|
||||
| **Data integrity** | Environment may be in partial rollback state |
|
||||
| **SLA impact** | Incident resolution blocked; extended outage |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.orchestrator.rollback-health
|
||||
```
|
||||
|
||||
2. **Check rollback status:**
|
||||
```bash
|
||||
stella rollback status <rollback-id>
|
||||
```
|
||||
|
||||
3. **Check previous deployment history:**
|
||||
```bash
|
||||
stella orch deployments list --env <env-name> --last 10
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check why rollback failed:**
|
||||
```bash
|
||||
stella rollback trace <rollback-id> --verbose
|
||||
```
|
||||
Look for: Which step failed, error message
|
||||
|
||||
2. **Check previous artifact availability:**
|
||||
```bash
|
||||
stella orch artifacts get <previous-digest> --check
|
||||
```
|
||||
Problem if: Artifact deleted, not in registry
|
||||
|
||||
3. **Check environment state:**
|
||||
```bash
|
||||
stella orch env status <env-name> --detailed
|
||||
```
|
||||
|
||||
4. **Check for deployment locks:**
|
||||
```bash
|
||||
stella orch locks list --env <env-name>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Force release lock if stuck:**
|
||||
```bash
|
||||
stella orch locks release --env <env-name> --force
|
||||
```
|
||||
|
||||
2. **Manual rollback using specific artifact:**
|
||||
```bash
|
||||
stella deploy --env <env-name> --artifact <previous-digest> --force
|
||||
```
|
||||
|
||||
3. **If artifact unavailable, deploy last known good:**
|
||||
```bash
|
||||
stella orch deployments list --env <env-name> --status success
|
||||
stella deploy --env <env-name> --artifact <last-good-digest>
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If previous artifact not in registry:**
|
||||
|
||||
1. Check artifact retention policy:
|
||||
```bash
|
||||
stella registry retention show
|
||||
```
|
||||
|
||||
2. Restore from backup registry:
|
||||
```bash
|
||||
stella registry restore --artifact <digest> --from backup
|
||||
```
|
||||
|
||||
3. Increase artifact retention:
|
||||
```bash
|
||||
stella registry retention set --min-versions 10
|
||||
```
|
||||
|
||||
**If deployment service unavailable:**
|
||||
|
||||
1. Check deployment target connectivity:
|
||||
```bash
|
||||
stella orch connectivity --target <env-name>
|
||||
```
|
||||
|
||||
2. Check deployment agent status:
|
||||
```bash
|
||||
stella orch agent status --env <env-name>
|
||||
```
|
||||
|
||||
**If configuration drift:**
|
||||
|
||||
1. Check environment configuration:
|
||||
```bash
|
||||
stella orch env config diff <env-name>
|
||||
```
|
||||
|
||||
2. Reset environment to known state:
|
||||
```bash
|
||||
stella orch env reset <env-name> --to-baseline
|
||||
```
|
||||
|
||||
**If database state inconsistent:**
|
||||
|
||||
1. Check orchestrator database:
|
||||
```bash
|
||||
stella orch db verify
|
||||
```
|
||||
|
||||
2. Repair deployment state:
|
||||
```bash
|
||||
stella orch repair --deployment <deployment-id>
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Verify rollback completed
|
||||
stella rollback status <rollback-id>
|
||||
|
||||
# Verify environment state
|
||||
stella orch env status <env-name>
|
||||
|
||||
# Verify correct version deployed
|
||||
stella orch deployments current --env <env-name>
|
||||
|
||||
# Health check the environment
|
||||
stella orch health-check --env <env-name>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Retention:** Maintain at least 5 previous versions in registry
|
||||
- [ ] **Testing:** Test rollback procedure in staging regularly
|
||||
- [ ] **Monitoring:** Alert on rollback failures immediately
|
||||
- [ ] **Documentation:** Document manual rollback procedures per environment
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/release-orchestrator/rollback.md`
|
||||
- **Related runbooks:** `orchestrator-promotion-stuck.md`, `orchestrator-evidence-missing.md`
|
||||
- **Rollback procedures:** `docs/operations/rollback-procedures.md`
|
||||
Reference in New Issue
Block a user