Files
git.stella-ops.org/docs/operations/runbooks/orchestrator-rollback-failed.md

4.2 KiB

Runbook: Release Orchestrator - Rollback Operation Failed

Sprint: SPRINT_20260117_029_DOCS_runbook_coverage Task: RUN-004 - Release Orchestrator Runbooks

Metadata

Field Value
Component Release Orchestrator
Severity Critical
On-call scope Platform team, Release team
Last updated 2026-01-17
Doctor check check.orchestrator.rollback-health

Symptoms

  • Rollback operation failing or stuck
  • Alert OrchestratorRollbackFailed firing
  • Error: "rollback failed" or "cannot restore previous version"
  • Target environment in inconsistent state
  • Previous artifact not available for deployment

Impact

Impact Type Description
User-facing Rollback blocked; potentially broken release in production
Data integrity Environment may be in partial rollback state
SLA impact Incident resolution blocked; extended outage

Diagnosis

Quick checks

  1. Check Doctor diagnostics:

    stella doctor --check check.orchestrator.rollback-health
    
  2. Check rollback status:

    stella rollback status <rollback-id>
    
  3. Check previous deployment history:

    stella orch deployments list --env <env-name> --last 10
    

Deep diagnosis

  1. Check why rollback failed:

    stella rollback trace <rollback-id> --verbose
    

    Look for: Which step failed, error message

  2. Check previous artifact availability:

    stella orch artifacts get <previous-digest> --check
    

    Problem if: Artifact deleted, not in registry

  3. Check environment state:

    stella orch env status <env-name> --detailed
    
  4. Check for deployment locks:

    stella orch locks list --env <env-name>
    

Resolution

Immediate mitigation

  1. Force release lock if stuck:

    stella orch locks release --env <env-name> --force
    
  2. Manual rollback using specific artifact:

    stella deploy --env <env-name> --artifact <previous-digest> --force
    
  3. If artifact unavailable, deploy last known good:

    stella orch deployments list --env <env-name> --status success
    stella deploy --env <env-name> --artifact <last-good-digest>
    

Root cause fix

If previous artifact not in registry:

  1. Check artifact retention policy:

    stella registry retention show
    
  2. Restore from backup registry:

    stella registry restore --artifact <digest> --from backup
    
  3. Increase artifact retention:

    stella registry retention set --min-versions 10
    

If deployment service unavailable:

  1. Check deployment target connectivity:

    stella orch connectivity --target <env-name>
    
  2. Check deployment agent status:

    stella orch agent status --env <env-name>
    

If configuration drift:

  1. Check environment configuration:

    stella orch env config diff <env-name>
    
  2. Reset environment to known state:

    stella orch env reset <env-name> --to-baseline
    

If database state inconsistent:

  1. Check orchestrator database:

    stella orch db verify
    
  2. Repair deployment state:

    stella orch repair --deployment <deployment-id>
    

Verification

# Verify rollback completed
stella rollback status <rollback-id>

# Verify environment state
stella orch env status <env-name>

# Verify correct version deployed
stella orch deployments current --env <env-name>

# Health check the environment
stella orch health-check --env <env-name>

Prevention

  • Retention: Maintain at least 5 previous versions in registry
  • Testing: Test rollback procedure in staging regularly
  • Monitoring: Alert on rollback failures immediately
  • Documentation: Document manual rollback procedures per environment

  • Architecture: docs/modules/release-orchestrator/rollback.md
  • Related runbooks: orchestrator-promotion-stuck.md, orchestrator-evidence-missing.md
  • Rollback procedures: docs/operations/rollback-procedures.md