Files
git.stella-ops.org/docs/operations/runbooks/backup-restore-ops.md

9.5 KiB

Sprint: SPRINT_20260117_029_Runbook_coverage_expansion

Task: RUN-004 - Backup/Restore Runbook

Backup and Restore Operations Runbook

Status: PRODUCTION-READY (2026-01-17 UTC)

Scope

Comprehensive backup and restore procedures for all Stella Ops components including database, evidence locker, configuration, and secrets.


Backup Architecture Overview

Backup Components

Component Backup Type Default Schedule Retention
PostgreSQL Full + WAL Daily full, continuous WAL 30 days
Evidence Locker Incremental Daily 90 days
Configuration Snapshot Daily + on change 90 days
Secrets Encrypted snapshot Daily 30 days
Attestation Keys Encrypted export Weekly 1 year

Storage Locations

  • Primary: /var/lib/stellaops/backups/ (local)
  • Secondary: S3/Azure Blob/GCS (configurable)
  • Offline: Removable media for air-gap scenarios

Pre-flight Checklist

Environment Verification

# Check backup service status
stella backup status

# Verify backup storage
stella doctor --check check.storage.backup

# List recent backups
stella backup list --last 7d

# Test backup restore capability
stella backup test-restore --latest --dry-run

Metrics to Watch

  • stella_backup_last_success_timestamp - Last successful backup
  • stella_backup_duration_seconds - Backup duration
  • stella_backup_size_bytes - Backup size
  • stella_restore_test_last_success - Last restore test

Standard Procedures

SP-001: Create Manual Backup

When: Before upgrades, schema changes, or major configuration changes Duration: 5-30 minutes depending on data volume

  1. Create full system backup:

    stella backup create --full --name "pre-upgrade-$(date +%Y%m%d)"
    
  2. Or create component-specific backup:

    # Database only
    stella backup create --type database --name "db-pre-migration"
    
    # Evidence locker only
    stella backup create --type evidence --name "evidence-snapshot"
    
    # Configuration only
    stella backup create --type config --name "config-backup"
    
  3. Verify backup:

    stella backup verify --name "pre-upgrade-$(date +%Y%m%d)"
    
  4. Copy to offsite storage (recommended):

    stella backup copy --name "pre-upgrade-$(date +%Y%m%d)" --destination s3://backup-bucket/
    

SP-002: Verify Backup Integrity

Frequency: Weekly Duration: 15-60 minutes

  1. List backups for verification:

    stella backup list --unverified
    
  2. Verify backup integrity:

    # Verify specific backup
    stella backup verify --name <backup-name>
    
    # Verify all unverified
    stella backup verify --all-unverified
    
  3. Test restore (non-destructive):

    stella backup test-restore --name <backup-name> --target /tmp/restore-test
    
  4. Record verification result:

    stella backup log-verification --name <backup-name> --result success
    

SP-003: Restore from Backup

CAUTION: This is a destructive operation

Full System Restore

  1. Stop all services:

    stella service stop --all
    
  2. List available backups:

    stella backup list --type full
    
  3. Restore:

    # Dry run first
    stella backup restore --name <backup-name> --dry-run
    
    # Execute restore
    stella backup restore --name <backup-name> --confirm
    
  4. Start services:

    stella service start --all
    
  5. Verify restoration:

    stella doctor --all
    stella service health
    

Component-Specific Restore

  1. Database restore:

    stella service stop --service api,release-orchestrator
    stella backup restore --type database --name <backup-name> --confirm
    stella db migrate  # Apply any pending migrations
    stella service start --service api,release-orchestrator
    
  2. Evidence locker restore:

    stella backup restore --type evidence --name <backup-name> --confirm
    stella evidence verify --mode quick
    
  3. Configuration restore:

    stella backup restore --type config --name <backup-name> --confirm
    stella service restart --graceful
    

SP-004: Point-in-Time Recovery (Database)

  1. Identify target recovery point:

    # List WAL archives
    stella backup wal-list --after <start-date> --before <end-date>
    
  2. Perform PITR:

    stella backup restore-pitr --to-time "2026-01-17T10:30:00Z" --confirm
    
  3. Verify data state:

    stella db verify-integrity
    

Backup Schedules

Configure Backup Schedule

# View current schedule
stella backup schedule show

# Set database backup schedule
stella backup schedule set --type database --cron "0 2 * * *"

# Set evidence backup schedule
stella backup schedule set --type evidence --cron "0 3 * * *"

# Set configuration backup schedule
stella backup schedule set --type config --cron "0 4 * * *" --on-change

Retention Policy

# View retention policy
stella backup retention show

# Set retention
stella backup retention set --type database --days 30
stella backup retention set --type evidence --days 90
stella backup retention set --type config --days 90

# Apply retention (cleanup old backups)
stella backup retention apply

Incident Procedures

INC-001: Backup Failure

Symptoms:

  • Alert: StellaBackupFailed
  • Missing recent backup

Investigation:

# Check backup logs
stella backup logs --last 24h

# Check disk space
stella doctor --check check.storage.diskspace,check.storage.backup

# Test backup operation
stella backup test --type database

Resolution:

  1. Disk space issue:

    stella backup retention apply --force
    stella backup cleanup --expired
    
  2. Database connectivity:

    stella doctor --check check.postgres.connectivity
    
  3. Permission issue:

    • Check backup directory permissions
    • Verify service account access
  4. Retry backup:

    stella backup create --type <failed-type> --retry
    

INC-002: Restore Failure

Symptoms:

  • Restore command fails
  • Services not starting after restore

Investigation:

# Check restore logs
stella backup restore-logs --last-attempt

# Verify backup integrity
stella backup verify --name <backup-name>

# Check disk space
stella doctor --check check.storage.diskspace

Resolution:

  1. Corrupted backup:

    # Try previous backup
    stella backup list --type <type>
    stella backup restore --name <previous-backup> --confirm
    
  2. Version mismatch:

    # Check backup version
    stella backup info --name <backup-name>
    
    # Restore with migration
    stella backup restore --name <backup-name> --with-migration
    
  3. Disk space:

    • Free space or expand volume
    • Restore to alternate location

INC-003: Backup Storage Full

Symptoms:

  • Alert: StellaBackupStorageFull
  • New backups failing

Immediate Actions:

# Check storage
stella backup storage stats

# Emergency cleanup
stella backup cleanup --keep-last 3

# Delete specific old backups
stella backup delete --older-than 14d --confirm

Resolution:

  1. Adjust retention:

    stella backup retention set --type database --days 14
    stella backup retention apply
    
  2. Expand storage:

    • Add disk space
    • Configure offsite storage
  3. Archive to cold storage:

    stella backup archive --older-than 30d --destination s3://archive-bucket/
    

Disaster Recovery Scenarios

DR-001: Complete System Loss

  1. Provision new infrastructure
  2. Install Stella Ops
  3. Restore from offsite backup:
    stella backup restore --source s3://backup-bucket/latest-full.tar.gz --confirm
    
  4. Verify all components
  5. Update DNS/load balancer

DR-002: Database Corruption

  1. Stop services
  2. Restore database from latest clean backup:
    stella backup restore --type database --name <last-known-good>
    
  3. Apply WAL to near-corruption point (PITR)
  4. Verify data integrity
  5. Resume services

DR-003: Evidence Locker Loss

  1. Restore evidence from backup:
    stella backup restore --type evidence --name <backup-name>
    
  2. Rebuild index:
    stella evidence index rebuild
    
  3. Verify anchor chain:
    stella evidence anchor verify --all
    

Offline/Air-Gap Backup

Creating Offline Backup

# Create encrypted offline bundle
stella backup create-offline \
  --output /media/usb/stellaops-backup-$(date +%Y%m%d).enc \
  --encrypt \
  --passphrase-file /secure/backup-key

# Verify offline backup
stella backup verify-offline --input /media/usb/stellaops-backup-*.enc

Restoring from Offline Backup

# Restore from offline backup
stella backup restore-offline \
  --input /media/usb/stellaops-backup-*.enc \
  --passphrase-file /secure/backup-key \
  --confirm

Monitoring Dashboard

Access: Grafana → Dashboards → Stella Ops → Backup Status

Key panels:

  • Last backup success time
  • Backup size trend
  • Backup duration
  • Restore test status
  • Storage utilization

Evidence Capture

stella backup diagnostics --output /tmp/backup-diag-$(date +%Y%m%dT%H%M%S).tar.gz

Escalation Path

  1. L1 (On-call): Retry failed backups, basic troubleshooting
  2. L2 (Platform team): Restore operations, schedule adjustments
  3. L3 (Architecture): Disaster recovery execution

Last updated: 2026-01-17 (UTC)