9.5 KiB
Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
Task: RUN-004 - Backup/Restore Runbook
Backup and Restore Operations Runbook
Status: PRODUCTION-READY (2026-01-17 UTC)
Scope
Comprehensive backup and restore procedures for all Stella Ops components including database, evidence locker, configuration, and secrets.
Backup Architecture Overview
Backup Components
| Component | Backup Type | Default Schedule | Retention |
|---|---|---|---|
| PostgreSQL | Full + WAL | Daily full, continuous WAL | 30 days |
| Evidence Locker | Incremental | Daily | 90 days |
| Configuration | Snapshot | Daily + on change | 90 days |
| Secrets | Encrypted snapshot | Daily | 30 days |
| Attestation Keys | Encrypted export | Weekly | 1 year |
Storage Locations
- Primary:
/var/lib/stellaops/backups/(local) - Secondary: S3/Azure Blob/GCS (configurable)
- Offline: Removable media for air-gap scenarios
Pre-flight Checklist
Environment Verification
# Check backup service status
stella backup status
# Verify backup storage
stella doctor --check check.storage.backup
# List recent backups
stella backup list --last 7d
# Test backup restore capability
stella backup test-restore --latest --dry-run
Metrics to Watch
stella_backup_last_success_timestamp- Last successful backupstella_backup_duration_seconds- Backup durationstella_backup_size_bytes- Backup sizestella_restore_test_last_success- Last restore test
Standard Procedures
SP-001: Create Manual Backup
When: Before upgrades, schema changes, or major configuration changes Duration: 5-30 minutes depending on data volume
-
Create full system backup:
stella backup create --full --name "pre-upgrade-$(date +%Y%m%d)" -
Or create component-specific backup:
# Database only stella backup create --type database --name "db-pre-migration" # Evidence locker only stella backup create --type evidence --name "evidence-snapshot" # Configuration only stella backup create --type config --name "config-backup" -
Verify backup:
stella backup verify --name "pre-upgrade-$(date +%Y%m%d)" -
Copy to offsite storage (recommended):
stella backup copy --name "pre-upgrade-$(date +%Y%m%d)" --destination s3://backup-bucket/
SP-002: Verify Backup Integrity
Frequency: Weekly Duration: 15-60 minutes
-
List backups for verification:
stella backup list --unverified -
Verify backup integrity:
# Verify specific backup stella backup verify --name <backup-name> # Verify all unverified stella backup verify --all-unverified -
Test restore (non-destructive):
stella backup test-restore --name <backup-name> --target /tmp/restore-test -
Record verification result:
stella backup log-verification --name <backup-name> --result success
SP-003: Restore from Backup
CAUTION: This is a destructive operation
Full System Restore
-
Stop all services:
stella service stop --all -
List available backups:
stella backup list --type full -
Restore:
# Dry run first stella backup restore --name <backup-name> --dry-run # Execute restore stella backup restore --name <backup-name> --confirm -
Start services:
stella service start --all -
Verify restoration:
stella doctor --all stella service health
Component-Specific Restore
-
Database restore:
stella service stop --service api,release-orchestrator stella backup restore --type database --name <backup-name> --confirm stella db migrate # Apply any pending migrations stella service start --service api,release-orchestrator -
Evidence locker restore:
stella backup restore --type evidence --name <backup-name> --confirm stella evidence verify --mode quick -
Configuration restore:
stella backup restore --type config --name <backup-name> --confirm stella service restart --graceful
SP-004: Point-in-Time Recovery (Database)
-
Identify target recovery point:
# List WAL archives stella backup wal-list --after <start-date> --before <end-date> -
Perform PITR:
stella backup restore-pitr --to-time "2026-01-17T10:30:00Z" --confirm -
Verify data state:
stella db verify-integrity
Backup Schedules
Configure Backup Schedule
# View current schedule
stella backup schedule show
# Set database backup schedule
stella backup schedule set --type database --cron "0 2 * * *"
# Set evidence backup schedule
stella backup schedule set --type evidence --cron "0 3 * * *"
# Set configuration backup schedule
stella backup schedule set --type config --cron "0 4 * * *" --on-change
Retention Policy
# View retention policy
stella backup retention show
# Set retention
stella backup retention set --type database --days 30
stella backup retention set --type evidence --days 90
stella backup retention set --type config --days 90
# Apply retention (cleanup old backups)
stella backup retention apply
Incident Procedures
INC-001: Backup Failure
Symptoms:
- Alert:
StellaBackupFailed - Missing recent backup
Investigation:
# Check backup logs
stella backup logs --last 24h
# Check disk space
stella doctor --check check.storage.diskspace,check.storage.backup
# Test backup operation
stella backup test --type database
Resolution:
-
Disk space issue:
stella backup retention apply --force stella backup cleanup --expired -
Database connectivity:
stella doctor --check check.postgres.connectivity -
Permission issue:
- Check backup directory permissions
- Verify service account access
-
Retry backup:
stella backup create --type <failed-type> --retry
INC-002: Restore Failure
Symptoms:
- Restore command fails
- Services not starting after restore
Investigation:
# Check restore logs
stella backup restore-logs --last-attempt
# Verify backup integrity
stella backup verify --name <backup-name>
# Check disk space
stella doctor --check check.storage.diskspace
Resolution:
-
Corrupted backup:
# Try previous backup stella backup list --type <type> stella backup restore --name <previous-backup> --confirm -
Version mismatch:
# Check backup version stella backup info --name <backup-name> # Restore with migration stella backup restore --name <backup-name> --with-migration -
Disk space:
- Free space or expand volume
- Restore to alternate location
INC-003: Backup Storage Full
Symptoms:
- Alert:
StellaBackupStorageFull - New backups failing
Immediate Actions:
# Check storage
stella backup storage stats
# Emergency cleanup
stella backup cleanup --keep-last 3
# Delete specific old backups
stella backup delete --older-than 14d --confirm
Resolution:
-
Adjust retention:
stella backup retention set --type database --days 14 stella backup retention apply -
Expand storage:
- Add disk space
- Configure offsite storage
-
Archive to cold storage:
stella backup archive --older-than 30d --destination s3://archive-bucket/
Disaster Recovery Scenarios
DR-001: Complete System Loss
- Provision new infrastructure
- Install Stella Ops
- Restore from offsite backup:
stella backup restore --source s3://backup-bucket/latest-full.tar.gz --confirm - Verify all components
- Update DNS/load balancer
DR-002: Database Corruption
- Stop services
- Restore database from latest clean backup:
stella backup restore --type database --name <last-known-good> - Apply WAL to near-corruption point (PITR)
- Verify data integrity
- Resume services
DR-003: Evidence Locker Loss
- Restore evidence from backup:
stella backup restore --type evidence --name <backup-name> - Rebuild index:
stella evidence index rebuild - Verify anchor chain:
stella evidence anchor verify --all
Offline/Air-Gap Backup
Creating Offline Backup
# Create encrypted offline bundle
stella backup create-offline \
--output /media/usb/stellaops-backup-$(date +%Y%m%d).enc \
--encrypt \
--passphrase-file /secure/backup-key
# Verify offline backup
stella backup verify-offline --input /media/usb/stellaops-backup-*.enc
Restoring from Offline Backup
# Restore from offline backup
stella backup restore-offline \
--input /media/usb/stellaops-backup-*.enc \
--passphrase-file /secure/backup-key \
--confirm
Monitoring Dashboard
Access: Grafana → Dashboards → Stella Ops → Backup Status
Key panels:
- Last backup success time
- Backup size trend
- Backup duration
- Restore test status
- Storage utilization
Evidence Capture
stella backup diagnostics --output /tmp/backup-diag-$(date +%Y%m%dT%H%M%S).tar.gz
Escalation Path
- L1 (On-call): Retry failed backups, basic troubleshooting
- L2 (Platform team): Restore operations, schedule adjustments
- L3 (Architecture): Disaster recovery execution
Last updated: 2026-01-17 (UTC)