Files
git.stella-ops.org/docs/operations/runbooks/backup-restore-ops.md

450 lines
9.5 KiB
Markdown

# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
# Task: RUN-004 - Backup/Restore Runbook
# Backup and Restore Operations Runbook
Status: PRODUCTION-READY (2026-01-17 UTC)
## Scope
Comprehensive backup and restore procedures for all Stella Ops components including database, evidence locker, configuration, and secrets.
---
## Backup Architecture Overview
### Backup Components
| Component | Backup Type | Default Schedule | Retention |
|-----------|-------------|------------------|-----------|
| PostgreSQL | Full + WAL | Daily full, continuous WAL | 30 days |
| Evidence Locker | Incremental | Daily | 90 days |
| Configuration | Snapshot | Daily + on change | 90 days |
| Secrets | Encrypted snapshot | Daily | 30 days |
| Attestation Keys | Encrypted export | Weekly | 1 year |
### Storage Locations
- **Primary:** `/var/lib/stellaops/backups/` (local)
- **Secondary:** S3/Azure Blob/GCS (configurable)
- **Offline:** Removable media for air-gap scenarios
---
## Pre-flight Checklist
### Environment Verification
```bash
# Check backup service status
stella backup status
# Verify backup storage
stella doctor --check check.storage.backup
# List recent backups
stella backup list --last 7d
# Test backup restore capability
stella backup test-restore --latest --dry-run
```
### Metrics to Watch
- `stella_backup_last_success_timestamp` - Last successful backup
- `stella_backup_duration_seconds` - Backup duration
- `stella_backup_size_bytes` - Backup size
- `stella_restore_test_last_success` - Last restore test
---
## Standard Procedures
### SP-001: Create Manual Backup
**When:** Before upgrades, schema changes, or major configuration changes
**Duration:** 5-30 minutes depending on data volume
1. Create full system backup:
```bash
stella backup create --full --name "pre-upgrade-$(date +%Y%m%d)"
```
2. Or create component-specific backup:
```bash
# Database only
stella backup create --type database --name "db-pre-migration"
# Evidence locker only
stella backup create --type evidence --name "evidence-snapshot"
# Configuration only
stella backup create --type config --name "config-backup"
```
3. Verify backup:
```bash
stella backup verify --name "pre-upgrade-$(date +%Y%m%d)"
```
4. Copy to offsite storage (recommended):
```bash
stella backup copy --name "pre-upgrade-$(date +%Y%m%d)" --destination s3://backup-bucket/
```
### SP-002: Verify Backup Integrity
**Frequency:** Weekly
**Duration:** 15-60 minutes
1. List backups for verification:
```bash
stella backup list --unverified
```
2. Verify backup integrity:
```bash
# Verify specific backup
stella backup verify --name <backup-name>
# Verify all unverified
stella backup verify --all-unverified
```
3. Test restore (non-destructive):
```bash
stella backup test-restore --name <backup-name> --target /tmp/restore-test
```
4. Record verification result:
```bash
stella backup log-verification --name <backup-name> --result success
```
### SP-003: Restore from Backup
**CAUTION: This is a destructive operation**
#### Full System Restore
1. Stop all services:
```bash
stella service stop --all
```
2. List available backups:
```bash
stella backup list --type full
```
3. Restore:
```bash
# Dry run first
stella backup restore --name <backup-name> --dry-run
# Execute restore
stella backup restore --name <backup-name> --confirm
```
4. Start services:
```bash
stella service start --all
```
5. Verify restoration:
```bash
stella doctor --all
stella service health
```
#### Component-Specific Restore
1. Database restore:
```bash
stella service stop --service api,release-orchestrator
stella backup restore --type database --name <backup-name> --confirm
stella db migrate # Apply any pending migrations
stella service start --service api,release-orchestrator
```
2. Evidence locker restore:
```bash
stella backup restore --type evidence --name <backup-name> --confirm
stella evidence verify --mode quick
```
3. Configuration restore:
```bash
stella backup restore --type config --name <backup-name> --confirm
stella service restart --graceful
```
### SP-004: Point-in-Time Recovery (Database)
1. Identify target recovery point:
```bash
# List WAL archives
stella backup wal-list --after <start-date> --before <end-date>
```
2. Perform PITR:
```bash
stella backup restore-pitr --to-time "2026-01-17T10:30:00Z" --confirm
```
3. Verify data state:
```bash
stella db verify-integrity
```
---
## Backup Schedules
### Configure Backup Schedule
```bash
# View current schedule
stella backup schedule show
# Set database backup schedule
stella backup schedule set --type database --cron "0 2 * * *"
# Set evidence backup schedule
stella backup schedule set --type evidence --cron "0 3 * * *"
# Set configuration backup schedule
stella backup schedule set --type config --cron "0 4 * * *" --on-change
```
### Retention Policy
```bash
# View retention policy
stella backup retention show
# Set retention
stella backup retention set --type database --days 30
stella backup retention set --type evidence --days 90
stella backup retention set --type config --days 90
# Apply retention (cleanup old backups)
stella backup retention apply
```
---
## Incident Procedures
### INC-001: Backup Failure
**Symptoms:**
- Alert: `StellaBackupFailed`
- Missing recent backup
**Investigation:**
```bash
# Check backup logs
stella backup logs --last 24h
# Check disk space
stella doctor --check check.storage.diskspace,check.storage.backup
# Test backup operation
stella backup test --type database
```
**Resolution:**
1. **Disk space issue:**
```bash
stella backup retention apply --force
stella backup cleanup --expired
```
2. **Database connectivity:**
```bash
stella doctor --check check.postgres.connectivity
```
3. **Permission issue:**
- Check backup directory permissions
- Verify service account access
4. **Retry backup:**
```bash
stella backup create --type <failed-type> --retry
```
### INC-002: Restore Failure
**Symptoms:**
- Restore command fails
- Services not starting after restore
**Investigation:**
```bash
# Check restore logs
stella backup restore-logs --last-attempt
# Verify backup integrity
stella backup verify --name <backup-name>
# Check disk space
stella doctor --check check.storage.diskspace
```
**Resolution:**
1. **Corrupted backup:**
```bash
# Try previous backup
stella backup list --type <type>
stella backup restore --name <previous-backup> --confirm
```
2. **Version mismatch:**
```bash
# Check backup version
stella backup info --name <backup-name>
# Restore with migration
stella backup restore --name <backup-name> --with-migration
```
3. **Disk space:**
- Free space or expand volume
- Restore to alternate location
### INC-003: Backup Storage Full
**Symptoms:**
- Alert: `StellaBackupStorageFull`
- New backups failing
**Immediate Actions:**
```bash
# Check storage
stella backup storage stats
# Emergency cleanup
stella backup cleanup --keep-last 3
# Delete specific old backups
stella backup delete --older-than 14d --confirm
```
**Resolution:**
1. **Adjust retention:**
```bash
stella backup retention set --type database --days 14
stella backup retention apply
```
2. **Expand storage:**
- Add disk space
- Configure offsite storage
3. **Archive to cold storage:**
```bash
stella backup archive --older-than 30d --destination s3://archive-bucket/
```
---
## Disaster Recovery Scenarios
### DR-001: Complete System Loss
1. Provision new infrastructure
2. Install Stella Ops
3. Restore from offsite backup:
```bash
stella backup restore --source s3://backup-bucket/latest-full.tar.gz --confirm
```
4. Verify all components
5. Update DNS/load balancer
### DR-002: Database Corruption
1. Stop services
2. Restore database from latest clean backup:
```bash
stella backup restore --type database --name <last-known-good>
```
3. Apply WAL to near-corruption point (PITR)
4. Verify data integrity
5. Resume services
### DR-003: Evidence Locker Loss
1. Restore evidence from backup:
```bash
stella backup restore --type evidence --name <backup-name>
```
2. Rebuild index:
```bash
stella evidence index rebuild
```
3. Verify anchor chain:
```bash
stella evidence anchor verify --all
```
---
## Offline/Air-Gap Backup
### Creating Offline Backup
```bash
# Create encrypted offline bundle
stella backup create-offline \
--output /media/usb/stellaops-backup-$(date +%Y%m%d).enc \
--encrypt \
--passphrase-file /secure/backup-key
# Verify offline backup
stella backup verify-offline --input /media/usb/stellaops-backup-*.enc
```
### Restoring from Offline Backup
```bash
# Restore from offline backup
stella backup restore-offline \
--input /media/usb/stellaops-backup-*.enc \
--passphrase-file /secure/backup-key \
--confirm
```
---
## Monitoring Dashboard
Access: Grafana → Dashboards → Stella Ops → Backup Status
Key panels:
- Last backup success time
- Backup size trend
- Backup duration
- Restore test status
- Storage utilization
---
## Evidence Capture
```bash
stella backup diagnostics --output /tmp/backup-diag-$(date +%Y%m%dT%H%M%S).tar.gz
```
---
## Escalation Path
1. **L1 (On-call):** Retry failed backups, basic troubleshooting
2. **L2 (Platform team):** Restore operations, schedule adjustments
3. **L3 (Architecture):** Disaster recovery execution
---
_Last updated: 2026-01-17 (UTC)_