450 lines
9.5 KiB
Markdown
450 lines
9.5 KiB
Markdown
# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
|
|
# Task: RUN-004 - Backup/Restore Runbook
|
|
# Backup and Restore Operations Runbook
|
|
|
|
Status: PRODUCTION-READY (2026-01-17 UTC)
|
|
|
|
## Scope
|
|
Comprehensive backup and restore procedures for all Stella Ops components including database, evidence locker, configuration, and secrets.
|
|
|
|
---
|
|
|
|
## Backup Architecture Overview
|
|
|
|
### Backup Components
|
|
|
|
| Component | Backup Type | Default Schedule | Retention |
|
|
|-----------|-------------|------------------|-----------|
|
|
| PostgreSQL | Full + WAL | Daily full, continuous WAL | 30 days |
|
|
| Evidence Locker | Incremental | Daily | 90 days |
|
|
| Configuration | Snapshot | Daily + on change | 90 days |
|
|
| Secrets | Encrypted snapshot | Daily | 30 days |
|
|
| Attestation Keys | Encrypted export | Weekly | 1 year |
|
|
|
|
### Storage Locations
|
|
|
|
- **Primary:** `/var/lib/stellaops/backups/` (local)
|
|
- **Secondary:** S3/Azure Blob/GCS (configurable)
|
|
- **Offline:** Removable media for air-gap scenarios
|
|
|
|
---
|
|
|
|
## Pre-flight Checklist
|
|
|
|
### Environment Verification
|
|
```bash
|
|
# Check backup service status
|
|
stella backup status
|
|
|
|
# Verify backup storage
|
|
stella doctor --check check.storage.backup
|
|
|
|
# List recent backups
|
|
stella backup list --last 7d
|
|
|
|
# Test backup restore capability
|
|
stella backup test-restore --latest --dry-run
|
|
```
|
|
|
|
### Metrics to Watch
|
|
- `stella_backup_last_success_timestamp` - Last successful backup
|
|
- `stella_backup_duration_seconds` - Backup duration
|
|
- `stella_backup_size_bytes` - Backup size
|
|
- `stella_restore_test_last_success` - Last restore test
|
|
|
|
---
|
|
|
|
## Standard Procedures
|
|
|
|
### SP-001: Create Manual Backup
|
|
|
|
**When:** Before upgrades, schema changes, or major configuration changes
|
|
**Duration:** 5-30 minutes depending on data volume
|
|
|
|
1. Create full system backup:
|
|
```bash
|
|
stella backup create --full --name "pre-upgrade-$(date +%Y%m%d)"
|
|
```
|
|
|
|
2. Or create component-specific backup:
|
|
```bash
|
|
# Database only
|
|
stella backup create --type database --name "db-pre-migration"
|
|
|
|
# Evidence locker only
|
|
stella backup create --type evidence --name "evidence-snapshot"
|
|
|
|
# Configuration only
|
|
stella backup create --type config --name "config-backup"
|
|
```
|
|
|
|
3. Verify backup:
|
|
```bash
|
|
stella backup verify --name "pre-upgrade-$(date +%Y%m%d)"
|
|
```
|
|
|
|
4. Copy to offsite storage (recommended):
|
|
```bash
|
|
stella backup copy --name "pre-upgrade-$(date +%Y%m%d)" --destination s3://backup-bucket/
|
|
```
|
|
|
|
### SP-002: Verify Backup Integrity
|
|
|
|
**Frequency:** Weekly
|
|
**Duration:** 15-60 minutes
|
|
|
|
1. List backups for verification:
|
|
```bash
|
|
stella backup list --unverified
|
|
```
|
|
|
|
2. Verify backup integrity:
|
|
```bash
|
|
# Verify specific backup
|
|
stella backup verify --name <backup-name>
|
|
|
|
# Verify all unverified
|
|
stella backup verify --all-unverified
|
|
```
|
|
|
|
3. Test restore (non-destructive):
|
|
```bash
|
|
stella backup test-restore --name <backup-name> --target /tmp/restore-test
|
|
```
|
|
|
|
4. Record verification result:
|
|
```bash
|
|
stella backup log-verification --name <backup-name> --result success
|
|
```
|
|
|
|
### SP-003: Restore from Backup
|
|
|
|
**CAUTION: This is a destructive operation**
|
|
|
|
#### Full System Restore
|
|
|
|
1. Stop all services:
|
|
```bash
|
|
stella service stop --all
|
|
```
|
|
|
|
2. List available backups:
|
|
```bash
|
|
stella backup list --type full
|
|
```
|
|
|
|
3. Restore:
|
|
```bash
|
|
# Dry run first
|
|
stella backup restore --name <backup-name> --dry-run
|
|
|
|
# Execute restore
|
|
stella backup restore --name <backup-name> --confirm
|
|
```
|
|
|
|
4. Start services:
|
|
```bash
|
|
stella service start --all
|
|
```
|
|
|
|
5. Verify restoration:
|
|
```bash
|
|
stella doctor --all
|
|
stella service health
|
|
```
|
|
|
|
#### Component-Specific Restore
|
|
|
|
1. Database restore:
|
|
```bash
|
|
stella service stop --service api,release-orchestrator
|
|
stella backup restore --type database --name <backup-name> --confirm
|
|
stella db migrate # Apply any pending migrations
|
|
stella service start --service api,release-orchestrator
|
|
```
|
|
|
|
2. Evidence locker restore:
|
|
```bash
|
|
stella backup restore --type evidence --name <backup-name> --confirm
|
|
stella evidence verify --mode quick
|
|
```
|
|
|
|
3. Configuration restore:
|
|
```bash
|
|
stella backup restore --type config --name <backup-name> --confirm
|
|
stella service restart --graceful
|
|
```
|
|
|
|
### SP-004: Point-in-Time Recovery (Database)
|
|
|
|
1. Identify target recovery point:
|
|
```bash
|
|
# List WAL archives
|
|
stella backup wal-list --after <start-date> --before <end-date>
|
|
```
|
|
|
|
2. Perform PITR:
|
|
```bash
|
|
stella backup restore-pitr --to-time "2026-01-17T10:30:00Z" --confirm
|
|
```
|
|
|
|
3. Verify data state:
|
|
```bash
|
|
stella db verify-integrity
|
|
```
|
|
|
|
---
|
|
|
|
## Backup Schedules
|
|
|
|
### Configure Backup Schedule
|
|
|
|
```bash
|
|
# View current schedule
|
|
stella backup schedule show
|
|
|
|
# Set database backup schedule
|
|
stella backup schedule set --type database --cron "0 2 * * *"
|
|
|
|
# Set evidence backup schedule
|
|
stella backup schedule set --type evidence --cron "0 3 * * *"
|
|
|
|
# Set configuration backup schedule
|
|
stella backup schedule set --type config --cron "0 4 * * *" --on-change
|
|
```
|
|
|
|
### Retention Policy
|
|
|
|
```bash
|
|
# View retention policy
|
|
stella backup retention show
|
|
|
|
# Set retention
|
|
stella backup retention set --type database --days 30
|
|
stella backup retention set --type evidence --days 90
|
|
stella backup retention set --type config --days 90
|
|
|
|
# Apply retention (cleanup old backups)
|
|
stella backup retention apply
|
|
```
|
|
|
|
---
|
|
|
|
## Incident Procedures
|
|
|
|
### INC-001: Backup Failure
|
|
|
|
**Symptoms:**
|
|
- Alert: `StellaBackupFailed`
|
|
- Missing recent backup
|
|
|
|
**Investigation:**
|
|
```bash
|
|
# Check backup logs
|
|
stella backup logs --last 24h
|
|
|
|
# Check disk space
|
|
stella doctor --check check.storage.diskspace,check.storage.backup
|
|
|
|
# Test backup operation
|
|
stella backup test --type database
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
1. **Disk space issue:**
|
|
```bash
|
|
stella backup retention apply --force
|
|
stella backup cleanup --expired
|
|
```
|
|
|
|
2. **Database connectivity:**
|
|
```bash
|
|
stella doctor --check check.postgres.connectivity
|
|
```
|
|
|
|
3. **Permission issue:**
|
|
- Check backup directory permissions
|
|
- Verify service account access
|
|
|
|
4. **Retry backup:**
|
|
```bash
|
|
stella backup create --type <failed-type> --retry
|
|
```
|
|
|
|
### INC-002: Restore Failure
|
|
|
|
**Symptoms:**
|
|
- Restore command fails
|
|
- Services not starting after restore
|
|
|
|
**Investigation:**
|
|
```bash
|
|
# Check restore logs
|
|
stella backup restore-logs --last-attempt
|
|
|
|
# Verify backup integrity
|
|
stella backup verify --name <backup-name>
|
|
|
|
# Check disk space
|
|
stella doctor --check check.storage.diskspace
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
1. **Corrupted backup:**
|
|
```bash
|
|
# Try previous backup
|
|
stella backup list --type <type>
|
|
stella backup restore --name <previous-backup> --confirm
|
|
```
|
|
|
|
2. **Version mismatch:**
|
|
```bash
|
|
# Check backup version
|
|
stella backup info --name <backup-name>
|
|
|
|
# Restore with migration
|
|
stella backup restore --name <backup-name> --with-migration
|
|
```
|
|
|
|
3. **Disk space:**
|
|
- Free space or expand volume
|
|
- Restore to alternate location
|
|
|
|
### INC-003: Backup Storage Full
|
|
|
|
**Symptoms:**
|
|
- Alert: `StellaBackupStorageFull`
|
|
- New backups failing
|
|
|
|
**Immediate Actions:**
|
|
```bash
|
|
# Check storage
|
|
stella backup storage stats
|
|
|
|
# Emergency cleanup
|
|
stella backup cleanup --keep-last 3
|
|
|
|
# Delete specific old backups
|
|
stella backup delete --older-than 14d --confirm
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
1. **Adjust retention:**
|
|
```bash
|
|
stella backup retention set --type database --days 14
|
|
stella backup retention apply
|
|
```
|
|
|
|
2. **Expand storage:**
|
|
- Add disk space
|
|
- Configure offsite storage
|
|
|
|
3. **Archive to cold storage:**
|
|
```bash
|
|
stella backup archive --older-than 30d --destination s3://archive-bucket/
|
|
```
|
|
|
|
---
|
|
|
|
## Disaster Recovery Scenarios
|
|
|
|
### DR-001: Complete System Loss
|
|
|
|
1. Provision new infrastructure
|
|
2. Install Stella Ops
|
|
3. Restore from offsite backup:
|
|
```bash
|
|
stella backup restore --source s3://backup-bucket/latest-full.tar.gz --confirm
|
|
```
|
|
4. Verify all components
|
|
5. Update DNS/load balancer
|
|
|
|
### DR-002: Database Corruption
|
|
|
|
1. Stop services
|
|
2. Restore database from latest clean backup:
|
|
```bash
|
|
stella backup restore --type database --name <last-known-good>
|
|
```
|
|
3. Apply WAL to near-corruption point (PITR)
|
|
4. Verify data integrity
|
|
5. Resume services
|
|
|
|
### DR-003: Evidence Locker Loss
|
|
|
|
1. Restore evidence from backup:
|
|
```bash
|
|
stella backup restore --type evidence --name <backup-name>
|
|
```
|
|
2. Rebuild index:
|
|
```bash
|
|
stella evidence index rebuild
|
|
```
|
|
3. Verify anchor chain:
|
|
```bash
|
|
stella evidence anchor verify --all
|
|
```
|
|
|
|
---
|
|
|
|
## Offline/Air-Gap Backup
|
|
|
|
### Creating Offline Backup
|
|
|
|
```bash
|
|
# Create encrypted offline bundle
|
|
stella backup create-offline \
|
|
--output /media/usb/stellaops-backup-$(date +%Y%m%d).enc \
|
|
--encrypt \
|
|
--passphrase-file /secure/backup-key
|
|
|
|
# Verify offline backup
|
|
stella backup verify-offline --input /media/usb/stellaops-backup-*.enc
|
|
```
|
|
|
|
### Restoring from Offline Backup
|
|
|
|
```bash
|
|
# Restore from offline backup
|
|
stella backup restore-offline \
|
|
--input /media/usb/stellaops-backup-*.enc \
|
|
--passphrase-file /secure/backup-key \
|
|
--confirm
|
|
```
|
|
|
|
---
|
|
|
|
## Monitoring Dashboard
|
|
|
|
Access: Grafana → Dashboards → Stella Ops → Backup Status
|
|
|
|
Key panels:
|
|
- Last backup success time
|
|
- Backup size trend
|
|
- Backup duration
|
|
- Restore test status
|
|
- Storage utilization
|
|
|
|
---
|
|
|
|
## Evidence Capture
|
|
|
|
```bash
|
|
stella backup diagnostics --output /tmp/backup-diag-$(date +%Y%m%dT%H%M%S).tar.gz
|
|
```
|
|
|
|
---
|
|
|
|
## Escalation Path
|
|
|
|
1. **L1 (On-call):** Retry failed backups, basic troubleshooting
|
|
2. **L2 (Platform team):** Restore operations, schedule adjustments
|
|
3. **L3 (Architecture):** Disaster recovery execution
|
|
|
|
---
|
|
|
|
_Last updated: 2026-01-17 (UTC)_
|