synergy moats product advisory implementations
This commit is contained in:
449
docs/operations/runbooks/backup-restore-ops.md
Normal file
449
docs/operations/runbooks/backup-restore-ops.md
Normal file
@@ -0,0 +1,449 @@
|
||||
# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
|
||||
# Task: RUN-004 - Backup/Restore Runbook
|
||||
# Backup and Restore Operations Runbook
|
||||
|
||||
Status: PRODUCTION-READY (2026-01-17 UTC)
|
||||
|
||||
## Scope
|
||||
Comprehensive backup and restore procedures for all Stella Ops components including database, evidence locker, configuration, and secrets.
|
||||
|
||||
---
|
||||
|
||||
## Backup Architecture Overview
|
||||
|
||||
### Backup Components
|
||||
|
||||
| Component | Backup Type | Default Schedule | Retention |
|
||||
|-----------|-------------|------------------|-----------|
|
||||
| PostgreSQL | Full + WAL | Daily full, continuous WAL | 30 days |
|
||||
| Evidence Locker | Incremental | Daily | 90 days |
|
||||
| Configuration | Snapshot | Daily + on change | 90 days |
|
||||
| Secrets | Encrypted snapshot | Daily | 30 days |
|
||||
| Attestation Keys | Encrypted export | Weekly | 1 year |
|
||||
|
||||
### Storage Locations
|
||||
|
||||
- **Primary:** `/var/lib/stellaops/backups/` (local)
|
||||
- **Secondary:** S3/Azure Blob/GCS (configurable)
|
||||
- **Offline:** Removable media for air-gap scenarios
|
||||
|
||||
---
|
||||
|
||||
## Pre-flight Checklist
|
||||
|
||||
### Environment Verification
|
||||
```bash
|
||||
# Check backup service status
|
||||
stella backup status
|
||||
|
||||
# Verify backup storage
|
||||
stella doctor --check check.storage.backup
|
||||
|
||||
# List recent backups
|
||||
stella backup list --last 7d
|
||||
|
||||
# Test backup restore capability
|
||||
stella backup test-restore --latest --dry-run
|
||||
```
|
||||
|
||||
### Metrics to Watch
|
||||
- `stella_backup_last_success_timestamp` - Last successful backup
|
||||
- `stella_backup_duration_seconds` - Backup duration
|
||||
- `stella_backup_size_bytes` - Backup size
|
||||
- `stella_restore_test_last_success` - Last restore test
|
||||
|
||||
---
|
||||
|
||||
## Standard Procedures
|
||||
|
||||
### SP-001: Create Manual Backup
|
||||
|
||||
**When:** Before upgrades, schema changes, or major configuration changes
|
||||
**Duration:** 5-30 minutes depending on data volume
|
||||
|
||||
1. Create full system backup:
|
||||
```bash
|
||||
stella backup create --full --name "pre-upgrade-$(date +%Y%m%d)"
|
||||
```
|
||||
|
||||
2. Or create component-specific backup:
|
||||
```bash
|
||||
# Database only
|
||||
stella backup create --type database --name "db-pre-migration"
|
||||
|
||||
# Evidence locker only
|
||||
stella backup create --type evidence --name "evidence-snapshot"
|
||||
|
||||
# Configuration only
|
||||
stella backup create --type config --name "config-backup"
|
||||
```
|
||||
|
||||
3. Verify backup:
|
||||
```bash
|
||||
stella backup verify --name "pre-upgrade-$(date +%Y%m%d)"
|
||||
```
|
||||
|
||||
4. Copy to offsite storage (recommended):
|
||||
```bash
|
||||
stella backup copy --name "pre-upgrade-$(date +%Y%m%d)" --destination s3://backup-bucket/
|
||||
```
|
||||
|
||||
### SP-002: Verify Backup Integrity
|
||||
|
||||
**Frequency:** Weekly
|
||||
**Duration:** 15-60 minutes
|
||||
|
||||
1. List backups for verification:
|
||||
```bash
|
||||
stella backup list --unverified
|
||||
```
|
||||
|
||||
2. Verify backup integrity:
|
||||
```bash
|
||||
# Verify specific backup
|
||||
stella backup verify --name <backup-name>
|
||||
|
||||
# Verify all unverified
|
||||
stella backup verify --all-unverified
|
||||
```
|
||||
|
||||
3. Test restore (non-destructive):
|
||||
```bash
|
||||
stella backup test-restore --name <backup-name> --target /tmp/restore-test
|
||||
```
|
||||
|
||||
4. Record verification result:
|
||||
```bash
|
||||
stella backup log-verification --name <backup-name> --result success
|
||||
```
|
||||
|
||||
### SP-003: Restore from Backup
|
||||
|
||||
**CAUTION: This is a destructive operation**
|
||||
|
||||
#### Full System Restore
|
||||
|
||||
1. Stop all services:
|
||||
```bash
|
||||
stella service stop --all
|
||||
```
|
||||
|
||||
2. List available backups:
|
||||
```bash
|
||||
stella backup list --type full
|
||||
```
|
||||
|
||||
3. Restore:
|
||||
```bash
|
||||
# Dry run first
|
||||
stella backup restore --name <backup-name> --dry-run
|
||||
|
||||
# Execute restore
|
||||
stella backup restore --name <backup-name> --confirm
|
||||
```
|
||||
|
||||
4. Start services:
|
||||
```bash
|
||||
stella service start --all
|
||||
```
|
||||
|
||||
5. Verify restoration:
|
||||
```bash
|
||||
stella doctor --all
|
||||
stella service health
|
||||
```
|
||||
|
||||
#### Component-Specific Restore
|
||||
|
||||
1. Database restore:
|
||||
```bash
|
||||
stella service stop --service api,release-orchestrator
|
||||
stella backup restore --type database --name <backup-name> --confirm
|
||||
stella db migrate # Apply any pending migrations
|
||||
stella service start --service api,release-orchestrator
|
||||
```
|
||||
|
||||
2. Evidence locker restore:
|
||||
```bash
|
||||
stella backup restore --type evidence --name <backup-name> --confirm
|
||||
stella evidence verify --mode quick
|
||||
```
|
||||
|
||||
3. Configuration restore:
|
||||
```bash
|
||||
stella backup restore --type config --name <backup-name> --confirm
|
||||
stella service restart --graceful
|
||||
```
|
||||
|
||||
### SP-004: Point-in-Time Recovery (Database)
|
||||
|
||||
1. Identify target recovery point:
|
||||
```bash
|
||||
# List WAL archives
|
||||
stella backup wal-list --after <start-date> --before <end-date>
|
||||
```
|
||||
|
||||
2. Perform PITR:
|
||||
```bash
|
||||
stella backup restore-pitr --to-time "2026-01-17T10:30:00Z" --confirm
|
||||
```
|
||||
|
||||
3. Verify data state:
|
||||
```bash
|
||||
stella db verify-integrity
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Backup Schedules
|
||||
|
||||
### Configure Backup Schedule
|
||||
|
||||
```bash
|
||||
# View current schedule
|
||||
stella backup schedule show
|
||||
|
||||
# Set database backup schedule
|
||||
stella backup schedule set --type database --cron "0 2 * * *"
|
||||
|
||||
# Set evidence backup schedule
|
||||
stella backup schedule set --type evidence --cron "0 3 * * *"
|
||||
|
||||
# Set configuration backup schedule
|
||||
stella backup schedule set --type config --cron "0 4 * * *" --on-change
|
||||
```
|
||||
|
||||
### Retention Policy
|
||||
|
||||
```bash
|
||||
# View retention policy
|
||||
stella backup retention show
|
||||
|
||||
# Set retention
|
||||
stella backup retention set --type database --days 30
|
||||
stella backup retention set --type evidence --days 90
|
||||
stella backup retention set --type config --days 90
|
||||
|
||||
# Apply retention (cleanup old backups)
|
||||
stella backup retention apply
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Incident Procedures
|
||||
|
||||
### INC-001: Backup Failure
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaBackupFailed`
|
||||
- Missing recent backup
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check backup logs
|
||||
stella backup logs --last 24h
|
||||
|
||||
# Check disk space
|
||||
stella doctor --check check.storage.diskspace,check.storage.backup
|
||||
|
||||
# Test backup operation
|
||||
stella backup test --type database
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Disk space issue:**
|
||||
```bash
|
||||
stella backup retention apply --force
|
||||
stella backup cleanup --expired
|
||||
```
|
||||
|
||||
2. **Database connectivity:**
|
||||
```bash
|
||||
stella doctor --check check.postgres.connectivity
|
||||
```
|
||||
|
||||
3. **Permission issue:**
|
||||
- Check backup directory permissions
|
||||
- Verify service account access
|
||||
|
||||
4. **Retry backup:**
|
||||
```bash
|
||||
stella backup create --type <failed-type> --retry
|
||||
```
|
||||
|
||||
### INC-002: Restore Failure
|
||||
|
||||
**Symptoms:**
|
||||
- Restore command fails
|
||||
- Services not starting after restore
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check restore logs
|
||||
stella backup restore-logs --last-attempt
|
||||
|
||||
# Verify backup integrity
|
||||
stella backup verify --name <backup-name>
|
||||
|
||||
# Check disk space
|
||||
stella doctor --check check.storage.diskspace
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Corrupted backup:**
|
||||
```bash
|
||||
# Try previous backup
|
||||
stella backup list --type <type>
|
||||
stella backup restore --name <previous-backup> --confirm
|
||||
```
|
||||
|
||||
2. **Version mismatch:**
|
||||
```bash
|
||||
# Check backup version
|
||||
stella backup info --name <backup-name>
|
||||
|
||||
# Restore with migration
|
||||
stella backup restore --name <backup-name> --with-migration
|
||||
```
|
||||
|
||||
3. **Disk space:**
|
||||
- Free space or expand volume
|
||||
- Restore to alternate location
|
||||
|
||||
### INC-003: Backup Storage Full
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaBackupStorageFull`
|
||||
- New backups failing
|
||||
|
||||
**Immediate Actions:**
|
||||
```bash
|
||||
# Check storage
|
||||
stella backup storage stats
|
||||
|
||||
# Emergency cleanup
|
||||
stella backup cleanup --keep-last 3
|
||||
|
||||
# Delete specific old backups
|
||||
stella backup delete --older-than 14d --confirm
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Adjust retention:**
|
||||
```bash
|
||||
stella backup retention set --type database --days 14
|
||||
stella backup retention apply
|
||||
```
|
||||
|
||||
2. **Expand storage:**
|
||||
- Add disk space
|
||||
- Configure offsite storage
|
||||
|
||||
3. **Archive to cold storage:**
|
||||
```bash
|
||||
stella backup archive --older-than 30d --destination s3://archive-bucket/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Disaster Recovery Scenarios
|
||||
|
||||
### DR-001: Complete System Loss
|
||||
|
||||
1. Provision new infrastructure
|
||||
2. Install Stella Ops
|
||||
3. Restore from offsite backup:
|
||||
```bash
|
||||
stella backup restore --source s3://backup-bucket/latest-full.tar.gz --confirm
|
||||
```
|
||||
4. Verify all components
|
||||
5. Update DNS/load balancer
|
||||
|
||||
### DR-002: Database Corruption
|
||||
|
||||
1. Stop services
|
||||
2. Restore database from latest clean backup:
|
||||
```bash
|
||||
stella backup restore --type database --name <last-known-good>
|
||||
```
|
||||
3. Apply WAL to near-corruption point (PITR)
|
||||
4. Verify data integrity
|
||||
5. Resume services
|
||||
|
||||
### DR-003: Evidence Locker Loss
|
||||
|
||||
1. Restore evidence from backup:
|
||||
```bash
|
||||
stella backup restore --type evidence --name <backup-name>
|
||||
```
|
||||
2. Rebuild index:
|
||||
```bash
|
||||
stella evidence index rebuild
|
||||
```
|
||||
3. Verify anchor chain:
|
||||
```bash
|
||||
stella evidence anchor verify --all
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Offline/Air-Gap Backup
|
||||
|
||||
### Creating Offline Backup
|
||||
|
||||
```bash
|
||||
# Create encrypted offline bundle
|
||||
stella backup create-offline \
|
||||
--output /media/usb/stellaops-backup-$(date +%Y%m%d).enc \
|
||||
--encrypt \
|
||||
--passphrase-file /secure/backup-key
|
||||
|
||||
# Verify offline backup
|
||||
stella backup verify-offline --input /media/usb/stellaops-backup-*.enc
|
||||
```
|
||||
|
||||
### Restoring from Offline Backup
|
||||
|
||||
```bash
|
||||
# Restore from offline backup
|
||||
stella backup restore-offline \
|
||||
--input /media/usb/stellaops-backup-*.enc \
|
||||
--passphrase-file /secure/backup-key \
|
||||
--confirm
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Dashboard
|
||||
|
||||
Access: Grafana → Dashboards → Stella Ops → Backup Status
|
||||
|
||||
Key panels:
|
||||
- Last backup success time
|
||||
- Backup size trend
|
||||
- Backup duration
|
||||
- Restore test status
|
||||
- Storage utilization
|
||||
|
||||
---
|
||||
|
||||
## Evidence Capture
|
||||
|
||||
```bash
|
||||
stella backup diagnostics --output /tmp/backup-diag-$(date +%Y%m%dT%H%M%S).tar.gz
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Escalation Path
|
||||
|
||||
1. **L1 (On-call):** Retry failed backups, basic troubleshooting
|
||||
2. **L2 (Platform team):** Restore operations, schedule adjustments
|
||||
3. **L3 (Architecture):** Disaster recovery execution
|
||||
|
||||
---
|
||||
|
||||
_Last updated: 2026-01-17 (UTC)_
|
||||
Reference in New Issue
Block a user