Files

372 lines
7.6 KiB
Markdown

# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
# Task: RUN-001 - PostgreSQL Operations Runbook
# PostgreSQL Database Runbook (dev-mock ready)
Status: PRODUCTION-READY (2026-01-17 UTC)
## Scope
PostgreSQL database operations including monitoring, maintenance, backup/restore, and common incident handling for Stella Ops deployments.
---
## Pre-flight Checklist
### Environment Verification
```bash
# Check database connection
stella db ping
# Verify connection pool health
stella doctor --check check.postgres.connectivity,check.postgres.pool
# Check migration status
stella db migrations status
```
### Metrics to Watch
- `stella_postgres_connections_active` - Active connections (should be < 80% of max)
- `stella_postgres_query_duration_seconds` - P99 query latency (target: < 100ms)
- `stella_postgres_pool_waiting` - Connections waiting for pool (should be 0)
---
## Standard Procedures
### SP-001: Daily Health Check
**Frequency:** Daily or on-demand
**Duration:** ~5 minutes
1. Run comprehensive health check:
```bash
stella doctor --category database --format json > /tmp/db-health-$(date +%Y%m%d).json
```
2. Review slow queries from last 24h:
```bash
stella db queries --slow --period 24h --limit 20
```
3. Check replication status (if applicable):
```bash
stella db replication status
```
4. Verify backup completion:
```bash
stella backup status --type database
```
### SP-002: Connection Pool Tuning
**When:** Pool exhaustion alerts or high wait times
1. Check current pool usage:
```bash
stella db pool stats --detailed
```
2. Identify connection-holding queries:
```bash
stella db queries --active --sort duration
```
3. Adjust pool size (if needed):
```bash
# Review current settings
stella config get Database:MaxPoolSize
# Increase pool size
stella config set Database:MaxPoolSize 150
# Restart affected services
stella service restart --service release-orchestrator
```
4. Verify improvement:
```bash
stella db pool watch --duration 5m
```
### SP-003: Backup and Restore
**Backup:**
```bash
# Create immediate backup
stella backup create --type database --name "pre-upgrade-$(date +%Y%m%d)"
# Verify backup
stella backup verify --latest
```
**Restore:**
```bash
# List available backups
stella backup list --type database
# Restore to specific point (CAUTION: destructive)
stella backup restore --id <backup-id> --confirm
# Verify restoration
stella db ping
stella db migrations status
```
### SP-004: Migration Execution
1. Pre-migration backup:
```bash
stella backup create --type database --name "pre-migration"
```
2. Run migrations:
```bash
# Dry run first
stella db migrate --dry-run
# Apply migrations
stella db migrate
```
3. Verify migration success:
```bash
stella db migrations status
stella doctor --check check.postgres.migrations
```
---
## Incident Procedures
### INC-001: Connection Pool Exhaustion
**Symptoms:**
- Alert: `StellaPostgresPoolExhausted`
- Error logs: "connection pool exhausted, waiting for available connection"
- Increased request latency
**Investigation:**
```bash
# Check pool status
stella db pool stats
# Find long-running queries
stella db queries --active --sort duration --limit 10
# Check for connection leaks
stella db connections --by-client
```
**Resolution:**
1. **Immediate relief** - Terminate long-running queries:
```bash
# Identify stuck queries
stella db queries --active --duration ">5m"
# Terminate specific query (use with caution)
stella db query terminate --pid <pid>
```
2. **Scale pool** (if legitimate load):
```bash
stella config set Database:MaxPoolSize 200
stella service restart --graceful
```
3. **Fix leaks** (if application bug):
- Review application logs for unclosed connections
- Deploy fix to affected service
### INC-002: Slow Query Performance
**Symptoms:**
- Alert: `StellaPostgresQueryLatencyHigh`
- P99 query latency > 500ms
**Investigation:**
```bash
# Get slow query report
stella db queries --slow --period 1h --format json > /tmp/slow-queries.json
# Analyze specific query
stella db query explain --sql "SELECT ..." --analyze
# Check table statistics
stella db stats tables --sort bloat
```
**Resolution:**
1. **Index optimization:**
```bash
# Get index recommendations
stella db index suggest --table <table>
# Create recommended index
stella db index create --table <table> --columns "col1,col2"
```
2. **Vacuum/analyze:**
```bash
stella db vacuum --table <table>
stella db analyze --table <table>
```
3. **Query optimization** - Review and rewrite problematic queries
### INC-003: Database Connectivity Loss
**Symptoms:**
- Alert: `StellaPostgresConnectionFailed`
- All services reporting database connection errors
**Investigation:**
```bash
# Test basic connectivity
stella db ping
# Check DNS resolution
stella network dns-lookup <db-host>
# Check firewall/network
stella network test --host <db-host> --port 5432
```
**Resolution:**
1. **Network issue:**
- Verify security groups / firewall rules
- Check VPN/tunnel status if applicable
- Verify DNS resolution
2. **Database server issue:**
- Check PostgreSQL service status on server
- Review PostgreSQL logs
- Check disk space on database server
3. **Credential issue:**
```bash
stella db verify-credentials
stella secrets rotate --scope database
```
### INC-004: Disk Space Alert
**Symptoms:**
- Alert: `StellaPostgresDiskSpaceWarning` or `Critical`
- Database write failures
**Investigation:**
```bash
# Check disk usage
stella db disk-usage
# Find large tables
stella db stats tables --sort size --limit 20
# Check for bloat
stella db stats tables --sort bloat
```
**Resolution:**
1. **Immediate cleanup:**
```bash
# Vacuum to reclaim space
stella db vacuum --full --table <large-table>
# Clean old data (if retention policy allows)
stella db prune --table evidence_artifacts --older-than 90d --dry-run
```
2. **Archive old data:**
```bash
stella db archive --table findings_history --older-than 180d
```
3. **Expand disk** (if legitimate growth):
- Follow cloud provider procedure to expand volume
- Resize filesystem
---
## Maintenance Windows
### Weekly Maintenance (Sunday 02:00 UTC)
1. Run vacuum analyze on all tables:
```bash
stella db vacuum --analyze --all-tables
```
2. Update table statistics:
```bash
stella db analyze --all-tables
```
3. Clean temporary files:
```bash
stella db cleanup --temp-files
```
### Monthly Maintenance (First Sunday 03:00 UTC)
1. Full vacuum on large tables:
```bash
stella db vacuum --full --table findings --table verdicts
```
2. Reindex if needed:
```bash
stella db reindex --concurrently --table findings
```
3. Archive old data per retention policy:
```bash
stella db archive --apply-retention
```
---
## Monitoring Dashboard
Access: Grafana → Dashboards → Stella Ops → PostgreSQL
Key panels:
- Connection pool utilization
- Query latency percentiles
- Disk usage trend
- Replication lag (if applicable)
- Active queries count
---
## Evidence Capture
For any incident, capture:
```bash
# Comprehensive database state
stella db diagnostics --output /tmp/db-diag-$(date +%Y%m%dT%H%M%S).tar.gz
```
Bundle includes:
- Connection stats
- Active queries
- Lock information
- Table statistics
- Recent slow query log
- Configuration snapshot
---
## Escalation Path
1. **L1 (On-call):** Standard procedures, restart services
2. **L2 (Database team):** Query optimization, schema changes
3. **L3 (Vendor support):** Hardware/cloud platform issues
---
_Last updated: 2026-01-17 (UTC)_