# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion # Task: RUN-001 - PostgreSQL Operations Runbook # PostgreSQL Database Runbook (dev-mock ready) Status: PRODUCTION-READY (2026-01-17 UTC) ## Scope PostgreSQL database operations including monitoring, maintenance, backup/restore, and common incident handling for Stella Ops deployments. --- ## Pre-flight Checklist ### Environment Verification ```bash # Check database connection stella db ping # Verify connection pool health stella doctor --check check.postgres.connectivity,check.postgres.pool # Check migration status stella db migrations status ``` ### Metrics to Watch - `stella_postgres_connections_active` - Active connections (should be < 80% of max) - `stella_postgres_query_duration_seconds` - P99 query latency (target: < 100ms) - `stella_postgres_pool_waiting` - Connections waiting for pool (should be 0) --- ## Standard Procedures ### SP-001: Daily Health Check **Frequency:** Daily or on-demand **Duration:** ~5 minutes 1. Run comprehensive health check: ```bash stella doctor --category database --format json > /tmp/db-health-$(date +%Y%m%d).json ``` 2. Review slow queries from last 24h: ```bash stella db queries --slow --period 24h --limit 20 ``` 3. Check replication status (if applicable): ```bash stella db replication status ``` 4. Verify backup completion: ```bash stella backup status --type database ``` ### SP-002: Connection Pool Tuning **When:** Pool exhaustion alerts or high wait times 1. Check current pool usage: ```bash stella db pool stats --detailed ``` 2. Identify connection-holding queries: ```bash stella db queries --active --sort duration ``` 3. Adjust pool size (if needed): ```bash # Review current settings stella config get Database:MaxPoolSize # Increase pool size stella config set Database:MaxPoolSize 150 # Restart affected services stella service restart --service release-orchestrator ``` 4. Verify improvement: ```bash stella db pool watch --duration 5m ``` ### SP-003: Backup and Restore **Backup:** ```bash # Create immediate backup stella backup create --type database --name "pre-upgrade-$(date +%Y%m%d)" # Verify backup stella backup verify --latest ``` **Restore:** ```bash # List available backups stella backup list --type database # Restore to specific point (CAUTION: destructive) stella backup restore --id --confirm # Verify restoration stella db ping stella db migrations status ``` ### SP-004: Migration Execution 1. Pre-migration backup: ```bash stella backup create --type database --name "pre-migration" ``` 2. Run migrations: ```bash # Dry run first stella db migrate --dry-run # Apply migrations stella db migrate ``` 3. Verify migration success: ```bash stella db migrations status stella doctor --check check.postgres.migrations ``` --- ## Incident Procedures ### INC-001: Connection Pool Exhaustion **Symptoms:** - Alert: `StellaPostgresPoolExhausted` - Error logs: "connection pool exhausted, waiting for available connection" - Increased request latency **Investigation:** ```bash # Check pool status stella db pool stats # Find long-running queries stella db queries --active --sort duration --limit 10 # Check for connection leaks stella db connections --by-client ``` **Resolution:** 1. **Immediate relief** - Terminate long-running queries: ```bash # Identify stuck queries stella db queries --active --duration ">5m" # Terminate specific query (use with caution) stella db query terminate --pid ``` 2. **Scale pool** (if legitimate load): ```bash stella config set Database:MaxPoolSize 200 stella service restart --graceful ``` 3. **Fix leaks** (if application bug): - Review application logs for unclosed connections - Deploy fix to affected service ### INC-002: Slow Query Performance **Symptoms:** - Alert: `StellaPostgresQueryLatencyHigh` - P99 query latency > 500ms **Investigation:** ```bash # Get slow query report stella db queries --slow --period 1h --format json > /tmp/slow-queries.json # Analyze specific query stella db query explain --sql "SELECT ..." --analyze # Check table statistics stella db stats tables --sort bloat ``` **Resolution:** 1. **Index optimization:** ```bash # Get index recommendations stella db index suggest --table # Create recommended index stella db index create --table
--columns "col1,col2" ``` 2. **Vacuum/analyze:** ```bash stella db vacuum --table
stella db analyze --table
``` 3. **Query optimization** - Review and rewrite problematic queries ### INC-003: Database Connectivity Loss **Symptoms:** - Alert: `StellaPostgresConnectionFailed` - All services reporting database connection errors **Investigation:** ```bash # Test basic connectivity stella db ping # Check DNS resolution stella network dns-lookup # Check firewall/network stella network test --host --port 5432 ``` **Resolution:** 1. **Network issue:** - Verify security groups / firewall rules - Check VPN/tunnel status if applicable - Verify DNS resolution 2. **Database server issue:** - Check PostgreSQL service status on server - Review PostgreSQL logs - Check disk space on database server 3. **Credential issue:** ```bash stella db verify-credentials stella secrets rotate --scope database ``` ### INC-004: Disk Space Alert **Symptoms:** - Alert: `StellaPostgresDiskSpaceWarning` or `Critical` - Database write failures **Investigation:** ```bash # Check disk usage stella db disk-usage # Find large tables stella db stats tables --sort size --limit 20 # Check for bloat stella db stats tables --sort bloat ``` **Resolution:** 1. **Immediate cleanup:** ```bash # Vacuum to reclaim space stella db vacuum --full --table # Clean old data (if retention policy allows) stella db prune --table evidence_artifacts --older-than 90d --dry-run ``` 2. **Archive old data:** ```bash stella db archive --table findings_history --older-than 180d ``` 3. **Expand disk** (if legitimate growth): - Follow cloud provider procedure to expand volume - Resize filesystem --- ## Maintenance Windows ### Weekly Maintenance (Sunday 02:00 UTC) 1. Run vacuum analyze on all tables: ```bash stella db vacuum --analyze --all-tables ``` 2. Update table statistics: ```bash stella db analyze --all-tables ``` 3. Clean temporary files: ```bash stella db cleanup --temp-files ``` ### Monthly Maintenance (First Sunday 03:00 UTC) 1. Full vacuum on large tables: ```bash stella db vacuum --full --table findings --table verdicts ``` 2. Reindex if needed: ```bash stella db reindex --concurrently --table findings ``` 3. Archive old data per retention policy: ```bash stella db archive --apply-retention ``` --- ## Monitoring Dashboard Access: Grafana → Dashboards → Stella Ops → PostgreSQL Key panels: - Connection pool utilization - Query latency percentiles - Disk usage trend - Replication lag (if applicable) - Active queries count --- ## Evidence Capture For any incident, capture: ```bash # Comprehensive database state stella db diagnostics --output /tmp/db-diag-$(date +%Y%m%dT%H%M%S).tar.gz ``` Bundle includes: - Connection stats - Active queries - Lock information - Table statistics - Recent slow query log - Configuration snapshot --- ## Escalation Path 1. **L1 (On-call):** Standard procedures, restart services 2. **L2 (Database team):** Query optimization, schema changes 3. **L3 (Vendor support):** Hardware/cloud platform issues --- _Last updated: 2026-01-17 (UTC)_