7.6 KiB
Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
Task: RUN-001 - PostgreSQL Operations Runbook
PostgreSQL Database Runbook (dev-mock ready)
Status: PRODUCTION-READY (2026-01-17 UTC)
Scope
PostgreSQL database operations including monitoring, maintenance, backup/restore, and common incident handling for Stella Ops deployments.
Pre-flight Checklist
Environment Verification
# Check database connection
stella db ping
# Verify connection pool health
stella doctor --check check.postgres.connectivity,check.postgres.pool
# Check migration status
stella db migrations status
Metrics to Watch
stella_postgres_connections_active- Active connections (should be < 80% of max)stella_postgres_query_duration_seconds- P99 query latency (target: < 100ms)stella_postgres_pool_waiting- Connections waiting for pool (should be 0)
Standard Procedures
SP-001: Daily Health Check
Frequency: Daily or on-demand Duration: ~5 minutes
-
Run comprehensive health check:
stella doctor --category database --format json > /tmp/db-health-$(date +%Y%m%d).json -
Review slow queries from last 24h:
stella db queries --slow --period 24h --limit 20 -
Check replication status (if applicable):
stella db replication status -
Verify backup completion:
stella backup status --type database
SP-002: Connection Pool Tuning
When: Pool exhaustion alerts or high wait times
-
Check current pool usage:
stella db pool stats --detailed -
Identify connection-holding queries:
stella db queries --active --sort duration -
Adjust pool size (if needed):
# Review current settings stella config get Database:MaxPoolSize # Increase pool size stella config set Database:MaxPoolSize 150 # Restart affected services stella service restart --service release-orchestrator -
Verify improvement:
stella db pool watch --duration 5m
SP-003: Backup and Restore
Backup:
# Create immediate backup
stella backup create --type database --name "pre-upgrade-$(date +%Y%m%d)"
# Verify backup
stella backup verify --latest
Restore:
# List available backups
stella backup list --type database
# Restore to specific point (CAUTION: destructive)
stella backup restore --id <backup-id> --confirm
# Verify restoration
stella db ping
stella db migrations status
SP-004: Migration Execution
-
Pre-migration backup:
stella backup create --type database --name "pre-migration" -
Run migrations:
# Dry run first stella db migrate --dry-run # Apply migrations stella db migrate -
Verify migration success:
stella db migrations status stella doctor --check check.postgres.migrations
Incident Procedures
INC-001: Connection Pool Exhaustion
Symptoms:
- Alert:
StellaPostgresPoolExhausted - Error logs: "connection pool exhausted, waiting for available connection"
- Increased request latency
Investigation:
# Check pool status
stella db pool stats
# Find long-running queries
stella db queries --active --sort duration --limit 10
# Check for connection leaks
stella db connections --by-client
Resolution:
-
Immediate relief - Terminate long-running queries:
# Identify stuck queries stella db queries --active --duration ">5m" # Terminate specific query (use with caution) stella db query terminate --pid <pid> -
Scale pool (if legitimate load):
stella config set Database:MaxPoolSize 200 stella service restart --graceful -
Fix leaks (if application bug):
- Review application logs for unclosed connections
- Deploy fix to affected service
INC-002: Slow Query Performance
Symptoms:
- Alert:
StellaPostgresQueryLatencyHigh - P99 query latency > 500ms
Investigation:
# Get slow query report
stella db queries --slow --period 1h --format json > /tmp/slow-queries.json
# Analyze specific query
stella db query explain --sql "SELECT ..." --analyze
# Check table statistics
stella db stats tables --sort bloat
Resolution:
-
Index optimization:
# Get index recommendations stella db index suggest --table <table> # Create recommended index stella db index create --table <table> --columns "col1,col2" -
Vacuum/analyze:
stella db vacuum --table <table> stella db analyze --table <table> -
Query optimization - Review and rewrite problematic queries
INC-003: Database Connectivity Loss
Symptoms:
- Alert:
StellaPostgresConnectionFailed - All services reporting database connection errors
Investigation:
# Test basic connectivity
stella db ping
# Check DNS resolution
stella network dns-lookup <db-host>
# Check firewall/network
stella network test --host <db-host> --port 5432
Resolution:
-
Network issue:
- Verify security groups / firewall rules
- Check VPN/tunnel status if applicable
- Verify DNS resolution
-
Database server issue:
- Check PostgreSQL service status on server
- Review PostgreSQL logs
- Check disk space on database server
-
Credential issue:
stella db verify-credentials stella secrets rotate --scope database
INC-004: Disk Space Alert
Symptoms:
- Alert:
StellaPostgresDiskSpaceWarningorCritical - Database write failures
Investigation:
# Check disk usage
stella db disk-usage
# Find large tables
stella db stats tables --sort size --limit 20
# Check for bloat
stella db stats tables --sort bloat
Resolution:
-
Immediate cleanup:
# Vacuum to reclaim space stella db vacuum --full --table <large-table> # Clean old data (if retention policy allows) stella db prune --table evidence_artifacts --older-than 90d --dry-run -
Archive old data:
stella db archive --table findings_history --older-than 180d -
Expand disk (if legitimate growth):
- Follow cloud provider procedure to expand volume
- Resize filesystem
Maintenance Windows
Weekly Maintenance (Sunday 02:00 UTC)
-
Run vacuum analyze on all tables:
stella db vacuum --analyze --all-tables -
Update table statistics:
stella db analyze --all-tables -
Clean temporary files:
stella db cleanup --temp-files
Monthly Maintenance (First Sunday 03:00 UTC)
-
Full vacuum on large tables:
stella db vacuum --full --table findings --table verdicts -
Reindex if needed:
stella db reindex --concurrently --table findings -
Archive old data per retention policy:
stella db archive --apply-retention
Monitoring Dashboard
Access: Grafana → Dashboards → Stella Ops → PostgreSQL
Key panels:
- Connection pool utilization
- Query latency percentiles
- Disk usage trend
- Replication lag (if applicable)
- Active queries count
Evidence Capture
For any incident, capture:
# Comprehensive database state
stella db diagnostics --output /tmp/db-diag-$(date +%Y%m%dT%H%M%S).tar.gz
Bundle includes:
- Connection stats
- Active queries
- Lock information
- Table statistics
- Recent slow query log
- Configuration snapshot
Escalation Path
- L1 (On-call): Standard procedures, restart services
- L2 (Database team): Query optimization, schema changes
- L3 (Vendor support): Hardware/cloud platform issues
Last updated: 2026-01-17 (UTC)