Files
git.stella-ops.org/docs/operations/runbooks/postgres-ops.md

7.6 KiB

Sprint: SPRINT_20260117_029_Runbook_coverage_expansion

Task: RUN-001 - PostgreSQL Operations Runbook

PostgreSQL Database Runbook (dev-mock ready)

Status: PRODUCTION-READY (2026-01-17 UTC)

Scope

PostgreSQL database operations including monitoring, maintenance, backup/restore, and common incident handling for Stella Ops deployments.


Pre-flight Checklist

Environment Verification

# Check database connection
stella db ping

# Verify connection pool health
stella doctor --check check.postgres.connectivity,check.postgres.pool

# Check migration status
stella db migrations status

Metrics to Watch

  • stella_postgres_connections_active - Active connections (should be < 80% of max)
  • stella_postgres_query_duration_seconds - P99 query latency (target: < 100ms)
  • stella_postgres_pool_waiting - Connections waiting for pool (should be 0)

Standard Procedures

SP-001: Daily Health Check

Frequency: Daily or on-demand Duration: ~5 minutes

  1. Run comprehensive health check:

    stella doctor --category database --format json > /tmp/db-health-$(date +%Y%m%d).json
    
  2. Review slow queries from last 24h:

    stella db queries --slow --period 24h --limit 20
    
  3. Check replication status (if applicable):

    stella db replication status
    
  4. Verify backup completion:

    stella backup status --type database
    

SP-002: Connection Pool Tuning

When: Pool exhaustion alerts or high wait times

  1. Check current pool usage:

    stella db pool stats --detailed
    
  2. Identify connection-holding queries:

    stella db queries --active --sort duration
    
  3. Adjust pool size (if needed):

    # Review current settings
    stella config get Database:MaxPoolSize
    
    # Increase pool size
    stella config set Database:MaxPoolSize 150
    
    # Restart affected services
    stella service restart --service release-orchestrator
    
  4. Verify improvement:

    stella db pool watch --duration 5m
    

SP-003: Backup and Restore

Backup:

# Create immediate backup
stella backup create --type database --name "pre-upgrade-$(date +%Y%m%d)"

# Verify backup
stella backup verify --latest

Restore:

# List available backups
stella backup list --type database

# Restore to specific point (CAUTION: destructive)
stella backup restore --id <backup-id> --confirm

# Verify restoration
stella db ping
stella db migrations status

SP-004: Migration Execution

  1. Pre-migration backup:

    stella backup create --type database --name "pre-migration"
    
  2. Run migrations:

    # Dry run first
    stella db migrate --dry-run
    
    # Apply migrations
    stella db migrate
    
  3. Verify migration success:

    stella db migrations status
    stella doctor --check check.postgres.migrations
    

Incident Procedures

INC-001: Connection Pool Exhaustion

Symptoms:

  • Alert: StellaPostgresPoolExhausted
  • Error logs: "connection pool exhausted, waiting for available connection"
  • Increased request latency

Investigation:

# Check pool status
stella db pool stats

# Find long-running queries
stella db queries --active --sort duration --limit 10

# Check for connection leaks
stella db connections --by-client

Resolution:

  1. Immediate relief - Terminate long-running queries:

    # Identify stuck queries
    stella db queries --active --duration ">5m"
    
    # Terminate specific query (use with caution)
    stella db query terminate --pid <pid>
    
  2. Scale pool (if legitimate load):

    stella config set Database:MaxPoolSize 200
    stella service restart --graceful
    
  3. Fix leaks (if application bug):

    • Review application logs for unclosed connections
    • Deploy fix to affected service

INC-002: Slow Query Performance

Symptoms:

  • Alert: StellaPostgresQueryLatencyHigh
  • P99 query latency > 500ms

Investigation:

# Get slow query report
stella db queries --slow --period 1h --format json > /tmp/slow-queries.json

# Analyze specific query
stella db query explain --sql "SELECT ..." --analyze

# Check table statistics
stella db stats tables --sort bloat

Resolution:

  1. Index optimization:

    # Get index recommendations
    stella db index suggest --table <table>
    
    # Create recommended index
    stella db index create --table <table> --columns "col1,col2"
    
  2. Vacuum/analyze:

    stella db vacuum --table <table>
    stella db analyze --table <table>
    
  3. Query optimization - Review and rewrite problematic queries

INC-003: Database Connectivity Loss

Symptoms:

  • Alert: StellaPostgresConnectionFailed
  • All services reporting database connection errors

Investigation:

# Test basic connectivity
stella db ping

# Check DNS resolution
stella network dns-lookup <db-host>

# Check firewall/network
stella network test --host <db-host> --port 5432

Resolution:

  1. Network issue:

    • Verify security groups / firewall rules
    • Check VPN/tunnel status if applicable
    • Verify DNS resolution
  2. Database server issue:

    • Check PostgreSQL service status on server
    • Review PostgreSQL logs
    • Check disk space on database server
  3. Credential issue:

    stella db verify-credentials
    stella secrets rotate --scope database
    

INC-004: Disk Space Alert

Symptoms:

  • Alert: StellaPostgresDiskSpaceWarning or Critical
  • Database write failures

Investigation:

# Check disk usage
stella db disk-usage

# Find large tables
stella db stats tables --sort size --limit 20

# Check for bloat
stella db stats tables --sort bloat

Resolution:

  1. Immediate cleanup:

    # Vacuum to reclaim space
    stella db vacuum --full --table <large-table>
    
    # Clean old data (if retention policy allows)
    stella db prune --table evidence_artifacts --older-than 90d --dry-run
    
  2. Archive old data:

    stella db archive --table findings_history --older-than 180d
    
  3. Expand disk (if legitimate growth):

    • Follow cloud provider procedure to expand volume
    • Resize filesystem

Maintenance Windows

Weekly Maintenance (Sunday 02:00 UTC)

  1. Run vacuum analyze on all tables:

    stella db vacuum --analyze --all-tables
    
  2. Update table statistics:

    stella db analyze --all-tables
    
  3. Clean temporary files:

    stella db cleanup --temp-files
    

Monthly Maintenance (First Sunday 03:00 UTC)

  1. Full vacuum on large tables:

    stella db vacuum --full --table findings --table verdicts
    
  2. Reindex if needed:

    stella db reindex --concurrently --table findings
    
  3. Archive old data per retention policy:

    stella db archive --apply-retention
    

Monitoring Dashboard

Access: Grafana → Dashboards → Stella Ops → PostgreSQL

Key panels:

  • Connection pool utilization
  • Query latency percentiles
  • Disk usage trend
  • Replication lag (if applicable)
  • Active queries count

Evidence Capture

For any incident, capture:

# Comprehensive database state
stella db diagnostics --output /tmp/db-diag-$(date +%Y%m%dT%H%M%S).tar.gz

Bundle includes:

  • Connection stats
  • Active queries
  • Lock information
  • Table statistics
  • Recent slow query log
  • Configuration snapshot

Escalation Path

  1. L1 (On-call): Standard procedures, restart services
  2. L2 (Database team): Query optimization, schema changes
  3. L3 (Vendor support): Hardware/cloud platform issues

Last updated: 2026-01-17 (UTC)