372 lines
7.6 KiB
Markdown
372 lines
7.6 KiB
Markdown
# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
|
|
# Task: RUN-001 - PostgreSQL Operations Runbook
|
|
# PostgreSQL Database Runbook (dev-mock ready)
|
|
|
|
Status: PRODUCTION-READY (2026-01-17 UTC)
|
|
|
|
## Scope
|
|
PostgreSQL database operations including monitoring, maintenance, backup/restore, and common incident handling for Stella Ops deployments.
|
|
|
|
---
|
|
|
|
## Pre-flight Checklist
|
|
|
|
### Environment Verification
|
|
```bash
|
|
# Check database connection
|
|
stella db ping
|
|
|
|
# Verify connection pool health
|
|
stella doctor --check check.postgres.connectivity,check.postgres.pool
|
|
|
|
# Check migration status
|
|
stella db migrations status
|
|
```
|
|
|
|
### Metrics to Watch
|
|
- `stella_postgres_connections_active` - Active connections (should be < 80% of max)
|
|
- `stella_postgres_query_duration_seconds` - P99 query latency (target: < 100ms)
|
|
- `stella_postgres_pool_waiting` - Connections waiting for pool (should be 0)
|
|
|
|
---
|
|
|
|
## Standard Procedures
|
|
|
|
### SP-001: Daily Health Check
|
|
|
|
**Frequency:** Daily or on-demand
|
|
**Duration:** ~5 minutes
|
|
|
|
1. Run comprehensive health check:
|
|
```bash
|
|
stella doctor --category database --format json > /tmp/db-health-$(date +%Y%m%d).json
|
|
```
|
|
|
|
2. Review slow queries from last 24h:
|
|
```bash
|
|
stella db queries --slow --period 24h --limit 20
|
|
```
|
|
|
|
3. Check replication status (if applicable):
|
|
```bash
|
|
stella db replication status
|
|
```
|
|
|
|
4. Verify backup completion:
|
|
```bash
|
|
stella backup status --type database
|
|
```
|
|
|
|
### SP-002: Connection Pool Tuning
|
|
|
|
**When:** Pool exhaustion alerts or high wait times
|
|
|
|
1. Check current pool usage:
|
|
```bash
|
|
stella db pool stats --detailed
|
|
```
|
|
|
|
2. Identify connection-holding queries:
|
|
```bash
|
|
stella db queries --active --sort duration
|
|
```
|
|
|
|
3. Adjust pool size (if needed):
|
|
```bash
|
|
# Review current settings
|
|
stella config get Database:MaxPoolSize
|
|
|
|
# Increase pool size
|
|
stella config set Database:MaxPoolSize 150
|
|
|
|
# Restart affected services
|
|
stella service restart --service release-orchestrator
|
|
```
|
|
|
|
4. Verify improvement:
|
|
```bash
|
|
stella db pool watch --duration 5m
|
|
```
|
|
|
|
### SP-003: Backup and Restore
|
|
|
|
**Backup:**
|
|
```bash
|
|
# Create immediate backup
|
|
stella backup create --type database --name "pre-upgrade-$(date +%Y%m%d)"
|
|
|
|
# Verify backup
|
|
stella backup verify --latest
|
|
```
|
|
|
|
**Restore:**
|
|
```bash
|
|
# List available backups
|
|
stella backup list --type database
|
|
|
|
# Restore to specific point (CAUTION: destructive)
|
|
stella backup restore --id <backup-id> --confirm
|
|
|
|
# Verify restoration
|
|
stella db ping
|
|
stella db migrations status
|
|
```
|
|
|
|
### SP-004: Migration Execution
|
|
|
|
1. Pre-migration backup:
|
|
```bash
|
|
stella backup create --type database --name "pre-migration"
|
|
```
|
|
|
|
2. Run migrations:
|
|
```bash
|
|
# Dry run first
|
|
stella db migrate --dry-run
|
|
|
|
# Apply migrations
|
|
stella db migrate
|
|
```
|
|
|
|
3. Verify migration success:
|
|
```bash
|
|
stella db migrations status
|
|
stella doctor --check check.postgres.migrations
|
|
```
|
|
|
|
---
|
|
|
|
## Incident Procedures
|
|
|
|
### INC-001: Connection Pool Exhaustion
|
|
|
|
**Symptoms:**
|
|
- Alert: `StellaPostgresPoolExhausted`
|
|
- Error logs: "connection pool exhausted, waiting for available connection"
|
|
- Increased request latency
|
|
|
|
**Investigation:**
|
|
```bash
|
|
# Check pool status
|
|
stella db pool stats
|
|
|
|
# Find long-running queries
|
|
stella db queries --active --sort duration --limit 10
|
|
|
|
# Check for connection leaks
|
|
stella db connections --by-client
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
1. **Immediate relief** - Terminate long-running queries:
|
|
```bash
|
|
# Identify stuck queries
|
|
stella db queries --active --duration ">5m"
|
|
|
|
# Terminate specific query (use with caution)
|
|
stella db query terminate --pid <pid>
|
|
```
|
|
|
|
2. **Scale pool** (if legitimate load):
|
|
```bash
|
|
stella config set Database:MaxPoolSize 200
|
|
stella service restart --graceful
|
|
```
|
|
|
|
3. **Fix leaks** (if application bug):
|
|
- Review application logs for unclosed connections
|
|
- Deploy fix to affected service
|
|
|
|
### INC-002: Slow Query Performance
|
|
|
|
**Symptoms:**
|
|
- Alert: `StellaPostgresQueryLatencyHigh`
|
|
- P99 query latency > 500ms
|
|
|
|
**Investigation:**
|
|
```bash
|
|
# Get slow query report
|
|
stella db queries --slow --period 1h --format json > /tmp/slow-queries.json
|
|
|
|
# Analyze specific query
|
|
stella db query explain --sql "SELECT ..." --analyze
|
|
|
|
# Check table statistics
|
|
stella db stats tables --sort bloat
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
1. **Index optimization:**
|
|
```bash
|
|
# Get index recommendations
|
|
stella db index suggest --table <table>
|
|
|
|
# Create recommended index
|
|
stella db index create --table <table> --columns "col1,col2"
|
|
```
|
|
|
|
2. **Vacuum/analyze:**
|
|
```bash
|
|
stella db vacuum --table <table>
|
|
stella db analyze --table <table>
|
|
```
|
|
|
|
3. **Query optimization** - Review and rewrite problematic queries
|
|
|
|
### INC-003: Database Connectivity Loss
|
|
|
|
**Symptoms:**
|
|
- Alert: `StellaPostgresConnectionFailed`
|
|
- All services reporting database connection errors
|
|
|
|
**Investigation:**
|
|
```bash
|
|
# Test basic connectivity
|
|
stella db ping
|
|
|
|
# Check DNS resolution
|
|
stella network dns-lookup <db-host>
|
|
|
|
# Check firewall/network
|
|
stella network test --host <db-host> --port 5432
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
1. **Network issue:**
|
|
- Verify security groups / firewall rules
|
|
- Check VPN/tunnel status if applicable
|
|
- Verify DNS resolution
|
|
|
|
2. **Database server issue:**
|
|
- Check PostgreSQL service status on server
|
|
- Review PostgreSQL logs
|
|
- Check disk space on database server
|
|
|
|
3. **Credential issue:**
|
|
```bash
|
|
stella db verify-credentials
|
|
stella secrets rotate --scope database
|
|
```
|
|
|
|
### INC-004: Disk Space Alert
|
|
|
|
**Symptoms:**
|
|
- Alert: `StellaPostgresDiskSpaceWarning` or `Critical`
|
|
- Database write failures
|
|
|
|
**Investigation:**
|
|
```bash
|
|
# Check disk usage
|
|
stella db disk-usage
|
|
|
|
# Find large tables
|
|
stella db stats tables --sort size --limit 20
|
|
|
|
# Check for bloat
|
|
stella db stats tables --sort bloat
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
1. **Immediate cleanup:**
|
|
```bash
|
|
# Vacuum to reclaim space
|
|
stella db vacuum --full --table <large-table>
|
|
|
|
# Clean old data (if retention policy allows)
|
|
stella db prune --table evidence_artifacts --older-than 90d --dry-run
|
|
```
|
|
|
|
2. **Archive old data:**
|
|
```bash
|
|
stella db archive --table findings_history --older-than 180d
|
|
```
|
|
|
|
3. **Expand disk** (if legitimate growth):
|
|
- Follow cloud provider procedure to expand volume
|
|
- Resize filesystem
|
|
|
|
---
|
|
|
|
## Maintenance Windows
|
|
|
|
### Weekly Maintenance (Sunday 02:00 UTC)
|
|
|
|
1. Run vacuum analyze on all tables:
|
|
```bash
|
|
stella db vacuum --analyze --all-tables
|
|
```
|
|
|
|
2. Update table statistics:
|
|
```bash
|
|
stella db analyze --all-tables
|
|
```
|
|
|
|
3. Clean temporary files:
|
|
```bash
|
|
stella db cleanup --temp-files
|
|
```
|
|
|
|
### Monthly Maintenance (First Sunday 03:00 UTC)
|
|
|
|
1. Full vacuum on large tables:
|
|
```bash
|
|
stella db vacuum --full --table findings --table verdicts
|
|
```
|
|
|
|
2. Reindex if needed:
|
|
```bash
|
|
stella db reindex --concurrently --table findings
|
|
```
|
|
|
|
3. Archive old data per retention policy:
|
|
```bash
|
|
stella db archive --apply-retention
|
|
```
|
|
|
|
---
|
|
|
|
## Monitoring Dashboard
|
|
|
|
Access: Grafana → Dashboards → Stella Ops → PostgreSQL
|
|
|
|
Key panels:
|
|
- Connection pool utilization
|
|
- Query latency percentiles
|
|
- Disk usage trend
|
|
- Replication lag (if applicable)
|
|
- Active queries count
|
|
|
|
---
|
|
|
|
## Evidence Capture
|
|
|
|
For any incident, capture:
|
|
```bash
|
|
# Comprehensive database state
|
|
stella db diagnostics --output /tmp/db-diag-$(date +%Y%m%dT%H%M%S).tar.gz
|
|
```
|
|
|
|
Bundle includes:
|
|
- Connection stats
|
|
- Active queries
|
|
- Lock information
|
|
- Table statistics
|
|
- Recent slow query log
|
|
- Configuration snapshot
|
|
|
|
---
|
|
|
|
## Escalation Path
|
|
|
|
1. **L1 (On-call):** Standard procedures, restart services
|
|
2. **L2 (Database team):** Query optimization, schema changes
|
|
3. **L3 (Vendor support):** Hardware/cloud platform issues
|
|
|
|
---
|
|
|
|
_Last updated: 2026-01-17 (UTC)_
|