synergy moats product advisory implementations
This commit is contained in:
371
docs/operations/runbooks/postgres-ops.md
Normal file
371
docs/operations/runbooks/postgres-ops.md
Normal file
@@ -0,0 +1,371 @@
|
||||
# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
|
||||
# Task: RUN-001 - PostgreSQL Operations Runbook
|
||||
# PostgreSQL Database Runbook (dev-mock ready)
|
||||
|
||||
Status: PRODUCTION-READY (2026-01-17 UTC)
|
||||
|
||||
## Scope
|
||||
PostgreSQL database operations including monitoring, maintenance, backup/restore, and common incident handling for Stella Ops deployments.
|
||||
|
||||
---
|
||||
|
||||
## Pre-flight Checklist
|
||||
|
||||
### Environment Verification
|
||||
```bash
|
||||
# Check database connection
|
||||
stella db ping
|
||||
|
||||
# Verify connection pool health
|
||||
stella doctor --check check.postgres.connectivity,check.postgres.pool
|
||||
|
||||
# Check migration status
|
||||
stella db migrations status
|
||||
```
|
||||
|
||||
### Metrics to Watch
|
||||
- `stella_postgres_connections_active` - Active connections (should be < 80% of max)
|
||||
- `stella_postgres_query_duration_seconds` - P99 query latency (target: < 100ms)
|
||||
- `stella_postgres_pool_waiting` - Connections waiting for pool (should be 0)
|
||||
|
||||
---
|
||||
|
||||
## Standard Procedures
|
||||
|
||||
### SP-001: Daily Health Check
|
||||
|
||||
**Frequency:** Daily or on-demand
|
||||
**Duration:** ~5 minutes
|
||||
|
||||
1. Run comprehensive health check:
|
||||
```bash
|
||||
stella doctor --category database --format json > /tmp/db-health-$(date +%Y%m%d).json
|
||||
```
|
||||
|
||||
2. Review slow queries from last 24h:
|
||||
```bash
|
||||
stella db queries --slow --period 24h --limit 20
|
||||
```
|
||||
|
||||
3. Check replication status (if applicable):
|
||||
```bash
|
||||
stella db replication status
|
||||
```
|
||||
|
||||
4. Verify backup completion:
|
||||
```bash
|
||||
stella backup status --type database
|
||||
```
|
||||
|
||||
### SP-002: Connection Pool Tuning
|
||||
|
||||
**When:** Pool exhaustion alerts or high wait times
|
||||
|
||||
1. Check current pool usage:
|
||||
```bash
|
||||
stella db pool stats --detailed
|
||||
```
|
||||
|
||||
2. Identify connection-holding queries:
|
||||
```bash
|
||||
stella db queries --active --sort duration
|
||||
```
|
||||
|
||||
3. Adjust pool size (if needed):
|
||||
```bash
|
||||
# Review current settings
|
||||
stella config get Database:MaxPoolSize
|
||||
|
||||
# Increase pool size
|
||||
stella config set Database:MaxPoolSize 150
|
||||
|
||||
# Restart affected services
|
||||
stella service restart --service release-orchestrator
|
||||
```
|
||||
|
||||
4. Verify improvement:
|
||||
```bash
|
||||
stella db pool watch --duration 5m
|
||||
```
|
||||
|
||||
### SP-003: Backup and Restore
|
||||
|
||||
**Backup:**
|
||||
```bash
|
||||
# Create immediate backup
|
||||
stella backup create --type database --name "pre-upgrade-$(date +%Y%m%d)"
|
||||
|
||||
# Verify backup
|
||||
stella backup verify --latest
|
||||
```
|
||||
|
||||
**Restore:**
|
||||
```bash
|
||||
# List available backups
|
||||
stella backup list --type database
|
||||
|
||||
# Restore to specific point (CAUTION: destructive)
|
||||
stella backup restore --id <backup-id> --confirm
|
||||
|
||||
# Verify restoration
|
||||
stella db ping
|
||||
stella db migrations status
|
||||
```
|
||||
|
||||
### SP-004: Migration Execution
|
||||
|
||||
1. Pre-migration backup:
|
||||
```bash
|
||||
stella backup create --type database --name "pre-migration"
|
||||
```
|
||||
|
||||
2. Run migrations:
|
||||
```bash
|
||||
# Dry run first
|
||||
stella db migrate --dry-run
|
||||
|
||||
# Apply migrations
|
||||
stella db migrate
|
||||
```
|
||||
|
||||
3. Verify migration success:
|
||||
```bash
|
||||
stella db migrations status
|
||||
stella doctor --check check.postgres.migrations
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Incident Procedures
|
||||
|
||||
### INC-001: Connection Pool Exhaustion
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaPostgresPoolExhausted`
|
||||
- Error logs: "connection pool exhausted, waiting for available connection"
|
||||
- Increased request latency
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check pool status
|
||||
stella db pool stats
|
||||
|
||||
# Find long-running queries
|
||||
stella db queries --active --sort duration --limit 10
|
||||
|
||||
# Check for connection leaks
|
||||
stella db connections --by-client
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Immediate relief** - Terminate long-running queries:
|
||||
```bash
|
||||
# Identify stuck queries
|
||||
stella db queries --active --duration ">5m"
|
||||
|
||||
# Terminate specific query (use with caution)
|
||||
stella db query terminate --pid <pid>
|
||||
```
|
||||
|
||||
2. **Scale pool** (if legitimate load):
|
||||
```bash
|
||||
stella config set Database:MaxPoolSize 200
|
||||
stella service restart --graceful
|
||||
```
|
||||
|
||||
3. **Fix leaks** (if application bug):
|
||||
- Review application logs for unclosed connections
|
||||
- Deploy fix to affected service
|
||||
|
||||
### INC-002: Slow Query Performance
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaPostgresQueryLatencyHigh`
|
||||
- P99 query latency > 500ms
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Get slow query report
|
||||
stella db queries --slow --period 1h --format json > /tmp/slow-queries.json
|
||||
|
||||
# Analyze specific query
|
||||
stella db query explain --sql "SELECT ..." --analyze
|
||||
|
||||
# Check table statistics
|
||||
stella db stats tables --sort bloat
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Index optimization:**
|
||||
```bash
|
||||
# Get index recommendations
|
||||
stella db index suggest --table <table>
|
||||
|
||||
# Create recommended index
|
||||
stella db index create --table <table> --columns "col1,col2"
|
||||
```
|
||||
|
||||
2. **Vacuum/analyze:**
|
||||
```bash
|
||||
stella db vacuum --table <table>
|
||||
stella db analyze --table <table>
|
||||
```
|
||||
|
||||
3. **Query optimization** - Review and rewrite problematic queries
|
||||
|
||||
### INC-003: Database Connectivity Loss
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaPostgresConnectionFailed`
|
||||
- All services reporting database connection errors
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Test basic connectivity
|
||||
stella db ping
|
||||
|
||||
# Check DNS resolution
|
||||
stella network dns-lookup <db-host>
|
||||
|
||||
# Check firewall/network
|
||||
stella network test --host <db-host> --port 5432
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Network issue:**
|
||||
- Verify security groups / firewall rules
|
||||
- Check VPN/tunnel status if applicable
|
||||
- Verify DNS resolution
|
||||
|
||||
2. **Database server issue:**
|
||||
- Check PostgreSQL service status on server
|
||||
- Review PostgreSQL logs
|
||||
- Check disk space on database server
|
||||
|
||||
3. **Credential issue:**
|
||||
```bash
|
||||
stella db verify-credentials
|
||||
stella secrets rotate --scope database
|
||||
```
|
||||
|
||||
### INC-004: Disk Space Alert
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaPostgresDiskSpaceWarning` or `Critical`
|
||||
- Database write failures
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check disk usage
|
||||
stella db disk-usage
|
||||
|
||||
# Find large tables
|
||||
stella db stats tables --sort size --limit 20
|
||||
|
||||
# Check for bloat
|
||||
stella db stats tables --sort bloat
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Immediate cleanup:**
|
||||
```bash
|
||||
# Vacuum to reclaim space
|
||||
stella db vacuum --full --table <large-table>
|
||||
|
||||
# Clean old data (if retention policy allows)
|
||||
stella db prune --table evidence_artifacts --older-than 90d --dry-run
|
||||
```
|
||||
|
||||
2. **Archive old data:**
|
||||
```bash
|
||||
stella db archive --table findings_history --older-than 180d
|
||||
```
|
||||
|
||||
3. **Expand disk** (if legitimate growth):
|
||||
- Follow cloud provider procedure to expand volume
|
||||
- Resize filesystem
|
||||
|
||||
---
|
||||
|
||||
## Maintenance Windows
|
||||
|
||||
### Weekly Maintenance (Sunday 02:00 UTC)
|
||||
|
||||
1. Run vacuum analyze on all tables:
|
||||
```bash
|
||||
stella db vacuum --analyze --all-tables
|
||||
```
|
||||
|
||||
2. Update table statistics:
|
||||
```bash
|
||||
stella db analyze --all-tables
|
||||
```
|
||||
|
||||
3. Clean temporary files:
|
||||
```bash
|
||||
stella db cleanup --temp-files
|
||||
```
|
||||
|
||||
### Monthly Maintenance (First Sunday 03:00 UTC)
|
||||
|
||||
1. Full vacuum on large tables:
|
||||
```bash
|
||||
stella db vacuum --full --table findings --table verdicts
|
||||
```
|
||||
|
||||
2. Reindex if needed:
|
||||
```bash
|
||||
stella db reindex --concurrently --table findings
|
||||
```
|
||||
|
||||
3. Archive old data per retention policy:
|
||||
```bash
|
||||
stella db archive --apply-retention
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Dashboard
|
||||
|
||||
Access: Grafana → Dashboards → Stella Ops → PostgreSQL
|
||||
|
||||
Key panels:
|
||||
- Connection pool utilization
|
||||
- Query latency percentiles
|
||||
- Disk usage trend
|
||||
- Replication lag (if applicable)
|
||||
- Active queries count
|
||||
|
||||
---
|
||||
|
||||
## Evidence Capture
|
||||
|
||||
For any incident, capture:
|
||||
```bash
|
||||
# Comprehensive database state
|
||||
stella db diagnostics --output /tmp/db-diag-$(date +%Y%m%dT%H%M%S).tar.gz
|
||||
```
|
||||
|
||||
Bundle includes:
|
||||
- Connection stats
|
||||
- Active queries
|
||||
- Lock information
|
||||
- Table statistics
|
||||
- Recent slow query log
|
||||
- Configuration snapshot
|
||||
|
||||
---
|
||||
|
||||
## Escalation Path
|
||||
|
||||
1. **L1 (On-call):** Standard procedures, restart services
|
||||
2. **L2 (Database team):** Query optimization, schema changes
|
||||
3. **L3 (Vendor support):** Hardware/cloud platform issues
|
||||
|
||||
---
|
||||
|
||||
_Last updated: 2026-01-17 (UTC)_
|
||||
Reference in New Issue
Block a user