synergy moats product advisory implementations

2026-01-17 01:30:03 +02:00
parent 77ff029205
commit d8d9c0a6e3
106 changed files with 20603 additions and 123 deletions
--- a/docs/operations/runbooks/postgres-ops.md
+++ b/docs/operations/runbooks/postgres-ops.md
@@ -0,0 +1,371 @@
+# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
+# Task: RUN-001 - PostgreSQL Operations Runbook
+# PostgreSQL Database Runbook (dev-mock ready)
+
+Status: PRODUCTION-READY (2026-01-17 UTC)
+
+## Scope
+PostgreSQL database operations including monitoring, maintenance, backup/restore, and common incident handling for Stella Ops deployments.
+
+---
+
+## Pre-flight Checklist
+
+### Environment Verification
+```bash
+# Check database connection
+stella db ping
+
+# Verify connection pool health
+stella doctor --check check.postgres.connectivity,check.postgres.pool
+
+# Check migration status
+stella db migrations status
+```
+
+### Metrics to Watch
+- `stella_postgres_connections_active` - Active connections (should be < 80% of max)
+- `stella_postgres_query_duration_seconds` - P99 query latency (target: < 100ms)
+- `stella_postgres_pool_waiting` - Connections waiting for pool (should be 0)
+
+---
+
+## Standard Procedures
+
+### SP-001: Daily Health Check
+
+**Frequency:** Daily or on-demand
+**Duration:** ~5 minutes
+
+1. Run comprehensive health check:
+   ```bash
+   stella doctor --category database --format json > /tmp/db-health-$(date +%Y%m%d).json
+   ```
+
+2. Review slow queries from last 24h:
+   ```bash
+   stella db queries --slow --period 24h --limit 20
+   ```
+
+3. Check replication status (if applicable):
+   ```bash
+   stella db replication status
+   ```
+
+4. Verify backup completion:
+   ```bash
+   stella backup status --type database
+   ```
+
+### SP-002: Connection Pool Tuning
+
+**When:** Pool exhaustion alerts or high wait times
+
+1. Check current pool usage:
+   ```bash
+   stella db pool stats --detailed
+   ```
+
+2. Identify connection-holding queries:
+   ```bash
+   stella db queries --active --sort duration
+   ```
+
+3. Adjust pool size (if needed):
+   ```bash
+   # Review current settings
+   stella config get Database:MaxPoolSize
+   
+   # Increase pool size
+   stella config set Database:MaxPoolSize 150
+   
+   # Restart affected services
+   stella service restart --service release-orchestrator
+   ```
+
+4. Verify improvement:
+   ```bash
+   stella db pool watch --duration 5m
+   ```
+
+### SP-003: Backup and Restore
+
+**Backup:**
+```bash
+# Create immediate backup
+stella backup create --type database --name "pre-upgrade-$(date +%Y%m%d)"
+
+# Verify backup
+stella backup verify --latest
+```
+
+**Restore:**
+```bash
+# List available backups
+stella backup list --type database
+
+# Restore to specific point (CAUTION: destructive)
+stella backup restore --id <backup-id> --confirm
+
+# Verify restoration
+stella db ping
+stella db migrations status
+```
+
+### SP-004: Migration Execution
+
+1. Pre-migration backup:
+   ```bash
+   stella backup create --type database --name "pre-migration"
+   ```
+
+2. Run migrations:
+   ```bash
+   # Dry run first
+   stella db migrate --dry-run
+   
+   # Apply migrations
+   stella db migrate
+   ```
+
+3. Verify migration success:
+   ```bash
+   stella db migrations status
+   stella doctor --check check.postgres.migrations
+   ```
+
+---
+
+## Incident Procedures
+
+### INC-001: Connection Pool Exhaustion
+
+**Symptoms:**
+- Alert: `StellaPostgresPoolExhausted`
+- Error logs: "connection pool exhausted, waiting for available connection"
+- Increased request latency
+
+**Investigation:**
+```bash
+# Check pool status
+stella db pool stats
+
+# Find long-running queries
+stella db queries --active --sort duration --limit 10
+
+# Check for connection leaks
+stella db connections --by-client
+```
+
+**Resolution:**
+
+1. **Immediate relief** - Terminate long-running queries:
+   ```bash
+   # Identify stuck queries
+   stella db queries --active --duration ">5m"
+   
+   # Terminate specific query (use with caution)
+   stella db query terminate --pid <pid>
+   ```
+
+2. **Scale pool** (if legitimate load):
+   ```bash
+   stella config set Database:MaxPoolSize 200
+   stella service restart --graceful
+   ```
+
+3. **Fix leaks** (if application bug):
+   - Review application logs for unclosed connections
+   - Deploy fix to affected service
+
+### INC-002: Slow Query Performance
+
+**Symptoms:**
+- Alert: `StellaPostgresQueryLatencyHigh`
+- P99 query latency > 500ms
+
+**Investigation:**
+```bash
+# Get slow query report
+stella db queries --slow --period 1h --format json > /tmp/slow-queries.json
+
+# Analyze specific query
+stella db query explain --sql "SELECT ..." --analyze
+
+# Check table statistics
+stella db stats tables --sort bloat
+```
+
+**Resolution:**
+
+1. **Index optimization:**
+   ```bash
+   # Get index recommendations
+   stella db index suggest --table <table>
+   
+   # Create recommended index
+   stella db index create --table <table> --columns "col1,col2"
+   ```
+
+2. **Vacuum/analyze:**
+   ```bash
+   stella db vacuum --table <table>
+   stella db analyze --table <table>
+   ```
+
+3. **Query optimization** - Review and rewrite problematic queries
+
+### INC-003: Database Connectivity Loss
+
+**Symptoms:**
+- Alert: `StellaPostgresConnectionFailed`
+- All services reporting database connection errors
+
+**Investigation:**
+```bash
+# Test basic connectivity
+stella db ping
+
+# Check DNS resolution
+stella network dns-lookup <db-host>
+
+# Check firewall/network
+stella network test --host <db-host> --port 5432
+```
+
+**Resolution:**
+
+1. **Network issue:**
+   - Verify security groups / firewall rules
+   - Check VPN/tunnel status if applicable
+   - Verify DNS resolution
+
+2. **Database server issue:**
+   - Check PostgreSQL service status on server
+   - Review PostgreSQL logs
+   - Check disk space on database server
+
+3. **Credential issue:**
+   ```bash
+   stella db verify-credentials
+   stella secrets rotate --scope database
+   ```
+
+### INC-004: Disk Space Alert
+
+**Symptoms:**
+- Alert: `StellaPostgresDiskSpaceWarning` or `Critical`
+- Database write failures
+
+**Investigation:**
+```bash
+# Check disk usage
+stella db disk-usage
+
+# Find large tables
+stella db stats tables --sort size --limit 20
+
+# Check for bloat
+stella db stats tables --sort bloat
+```
+
+**Resolution:**
+
+1. **Immediate cleanup:**
+   ```bash
+   # Vacuum to reclaim space
+   stella db vacuum --full --table <large-table>
+   
+   # Clean old data (if retention policy allows)
+   stella db prune --table evidence_artifacts --older-than 90d --dry-run
+   ```
+
+2. **Archive old data:**
+   ```bash
+   stella db archive --table findings_history --older-than 180d
+   ```
+
+3. **Expand disk** (if legitimate growth):
+   - Follow cloud provider procedure to expand volume
+   - Resize filesystem
+
+---
+
+## Maintenance Windows
+
+### Weekly Maintenance (Sunday 02:00 UTC)
+
+1. Run vacuum analyze on all tables:
+   ```bash
+   stella db vacuum --analyze --all-tables
+   ```
+
+2. Update table statistics:
+   ```bash
+   stella db analyze --all-tables
+   ```
+
+3. Clean temporary files:
+   ```bash
+   stella db cleanup --temp-files
+   ```
+
+### Monthly Maintenance (First Sunday 03:00 UTC)
+
+1. Full vacuum on large tables:
+   ```bash
+   stella db vacuum --full --table findings --table verdicts
+   ```
+
+2. Reindex if needed:
+   ```bash
+   stella db reindex --concurrently --table findings
+   ```
+
+3. Archive old data per retention policy:
+   ```bash
+   stella db archive --apply-retention
+   ```
+
+---
+
+## Monitoring Dashboard
+
+Access: Grafana → Dashboards → Stella Ops → PostgreSQL
+
+Key panels:
+- Connection pool utilization
+- Query latency percentiles
+- Disk usage trend
+- Replication lag (if applicable)
+- Active queries count
+
+---
+
+## Evidence Capture
+
+For any incident, capture:
+```bash
+# Comprehensive database state
+stella db diagnostics --output /tmp/db-diag-$(date +%Y%m%dT%H%M%S).tar.gz
+```
+
+Bundle includes:
+- Connection stats
+- Active queries
+- Lock information
+- Table statistics
+- Recent slow query log
+- Configuration snapshot
+
+---
+
+## Escalation Path
+
+1. **L1 (On-call):** Standard procedures, restart services
+2. **L2 (Database team):** Query optimization, schema changes
+3. **L3 (Vendor support):** Hardware/cloud platform issues
+
+---
+
+_Last updated: 2026-01-17 (UTC)_