# PostgreSQL Operations Guide **Version:** 1.0.0 **Last Updated:** 2025-12-10 **Status:** Active This guide covers PostgreSQL operations for StellaOps, including setup, performance tuning, monitoring, backup/restore, and scaling recommendations. --- ## 1. Overview StellaOps uses PostgreSQL (≥16) as the primary control-plane database with per-module schema isolation. MongoDB is retained only for legacy modules not yet converted. ### 1.1 Schema Topology ``` ┌─────────────────────────────────────────────────────────────────┐ │ PostgreSQL Cluster │ │ ┌─────────────────────────────────────────────────────────────┐│ │ │ stellaops (database) ││ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ││ │ │ │authority│ │ vuln │ │ vex │ │scheduler│ ││ │ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ ││ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ││ │ │ │ notify │ │ policy │ │ packs │ │ issuer │ ││ │ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ ││ │ │ ┌─────────┐ ││ │ │ │ audit │ (cross-cutting audit schema) ││ │ │ └─────────┘ ││ │ └─────────────────────────────────────────────────────────────┘│ └─────────────────────────────────────────────────────────────────┘ ``` ### 1.2 Module Schema Ownership | Schema | Owner Module | Primary Tables | |--------|--------------|----------------| | `authority` | Authority | tenants, users, roles, tokens, licenses | | `vuln` | Concelier | sources, advisories, advisory_affected, kev_flags | | `vex` | Excititor | projects, graph_revisions, statements, observations | | `scheduler` | Scheduler | schedules, runs, graph_jobs, workers, locks | | `notify` | Notify | channels, templates, rules, deliveries | | `policy` | Policy | packs, rules, evaluations, exceptions | | `concelier` | Concelier | documents, dtos, states, exports | | `audit` | Shared | audit_log (cross-cutting) | --- ## 2. Performance Configuration ### 2.1 Enable pg_stat_statements The `pg_stat_statements` extension is essential for query performance analysis. Enable it in your PostgreSQL configuration: **postgresql.conf:** ```ini # Load the extension at startup shared_preload_libraries = 'pg_stat_statements' # Configuration pg_stat_statements.max = 10000 pg_stat_statements.track = all pg_stat_statements.track_utility = on pg_stat_statements.track_planning = on ``` **Enable in database:** ```sql -- Create the extension (requires superuser) CREATE EXTENSION IF NOT EXISTS pg_stat_statements; -- Verify installation SELECT * FROM pg_stat_statements LIMIT 1; -- Reset statistics (useful after configuration changes) SELECT pg_stat_statements_reset(); ``` ### 2.2 Recommended PostgreSQL Settings **Memory Configuration (adjust based on available RAM):** ```ini # For a server with 16GB RAM dedicated to PostgreSQL: shared_buffers = 4GB # 25% of RAM effective_cache_size = 12GB # 75% of RAM maintenance_work_mem = 1GB # For VACUUM, CREATE INDEX work_mem = 64MB # Per-operation sort memory # Connection management max_connections = 200 # Adjust based on pooling ``` **Write-Ahead Log (WAL):** ```ini wal_buffers = 64MB checkpoint_completion_target = 0.9 max_wal_size = 4GB min_wal_size = 1GB ``` **Query Planner:** ```ini random_page_cost = 1.1 # For SSDs (default 4.0 is for HDDs) effective_io_concurrency = 200 # For SSDs default_statistics_target = 100 # Increase for complex queries ``` **Parallel Query:** ```ini max_parallel_workers_per_gather = 4 max_parallel_workers = 8 max_parallel_maintenance_workers = 4 ``` ### 2.3 Connection Pooling (PgBouncer) **Recommended PgBouncer configuration:** ```ini [pgbouncer] pool_mode = transaction max_client_conn = 1000 default_pool_size = 20 reserve_pool_size = 5 reserve_pool_timeout = 3 server_idle_timeout = 60 query_timeout = 30 ``` **Session configuration (set on connection open):** ```sql SET app.tenant_id = ''; SET timezone = 'UTC'; SET statement_timeout = '30s'; ``` --- ## 3. Query Performance Analysis ### 3.1 Identifying Slow Queries **Top queries by total time:** ```sql SELECT substring(query, 1, 100) as query_preview, calls, round(total_exec_time::numeric, 2) as total_ms, round(mean_exec_time::numeric, 2) as mean_ms, round((100 * total_exec_time / sum(total_exec_time) over())::numeric, 2) as percent_total FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 20; ``` **Queries with high mean execution time:** ```sql SELECT substring(query, 1, 100) as query_preview, calls, round(mean_exec_time::numeric, 2) as mean_ms, round(stddev_exec_time::numeric, 2) as stddev_ms, rows FROM pg_stat_statements WHERE calls > 10 ORDER BY mean_exec_time DESC LIMIT 20; ``` **Queries with high buffer usage (I/O intensive):** ```sql SELECT substring(query, 1, 100) as query_preview, calls, shared_blks_hit + shared_blks_read as total_blks, round(100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0), 2) as hit_ratio FROM pg_stat_statements WHERE shared_blks_hit + shared_blks_read > 1000 ORDER BY shared_blks_read DESC LIMIT 20; ``` ### 3.2 Using EXPLAIN ANALYZE **Basic usage:** ```sql EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT * FROM vuln.advisories WHERE state = 'active' AND severity = 'critical' ORDER BY modified_at DESC LIMIT 100; ``` **Understanding output - key indicators:** - **Seq Scan** on large tables = missing index - **Hash Join** vs **Nested Loop** - consider data sizes - **Rows** estimate vs actual - statistics accuracy - **Buffers: shared hit/read** - cache effectiveness **Example analysis:** ```sql -- Bad: Sequential scan on large table Seq Scan on advisories (cost=0.00..50000.00 rows=1000 width=100) Filter: ((state = 'active') AND (severity = 'critical')) Rows Removed by Filter: 99000 -- Good: Index scan Index Scan using idx_advisories_state_severity on advisories Index Cond: ((state = 'active') AND (severity = 'critical')) ``` ### 3.3 Index Analysis **Find unused indexes:** ```sql SELECT schemaname || '.' || relname as table, indexrelname as index, pg_size_pretty(pg_relation_size(indexrelid)) as size, idx_scan as scans FROM pg_stat_user_indexes WHERE idx_scan = 0 AND schemaname NOT IN ('pg_catalog', 'pg_toast') ORDER BY pg_relation_size(indexrelid) DESC; ``` **Find missing indexes (tables with high sequential scans):** ```sql SELECT schemaname || '.' || relname as table, seq_scan, seq_tup_read, idx_scan, round(100.0 * idx_scan / nullif(seq_scan + idx_scan, 0), 2) as idx_usage_pct FROM pg_stat_user_tables WHERE seq_scan > 100 ORDER BY seq_tup_read DESC LIMIT 20; ``` **Duplicate indexes:** ```sql SELECT pg_size_pretty(sum(pg_relation_size(idx))::bigint) as size, array_agg(idx) as indexes, indrelid::regclass as table, indkey as columns FROM ( SELECT indexrelid::regclass as idx, indrelid, indkey FROM pg_index ) sub GROUP BY indrelid, indkey HAVING count(*) > 1; ``` --- ## 4. Index Guidelines for StellaOps ### 4.1 Standard Index Patterns All tenant-scoped tables should have composite indexes starting with `tenant_id`: ```sql -- Standard tenant + primary lookup pattern CREATE INDEX idx__tenant_ ON .
(tenant_id, ); -- Time-based queries CREATE INDEX idx_
_tenant_time ON .
(tenant_id, created_at DESC); -- State/status filtering CREATE INDEX idx_
_tenant_state ON .
(tenant_id, state) WHERE state IN ('active', 'pending'); ``` ### 4.2 Module-Specific Indexes **Authority schema:** ```sql CREATE INDEX idx_users_tenant ON authority.users(tenant_id); CREATE INDEX idx_users_email ON authority.users(email) WHERE email IS NOT NULL; CREATE INDEX idx_tokens_expires ON authority.tokens(expires_at) WHERE revoked_at IS NULL; ``` **Vuln schema:** ```sql CREATE INDEX idx_advisories_primary_vuln ON vuln.advisories(primary_vuln_id); CREATE INDEX idx_advisories_modified ON vuln.advisories(modified_at DESC); CREATE INDEX idx_advisory_aliases_value ON vuln.advisory_aliases(alias_value); CREATE INDEX idx_advisory_affected_purl ON vuln.advisory_affected(package_purl) WHERE package_purl IS NOT NULL; ``` **Scheduler schema:** ```sql CREATE INDEX idx_runs_tenant_state ON scheduler.runs(tenant_id, state); CREATE INDEX idx_runs_state_created ON scheduler.runs(state, created_at) WHERE state IN ('pending', 'queued', 'running'); CREATE INDEX idx_graph_jobs_tenant_status ON scheduler.graph_jobs(tenant_id, status); ``` ### 4.3 JSONB Indexes ```sql -- GIN index for containment queries (@>, ?, ?&, ?|) CREATE INDEX idx_
__gin ON .
USING GIN (); -- Expression index for specific JSON paths CREATE INDEX idx_
__path ON .
((->>'specific_key')); ``` --- ## 5. Monitoring Setup ### 5.1 Key Metrics to Monitor **Connection metrics:** ```sql -- Current connections by state SELECT state, count(*) FROM pg_stat_activity GROUP BY state; -- Connections by database/user SELECT datname, usename, count(*) FROM pg_stat_activity GROUP BY datname, usename; ``` **Cache effectiveness:** ```sql -- Database-level cache hit ratio (should be >99%) SELECT datname, round(100.0 * blks_hit / nullif(blks_hit + blks_read, 0), 2) as cache_hit_ratio FROM pg_stat_database WHERE datname = 'stellaops'; ``` **Table bloat and maintenance:** ```sql -- Tables needing VACUUM SELECT schemaname || '.' || relname as table, n_dead_tup, n_live_tup, round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 2) as dead_pct, last_vacuum, last_autovacuum FROM pg_stat_user_tables WHERE n_dead_tup > 10000 ORDER BY n_dead_tup DESC; ``` ### 5.2 Prometheus Metrics Use `postgres_exporter` for Prometheus integration. Key metrics: ```yaml # Alert rules for PostgreSQL groups: - name: postgresql rules: - alert: PostgreSQLHighConnections expr: pg_stat_activity_count > (pg_settings_max_connections * 0.8) for: 5m labels: severity: warning annotations: summary: "PostgreSQL connections at {{ $value | humanizePercentage }} of max" - alert: PostgreSQLLowCacheHitRatio expr: pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read) < 0.95 for: 15m labels: severity: warning annotations: summary: "PostgreSQL cache hit ratio below 95%" - alert: PostgreSQLDeadlocks expr: rate(pg_stat_database_deadlocks[5m]) > 0 for: 5m labels: severity: warning annotations: summary: "PostgreSQL deadlocks detected" - alert: PostgreSQLSlowQueries expr: pg_stat_activity_max_tx_duration > 300 for: 5m labels: severity: warning annotations: summary: "Long-running transaction detected (>5min)" ``` ### 5.3 Grafana Dashboard Import the PostgreSQL dashboard (ID: 9628) or create custom panels for: 1. **Connection Pool** - Active/idle/waiting connections 2. **Query Performance** - QPS, latency percentiles 3. **Cache Hit Ratio** - Database and table level 4. **Disk I/O** - Read/write IOPS and throughput 5. **Replication Lag** - For HA setups 6. **Lock Waits** - Blocked queries count --- ## 6. Performance Baselines ### 6.1 Expected Performance Targets | Operation | Target P95 | Notes | |-----------|------------|-------| | Simple key lookup | < 5ms | Single row by UUID | | Tenant-filtered list | < 50ms | 100 rows with pagination | | Advisory search | < 100ms | With FTS and filters | | VEX statement insert | < 20ms | Single statement | | Scheduler job enqueue | < 10ms | With lock acquisition | | Report generation | < 500ms | Full SBOM evaluation | ### 6.2 Baseline Queries Run these periodically to establish baselines: ```sql -- Authority: User lookup EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM authority.users WHERE tenant_id = '' AND normalized_username = 'testuser'; -- Vuln: Advisory search EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM vuln.advisories WHERE state = 'active' AND to_tsvector('english', title || ' ' || coalesce(summary, '')) @@ plainto_tsquery('critical vulnerability') ORDER BY modified_at DESC LIMIT 50; -- Scheduler: Pending jobs EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM scheduler.runs WHERE tenant_id = '' AND state = 'pending' ORDER BY created_at LIMIT 100; ``` ### 6.3 Load Testing Use `pgbench` for baseline load testing: ```bash # Initialize test data pgbench -i -s 50 stellaops # Run benchmark (60 seconds, 10 clients) pgbench -c 10 -j 4 -T 60 stellaops # Custom script benchmark pgbench -c 10 -j 4 -T 60 -f custom_workload.sql stellaops ``` --- ## 7. Backup and Restore ### 7.1 Backup Strategy **Daily full backup with pg_dump:** ```bash #!/bin/bash DATE=$(date +%Y%m%d_%H%M%S) BACKUP_DIR=/var/backups/postgresql pg_dump -Fc -Z 9 \ --host="${PGHOST}" \ --port="${PGPORT}" \ --username="${PGUSER}" \ --dbname=stellaops \ --file="${BACKUP_DIR}/stellaops_${DATE}.dump" # Retain last 7 days find ${BACKUP_DIR} -name "*.dump" -mtime +7 -delete ``` **Continuous WAL archiving:** ```ini # postgresql.conf archive_mode = on archive_command = 'cp %p /var/lib/postgresql/wal_archive/%f' ``` ### 7.2 Point-in-Time Recovery ```bash # Stop PostgreSQL systemctl stop postgresql # Restore base backup pg_restore -C -d postgres /var/backups/postgresql/stellaops_backup.dump # Create recovery.conf (PostgreSQL 12+: recovery.signal + postgresql.conf) cat > ${PGDATA}/postgresql.auto.conf << EOF restore_command = 'cp /var/lib/postgresql/wal_archive/%f %p' recovery_target_time = '2025-12-10 14:30:00 UTC' EOF touch ${PGDATA}/recovery.signal # Start PostgreSQL systemctl start postgresql ``` ### 7.3 Backup Verification ```bash # Test restore to a different database pg_restore -C -d postgres --dbname=stellaops_test /var/backups/postgresql/stellaops_backup.dump # Verify data integrity psql -d stellaops_test -c "SELECT count(*) FROM authority.users;" psql -d stellaops_test -c "SELECT count(*) FROM vuln.advisories;" # Cleanup dropdb stellaops_test ``` --- ## 8. Scaling Recommendations ### 8.1 Vertical Scaling | Load Level | vCPUs | RAM | Storage | Connections | |------------|-------|-----|---------|-------------| | Development | 2 | 4GB | 50GB SSD | 50 | | Small (<1k images) | 4 | 16GB | 200GB SSD | 100 | | Medium (1k-10k images) | 8 | 32GB | 500GB SSD | 200 | | Large (10k+ images) | 16 | 64GB | 1TB+ NVMe | 500 | ### 8.2 Horizontal Scaling **Read replicas for reporting:** ```yaml # Primary for writes primary: host: postgres-primary.internal port: 5432 # Replicas for reads (round-robin) replicas: - host: postgres-replica-1.internal port: 5432 - host: postgres-replica-2.internal port: 5432 ``` **Connection routing in application:** - Writes → Primary - Heavy reads (reports, dashboards) → Replicas - Scheduler impact queries → Replicas with acceptable lag ### 8.3 Table Partitioning For high-volume tables (>100M rows), consider partitioning: ```sql -- Partition scheduler.runs by created_at CREATE TABLE scheduler.runs_partitioned ( LIKE scheduler.runs INCLUDING ALL ) PARTITION BY RANGE (created_at); -- Monthly partitions CREATE TABLE scheduler.runs_y2025m12 PARTITION OF scheduler.runs_partitioned FOR VALUES FROM ('2025-12-01') TO ('2026-01-01'); -- Automate partition creation -- See: pg_partman extension ``` ### 8.4 Connection Pooling at Scale For >1000 concurrent connections, deploy PgBouncer as a sidecar or dedicated service: ```yaml # Kubernetes deployment with PgBouncer sidecar containers: - name: app env: - name: DATABASE_URL value: "postgresql://localhost:6432/stellaops" - name: pgbouncer image: pgbouncer/pgbouncer:1.21.0 ports: - containerPort: 6432 ``` --- ## 9. Troubleshooting ### 9.1 Common Issues **High connection count:** ```sql -- Identify connection sources SELECT client_addr, usename, state, count(*) FROM pg_stat_activity GROUP BY 1, 2, 3 ORDER BY 4 DESC; -- Terminate idle connections SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < now() - interval '30 minutes'; ``` **Lock contention:** ```sql -- Find blocking queries SELECT blocked.pid as blocked_pid, blocked.query as blocked_query, blocking.pid as blocking_pid, blocking.query as blocking_query FROM pg_stat_activity blocked JOIN pg_stat_activity blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid)) WHERE blocked.wait_event_type = 'Lock'; ``` **Table bloat:** ```sql -- Check table and index sizes SELECT schemaname || '.' || relname as table, pg_size_pretty(pg_total_relation_size(relid)) as total_size, pg_size_pretty(pg_table_size(relid)) as table_size, pg_size_pretty(pg_indexes_size(relid)) as index_size FROM pg_stat_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 20; -- Manual VACUUM FULL for severe bloat (blocks writes!) VACUUM (FULL, ANALYZE) scheduler.runs; ``` ### 9.2 Emergency Procedures **Kill long-running queries:** ```sql SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '10 minutes' AND query NOT LIKE '%pg_stat%'; ``` **Force checkpoint (before maintenance):** ```sql CHECKPOINT; ``` **Emergency read-only mode:** ```sql ALTER DATABASE stellaops SET default_transaction_read_only = on; ``` --- ## 10. Air-Gap Considerations ### 10.1 Offline Setup PostgreSQL 16+ is bundled in the air-gap kit. See `docs/24_OFFLINE_KIT.md` for import instructions. **Docker image digest (pinned):** ```yaml postgres: image: docker.io/library/postgres:16@sha256: ``` ### 10.2 Migrations in Air-Gap All migrations are embedded in application assemblies. No network access required: ```bash # Run migrations manually dotnet run --project src/Tools/MigrationRunner -- \ --connection "Host=postgres;Database=stellaops;..." \ --schema all ``` ### 10.3 Backup in Air-Gap ```bash # Local backup with encryption pg_dump -Fc stellaops | gpg --encrypt -r backup@stellaops.local > backup.dump.gpg # Restore gpg --decrypt backup.dump.gpg | pg_restore -d stellaops ``` --- ## Appendix A: Quick Reference ### Connection String Template ``` Host=;Port=5432;Database=stellaops;Username=;Password=; Pooling=true;MinPoolSize=5;MaxPoolSize=20;ConnectionIdleLifetime=300; CommandTimeout=30;Timeout=15; ``` ### Essential Commands ```bash # Connect to database psql -h localhost -U stellaops -d stellaops # Check version psql -c "SELECT version();" # List schemas psql -c "\dn" # List tables in schema psql -c "\dt vuln.*" # Table structure psql -c "\d vuln.advisories" # Current activity psql -c "SELECT * FROM pg_stat_activity;" ``` ### Useful Extensions ```sql CREATE EXTENSION IF NOT EXISTS pg_stat_statements; -- Query statistics CREATE EXTENSION IF NOT EXISTS pg_trgm; -- Fuzzy text search CREATE EXTENSION IF NOT EXISTS btree_gin; -- GIN for scalars CREATE EXTENSION IF NOT EXISTS pgcrypto; -- Cryptographic functions ```