Files
git.stella-ops.org/docs/operations/postgresql-guide.md
master b7059d523e Refactor and update test projects, remove obsolete tests, and upgrade dependencies
- Deleted obsolete test files for SchedulerAuditService and SchedulerMongoSessionFactory.
- Removed unused TestDataFactory class.
- Updated project files for Mongo.Tests to remove references to deleted files.
- Upgraded BouncyCastle.Cryptography package to version 2.6.2 across multiple projects.
- Replaced Microsoft.Extensions.Http.Polly with Microsoft.Extensions.Http.Resilience in Zastava.Webhook project.
- Updated NetEscapades.Configuration.Yaml package to version 3.1.0 in Configuration library.
- Upgraded Pkcs11Interop package to version 5.1.2 in Cryptography libraries.
- Refactored Argon2idPasswordHasher to use BouncyCastle for hashing instead of Konscious.
- Updated JsonSchema.Net package to version 7.3.2 in Microservice project.
- Updated global.json to use .NET SDK version 10.0.101.
2025-12-10 19:13:29 +02:00

746 lines
20 KiB
Markdown

# PostgreSQL Operations Guide
**Version:** 1.0.0
**Last Updated:** 2025-12-10
**Status:** Active
This guide covers PostgreSQL operations for StellaOps, including setup, performance tuning, monitoring, backup/restore, and scaling recommendations.
---
## 1. Overview
StellaOps uses PostgreSQL (≥16) as the primary control-plane database with per-module schema isolation. MongoDB is retained only for legacy modules not yet converted.
### 1.1 Schema Topology
```
┌─────────────────────────────────────────────────────────────────┐
│ PostgreSQL Cluster │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ stellaops (database) ││
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ││
│ │ │authority│ │ vuln │ │ vex │ │scheduler│ ││
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ ││
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ││
│ │ │ notify │ │ policy │ │ packs │ │ issuer │ ││
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ ││
│ │ ┌─────────┐ ││
│ │ │ audit │ (cross-cutting audit schema) ││
│ │ └─────────┘ ││
│ └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘
```
### 1.2 Module Schema Ownership
| Schema | Owner Module | Primary Tables |
|--------|--------------|----------------|
| `authority` | Authority | tenants, users, roles, tokens, licenses |
| `vuln` | Concelier | sources, advisories, advisory_affected, kev_flags |
| `vex` | Excititor | projects, graph_revisions, statements, observations |
| `scheduler` | Scheduler | schedules, runs, graph_jobs, workers, locks |
| `notify` | Notify | channels, templates, rules, deliveries |
| `policy` | Policy | packs, rules, evaluations, exceptions |
| `concelier` | Concelier | documents, dtos, states, exports |
| `audit` | Shared | audit_log (cross-cutting) |
---
## 2. Performance Configuration
### 2.1 Enable pg_stat_statements
The `pg_stat_statements` extension is essential for query performance analysis. Enable it in your PostgreSQL configuration:
**postgresql.conf:**
```ini
# Load the extension at startup
shared_preload_libraries = 'pg_stat_statements'
# Configuration
pg_stat_statements.max = 10000
pg_stat_statements.track = all
pg_stat_statements.track_utility = on
pg_stat_statements.track_planning = on
```
**Enable in database:**
```sql
-- Create the extension (requires superuser)
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Verify installation
SELECT * FROM pg_stat_statements LIMIT 1;
-- Reset statistics (useful after configuration changes)
SELECT pg_stat_statements_reset();
```
### 2.2 Recommended PostgreSQL Settings
**Memory Configuration (adjust based on available RAM):**
```ini
# For a server with 16GB RAM dedicated to PostgreSQL:
shared_buffers = 4GB # 25% of RAM
effective_cache_size = 12GB # 75% of RAM
maintenance_work_mem = 1GB # For VACUUM, CREATE INDEX
work_mem = 64MB # Per-operation sort memory
# Connection management
max_connections = 200 # Adjust based on pooling
```
**Write-Ahead Log (WAL):**
```ini
wal_buffers = 64MB
checkpoint_completion_target = 0.9
max_wal_size = 4GB
min_wal_size = 1GB
```
**Query Planner:**
```ini
random_page_cost = 1.1 # For SSDs (default 4.0 is for HDDs)
effective_io_concurrency = 200 # For SSDs
default_statistics_target = 100 # Increase for complex queries
```
**Parallel Query:**
```ini
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
max_parallel_maintenance_workers = 4
```
### 2.3 Connection Pooling (PgBouncer)
**Recommended PgBouncer configuration:**
```ini
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 20
reserve_pool_size = 5
reserve_pool_timeout = 3
server_idle_timeout = 60
query_timeout = 30
```
**Session configuration (set on connection open):**
```sql
SET app.tenant_id = '<tenant-uuid>';
SET timezone = 'UTC';
SET statement_timeout = '30s';
```
---
## 3. Query Performance Analysis
### 3.1 Identifying Slow Queries
**Top queries by total time:**
```sql
SELECT
substring(query, 1, 100) as query_preview,
calls,
round(total_exec_time::numeric, 2) as total_ms,
round(mean_exec_time::numeric, 2) as mean_ms,
round((100 * total_exec_time / sum(total_exec_time) over())::numeric, 2) as percent_total
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
```
**Queries with high mean execution time:**
```sql
SELECT
substring(query, 1, 100) as query_preview,
calls,
round(mean_exec_time::numeric, 2) as mean_ms,
round(stddev_exec_time::numeric, 2) as stddev_ms,
rows
FROM pg_stat_statements
WHERE calls > 10
ORDER BY mean_exec_time DESC
LIMIT 20;
```
**Queries with high buffer usage (I/O intensive):**
```sql
SELECT
substring(query, 1, 100) as query_preview,
calls,
shared_blks_hit + shared_blks_read as total_blks,
round(100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0), 2) as hit_ratio
FROM pg_stat_statements
WHERE shared_blks_hit + shared_blks_read > 1000
ORDER BY shared_blks_read DESC
LIMIT 20;
```
### 3.2 Using EXPLAIN ANALYZE
**Basic usage:**
```sql
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM vuln.advisories
WHERE state = 'active' AND severity = 'critical'
ORDER BY modified_at DESC
LIMIT 100;
```
**Understanding output - key indicators:**
- **Seq Scan** on large tables = missing index
- **Hash Join** vs **Nested Loop** - consider data sizes
- **Rows** estimate vs actual - statistics accuracy
- **Buffers: shared hit/read** - cache effectiveness
**Example analysis:**
```sql
-- Bad: Sequential scan on large table
Seq Scan on advisories (cost=0.00..50000.00 rows=1000 width=100)
Filter: ((state = 'active') AND (severity = 'critical'))
Rows Removed by Filter: 99000
-- Good: Index scan
Index Scan using idx_advisories_state_severity on advisories
Index Cond: ((state = 'active') AND (severity = 'critical'))
```
### 3.3 Index Analysis
**Find unused indexes:**
```sql
SELECT
schemaname || '.' || relname as table,
indexrelname as index,
pg_size_pretty(pg_relation_size(indexrelid)) as size,
idx_scan as scans
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND schemaname NOT IN ('pg_catalog', 'pg_toast')
ORDER BY pg_relation_size(indexrelid) DESC;
```
**Find missing indexes (tables with high sequential scans):**
```sql
SELECT
schemaname || '.' || relname as table,
seq_scan,
seq_tup_read,
idx_scan,
round(100.0 * idx_scan / nullif(seq_scan + idx_scan, 0), 2) as idx_usage_pct
FROM pg_stat_user_tables
WHERE seq_scan > 100
ORDER BY seq_tup_read DESC
LIMIT 20;
```
**Duplicate indexes:**
```sql
SELECT
pg_size_pretty(sum(pg_relation_size(idx))::bigint) as size,
array_agg(idx) as indexes,
indrelid::regclass as table,
indkey as columns
FROM (
SELECT indexrelid::regclass as idx, indrelid, indkey
FROM pg_index
) sub
GROUP BY indrelid, indkey
HAVING count(*) > 1;
```
---
## 4. Index Guidelines for StellaOps
### 4.1 Standard Index Patterns
All tenant-scoped tables should have composite indexes starting with `tenant_id`:
```sql
-- Standard tenant + primary lookup pattern
CREATE INDEX idx_<table>_tenant_<field> ON <schema>.<table>(tenant_id, <field>);
-- Time-based queries
CREATE INDEX idx_<table>_tenant_time ON <schema>.<table>(tenant_id, created_at DESC);
-- State/status filtering
CREATE INDEX idx_<table>_tenant_state ON <schema>.<table>(tenant_id, state)
WHERE state IN ('active', 'pending');
```
### 4.2 Module-Specific Indexes
**Authority schema:**
```sql
CREATE INDEX idx_users_tenant ON authority.users(tenant_id);
CREATE INDEX idx_users_email ON authority.users(email) WHERE email IS NOT NULL;
CREATE INDEX idx_tokens_expires ON authority.tokens(expires_at) WHERE revoked_at IS NULL;
```
**Vuln schema:**
```sql
CREATE INDEX idx_advisories_primary_vuln ON vuln.advisories(primary_vuln_id);
CREATE INDEX idx_advisories_modified ON vuln.advisories(modified_at DESC);
CREATE INDEX idx_advisory_aliases_value ON vuln.advisory_aliases(alias_value);
CREATE INDEX idx_advisory_affected_purl ON vuln.advisory_affected(package_purl)
WHERE package_purl IS NOT NULL;
```
**Scheduler schema:**
```sql
CREATE INDEX idx_runs_tenant_state ON scheduler.runs(tenant_id, state);
CREATE INDEX idx_runs_state_created ON scheduler.runs(state, created_at)
WHERE state IN ('pending', 'queued', 'running');
CREATE INDEX idx_graph_jobs_tenant_status ON scheduler.graph_jobs(tenant_id, status);
```
### 4.3 JSONB Indexes
```sql
-- GIN index for containment queries (@>, ?, ?&, ?|)
CREATE INDEX idx_<table>_<column>_gin ON <schema>.<table> USING GIN (<column>);
-- Expression index for specific JSON paths
CREATE INDEX idx_<table>_<column>_path ON <schema>.<table> ((<column>->>'specific_key'));
```
---
## 5. Monitoring Setup
### 5.1 Key Metrics to Monitor
**Connection metrics:**
```sql
-- Current connections by state
SELECT state, count(*)
FROM pg_stat_activity
GROUP BY state;
-- Connections by database/user
SELECT datname, usename, count(*)
FROM pg_stat_activity
GROUP BY datname, usename;
```
**Cache effectiveness:**
```sql
-- Database-level cache hit ratio (should be >99%)
SELECT
datname,
round(100.0 * blks_hit / nullif(blks_hit + blks_read, 0), 2) as cache_hit_ratio
FROM pg_stat_database
WHERE datname = 'stellaops';
```
**Table bloat and maintenance:**
```sql
-- Tables needing VACUUM
SELECT
schemaname || '.' || relname as table,
n_dead_tup,
n_live_tup,
round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 2) as dead_pct,
last_vacuum,
last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;
```
### 5.2 Prometheus Metrics
Use `postgres_exporter` for Prometheus integration. Key metrics:
```yaml
# Alert rules for PostgreSQL
groups:
- name: postgresql
rules:
- alert: PostgreSQLHighConnections
expr: pg_stat_activity_count > (pg_settings_max_connections * 0.8)
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL connections at {{ $value | humanizePercentage }} of max"
- alert: PostgreSQLLowCacheHitRatio
expr: pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read) < 0.95
for: 15m
labels:
severity: warning
annotations:
summary: "PostgreSQL cache hit ratio below 95%"
- alert: PostgreSQLDeadlocks
expr: rate(pg_stat_database_deadlocks[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL deadlocks detected"
- alert: PostgreSQLSlowQueries
expr: pg_stat_activity_max_tx_duration > 300
for: 5m
labels:
severity: warning
annotations:
summary: "Long-running transaction detected (>5min)"
```
### 5.3 Grafana Dashboard
Import the PostgreSQL dashboard (ID: 9628) or create custom panels for:
1. **Connection Pool** - Active/idle/waiting connections
2. **Query Performance** - QPS, latency percentiles
3. **Cache Hit Ratio** - Database and table level
4. **Disk I/O** - Read/write IOPS and throughput
5. **Replication Lag** - For HA setups
6. **Lock Waits** - Blocked queries count
---
## 6. Performance Baselines
### 6.1 Expected Performance Targets
| Operation | Target P95 | Notes |
|-----------|------------|-------|
| Simple key lookup | < 5ms | Single row by UUID |
| Tenant-filtered list | < 50ms | 100 rows with pagination |
| Advisory search | < 100ms | With FTS and filters |
| VEX statement insert | < 20ms | Single statement |
| Scheduler job enqueue | < 10ms | With lock acquisition |
| Report generation | < 500ms | Full SBOM evaluation |
### 6.2 Baseline Queries
Run these periodically to establish baselines:
```sql
-- Authority: User lookup
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM authority.users
WHERE tenant_id = '<uuid>' AND normalized_username = 'testuser';
-- Vuln: Advisory search
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM vuln.advisories
WHERE state = 'active'
AND to_tsvector('english', title || ' ' || coalesce(summary, '')) @@ plainto_tsquery('critical vulnerability')
ORDER BY modified_at DESC
LIMIT 50;
-- Scheduler: Pending jobs
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM scheduler.runs
WHERE tenant_id = '<uuid>' AND state = 'pending'
ORDER BY created_at
LIMIT 100;
```
### 6.3 Load Testing
Use `pgbench` for baseline load testing:
```bash
# Initialize test data
pgbench -i -s 50 stellaops
# Run benchmark (60 seconds, 10 clients)
pgbench -c 10 -j 4 -T 60 stellaops
# Custom script benchmark
pgbench -c 10 -j 4 -T 60 -f custom_workload.sql stellaops
```
---
## 7. Backup and Restore
### 7.1 Backup Strategy
**Daily full backup with pg_dump:**
```bash
#!/bin/bash
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR=/var/backups/postgresql
pg_dump -Fc -Z 9 \
--host="${PGHOST}" \
--port="${PGPORT}" \
--username="${PGUSER}" \
--dbname=stellaops \
--file="${BACKUP_DIR}/stellaops_${DATE}.dump"
# Retain last 7 days
find ${BACKUP_DIR} -name "*.dump" -mtime +7 -delete
```
**Continuous WAL archiving:**
```ini
# postgresql.conf
archive_mode = on
archive_command = 'cp %p /var/lib/postgresql/wal_archive/%f'
```
### 7.2 Point-in-Time Recovery
```bash
# Stop PostgreSQL
systemctl stop postgresql
# Restore base backup
pg_restore -C -d postgres /var/backups/postgresql/stellaops_backup.dump
# Create recovery.conf (PostgreSQL 12+: recovery.signal + postgresql.conf)
cat > ${PGDATA}/postgresql.auto.conf << EOF
restore_command = 'cp /var/lib/postgresql/wal_archive/%f %p'
recovery_target_time = '2025-12-10 14:30:00 UTC'
EOF
touch ${PGDATA}/recovery.signal
# Start PostgreSQL
systemctl start postgresql
```
### 7.3 Backup Verification
```bash
# Test restore to a different database
pg_restore -C -d postgres --dbname=stellaops_test /var/backups/postgresql/stellaops_backup.dump
# Verify data integrity
psql -d stellaops_test -c "SELECT count(*) FROM authority.users;"
psql -d stellaops_test -c "SELECT count(*) FROM vuln.advisories;"
# Cleanup
dropdb stellaops_test
```
---
## 8. Scaling Recommendations
### 8.1 Vertical Scaling
| Load Level | vCPUs | RAM | Storage | Connections |
|------------|-------|-----|---------|-------------|
| Development | 2 | 4GB | 50GB SSD | 50 |
| Small (<1k images) | 4 | 16GB | 200GB SSD | 100 |
| Medium (1k-10k images) | 8 | 32GB | 500GB SSD | 200 |
| Large (10k+ images) | 16 | 64GB | 1TB+ NVMe | 500 |
### 8.2 Horizontal Scaling
**Read replicas for reporting:**
```yaml
# Primary for writes
primary:
host: postgres-primary.internal
port: 5432
# Replicas for reads (round-robin)
replicas:
- host: postgres-replica-1.internal
port: 5432
- host: postgres-replica-2.internal
port: 5432
```
**Connection routing in application:**
- Writes Primary
- Heavy reads (reports, dashboards) Replicas
- Scheduler impact queries Replicas with acceptable lag
### 8.3 Table Partitioning
For high-volume tables (>100M rows), consider partitioning:
```sql
-- Partition scheduler.runs by created_at
CREATE TABLE scheduler.runs_partitioned (
LIKE scheduler.runs INCLUDING ALL
) PARTITION BY RANGE (created_at);
-- Monthly partitions
CREATE TABLE scheduler.runs_y2025m12
PARTITION OF scheduler.runs_partitioned
FOR VALUES FROM ('2025-12-01') TO ('2026-01-01');
-- Automate partition creation
-- See: pg_partman extension
```
### 8.4 Connection Pooling at Scale
For >1000 concurrent connections, deploy PgBouncer as a sidecar or dedicated service:
```yaml
# Kubernetes deployment with PgBouncer sidecar
containers:
- name: app
env:
- name: DATABASE_URL
value: "postgresql://localhost:6432/stellaops"
- name: pgbouncer
image: pgbouncer/pgbouncer:1.21.0
ports:
- containerPort: 6432
```
---
## 9. Troubleshooting
### 9.1 Common Issues
**High connection count:**
```sql
-- Identify connection sources
SELECT client_addr, usename, state, count(*)
FROM pg_stat_activity
GROUP BY 1, 2, 3
ORDER BY 4 DESC;
-- Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND state_change < now() - interval '30 minutes';
```
**Lock contention:**
```sql
-- Find blocking queries
SELECT
blocked.pid as blocked_pid,
blocked.query as blocked_query,
blocking.pid as blocking_pid,
blocking.query as blocking_query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.wait_event_type = 'Lock';
```
**Table bloat:**
```sql
-- Check table and index sizes
SELECT
schemaname || '.' || relname as table,
pg_size_pretty(pg_total_relation_size(relid)) as total_size,
pg_size_pretty(pg_table_size(relid)) as table_size,
pg_size_pretty(pg_indexes_size(relid)) as index_size
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 20;
-- Manual VACUUM FULL for severe bloat (blocks writes!)
VACUUM (FULL, ANALYZE) scheduler.runs;
```
### 9.2 Emergency Procedures
**Kill long-running queries:**
```sql
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'active'
AND query_start < now() - interval '10 minutes'
AND query NOT LIKE '%pg_stat%';
```
**Force checkpoint (before maintenance):**
```sql
CHECKPOINT;
```
**Emergency read-only mode:**
```sql
ALTER DATABASE stellaops SET default_transaction_read_only = on;
```
---
## 10. Air-Gap Considerations
### 10.1 Offline Setup
PostgreSQL 16+ is bundled in the air-gap kit. See `docs/24_OFFLINE_KIT.md` for import instructions.
**Docker image digest (pinned):**
```yaml
postgres:
image: docker.io/library/postgres:16@sha256:<pinned-digest>
```
### 10.2 Migrations in Air-Gap
All migrations are embedded in application assemblies. No network access required:
```bash
# Run migrations manually
dotnet run --project src/Tools/MigrationRunner -- \
--connection "Host=postgres;Database=stellaops;..." \
--schema all
```
### 10.3 Backup in Air-Gap
```bash
# Local backup with encryption
pg_dump -Fc stellaops | gpg --encrypt -r backup@stellaops.local > backup.dump.gpg
# Restore
gpg --decrypt backup.dump.gpg | pg_restore -d stellaops
```
---
## Appendix A: Quick Reference
### Connection String Template
```
Host=<host>;Port=5432;Database=stellaops;Username=<user>;Password=<pass>;
Pooling=true;MinPoolSize=5;MaxPoolSize=20;ConnectionIdleLifetime=300;
CommandTimeout=30;Timeout=15;
```
### Essential Commands
```bash
# Connect to database
psql -h localhost -U stellaops -d stellaops
# Check version
psql -c "SELECT version();"
# List schemas
psql -c "\dn"
# List tables in schema
psql -c "\dt vuln.*"
# Table structure
psql -c "\d vuln.advisories"
# Current activity
psql -c "SELECT * FROM pg_stat_activity;"
```
### Useful Extensions
```sql
CREATE EXTENSION IF NOT EXISTS pg_stat_statements; -- Query statistics
CREATE EXTENSION IF NOT EXISTS pg_trgm; -- Fuzzy text search
CREATE EXTENSION IF NOT EXISTS btree_gin; -- GIN for scalars
CREATE EXTENSION IF NOT EXISTS pgcrypto; -- Cryptographic functions
```