Files
git.stella-ops.org/docs/operations/postgresql-guide.md
master b7059d523e Refactor and update test projects, remove obsolete tests, and upgrade dependencies
- Deleted obsolete test files for SchedulerAuditService and SchedulerMongoSessionFactory.
- Removed unused TestDataFactory class.
- Updated project files for Mongo.Tests to remove references to deleted files.
- Upgraded BouncyCastle.Cryptography package to version 2.6.2 across multiple projects.
- Replaced Microsoft.Extensions.Http.Polly with Microsoft.Extensions.Http.Resilience in Zastava.Webhook project.
- Updated NetEscapades.Configuration.Yaml package to version 3.1.0 in Configuration library.
- Upgraded Pkcs11Interop package to version 5.1.2 in Cryptography libraries.
- Refactored Argon2idPasswordHasher to use BouncyCastle for hashing instead of Konscious.
- Updated JsonSchema.Net package to version 7.3.2 in Microservice project.
- Updated global.json to use .NET SDK version 10.0.101.
2025-12-10 19:13:29 +02:00

20 KiB

PostgreSQL Operations Guide

Version: 1.0.0 Last Updated: 2025-12-10 Status: Active

This guide covers PostgreSQL operations for StellaOps, including setup, performance tuning, monitoring, backup/restore, and scaling recommendations.


1. Overview

StellaOps uses PostgreSQL (≥16) as the primary control-plane database with per-module schema isolation. MongoDB is retained only for legacy modules not yet converted.

1.1 Schema Topology

┌─────────────────────────────────────────────────────────────────┐
│                    PostgreSQL Cluster                            │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │                    stellaops (database)                      ││
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐           ││
│  │  │authority│ │  vuln   │ │   vex   │ │scheduler│           ││
│  │  └─────────┘ └─────────┘ └─────────┘ └─────────┘           ││
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐           ││
│  │  │ notify  │ │ policy  │ │ packs   │ │ issuer  │           ││
│  │  └─────────┘ └─────────┘ └─────────┘ └─────────┘           ││
│  │  ┌─────────┐                                                ││
│  │  │  audit  │  (cross-cutting audit schema)                  ││
│  │  └─────────┘                                                ││
│  └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘

1.2 Module Schema Ownership

Schema Owner Module Primary Tables
authority Authority tenants, users, roles, tokens, licenses
vuln Concelier sources, advisories, advisory_affected, kev_flags
vex Excititor projects, graph_revisions, statements, observations
scheduler Scheduler schedules, runs, graph_jobs, workers, locks
notify Notify channels, templates, rules, deliveries
policy Policy packs, rules, evaluations, exceptions
concelier Concelier documents, dtos, states, exports
audit Shared audit_log (cross-cutting)

2. Performance Configuration

2.1 Enable pg_stat_statements

The pg_stat_statements extension is essential for query performance analysis. Enable it in your PostgreSQL configuration:

postgresql.conf:

# Load the extension at startup
shared_preload_libraries = 'pg_stat_statements'

# Configuration
pg_stat_statements.max = 10000
pg_stat_statements.track = all
pg_stat_statements.track_utility = on
pg_stat_statements.track_planning = on

Enable in database:

-- Create the extension (requires superuser)
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Verify installation
SELECT * FROM pg_stat_statements LIMIT 1;

-- Reset statistics (useful after configuration changes)
SELECT pg_stat_statements_reset();

Memory Configuration (adjust based on available RAM):

# For a server with 16GB RAM dedicated to PostgreSQL:
shared_buffers = 4GB                  # 25% of RAM
effective_cache_size = 12GB           # 75% of RAM
maintenance_work_mem = 1GB            # For VACUUM, CREATE INDEX
work_mem = 64MB                       # Per-operation sort memory

# Connection management
max_connections = 200                 # Adjust based on pooling

Write-Ahead Log (WAL):

wal_buffers = 64MB
checkpoint_completion_target = 0.9
max_wal_size = 4GB
min_wal_size = 1GB

Query Planner:

random_page_cost = 1.1                # For SSDs (default 4.0 is for HDDs)
effective_io_concurrency = 200        # For SSDs
default_statistics_target = 100       # Increase for complex queries

Parallel Query:

max_parallel_workers_per_gather = 4
max_parallel_workers = 8
max_parallel_maintenance_workers = 4

2.3 Connection Pooling (PgBouncer)

Recommended PgBouncer configuration:

[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 20
reserve_pool_size = 5
reserve_pool_timeout = 3
server_idle_timeout = 60
query_timeout = 30

Session configuration (set on connection open):

SET app.tenant_id = '<tenant-uuid>';
SET timezone = 'UTC';
SET statement_timeout = '30s';

3. Query Performance Analysis

3.1 Identifying Slow Queries

Top queries by total time:

SELECT
    substring(query, 1, 100) as query_preview,
    calls,
    round(total_exec_time::numeric, 2) as total_ms,
    round(mean_exec_time::numeric, 2) as mean_ms,
    round((100 * total_exec_time / sum(total_exec_time) over())::numeric, 2) as percent_total
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

Queries with high mean execution time:

SELECT
    substring(query, 1, 100) as query_preview,
    calls,
    round(mean_exec_time::numeric, 2) as mean_ms,
    round(stddev_exec_time::numeric, 2) as stddev_ms,
    rows
FROM pg_stat_statements
WHERE calls > 10
ORDER BY mean_exec_time DESC
LIMIT 20;

Queries with high buffer usage (I/O intensive):

SELECT
    substring(query, 1, 100) as query_preview,
    calls,
    shared_blks_hit + shared_blks_read as total_blks,
    round(100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0), 2) as hit_ratio
FROM pg_stat_statements
WHERE shared_blks_hit + shared_blks_read > 1000
ORDER BY shared_blks_read DESC
LIMIT 20;

3.2 Using EXPLAIN ANALYZE

Basic usage:

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM vuln.advisories
WHERE state = 'active' AND severity = 'critical'
ORDER BY modified_at DESC
LIMIT 100;

Understanding output - key indicators:

  • Seq Scan on large tables = missing index
  • Hash Join vs Nested Loop - consider data sizes
  • Rows estimate vs actual - statistics accuracy
  • Buffers: shared hit/read - cache effectiveness

Example analysis:

-- Bad: Sequential scan on large table
Seq Scan on advisories  (cost=0.00..50000.00 rows=1000 width=100)
  Filter: ((state = 'active') AND (severity = 'critical'))
  Rows Removed by Filter: 99000

-- Good: Index scan
Index Scan using idx_advisories_state_severity on advisories
  Index Cond: ((state = 'active') AND (severity = 'critical'))

3.3 Index Analysis

Find unused indexes:

SELECT
    schemaname || '.' || relname as table,
    indexrelname as index,
    pg_size_pretty(pg_relation_size(indexrelid)) as size,
    idx_scan as scans
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND schemaname NOT IN ('pg_catalog', 'pg_toast')
ORDER BY pg_relation_size(indexrelid) DESC;

Find missing indexes (tables with high sequential scans):

SELECT
    schemaname || '.' || relname as table,
    seq_scan,
    seq_tup_read,
    idx_scan,
    round(100.0 * idx_scan / nullif(seq_scan + idx_scan, 0), 2) as idx_usage_pct
FROM pg_stat_user_tables
WHERE seq_scan > 100
ORDER BY seq_tup_read DESC
LIMIT 20;

Duplicate indexes:

SELECT
    pg_size_pretty(sum(pg_relation_size(idx))::bigint) as size,
    array_agg(idx) as indexes,
    indrelid::regclass as table,
    indkey as columns
FROM (
    SELECT indexrelid::regclass as idx, indrelid, indkey
    FROM pg_index
) sub
GROUP BY indrelid, indkey
HAVING count(*) > 1;

4. Index Guidelines for StellaOps

4.1 Standard Index Patterns

All tenant-scoped tables should have composite indexes starting with tenant_id:

-- Standard tenant + primary lookup pattern
CREATE INDEX idx_<table>_tenant_<field> ON <schema>.<table>(tenant_id, <field>);

-- Time-based queries
CREATE INDEX idx_<table>_tenant_time ON <schema>.<table>(tenant_id, created_at DESC);

-- State/status filtering
CREATE INDEX idx_<table>_tenant_state ON <schema>.<table>(tenant_id, state)
    WHERE state IN ('active', 'pending');

4.2 Module-Specific Indexes

Authority schema:

CREATE INDEX idx_users_tenant ON authority.users(tenant_id);
CREATE INDEX idx_users_email ON authority.users(email) WHERE email IS NOT NULL;
CREATE INDEX idx_tokens_expires ON authority.tokens(expires_at) WHERE revoked_at IS NULL;

Vuln schema:

CREATE INDEX idx_advisories_primary_vuln ON vuln.advisories(primary_vuln_id);
CREATE INDEX idx_advisories_modified ON vuln.advisories(modified_at DESC);
CREATE INDEX idx_advisory_aliases_value ON vuln.advisory_aliases(alias_value);
CREATE INDEX idx_advisory_affected_purl ON vuln.advisory_affected(package_purl)
    WHERE package_purl IS NOT NULL;

Scheduler schema:

CREATE INDEX idx_runs_tenant_state ON scheduler.runs(tenant_id, state);
CREATE INDEX idx_runs_state_created ON scheduler.runs(state, created_at)
    WHERE state IN ('pending', 'queued', 'running');
CREATE INDEX idx_graph_jobs_tenant_status ON scheduler.graph_jobs(tenant_id, status);

4.3 JSONB Indexes

-- GIN index for containment queries (@>, ?, ?&, ?|)
CREATE INDEX idx_<table>_<column>_gin ON <schema>.<table> USING GIN (<column>);

-- Expression index for specific JSON paths
CREATE INDEX idx_<table>_<column>_path ON <schema>.<table> ((<column>->>'specific_key'));

5. Monitoring Setup

5.1 Key Metrics to Monitor

Connection metrics:

-- Current connections by state
SELECT state, count(*)
FROM pg_stat_activity
GROUP BY state;

-- Connections by database/user
SELECT datname, usename, count(*)
FROM pg_stat_activity
GROUP BY datname, usename;

Cache effectiveness:

-- Database-level cache hit ratio (should be >99%)
SELECT
    datname,
    round(100.0 * blks_hit / nullif(blks_hit + blks_read, 0), 2) as cache_hit_ratio
FROM pg_stat_database
WHERE datname = 'stellaops';

Table bloat and maintenance:

-- Tables needing VACUUM
SELECT
    schemaname || '.' || relname as table,
    n_dead_tup,
    n_live_tup,
    round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 2) as dead_pct,
    last_vacuum,
    last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;

5.2 Prometheus Metrics

Use postgres_exporter for Prometheus integration. Key metrics:

# Alert rules for PostgreSQL
groups:
  - name: postgresql
    rules:
      - alert: PostgreSQLHighConnections
        expr: pg_stat_activity_count > (pg_settings_max_connections * 0.8)
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL connections at {{ $value | humanizePercentage }} of max"

      - alert: PostgreSQLLowCacheHitRatio
        expr: pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read) < 0.95
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL cache hit ratio below 95%"

      - alert: PostgreSQLDeadlocks
        expr: rate(pg_stat_database_deadlocks[5m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL deadlocks detected"

      - alert: PostgreSQLSlowQueries
        expr: pg_stat_activity_max_tx_duration > 300
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Long-running transaction detected (>5min)"

5.3 Grafana Dashboard

Import the PostgreSQL dashboard (ID: 9628) or create custom panels for:

  1. Connection Pool - Active/idle/waiting connections
  2. Query Performance - QPS, latency percentiles
  3. Cache Hit Ratio - Database and table level
  4. Disk I/O - Read/write IOPS and throughput
  5. Replication Lag - For HA setups
  6. Lock Waits - Blocked queries count

6. Performance Baselines

6.1 Expected Performance Targets

Operation Target P95 Notes
Simple key lookup < 5ms Single row by UUID
Tenant-filtered list < 50ms 100 rows with pagination
Advisory search < 100ms With FTS and filters
VEX statement insert < 20ms Single statement
Scheduler job enqueue < 10ms With lock acquisition
Report generation < 500ms Full SBOM evaluation

6.2 Baseline Queries

Run these periodically to establish baselines:

-- Authority: User lookup
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM authority.users
WHERE tenant_id = '<uuid>' AND normalized_username = 'testuser';

-- Vuln: Advisory search
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM vuln.advisories
WHERE state = 'active'
  AND to_tsvector('english', title || ' ' || coalesce(summary, '')) @@ plainto_tsquery('critical vulnerability')
ORDER BY modified_at DESC
LIMIT 50;

-- Scheduler: Pending jobs
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM scheduler.runs
WHERE tenant_id = '<uuid>' AND state = 'pending'
ORDER BY created_at
LIMIT 100;

6.3 Load Testing

Use pgbench for baseline load testing:

# Initialize test data
pgbench -i -s 50 stellaops

# Run benchmark (60 seconds, 10 clients)
pgbench -c 10 -j 4 -T 60 stellaops

# Custom script benchmark
pgbench -c 10 -j 4 -T 60 -f custom_workload.sql stellaops

7. Backup and Restore

7.1 Backup Strategy

Daily full backup with pg_dump:

#!/bin/bash
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR=/var/backups/postgresql

pg_dump -Fc -Z 9 \
  --host="${PGHOST}" \
  --port="${PGPORT}" \
  --username="${PGUSER}" \
  --dbname=stellaops \
  --file="${BACKUP_DIR}/stellaops_${DATE}.dump"

# Retain last 7 days
find ${BACKUP_DIR} -name "*.dump" -mtime +7 -delete

Continuous WAL archiving:

# postgresql.conf
archive_mode = on
archive_command = 'cp %p /var/lib/postgresql/wal_archive/%f'

7.2 Point-in-Time Recovery

# Stop PostgreSQL
systemctl stop postgresql

# Restore base backup
pg_restore -C -d postgres /var/backups/postgresql/stellaops_backup.dump

# Create recovery.conf (PostgreSQL 12+: recovery.signal + postgresql.conf)
cat > ${PGDATA}/postgresql.auto.conf << EOF
restore_command = 'cp /var/lib/postgresql/wal_archive/%f %p'
recovery_target_time = '2025-12-10 14:30:00 UTC'
EOF

touch ${PGDATA}/recovery.signal

# Start PostgreSQL
systemctl start postgresql

7.3 Backup Verification

# Test restore to a different database
pg_restore -C -d postgres --dbname=stellaops_test /var/backups/postgresql/stellaops_backup.dump

# Verify data integrity
psql -d stellaops_test -c "SELECT count(*) FROM authority.users;"
psql -d stellaops_test -c "SELECT count(*) FROM vuln.advisories;"

# Cleanup
dropdb stellaops_test

8. Scaling Recommendations

8.1 Vertical Scaling

Load Level vCPUs RAM Storage Connections
Development 2 4GB 50GB SSD 50
Small (<1k images) 4 16GB 200GB SSD 100
Medium (1k-10k images) 8 32GB 500GB SSD 200
Large (10k+ images) 16 64GB 1TB+ NVMe 500

8.2 Horizontal Scaling

Read replicas for reporting:

# Primary for writes
primary:
  host: postgres-primary.internal
  port: 5432

# Replicas for reads (round-robin)
replicas:
  - host: postgres-replica-1.internal
    port: 5432
  - host: postgres-replica-2.internal
    port: 5432

Connection routing in application:

  • Writes → Primary
  • Heavy reads (reports, dashboards) → Replicas
  • Scheduler impact queries → Replicas with acceptable lag

8.3 Table Partitioning

For high-volume tables (>100M rows), consider partitioning:

-- Partition scheduler.runs by created_at
CREATE TABLE scheduler.runs_partitioned (
    LIKE scheduler.runs INCLUDING ALL
) PARTITION BY RANGE (created_at);

-- Monthly partitions
CREATE TABLE scheduler.runs_y2025m12
    PARTITION OF scheduler.runs_partitioned
    FOR VALUES FROM ('2025-12-01') TO ('2026-01-01');

-- Automate partition creation
-- See: pg_partman extension

8.4 Connection Pooling at Scale

For >1000 concurrent connections, deploy PgBouncer as a sidecar or dedicated service:

# Kubernetes deployment with PgBouncer sidecar
containers:
  - name: app
    env:
      - name: DATABASE_URL
        value: "postgresql://localhost:6432/stellaops"
  - name: pgbouncer
    image: pgbouncer/pgbouncer:1.21.0
    ports:
      - containerPort: 6432

9. Troubleshooting

9.1 Common Issues

High connection count:

-- Identify connection sources
SELECT client_addr, usename, state, count(*)
FROM pg_stat_activity
GROUP BY 1, 2, 3
ORDER BY 4 DESC;

-- Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
  AND state_change < now() - interval '30 minutes';

Lock contention:

-- Find blocking queries
SELECT
    blocked.pid as blocked_pid,
    blocked.query as blocked_query,
    blocking.pid as blocking_pid,
    blocking.query as blocking_query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.wait_event_type = 'Lock';

Table bloat:

-- Check table and index sizes
SELECT
    schemaname || '.' || relname as table,
    pg_size_pretty(pg_total_relation_size(relid)) as total_size,
    pg_size_pretty(pg_table_size(relid)) as table_size,
    pg_size_pretty(pg_indexes_size(relid)) as index_size
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 20;

-- Manual VACUUM FULL for severe bloat (blocks writes!)
VACUUM (FULL, ANALYZE) scheduler.runs;

9.2 Emergency Procedures

Kill long-running queries:

SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'active'
  AND query_start < now() - interval '10 minutes'
  AND query NOT LIKE '%pg_stat%';

Force checkpoint (before maintenance):

CHECKPOINT;

Emergency read-only mode:

ALTER DATABASE stellaops SET default_transaction_read_only = on;

10. Air-Gap Considerations

10.1 Offline Setup

PostgreSQL 16+ is bundled in the air-gap kit. See docs/24_OFFLINE_KIT.md for import instructions.

Docker image digest (pinned):

postgres:
  image: docker.io/library/postgres:16@sha256:<pinned-digest>

10.2 Migrations in Air-Gap

All migrations are embedded in application assemblies. No network access required:

# Run migrations manually
dotnet run --project src/Tools/MigrationRunner -- \
  --connection "Host=postgres;Database=stellaops;..." \
  --schema all

10.3 Backup in Air-Gap

# Local backup with encryption
pg_dump -Fc stellaops | gpg --encrypt -r backup@stellaops.local > backup.dump.gpg

# Restore
gpg --decrypt backup.dump.gpg | pg_restore -d stellaops

Appendix A: Quick Reference

Connection String Template

Host=<host>;Port=5432;Database=stellaops;Username=<user>;Password=<pass>;
Pooling=true;MinPoolSize=5;MaxPoolSize=20;ConnectionIdleLifetime=300;
CommandTimeout=30;Timeout=15;

Essential Commands

# Connect to database
psql -h localhost -U stellaops -d stellaops

# Check version
psql -c "SELECT version();"

# List schemas
psql -c "\dn"

# List tables in schema
psql -c "\dt vuln.*"

# Table structure
psql -c "\d vuln.advisories"

# Current activity
psql -c "SELECT * FROM pg_stat_activity;"

Useful Extensions

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;  -- Query statistics
CREATE EXTENSION IF NOT EXISTS pg_trgm;             -- Fuzzy text search
CREATE EXTENSION IF NOT EXISTS btree_gin;           -- GIN for scalars
CREATE EXTENSION IF NOT EXISTS pgcrypto;            -- Cryptographic functions