- Deleted obsolete test files for SchedulerAuditService and SchedulerMongoSessionFactory. - Removed unused TestDataFactory class. - Updated project files for Mongo.Tests to remove references to deleted files. - Upgraded BouncyCastle.Cryptography package to version 2.6.2 across multiple projects. - Replaced Microsoft.Extensions.Http.Polly with Microsoft.Extensions.Http.Resilience in Zastava.Webhook project. - Updated NetEscapades.Configuration.Yaml package to version 3.1.0 in Configuration library. - Upgraded Pkcs11Interop package to version 5.1.2 in Cryptography libraries. - Refactored Argon2idPasswordHasher to use BouncyCastle for hashing instead of Konscious. - Updated JsonSchema.Net package to version 7.3.2 in Microservice project. - Updated global.json to use .NET SDK version 10.0.101.
20 KiB
PostgreSQL Operations Guide
Version: 1.0.0 Last Updated: 2025-12-10 Status: Active
This guide covers PostgreSQL operations for StellaOps, including setup, performance tuning, monitoring, backup/restore, and scaling recommendations.
1. Overview
StellaOps uses PostgreSQL (≥16) as the primary control-plane database with per-module schema isolation. MongoDB is retained only for legacy modules not yet converted.
1.1 Schema Topology
┌─────────────────────────────────────────────────────────────────┐
│ PostgreSQL Cluster │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ stellaops (database) ││
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ││
│ │ │authority│ │ vuln │ │ vex │ │scheduler│ ││
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ ││
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ││
│ │ │ notify │ │ policy │ │ packs │ │ issuer │ ││
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ ││
│ │ ┌─────────┐ ││
│ │ │ audit │ (cross-cutting audit schema) ││
│ │ └─────────┘ ││
│ └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘
1.2 Module Schema Ownership
| Schema | Owner Module | Primary Tables |
|---|---|---|
authority |
Authority | tenants, users, roles, tokens, licenses |
vuln |
Concelier | sources, advisories, advisory_affected, kev_flags |
vex |
Excititor | projects, graph_revisions, statements, observations |
scheduler |
Scheduler | schedules, runs, graph_jobs, workers, locks |
notify |
Notify | channels, templates, rules, deliveries |
policy |
Policy | packs, rules, evaluations, exceptions |
concelier |
Concelier | documents, dtos, states, exports |
audit |
Shared | audit_log (cross-cutting) |
2. Performance Configuration
2.1 Enable pg_stat_statements
The pg_stat_statements extension is essential for query performance analysis. Enable it in your PostgreSQL configuration:
postgresql.conf:
# Load the extension at startup
shared_preload_libraries = 'pg_stat_statements'
# Configuration
pg_stat_statements.max = 10000
pg_stat_statements.track = all
pg_stat_statements.track_utility = on
pg_stat_statements.track_planning = on
Enable in database:
-- Create the extension (requires superuser)
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Verify installation
SELECT * FROM pg_stat_statements LIMIT 1;
-- Reset statistics (useful after configuration changes)
SELECT pg_stat_statements_reset();
2.2 Recommended PostgreSQL Settings
Memory Configuration (adjust based on available RAM):
# For a server with 16GB RAM dedicated to PostgreSQL:
shared_buffers = 4GB # 25% of RAM
effective_cache_size = 12GB # 75% of RAM
maintenance_work_mem = 1GB # For VACUUM, CREATE INDEX
work_mem = 64MB # Per-operation sort memory
# Connection management
max_connections = 200 # Adjust based on pooling
Write-Ahead Log (WAL):
wal_buffers = 64MB
checkpoint_completion_target = 0.9
max_wal_size = 4GB
min_wal_size = 1GB
Query Planner:
random_page_cost = 1.1 # For SSDs (default 4.0 is for HDDs)
effective_io_concurrency = 200 # For SSDs
default_statistics_target = 100 # Increase for complex queries
Parallel Query:
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
max_parallel_maintenance_workers = 4
2.3 Connection Pooling (PgBouncer)
Recommended PgBouncer configuration:
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 20
reserve_pool_size = 5
reserve_pool_timeout = 3
server_idle_timeout = 60
query_timeout = 30
Session configuration (set on connection open):
SET app.tenant_id = '<tenant-uuid>';
SET timezone = 'UTC';
SET statement_timeout = '30s';
3. Query Performance Analysis
3.1 Identifying Slow Queries
Top queries by total time:
SELECT
substring(query, 1, 100) as query_preview,
calls,
round(total_exec_time::numeric, 2) as total_ms,
round(mean_exec_time::numeric, 2) as mean_ms,
round((100 * total_exec_time / sum(total_exec_time) over())::numeric, 2) as percent_total
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
Queries with high mean execution time:
SELECT
substring(query, 1, 100) as query_preview,
calls,
round(mean_exec_time::numeric, 2) as mean_ms,
round(stddev_exec_time::numeric, 2) as stddev_ms,
rows
FROM pg_stat_statements
WHERE calls > 10
ORDER BY mean_exec_time DESC
LIMIT 20;
Queries with high buffer usage (I/O intensive):
SELECT
substring(query, 1, 100) as query_preview,
calls,
shared_blks_hit + shared_blks_read as total_blks,
round(100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0), 2) as hit_ratio
FROM pg_stat_statements
WHERE shared_blks_hit + shared_blks_read > 1000
ORDER BY shared_blks_read DESC
LIMIT 20;
3.2 Using EXPLAIN ANALYZE
Basic usage:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM vuln.advisories
WHERE state = 'active' AND severity = 'critical'
ORDER BY modified_at DESC
LIMIT 100;
Understanding output - key indicators:
- Seq Scan on large tables = missing index
- Hash Join vs Nested Loop - consider data sizes
- Rows estimate vs actual - statistics accuracy
- Buffers: shared hit/read - cache effectiveness
Example analysis:
-- Bad: Sequential scan on large table
Seq Scan on advisories (cost=0.00..50000.00 rows=1000 width=100)
Filter: ((state = 'active') AND (severity = 'critical'))
Rows Removed by Filter: 99000
-- Good: Index scan
Index Scan using idx_advisories_state_severity on advisories
Index Cond: ((state = 'active') AND (severity = 'critical'))
3.3 Index Analysis
Find unused indexes:
SELECT
schemaname || '.' || relname as table,
indexrelname as index,
pg_size_pretty(pg_relation_size(indexrelid)) as size,
idx_scan as scans
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND schemaname NOT IN ('pg_catalog', 'pg_toast')
ORDER BY pg_relation_size(indexrelid) DESC;
Find missing indexes (tables with high sequential scans):
SELECT
schemaname || '.' || relname as table,
seq_scan,
seq_tup_read,
idx_scan,
round(100.0 * idx_scan / nullif(seq_scan + idx_scan, 0), 2) as idx_usage_pct
FROM pg_stat_user_tables
WHERE seq_scan > 100
ORDER BY seq_tup_read DESC
LIMIT 20;
Duplicate indexes:
SELECT
pg_size_pretty(sum(pg_relation_size(idx))::bigint) as size,
array_agg(idx) as indexes,
indrelid::regclass as table,
indkey as columns
FROM (
SELECT indexrelid::regclass as idx, indrelid, indkey
FROM pg_index
) sub
GROUP BY indrelid, indkey
HAVING count(*) > 1;
4. Index Guidelines for StellaOps
4.1 Standard Index Patterns
All tenant-scoped tables should have composite indexes starting with tenant_id:
-- Standard tenant + primary lookup pattern
CREATE INDEX idx_<table>_tenant_<field> ON <schema>.<table>(tenant_id, <field>);
-- Time-based queries
CREATE INDEX idx_<table>_tenant_time ON <schema>.<table>(tenant_id, created_at DESC);
-- State/status filtering
CREATE INDEX idx_<table>_tenant_state ON <schema>.<table>(tenant_id, state)
WHERE state IN ('active', 'pending');
4.2 Module-Specific Indexes
Authority schema:
CREATE INDEX idx_users_tenant ON authority.users(tenant_id);
CREATE INDEX idx_users_email ON authority.users(email) WHERE email IS NOT NULL;
CREATE INDEX idx_tokens_expires ON authority.tokens(expires_at) WHERE revoked_at IS NULL;
Vuln schema:
CREATE INDEX idx_advisories_primary_vuln ON vuln.advisories(primary_vuln_id);
CREATE INDEX idx_advisories_modified ON vuln.advisories(modified_at DESC);
CREATE INDEX idx_advisory_aliases_value ON vuln.advisory_aliases(alias_value);
CREATE INDEX idx_advisory_affected_purl ON vuln.advisory_affected(package_purl)
WHERE package_purl IS NOT NULL;
Scheduler schema:
CREATE INDEX idx_runs_tenant_state ON scheduler.runs(tenant_id, state);
CREATE INDEX idx_runs_state_created ON scheduler.runs(state, created_at)
WHERE state IN ('pending', 'queued', 'running');
CREATE INDEX idx_graph_jobs_tenant_status ON scheduler.graph_jobs(tenant_id, status);
4.3 JSONB Indexes
-- GIN index for containment queries (@>, ?, ?&, ?|)
CREATE INDEX idx_<table>_<column>_gin ON <schema>.<table> USING GIN (<column>);
-- Expression index for specific JSON paths
CREATE INDEX idx_<table>_<column>_path ON <schema>.<table> ((<column>->>'specific_key'));
5. Monitoring Setup
5.1 Key Metrics to Monitor
Connection metrics:
-- Current connections by state
SELECT state, count(*)
FROM pg_stat_activity
GROUP BY state;
-- Connections by database/user
SELECT datname, usename, count(*)
FROM pg_stat_activity
GROUP BY datname, usename;
Cache effectiveness:
-- Database-level cache hit ratio (should be >99%)
SELECT
datname,
round(100.0 * blks_hit / nullif(blks_hit + blks_read, 0), 2) as cache_hit_ratio
FROM pg_stat_database
WHERE datname = 'stellaops';
Table bloat and maintenance:
-- Tables needing VACUUM
SELECT
schemaname || '.' || relname as table,
n_dead_tup,
n_live_tup,
round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 2) as dead_pct,
last_vacuum,
last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;
5.2 Prometheus Metrics
Use postgres_exporter for Prometheus integration. Key metrics:
# Alert rules for PostgreSQL
groups:
- name: postgresql
rules:
- alert: PostgreSQLHighConnections
expr: pg_stat_activity_count > (pg_settings_max_connections * 0.8)
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL connections at {{ $value | humanizePercentage }} of max"
- alert: PostgreSQLLowCacheHitRatio
expr: pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read) < 0.95
for: 15m
labels:
severity: warning
annotations:
summary: "PostgreSQL cache hit ratio below 95%"
- alert: PostgreSQLDeadlocks
expr: rate(pg_stat_database_deadlocks[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL deadlocks detected"
- alert: PostgreSQLSlowQueries
expr: pg_stat_activity_max_tx_duration > 300
for: 5m
labels:
severity: warning
annotations:
summary: "Long-running transaction detected (>5min)"
5.3 Grafana Dashboard
Import the PostgreSQL dashboard (ID: 9628) or create custom panels for:
- Connection Pool - Active/idle/waiting connections
- Query Performance - QPS, latency percentiles
- Cache Hit Ratio - Database and table level
- Disk I/O - Read/write IOPS and throughput
- Replication Lag - For HA setups
- Lock Waits - Blocked queries count
6. Performance Baselines
6.1 Expected Performance Targets
| Operation | Target P95 | Notes |
|---|---|---|
| Simple key lookup | < 5ms | Single row by UUID |
| Tenant-filtered list | < 50ms | 100 rows with pagination |
| Advisory search | < 100ms | With FTS and filters |
| VEX statement insert | < 20ms | Single statement |
| Scheduler job enqueue | < 10ms | With lock acquisition |
| Report generation | < 500ms | Full SBOM evaluation |
6.2 Baseline Queries
Run these periodically to establish baselines:
-- Authority: User lookup
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM authority.users
WHERE tenant_id = '<uuid>' AND normalized_username = 'testuser';
-- Vuln: Advisory search
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM vuln.advisories
WHERE state = 'active'
AND to_tsvector('english', title || ' ' || coalesce(summary, '')) @@ plainto_tsquery('critical vulnerability')
ORDER BY modified_at DESC
LIMIT 50;
-- Scheduler: Pending jobs
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM scheduler.runs
WHERE tenant_id = '<uuid>' AND state = 'pending'
ORDER BY created_at
LIMIT 100;
6.3 Load Testing
Use pgbench for baseline load testing:
# Initialize test data
pgbench -i -s 50 stellaops
# Run benchmark (60 seconds, 10 clients)
pgbench -c 10 -j 4 -T 60 stellaops
# Custom script benchmark
pgbench -c 10 -j 4 -T 60 -f custom_workload.sql stellaops
7. Backup and Restore
7.1 Backup Strategy
Daily full backup with pg_dump:
#!/bin/bash
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR=/var/backups/postgresql
pg_dump -Fc -Z 9 \
--host="${PGHOST}" \
--port="${PGPORT}" \
--username="${PGUSER}" \
--dbname=stellaops \
--file="${BACKUP_DIR}/stellaops_${DATE}.dump"
# Retain last 7 days
find ${BACKUP_DIR} -name "*.dump" -mtime +7 -delete
Continuous WAL archiving:
# postgresql.conf
archive_mode = on
archive_command = 'cp %p /var/lib/postgresql/wal_archive/%f'
7.2 Point-in-Time Recovery
# Stop PostgreSQL
systemctl stop postgresql
# Restore base backup
pg_restore -C -d postgres /var/backups/postgresql/stellaops_backup.dump
# Create recovery.conf (PostgreSQL 12+: recovery.signal + postgresql.conf)
cat > ${PGDATA}/postgresql.auto.conf << EOF
restore_command = 'cp /var/lib/postgresql/wal_archive/%f %p'
recovery_target_time = '2025-12-10 14:30:00 UTC'
EOF
touch ${PGDATA}/recovery.signal
# Start PostgreSQL
systemctl start postgresql
7.3 Backup Verification
# Test restore to a different database
pg_restore -C -d postgres --dbname=stellaops_test /var/backups/postgresql/stellaops_backup.dump
# Verify data integrity
psql -d stellaops_test -c "SELECT count(*) FROM authority.users;"
psql -d stellaops_test -c "SELECT count(*) FROM vuln.advisories;"
# Cleanup
dropdb stellaops_test
8. Scaling Recommendations
8.1 Vertical Scaling
| Load Level | vCPUs | RAM | Storage | Connections |
|---|---|---|---|---|
| Development | 2 | 4GB | 50GB SSD | 50 |
| Small (<1k images) | 4 | 16GB | 200GB SSD | 100 |
| Medium (1k-10k images) | 8 | 32GB | 500GB SSD | 200 |
| Large (10k+ images) | 16 | 64GB | 1TB+ NVMe | 500 |
8.2 Horizontal Scaling
Read replicas for reporting:
# Primary for writes
primary:
host: postgres-primary.internal
port: 5432
# Replicas for reads (round-robin)
replicas:
- host: postgres-replica-1.internal
port: 5432
- host: postgres-replica-2.internal
port: 5432
Connection routing in application:
- Writes → Primary
- Heavy reads (reports, dashboards) → Replicas
- Scheduler impact queries → Replicas with acceptable lag
8.3 Table Partitioning
For high-volume tables (>100M rows), consider partitioning:
-- Partition scheduler.runs by created_at
CREATE TABLE scheduler.runs_partitioned (
LIKE scheduler.runs INCLUDING ALL
) PARTITION BY RANGE (created_at);
-- Monthly partitions
CREATE TABLE scheduler.runs_y2025m12
PARTITION OF scheduler.runs_partitioned
FOR VALUES FROM ('2025-12-01') TO ('2026-01-01');
-- Automate partition creation
-- See: pg_partman extension
8.4 Connection Pooling at Scale
For >1000 concurrent connections, deploy PgBouncer as a sidecar or dedicated service:
# Kubernetes deployment with PgBouncer sidecar
containers:
- name: app
env:
- name: DATABASE_URL
value: "postgresql://localhost:6432/stellaops"
- name: pgbouncer
image: pgbouncer/pgbouncer:1.21.0
ports:
- containerPort: 6432
9. Troubleshooting
9.1 Common Issues
High connection count:
-- Identify connection sources
SELECT client_addr, usename, state, count(*)
FROM pg_stat_activity
GROUP BY 1, 2, 3
ORDER BY 4 DESC;
-- Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND state_change < now() - interval '30 minutes';
Lock contention:
-- Find blocking queries
SELECT
blocked.pid as blocked_pid,
blocked.query as blocked_query,
blocking.pid as blocking_pid,
blocking.query as blocking_query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.wait_event_type = 'Lock';
Table bloat:
-- Check table and index sizes
SELECT
schemaname || '.' || relname as table,
pg_size_pretty(pg_total_relation_size(relid)) as total_size,
pg_size_pretty(pg_table_size(relid)) as table_size,
pg_size_pretty(pg_indexes_size(relid)) as index_size
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 20;
-- Manual VACUUM FULL for severe bloat (blocks writes!)
VACUUM (FULL, ANALYZE) scheduler.runs;
9.2 Emergency Procedures
Kill long-running queries:
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'active'
AND query_start < now() - interval '10 minutes'
AND query NOT LIKE '%pg_stat%';
Force checkpoint (before maintenance):
CHECKPOINT;
Emergency read-only mode:
ALTER DATABASE stellaops SET default_transaction_read_only = on;
10. Air-Gap Considerations
10.1 Offline Setup
PostgreSQL 16+ is bundled in the air-gap kit. See docs/24_OFFLINE_KIT.md for import instructions.
Docker image digest (pinned):
postgres:
image: docker.io/library/postgres:16@sha256:<pinned-digest>
10.2 Migrations in Air-Gap
All migrations are embedded in application assemblies. No network access required:
# Run migrations manually
dotnet run --project src/Tools/MigrationRunner -- \
--connection "Host=postgres;Database=stellaops;..." \
--schema all
10.3 Backup in Air-Gap
# Local backup with encryption
pg_dump -Fc stellaops | gpg --encrypt -r backup@stellaops.local > backup.dump.gpg
# Restore
gpg --decrypt backup.dump.gpg | pg_restore -d stellaops
Appendix A: Quick Reference
Connection String Template
Host=<host>;Port=5432;Database=stellaops;Username=<user>;Password=<pass>;
Pooling=true;MinPoolSize=5;MaxPoolSize=20;ConnectionIdleLifetime=300;
CommandTimeout=30;Timeout=15;
Essential Commands
# Connect to database
psql -h localhost -U stellaops -d stellaops
# Check version
psql -c "SELECT version();"
# List schemas
psql -c "\dn"
# List tables in schema
psql -c "\dt vuln.*"
# Table structure
psql -c "\d vuln.advisories"
# Current activity
psql -c "SELECT * FROM pg_stat_activity;"
Useful Extensions
CREATE EXTENSION IF NOT EXISTS pg_stat_statements; -- Query statistics
CREATE EXTENSION IF NOT EXISTS pg_trgm; -- Fuzzy text search
CREATE EXTENSION IF NOT EXISTS btree_gin; -- GIN for scalars
CREATE EXTENSION IF NOT EXISTS pgcrypto; -- Cryptographic functions