Files

master c58a236d70 Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 12:28:00 +02:00

4.5 KiB

Raw Blame History

checkId, plugin, severity, tags

checkId

plugin

severity

PostgreSQL Connection Pool

What It Checks

Connects to PostgreSQL and queries pg_stat_activity and pg_settings to evaluate connection pool health:

Pool usage ratio: warn above 70%, fail above 90% (active connections / max connections).
Waiting connections: warn if any connections are waiting for a pool slot.

Evidence collected: ActiveConnections, IdleConnections, MaxConnections, UsageRatio, ConfiguredMaxPoolSize, ConfiguredMinPoolSize, WaitingConnections.

The check requires ConnectionStrings:StellaOps or Database:ConnectionString to be configured.

The SQL query executed:

SELECT
  (SELECT count(*) FROM pg_stat_activity WHERE state = 'active') as active,
  (SELECT count(*) FROM pg_stat_activity WHERE state = 'idle') as idle,
  (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') as max_conn,
  (SELECT count(*) FROM pg_stat_activity WHERE wait_event_type = 'Client') as waiting

Why It Matters

PostgreSQL connection exhaustion is one of the most common causes of service outages in Stella Ops. When the connection pool is exhausted, all services that need the database start timing out, causing cascading failures across the platform. Waiting connections indicate that requests are already queuing for database access, which translates directly to increased latency for end users. Connection leaks, if not caught early, will eventually exhaust the pool completely.

Common Causes

Connection leak in application code (connections opened but not returned to pool)
Long-running queries holding connections open
Pool size too small for the workload (too many services sharing a single pool)
Sudden spike in database requests (bulk scan, CI surge)
All pool connections in use during peak load
Connection timeout configured too long, allowing stale connections to occupy slots
Requests arriving faster than connections are released

How to Fix

Docker Compose

# Check active database connections
docker compose -f docker-compose.stella-ops.yml exec postgres \
  psql -U stellaops -d stellaops_platform -c \
  "SELECT state, count(*) FROM pg_stat_activity GROUP BY state;"

# Terminate idle connections
docker compose -f docker-compose.stella-ops.yml exec postgres \
  psql -U stellaops -d stellaops_platform -c \
  "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';"

# Increase max connections in PostgreSQL
# In docker-compose.stella-ops.yml:

services:
  postgres:
    command: >
      postgres
      -c max_connections=200
      -c shared_buffers=256MB

Increase Npgsql pool size via connection string:

services:
  platform:
    environment:
      ConnectionStrings__StellaOps: "Host=postgres;Database=stellaops_platform;Username=stellaops;Password=stellaops;Maximum Pool Size=50;Minimum Pool Size=5"

Bare Metal / systemd

# Check connection statistics
psql -U stellaops -d stellaops_platform -c \
  "SELECT state, count(*) FROM pg_stat_activity GROUP BY state;"

# Check for long-running queries
psql -U stellaops -d stellaops_platform -c \
  "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;"

# Increase max connections
sudo -u postgres psql -c "ALTER SYSTEM SET max_connections = 200;"
sudo systemctl restart postgresql

Kubernetes / Helm

# Check connection pool from inside a pod
kubectl exec -it <postgres-pod> -- psql -U stellaops -d stellaops_platform -c \
  "SELECT state, count(*) FROM pg_stat_activity GROUP BY state;"

# Terminate idle connections
kubectl exec -it <postgres-pod> -- psql -U stellaops -d stellaops_platform -c \
  "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';"

Set in Helm values.yaml:

postgresql:
  maxConnections: 200
  sharedBuffers: 256MB

platform:
  database:
    connectionString: "Host=postgres;Database=stellaops_platform;Username=stellaops;Password=stellaops;Maximum Pool Size=50;Minimum Pool Size=5"

Verification

stella doctor run --check check.postgres.pool

check.postgres.connectivity -- connectivity issues compound pool problems
check.postgres.migrations -- schema issues can cause queries to hang, consuming connections
check.operations.job-queue -- database bottleneck slows job processing

4.5 KiB Raw Blame History