Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/postgres/pool.md
+++ b/docs/doctor/articles/postgres/pool.md
@@ -0,0 +1,123 @@
+---
+checkId: check.postgres.pool
+plugin: stellaops.doctor.postgres
+severity: warn
+tags: [database, postgres, pool, connections]
+---
+# PostgreSQL Connection Pool
+
+## What It Checks
+Connects to PostgreSQL and queries `pg_stat_activity` and `pg_settings` to evaluate connection pool health:
+
+- **Pool usage ratio**: warn above 70%, fail above 90% (active connections / max connections).
+- **Waiting connections**: warn if any connections are waiting for a pool slot.
+
+Evidence collected: `ActiveConnections`, `IdleConnections`, `MaxConnections`, `UsageRatio`, `ConfiguredMaxPoolSize`, `ConfiguredMinPoolSize`, `WaitingConnections`.
+
+The check requires `ConnectionStrings:StellaOps` or `Database:ConnectionString` to be configured.
+
+The SQL query executed:
+
+```sql
+SELECT
+  (SELECT count(*) FROM pg_stat_activity WHERE state = 'active') as active,
+  (SELECT count(*) FROM pg_stat_activity WHERE state = 'idle') as idle,
+  (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') as max_conn,
+  (SELECT count(*) FROM pg_stat_activity WHERE wait_event_type = 'Client') as waiting
+```
+
+## Why It Matters
+PostgreSQL connection exhaustion is one of the most common causes of service outages in Stella Ops. When the connection pool is exhausted, all services that need the database start timing out, causing cascading failures across the platform. Waiting connections indicate that requests are already queuing for database access, which translates directly to increased latency for end users. Connection leaks, if not caught early, will eventually exhaust the pool completely.
+
+## Common Causes
+- Connection leak in application code (connections opened but not returned to pool)
+- Long-running queries holding connections open
+- Pool size too small for the workload (too many services sharing a single pool)
+- Sudden spike in database requests (bulk scan, CI surge)
+- All pool connections in use during peak load
+- Connection timeout configured too long, allowing stale connections to occupy slots
+- Requests arriving faster than connections are released
+
+## How to Fix
+
+### Docker Compose
+```bash
+# Check active database connections
+docker compose -f docker-compose.stella-ops.yml exec postgres \
+  psql -U stellaops -d stellaops_platform -c \
+  "SELECT state, count(*) FROM pg_stat_activity GROUP BY state;"
+
+# Terminate idle connections
+docker compose -f docker-compose.stella-ops.yml exec postgres \
+  psql -U stellaops -d stellaops_platform -c \
+  "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';"
+
+# Increase max connections in PostgreSQL
+# In docker-compose.stella-ops.yml:
+```
+
+```yaml
+services:
+  postgres:
+    command: >
+      postgres
+      -c max_connections=200
+      -c shared_buffers=256MB
+```
+
+Increase Npgsql pool size via connection string:
+
+```yaml
+services:
+  platform:
+    environment:
+      ConnectionStrings__StellaOps: "Host=postgres;Database=stellaops_platform;Username=stellaops;Password=stellaops;Maximum Pool Size=50;Minimum Pool Size=5"
+```
+
+### Bare Metal / systemd
+```bash
+# Check connection statistics
+psql -U stellaops -d stellaops_platform -c \
+  "SELECT state, count(*) FROM pg_stat_activity GROUP BY state;"
+
+# Check for long-running queries
+psql -U stellaops -d stellaops_platform -c \
+  "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;"
+
+# Increase max connections
+sudo -u postgres psql -c "ALTER SYSTEM SET max_connections = 200;"
+sudo systemctl restart postgresql
+```
+
+### Kubernetes / Helm
+```bash
+# Check connection pool from inside a pod
+kubectl exec -it <postgres-pod> -- psql -U stellaops -d stellaops_platform -c \
+  "SELECT state, count(*) FROM pg_stat_activity GROUP BY state;"
+
+# Terminate idle connections
+kubectl exec -it <postgres-pod> -- psql -U stellaops -d stellaops_platform -c \
+  "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';"
+```
+
+Set in Helm `values.yaml`:
+
+```yaml
+postgresql:
+  maxConnections: 200
+  sharedBuffers: 256MB
+
+platform:
+  database:
+    connectionString: "Host=postgres;Database=stellaops_platform;Username=stellaops;Password=stellaops;Maximum Pool Size=50;Minimum Pool Size=5"
+```
+
+## Verification
+```
+stella doctor run --check check.postgres.pool
+```
+
+## Related Checks
+- `check.postgres.connectivity` -- connectivity issues compound pool problems
+- `check.postgres.migrations` -- schema issues can cause queries to hang, consuming connections
+- `check.operations.job-queue` -- database bottleneck slows job processing