Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
master
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions

View File

@@ -0,0 +1,124 @@
---
checkId: check.postgres.connectivity
plugin: stellaops.doctor.postgres
severity: fail
tags: [database, postgres, connectivity, core]
---
# PostgreSQL Connectivity
## What It Checks
Opens a connection to PostgreSQL and executes `SELECT version(), current_timestamp` to verify the database is accessible and responsive. Measures round-trip latency:
- **Critical latency**: fail if response time exceeds 500ms.
- **Warning latency**: warn if response time exceeds 100ms.
- **Connection timeout**: fail if the connection attempt exceeds 10 seconds.
- **Connection failure**: fail on authentication errors, DNS failures, or network issues.
The connection string password is masked in all evidence output.
Evidence collected: `ConnectionString` (masked), `LatencyMs`, `Version`, `ServerTime`, `Status`, `Threshold`, `ErrorCode`, `ErrorMessage`, `TimeoutSeconds`.
The check requires `ConnectionStrings:StellaOps` or `Database:ConnectionString` to be configured.
## Why It Matters
PostgreSQL is the primary data store for the entire Stella Ops platform. Every service depends on it for configuration, state, and transactional data. If the database is unreachable, the platform is effectively down. High latency propagates through every database operation, degrading the performance of all services, API endpoints, and background jobs simultaneously. This is the most fundamental infrastructure health check.
## Common Causes
- Database server not running or crashed
- Network connectivity issues between the application and database
- Firewall blocking the database port (5432)
- DNS resolution failure for the database hostname
- Invalid connection string (wrong host, port, or database name)
- Authentication failure (wrong username or password)
- Database does not exist
- Database server overloaded (high CPU, memory pressure, I/O saturation)
- Network latency between application and database hosts
- Slow queries blocking connections
- SSL/TLS certificate issues
## How to Fix
### Docker Compose
```bash
# Check postgres container status
docker compose -f docker-compose.stella-ops.yml ps postgres
# Test direct connection
docker compose -f docker-compose.stella-ops.yml exec postgres \
pg_isready -U stellaops -d stellaops_platform
# View postgres logs
docker compose -f docker-compose.stella-ops.yml logs --tail 100 postgres
# Restart postgres if needed
docker compose -f docker-compose.stella-ops.yml restart postgres
```
Verify connection string in environment:
```yaml
services:
platform:
environment:
ConnectionStrings__StellaOps: "Host=postgres;Port=5432;Database=stellaops_platform;Username=stellaops;Password=stellaops"
```
### Bare Metal / systemd
```bash
# Check PostgreSQL service status
sudo systemctl status postgresql
# Test connectivity
pg_isready -h localhost -p 5432 -U stellaops -d stellaops_platform
# Check PostgreSQL logs
sudo tail -100 /var/log/postgresql/postgresql-*.log
# Verify connection string
stella config get ConnectionStrings:StellaOps
# Test connection manually
psql -h localhost -p 5432 -U stellaops -d stellaops_platform -c "SELECT 1;"
```
### Kubernetes / Helm
```bash
# Check PostgreSQL pod status
kubectl get pods -l app=postgresql
# Test connectivity from an application pod
kubectl exec -it <platform-pod> -- pg_isready -h postgres -p 5432
# View PostgreSQL pod logs
kubectl logs -l app=postgresql --tail=100
# Check service DNS resolution
kubectl exec -it <platform-pod> -- nslookup postgres
```
Verify connection string in secret:
```bash
kubectl get secret stellaops-db-credentials -o jsonpath='{.data.connection-string}' | base64 -d
```
Set in Helm `values.yaml`:
```yaml
postgresql:
host: postgres
port: 5432
database: stellaops_platform
auth:
existingSecret: stellaops-db-credentials
```
## Verification
```
stella doctor run --check check.postgres.connectivity
```
## Related Checks
- `check.postgres.pool` -- pool exhaustion can masquerade as connectivity issues
- `check.postgres.migrations` -- migration checks depend on connectivity
- `check.operations.job-queue` -- database issues cause job queue failures