Doctor plugin checks: implement health check classes and documentation
Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
124
docs/doctor/articles/postgres/connectivity.md
Normal file
124
docs/doctor/articles/postgres/connectivity.md
Normal file
@@ -0,0 +1,124 @@
|
||||
---
|
||||
checkId: check.postgres.connectivity
|
||||
plugin: stellaops.doctor.postgres
|
||||
severity: fail
|
||||
tags: [database, postgres, connectivity, core]
|
||||
---
|
||||
# PostgreSQL Connectivity
|
||||
|
||||
## What It Checks
|
||||
Opens a connection to PostgreSQL and executes `SELECT version(), current_timestamp` to verify the database is accessible and responsive. Measures round-trip latency:
|
||||
|
||||
- **Critical latency**: fail if response time exceeds 500ms.
|
||||
- **Warning latency**: warn if response time exceeds 100ms.
|
||||
- **Connection timeout**: fail if the connection attempt exceeds 10 seconds.
|
||||
- **Connection failure**: fail on authentication errors, DNS failures, or network issues.
|
||||
|
||||
The connection string password is masked in all evidence output.
|
||||
|
||||
Evidence collected: `ConnectionString` (masked), `LatencyMs`, `Version`, `ServerTime`, `Status`, `Threshold`, `ErrorCode`, `ErrorMessage`, `TimeoutSeconds`.
|
||||
|
||||
The check requires `ConnectionStrings:StellaOps` or `Database:ConnectionString` to be configured.
|
||||
|
||||
## Why It Matters
|
||||
PostgreSQL is the primary data store for the entire Stella Ops platform. Every service depends on it for configuration, state, and transactional data. If the database is unreachable, the platform is effectively down. High latency propagates through every database operation, degrading the performance of all services, API endpoints, and background jobs simultaneously. This is the most fundamental infrastructure health check.
|
||||
|
||||
## Common Causes
|
||||
- Database server not running or crashed
|
||||
- Network connectivity issues between the application and database
|
||||
- Firewall blocking the database port (5432)
|
||||
- DNS resolution failure for the database hostname
|
||||
- Invalid connection string (wrong host, port, or database name)
|
||||
- Authentication failure (wrong username or password)
|
||||
- Database does not exist
|
||||
- Database server overloaded (high CPU, memory pressure, I/O saturation)
|
||||
- Network latency between application and database hosts
|
||||
- Slow queries blocking connections
|
||||
- SSL/TLS certificate issues
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check postgres container status
|
||||
docker compose -f docker-compose.stella-ops.yml ps postgres
|
||||
|
||||
# Test direct connection
|
||||
docker compose -f docker-compose.stella-ops.yml exec postgres \
|
||||
pg_isready -U stellaops -d stellaops_platform
|
||||
|
||||
# View postgres logs
|
||||
docker compose -f docker-compose.stella-ops.yml logs --tail 100 postgres
|
||||
|
||||
# Restart postgres if needed
|
||||
docker compose -f docker-compose.stella-ops.yml restart postgres
|
||||
```
|
||||
|
||||
Verify connection string in environment:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
platform:
|
||||
environment:
|
||||
ConnectionStrings__StellaOps: "Host=postgres;Port=5432;Database=stellaops_platform;Username=stellaops;Password=stellaops"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check PostgreSQL service status
|
||||
sudo systemctl status postgresql
|
||||
|
||||
# Test connectivity
|
||||
pg_isready -h localhost -p 5432 -U stellaops -d stellaops_platform
|
||||
|
||||
# Check PostgreSQL logs
|
||||
sudo tail -100 /var/log/postgresql/postgresql-*.log
|
||||
|
||||
# Verify connection string
|
||||
stella config get ConnectionStrings:StellaOps
|
||||
|
||||
# Test connection manually
|
||||
psql -h localhost -p 5432 -U stellaops -d stellaops_platform -c "SELECT 1;"
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# Check PostgreSQL pod status
|
||||
kubectl get pods -l app=postgresql
|
||||
|
||||
# Test connectivity from an application pod
|
||||
kubectl exec -it <platform-pod> -- pg_isready -h postgres -p 5432
|
||||
|
||||
# View PostgreSQL pod logs
|
||||
kubectl logs -l app=postgresql --tail=100
|
||||
|
||||
# Check service DNS resolution
|
||||
kubectl exec -it <platform-pod> -- nslookup postgres
|
||||
```
|
||||
|
||||
Verify connection string in secret:
|
||||
|
||||
```bash
|
||||
kubectl get secret stellaops-db-credentials -o jsonpath='{.data.connection-string}' | base64 -d
|
||||
```
|
||||
|
||||
Set in Helm `values.yaml`:
|
||||
|
||||
```yaml
|
||||
postgresql:
|
||||
host: postgres
|
||||
port: 5432
|
||||
database: stellaops_platform
|
||||
auth:
|
||||
existingSecret: stellaops-db-credentials
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.postgres.connectivity
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.postgres.pool` -- pool exhaustion can masquerade as connectivity issues
|
||||
- `check.postgres.migrations` -- migration checks depend on connectivity
|
||||
- `check.operations.job-queue` -- database issues cause job queue failures
|
||||
127
docs/doctor/articles/postgres/migrations.md
Normal file
127
docs/doctor/articles/postgres/migrations.md
Normal file
@@ -0,0 +1,127 @@
|
||||
---
|
||||
checkId: check.postgres.migrations
|
||||
plugin: stellaops.doctor.postgres
|
||||
severity: warn
|
||||
tags: [database, postgres, migrations, schema]
|
||||
---
|
||||
# PostgreSQL Migration Status
|
||||
|
||||
## What It Checks
|
||||
Connects to PostgreSQL and examines the EF Core migration history to identify pending migrations:
|
||||
|
||||
1. **Migration table existence**: checks for the `__EFMigrationsHistory` table in the `public` schema. Warns if the table does not exist.
|
||||
2. **Applied migrations**: queries the migration history table (ordered by `MigrationId` descending) to determine which migrations have been applied.
|
||||
3. **Pending migrations**: compares applied migrations against the expected set to identify any unapplied migrations. Warns if pending migrations are found.
|
||||
|
||||
Evidence collected: `TableExists`, `AppliedCount`, `PendingCount`, `LatestApplied`, `PendingMigrations`, `Status`.
|
||||
|
||||
The check requires `ConnectionStrings:StellaOps` or `Database:ConnectionString` to be configured.
|
||||
|
||||
## Why It Matters
|
||||
Pending database migrations mean the database schema does not match what the application code expects. This causes 500 errors when the application tries to access tables or columns that do not exist, or uses schema features that have not been applied. In Stella Ops, missing migrations are the number one cause of service failures after an upgrade. Services may start and appear healthy but fail on the first database operation that touches a missing table or column.
|
||||
|
||||
## Common Causes
|
||||
- New deployment with schema changes but migration not executed
|
||||
- Migration was not run after a version update
|
||||
- Previous migration attempt failed partway through
|
||||
- Database initialized without EF Core (manual SQL scripts used instead)
|
||||
- Migration history table was accidentally dropped
|
||||
- First deployment to a fresh database with no migration history
|
||||
- Auto-migration disabled or not configured in service startup
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check migration status
|
||||
docker compose -f docker-compose.stella-ops.yml exec platform \
|
||||
stella db migrations status
|
||||
|
||||
# Apply pending migrations
|
||||
docker compose -f docker-compose.stella-ops.yml exec platform \
|
||||
stella db migrate
|
||||
|
||||
# If auto-migration is configured, restart the service (it migrates on startup)
|
||||
docker compose -f docker-compose.stella-ops.yml restart platform
|
||||
|
||||
# Verify migration status after applying
|
||||
docker compose -f docker-compose.stella-ops.yml exec platform \
|
||||
stella db migrations list
|
||||
```
|
||||
|
||||
Ensure auto-migration is enabled:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
platform:
|
||||
environment:
|
||||
Platform__AutoMigrate: "true"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# List pending migrations
|
||||
stella db migrations list --pending
|
||||
|
||||
# Apply pending migrations
|
||||
stella db migrate
|
||||
|
||||
# Verify all migrations are applied
|
||||
stella db migrations status
|
||||
|
||||
# If auto-migration is configured, restart the service
|
||||
sudo systemctl restart stellaops-platform
|
||||
```
|
||||
|
||||
Edit `/etc/stellaops/platform/appsettings.json` to enable auto-migration:
|
||||
|
||||
```json
|
||||
{
|
||||
"Platform": {
|
||||
"AutoMigrate": true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# Check migration status
|
||||
kubectl exec -it <platform-pod> -- stella db migrations status
|
||||
|
||||
# Apply pending migrations
|
||||
kubectl exec -it <platform-pod> -- stella db migrate
|
||||
|
||||
# Or use a migration Job
|
||||
kubectl apply -f - <<EOF
|
||||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: stellaops-migrate
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: migrate
|
||||
image: stellaops/platform:latest
|
||||
command: ["stella", "db", "migrate"]
|
||||
restartPolicy: Never
|
||||
EOF
|
||||
```
|
||||
|
||||
Set in Helm `values.yaml`:
|
||||
|
||||
```yaml
|
||||
platform:
|
||||
autoMigrate: true
|
||||
migrations:
|
||||
runOnStartup: true
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.postgres.migrations
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.postgres.connectivity` -- migrations require a working database connection
|
||||
- `check.postgres.pool` -- connection pool issues can cause migration failures
|
||||
123
docs/doctor/articles/postgres/pool.md
Normal file
123
docs/doctor/articles/postgres/pool.md
Normal file
@@ -0,0 +1,123 @@
|
||||
---
|
||||
checkId: check.postgres.pool
|
||||
plugin: stellaops.doctor.postgres
|
||||
severity: warn
|
||||
tags: [database, postgres, pool, connections]
|
||||
---
|
||||
# PostgreSQL Connection Pool
|
||||
|
||||
## What It Checks
|
||||
Connects to PostgreSQL and queries `pg_stat_activity` and `pg_settings` to evaluate connection pool health:
|
||||
|
||||
- **Pool usage ratio**: warn above 70%, fail above 90% (active connections / max connections).
|
||||
- **Waiting connections**: warn if any connections are waiting for a pool slot.
|
||||
|
||||
Evidence collected: `ActiveConnections`, `IdleConnections`, `MaxConnections`, `UsageRatio`, `ConfiguredMaxPoolSize`, `ConfiguredMinPoolSize`, `WaitingConnections`.
|
||||
|
||||
The check requires `ConnectionStrings:StellaOps` or `Database:ConnectionString` to be configured.
|
||||
|
||||
The SQL query executed:
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
(SELECT count(*) FROM pg_stat_activity WHERE state = 'active') as active,
|
||||
(SELECT count(*) FROM pg_stat_activity WHERE state = 'idle') as idle,
|
||||
(SELECT setting::int FROM pg_settings WHERE name = 'max_connections') as max_conn,
|
||||
(SELECT count(*) FROM pg_stat_activity WHERE wait_event_type = 'Client') as waiting
|
||||
```
|
||||
|
||||
## Why It Matters
|
||||
PostgreSQL connection exhaustion is one of the most common causes of service outages in Stella Ops. When the connection pool is exhausted, all services that need the database start timing out, causing cascading failures across the platform. Waiting connections indicate that requests are already queuing for database access, which translates directly to increased latency for end users. Connection leaks, if not caught early, will eventually exhaust the pool completely.
|
||||
|
||||
## Common Causes
|
||||
- Connection leak in application code (connections opened but not returned to pool)
|
||||
- Long-running queries holding connections open
|
||||
- Pool size too small for the workload (too many services sharing a single pool)
|
||||
- Sudden spike in database requests (bulk scan, CI surge)
|
||||
- All pool connections in use during peak load
|
||||
- Connection timeout configured too long, allowing stale connections to occupy slots
|
||||
- Requests arriving faster than connections are released
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check active database connections
|
||||
docker compose -f docker-compose.stella-ops.yml exec postgres \
|
||||
psql -U stellaops -d stellaops_platform -c \
|
||||
"SELECT state, count(*) FROM pg_stat_activity GROUP BY state;"
|
||||
|
||||
# Terminate idle connections
|
||||
docker compose -f docker-compose.stella-ops.yml exec postgres \
|
||||
psql -U stellaops -d stellaops_platform -c \
|
||||
"SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';"
|
||||
|
||||
# Increase max connections in PostgreSQL
|
||||
# In docker-compose.stella-ops.yml:
|
||||
```
|
||||
|
||||
```yaml
|
||||
services:
|
||||
postgres:
|
||||
command: >
|
||||
postgres
|
||||
-c max_connections=200
|
||||
-c shared_buffers=256MB
|
||||
```
|
||||
|
||||
Increase Npgsql pool size via connection string:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
platform:
|
||||
environment:
|
||||
ConnectionStrings__StellaOps: "Host=postgres;Database=stellaops_platform;Username=stellaops;Password=stellaops;Maximum Pool Size=50;Minimum Pool Size=5"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check connection statistics
|
||||
psql -U stellaops -d stellaops_platform -c \
|
||||
"SELECT state, count(*) FROM pg_stat_activity GROUP BY state;"
|
||||
|
||||
# Check for long-running queries
|
||||
psql -U stellaops -d stellaops_platform -c \
|
||||
"SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;"
|
||||
|
||||
# Increase max connections
|
||||
sudo -u postgres psql -c "ALTER SYSTEM SET max_connections = 200;"
|
||||
sudo systemctl restart postgresql
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# Check connection pool from inside a pod
|
||||
kubectl exec -it <postgres-pod> -- psql -U stellaops -d stellaops_platform -c \
|
||||
"SELECT state, count(*) FROM pg_stat_activity GROUP BY state;"
|
||||
|
||||
# Terminate idle connections
|
||||
kubectl exec -it <postgres-pod> -- psql -U stellaops -d stellaops_platform -c \
|
||||
"SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';"
|
||||
```
|
||||
|
||||
Set in Helm `values.yaml`:
|
||||
|
||||
```yaml
|
||||
postgresql:
|
||||
maxConnections: 200
|
||||
sharedBuffers: 256MB
|
||||
|
||||
platform:
|
||||
database:
|
||||
connectionString: "Host=postgres;Database=stellaops_platform;Username=stellaops;Password=stellaops;Maximum Pool Size=50;Minimum Pool Size=5"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.postgres.pool
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.postgres.connectivity` -- connectivity issues compound pool problems
|
||||
- `check.postgres.migrations` -- schema issues can cause queries to hang, consuming connections
|
||||
- `check.operations.job-queue` -- database bottleneck slows job processing
|
||||
Reference in New Issue
Block a user