doctor: complete runtime check documentation sprint

Signed-off-by: master <>
This commit is contained in:
master
2026-03-31 23:26:24 +03:00
parent 404d50bcb7
commit 152c1b1357
54 changed files with 2210 additions and 258 deletions

View File

@@ -0,0 +1,60 @@
---
checkId: check.db.connection
plugin: stellaops.doctor.database
severity: fail
tags: [database, postgres, connectivity, quick]
---
# Database Connection
## What It Checks
Opens a PostgreSQL connection using `Doctor:Plugins:Database:ConnectionString` or `ConnectionStrings:DefaultConnection` and runs `SELECT version(), current_database(), current_user`.
The check passes only when the connection opens and the probe query returns successfully. Connection failures, authentication failures, DNS errors, and network timeouts fail the check.
## Why It Matters
Doctor cannot validate migrations, pool health, or schema state if the platform cannot reach PostgreSQL. A broken connection path usually means startup failures, API errors, and background job disruption across the suite.
## Common Causes
- `ConnectionStrings__DefaultConnection` is missing or malformed
- PostgreSQL is not running or not listening on the configured host and port
- DNS, firewall, or container networking prevents the Doctor service from reaching PostgreSQL
- Username, password, database name, or TLS settings are incorrect
## How to Fix
### Docker Compose
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml ps postgres
docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 100 postgres
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres pg_isready -U stellaops -d stellaops
```
Set the Doctor connection string with compose-style environment variables:
```yaml
services:
doctor-web:
environment:
ConnectionStrings__DefaultConnection: Host=postgres;Port=5432;Database=stellaops;Username=stellaops;Password=${STELLAOPS_DB_PASSWORD}
```
### Bare Metal / systemd
```bash
pg_isready -h <db-host> -p 5432 -U <db-user> -d <db-name>
psql "Host=<db-host>;Port=5432;Database=<db-name>;Username=<db-user>;Password=<password>" -c "SELECT 1"
```
### Kubernetes / Helm
```bash
kubectl exec deploy/doctor-web -- pg_isready -h <postgres-service> -p 5432 -U <db-user> -d <db-name>
kubectl get secret <db-secret> -o yaml
```
## Verification
```bash
stella doctor --check check.db.connection
```
## Related Checks
- `check.db.latency` - uses the same connection path and highlights performance issues after basic connectivity works
- `check.db.pool.health` - validates connection pressure after connectivity is restored

View File

@@ -0,0 +1,53 @@
---
checkId: check.db.latency
plugin: stellaops.doctor.database
severity: fail
tags: [database, postgres, latency, performance]
---
# Query Latency
## What It Checks
Runs two warmup queries and then measures five `SELECT 1` probes plus five temporary-table `INSERT` probes against PostgreSQL.
The check warns when the p95 latency exceeds `50ms` and fails when the p95 latency exceeds `200ms`.
## Why It Matters
Healthy connectivity is not enough if the database path is slow. Elevated query latency turns into slow UI pages, delayed releases, and queue backlogs across the platform.
## Common Causes
- CPU, memory, or I/O pressure on the PostgreSQL host
- Cross-host or cross-region latency between Doctor and PostgreSQL
- Lock contention or long-running transactions
- Shared infrastructure saturation in the default compose stack
## How to Fix
### Docker Compose
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT * FROM pg_stat_activity WHERE state = 'active';"
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT * FROM pg_locks WHERE NOT granted;"
docker compose -f devops/compose/docker-compose.stella-ops.yml stats postgres
```
Tune connection placement and storage before raising thresholds. If the database is remote, keep `doctor-web` and PostgreSQL on the same low-latency network segment.
### Bare Metal / systemd
```bash
psql -h <db-host> -U <db-user> -d <db-name> -c "SELECT * FROM pg_stat_activity WHERE state = 'active';"
psql -h <db-host> -U <db-user> -d <db-name> -c "SELECT * FROM pg_locks WHERE NOT granted;"
```
### Kubernetes / Helm
```bash
kubectl top pod -n <namespace> <postgres-pod>
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT now();"
```
## Verification
```bash
stella doctor --check check.db.latency
```
## Related Checks
- `check.db.connection` - basic reachability must pass before latency numbers are meaningful
- `check.db.pool.health` - pool saturation often shows up as latency first

View File

@@ -0,0 +1,52 @@
---
checkId: check.db.migrations.failed
plugin: stellaops.doctor.database
severity: fail
tags: [database, migrations, postgres, schema]
---
# Failed Migrations
## What It Checks
Reads the `stella_migration_history` table, when present, and reports rows marked `failed` or `incomplete`.
If the tracking table does not exist, the check reports informationally and assumes the service is using a different migration mechanism.
## Why It Matters
Partially applied migrations leave schemas in undefined states. That is a common cause of startup failures and runtime `500` errors after upgrades.
## Common Causes
- A migration script failed during deployment
- The database user lacks DDL permissions
- Two processes attempted to apply migrations concurrently
- An interrupted deployment left the migration history half-written
## How to Fix
### Docker Compose
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 200 doctor-web
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT migration_id, status, error_message, applied_at FROM stella_migration_history ORDER BY applied_at DESC LIMIT 10;"
```
Fix the underlying SQL or permission problem, then restart the owning service so startup migrations run again.
### Bare Metal / systemd
```bash
journalctl -u <service-name> -n 200
dotnet ef database update
```
### Kubernetes / Helm
```bash
kubectl logs deploy/<service-name> -n <namespace> --tail=200
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT migration_id, status FROM stella_migration_history;"
```
## Verification
```bash
stella doctor --check check.db.migrations.failed
```
## Related Checks
- `check.db.migrations.pending` - pending migrations often follow a failed rollout
- `check.db.schema.version` - schema consistency should be rechecked after cleanup

View File

@@ -0,0 +1,52 @@
---
checkId: check.db.migrations.pending
plugin: stellaops.doctor.database
severity: warn
tags: [database, migrations, postgres, schema]
---
# Pending Migrations
## What It Checks
Looks for the `__EFMigrationsHistory` table and reports the latest applied migration recorded there.
This runtime check does not diff the database against the assembly directly; it tells you whether migration history exists and what the latest applied migration is.
## Why It Matters
Missing or stale migration history usually means a fresh environment was bootstrapped incorrectly or schema changes were never applied on startup.
## Common Causes
- Startup migrations are not wired for the owning service
- The database was reset and the service never converged the schema
- The service is using a different schema owner than operators expect
## How to Fix
### Docker Compose
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 200 doctor-web
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT \"MigrationId\" FROM \"__EFMigrationsHistory\" ORDER BY \"MigrationId\" DESC;"
```
Confirm the owning service calls startup migrations on boot instead of relying on one-off SQL initialization scripts.
### Bare Metal / systemd
```bash
journalctl -u <service-name> -n 200
dotnet ef migrations list
dotnet ef database update
```
### Kubernetes / Helm
```bash
kubectl logs deploy/<service-name> -n <namespace> --tail=200
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT COUNT(*) FROM \"__EFMigrationsHistory\";"
```
## Verification
```bash
stella doctor --check check.db.migrations.pending
```
## Related Checks
- `check.db.migrations.failed` - diagnose broken runs before retrying
- `check.db.schema.version` - validates the resulting schema shape

View File

@@ -0,0 +1,51 @@
---
checkId: check.db.permissions
plugin: stellaops.doctor.database
severity: fail
tags: [database, postgres, permissions, security]
---
# Database Permissions
## What It Checks
Inspects the current PostgreSQL user, whether it is a superuser, whether it can create databases or roles, and whether it has access to application schemas.
The check warns when the app runs as a superuser and fails when the user cannot use the `public` schema.
## Why It Matters
Over-privileged accounts increase blast radius. Under-privileged accounts break startup migrations and normal CRUD paths.
## Common Causes
- The connection string still uses `postgres` or another admin account
- Grants were not applied after creating a dedicated service account
- Restrictive schema privileges were added manually
## How to Fix
### Docker Compose
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U postgres -d stellaops -c "CREATE USER stellaops WITH PASSWORD '<strong-password>';"
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U postgres -d stellaops -c "GRANT CONNECT ON DATABASE stellaops TO stellaops;"
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U postgres -d stellaops -c "GRANT USAGE ON SCHEMA public TO stellaops;"
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U postgres -d stellaops -c "GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO stellaops;"
```
Update `ConnectionStrings__DefaultConnection` after the grants are in place.
### Bare Metal / systemd
```bash
psql -h <db-host> -U postgres -d <db-name> -c "ALTER ROLE <app-user> NOSUPERUSER NOCREATEDB NOCREATEROLE;"
```
### Kubernetes / Helm
```bash
kubectl exec -n <namespace> <postgres-pod> -- psql -U postgres -d <db-name> -c "\du"
```
## Verification
```bash
stella doctor --check check.db.permissions
```
## Related Checks
- `check.db.migrations.failed` - missing privileges frequently break migrations
- `check.db.connection` - credentials and grants must both be correct

View File

@@ -0,0 +1,50 @@
---
checkId: check.db.pool.health
plugin: stellaops.doctor.database
severity: fail
tags: [database, postgres, pool, connections]
---
# Connection Pool Health
## What It Checks
Queries `pg_stat_activity` for the current database and evaluates total connections, active connections, idle connections, waiting connections, and sessions stuck `idle in transaction`.
The check warns when more than five sessions are `idle in transaction` or when total usage exceeds `80%` of server capacity.
## Why It Matters
Pool pressure turns into request latency, migration timeouts, and job backlog. `idle in transaction` sessions are especially dangerous because they hold locks while doing nothing useful.
## Common Causes
- Application code is not closing transactions
- Connection leaks keep sessions open after requests complete
- `max_connections` is too low for the number of app instances
- Long-running requests or deadlocks block pooled connections
## How to Fix
### Docker Compose
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT pid, state, wait_event, query FROM pg_stat_activity WHERE datname = current_database();"
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT pid, query FROM pg_stat_activity WHERE state = 'idle in transaction';"
```
### Bare Metal / systemd
```bash
psql -h <db-host> -U <db-user> -d <db-name> -c "SHOW max_connections;"
```
Review the owning service for transaction scopes that stay open across network calls or retries.
### Kubernetes / Helm
```bash
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT count(*) FROM pg_stat_activity;"
```
## Verification
```bash
stella doctor --check check.db.pool.health
```
## Related Checks
- `check.db.pool.size` - configuration and runtime pressure need to agree
- `check.db.latency` - latency usually rises before the pool is fully exhausted

View File

@@ -0,0 +1,56 @@
---
checkId: check.db.pool.size
plugin: stellaops.doctor.database
severity: warn
tags: [database, postgres, pool, configuration]
---
# Connection Pool Size
## What It Checks
Parses the Npgsql connection string and compares `Pooling`, `MinPoolSize`, and `MaxPoolSize` against PostgreSQL `max_connections` minus reserved superuser slots.
The check warns when pooling is disabled or when `Max Pool Size` exceeds practical server capacity. It returns info when `MinPoolSize=0`.
## Why It Matters
Pool sizing mistakes create either avoidable cold-start latency or connection storms that starve PostgreSQL.
## Common Causes
- `Pooling=false` left over from local troubleshooting
- `Max Pool Size` copied from another environment without checking server capacity
- Multiple app replicas sharing the same PostgreSQL limit without coordinated sizing
## How to Fix
### Docker Compose
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SHOW max_connections;"
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SHOW superuser_reserved_connections;"
```
Set an explicit connection string:
```yaml
services:
doctor-web:
environment:
ConnectionStrings__DefaultConnection: Host=postgres;Port=5432;Database=stellaops;Username=stellaops;Password=${STELLAOPS_DB_PASSWORD};Pooling=true;MinPoolSize=5;MaxPoolSize=25
```
### Bare Metal / systemd
```bash
psql -h <db-host> -U <db-user> -d <db-name> -c "SHOW max_connections;"
```
### Kubernetes / Helm
```bash
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SHOW max_connections;"
```
## Verification
```bash
stella doctor --check check.db.pool.size
```
## Related Checks
- `check.db.pool.health` - validates that configured limits behave correctly at runtime
- `check.db.connection` - pooling changes should not break base connectivity

View File

@@ -0,0 +1,49 @@
---
checkId: check.db.schema.version
plugin: stellaops.doctor.database
severity: fail
tags: [database, postgres, schema, migrations]
---
# Schema Version
## What It Checks
Counts non-system schemas and tables, inspects the latest EF migration entry when available, and warns when PostgreSQL reports unvalidated foreign-key constraints.
Unvalidated constraints usually indicate an interrupted migration or manual DDL drift.
## Why It Matters
Schema drift is a common source of runtime breakage after upgrades. Unvalidated constraints can hide partial migrations long after deployment appears complete.
## Common Causes
- A migration failed after creating constraints but before validation
- Manual schema changes bypassed startup migrations
- The database was restored from an inconsistent backup
## How to Fix
### Docker Compose
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT conname FROM pg_constraint WHERE NOT convalidated;"
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT \"MigrationId\" FROM \"__EFMigrationsHistory\" ORDER BY \"MigrationId\" DESC LIMIT 5;"
```
Re-run the owning service with startup migrations enabled after fixing the underlying schema issue.
### Bare Metal / systemd
```bash
psql -h <db-host> -U <db-user> -d <db-name> -c "SELECT COUNT(*) FROM pg_constraint WHERE NOT convalidated;"
```
### Kubernetes / Helm
```bash
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT nspname FROM pg_namespace;"
```
## Verification
```bash
stella doctor --check check.db.schema.version
```
## Related Checks
- `check.db.migrations.failed` - failed migrations are the most common cause of schema inconsistency
- `check.db.migrations.pending` - verify history after cleanup