doctor: complete runtime check documentation sprint
Signed-off-by: master <>
This commit is contained in:
60
docs/doctor/articles/postgres/db-connection.md
Normal file
60
docs/doctor/articles/postgres/db-connection.md
Normal file
@@ -0,0 +1,60 @@
|
||||
---
|
||||
checkId: check.db.connection
|
||||
plugin: stellaops.doctor.database
|
||||
severity: fail
|
||||
tags: [database, postgres, connectivity, quick]
|
||||
---
|
||||
# Database Connection
|
||||
|
||||
## What It Checks
|
||||
Opens a PostgreSQL connection using `Doctor:Plugins:Database:ConnectionString` or `ConnectionStrings:DefaultConnection` and runs `SELECT version(), current_database(), current_user`.
|
||||
|
||||
The check passes only when the connection opens and the probe query returns successfully. Connection failures, authentication failures, DNS errors, and network timeouts fail the check.
|
||||
|
||||
## Why It Matters
|
||||
Doctor cannot validate migrations, pool health, or schema state if the platform cannot reach PostgreSQL. A broken connection path usually means startup failures, API errors, and background job disruption across the suite.
|
||||
|
||||
## Common Causes
|
||||
- `ConnectionStrings__DefaultConnection` is missing or malformed
|
||||
- PostgreSQL is not running or not listening on the configured host and port
|
||||
- DNS, firewall, or container networking prevents the Doctor service from reaching PostgreSQL
|
||||
- Username, password, database name, or TLS settings are incorrect
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml ps postgres
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 100 postgres
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres pg_isready -U stellaops -d stellaops
|
||||
```
|
||||
|
||||
Set the Doctor connection string with compose-style environment variables:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
ConnectionStrings__DefaultConnection: Host=postgres;Port=5432;Database=stellaops;Username=stellaops;Password=${STELLAOPS_DB_PASSWORD}
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
pg_isready -h <db-host> -p 5432 -U <db-user> -d <db-name>
|
||||
psql "Host=<db-host>;Port=5432;Database=<db-name>;Username=<db-user>;Password=<password>" -c "SELECT 1"
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl exec deploy/doctor-web -- pg_isready -h <postgres-service> -p 5432 -U <db-user> -d <db-name>
|
||||
kubectl get secret <db-secret> -o yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.db.connection
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.db.latency` - uses the same connection path and highlights performance issues after basic connectivity works
|
||||
- `check.db.pool.health` - validates connection pressure after connectivity is restored
|
||||
53
docs/doctor/articles/postgres/db-latency.md
Normal file
53
docs/doctor/articles/postgres/db-latency.md
Normal file
@@ -0,0 +1,53 @@
|
||||
---
|
||||
checkId: check.db.latency
|
||||
plugin: stellaops.doctor.database
|
||||
severity: fail
|
||||
tags: [database, postgres, latency, performance]
|
||||
---
|
||||
# Query Latency
|
||||
|
||||
## What It Checks
|
||||
Runs two warmup queries and then measures five `SELECT 1` probes plus five temporary-table `INSERT` probes against PostgreSQL.
|
||||
|
||||
The check warns when the p95 latency exceeds `50ms` and fails when the p95 latency exceeds `200ms`.
|
||||
|
||||
## Why It Matters
|
||||
Healthy connectivity is not enough if the database path is slow. Elevated query latency turns into slow UI pages, delayed releases, and queue backlogs across the platform.
|
||||
|
||||
## Common Causes
|
||||
- CPU, memory, or I/O pressure on the PostgreSQL host
|
||||
- Cross-host or cross-region latency between Doctor and PostgreSQL
|
||||
- Lock contention or long-running transactions
|
||||
- Shared infrastructure saturation in the default compose stack
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT * FROM pg_stat_activity WHERE state = 'active';"
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT * FROM pg_locks WHERE NOT granted;"
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml stats postgres
|
||||
```
|
||||
|
||||
Tune connection placement and storage before raising thresholds. If the database is remote, keep `doctor-web` and PostgreSQL on the same low-latency network segment.
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
psql -h <db-host> -U <db-user> -d <db-name> -c "SELECT * FROM pg_stat_activity WHERE state = 'active';"
|
||||
psql -h <db-host> -U <db-user> -d <db-name> -c "SELECT * FROM pg_locks WHERE NOT granted;"
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl top pod -n <namespace> <postgres-pod>
|
||||
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT now();"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.db.latency
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.db.connection` - basic reachability must pass before latency numbers are meaningful
|
||||
- `check.db.pool.health` - pool saturation often shows up as latency first
|
||||
52
docs/doctor/articles/postgres/db-migrations-failed.md
Normal file
52
docs/doctor/articles/postgres/db-migrations-failed.md
Normal file
@@ -0,0 +1,52 @@
|
||||
---
|
||||
checkId: check.db.migrations.failed
|
||||
plugin: stellaops.doctor.database
|
||||
severity: fail
|
||||
tags: [database, migrations, postgres, schema]
|
||||
---
|
||||
# Failed Migrations
|
||||
|
||||
## What It Checks
|
||||
Reads the `stella_migration_history` table, when present, and reports rows marked `failed` or `incomplete`.
|
||||
|
||||
If the tracking table does not exist, the check reports informationally and assumes the service is using a different migration mechanism.
|
||||
|
||||
## Why It Matters
|
||||
Partially applied migrations leave schemas in undefined states. That is a common cause of startup failures and runtime `500` errors after upgrades.
|
||||
|
||||
## Common Causes
|
||||
- A migration script failed during deployment
|
||||
- The database user lacks DDL permissions
|
||||
- Two processes attempted to apply migrations concurrently
|
||||
- An interrupted deployment left the migration history half-written
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 200 doctor-web
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT migration_id, status, error_message, applied_at FROM stella_migration_history ORDER BY applied_at DESC LIMIT 10;"
|
||||
```
|
||||
|
||||
Fix the underlying SQL or permission problem, then restart the owning service so startup migrations run again.
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
journalctl -u <service-name> -n 200
|
||||
dotnet ef database update
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl logs deploy/<service-name> -n <namespace> --tail=200
|
||||
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT migration_id, status FROM stella_migration_history;"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.db.migrations.failed
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.db.migrations.pending` - pending migrations often follow a failed rollout
|
||||
- `check.db.schema.version` - schema consistency should be rechecked after cleanup
|
||||
52
docs/doctor/articles/postgres/db-migrations-pending.md
Normal file
52
docs/doctor/articles/postgres/db-migrations-pending.md
Normal file
@@ -0,0 +1,52 @@
|
||||
---
|
||||
checkId: check.db.migrations.pending
|
||||
plugin: stellaops.doctor.database
|
||||
severity: warn
|
||||
tags: [database, migrations, postgres, schema]
|
||||
---
|
||||
# Pending Migrations
|
||||
|
||||
## What It Checks
|
||||
Looks for the `__EFMigrationsHistory` table and reports the latest applied migration recorded there.
|
||||
|
||||
This runtime check does not diff the database against the assembly directly; it tells you whether migration history exists and what the latest applied migration is.
|
||||
|
||||
## Why It Matters
|
||||
Missing or stale migration history usually means a fresh environment was bootstrapped incorrectly or schema changes were never applied on startup.
|
||||
|
||||
## Common Causes
|
||||
- Startup migrations are not wired for the owning service
|
||||
- The database was reset and the service never converged the schema
|
||||
- The service is using a different schema owner than operators expect
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 200 doctor-web
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT \"MigrationId\" FROM \"__EFMigrationsHistory\" ORDER BY \"MigrationId\" DESC;"
|
||||
```
|
||||
|
||||
Confirm the owning service calls startup migrations on boot instead of relying on one-off SQL initialization scripts.
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
journalctl -u <service-name> -n 200
|
||||
dotnet ef migrations list
|
||||
dotnet ef database update
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl logs deploy/<service-name> -n <namespace> --tail=200
|
||||
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT COUNT(*) FROM \"__EFMigrationsHistory\";"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.db.migrations.pending
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.db.migrations.failed` - diagnose broken runs before retrying
|
||||
- `check.db.schema.version` - validates the resulting schema shape
|
||||
51
docs/doctor/articles/postgres/db-permissions.md
Normal file
51
docs/doctor/articles/postgres/db-permissions.md
Normal file
@@ -0,0 +1,51 @@
|
||||
---
|
||||
checkId: check.db.permissions
|
||||
plugin: stellaops.doctor.database
|
||||
severity: fail
|
||||
tags: [database, postgres, permissions, security]
|
||||
---
|
||||
# Database Permissions
|
||||
|
||||
## What It Checks
|
||||
Inspects the current PostgreSQL user, whether it is a superuser, whether it can create databases or roles, and whether it has access to application schemas.
|
||||
|
||||
The check warns when the app runs as a superuser and fails when the user cannot use the `public` schema.
|
||||
|
||||
## Why It Matters
|
||||
Over-privileged accounts increase blast radius. Under-privileged accounts break startup migrations and normal CRUD paths.
|
||||
|
||||
## Common Causes
|
||||
- The connection string still uses `postgres` or another admin account
|
||||
- Grants were not applied after creating a dedicated service account
|
||||
- Restrictive schema privileges were added manually
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U postgres -d stellaops -c "CREATE USER stellaops WITH PASSWORD '<strong-password>';"
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U postgres -d stellaops -c "GRANT CONNECT ON DATABASE stellaops TO stellaops;"
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U postgres -d stellaops -c "GRANT USAGE ON SCHEMA public TO stellaops;"
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U postgres -d stellaops -c "GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO stellaops;"
|
||||
```
|
||||
|
||||
Update `ConnectionStrings__DefaultConnection` after the grants are in place.
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
psql -h <db-host> -U postgres -d <db-name> -c "ALTER ROLE <app-user> NOSUPERUSER NOCREATEDB NOCREATEROLE;"
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl exec -n <namespace> <postgres-pod> -- psql -U postgres -d <db-name> -c "\du"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.db.permissions
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.db.migrations.failed` - missing privileges frequently break migrations
|
||||
- `check.db.connection` - credentials and grants must both be correct
|
||||
50
docs/doctor/articles/postgres/db-pool-health.md
Normal file
50
docs/doctor/articles/postgres/db-pool-health.md
Normal file
@@ -0,0 +1,50 @@
|
||||
---
|
||||
checkId: check.db.pool.health
|
||||
plugin: stellaops.doctor.database
|
||||
severity: fail
|
||||
tags: [database, postgres, pool, connections]
|
||||
---
|
||||
# Connection Pool Health
|
||||
|
||||
## What It Checks
|
||||
Queries `pg_stat_activity` for the current database and evaluates total connections, active connections, idle connections, waiting connections, and sessions stuck `idle in transaction`.
|
||||
|
||||
The check warns when more than five sessions are `idle in transaction` or when total usage exceeds `80%` of server capacity.
|
||||
|
||||
## Why It Matters
|
||||
Pool pressure turns into request latency, migration timeouts, and job backlog. `idle in transaction` sessions are especially dangerous because they hold locks while doing nothing useful.
|
||||
|
||||
## Common Causes
|
||||
- Application code is not closing transactions
|
||||
- Connection leaks keep sessions open after requests complete
|
||||
- `max_connections` is too low for the number of app instances
|
||||
- Long-running requests or deadlocks block pooled connections
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT pid, state, wait_event, query FROM pg_stat_activity WHERE datname = current_database();"
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT pid, query FROM pg_stat_activity WHERE state = 'idle in transaction';"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
psql -h <db-host> -U <db-user> -d <db-name> -c "SHOW max_connections;"
|
||||
```
|
||||
|
||||
Review the owning service for transaction scopes that stay open across network calls or retries.
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT count(*) FROM pg_stat_activity;"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.db.pool.health
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.db.pool.size` - configuration and runtime pressure need to agree
|
||||
- `check.db.latency` - latency usually rises before the pool is fully exhausted
|
||||
56
docs/doctor/articles/postgres/db-pool-size.md
Normal file
56
docs/doctor/articles/postgres/db-pool-size.md
Normal file
@@ -0,0 +1,56 @@
|
||||
---
|
||||
checkId: check.db.pool.size
|
||||
plugin: stellaops.doctor.database
|
||||
severity: warn
|
||||
tags: [database, postgres, pool, configuration]
|
||||
---
|
||||
# Connection Pool Size
|
||||
|
||||
## What It Checks
|
||||
Parses the Npgsql connection string and compares `Pooling`, `MinPoolSize`, and `MaxPoolSize` against PostgreSQL `max_connections` minus reserved superuser slots.
|
||||
|
||||
The check warns when pooling is disabled or when `Max Pool Size` exceeds practical server capacity. It returns info when `MinPoolSize=0`.
|
||||
|
||||
## Why It Matters
|
||||
Pool sizing mistakes create either avoidable cold-start latency or connection storms that starve PostgreSQL.
|
||||
|
||||
## Common Causes
|
||||
- `Pooling=false` left over from local troubleshooting
|
||||
- `Max Pool Size` copied from another environment without checking server capacity
|
||||
- Multiple app replicas sharing the same PostgreSQL limit without coordinated sizing
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SHOW max_connections;"
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SHOW superuser_reserved_connections;"
|
||||
```
|
||||
|
||||
Set an explicit connection string:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
ConnectionStrings__DefaultConnection: Host=postgres;Port=5432;Database=stellaops;Username=stellaops;Password=${STELLAOPS_DB_PASSWORD};Pooling=true;MinPoolSize=5;MaxPoolSize=25
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
psql -h <db-host> -U <db-user> -d <db-name> -c "SHOW max_connections;"
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SHOW max_connections;"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.db.pool.size
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.db.pool.health` - validates that configured limits behave correctly at runtime
|
||||
- `check.db.connection` - pooling changes should not break base connectivity
|
||||
49
docs/doctor/articles/postgres/db-schema-version.md
Normal file
49
docs/doctor/articles/postgres/db-schema-version.md
Normal file
@@ -0,0 +1,49 @@
|
||||
---
|
||||
checkId: check.db.schema.version
|
||||
plugin: stellaops.doctor.database
|
||||
severity: fail
|
||||
tags: [database, postgres, schema, migrations]
|
||||
---
|
||||
# Schema Version
|
||||
|
||||
## What It Checks
|
||||
Counts non-system schemas and tables, inspects the latest EF migration entry when available, and warns when PostgreSQL reports unvalidated foreign-key constraints.
|
||||
|
||||
Unvalidated constraints usually indicate an interrupted migration or manual DDL drift.
|
||||
|
||||
## Why It Matters
|
||||
Schema drift is a common source of runtime breakage after upgrades. Unvalidated constraints can hide partial migrations long after deployment appears complete.
|
||||
|
||||
## Common Causes
|
||||
- A migration failed after creating constraints but before validation
|
||||
- Manual schema changes bypassed startup migrations
|
||||
- The database was restored from an inconsistent backup
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT conname FROM pg_constraint WHERE NOT convalidated;"
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT \"MigrationId\" FROM \"__EFMigrationsHistory\" ORDER BY \"MigrationId\" DESC LIMIT 5;"
|
||||
```
|
||||
|
||||
Re-run the owning service with startup migrations enabled after fixing the underlying schema issue.
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
psql -h <db-host> -U <db-user> -d <db-name> -c "SELECT COUNT(*) FROM pg_constraint WHERE NOT convalidated;"
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT nspname FROM pg_namespace;"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.db.schema.version
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.db.migrations.failed` - failed migrations are the most common cause of schema inconsistency
|
||||
- `check.db.migrations.pending` - verify history after cleanup
|
||||
Reference in New Issue
Block a user