doctor: complete runtime check documentation sprint

Signed-off-by: master <>
2026-03-31 23:26:24 +03:00
parent 404d50bcb7
commit 152c1b1357
54 changed files with 2210 additions and 258 deletions
--- a/docs/doctor/articles/postgres/db-connection.md
+++ b/docs/doctor/articles/postgres/db-connection.md
@@ -0,0 +1,60 @@
+---
+checkId: check.db.connection
+plugin: stellaops.doctor.database
+severity: fail
+tags: [database, postgres, connectivity, quick]
+---
+# Database Connection
+
+## What It Checks
+Opens a PostgreSQL connection using `Doctor:Plugins:Database:ConnectionString` or `ConnectionStrings:DefaultConnection` and runs `SELECT version(), current_database(), current_user`.
+
+The check passes only when the connection opens and the probe query returns successfully. Connection failures, authentication failures, DNS errors, and network timeouts fail the check.
+
+## Why It Matters
+Doctor cannot validate migrations, pool health, or schema state if the platform cannot reach PostgreSQL. A broken connection path usually means startup failures, API errors, and background job disruption across the suite.
+
+## Common Causes
+- `ConnectionStrings__DefaultConnection` is missing or malformed
+- PostgreSQL is not running or not listening on the configured host and port
+- DNS, firewall, or container networking prevents the Doctor service from reaching PostgreSQL
+- Username, password, database name, or TLS settings are incorrect
+
+## How to Fix
+
+### Docker Compose
+```bash
+docker compose -f devops/compose/docker-compose.stella-ops.yml ps postgres
+docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 100 postgres
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres pg_isready -U stellaops -d stellaops
+```
+
+Set the Doctor connection string with compose-style environment variables:
+
+```yaml
+services:
+  doctor-web:
+    environment:
+      ConnectionStrings__DefaultConnection: Host=postgres;Port=5432;Database=stellaops;Username=stellaops;Password=${STELLAOPS_DB_PASSWORD}
+```
+
+### Bare Metal / systemd
+```bash
+pg_isready -h <db-host> -p 5432 -U <db-user> -d <db-name>
+psql "Host=<db-host>;Port=5432;Database=<db-name>;Username=<db-user>;Password=<password>" -c "SELECT 1"
+```
+
+### Kubernetes / Helm
+```bash
+kubectl exec deploy/doctor-web -- pg_isready -h <postgres-service> -p 5432 -U <db-user> -d <db-name>
+kubectl get secret <db-secret> -o yaml
+```
+
+## Verification
+```bash
+stella doctor --check check.db.connection
+```
+
+## Related Checks
+- `check.db.latency` - uses the same connection path and highlights performance issues after basic connectivity works
+- `check.db.pool.health` - validates connection pressure after connectivity is restored
--- a/docs/doctor/articles/postgres/db-latency.md
+++ b/docs/doctor/articles/postgres/db-latency.md
@@ -0,0 +1,53 @@
+---
+checkId: check.db.latency
+plugin: stellaops.doctor.database
+severity: fail
+tags: [database, postgres, latency, performance]
+---
+# Query Latency
+
+## What It Checks
+Runs two warmup queries and then measures five `SELECT 1` probes plus five temporary-table `INSERT` probes against PostgreSQL.
+
+The check warns when the p95 latency exceeds `50ms` and fails when the p95 latency exceeds `200ms`.
+
+## Why It Matters
+Healthy connectivity is not enough if the database path is slow. Elevated query latency turns into slow UI pages, delayed releases, and queue backlogs across the platform.
+
+## Common Causes
+- CPU, memory, or I/O pressure on the PostgreSQL host
+- Cross-host or cross-region latency between Doctor and PostgreSQL
+- Lock contention or long-running transactions
+- Shared infrastructure saturation in the default compose stack
+
+## How to Fix
+
+### Docker Compose
+```bash
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT * FROM pg_stat_activity WHERE state = 'active';"
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT * FROM pg_locks WHERE NOT granted;"
+docker compose -f devops/compose/docker-compose.stella-ops.yml stats postgres
+```
+
+Tune connection placement and storage before raising thresholds. If the database is remote, keep `doctor-web` and PostgreSQL on the same low-latency network segment.
+
+### Bare Metal / systemd
+```bash
+psql -h <db-host> -U <db-user> -d <db-name> -c "SELECT * FROM pg_stat_activity WHERE state = 'active';"
+psql -h <db-host> -U <db-user> -d <db-name> -c "SELECT * FROM pg_locks WHERE NOT granted;"
+```
+
+### Kubernetes / Helm
+```bash
+kubectl top pod -n <namespace> <postgres-pod>
+kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT now();"
+```
+
+## Verification
+```bash
+stella doctor --check check.db.latency
+```
+
+## Related Checks
+- `check.db.connection` - basic reachability must pass before latency numbers are meaningful
+- `check.db.pool.health` - pool saturation often shows up as latency first
--- a/docs/doctor/articles/postgres/db-migrations-failed.md
+++ b/docs/doctor/articles/postgres/db-migrations-failed.md
@@ -0,0 +1,52 @@
+---
+checkId: check.db.migrations.failed
+plugin: stellaops.doctor.database
+severity: fail
+tags: [database, migrations, postgres, schema]
+---
+# Failed Migrations
+
+## What It Checks
+Reads the `stella_migration_history` table, when present, and reports rows marked `failed` or `incomplete`.
+
+If the tracking table does not exist, the check reports informationally and assumes the service is using a different migration mechanism.
+
+## Why It Matters
+Partially applied migrations leave schemas in undefined states. That is a common cause of startup failures and runtime `500` errors after upgrades.
+
+## Common Causes
+- A migration script failed during deployment
+- The database user lacks DDL permissions
+- Two processes attempted to apply migrations concurrently
+- An interrupted deployment left the migration history half-written
+
+## How to Fix
+
+### Docker Compose
+```bash
+docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 200 doctor-web
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT migration_id, status, error_message, applied_at FROM stella_migration_history ORDER BY applied_at DESC LIMIT 10;"
+```
+
+Fix the underlying SQL or permission problem, then restart the owning service so startup migrations run again.
+
+### Bare Metal / systemd
+```bash
+journalctl -u <service-name> -n 200
+dotnet ef database update
+```
+
+### Kubernetes / Helm
+```bash
+kubectl logs deploy/<service-name> -n <namespace> --tail=200
+kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT migration_id, status FROM stella_migration_history;"
+```
+
+## Verification
+```bash
+stella doctor --check check.db.migrations.failed
+```
+
+## Related Checks
+- `check.db.migrations.pending` - pending migrations often follow a failed rollout
+- `check.db.schema.version` - schema consistency should be rechecked after cleanup
--- a/docs/doctor/articles/postgres/db-migrations-pending.md
+++ b/docs/doctor/articles/postgres/db-migrations-pending.md
@@ -0,0 +1,52 @@
+---
+checkId: check.db.migrations.pending
+plugin: stellaops.doctor.database
+severity: warn
+tags: [database, migrations, postgres, schema]
+---
+# Pending Migrations
+
+## What It Checks
+Looks for the `__EFMigrationsHistory` table and reports the latest applied migration recorded there.
+
+This runtime check does not diff the database against the assembly directly; it tells you whether migration history exists and what the latest applied migration is.
+
+## Why It Matters
+Missing or stale migration history usually means a fresh environment was bootstrapped incorrectly or schema changes were never applied on startup.
+
+## Common Causes
+- Startup migrations are not wired for the owning service
+- The database was reset and the service never converged the schema
+- The service is using a different schema owner than operators expect
+
+## How to Fix
+
+### Docker Compose
+```bash
+docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 200 doctor-web
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT \"MigrationId\" FROM \"__EFMigrationsHistory\" ORDER BY \"MigrationId\" DESC;"
+```
+
+Confirm the owning service calls startup migrations on boot instead of relying on one-off SQL initialization scripts.
+
+### Bare Metal / systemd
+```bash
+journalctl -u <service-name> -n 200
+dotnet ef migrations list
+dotnet ef database update
+```
+
+### Kubernetes / Helm
+```bash
+kubectl logs deploy/<service-name> -n <namespace> --tail=200
+kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT COUNT(*) FROM \"__EFMigrationsHistory\";"
+```
+
+## Verification
+```bash
+stella doctor --check check.db.migrations.pending
+```
+
+## Related Checks
+- `check.db.migrations.failed` - diagnose broken runs before retrying
+- `check.db.schema.version` - validates the resulting schema shape
--- a/docs/doctor/articles/postgres/db-permissions.md
+++ b/docs/doctor/articles/postgres/db-permissions.md
@@ -0,0 +1,51 @@
+---
+checkId: check.db.permissions
+plugin: stellaops.doctor.database
+severity: fail
+tags: [database, postgres, permissions, security]
+---
+# Database Permissions
+
+## What It Checks
+Inspects the current PostgreSQL user, whether it is a superuser, whether it can create databases or roles, and whether it has access to application schemas.
+
+The check warns when the app runs as a superuser and fails when the user cannot use the `public` schema.
+
+## Why It Matters
+Over-privileged accounts increase blast radius. Under-privileged accounts break startup migrations and normal CRUD paths.
+
+## Common Causes
+- The connection string still uses `postgres` or another admin account
+- Grants were not applied after creating a dedicated service account
+- Restrictive schema privileges were added manually
+
+## How to Fix
+
+### Docker Compose
+```bash
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U postgres -d stellaops -c "CREATE USER stellaops WITH PASSWORD '<strong-password>';"
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U postgres -d stellaops -c "GRANT CONNECT ON DATABASE stellaops TO stellaops;"
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U postgres -d stellaops -c "GRANT USAGE ON SCHEMA public TO stellaops;"
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U postgres -d stellaops -c "GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO stellaops;"
+```
+
+Update `ConnectionStrings__DefaultConnection` after the grants are in place.
+
+### Bare Metal / systemd
+```bash
+psql -h <db-host> -U postgres -d <db-name> -c "ALTER ROLE <app-user> NOSUPERUSER NOCREATEDB NOCREATEROLE;"
+```
+
+### Kubernetes / Helm
+```bash
+kubectl exec -n <namespace> <postgres-pod> -- psql -U postgres -d <db-name> -c "\du"
+```
+
+## Verification
+```bash
+stella doctor --check check.db.permissions
+```
+
+## Related Checks
+- `check.db.migrations.failed` - missing privileges frequently break migrations
+- `check.db.connection` - credentials and grants must both be correct
--- a/docs/doctor/articles/postgres/db-pool-health.md
+++ b/docs/doctor/articles/postgres/db-pool-health.md
@@ -0,0 +1,50 @@
+---
+checkId: check.db.pool.health
+plugin: stellaops.doctor.database
+severity: fail
+tags: [database, postgres, pool, connections]
+---
+# Connection Pool Health
+
+## What It Checks
+Queries `pg_stat_activity` for the current database and evaluates total connections, active connections, idle connections, waiting connections, and sessions stuck `idle in transaction`.
+
+The check warns when more than five sessions are `idle in transaction` or when total usage exceeds `80%` of server capacity.
+
+## Why It Matters
+Pool pressure turns into request latency, migration timeouts, and job backlog. `idle in transaction` sessions are especially dangerous because they hold locks while doing nothing useful.
+
+## Common Causes
+- Application code is not closing transactions
+- Connection leaks keep sessions open after requests complete
+- `max_connections` is too low for the number of app instances
+- Long-running requests or deadlocks block pooled connections
+
+## How to Fix
+
+### Docker Compose
+```bash
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT pid, state, wait_event, query FROM pg_stat_activity WHERE datname = current_database();"
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT pid, query FROM pg_stat_activity WHERE state = 'idle in transaction';"
+```
+
+### Bare Metal / systemd
+```bash
+psql -h <db-host> -U <db-user> -d <db-name> -c "SHOW max_connections;"
+```
+
+Review the owning service for transaction scopes that stay open across network calls or retries.
+
+### Kubernetes / Helm
+```bash
+kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT count(*) FROM pg_stat_activity;"
+```
+
+## Verification
+```bash
+stella doctor --check check.db.pool.health
+```
+
+## Related Checks
+- `check.db.pool.size` - configuration and runtime pressure need to agree
+- `check.db.latency` - latency usually rises before the pool is fully exhausted
--- a/docs/doctor/articles/postgres/db-pool-size.md
+++ b/docs/doctor/articles/postgres/db-pool-size.md
@@ -0,0 +1,56 @@
+---
+checkId: check.db.pool.size
+plugin: stellaops.doctor.database
+severity: warn
+tags: [database, postgres, pool, configuration]
+---
+# Connection Pool Size
+
+## What It Checks
+Parses the Npgsql connection string and compares `Pooling`, `MinPoolSize`, and `MaxPoolSize` against PostgreSQL `max_connections` minus reserved superuser slots.
+
+The check warns when pooling is disabled or when `Max Pool Size` exceeds practical server capacity. It returns info when `MinPoolSize=0`.
+
+## Why It Matters
+Pool sizing mistakes create either avoidable cold-start latency or connection storms that starve PostgreSQL.
+
+## Common Causes
+- `Pooling=false` left over from local troubleshooting
+- `Max Pool Size` copied from another environment without checking server capacity
+- Multiple app replicas sharing the same PostgreSQL limit without coordinated sizing
+
+## How to Fix
+
+### Docker Compose
+```bash
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SHOW max_connections;"
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SHOW superuser_reserved_connections;"
+```
+
+Set an explicit connection string:
+
+```yaml
+services:
+  doctor-web:
+    environment:
+      ConnectionStrings__DefaultConnection: Host=postgres;Port=5432;Database=stellaops;Username=stellaops;Password=${STELLAOPS_DB_PASSWORD};Pooling=true;MinPoolSize=5;MaxPoolSize=25
+```
+
+### Bare Metal / systemd
+```bash
+psql -h <db-host> -U <db-user> -d <db-name> -c "SHOW max_connections;"
+```
+
+### Kubernetes / Helm
+```bash
+kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SHOW max_connections;"
+```
+
+## Verification
+```bash
+stella doctor --check check.db.pool.size
+```
+
+## Related Checks
+- `check.db.pool.health` - validates that configured limits behave correctly at runtime
+- `check.db.connection` - pooling changes should not break base connectivity
--- a/docs/doctor/articles/postgres/db-schema-version.md
+++ b/docs/doctor/articles/postgres/db-schema-version.md
@@ -0,0 +1,49 @@
+---
+checkId: check.db.schema.version
+plugin: stellaops.doctor.database
+severity: fail
+tags: [database, postgres, schema, migrations]
+---
+# Schema Version
+
+## What It Checks
+Counts non-system schemas and tables, inspects the latest EF migration entry when available, and warns when PostgreSQL reports unvalidated foreign-key constraints.
+
+Unvalidated constraints usually indicate an interrupted migration or manual DDL drift.
+
+## Why It Matters
+Schema drift is a common source of runtime breakage after upgrades. Unvalidated constraints can hide partial migrations long after deployment appears complete.
+
+## Common Causes
+- A migration failed after creating constraints but before validation
+- Manual schema changes bypassed startup migrations
+- The database was restored from an inconsistent backup
+
+## How to Fix
+
+### Docker Compose
+```bash
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT conname FROM pg_constraint WHERE NOT convalidated;"
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT \"MigrationId\" FROM \"__EFMigrationsHistory\" ORDER BY \"MigrationId\" DESC LIMIT 5;"
+```
+
+Re-run the owning service with startup migrations enabled after fixing the underlying schema issue.
+
+### Bare Metal / systemd
+```bash
+psql -h <db-host> -U <db-user> -d <db-name> -c "SELECT COUNT(*) FROM pg_constraint WHERE NOT convalidated;"
+```
+
+### Kubernetes / Helm
+```bash
+kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT nspname FROM pg_namespace;"
+```
+
+## Verification
+```bash
+stella doctor --check check.db.schema.version
+```
+
+## Related Checks
+- `check.db.migrations.failed` - failed migrations are the most common cause of schema inconsistency
+- `check.db.migrations.pending` - verify history after cleanup