doctor: complete runtime check documentation sprint

Signed-off-by: master <>
This commit is contained in:
master
2026-03-31 23:26:24 +03:00
parent 404d50bcb7
commit 152c1b1357
54 changed files with 2210 additions and 258 deletions

View File

@@ -0,0 +1,53 @@
---
checkId: check.observability.alerting
plugin: stellaops.doctor.observability
severity: info
tags: [observability, alerting, notifications]
---
# Alerting Configuration
## What It Checks
Looks for configured alert destinations such as Alertmanager, Slack, email recipients, or PagerDuty routing keys.
The check reports info when alerting is explicitly disabled or when no destination is configured. It warns only when a destination is present but obviously malformed, such as invalid email addresses.
## Why It Matters
Metrics and logs are not actionable if nobody is notified when thresholds are crossed. Production installs should route alerts somewhere outside the application process.
## Common Causes
- Alerting was never configured after initial compose bring-up
- Notification secrets were omitted from environment variables
- Recipient lists contain placeholders or invalid values
## How to Fix
### Docker Compose
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web printenv | grep -E 'ALERT|SLACK|PAGERDUTY|SMTP'
```
Example compose-style configuration:
```yaml
services:
doctor-web:
environment:
Alerting__Enabled: "true"
Alerting__AlertManagerUrl: http://alertmanager:9093
Alerting__Email__Recipients__0: ops@example.com
```
### Bare Metal / systemd
Configure `Alerting:*` settings in the service configuration and ensure secrets come from the platform secrets provider rather than clear text files.
### Kubernetes / Helm
Store webhook URLs and routing keys in Secrets, then mount them into `Alerting:*` values.
## Verification
```bash
stella doctor --check check.observability.alerting
```
## Related Checks
- `check.observability.metrics` - alerting is usually driven by metrics
- `check.observability.logging` - logs are the fallback when alerts are missing

View File

@@ -0,0 +1,53 @@
---
checkId: check.observability.healthchecks
plugin: stellaops.doctor.observability
severity: warn
tags: [observability, healthchecks, readiness, liveness]
---
# Health Check Endpoints
## What It Checks
Evaluates the configured health, readiness, and liveness paths and optionally probes `http://localhost:<port><path>` when a health-check port is configured.
The check warns when endpoints are unreachable, when timeouts are outside the `1s` to `60s` range, or when readiness and liveness collapse onto the same path.
## Why It Matters
Broken health probes turn into bad restart loops, failed rolling upgrades, and misleading orchestration signals.
## Common Causes
- The service exposes `/health` but not `/health/ready` or `/health/live`
- Health-check ports differ from the actual bound HTTP port
- Probe timeout values were copied from another service without validation
## How to Fix
### Docker Compose
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://localhost:8080/health
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://localhost:8080/health/ready
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://localhost:8080/health/live
```
Set explicit paths and a reasonable timeout:
```yaml
HealthChecks__Path: /health
HealthChecks__ReadinessPath: /health/ready
HealthChecks__LivenessPath: /health/live
HealthChecks__Timeout: 30
```
### Bare Metal / systemd
Verify reverse proxies and firewalls do not block the health port.
### Kubernetes / Helm
Point readiness and liveness probes at separate endpoints whenever startup and steady-state behavior differ.
## Verification
```bash
stella doctor --check check.observability.healthchecks
```
## Related Checks
- `check.core.services.health` - aggregates the underlying ASP.NET health checks when available
- `check.observability.metrics` - shared listener misconfiguration can break both endpoints

View File

@@ -0,0 +1,49 @@
---
checkId: check.observability.logging
plugin: stellaops.doctor.observability
severity: warn
tags: [observability, logging, structured-logs]
---
# Logging Configuration
## What It Checks
Reads default and framework log levels and looks for structured logging via `Logging:Structured`, JSON console formatting, or a `Serilog` configuration section.
The check warns when default logging is `Debug` or `Trace`, when Microsoft categories are too verbose, or when structured logging is missing.
## Why It Matters
Unstructured logs slow incident response and make exports difficult to analyze. Overly verbose framework logging also drives storage growth and noise.
## Common Causes
- Only the default ASP.NET console logger is configured
- `Logging:Structured` or `Serilog` settings were omitted from compose values
- Troubleshooting log levels were left enabled in production
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
Logging__LogLevel__Default: Information
Logging__LogLevel__Microsoft: Warning
Logging__Structured: "true"
```
If Serilog is used, make sure the console sink emits JSON or another structured format that downstream tooling can parse.
### Bare Metal / systemd
Keep framework namespaces at `Warning` or stricter unless you are collecting short-lived debugging evidence.
### Kubernetes / Helm
Ensure log collectors expect the same output format the application emits.
## Verification
```bash
stella doctor --check check.observability.logging
```
## Related Checks
- `check.observability.alerting` - alerting often relies on structured log pipelines
- `check.security.audit.logging` - audit logs should follow the same transport and retention standards

View File

@@ -0,0 +1,53 @@
---
checkId: check.observability.metrics
plugin: stellaops.doctor.observability
severity: warn
tags: [observability, metrics, prometheus]
---
# Metrics Collection
## What It Checks
Inspects `Metrics:*`, `Prometheus:*`, and `OpenTelemetry:Metrics:*` settings. When a metrics port is configured and an `IHttpClientFactory` is available, the check probes `http://localhost:<port><path>`.
The check returns info when metrics are disabled or absent, and warns when the configured endpoint cannot be reached.
## Why It Matters
Metrics are the primary input for alerting, SLO tracking, and capacity planning. Missing or unreachable endpoints remove the fastest signal operators have.
## Common Causes
- Metrics were never enabled in the deployment configuration
- The metrics path or port does not match the listener exposed by the service
- A sidecar or reverse proxy blocks local probing
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
Metrics__Enabled: "true"
Metrics__Path: /metrics
Metrics__Port: 8080
```
Probe the endpoint from inside the container:
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://localhost:8080/metrics
```
### Bare Metal / systemd
Bind the metrics port explicitly if the service does not share the main HTTP listener.
### Kubernetes / Helm
Align the `ServiceMonitor` or Prometheus scrape config with the same path and port the app exposes.
## Verification
```bash
stella doctor --check check.observability.metrics
```
## Related Checks
- `check.observability.otel` - OpenTelemetry metrics often share the same collector path
- `check.observability.alerting` - metrics are usually the source for alert rules

View File

@@ -0,0 +1,52 @@
---
checkId: check.observability.otel
plugin: stellaops.doctor.observability
severity: warn
tags: [observability, opentelemetry, tracing, metrics]
---
# OpenTelemetry Configuration
## What It Checks
Reads `OpenTelemetry:*`, `Telemetry:*`, and `OTEL_*` settings for endpoint, service name, tracing enablement, metrics enablement, and sampling ratio. When possible, it probes the collector host directly.
The check reports info when no OTLP endpoint is configured and warns when the service name is missing, tracing or metrics are disabled, sampling is too low, or the collector is unreachable.
## Why It Matters
OpenTelemetry is the main path for exporting traces and metrics to external systems. Broken collector settings silently remove cross-service visibility.
## Common Causes
- `OTEL_EXPORTER_OTLP_ENDPOINT` was omitted from compose or environment settings
- `OTEL_SERVICE_NAME` was never set
- Collector networking differs between local and deployed environments
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317
OTEL_SERVICE_NAME: doctor-web
OpenTelemetry__Tracing__Enabled: "true"
OpenTelemetry__Metrics__Enabled: "true"
```
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://otel-collector:4318/
```
### Bare Metal / systemd
Keep the collector endpoint in the service unit or configuration file and verify firewalls allow traffic on the OTLP port.
### Kubernetes / Helm
Use cluster-local collector service names and inject `OTEL_SERVICE_NAME` per workload.
## Verification
```bash
stella doctor --check check.observability.otel
```
## Related Checks
- `check.observability.tracing` - validates trace-specific tuning once OTLP export is wired
- `check.observability.metrics` - metrics export often shares the same collector

View File

@@ -0,0 +1,48 @@
---
checkId: check.observability.tracing
plugin: stellaops.doctor.observability
severity: warn
tags: [observability, tracing, correlation]
---
# Distributed Tracing
## What It Checks
Validates trace enablement, propagator, sampling ratio, exporter type, and whether HTTP and database instrumentation are turned on.
The check reports info when tracing is explicitly disabled and warns when sampling is invalid, too low, or when important instrumentation is turned off.
## Why It Matters
Tracing is the fastest way to understand cross-service latency and identify the exact hop that is failing. Disabling instrumentation removes that evidence.
## Common Causes
- Sampling ratio set to `0` during load testing and never restored
- Only outbound HTTP traces are enabled while database spans remain off
- Propagator or exporter defaults differ between services
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
Tracing__Enabled: "true"
Tracing__SamplingRatio: "1.0"
Tracing__Instrumentation__Http: "true"
Tracing__Instrumentation__Database: "true"
```
### Bare Metal / systemd
Keep `Tracing:SamplingRatio` between `0.01` and `1.0` unless you are deliberately suppressing traces for a benchmark.
### Kubernetes / Helm
Propagate the same trace configuration across all services in the release path so correlation IDs remain intact.
## Verification
```bash
stella doctor --check check.observability.tracing
```
## Related Checks
- `check.observability.otel` - exporter connectivity must work before traces leave the process
- `check.servicegraph.timeouts` - tracing is most useful when diagnosing timeout-related issues

View File

@@ -0,0 +1,60 @@
---
checkId: check.db.connection
plugin: stellaops.doctor.database
severity: fail
tags: [database, postgres, connectivity, quick]
---
# Database Connection
## What It Checks
Opens a PostgreSQL connection using `Doctor:Plugins:Database:ConnectionString` or `ConnectionStrings:DefaultConnection` and runs `SELECT version(), current_database(), current_user`.
The check passes only when the connection opens and the probe query returns successfully. Connection failures, authentication failures, DNS errors, and network timeouts fail the check.
## Why It Matters
Doctor cannot validate migrations, pool health, or schema state if the platform cannot reach PostgreSQL. A broken connection path usually means startup failures, API errors, and background job disruption across the suite.
## Common Causes
- `ConnectionStrings__DefaultConnection` is missing or malformed
- PostgreSQL is not running or not listening on the configured host and port
- DNS, firewall, or container networking prevents the Doctor service from reaching PostgreSQL
- Username, password, database name, or TLS settings are incorrect
## How to Fix
### Docker Compose
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml ps postgres
docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 100 postgres
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres pg_isready -U stellaops -d stellaops
```
Set the Doctor connection string with compose-style environment variables:
```yaml
services:
doctor-web:
environment:
ConnectionStrings__DefaultConnection: Host=postgres;Port=5432;Database=stellaops;Username=stellaops;Password=${STELLAOPS_DB_PASSWORD}
```
### Bare Metal / systemd
```bash
pg_isready -h <db-host> -p 5432 -U <db-user> -d <db-name>
psql "Host=<db-host>;Port=5432;Database=<db-name>;Username=<db-user>;Password=<password>" -c "SELECT 1"
```
### Kubernetes / Helm
```bash
kubectl exec deploy/doctor-web -- pg_isready -h <postgres-service> -p 5432 -U <db-user> -d <db-name>
kubectl get secret <db-secret> -o yaml
```
## Verification
```bash
stella doctor --check check.db.connection
```
## Related Checks
- `check.db.latency` - uses the same connection path and highlights performance issues after basic connectivity works
- `check.db.pool.health` - validates connection pressure after connectivity is restored

View File

@@ -0,0 +1,53 @@
---
checkId: check.db.latency
plugin: stellaops.doctor.database
severity: fail
tags: [database, postgres, latency, performance]
---
# Query Latency
## What It Checks
Runs two warmup queries and then measures five `SELECT 1` probes plus five temporary-table `INSERT` probes against PostgreSQL.
The check warns when the p95 latency exceeds `50ms` and fails when the p95 latency exceeds `200ms`.
## Why It Matters
Healthy connectivity is not enough if the database path is slow. Elevated query latency turns into slow UI pages, delayed releases, and queue backlogs across the platform.
## Common Causes
- CPU, memory, or I/O pressure on the PostgreSQL host
- Cross-host or cross-region latency between Doctor and PostgreSQL
- Lock contention or long-running transactions
- Shared infrastructure saturation in the default compose stack
## How to Fix
### Docker Compose
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT * FROM pg_stat_activity WHERE state = 'active';"
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT * FROM pg_locks WHERE NOT granted;"
docker compose -f devops/compose/docker-compose.stella-ops.yml stats postgres
```
Tune connection placement and storage before raising thresholds. If the database is remote, keep `doctor-web` and PostgreSQL on the same low-latency network segment.
### Bare Metal / systemd
```bash
psql -h <db-host> -U <db-user> -d <db-name> -c "SELECT * FROM pg_stat_activity WHERE state = 'active';"
psql -h <db-host> -U <db-user> -d <db-name> -c "SELECT * FROM pg_locks WHERE NOT granted;"
```
### Kubernetes / Helm
```bash
kubectl top pod -n <namespace> <postgres-pod>
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT now();"
```
## Verification
```bash
stella doctor --check check.db.latency
```
## Related Checks
- `check.db.connection` - basic reachability must pass before latency numbers are meaningful
- `check.db.pool.health` - pool saturation often shows up as latency first

View File

@@ -0,0 +1,52 @@
---
checkId: check.db.migrations.failed
plugin: stellaops.doctor.database
severity: fail
tags: [database, migrations, postgres, schema]
---
# Failed Migrations
## What It Checks
Reads the `stella_migration_history` table, when present, and reports rows marked `failed` or `incomplete`.
If the tracking table does not exist, the check reports informationally and assumes the service is using a different migration mechanism.
## Why It Matters
Partially applied migrations leave schemas in undefined states. That is a common cause of startup failures and runtime `500` errors after upgrades.
## Common Causes
- A migration script failed during deployment
- The database user lacks DDL permissions
- Two processes attempted to apply migrations concurrently
- An interrupted deployment left the migration history half-written
## How to Fix
### Docker Compose
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 200 doctor-web
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT migration_id, status, error_message, applied_at FROM stella_migration_history ORDER BY applied_at DESC LIMIT 10;"
```
Fix the underlying SQL or permission problem, then restart the owning service so startup migrations run again.
### Bare Metal / systemd
```bash
journalctl -u <service-name> -n 200
dotnet ef database update
```
### Kubernetes / Helm
```bash
kubectl logs deploy/<service-name> -n <namespace> --tail=200
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT migration_id, status FROM stella_migration_history;"
```
## Verification
```bash
stella doctor --check check.db.migrations.failed
```
## Related Checks
- `check.db.migrations.pending` - pending migrations often follow a failed rollout
- `check.db.schema.version` - schema consistency should be rechecked after cleanup

View File

@@ -0,0 +1,52 @@
---
checkId: check.db.migrations.pending
plugin: stellaops.doctor.database
severity: warn
tags: [database, migrations, postgres, schema]
---
# Pending Migrations
## What It Checks
Looks for the `__EFMigrationsHistory` table and reports the latest applied migration recorded there.
This runtime check does not diff the database against the assembly directly; it tells you whether migration history exists and what the latest applied migration is.
## Why It Matters
Missing or stale migration history usually means a fresh environment was bootstrapped incorrectly or schema changes were never applied on startup.
## Common Causes
- Startup migrations are not wired for the owning service
- The database was reset and the service never converged the schema
- The service is using a different schema owner than operators expect
## How to Fix
### Docker Compose
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 200 doctor-web
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT \"MigrationId\" FROM \"__EFMigrationsHistory\" ORDER BY \"MigrationId\" DESC;"
```
Confirm the owning service calls startup migrations on boot instead of relying on one-off SQL initialization scripts.
### Bare Metal / systemd
```bash
journalctl -u <service-name> -n 200
dotnet ef migrations list
dotnet ef database update
```
### Kubernetes / Helm
```bash
kubectl logs deploy/<service-name> -n <namespace> --tail=200
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT COUNT(*) FROM \"__EFMigrationsHistory\";"
```
## Verification
```bash
stella doctor --check check.db.migrations.pending
```
## Related Checks
- `check.db.migrations.failed` - diagnose broken runs before retrying
- `check.db.schema.version` - validates the resulting schema shape

View File

@@ -0,0 +1,51 @@
---
checkId: check.db.permissions
plugin: stellaops.doctor.database
severity: fail
tags: [database, postgres, permissions, security]
---
# Database Permissions
## What It Checks
Inspects the current PostgreSQL user, whether it is a superuser, whether it can create databases or roles, and whether it has access to application schemas.
The check warns when the app runs as a superuser and fails when the user cannot use the `public` schema.
## Why It Matters
Over-privileged accounts increase blast radius. Under-privileged accounts break startup migrations and normal CRUD paths.
## Common Causes
- The connection string still uses `postgres` or another admin account
- Grants were not applied after creating a dedicated service account
- Restrictive schema privileges were added manually
## How to Fix
### Docker Compose
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U postgres -d stellaops -c "CREATE USER stellaops WITH PASSWORD '<strong-password>';"
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U postgres -d stellaops -c "GRANT CONNECT ON DATABASE stellaops TO stellaops;"
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U postgres -d stellaops -c "GRANT USAGE ON SCHEMA public TO stellaops;"
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U postgres -d stellaops -c "GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO stellaops;"
```
Update `ConnectionStrings__DefaultConnection` after the grants are in place.
### Bare Metal / systemd
```bash
psql -h <db-host> -U postgres -d <db-name> -c "ALTER ROLE <app-user> NOSUPERUSER NOCREATEDB NOCREATEROLE;"
```
### Kubernetes / Helm
```bash
kubectl exec -n <namespace> <postgres-pod> -- psql -U postgres -d <db-name> -c "\du"
```
## Verification
```bash
stella doctor --check check.db.permissions
```
## Related Checks
- `check.db.migrations.failed` - missing privileges frequently break migrations
- `check.db.connection` - credentials and grants must both be correct

View File

@@ -0,0 +1,50 @@
---
checkId: check.db.pool.health
plugin: stellaops.doctor.database
severity: fail
tags: [database, postgres, pool, connections]
---
# Connection Pool Health
## What It Checks
Queries `pg_stat_activity` for the current database and evaluates total connections, active connections, idle connections, waiting connections, and sessions stuck `idle in transaction`.
The check warns when more than five sessions are `idle in transaction` or when total usage exceeds `80%` of server capacity.
## Why It Matters
Pool pressure turns into request latency, migration timeouts, and job backlog. `idle in transaction` sessions are especially dangerous because they hold locks while doing nothing useful.
## Common Causes
- Application code is not closing transactions
- Connection leaks keep sessions open after requests complete
- `max_connections` is too low for the number of app instances
- Long-running requests or deadlocks block pooled connections
## How to Fix
### Docker Compose
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT pid, state, wait_event, query FROM pg_stat_activity WHERE datname = current_database();"
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT pid, query FROM pg_stat_activity WHERE state = 'idle in transaction';"
```
### Bare Metal / systemd
```bash
psql -h <db-host> -U <db-user> -d <db-name> -c "SHOW max_connections;"
```
Review the owning service for transaction scopes that stay open across network calls or retries.
### Kubernetes / Helm
```bash
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT count(*) FROM pg_stat_activity;"
```
## Verification
```bash
stella doctor --check check.db.pool.health
```
## Related Checks
- `check.db.pool.size` - configuration and runtime pressure need to agree
- `check.db.latency` - latency usually rises before the pool is fully exhausted

View File

@@ -0,0 +1,56 @@
---
checkId: check.db.pool.size
plugin: stellaops.doctor.database
severity: warn
tags: [database, postgres, pool, configuration]
---
# Connection Pool Size
## What It Checks
Parses the Npgsql connection string and compares `Pooling`, `MinPoolSize`, and `MaxPoolSize` against PostgreSQL `max_connections` minus reserved superuser slots.
The check warns when pooling is disabled or when `Max Pool Size` exceeds practical server capacity. It returns info when `MinPoolSize=0`.
## Why It Matters
Pool sizing mistakes create either avoidable cold-start latency or connection storms that starve PostgreSQL.
## Common Causes
- `Pooling=false` left over from local troubleshooting
- `Max Pool Size` copied from another environment without checking server capacity
- Multiple app replicas sharing the same PostgreSQL limit without coordinated sizing
## How to Fix
### Docker Compose
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SHOW max_connections;"
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SHOW superuser_reserved_connections;"
```
Set an explicit connection string:
```yaml
services:
doctor-web:
environment:
ConnectionStrings__DefaultConnection: Host=postgres;Port=5432;Database=stellaops;Username=stellaops;Password=${STELLAOPS_DB_PASSWORD};Pooling=true;MinPoolSize=5;MaxPoolSize=25
```
### Bare Metal / systemd
```bash
psql -h <db-host> -U <db-user> -d <db-name> -c "SHOW max_connections;"
```
### Kubernetes / Helm
```bash
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SHOW max_connections;"
```
## Verification
```bash
stella doctor --check check.db.pool.size
```
## Related Checks
- `check.db.pool.health` - validates that configured limits behave correctly at runtime
- `check.db.connection` - pooling changes should not break base connectivity

View File

@@ -0,0 +1,49 @@
---
checkId: check.db.schema.version
plugin: stellaops.doctor.database
severity: fail
tags: [database, postgres, schema, migrations]
---
# Schema Version
## What It Checks
Counts non-system schemas and tables, inspects the latest EF migration entry when available, and warns when PostgreSQL reports unvalidated foreign-key constraints.
Unvalidated constraints usually indicate an interrupted migration or manual DDL drift.
## Why It Matters
Schema drift is a common source of runtime breakage after upgrades. Unvalidated constraints can hide partial migrations long after deployment appears complete.
## Common Causes
- A migration failed after creating constraints but before validation
- Manual schema changes bypassed startup migrations
- The database was restored from an inconsistent backup
## How to Fix
### Docker Compose
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT conname FROM pg_constraint WHERE NOT convalidated;"
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT \"MigrationId\" FROM \"__EFMigrationsHistory\" ORDER BY \"MigrationId\" DESC LIMIT 5;"
```
Re-run the owning service with startup migrations enabled after fixing the underlying schema issue.
### Bare Metal / systemd
```bash
psql -h <db-host> -U <db-user> -d <db-name> -c "SELECT COUNT(*) FROM pg_constraint WHERE NOT convalidated;"
```
### Kubernetes / Helm
```bash
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT nspname FROM pg_namespace;"
```
## Verification
```bash
stella doctor --check check.db.schema.version
```
## Related Checks
- `check.db.migrations.failed` - failed migrations are the most common cause of schema inconsistency
- `check.db.migrations.pending` - verify history after cleanup

View File

@@ -0,0 +1,56 @@
---
checkId: check.servicegraph.backend
plugin: stellaops.doctor.servicegraph
severity: fail
tags: [servicegraph, backend, api, connectivity]
---
# Backend API Connectivity
## What It Checks
Reads `StellaOps:BackendUrl` or `BackendUrl`, appends `/health`, and performs an HTTP GET through `IHttpClientFactory`.
The check passes on a successful response, warns when latency exceeds `2000ms`, and fails on non-success status codes or connection errors.
## Why It Matters
The backend API is the control plane entry point for many Stella Ops flows. If it is unreachable, UI features and cross-service orchestration degrade quickly.
## Common Causes
- `StellaOps__BackendUrl` points to the wrong host, port, or scheme
- The backend service is down or returning `5xx`
- DNS, proxy, or network rules block access from the Doctor service
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
StellaOps__BackendUrl: http://platform-web:8080
```
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://platform-web:8080/health
docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 100 platform-web
```
### Bare Metal / systemd
```bash
curl -fsS http://<backend-host>:<port>/health
journalctl -u <backend-service> -n 200
```
### Kubernetes / Helm
```bash
kubectl exec deploy/doctor-web -n <namespace> -- curl -fsS http://<backend-service>.<namespace>.svc.cluster.local:<port>/health
kubectl logs deploy/<backend-service> -n <namespace> --tail=200
```
## Verification
```bash
stella doctor --check check.servicegraph.backend
```
## Related Checks
- `check.servicegraph.endpoints` - validates the rest of the service graph after the main backend is reachable
- `check.servicegraph.timeouts` - slow backend responses often trace back to timeout tuning

View File

@@ -0,0 +1,48 @@
---
checkId: check.servicegraph.circuitbreaker
plugin: stellaops.doctor.servicegraph
severity: warn
tags: [servicegraph, resilience, circuit-breaker]
---
# Circuit Breaker Status
## What It Checks
Reads `Resilience:Enabled` or `HttpClient:Resilience:Enabled` and, when enabled, validates `BreakDurationSeconds`, `FailureThreshold`, and `SamplingDurationSeconds`.
The check reports info when resilience is not configured, warns when `BreakDurationSeconds < 5` or `FailureThreshold < 2`, and passes otherwise.
## Why It Matters
Circuit breakers protect external dependencies from retry storms. Bad thresholds either trip too aggressively or never trip when a downstream service is failing.
## Common Causes
- Resilience policies were never enabled on outgoing HTTP clients
- Thresholds were copied from a benchmark profile into production
- Multiple services use different resilience defaults, making failures unpredictable
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
Resilience__Enabled: "true"
Resilience__CircuitBreaker__BreakDurationSeconds: "30"
Resilience__CircuitBreaker__FailureThreshold: "5"
Resilience__CircuitBreaker__SamplingDurationSeconds: "60"
```
### Bare Metal / systemd
Keep breaker settings in the same configuration source used for HTTP client registration so the service and Doctor observe the same values.
### Kubernetes / Helm
Standardize resilience values across backend-facing workloads instead of per-pod overrides.
## Verification
```bash
stella doctor --check check.servicegraph.circuitbreaker
```
## Related Checks
- `check.servicegraph.backend` - breaker policy protects this path when the backend degrades
- `check.servicegraph.timeouts` - timeout settings and breaker settings should be tuned together

View File

@@ -0,0 +1,53 @@
---
checkId: check.servicegraph.endpoints
plugin: stellaops.doctor.servicegraph
severity: fail
tags: [servicegraph, services, endpoints, connectivity]
---
# Service Endpoints
## What It Checks
Collects configured service URLs for Authority, Scanner, Concelier, Excititor, Attestor, VexLens, and Gateway, appends `/health`, and probes each endpoint.
The check fails when any configured endpoint is unreachable or returns a non-success status. If no endpoints are configured, the check is skipped.
## Why It Matters
Stella Ops is a multi-service platform. A single broken internal endpoint can stall release orchestration, evidence generation, or advisory workflows even when the main web process is alive.
## Common Causes
- One or more `StellaOps:*Url` values are missing or point to the wrong internal service name
- Internal DNS or network routing is broken
- The target workload is up but not exposing `/health`
## How to Fix
### Docker Compose
Set the internal URLs explicitly:
```yaml
StellaOps__AuthorityUrl: http://authority-web:8080
StellaOps__ScannerUrl: http://scanner-web:8080
StellaOps__GatewayUrl: http://web:8080
```
Probe each endpoint from the Doctor container:
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://authority-web:8080/health
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://scanner-web:8080/health
```
### Bare Metal / systemd
Confirm the service-discovery or reverse-proxy names resolve from the Doctor host.
### Kubernetes / Helm
Use cluster-local service DNS names and check that each workload exports a health endpoint through the same port the URL references.
## Verification
```bash
stella doctor --check check.servicegraph.endpoints
```
## Related Checks
- `check.servicegraph.backend` - the backend is usually the first endpoint operators validate
- `check.servicegraph.mq` - asynchronous workflows also depend on messaging, not only HTTP endpoints

View File

@@ -0,0 +1,56 @@
---
checkId: check.servicegraph.mq
plugin: stellaops.doctor.servicegraph
severity: warn
tags: [servicegraph, messaging, rabbitmq, connectivity]
---
# Message Queue Connectivity
## What It Checks
Reads `RabbitMQ:Host` or `Messaging:RabbitMQ:Host` plus an optional port, defaulting to `5672`, and attempts a TCP connection.
The check skips when RabbitMQ is not configured and fails on timeouts, DNS failures, or refused connections.
## Why It Matters
Release tasks, notifications, and deferred work often depend on a functioning message broker. A dead queue path turns healthy APIs into backlogged systems.
## Common Causes
- `RabbitMQ__Host` is unset or points to the wrong broker
- The broker container is down
- AMQP traffic is blocked between Doctor and RabbitMQ
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
RabbitMQ__Host: rabbitmq
RabbitMQ__Port: "5672"
```
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml ps rabbitmq
docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 100 rabbitmq
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web sh -lc "nc -zv rabbitmq 5672"
```
### Bare Metal / systemd
```bash
nc -zv <rabbit-host> 5672
```
### Kubernetes / Helm
```bash
kubectl exec deploy/doctor-web -n <namespace> -- sh -lc "nc -zv <rabbit-service> 5672"
```
## Verification
```bash
stella doctor --check check.servicegraph.mq
```
## Related Checks
- `check.servicegraph.valkey` - cache and queue connectivity usually fail together when service networking is broken
- `check.servicegraph.timeouts` - aggressive timeouts can make a slow broker look unavailable

View File

@@ -0,0 +1,48 @@
---
checkId: check.servicegraph.timeouts
plugin: stellaops.doctor.servicegraph
severity: warn
tags: [servicegraph, timeouts, configuration]
---
# Service Timeouts
## What It Checks
Validates `HttpClient:Timeout`, `Database:CommandTimeout`, `Cache:OperationTimeout`, and `HealthChecks:Timeout`.
The check warns when HTTP timeout is below `5s` or above `300s`, database timeout is below `5s` or above `120s`, cache timeout exceeds `30s`, or health-check timeout exceeds the HTTP timeout.
## Why It Matters
Timeouts define how quickly failures surface and how long stuck work ties up resources. Poor values cause either premature failures or prolonged resource exhaustion.
## Common Causes
- Defaults from one environment were copied into another with very different latency
- Health-check timeout was set higher than the main request timeout
- Cache or database timeouts were raised to hide underlying performance problems
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
HttpClient__Timeout: "100"
Database__CommandTimeout: "30"
Cache__OperationTimeout: "5"
HealthChecks__Timeout: "10"
```
### Bare Metal / systemd
Tune timeouts from measured service latencies, not from guesswork. Raise values only after understanding the slower dependency.
### Kubernetes / Helm
Keep application timeouts lower than ingress, service-mesh, and job-level deadlines so failures happen in the component that owns the retry policy.
## Verification
```bash
stella doctor --check check.servicegraph.timeouts
```
## Related Checks
- `check.servicegraph.backend` - timeout misconfiguration often shows up as backend failures first
- `check.db.latency` - high database latency can force operators to revisit timeout values

View File

@@ -0,0 +1,52 @@
---
checkId: check.servicegraph.valkey
plugin: stellaops.doctor.servicegraph
severity: warn
tags: [servicegraph, valkey, redis, cache]
---
# Valkey/Redis Connectivity
## What It Checks
Reads `Valkey:ConnectionString`, `Redis:ConnectionString`, `ConnectionStrings:Valkey`, or `ConnectionStrings:Redis`, parses the host and port, and opens a TCP connection.
The check skips when no cache connection string is configured and fails when parsing fails or the target cannot be reached.
## Why It Matters
Cache unavailability affects queue coordination, state caching, and latency-sensitive platform features. A malformed connection string is also an early warning that the environment is not wired correctly.
## Common Causes
- The cache connection string is missing, malformed, or still points to a previous environment
- The Valkey/Redis service is not running
- Container networking or DNS is broken
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
Valkey__ConnectionString: valkey:6379,password=${STELLAOPS_VALKEY_PASSWORD}
```
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml ps valkey
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web sh -lc "nc -zv valkey 6379"
```
### Bare Metal / systemd
```bash
redis-cli -h <valkey-host> -p 6379 ping
```
### Kubernetes / Helm
Use a cluster-local service name in the connection string and verify the port exposed by the StatefulSet or Service.
## Verification
```bash
stella doctor --check check.servicegraph.valkey
```
## Related Checks
- `check.servicegraph.mq` - both checks validate internal service-network connectivity
- `check.servicegraph.endpoints` - broad service discovery issues usually affect cache endpoints too

View File

@@ -0,0 +1,56 @@
---
checkId: check.verification.artifact.pull
plugin: stellaops.doctor.verification
severity: fail
tags: [verification, artifact, registry, supply-chain]
---
# Test Artifact Pull
## What It Checks
Requires the verification plugin to be enabled and a test artifact to be configured with either `Doctor:Plugins:Verification:TestArtifact:Reference` or `Doctor:Plugins:Verification:TestArtifact:OfflineBundlePath`.
For offline mode it checks the bundle file exists. For online mode it performs a registry `HEAD` request against the OCI manifest and optionally compares the returned digest to the expected digest.
## Why It Matters
The rest of the verification pipeline is meaningless if Doctor cannot retrieve the artifact it is supposed to validate.
## Common Causes
- No test artifact reference or offline bundle path is configured
- Registry credentials are missing or do not allow manifest access
- The artifact digest or tag points to content that no longer exists
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
Doctor__Plugins__Verification__Enabled: "true"
Doctor__Plugins__Verification__TestArtifact__Reference: ghcr.io/example/app@sha256:<digest>
```
For air-gapped mode:
```yaml
Doctor__Plugins__Verification__TestArtifact__OfflineBundlePath: /var/lib/stella/verification/offline-bundle.json
```
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web crane manifest ghcr.io/example/app@sha256:<digest>
```
### Bare Metal / systemd
Use an immutable digest reference instead of a mutable tag whenever possible.
### Kubernetes / Helm
Mount registry credentials and the offline bundle path into the Doctor workload if the cluster is disconnected.
## Verification
```bash
stella doctor --check check.verification.artifact.pull
```
## Related Checks
- `check.verification.signature` - signature validation depends on the same artifact input
- `check.integration.oci.pull` - registry authorization issues often show up there too

View File

@@ -0,0 +1,50 @@
---
checkId: check.verification.policy.engine
plugin: stellaops.doctor.verification
severity: fail
tags: [verification, policy, vex, compliance]
---
# Policy Engine Evaluation
## What It Checks
Requires the verification plugin plus a configured test artifact. In offline mode it looks for policy results inside the exported bundle. In online mode it validates `Policy:Engine:Enabled`, a policy reference, and `Policy:VexAware`.
The check fails when the policy engine is disabled, warns when no policy reference is configured or when VEX-aware evaluation is off, and passes when the prerequisites are present.
## Why It Matters
Release verification is only trustworthy if the same policy engine and VEX rules used in production can be exercised by Doctor.
## Common Causes
- `Policy__Engine__Enabled` is false
- No default or test policy reference is configured
- Policy rules were not updated to account for VEX justifications
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
Policy__Engine__Enabled: "true"
Policy__DefaultPolicyRef: policy://default/release-gate
Policy__VexAware: "true"
Doctor__Plugins__Verification__PolicyTest__PolicyRef: policy://default/release-gate
```
If you use offline verification, export the bundle with policy data included before copying it into the air-gapped environment.
### Bare Metal / systemd
Keep the Doctor policy reference aligned with the policy engine configuration used by release orchestration.
### Kubernetes / Helm
Store the policy ref in ConfigMaps and enforce the same value across the policy engine and Doctor service.
## Verification
```bash
stella doctor --check check.verification.policy.engine
```
## Related Checks
- `check.verification.vex.validation` - VEX-aware policy only helps if VEX collection works
- `check.verification.sbom.validation` - policy evaluation usually consumes SBOM and vulnerability evidence

View File

@@ -0,0 +1,52 @@
---
checkId: check.verification.sbom.validation
plugin: stellaops.doctor.verification
severity: fail
tags: [verification, sbom, cyclonedx, spdx]
---
# SBOM Validation
## What It Checks
Requires the verification plugin plus a test artifact. In offline mode it looks for CycloneDX or SPDX JSON inside the bundle. In online mode it checks whether `Scanner:SbomGeneration:Enabled` or `Attestor:SbomAttestation:Enabled` is turned on.
The check warns when SBOM generation and attestation are both disabled, and fails when the offline bundle is missing or contains no recognizable SBOM.
## Why It Matters
SBOMs are the input for downstream vulnerability analysis, policy decisions, and customer evidence exports. If SBOM generation is off, release evidence is incomplete.
## Common Causes
- The build pipeline is not producing SBOMs
- SBOM attestation is disabled even though verification expects it
- Offline bundles were exported without `--include-sbom`
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
Scanner__SbomGeneration__Enabled: "true"
Attestor__SbomAttestation__Enabled: "true"
```
For offline mode:
```bash
stella verification bundle export --include-sbom --output /var/lib/stella/verification/offline-bundle.json
```
### Bare Metal / systemd
Enable SBOM generation in the scanner and keep artifact attachments immutable once published.
### Kubernetes / Helm
Mount the same scanner and attestor config into Doctor that the production verification pipeline uses.
## Verification
```bash
stella doctor --check check.verification.sbom.validation
```
## Related Checks
- `check.verification.artifact.pull` - the artifact must be reachable before attached SBOMs can be validated
- `check.verification.policy.engine` - policy rules commonly consume SBOM-derived vulnerability data

View File

@@ -0,0 +1,56 @@
---
checkId: check.verification.signature
plugin: stellaops.doctor.verification
severity: fail
tags: [verification, signatures, dsse, rekor]
---
# Signature Verification
## What It Checks
Requires the verification plugin plus a test artifact. In offline mode it looks for DSSE-style signature material in the bundle. In online mode it checks `Sigstore:Enabled` and verifies the Rekor log endpoint is reachable.
The check reports info when Sigstore is disabled, and fails when the offline bundle is missing or Rekor cannot be reached.
## Why It Matters
Signature verification is the minimum control that proves the artifact under review was signed by the expected supply-chain path.
## Common Causes
- `Sigstore__Enabled` is false
- Rekor URL is unreachable from the Doctor workload
- Offline bundles were exported without signatures
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
Sigstore__Enabled: "true"
Sigstore__RekorUrl: https://rekor.sigstore.dev
```
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS https://rekor.sigstore.dev/api/v1/log
```
For offline verification:
```bash
stella verification bundle export --include-signatures --output /var/lib/stella/verification/offline-bundle.json
```
### Bare Metal / systemd
Ensure the Doctor host trusts the CA chain used by the Rekor endpoint or use the approved internal Rekor deployment.
### Kubernetes / Helm
Prefer an internal Rekor service URL in disconnected or regulated clusters.
## Verification
```bash
stella doctor --check check.verification.signature
```
## Related Checks
- `check.attestation.rekor.connectivity` - validates the transparency log path more directly
- `check.verification.artifact.pull` - signature checks need a reachable artifact reference

View File

@@ -0,0 +1,52 @@
---
checkId: check.verification.vex.validation
plugin: stellaops.doctor.verification
severity: fail
tags: [verification, vex, csaf, openvex]
---
# VEX Validation
## What It Checks
Requires the verification plugin plus a test artifact. In offline mode it looks for OpenVEX, CSAF VEX, or CycloneDX VEX content inside the bundle. In online mode it validates `VexHub:Collection:Enabled` and at least one configured VEX feed URL.
The check reports info when VEX collection is disabled, warns when feeds are missing, and fails only for unusable offline bundle inputs.
## Why It Matters
VEX data is what allows policy to distinguish exploitable findings from known-not-affected cases. Without it, release gates become overly noisy or overly permissive.
## Common Causes
- `VexHub__Collection__Enabled` is false
- Vendor or internal VEX feeds were never configured
- Offline bundles were exported without `--include-vex`
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
VexHub__Collection__Enabled: "true"
VexHub__Feeds__0__Url: https://vendor.example/vex.json
```
For offline mode:
```bash
stella verification bundle export --include-vex --output /var/lib/stella/verification/offline-bundle.json
```
### Bare Metal / systemd
Keep VEX feeds in a controlled mirror if the environment cannot reach upstream vendors directly.
### Kubernetes / Helm
Mount VEX feed configuration from the same source used by the running VexHub deployment.
## Verification
```bash
stella doctor --check check.verification.vex.validation
```
## Related Checks
- `check.verification.policy.engine` - VEX-aware policy is only as good as the VEX data it receives
- `check.verification.sbom.validation` - VEX statements refer to components identified in the SBOM

View File

@@ -1,181 +0,0 @@
# Sprint 20260326_001 — Doctor Health Checks Documentation
## Topic & Scope
- Document every Doctor health check (99 checks across 16 plugins) with precise, actionable remediation.
- Each check must have: what it tests, why it matters, exact fix steps, Docker compose specifics, and verification.
- Fix false-positive checks that fail on default Docker compose installations.
- Working directory: `docs/modules/doctor/`, `src/Doctor/__Plugins/`
- Expected evidence: docs, improved check messages, tests.
## Dependencies & Concurrency
- No upstream dependencies. Can be parallelized by plugin.
- Depends on the 4 check code fixes already applied (RequiredSettings, EnvironmentVariables, SecretsConfiguration, DockerSocket).
## Documentation Prerequisites
- `docs/modules/doctor/architecture.md` — existing Doctor architecture overview
- `docs/modules/doctor/registry-checks.md` — existing check registry reference
- `devops/compose/docker-compose.stella-ops.yml` — the reference deployment
## Delivery Tracker
### DOC-001 - Create check reference index
Status: TODO
Dependency: none
Owners: Documentation author
Task description:
- Create `docs/modules/doctor/checks/README.md` with a master table of all 99 checks
- Columns: Check ID, Plugin, Category, Severity, Summary, Docker Compose Status (Pass/Warn/Fail/N/A)
- Group by plugin (Core, Security, Docker, Agent, Attestor, Auth, etc.)
- Include quick-reference severity legend
Completion criteria:
- [ ] All 99 checks listed with correct metadata
- [ ] Docker Compose Status column filled from actual test run
### DOC-002 - Core Plugin checks documentation (9 checks)
Status: TODO
Dependency: DOC-001
Owners: Documentation author
Task description:
- Create `docs/modules/doctor/checks/core.md`
- Document each check:
- **check.core.config.required**: What settings are checked, key variants (colon vs `__`), compose env var names, how to add missing settings
- **check.core.env.variables**: Which env vars are checked, why `ASPNETCORE_ENVIRONMENT` may not be set in compose, when this is OK
- **check.core.health.endpoint**: Health endpoint configuration
- **check.core.memory**: Memory threshold configuration
- **check.core.startup.time**: Expected startup time ranges
- Each remaining core check
- For each check: Symptom → Root Cause → Fix → Verify
Completion criteria:
- [ ] Each check has: description, what it tests, severity, fix steps, Docker compose notes, verification command
### DOC-003 - Security Plugin checks documentation
Status: TODO
Dependency: DOC-001
Owners: Documentation author
Task description:
- Create `docs/modules/doctor/checks/security.md`
- Document: check.security.secrets, check.security.tls, check.security.cors, check.security.headers
- Include: which keys are considered "secrets" vs DSNs, vault provider configuration, development vs production guidance
Completion criteria:
- [ ] Each check documented with fix steps and Docker compose notes
### DOC-004 - Docker Plugin checks documentation
Status: TODO
Dependency: DOC-001
Owners: Documentation author
Task description:
- Create `docs/modules/doctor/checks/docker.md`
- Document: check.docker.socket, check.docker.daemon, check.docker.images
- Include: container-vs-host detection, socket mount instructions, Windows named pipe notes
Completion criteria:
- [ ] Each check documented with container-aware behavior explained
### DOC-005 - Agent Plugin checks documentation (11 checks)
Status: TODO
Dependency: DOC-001
Owners: Documentation author
Task description:
- Create `docs/modules/doctor/checks/agent.md`
- Document all 11 agent checks: capacity, certificates, cluster health/quorum, heartbeat, resources, versions, stale detection, task failure rate, task backlog
Completion criteria:
- [ ] Each check documented with thresholds, configuration options, fix steps
### DOC-006 - Attestor Plugin checks documentation (6 checks)
Status: TODO
Dependency: DOC-001
Owners: Documentation author
Task description:
- Create `docs/modules/doctor/checks/attestor.md`
- Document: cosign key material, clock skew, Rekor connectivity/verification, signing key expiration, transparency log consistency
Completion criteria:
- [ ] Each check documented including air-gap/offline scenarios
### DOC-007 - Auth Plugin checks documentation (4 checks)
Status: TODO
Dependency: DOC-001
Owners: Documentation author
Task description:
- Create `docs/modules/doctor/checks/auth.md`
- Document: auth configuration, OIDC provider connectivity, signing key health, token service health
Completion criteria:
- [ ] Each check documented with OIDC troubleshooting steps
### DOC-008 - Remaining plugins documentation
Status: TODO
Dependency: DOC-001
Owners: Documentation author
Task description:
- Create one doc per remaining plugin:
- `docs/modules/doctor/checks/binary-analysis.md` (6 checks)
- `docs/modules/doctor/checks/compliance.md` (7 checks)
- `docs/modules/doctor/checks/crypto.md` (6 checks)
- `docs/modules/doctor/checks/environment.md` (6 checks)
- `docs/modules/doctor/checks/evidence-locker.md` (4 checks)
- `docs/modules/doctor/checks/observability.md` (4 checks)
- `docs/modules/doctor/checks/notify.md` (9 checks)
- `docs/modules/doctor/checks/operations.md` (3 checks)
- `docs/modules/doctor/checks/policy.md` (1 check)
- `docs/modules/doctor/checks/postgres.md` (3 checks)
- `docs/modules/doctor/checks/release.md` (6 checks)
- `docs/modules/doctor/checks/scanner.md` (7 checks)
- `docs/modules/doctor/checks/storage.md` (3 checks)
- `docs/modules/doctor/checks/timestamping.md` (9 checks)
- `docs/modules/doctor/checks/vex.md` (3 checks)
Completion criteria:
- [ ] Every check across all 16 plugins documented
### DOC-009 - Improve check remediation messages in code
Status: TODO
Dependency: DOC-002 through DOC-008
Owners: Developer
Task description:
- For each check, update the `WithRemediation()` steps to include:
- Exact commands (not vague "configure X")
- Docker compose env var names (using `__` separator)
- File paths relative to the compose directory
- Link to the documentation page (e.g., "See docs/modules/doctor/checks/core.md")
- Update `WithCauses()` to be specific, not generic
Completion criteria:
- [ ] All 99 checks have precise, copy-pasteable remediation steps
- [ ] No check reports a generic "configure X" without specifying how
- [ ] Docker compose installations pass all checks that should pass
### DOC-010 - Docker compose default pass baseline
Status: TODO
Dependency: DOC-009
Owners: QA / Test Automation
Task description:
- Run all 99 Doctor checks against a fresh `docker compose up` installation
- Document which checks MUST pass, which are expected warnings, which are N/A
- Create `docs/modules/doctor/compose-baseline.md` with the expected results
- Add any remaining code fixes for false positives
Completion criteria:
- [ ] Baseline document created
- [ ] Zero false-positive FAILs on fresh Docker compose install
- [ ] All WARN checks documented as expected or fixed
## Execution Log
| Date (UTC) | Update | Owner |
| --- | --- | --- |
| 2026-03-26 | Sprint created. 4 code fixes applied (RequiredSettings, EnvironmentVariables, SecretsConfiguration, DockerSocket). | Planning |
## Decisions & Risks
- Risk: 99 checks is a large documentation surface. Parallelize by plugin.
- Decision: Each plugin gets its own doc file for maintainability.
- Decision: Remediation messages in code should link to docs, not duplicate full instructions.
## Next Checkpoints
- DOC-001 (index): 1 day
- DOC-002 through DOC-008 (all plugin docs): 3-5 days
- DOC-009 (code remediation improvements): 2 days
- DOC-010 (baseline): 1 day

View File

@@ -0,0 +1,188 @@
# Doctor Runtime Check Index
## Scope
- Runtime catalog source: `GET /api/v1/doctor/checks` on 2026-03-31.
- Docker compose baseline source: run `dr_20260331_195122_99ff09` captured from the locally running default stack.
- Canonical remediation content lives in `docs/doctor/articles/**`; this index maps the live runtime catalog to those articles.
## Runtime Summary
| Plugin | Checks |
| --- | ---: |
| `stellaops.doctor.attestation` | 3 |
| `stellaops.doctor.binaryanalysis` | 6 |
| `stellaops.doctor.compliance` | 7 |
| `stellaops.doctor.core` | 9 |
| `stellaops.doctor.database` | 8 |
| `stellaops.doctor.docker` | 5 |
| `stellaops.doctor.environment` | 6 |
| `stellaops.doctor.integration` | 16 |
| `stellaops.doctor.observability` | 6 |
| `stellaops.doctor.release` | 6 |
| `stellaops.doctor.scanner` | 7 |
| `stellaops.doctor.security` | 11 |
| `stellaops.doctor.servicegraph` | 6 |
| `stellaops.doctor.verification` | 5 |
## Baseline Legend
- `pass`: expected healthy result in the captured compose baseline.
- `info`: informational only; not a release blocker in the captured baseline.
- `warn`: action needed or recommended; not a hard failure in the captured baseline.
- `fail`: baseline failure observed in the captured runtime.
- `skip`: not applicable in the captured runtime context.
## `stellaops.doctor.attestation`
| Check ID | Severity | Baseline | Article |
| --- | --- | --- | --- |
| `check.attestation.clock.skew` | `warn` | `warn` | [article](../../../doctor/articles/attestor/clock-skew.md) |
| `check.attestation.cosign.keymaterial` | `fail` | `skip` | [article](../../../doctor/articles/attestor/cosign-keymaterial.md) |
| `check.attestation.rekor.connectivity` | `fail` | `skip` | [article](../../../doctor/articles/attestor/rekor-connectivity.md) |
## `stellaops.doctor.binaryanalysis`
| Check ID | Severity | Baseline | Article |
| --- | --- | --- | --- |
| `check.binaryanalysis.buildinfo.cache` | `warn` | `warn` | [article](../../../doctor/articles/binary-analysis/buildinfo-cache.md) |
| `check.binaryanalysis.corpus.kpi.baseline` | `warn` | `warn` | [article](../../../doctor/articles/binary-analysis/kpi-baseline-exists.md) |
| `check.binaryanalysis.corpus.mirror.freshness` | `warn` | `warn` | [article](../../../doctor/articles/binary-analysis/corpus-mirror-freshness.md) |
| `check.binaryanalysis.ddeb.enabled` | `warn` | `warn` | [article](../../../doctor/articles/binary-analysis/ddeb-repo-enabled.md) |
| `check.binaryanalysis.debuginfod.available` | `warn` | `info` | [article](../../../doctor/articles/binary-analysis/debuginfod-availability.md) |
| `check.binaryanalysis.symbol.recovery.fallback` | `warn` | `info` | [article](../../../doctor/articles/binary-analysis/symbol-recovery-fallback.md) |
## `stellaops.doctor.compliance`
| Check ID | Severity | Baseline | Article |
| --- | --- | --- | --- |
| `check.compliance.attestation-signing` | `fail` | `skip` | [article](../../../doctor/articles/compliance/attestation-signing.md) |
| `check.compliance.audit-readiness` | `warn` | `skip` | [article](../../../doctor/articles/compliance/audit-readiness.md) |
| `check.compliance.evidence-integrity` | `fail` | `skip` | [article](../../../doctor/articles/compliance/evidence-integrity.md) |
| `check.compliance.evidence-rate` | `fail` | `skip` | [article](../../../doctor/articles/compliance/evidence-rate.md) |
| `check.compliance.export-readiness` | `warn` | `skip` | [article](../../../doctor/articles/compliance/export-readiness.md) |
| `check.compliance.framework` | `warn` | `skip` | [article](../../../doctor/articles/compliance/framework.md) |
| `check.compliance.provenance-completeness` | `fail` | `skip` | [article](../../../doctor/articles/compliance/provenance-completeness.md) |
## `stellaops.doctor.core`
| Check ID | Severity | Baseline | Article |
| --- | --- | --- | --- |
| `check.core.auth.config` | `warn` | `skip` | [article](../../../doctor/articles/core/auth-config.md) |
| `check.core.config.loaded` | `fail` | `pass` | [article](../../../doctor/articles/core/config-loaded.md) |
| `check.core.config.required` | `fail` | `fail` | [article](../../../doctor/articles/core/config-required.md) |
| `check.core.crypto.available` | `fail` | `pass` | [article](../../../doctor/articles/core/crypto-available.md) |
| `check.core.env.diskspace` | `fail` | `pass` | [article](../../../doctor/articles/core/env-diskspace.md) |
| `check.core.env.memory` | `warn` | `pass` | [article](../../../doctor/articles/core/env-memory.md) |
| `check.core.env.variables` | `warn` | `warn` | [article](../../../doctor/articles/core/env-variables.md) |
| `check.core.services.dependencies` | `fail` | `pass` | [article](../../../doctor/articles/core/services-dependencies.md) |
| `check.core.services.health` | `fail` | `skip` | [article](../../../doctor/articles/core/services-health.md) |
## `stellaops.doctor.database`
| Check ID | Severity | Baseline | Article |
| --- | --- | --- | --- |
| `check.db.connection` | `fail` | `skip` | [article](../../../doctor/articles/postgres/db-connection.md) |
| `check.db.latency` | `fail` | `skip` | [article](../../../doctor/articles/postgres/db-latency.md) |
| `check.db.migrations.failed` | `fail` | `skip` | [article](../../../doctor/articles/postgres/db-migrations-failed.md) |
| `check.db.migrations.pending` | `warn` | `skip` | [article](../../../doctor/articles/postgres/db-migrations-pending.md) |
| `check.db.permissions` | `fail` | `skip` | [article](../../../doctor/articles/postgres/db-permissions.md) |
| `check.db.pool.health` | `fail` | `skip` | [article](../../../doctor/articles/postgres/db-pool-health.md) |
| `check.db.pool.size` | `warn` | `skip` | [article](../../../doctor/articles/postgres/db-pool-size.md) |
| `check.db.schema.version` | `fail` | `skip` | [article](../../../doctor/articles/postgres/db-schema-version.md) |
## `stellaops.doctor.docker`
| Check ID | Severity | Baseline | Article |
| --- | --- | --- | --- |
| `check.docker.apiversion` | `warn` | `skip` | [article](../../../doctor/articles/docker/apiversion.md) |
| `check.docker.daemon` | `fail` | `fail` | [article](../../../doctor/articles/docker/daemon.md) |
| `check.docker.network` | `warn` | `skip` | [article](../../../doctor/articles/docker/network.md) |
| `check.docker.socket` | `fail` | `fail` | [article](../../../doctor/articles/docker/socket.md) |
| `check.docker.storage` | `warn` | `skip` | [article](../../../doctor/articles/docker/storage.md) |
## `stellaops.doctor.environment`
| Check ID | Severity | Baseline | Article |
| --- | --- | --- | --- |
| `check.environment.capacity` | `warn` | `skip` | [article](../../../doctor/articles/environment/environment-capacity.md) |
| `check.environment.connectivity` | `warn` | `skip` | [article](../../../doctor/articles/environment/environment-connectivity.md) |
| `check.environment.deployments` | `warn` | `skip` | [article](../../../doctor/articles/environment/environment-deployment-health.md) |
| `check.environment.drift` | `warn` | `skip` | [article](../../../doctor/articles/environment/environment-drift.md) |
| `check.environment.network.policy` | `warn` | `skip` | [article](../../../doctor/articles/environment/environment-network-policy.md) |
| `check.environment.secrets` | `warn` | `skip` | [article](../../../doctor/articles/environment/environment-secret-health.md) |
## `stellaops.doctor.integration`
| Check ID | Severity | Baseline | Article |
| --- | --- | --- | --- |
| `check.integration.ci.system` | `warn` | `skip` | [article](../../../doctor/articles/integration/ci-system-connectivity.md) |
| `check.integration.git` | `warn` | `skip` | [article](../../../doctor/articles/integration/git-provider-api.md) |
| `check.integration.ldap` | `warn` | `skip` | [article](../../../doctor/articles/integration/ldap-connectivity.md) |
| `check.integration.oci.capabilities` | `info` | `skip` | [article](../../../doctor/articles/integration/registry-capability-probe.md) |
| `check.integration.oci.credentials` | `fail` | `skip` | [article](../../../doctor/articles/integration/registry-credentials.md) |
| `check.integration.oci.pull` | `fail` | `skip` | [article](../../../doctor/articles/integration/registry-pull-authorization.md) |
| `check.integration.oci.push` | `fail` | `skip` | [article](../../../doctor/articles/integration/registry-push-authorization.md) |
| `check.integration.oci.referrers` | `warn` | `skip` | [article](../../../doctor/articles/integration/registry-referrers-api.md) |
| `check.integration.oci.registry` | `warn` | `skip` | [article](../../../doctor/articles/integration/oci-registry-connectivity.md) |
| `check.integration.oidc` | `warn` | `skip` | [article](../../../doctor/articles/integration/oidc-provider.md) |
| `check.integration.s3.storage` | `warn` | `skip` | [article](../../../doctor/articles/integration/object-storage.md) |
| `check.integration.secrets.manager` | `fail` | `skip` | [article](../../../doctor/articles/integration/secrets-manager-connectivity.md) |
| `check.integration.slack` | `info` | `skip` | [article](../../../doctor/articles/integration/slack-webhook.md) |
| `check.integration.smtp` | `warn` | `skip` | [article](../../../doctor/articles/integration/smtp-connectivity.md) |
| `check.integration.teams` | `info` | `skip` | [article](../../../doctor/articles/integration/teams-webhook.md) |
| `check.integration.webhooks` | `warn` | `skip` | [article](../../../doctor/articles/integration/webhook-health.md) |
## `stellaops.doctor.observability`
| Check ID | Severity | Baseline | Article |
| --- | --- | --- | --- |
| `check.observability.alerting` | `info` | `info` | [article](../../../doctor/articles/observability/observability-alerting.md) |
| `check.observability.healthchecks` | `warn` | `pass` | [article](../../../doctor/articles/observability/observability-healthchecks.md) |
| `check.observability.logging` | `warn` | `warn` | [article](../../../doctor/articles/observability/observability-logging.md) |
| `check.observability.metrics` | `warn` | `info` | [article](../../../doctor/articles/observability/observability-metrics.md) |
| `check.observability.otel` | `warn` | `info` | [article](../../../doctor/articles/observability/observability-otel.md) |
| `check.observability.tracing` | `warn` | `pass` | [article](../../../doctor/articles/observability/observability-tracing.md) |
## `stellaops.doctor.release`
| Check ID | Severity | Baseline | Article |
| --- | --- | --- | --- |
| `check.release.active` | `warn` | `skip` | [article](../../../doctor/articles/release/active.md) |
| `check.release.configuration` | `warn` | `skip` | [article](../../../doctor/articles/release/configuration.md) |
| `check.release.environment.readiness` | `warn` | `skip` | [article](../../../doctor/articles/release/environment-readiness.md) |
| `check.release.promotion.gates` | `warn` | `skip` | [article](../../../doctor/articles/release/promotion-gates.md) |
| `check.release.rollback.readiness` | `warn` | `skip` | [article](../../../doctor/articles/release/rollback-readiness.md) |
| `check.release.schedule` | `info` | `skip` | [article](../../../doctor/articles/release/schedule.md) |
## `stellaops.doctor.scanner`
| Check ID | Severity | Baseline | Article |
| --- | --- | --- | --- |
| `check.scanner.queue` | `warn` | `skip` | [article](../../../doctor/articles/scanner/queue.md) |
| `check.scanner.reachability` | `warn` | `skip` | [article](../../../doctor/articles/scanner/reachability.md) |
| `check.scanner.resources` | `warn` | `skip` | [article](../../../doctor/articles/scanner/resources.md) |
| `check.scanner.sbom` | `warn` | `skip` | [article](../../../doctor/articles/scanner/sbom.md) |
| `check.scanner.slice.cache` | `warn` | `skip` | [article](../../../doctor/articles/scanner/slice-cache.md) |
| `check.scanner.vuln` | `warn` | `skip` | [article](../../../doctor/articles/scanner/vuln.md) |
| `check.scanner.witness.graph` | `warn` | `skip` | [article](../../../doctor/articles/scanner/witness-graph.md) |
## `stellaops.doctor.security`
| Check ID | Severity | Baseline | Article |
| --- | --- | --- | --- |
| `check.security.apikey` | `warn` | `skip` | [article](../../../doctor/articles/security/apikey.md) |
| `check.security.audit.logging` | `warn` | `warn` | [article](../../../doctor/articles/security/audit-logging.md) |
| `check.security.cors` | `warn` | `warn` | [article](../../../doctor/articles/security/cors.md) |
| `check.security.encryption` | `warn` | `skip` | [article](../../../doctor/articles/security/encryption.md) |
| `check.security.evidence.integrity` | `fail` | `skip` | [article](../../../doctor/articles/security/evidence-integrity.md) |
| `check.security.headers` | `warn` | `warn` | [article](../../../doctor/articles/security/headers.md) |
| `check.security.jwt.config` | `fail` | `skip` | [article](../../../doctor/articles/security/jwt-config.md) |
| `check.security.password.policy` | `warn` | `skip` | [article](../../../doctor/articles/security/password-policy.md) |
| `check.security.ratelimit` | `warn` | `info` | [article](../../../doctor/articles/security/ratelimit.md) |
| `check.security.secrets` | `fail` | `fail` | [article](../../../doctor/articles/security/secrets.md) |
| `check.security.tls.certificate` | `fail` | `pass` | [article](../../../doctor/articles/security/tls-certificate.md) |
## `stellaops.doctor.servicegraph`
| Check ID | Severity | Baseline | Article |
| --- | --- | --- | --- |
| `check.servicegraph.backend` | `fail` | `skip` | [article](../../../doctor/articles/servicegraph/servicegraph-backend.md) |
| `check.servicegraph.circuitbreaker` | `warn` | `info` | [article](../../../doctor/articles/servicegraph/servicegraph-circuitbreaker.md) |
| `check.servicegraph.endpoints` | `fail` | `skip` | [article](../../../doctor/articles/servicegraph/servicegraph-endpoints.md) |
| `check.servicegraph.mq` | `warn` | `skip` | [article](../../../doctor/articles/servicegraph/servicegraph-mq.md) |
| `check.servicegraph.timeouts` | `warn` | `pass` | [article](../../../doctor/articles/servicegraph/servicegraph-timeouts.md) |
| `check.servicegraph.valkey` | `warn` | `pass` | [article](../../../doctor/articles/servicegraph/servicegraph-valkey.md) |
## `stellaops.doctor.verification`
| Check ID | Severity | Baseline | Article |
| --- | --- | --- | --- |
| `check.verification.artifact.pull` | `fail` | `skip` | [article](../../../doctor/articles/verification/verification-artifact-pull.md) |
| `check.verification.policy.engine` | `fail` | `skip` | [article](../../../doctor/articles/verification/verification-policy-engine.md) |
| `check.verification.sbom.validation` | `fail` | `skip` | [article](../../../doctor/articles/verification/verification-sbom-validation.md) |
| `check.verification.signature` | `fail` | `skip` | [article](../../../doctor/articles/verification/verification-signature.md) |
| `check.verification.vex.validation` | `fail` | `skip` | [article](../../../doctor/articles/verification/verification-vex-validation.md) |

View File

@@ -0,0 +1,77 @@
# Doctor Compose Baseline
## Evidence
- Runtime source: local default stack reachable at `http://127.1.0.26/api/v1/doctor`.
- Catalog snapshot: `GET /api/v1/doctor/checks` on 2026-03-31.
- Baseline run: `dr_20260331_195122_99ff09`.
- Duration: `12103ms`.
## Baseline Summary
| Status | Count |
| --- | ---: |
| `pass` | 10 |
| `info` | 7 |
| `warn` | 10 |
| `fail` | 4 |
| `skip` | 70 |
| `total` | 101 |
## Capture Notes
- This baseline was captured from the locally running default compose stack, not from a second fresh stack.
- A parallel `docker compose up` was not used because `devops/compose/docker-compose.stella-ops.yml` hardcodes container names, which would conflict with the already running environment.
- The runtime catalog currently exposes `101` checks across `14` plugins. That supersedes the stale sprint text that still referenced `99` checks across `16` plugins.
## Observed Failures
| Check ID | Diagnosis | Notes |
| --- | --- | --- |
| `check.core.config.required` | Missing 2 required setting(s) | Missing `ConnectionStrings:DefaultConnection` and `Logging:LogLevel:Default` in the captured runtime. |
| `check.docker.daemon` | Cannot connect to Docker daemon: Connection failed | Doctor ran without a reachable Docker daemon socket. |
| `check.docker.socket` | 1 Docker socket issue(s) | `/var/run/docker.sock` was absent in the captured container context. |
| `check.security.secrets` | 2 secrets management issue(s) found | The runtime reported no secrets provider plus a potential plain-text connection string. |
## Observed Warnings
| Check ID | Diagnosis |
| --- | --- |
| `check.attestation.clock.skew` | System clock is off by 5.5 seconds (threshold: 5s) |
| `check.binaryanalysis.buildinfo.cache` | Debian buildinfo services are reachable but cache directory does not exist |
| `check.binaryanalysis.corpus.kpi.baseline` | KPI baseline directory does not exist: `/var/lib/stella/baselines` |
| `check.binaryanalysis.corpus.mirror.freshness` | Corpus mirrors directory does not exist: `/var/lib/stella/mirrors` |
| `check.binaryanalysis.ddeb.enabled` | Ubuntu ddeb repository is not configured but `ddebs.ubuntu.com` is reachable |
| `check.core.env.variables` | No environment configuration variables detected |
| `check.observability.logging` | 1 logging configuration issue(s) |
| `check.security.audit.logging` | 2 audit logging issue(s) |
| `check.security.cors` | 1 CORS configuration issue(s) found |
| `check.security.headers` | 5 security header(s) not configured |
## Observed Informational Results
| Check ID | Diagnosis |
| --- | --- |
| `check.binaryanalysis.debuginfod.available` | `DEBUGINFOD_URLS` not configured but default Fedora debuginfod is reachable |
| `check.binaryanalysis.symbol.recovery.fallback` | Symbol recovery operational with 1/3 sources available |
| `check.observability.alerting` | No alerting destinations configured |
| `check.observability.metrics` | Metrics configuration not found |
| `check.observability.otel` | OpenTelemetry endpoint not configured |
| `check.security.ratelimit` | Rate limiting configuration not found |
| `check.servicegraph.circuitbreaker` | Circuit breakers not configured |
## Healthy Baseline Results
The captured runtime returned `pass` for:
- `check.core.config.loaded`
- `check.core.crypto.available`
- `check.core.env.diskspace`
- `check.core.env.memory`
- `check.core.services.dependencies`
- `check.observability.healthchecks`
- `check.observability.tracing`
- `check.security.tls.certificate`
- `check.servicegraph.timeouts`
- `check.servicegraph.valkey`
## Skipped Checks
- `70` checks were skipped because the captured local stack did not provide the required runtime context, credentials, test artifacts, or dependent services.
- Skips are expected for the database, integration, release, scanner, and verification groups when the default local stack is not fully wired for end-to-end release validation.
## Follow-Up
- Use [the runtime check index](./checks/README.md) to map each runtime check to its article.
- Rebuild and rerun the Doctor services before claiming a fresh-stack zero-false-positive baseline; this document only records the captured live baseline from 2026-03-31.