doctor: complete runtime check documentation sprint
Signed-off-by: master <>
This commit is contained in:
53
docs/doctor/articles/observability/observability-alerting.md
Normal file
53
docs/doctor/articles/observability/observability-alerting.md
Normal file
@@ -0,0 +1,53 @@
|
||||
---
|
||||
checkId: check.observability.alerting
|
||||
plugin: stellaops.doctor.observability
|
||||
severity: info
|
||||
tags: [observability, alerting, notifications]
|
||||
---
|
||||
# Alerting Configuration
|
||||
|
||||
## What It Checks
|
||||
Looks for configured alert destinations such as Alertmanager, Slack, email recipients, or PagerDuty routing keys.
|
||||
|
||||
The check reports info when alerting is explicitly disabled or when no destination is configured. It warns only when a destination is present but obviously malformed, such as invalid email addresses.
|
||||
|
||||
## Why It Matters
|
||||
Metrics and logs are not actionable if nobody is notified when thresholds are crossed. Production installs should route alerts somewhere outside the application process.
|
||||
|
||||
## Common Causes
|
||||
- Alerting was never configured after initial compose bring-up
|
||||
- Notification secrets were omitted from environment variables
|
||||
- Recipient lists contain placeholders or invalid values
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web printenv | grep -E 'ALERT|SLACK|PAGERDUTY|SMTP'
|
||||
```
|
||||
|
||||
Example compose-style configuration:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
Alerting__Enabled: "true"
|
||||
Alerting__AlertManagerUrl: http://alertmanager:9093
|
||||
Alerting__Email__Recipients__0: ops@example.com
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Configure `Alerting:*` settings in the service configuration and ensure secrets come from the platform secrets provider rather than clear text files.
|
||||
|
||||
### Kubernetes / Helm
|
||||
Store webhook URLs and routing keys in Secrets, then mount them into `Alerting:*` values.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.observability.alerting
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.observability.metrics` - alerting is usually driven by metrics
|
||||
- `check.observability.logging` - logs are the fallback when alerts are missing
|
||||
@@ -0,0 +1,53 @@
|
||||
---
|
||||
checkId: check.observability.healthchecks
|
||||
plugin: stellaops.doctor.observability
|
||||
severity: warn
|
||||
tags: [observability, healthchecks, readiness, liveness]
|
||||
---
|
||||
# Health Check Endpoints
|
||||
|
||||
## What It Checks
|
||||
Evaluates the configured health, readiness, and liveness paths and optionally probes `http://localhost:<port><path>` when a health-check port is configured.
|
||||
|
||||
The check warns when endpoints are unreachable, when timeouts are outside the `1s` to `60s` range, or when readiness and liveness collapse onto the same path.
|
||||
|
||||
## Why It Matters
|
||||
Broken health probes turn into bad restart loops, failed rolling upgrades, and misleading orchestration signals.
|
||||
|
||||
## Common Causes
|
||||
- The service exposes `/health` but not `/health/ready` or `/health/live`
|
||||
- Health-check ports differ from the actual bound HTTP port
|
||||
- Probe timeout values were copied from another service without validation
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://localhost:8080/health
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://localhost:8080/health/ready
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://localhost:8080/health/live
|
||||
```
|
||||
|
||||
Set explicit paths and a reasonable timeout:
|
||||
|
||||
```yaml
|
||||
HealthChecks__Path: /health
|
||||
HealthChecks__ReadinessPath: /health/ready
|
||||
HealthChecks__LivenessPath: /health/live
|
||||
HealthChecks__Timeout: 30
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Verify reverse proxies and firewalls do not block the health port.
|
||||
|
||||
### Kubernetes / Helm
|
||||
Point readiness and liveness probes at separate endpoints whenever startup and steady-state behavior differ.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.observability.healthchecks
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.core.services.health` - aggregates the underlying ASP.NET health checks when available
|
||||
- `check.observability.metrics` - shared listener misconfiguration can break both endpoints
|
||||
49
docs/doctor/articles/observability/observability-logging.md
Normal file
49
docs/doctor/articles/observability/observability-logging.md
Normal file
@@ -0,0 +1,49 @@
|
||||
---
|
||||
checkId: check.observability.logging
|
||||
plugin: stellaops.doctor.observability
|
||||
severity: warn
|
||||
tags: [observability, logging, structured-logs]
|
||||
---
|
||||
# Logging Configuration
|
||||
|
||||
## What It Checks
|
||||
Reads default and framework log levels and looks for structured logging via `Logging:Structured`, JSON console formatting, or a `Serilog` configuration section.
|
||||
|
||||
The check warns when default logging is `Debug` or `Trace`, when Microsoft categories are too verbose, or when structured logging is missing.
|
||||
|
||||
## Why It Matters
|
||||
Unstructured logs slow incident response and make exports difficult to analyze. Overly verbose framework logging also drives storage growth and noise.
|
||||
|
||||
## Common Causes
|
||||
- Only the default ASP.NET console logger is configured
|
||||
- `Logging:Structured` or `Serilog` settings were omitted from compose values
|
||||
- Troubleshooting log levels were left enabled in production
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
Logging__LogLevel__Default: Information
|
||||
Logging__LogLevel__Microsoft: Warning
|
||||
Logging__Structured: "true"
|
||||
```
|
||||
|
||||
If Serilog is used, make sure the console sink emits JSON or another structured format that downstream tooling can parse.
|
||||
|
||||
### Bare Metal / systemd
|
||||
Keep framework namespaces at `Warning` or stricter unless you are collecting short-lived debugging evidence.
|
||||
|
||||
### Kubernetes / Helm
|
||||
Ensure log collectors expect the same output format the application emits.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.observability.logging
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.observability.alerting` - alerting often relies on structured log pipelines
|
||||
- `check.security.audit.logging` - audit logs should follow the same transport and retention standards
|
||||
53
docs/doctor/articles/observability/observability-metrics.md
Normal file
53
docs/doctor/articles/observability/observability-metrics.md
Normal file
@@ -0,0 +1,53 @@
|
||||
---
|
||||
checkId: check.observability.metrics
|
||||
plugin: stellaops.doctor.observability
|
||||
severity: warn
|
||||
tags: [observability, metrics, prometheus]
|
||||
---
|
||||
# Metrics Collection
|
||||
|
||||
## What It Checks
|
||||
Inspects `Metrics:*`, `Prometheus:*`, and `OpenTelemetry:Metrics:*` settings. When a metrics port is configured and an `IHttpClientFactory` is available, the check probes `http://localhost:<port><path>`.
|
||||
|
||||
The check returns info when metrics are disabled or absent, and warns when the configured endpoint cannot be reached.
|
||||
|
||||
## Why It Matters
|
||||
Metrics are the primary input for alerting, SLO tracking, and capacity planning. Missing or unreachable endpoints remove the fastest signal operators have.
|
||||
|
||||
## Common Causes
|
||||
- Metrics were never enabled in the deployment configuration
|
||||
- The metrics path or port does not match the listener exposed by the service
|
||||
- A sidecar or reverse proxy blocks local probing
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
Metrics__Enabled: "true"
|
||||
Metrics__Path: /metrics
|
||||
Metrics__Port: 8080
|
||||
```
|
||||
|
||||
Probe the endpoint from inside the container:
|
||||
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://localhost:8080/metrics
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Bind the metrics port explicitly if the service does not share the main HTTP listener.
|
||||
|
||||
### Kubernetes / Helm
|
||||
Align the `ServiceMonitor` or Prometheus scrape config with the same path and port the app exposes.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.observability.metrics
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.observability.otel` - OpenTelemetry metrics often share the same collector path
|
||||
- `check.observability.alerting` - metrics are usually the source for alert rules
|
||||
52
docs/doctor/articles/observability/observability-otel.md
Normal file
52
docs/doctor/articles/observability/observability-otel.md
Normal file
@@ -0,0 +1,52 @@
|
||||
---
|
||||
checkId: check.observability.otel
|
||||
plugin: stellaops.doctor.observability
|
||||
severity: warn
|
||||
tags: [observability, opentelemetry, tracing, metrics]
|
||||
---
|
||||
# OpenTelemetry Configuration
|
||||
|
||||
## What It Checks
|
||||
Reads `OpenTelemetry:*`, `Telemetry:*`, and `OTEL_*` settings for endpoint, service name, tracing enablement, metrics enablement, and sampling ratio. When possible, it probes the collector host directly.
|
||||
|
||||
The check reports info when no OTLP endpoint is configured and warns when the service name is missing, tracing or metrics are disabled, sampling is too low, or the collector is unreachable.
|
||||
|
||||
## Why It Matters
|
||||
OpenTelemetry is the main path for exporting traces and metrics to external systems. Broken collector settings silently remove cross-service visibility.
|
||||
|
||||
## Common Causes
|
||||
- `OTEL_EXPORTER_OTLP_ENDPOINT` was omitted from compose or environment settings
|
||||
- `OTEL_SERVICE_NAME` was never set
|
||||
- Collector networking differs between local and deployed environments
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317
|
||||
OTEL_SERVICE_NAME: doctor-web
|
||||
OpenTelemetry__Tracing__Enabled: "true"
|
||||
OpenTelemetry__Metrics__Enabled: "true"
|
||||
```
|
||||
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://otel-collector:4318/
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Keep the collector endpoint in the service unit or configuration file and verify firewalls allow traffic on the OTLP port.
|
||||
|
||||
### Kubernetes / Helm
|
||||
Use cluster-local collector service names and inject `OTEL_SERVICE_NAME` per workload.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.observability.otel
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.observability.tracing` - validates trace-specific tuning once OTLP export is wired
|
||||
- `check.observability.metrics` - metrics export often shares the same collector
|
||||
48
docs/doctor/articles/observability/observability-tracing.md
Normal file
48
docs/doctor/articles/observability/observability-tracing.md
Normal file
@@ -0,0 +1,48 @@
|
||||
---
|
||||
checkId: check.observability.tracing
|
||||
plugin: stellaops.doctor.observability
|
||||
severity: warn
|
||||
tags: [observability, tracing, correlation]
|
||||
---
|
||||
# Distributed Tracing
|
||||
|
||||
## What It Checks
|
||||
Validates trace enablement, propagator, sampling ratio, exporter type, and whether HTTP and database instrumentation are turned on.
|
||||
|
||||
The check reports info when tracing is explicitly disabled and warns when sampling is invalid, too low, or when important instrumentation is turned off.
|
||||
|
||||
## Why It Matters
|
||||
Tracing is the fastest way to understand cross-service latency and identify the exact hop that is failing. Disabling instrumentation removes that evidence.
|
||||
|
||||
## Common Causes
|
||||
- Sampling ratio set to `0` during load testing and never restored
|
||||
- Only outbound HTTP traces are enabled while database spans remain off
|
||||
- Propagator or exporter defaults differ between services
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
Tracing__Enabled: "true"
|
||||
Tracing__SamplingRatio: "1.0"
|
||||
Tracing__Instrumentation__Http: "true"
|
||||
Tracing__Instrumentation__Database: "true"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Keep `Tracing:SamplingRatio` between `0.01` and `1.0` unless you are deliberately suppressing traces for a benchmark.
|
||||
|
||||
### Kubernetes / Helm
|
||||
Propagate the same trace configuration across all services in the release path so correlation IDs remain intact.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.observability.tracing
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.observability.otel` - exporter connectivity must work before traces leave the process
|
||||
- `check.servicegraph.timeouts` - tracing is most useful when diagnosing timeout-related issues
|
||||
60
docs/doctor/articles/postgres/db-connection.md
Normal file
60
docs/doctor/articles/postgres/db-connection.md
Normal file
@@ -0,0 +1,60 @@
|
||||
---
|
||||
checkId: check.db.connection
|
||||
plugin: stellaops.doctor.database
|
||||
severity: fail
|
||||
tags: [database, postgres, connectivity, quick]
|
||||
---
|
||||
# Database Connection
|
||||
|
||||
## What It Checks
|
||||
Opens a PostgreSQL connection using `Doctor:Plugins:Database:ConnectionString` or `ConnectionStrings:DefaultConnection` and runs `SELECT version(), current_database(), current_user`.
|
||||
|
||||
The check passes only when the connection opens and the probe query returns successfully. Connection failures, authentication failures, DNS errors, and network timeouts fail the check.
|
||||
|
||||
## Why It Matters
|
||||
Doctor cannot validate migrations, pool health, or schema state if the platform cannot reach PostgreSQL. A broken connection path usually means startup failures, API errors, and background job disruption across the suite.
|
||||
|
||||
## Common Causes
|
||||
- `ConnectionStrings__DefaultConnection` is missing or malformed
|
||||
- PostgreSQL is not running or not listening on the configured host and port
|
||||
- DNS, firewall, or container networking prevents the Doctor service from reaching PostgreSQL
|
||||
- Username, password, database name, or TLS settings are incorrect
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml ps postgres
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 100 postgres
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres pg_isready -U stellaops -d stellaops
|
||||
```
|
||||
|
||||
Set the Doctor connection string with compose-style environment variables:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
ConnectionStrings__DefaultConnection: Host=postgres;Port=5432;Database=stellaops;Username=stellaops;Password=${STELLAOPS_DB_PASSWORD}
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
pg_isready -h <db-host> -p 5432 -U <db-user> -d <db-name>
|
||||
psql "Host=<db-host>;Port=5432;Database=<db-name>;Username=<db-user>;Password=<password>" -c "SELECT 1"
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl exec deploy/doctor-web -- pg_isready -h <postgres-service> -p 5432 -U <db-user> -d <db-name>
|
||||
kubectl get secret <db-secret> -o yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.db.connection
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.db.latency` - uses the same connection path and highlights performance issues after basic connectivity works
|
||||
- `check.db.pool.health` - validates connection pressure after connectivity is restored
|
||||
53
docs/doctor/articles/postgres/db-latency.md
Normal file
53
docs/doctor/articles/postgres/db-latency.md
Normal file
@@ -0,0 +1,53 @@
|
||||
---
|
||||
checkId: check.db.latency
|
||||
plugin: stellaops.doctor.database
|
||||
severity: fail
|
||||
tags: [database, postgres, latency, performance]
|
||||
---
|
||||
# Query Latency
|
||||
|
||||
## What It Checks
|
||||
Runs two warmup queries and then measures five `SELECT 1` probes plus five temporary-table `INSERT` probes against PostgreSQL.
|
||||
|
||||
The check warns when the p95 latency exceeds `50ms` and fails when the p95 latency exceeds `200ms`.
|
||||
|
||||
## Why It Matters
|
||||
Healthy connectivity is not enough if the database path is slow. Elevated query latency turns into slow UI pages, delayed releases, and queue backlogs across the platform.
|
||||
|
||||
## Common Causes
|
||||
- CPU, memory, or I/O pressure on the PostgreSQL host
|
||||
- Cross-host or cross-region latency between Doctor and PostgreSQL
|
||||
- Lock contention or long-running transactions
|
||||
- Shared infrastructure saturation in the default compose stack
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT * FROM pg_stat_activity WHERE state = 'active';"
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT * FROM pg_locks WHERE NOT granted;"
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml stats postgres
|
||||
```
|
||||
|
||||
Tune connection placement and storage before raising thresholds. If the database is remote, keep `doctor-web` and PostgreSQL on the same low-latency network segment.
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
psql -h <db-host> -U <db-user> -d <db-name> -c "SELECT * FROM pg_stat_activity WHERE state = 'active';"
|
||||
psql -h <db-host> -U <db-user> -d <db-name> -c "SELECT * FROM pg_locks WHERE NOT granted;"
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl top pod -n <namespace> <postgres-pod>
|
||||
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT now();"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.db.latency
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.db.connection` - basic reachability must pass before latency numbers are meaningful
|
||||
- `check.db.pool.health` - pool saturation often shows up as latency first
|
||||
52
docs/doctor/articles/postgres/db-migrations-failed.md
Normal file
52
docs/doctor/articles/postgres/db-migrations-failed.md
Normal file
@@ -0,0 +1,52 @@
|
||||
---
|
||||
checkId: check.db.migrations.failed
|
||||
plugin: stellaops.doctor.database
|
||||
severity: fail
|
||||
tags: [database, migrations, postgres, schema]
|
||||
---
|
||||
# Failed Migrations
|
||||
|
||||
## What It Checks
|
||||
Reads the `stella_migration_history` table, when present, and reports rows marked `failed` or `incomplete`.
|
||||
|
||||
If the tracking table does not exist, the check reports informationally and assumes the service is using a different migration mechanism.
|
||||
|
||||
## Why It Matters
|
||||
Partially applied migrations leave schemas in undefined states. That is a common cause of startup failures and runtime `500` errors after upgrades.
|
||||
|
||||
## Common Causes
|
||||
- A migration script failed during deployment
|
||||
- The database user lacks DDL permissions
|
||||
- Two processes attempted to apply migrations concurrently
|
||||
- An interrupted deployment left the migration history half-written
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 200 doctor-web
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT migration_id, status, error_message, applied_at FROM stella_migration_history ORDER BY applied_at DESC LIMIT 10;"
|
||||
```
|
||||
|
||||
Fix the underlying SQL or permission problem, then restart the owning service so startup migrations run again.
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
journalctl -u <service-name> -n 200
|
||||
dotnet ef database update
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl logs deploy/<service-name> -n <namespace> --tail=200
|
||||
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT migration_id, status FROM stella_migration_history;"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.db.migrations.failed
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.db.migrations.pending` - pending migrations often follow a failed rollout
|
||||
- `check.db.schema.version` - schema consistency should be rechecked after cleanup
|
||||
52
docs/doctor/articles/postgres/db-migrations-pending.md
Normal file
52
docs/doctor/articles/postgres/db-migrations-pending.md
Normal file
@@ -0,0 +1,52 @@
|
||||
---
|
||||
checkId: check.db.migrations.pending
|
||||
plugin: stellaops.doctor.database
|
||||
severity: warn
|
||||
tags: [database, migrations, postgres, schema]
|
||||
---
|
||||
# Pending Migrations
|
||||
|
||||
## What It Checks
|
||||
Looks for the `__EFMigrationsHistory` table and reports the latest applied migration recorded there.
|
||||
|
||||
This runtime check does not diff the database against the assembly directly; it tells you whether migration history exists and what the latest applied migration is.
|
||||
|
||||
## Why It Matters
|
||||
Missing or stale migration history usually means a fresh environment was bootstrapped incorrectly or schema changes were never applied on startup.
|
||||
|
||||
## Common Causes
|
||||
- Startup migrations are not wired for the owning service
|
||||
- The database was reset and the service never converged the schema
|
||||
- The service is using a different schema owner than operators expect
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 200 doctor-web
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT \"MigrationId\" FROM \"__EFMigrationsHistory\" ORDER BY \"MigrationId\" DESC;"
|
||||
```
|
||||
|
||||
Confirm the owning service calls startup migrations on boot instead of relying on one-off SQL initialization scripts.
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
journalctl -u <service-name> -n 200
|
||||
dotnet ef migrations list
|
||||
dotnet ef database update
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl logs deploy/<service-name> -n <namespace> --tail=200
|
||||
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT COUNT(*) FROM \"__EFMigrationsHistory\";"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.db.migrations.pending
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.db.migrations.failed` - diagnose broken runs before retrying
|
||||
- `check.db.schema.version` - validates the resulting schema shape
|
||||
51
docs/doctor/articles/postgres/db-permissions.md
Normal file
51
docs/doctor/articles/postgres/db-permissions.md
Normal file
@@ -0,0 +1,51 @@
|
||||
---
|
||||
checkId: check.db.permissions
|
||||
plugin: stellaops.doctor.database
|
||||
severity: fail
|
||||
tags: [database, postgres, permissions, security]
|
||||
---
|
||||
# Database Permissions
|
||||
|
||||
## What It Checks
|
||||
Inspects the current PostgreSQL user, whether it is a superuser, whether it can create databases or roles, and whether it has access to application schemas.
|
||||
|
||||
The check warns when the app runs as a superuser and fails when the user cannot use the `public` schema.
|
||||
|
||||
## Why It Matters
|
||||
Over-privileged accounts increase blast radius. Under-privileged accounts break startup migrations and normal CRUD paths.
|
||||
|
||||
## Common Causes
|
||||
- The connection string still uses `postgres` or another admin account
|
||||
- Grants were not applied after creating a dedicated service account
|
||||
- Restrictive schema privileges were added manually
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U postgres -d stellaops -c "CREATE USER stellaops WITH PASSWORD '<strong-password>';"
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U postgres -d stellaops -c "GRANT CONNECT ON DATABASE stellaops TO stellaops;"
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U postgres -d stellaops -c "GRANT USAGE ON SCHEMA public TO stellaops;"
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U postgres -d stellaops -c "GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO stellaops;"
|
||||
```
|
||||
|
||||
Update `ConnectionStrings__DefaultConnection` after the grants are in place.
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
psql -h <db-host> -U postgres -d <db-name> -c "ALTER ROLE <app-user> NOSUPERUSER NOCREATEDB NOCREATEROLE;"
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl exec -n <namespace> <postgres-pod> -- psql -U postgres -d <db-name> -c "\du"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.db.permissions
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.db.migrations.failed` - missing privileges frequently break migrations
|
||||
- `check.db.connection` - credentials and grants must both be correct
|
||||
50
docs/doctor/articles/postgres/db-pool-health.md
Normal file
50
docs/doctor/articles/postgres/db-pool-health.md
Normal file
@@ -0,0 +1,50 @@
|
||||
---
|
||||
checkId: check.db.pool.health
|
||||
plugin: stellaops.doctor.database
|
||||
severity: fail
|
||||
tags: [database, postgres, pool, connections]
|
||||
---
|
||||
# Connection Pool Health
|
||||
|
||||
## What It Checks
|
||||
Queries `pg_stat_activity` for the current database and evaluates total connections, active connections, idle connections, waiting connections, and sessions stuck `idle in transaction`.
|
||||
|
||||
The check warns when more than five sessions are `idle in transaction` or when total usage exceeds `80%` of server capacity.
|
||||
|
||||
## Why It Matters
|
||||
Pool pressure turns into request latency, migration timeouts, and job backlog. `idle in transaction` sessions are especially dangerous because they hold locks while doing nothing useful.
|
||||
|
||||
## Common Causes
|
||||
- Application code is not closing transactions
|
||||
- Connection leaks keep sessions open after requests complete
|
||||
- `max_connections` is too low for the number of app instances
|
||||
- Long-running requests or deadlocks block pooled connections
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT pid, state, wait_event, query FROM pg_stat_activity WHERE datname = current_database();"
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT pid, query FROM pg_stat_activity WHERE state = 'idle in transaction';"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
psql -h <db-host> -U <db-user> -d <db-name> -c "SHOW max_connections;"
|
||||
```
|
||||
|
||||
Review the owning service for transaction scopes that stay open across network calls or retries.
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT count(*) FROM pg_stat_activity;"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.db.pool.health
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.db.pool.size` - configuration and runtime pressure need to agree
|
||||
- `check.db.latency` - latency usually rises before the pool is fully exhausted
|
||||
56
docs/doctor/articles/postgres/db-pool-size.md
Normal file
56
docs/doctor/articles/postgres/db-pool-size.md
Normal file
@@ -0,0 +1,56 @@
|
||||
---
|
||||
checkId: check.db.pool.size
|
||||
plugin: stellaops.doctor.database
|
||||
severity: warn
|
||||
tags: [database, postgres, pool, configuration]
|
||||
---
|
||||
# Connection Pool Size
|
||||
|
||||
## What It Checks
|
||||
Parses the Npgsql connection string and compares `Pooling`, `MinPoolSize`, and `MaxPoolSize` against PostgreSQL `max_connections` minus reserved superuser slots.
|
||||
|
||||
The check warns when pooling is disabled or when `Max Pool Size` exceeds practical server capacity. It returns info when `MinPoolSize=0`.
|
||||
|
||||
## Why It Matters
|
||||
Pool sizing mistakes create either avoidable cold-start latency or connection storms that starve PostgreSQL.
|
||||
|
||||
## Common Causes
|
||||
- `Pooling=false` left over from local troubleshooting
|
||||
- `Max Pool Size` copied from another environment without checking server capacity
|
||||
- Multiple app replicas sharing the same PostgreSQL limit without coordinated sizing
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SHOW max_connections;"
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SHOW superuser_reserved_connections;"
|
||||
```
|
||||
|
||||
Set an explicit connection string:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
ConnectionStrings__DefaultConnection: Host=postgres;Port=5432;Database=stellaops;Username=stellaops;Password=${STELLAOPS_DB_PASSWORD};Pooling=true;MinPoolSize=5;MaxPoolSize=25
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
psql -h <db-host> -U <db-user> -d <db-name> -c "SHOW max_connections;"
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SHOW max_connections;"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.db.pool.size
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.db.pool.health` - validates that configured limits behave correctly at runtime
|
||||
- `check.db.connection` - pooling changes should not break base connectivity
|
||||
49
docs/doctor/articles/postgres/db-schema-version.md
Normal file
49
docs/doctor/articles/postgres/db-schema-version.md
Normal file
@@ -0,0 +1,49 @@
|
||||
---
|
||||
checkId: check.db.schema.version
|
||||
plugin: stellaops.doctor.database
|
||||
severity: fail
|
||||
tags: [database, postgres, schema, migrations]
|
||||
---
|
||||
# Schema Version
|
||||
|
||||
## What It Checks
|
||||
Counts non-system schemas and tables, inspects the latest EF migration entry when available, and warns when PostgreSQL reports unvalidated foreign-key constraints.
|
||||
|
||||
Unvalidated constraints usually indicate an interrupted migration or manual DDL drift.
|
||||
|
||||
## Why It Matters
|
||||
Schema drift is a common source of runtime breakage after upgrades. Unvalidated constraints can hide partial migrations long after deployment appears complete.
|
||||
|
||||
## Common Causes
|
||||
- A migration failed after creating constraints but before validation
|
||||
- Manual schema changes bypassed startup migrations
|
||||
- The database was restored from an inconsistent backup
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT conname FROM pg_constraint WHERE NOT convalidated;"
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec postgres psql -U stellaops -d stellaops -c "SELECT \"MigrationId\" FROM \"__EFMigrationsHistory\" ORDER BY \"MigrationId\" DESC LIMIT 5;"
|
||||
```
|
||||
|
||||
Re-run the owning service with startup migrations enabled after fixing the underlying schema issue.
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
psql -h <db-host> -U <db-user> -d <db-name> -c "SELECT COUNT(*) FROM pg_constraint WHERE NOT convalidated;"
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl exec -n <namespace> <postgres-pod> -- psql -U <db-user> -d <db-name> -c "SELECT nspname FROM pg_namespace;"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.db.schema.version
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.db.migrations.failed` - failed migrations are the most common cause of schema inconsistency
|
||||
- `check.db.migrations.pending` - verify history after cleanup
|
||||
56
docs/doctor/articles/servicegraph/servicegraph-backend.md
Normal file
56
docs/doctor/articles/servicegraph/servicegraph-backend.md
Normal file
@@ -0,0 +1,56 @@
|
||||
---
|
||||
checkId: check.servicegraph.backend
|
||||
plugin: stellaops.doctor.servicegraph
|
||||
severity: fail
|
||||
tags: [servicegraph, backend, api, connectivity]
|
||||
---
|
||||
# Backend API Connectivity
|
||||
|
||||
## What It Checks
|
||||
Reads `StellaOps:BackendUrl` or `BackendUrl`, appends `/health`, and performs an HTTP GET through `IHttpClientFactory`.
|
||||
|
||||
The check passes on a successful response, warns when latency exceeds `2000ms`, and fails on non-success status codes or connection errors.
|
||||
|
||||
## Why It Matters
|
||||
The backend API is the control plane entry point for many Stella Ops flows. If it is unreachable, UI features and cross-service orchestration degrade quickly.
|
||||
|
||||
## Common Causes
|
||||
- `StellaOps__BackendUrl` points to the wrong host, port, or scheme
|
||||
- The backend service is down or returning `5xx`
|
||||
- DNS, proxy, or network rules block access from the Doctor service
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
StellaOps__BackendUrl: http://platform-web:8080
|
||||
```
|
||||
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://platform-web:8080/health
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 100 platform-web
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
curl -fsS http://<backend-host>:<port>/health
|
||||
journalctl -u <backend-service> -n 200
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl exec deploy/doctor-web -n <namespace> -- curl -fsS http://<backend-service>.<namespace>.svc.cluster.local:<port>/health
|
||||
kubectl logs deploy/<backend-service> -n <namespace> --tail=200
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.servicegraph.backend
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.servicegraph.endpoints` - validates the rest of the service graph after the main backend is reachable
|
||||
- `check.servicegraph.timeouts` - slow backend responses often trace back to timeout tuning
|
||||
@@ -0,0 +1,48 @@
|
||||
---
|
||||
checkId: check.servicegraph.circuitbreaker
|
||||
plugin: stellaops.doctor.servicegraph
|
||||
severity: warn
|
||||
tags: [servicegraph, resilience, circuit-breaker]
|
||||
---
|
||||
# Circuit Breaker Status
|
||||
|
||||
## What It Checks
|
||||
Reads `Resilience:Enabled` or `HttpClient:Resilience:Enabled` and, when enabled, validates `BreakDurationSeconds`, `FailureThreshold`, and `SamplingDurationSeconds`.
|
||||
|
||||
The check reports info when resilience is not configured, warns when `BreakDurationSeconds < 5` or `FailureThreshold < 2`, and passes otherwise.
|
||||
|
||||
## Why It Matters
|
||||
Circuit breakers protect external dependencies from retry storms. Bad thresholds either trip too aggressively or never trip when a downstream service is failing.
|
||||
|
||||
## Common Causes
|
||||
- Resilience policies were never enabled on outgoing HTTP clients
|
||||
- Thresholds were copied from a benchmark profile into production
|
||||
- Multiple services use different resilience defaults, making failures unpredictable
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
Resilience__Enabled: "true"
|
||||
Resilience__CircuitBreaker__BreakDurationSeconds: "30"
|
||||
Resilience__CircuitBreaker__FailureThreshold: "5"
|
||||
Resilience__CircuitBreaker__SamplingDurationSeconds: "60"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Keep breaker settings in the same configuration source used for HTTP client registration so the service and Doctor observe the same values.
|
||||
|
||||
### Kubernetes / Helm
|
||||
Standardize resilience values across backend-facing workloads instead of per-pod overrides.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.servicegraph.circuitbreaker
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.servicegraph.backend` - breaker policy protects this path when the backend degrades
|
||||
- `check.servicegraph.timeouts` - timeout settings and breaker settings should be tuned together
|
||||
53
docs/doctor/articles/servicegraph/servicegraph-endpoints.md
Normal file
53
docs/doctor/articles/servicegraph/servicegraph-endpoints.md
Normal file
@@ -0,0 +1,53 @@
|
||||
---
|
||||
checkId: check.servicegraph.endpoints
|
||||
plugin: stellaops.doctor.servicegraph
|
||||
severity: fail
|
||||
tags: [servicegraph, services, endpoints, connectivity]
|
||||
---
|
||||
# Service Endpoints
|
||||
|
||||
## What It Checks
|
||||
Collects configured service URLs for Authority, Scanner, Concelier, Excititor, Attestor, VexLens, and Gateway, appends `/health`, and probes each endpoint.
|
||||
|
||||
The check fails when any configured endpoint is unreachable or returns a non-success status. If no endpoints are configured, the check is skipped.
|
||||
|
||||
## Why It Matters
|
||||
Stella Ops is a multi-service platform. A single broken internal endpoint can stall release orchestration, evidence generation, or advisory workflows even when the main web process is alive.
|
||||
|
||||
## Common Causes
|
||||
- One or more `StellaOps:*Url` values are missing or point to the wrong internal service name
|
||||
- Internal DNS or network routing is broken
|
||||
- The target workload is up but not exposing `/health`
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Set the internal URLs explicitly:
|
||||
|
||||
```yaml
|
||||
StellaOps__AuthorityUrl: http://authority-web:8080
|
||||
StellaOps__ScannerUrl: http://scanner-web:8080
|
||||
StellaOps__GatewayUrl: http://web:8080
|
||||
```
|
||||
|
||||
Probe each endpoint from the Doctor container:
|
||||
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://authority-web:8080/health
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://scanner-web:8080/health
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Confirm the service-discovery or reverse-proxy names resolve from the Doctor host.
|
||||
|
||||
### Kubernetes / Helm
|
||||
Use cluster-local service DNS names and check that each workload exports a health endpoint through the same port the URL references.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.servicegraph.endpoints
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.servicegraph.backend` - the backend is usually the first endpoint operators validate
|
||||
- `check.servicegraph.mq` - asynchronous workflows also depend on messaging, not only HTTP endpoints
|
||||
56
docs/doctor/articles/servicegraph/servicegraph-mq.md
Normal file
56
docs/doctor/articles/servicegraph/servicegraph-mq.md
Normal file
@@ -0,0 +1,56 @@
|
||||
---
|
||||
checkId: check.servicegraph.mq
|
||||
plugin: stellaops.doctor.servicegraph
|
||||
severity: warn
|
||||
tags: [servicegraph, messaging, rabbitmq, connectivity]
|
||||
---
|
||||
# Message Queue Connectivity
|
||||
|
||||
## What It Checks
|
||||
Reads `RabbitMQ:Host` or `Messaging:RabbitMQ:Host` plus an optional port, defaulting to `5672`, and attempts a TCP connection.
|
||||
|
||||
The check skips when RabbitMQ is not configured and fails on timeouts, DNS failures, or refused connections.
|
||||
|
||||
## Why It Matters
|
||||
Release tasks, notifications, and deferred work often depend on a functioning message broker. A dead queue path turns healthy APIs into backlogged systems.
|
||||
|
||||
## Common Causes
|
||||
- `RabbitMQ__Host` is unset or points to the wrong broker
|
||||
- The broker container is down
|
||||
- AMQP traffic is blocked between Doctor and RabbitMQ
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
RabbitMQ__Host: rabbitmq
|
||||
RabbitMQ__Port: "5672"
|
||||
```
|
||||
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml ps rabbitmq
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 100 rabbitmq
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web sh -lc "nc -zv rabbitmq 5672"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
nc -zv <rabbit-host> 5672
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl exec deploy/doctor-web -n <namespace> -- sh -lc "nc -zv <rabbit-service> 5672"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.servicegraph.mq
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.servicegraph.valkey` - cache and queue connectivity usually fail together when service networking is broken
|
||||
- `check.servicegraph.timeouts` - aggressive timeouts can make a slow broker look unavailable
|
||||
48
docs/doctor/articles/servicegraph/servicegraph-timeouts.md
Normal file
48
docs/doctor/articles/servicegraph/servicegraph-timeouts.md
Normal file
@@ -0,0 +1,48 @@
|
||||
---
|
||||
checkId: check.servicegraph.timeouts
|
||||
plugin: stellaops.doctor.servicegraph
|
||||
severity: warn
|
||||
tags: [servicegraph, timeouts, configuration]
|
||||
---
|
||||
# Service Timeouts
|
||||
|
||||
## What It Checks
|
||||
Validates `HttpClient:Timeout`, `Database:CommandTimeout`, `Cache:OperationTimeout`, and `HealthChecks:Timeout`.
|
||||
|
||||
The check warns when HTTP timeout is below `5s` or above `300s`, database timeout is below `5s` or above `120s`, cache timeout exceeds `30s`, or health-check timeout exceeds the HTTP timeout.
|
||||
|
||||
## Why It Matters
|
||||
Timeouts define how quickly failures surface and how long stuck work ties up resources. Poor values cause either premature failures or prolonged resource exhaustion.
|
||||
|
||||
## Common Causes
|
||||
- Defaults from one environment were copied into another with very different latency
|
||||
- Health-check timeout was set higher than the main request timeout
|
||||
- Cache or database timeouts were raised to hide underlying performance problems
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
HttpClient__Timeout: "100"
|
||||
Database__CommandTimeout: "30"
|
||||
Cache__OperationTimeout: "5"
|
||||
HealthChecks__Timeout: "10"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Tune timeouts from measured service latencies, not from guesswork. Raise values only after understanding the slower dependency.
|
||||
|
||||
### Kubernetes / Helm
|
||||
Keep application timeouts lower than ingress, service-mesh, and job-level deadlines so failures happen in the component that owns the retry policy.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.servicegraph.timeouts
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.servicegraph.backend` - timeout misconfiguration often shows up as backend failures first
|
||||
- `check.db.latency` - high database latency can force operators to revisit timeout values
|
||||
52
docs/doctor/articles/servicegraph/servicegraph-valkey.md
Normal file
52
docs/doctor/articles/servicegraph/servicegraph-valkey.md
Normal file
@@ -0,0 +1,52 @@
|
||||
---
|
||||
checkId: check.servicegraph.valkey
|
||||
plugin: stellaops.doctor.servicegraph
|
||||
severity: warn
|
||||
tags: [servicegraph, valkey, redis, cache]
|
||||
---
|
||||
# Valkey/Redis Connectivity
|
||||
|
||||
## What It Checks
|
||||
Reads `Valkey:ConnectionString`, `Redis:ConnectionString`, `ConnectionStrings:Valkey`, or `ConnectionStrings:Redis`, parses the host and port, and opens a TCP connection.
|
||||
|
||||
The check skips when no cache connection string is configured and fails when parsing fails or the target cannot be reached.
|
||||
|
||||
## Why It Matters
|
||||
Cache unavailability affects queue coordination, state caching, and latency-sensitive platform features. A malformed connection string is also an early warning that the environment is not wired correctly.
|
||||
|
||||
## Common Causes
|
||||
- The cache connection string is missing, malformed, or still points to a previous environment
|
||||
- The Valkey/Redis service is not running
|
||||
- Container networking or DNS is broken
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
Valkey__ConnectionString: valkey:6379,password=${STELLAOPS_VALKEY_PASSWORD}
|
||||
```
|
||||
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml ps valkey
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web sh -lc "nc -zv valkey 6379"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
redis-cli -h <valkey-host> -p 6379 ping
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
Use a cluster-local service name in the connection string and verify the port exposed by the StatefulSet or Service.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.servicegraph.valkey
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.servicegraph.mq` - both checks validate internal service-network connectivity
|
||||
- `check.servicegraph.endpoints` - broad service discovery issues usually affect cache endpoints too
|
||||
@@ -0,0 +1,56 @@
|
||||
---
|
||||
checkId: check.verification.artifact.pull
|
||||
plugin: stellaops.doctor.verification
|
||||
severity: fail
|
||||
tags: [verification, artifact, registry, supply-chain]
|
||||
---
|
||||
# Test Artifact Pull
|
||||
|
||||
## What It Checks
|
||||
Requires the verification plugin to be enabled and a test artifact to be configured with either `Doctor:Plugins:Verification:TestArtifact:Reference` or `Doctor:Plugins:Verification:TestArtifact:OfflineBundlePath`.
|
||||
|
||||
For offline mode it checks the bundle file exists. For online mode it performs a registry `HEAD` request against the OCI manifest and optionally compares the returned digest to the expected digest.
|
||||
|
||||
## Why It Matters
|
||||
The rest of the verification pipeline is meaningless if Doctor cannot retrieve the artifact it is supposed to validate.
|
||||
|
||||
## Common Causes
|
||||
- No test artifact reference or offline bundle path is configured
|
||||
- Registry credentials are missing or do not allow manifest access
|
||||
- The artifact digest or tag points to content that no longer exists
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
Doctor__Plugins__Verification__Enabled: "true"
|
||||
Doctor__Plugins__Verification__TestArtifact__Reference: ghcr.io/example/app@sha256:<digest>
|
||||
```
|
||||
|
||||
For air-gapped mode:
|
||||
|
||||
```yaml
|
||||
Doctor__Plugins__Verification__TestArtifact__OfflineBundlePath: /var/lib/stella/verification/offline-bundle.json
|
||||
```
|
||||
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web crane manifest ghcr.io/example/app@sha256:<digest>
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Use an immutable digest reference instead of a mutable tag whenever possible.
|
||||
|
||||
### Kubernetes / Helm
|
||||
Mount registry credentials and the offline bundle path into the Doctor workload if the cluster is disconnected.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.verification.artifact.pull
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.verification.signature` - signature validation depends on the same artifact input
|
||||
- `check.integration.oci.pull` - registry authorization issues often show up there too
|
||||
@@ -0,0 +1,50 @@
|
||||
---
|
||||
checkId: check.verification.policy.engine
|
||||
plugin: stellaops.doctor.verification
|
||||
severity: fail
|
||||
tags: [verification, policy, vex, compliance]
|
||||
---
|
||||
# Policy Engine Evaluation
|
||||
|
||||
## What It Checks
|
||||
Requires the verification plugin plus a configured test artifact. In offline mode it looks for policy results inside the exported bundle. In online mode it validates `Policy:Engine:Enabled`, a policy reference, and `Policy:VexAware`.
|
||||
|
||||
The check fails when the policy engine is disabled, warns when no policy reference is configured or when VEX-aware evaluation is off, and passes when the prerequisites are present.
|
||||
|
||||
## Why It Matters
|
||||
Release verification is only trustworthy if the same policy engine and VEX rules used in production can be exercised by Doctor.
|
||||
|
||||
## Common Causes
|
||||
- `Policy__Engine__Enabled` is false
|
||||
- No default or test policy reference is configured
|
||||
- Policy rules were not updated to account for VEX justifications
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
Policy__Engine__Enabled: "true"
|
||||
Policy__DefaultPolicyRef: policy://default/release-gate
|
||||
Policy__VexAware: "true"
|
||||
Doctor__Plugins__Verification__PolicyTest__PolicyRef: policy://default/release-gate
|
||||
```
|
||||
|
||||
If you use offline verification, export the bundle with policy data included before copying it into the air-gapped environment.
|
||||
|
||||
### Bare Metal / systemd
|
||||
Keep the Doctor policy reference aligned with the policy engine configuration used by release orchestration.
|
||||
|
||||
### Kubernetes / Helm
|
||||
Store the policy ref in ConfigMaps and enforce the same value across the policy engine and Doctor service.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.verification.policy.engine
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.verification.vex.validation` - VEX-aware policy only helps if VEX collection works
|
||||
- `check.verification.sbom.validation` - policy evaluation usually consumes SBOM and vulnerability evidence
|
||||
@@ -0,0 +1,52 @@
|
||||
---
|
||||
checkId: check.verification.sbom.validation
|
||||
plugin: stellaops.doctor.verification
|
||||
severity: fail
|
||||
tags: [verification, sbom, cyclonedx, spdx]
|
||||
---
|
||||
# SBOM Validation
|
||||
|
||||
## What It Checks
|
||||
Requires the verification plugin plus a test artifact. In offline mode it looks for CycloneDX or SPDX JSON inside the bundle. In online mode it checks whether `Scanner:SbomGeneration:Enabled` or `Attestor:SbomAttestation:Enabled` is turned on.
|
||||
|
||||
The check warns when SBOM generation and attestation are both disabled, and fails when the offline bundle is missing or contains no recognizable SBOM.
|
||||
|
||||
## Why It Matters
|
||||
SBOMs are the input for downstream vulnerability analysis, policy decisions, and customer evidence exports. If SBOM generation is off, release evidence is incomplete.
|
||||
|
||||
## Common Causes
|
||||
- The build pipeline is not producing SBOMs
|
||||
- SBOM attestation is disabled even though verification expects it
|
||||
- Offline bundles were exported without `--include-sbom`
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
Scanner__SbomGeneration__Enabled: "true"
|
||||
Attestor__SbomAttestation__Enabled: "true"
|
||||
```
|
||||
|
||||
For offline mode:
|
||||
|
||||
```bash
|
||||
stella verification bundle export --include-sbom --output /var/lib/stella/verification/offline-bundle.json
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Enable SBOM generation in the scanner and keep artifact attachments immutable once published.
|
||||
|
||||
### Kubernetes / Helm
|
||||
Mount the same scanner and attestor config into Doctor that the production verification pipeline uses.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.verification.sbom.validation
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.verification.artifact.pull` - the artifact must be reachable before attached SBOMs can be validated
|
||||
- `check.verification.policy.engine` - policy rules commonly consume SBOM-derived vulnerability data
|
||||
56
docs/doctor/articles/verification/verification-signature.md
Normal file
56
docs/doctor/articles/verification/verification-signature.md
Normal file
@@ -0,0 +1,56 @@
|
||||
---
|
||||
checkId: check.verification.signature
|
||||
plugin: stellaops.doctor.verification
|
||||
severity: fail
|
||||
tags: [verification, signatures, dsse, rekor]
|
||||
---
|
||||
# Signature Verification
|
||||
|
||||
## What It Checks
|
||||
Requires the verification plugin plus a test artifact. In offline mode it looks for DSSE-style signature material in the bundle. In online mode it checks `Sigstore:Enabled` and verifies the Rekor log endpoint is reachable.
|
||||
|
||||
The check reports info when Sigstore is disabled, and fails when the offline bundle is missing or Rekor cannot be reached.
|
||||
|
||||
## Why It Matters
|
||||
Signature verification is the minimum control that proves the artifact under review was signed by the expected supply-chain path.
|
||||
|
||||
## Common Causes
|
||||
- `Sigstore__Enabled` is false
|
||||
- Rekor URL is unreachable from the Doctor workload
|
||||
- Offline bundles were exported without signatures
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
Sigstore__Enabled: "true"
|
||||
Sigstore__RekorUrl: https://rekor.sigstore.dev
|
||||
```
|
||||
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS https://rekor.sigstore.dev/api/v1/log
|
||||
```
|
||||
|
||||
For offline verification:
|
||||
|
||||
```bash
|
||||
stella verification bundle export --include-signatures --output /var/lib/stella/verification/offline-bundle.json
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Ensure the Doctor host trusts the CA chain used by the Rekor endpoint or use the approved internal Rekor deployment.
|
||||
|
||||
### Kubernetes / Helm
|
||||
Prefer an internal Rekor service URL in disconnected or regulated clusters.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.verification.signature
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.attestation.rekor.connectivity` - validates the transparency log path more directly
|
||||
- `check.verification.artifact.pull` - signature checks need a reachable artifact reference
|
||||
@@ -0,0 +1,52 @@
|
||||
---
|
||||
checkId: check.verification.vex.validation
|
||||
plugin: stellaops.doctor.verification
|
||||
severity: fail
|
||||
tags: [verification, vex, csaf, openvex]
|
||||
---
|
||||
# VEX Validation
|
||||
|
||||
## What It Checks
|
||||
Requires the verification plugin plus a test artifact. In offline mode it looks for OpenVEX, CSAF VEX, or CycloneDX VEX content inside the bundle. In online mode it validates `VexHub:Collection:Enabled` and at least one configured VEX feed URL.
|
||||
|
||||
The check reports info when VEX collection is disabled, warns when feeds are missing, and fails only for unusable offline bundle inputs.
|
||||
|
||||
## Why It Matters
|
||||
VEX data is what allows policy to distinguish exploitable findings from known-not-affected cases. Without it, release gates become overly noisy or overly permissive.
|
||||
|
||||
## Common Causes
|
||||
- `VexHub__Collection__Enabled` is false
|
||||
- Vendor or internal VEX feeds were never configured
|
||||
- Offline bundles were exported without `--include-vex`
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
VexHub__Collection__Enabled: "true"
|
||||
VexHub__Feeds__0__Url: https://vendor.example/vex.json
|
||||
```
|
||||
|
||||
For offline mode:
|
||||
|
||||
```bash
|
||||
stella verification bundle export --include-vex --output /var/lib/stella/verification/offline-bundle.json
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Keep VEX feeds in a controlled mirror if the environment cannot reach upstream vendors directly.
|
||||
|
||||
### Kubernetes / Helm
|
||||
Mount VEX feed configuration from the same source used by the running VexHub deployment.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.verification.vex.validation
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.verification.policy.engine` - VEX-aware policy is only as good as the VEX data it receives
|
||||
- `check.verification.sbom.validation` - VEX statements refer to components identified in the SBOM
|
||||
@@ -1,181 +0,0 @@
|
||||
# Sprint 20260326_001 — Doctor Health Checks Documentation
|
||||
|
||||
## Topic & Scope
|
||||
- Document every Doctor health check (99 checks across 16 plugins) with precise, actionable remediation.
|
||||
- Each check must have: what it tests, why it matters, exact fix steps, Docker compose specifics, and verification.
|
||||
- Fix false-positive checks that fail on default Docker compose installations.
|
||||
- Working directory: `docs/modules/doctor/`, `src/Doctor/__Plugins/`
|
||||
- Expected evidence: docs, improved check messages, tests.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- No upstream dependencies. Can be parallelized by plugin.
|
||||
- Depends on the 4 check code fixes already applied (RequiredSettings, EnvironmentVariables, SecretsConfiguration, DockerSocket).
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/modules/doctor/architecture.md` — existing Doctor architecture overview
|
||||
- `docs/modules/doctor/registry-checks.md` — existing check registry reference
|
||||
- `devops/compose/docker-compose.stella-ops.yml` — the reference deployment
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### DOC-001 - Create check reference index
|
||||
Status: TODO
|
||||
Dependency: none
|
||||
Owners: Documentation author
|
||||
Task description:
|
||||
- Create `docs/modules/doctor/checks/README.md` with a master table of all 99 checks
|
||||
- Columns: Check ID, Plugin, Category, Severity, Summary, Docker Compose Status (Pass/Warn/Fail/N/A)
|
||||
- Group by plugin (Core, Security, Docker, Agent, Attestor, Auth, etc.)
|
||||
- Include quick-reference severity legend
|
||||
|
||||
Completion criteria:
|
||||
- [ ] All 99 checks listed with correct metadata
|
||||
- [ ] Docker Compose Status column filled from actual test run
|
||||
|
||||
### DOC-002 - Core Plugin checks documentation (9 checks)
|
||||
Status: TODO
|
||||
Dependency: DOC-001
|
||||
Owners: Documentation author
|
||||
Task description:
|
||||
- Create `docs/modules/doctor/checks/core.md`
|
||||
- Document each check:
|
||||
- **check.core.config.required**: What settings are checked, key variants (colon vs `__`), compose env var names, how to add missing settings
|
||||
- **check.core.env.variables**: Which env vars are checked, why `ASPNETCORE_ENVIRONMENT` may not be set in compose, when this is OK
|
||||
- **check.core.health.endpoint**: Health endpoint configuration
|
||||
- **check.core.memory**: Memory threshold configuration
|
||||
- **check.core.startup.time**: Expected startup time ranges
|
||||
- Each remaining core check
|
||||
- For each check: Symptom → Root Cause → Fix → Verify
|
||||
|
||||
Completion criteria:
|
||||
- [ ] Each check has: description, what it tests, severity, fix steps, Docker compose notes, verification command
|
||||
|
||||
### DOC-003 - Security Plugin checks documentation
|
||||
Status: TODO
|
||||
Dependency: DOC-001
|
||||
Owners: Documentation author
|
||||
Task description:
|
||||
- Create `docs/modules/doctor/checks/security.md`
|
||||
- Document: check.security.secrets, check.security.tls, check.security.cors, check.security.headers
|
||||
- Include: which keys are considered "secrets" vs DSNs, vault provider configuration, development vs production guidance
|
||||
|
||||
Completion criteria:
|
||||
- [ ] Each check documented with fix steps and Docker compose notes
|
||||
|
||||
### DOC-004 - Docker Plugin checks documentation
|
||||
Status: TODO
|
||||
Dependency: DOC-001
|
||||
Owners: Documentation author
|
||||
Task description:
|
||||
- Create `docs/modules/doctor/checks/docker.md`
|
||||
- Document: check.docker.socket, check.docker.daemon, check.docker.images
|
||||
- Include: container-vs-host detection, socket mount instructions, Windows named pipe notes
|
||||
|
||||
Completion criteria:
|
||||
- [ ] Each check documented with container-aware behavior explained
|
||||
|
||||
### DOC-005 - Agent Plugin checks documentation (11 checks)
|
||||
Status: TODO
|
||||
Dependency: DOC-001
|
||||
Owners: Documentation author
|
||||
Task description:
|
||||
- Create `docs/modules/doctor/checks/agent.md`
|
||||
- Document all 11 agent checks: capacity, certificates, cluster health/quorum, heartbeat, resources, versions, stale detection, task failure rate, task backlog
|
||||
|
||||
Completion criteria:
|
||||
- [ ] Each check documented with thresholds, configuration options, fix steps
|
||||
|
||||
### DOC-006 - Attestor Plugin checks documentation (6 checks)
|
||||
Status: TODO
|
||||
Dependency: DOC-001
|
||||
Owners: Documentation author
|
||||
Task description:
|
||||
- Create `docs/modules/doctor/checks/attestor.md`
|
||||
- Document: cosign key material, clock skew, Rekor connectivity/verification, signing key expiration, transparency log consistency
|
||||
|
||||
Completion criteria:
|
||||
- [ ] Each check documented including air-gap/offline scenarios
|
||||
|
||||
### DOC-007 - Auth Plugin checks documentation (4 checks)
|
||||
Status: TODO
|
||||
Dependency: DOC-001
|
||||
Owners: Documentation author
|
||||
Task description:
|
||||
- Create `docs/modules/doctor/checks/auth.md`
|
||||
- Document: auth configuration, OIDC provider connectivity, signing key health, token service health
|
||||
|
||||
Completion criteria:
|
||||
- [ ] Each check documented with OIDC troubleshooting steps
|
||||
|
||||
### DOC-008 - Remaining plugins documentation
|
||||
Status: TODO
|
||||
Dependency: DOC-001
|
||||
Owners: Documentation author
|
||||
Task description:
|
||||
- Create one doc per remaining plugin:
|
||||
- `docs/modules/doctor/checks/binary-analysis.md` (6 checks)
|
||||
- `docs/modules/doctor/checks/compliance.md` (7 checks)
|
||||
- `docs/modules/doctor/checks/crypto.md` (6 checks)
|
||||
- `docs/modules/doctor/checks/environment.md` (6 checks)
|
||||
- `docs/modules/doctor/checks/evidence-locker.md` (4 checks)
|
||||
- `docs/modules/doctor/checks/observability.md` (4 checks)
|
||||
- `docs/modules/doctor/checks/notify.md` (9 checks)
|
||||
- `docs/modules/doctor/checks/operations.md` (3 checks)
|
||||
- `docs/modules/doctor/checks/policy.md` (1 check)
|
||||
- `docs/modules/doctor/checks/postgres.md` (3 checks)
|
||||
- `docs/modules/doctor/checks/release.md` (6 checks)
|
||||
- `docs/modules/doctor/checks/scanner.md` (7 checks)
|
||||
- `docs/modules/doctor/checks/storage.md` (3 checks)
|
||||
- `docs/modules/doctor/checks/timestamping.md` (9 checks)
|
||||
- `docs/modules/doctor/checks/vex.md` (3 checks)
|
||||
|
||||
Completion criteria:
|
||||
- [ ] Every check across all 16 plugins documented
|
||||
|
||||
### DOC-009 - Improve check remediation messages in code
|
||||
Status: TODO
|
||||
Dependency: DOC-002 through DOC-008
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- For each check, update the `WithRemediation()` steps to include:
|
||||
- Exact commands (not vague "configure X")
|
||||
- Docker compose env var names (using `__` separator)
|
||||
- File paths relative to the compose directory
|
||||
- Link to the documentation page (e.g., "See docs/modules/doctor/checks/core.md")
|
||||
- Update `WithCauses()` to be specific, not generic
|
||||
|
||||
Completion criteria:
|
||||
- [ ] All 99 checks have precise, copy-pasteable remediation steps
|
||||
- [ ] No check reports a generic "configure X" without specifying how
|
||||
- [ ] Docker compose installations pass all checks that should pass
|
||||
|
||||
### DOC-010 - Docker compose default pass baseline
|
||||
Status: TODO
|
||||
Dependency: DOC-009
|
||||
Owners: QA / Test Automation
|
||||
Task description:
|
||||
- Run all 99 Doctor checks against a fresh `docker compose up` installation
|
||||
- Document which checks MUST pass, which are expected warnings, which are N/A
|
||||
- Create `docs/modules/doctor/compose-baseline.md` with the expected results
|
||||
- Add any remaining code fixes for false positives
|
||||
|
||||
Completion criteria:
|
||||
- [ ] Baseline document created
|
||||
- [ ] Zero false-positive FAILs on fresh Docker compose install
|
||||
- [ ] All WARN checks documented as expected or fixed
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-03-26 | Sprint created. 4 code fixes applied (RequiredSettings, EnvironmentVariables, SecretsConfiguration, DockerSocket). | Planning |
|
||||
|
||||
## Decisions & Risks
|
||||
- Risk: 99 checks is a large documentation surface. Parallelize by plugin.
|
||||
- Decision: Each plugin gets its own doc file for maintainability.
|
||||
- Decision: Remediation messages in code should link to docs, not duplicate full instructions.
|
||||
|
||||
## Next Checkpoints
|
||||
- DOC-001 (index): 1 day
|
||||
- DOC-002 through DOC-008 (all plugin docs): 3-5 days
|
||||
- DOC-009 (code remediation improvements): 2 days
|
||||
- DOC-010 (baseline): 1 day
|
||||
188
docs/modules/doctor/checks/README.md
Normal file
188
docs/modules/doctor/checks/README.md
Normal file
@@ -0,0 +1,188 @@
|
||||
# Doctor Runtime Check Index
|
||||
|
||||
## Scope
|
||||
- Runtime catalog source: `GET /api/v1/doctor/checks` on 2026-03-31.
|
||||
- Docker compose baseline source: run `dr_20260331_195122_99ff09` captured from the locally running default stack.
|
||||
- Canonical remediation content lives in `docs/doctor/articles/**`; this index maps the live runtime catalog to those articles.
|
||||
|
||||
## Runtime Summary
|
||||
| Plugin | Checks |
|
||||
| --- | ---: |
|
||||
| `stellaops.doctor.attestation` | 3 |
|
||||
| `stellaops.doctor.binaryanalysis` | 6 |
|
||||
| `stellaops.doctor.compliance` | 7 |
|
||||
| `stellaops.doctor.core` | 9 |
|
||||
| `stellaops.doctor.database` | 8 |
|
||||
| `stellaops.doctor.docker` | 5 |
|
||||
| `stellaops.doctor.environment` | 6 |
|
||||
| `stellaops.doctor.integration` | 16 |
|
||||
| `stellaops.doctor.observability` | 6 |
|
||||
| `stellaops.doctor.release` | 6 |
|
||||
| `stellaops.doctor.scanner` | 7 |
|
||||
| `stellaops.doctor.security` | 11 |
|
||||
| `stellaops.doctor.servicegraph` | 6 |
|
||||
| `stellaops.doctor.verification` | 5 |
|
||||
|
||||
## Baseline Legend
|
||||
- `pass`: expected healthy result in the captured compose baseline.
|
||||
- `info`: informational only; not a release blocker in the captured baseline.
|
||||
- `warn`: action needed or recommended; not a hard failure in the captured baseline.
|
||||
- `fail`: baseline failure observed in the captured runtime.
|
||||
- `skip`: not applicable in the captured runtime context.
|
||||
|
||||
## `stellaops.doctor.attestation`
|
||||
| Check ID | Severity | Baseline | Article |
|
||||
| --- | --- | --- | --- |
|
||||
| `check.attestation.clock.skew` | `warn` | `warn` | [article](../../../doctor/articles/attestor/clock-skew.md) |
|
||||
| `check.attestation.cosign.keymaterial` | `fail` | `skip` | [article](../../../doctor/articles/attestor/cosign-keymaterial.md) |
|
||||
| `check.attestation.rekor.connectivity` | `fail` | `skip` | [article](../../../doctor/articles/attestor/rekor-connectivity.md) |
|
||||
|
||||
## `stellaops.doctor.binaryanalysis`
|
||||
| Check ID | Severity | Baseline | Article |
|
||||
| --- | --- | --- | --- |
|
||||
| `check.binaryanalysis.buildinfo.cache` | `warn` | `warn` | [article](../../../doctor/articles/binary-analysis/buildinfo-cache.md) |
|
||||
| `check.binaryanalysis.corpus.kpi.baseline` | `warn` | `warn` | [article](../../../doctor/articles/binary-analysis/kpi-baseline-exists.md) |
|
||||
| `check.binaryanalysis.corpus.mirror.freshness` | `warn` | `warn` | [article](../../../doctor/articles/binary-analysis/corpus-mirror-freshness.md) |
|
||||
| `check.binaryanalysis.ddeb.enabled` | `warn` | `warn` | [article](../../../doctor/articles/binary-analysis/ddeb-repo-enabled.md) |
|
||||
| `check.binaryanalysis.debuginfod.available` | `warn` | `info` | [article](../../../doctor/articles/binary-analysis/debuginfod-availability.md) |
|
||||
| `check.binaryanalysis.symbol.recovery.fallback` | `warn` | `info` | [article](../../../doctor/articles/binary-analysis/symbol-recovery-fallback.md) |
|
||||
|
||||
## `stellaops.doctor.compliance`
|
||||
| Check ID | Severity | Baseline | Article |
|
||||
| --- | --- | --- | --- |
|
||||
| `check.compliance.attestation-signing` | `fail` | `skip` | [article](../../../doctor/articles/compliance/attestation-signing.md) |
|
||||
| `check.compliance.audit-readiness` | `warn` | `skip` | [article](../../../doctor/articles/compliance/audit-readiness.md) |
|
||||
| `check.compliance.evidence-integrity` | `fail` | `skip` | [article](../../../doctor/articles/compliance/evidence-integrity.md) |
|
||||
| `check.compliance.evidence-rate` | `fail` | `skip` | [article](../../../doctor/articles/compliance/evidence-rate.md) |
|
||||
| `check.compliance.export-readiness` | `warn` | `skip` | [article](../../../doctor/articles/compliance/export-readiness.md) |
|
||||
| `check.compliance.framework` | `warn` | `skip` | [article](../../../doctor/articles/compliance/framework.md) |
|
||||
| `check.compliance.provenance-completeness` | `fail` | `skip` | [article](../../../doctor/articles/compliance/provenance-completeness.md) |
|
||||
|
||||
## `stellaops.doctor.core`
|
||||
| Check ID | Severity | Baseline | Article |
|
||||
| --- | --- | --- | --- |
|
||||
| `check.core.auth.config` | `warn` | `skip` | [article](../../../doctor/articles/core/auth-config.md) |
|
||||
| `check.core.config.loaded` | `fail` | `pass` | [article](../../../doctor/articles/core/config-loaded.md) |
|
||||
| `check.core.config.required` | `fail` | `fail` | [article](../../../doctor/articles/core/config-required.md) |
|
||||
| `check.core.crypto.available` | `fail` | `pass` | [article](../../../doctor/articles/core/crypto-available.md) |
|
||||
| `check.core.env.diskspace` | `fail` | `pass` | [article](../../../doctor/articles/core/env-diskspace.md) |
|
||||
| `check.core.env.memory` | `warn` | `pass` | [article](../../../doctor/articles/core/env-memory.md) |
|
||||
| `check.core.env.variables` | `warn` | `warn` | [article](../../../doctor/articles/core/env-variables.md) |
|
||||
| `check.core.services.dependencies` | `fail` | `pass` | [article](../../../doctor/articles/core/services-dependencies.md) |
|
||||
| `check.core.services.health` | `fail` | `skip` | [article](../../../doctor/articles/core/services-health.md) |
|
||||
|
||||
## `stellaops.doctor.database`
|
||||
| Check ID | Severity | Baseline | Article |
|
||||
| --- | --- | --- | --- |
|
||||
| `check.db.connection` | `fail` | `skip` | [article](../../../doctor/articles/postgres/db-connection.md) |
|
||||
| `check.db.latency` | `fail` | `skip` | [article](../../../doctor/articles/postgres/db-latency.md) |
|
||||
| `check.db.migrations.failed` | `fail` | `skip` | [article](../../../doctor/articles/postgres/db-migrations-failed.md) |
|
||||
| `check.db.migrations.pending` | `warn` | `skip` | [article](../../../doctor/articles/postgres/db-migrations-pending.md) |
|
||||
| `check.db.permissions` | `fail` | `skip` | [article](../../../doctor/articles/postgres/db-permissions.md) |
|
||||
| `check.db.pool.health` | `fail` | `skip` | [article](../../../doctor/articles/postgres/db-pool-health.md) |
|
||||
| `check.db.pool.size` | `warn` | `skip` | [article](../../../doctor/articles/postgres/db-pool-size.md) |
|
||||
| `check.db.schema.version` | `fail` | `skip` | [article](../../../doctor/articles/postgres/db-schema-version.md) |
|
||||
|
||||
## `stellaops.doctor.docker`
|
||||
| Check ID | Severity | Baseline | Article |
|
||||
| --- | --- | --- | --- |
|
||||
| `check.docker.apiversion` | `warn` | `skip` | [article](../../../doctor/articles/docker/apiversion.md) |
|
||||
| `check.docker.daemon` | `fail` | `fail` | [article](../../../doctor/articles/docker/daemon.md) |
|
||||
| `check.docker.network` | `warn` | `skip` | [article](../../../doctor/articles/docker/network.md) |
|
||||
| `check.docker.socket` | `fail` | `fail` | [article](../../../doctor/articles/docker/socket.md) |
|
||||
| `check.docker.storage` | `warn` | `skip` | [article](../../../doctor/articles/docker/storage.md) |
|
||||
|
||||
## `stellaops.doctor.environment`
|
||||
| Check ID | Severity | Baseline | Article |
|
||||
| --- | --- | --- | --- |
|
||||
| `check.environment.capacity` | `warn` | `skip` | [article](../../../doctor/articles/environment/environment-capacity.md) |
|
||||
| `check.environment.connectivity` | `warn` | `skip` | [article](../../../doctor/articles/environment/environment-connectivity.md) |
|
||||
| `check.environment.deployments` | `warn` | `skip` | [article](../../../doctor/articles/environment/environment-deployment-health.md) |
|
||||
| `check.environment.drift` | `warn` | `skip` | [article](../../../doctor/articles/environment/environment-drift.md) |
|
||||
| `check.environment.network.policy` | `warn` | `skip` | [article](../../../doctor/articles/environment/environment-network-policy.md) |
|
||||
| `check.environment.secrets` | `warn` | `skip` | [article](../../../doctor/articles/environment/environment-secret-health.md) |
|
||||
|
||||
## `stellaops.doctor.integration`
|
||||
| Check ID | Severity | Baseline | Article |
|
||||
| --- | --- | --- | --- |
|
||||
| `check.integration.ci.system` | `warn` | `skip` | [article](../../../doctor/articles/integration/ci-system-connectivity.md) |
|
||||
| `check.integration.git` | `warn` | `skip` | [article](../../../doctor/articles/integration/git-provider-api.md) |
|
||||
| `check.integration.ldap` | `warn` | `skip` | [article](../../../doctor/articles/integration/ldap-connectivity.md) |
|
||||
| `check.integration.oci.capabilities` | `info` | `skip` | [article](../../../doctor/articles/integration/registry-capability-probe.md) |
|
||||
| `check.integration.oci.credentials` | `fail` | `skip` | [article](../../../doctor/articles/integration/registry-credentials.md) |
|
||||
| `check.integration.oci.pull` | `fail` | `skip` | [article](../../../doctor/articles/integration/registry-pull-authorization.md) |
|
||||
| `check.integration.oci.push` | `fail` | `skip` | [article](../../../doctor/articles/integration/registry-push-authorization.md) |
|
||||
| `check.integration.oci.referrers` | `warn` | `skip` | [article](../../../doctor/articles/integration/registry-referrers-api.md) |
|
||||
| `check.integration.oci.registry` | `warn` | `skip` | [article](../../../doctor/articles/integration/oci-registry-connectivity.md) |
|
||||
| `check.integration.oidc` | `warn` | `skip` | [article](../../../doctor/articles/integration/oidc-provider.md) |
|
||||
| `check.integration.s3.storage` | `warn` | `skip` | [article](../../../doctor/articles/integration/object-storage.md) |
|
||||
| `check.integration.secrets.manager` | `fail` | `skip` | [article](../../../doctor/articles/integration/secrets-manager-connectivity.md) |
|
||||
| `check.integration.slack` | `info` | `skip` | [article](../../../doctor/articles/integration/slack-webhook.md) |
|
||||
| `check.integration.smtp` | `warn` | `skip` | [article](../../../doctor/articles/integration/smtp-connectivity.md) |
|
||||
| `check.integration.teams` | `info` | `skip` | [article](../../../doctor/articles/integration/teams-webhook.md) |
|
||||
| `check.integration.webhooks` | `warn` | `skip` | [article](../../../doctor/articles/integration/webhook-health.md) |
|
||||
|
||||
## `stellaops.doctor.observability`
|
||||
| Check ID | Severity | Baseline | Article |
|
||||
| --- | --- | --- | --- |
|
||||
| `check.observability.alerting` | `info` | `info` | [article](../../../doctor/articles/observability/observability-alerting.md) |
|
||||
| `check.observability.healthchecks` | `warn` | `pass` | [article](../../../doctor/articles/observability/observability-healthchecks.md) |
|
||||
| `check.observability.logging` | `warn` | `warn` | [article](../../../doctor/articles/observability/observability-logging.md) |
|
||||
| `check.observability.metrics` | `warn` | `info` | [article](../../../doctor/articles/observability/observability-metrics.md) |
|
||||
| `check.observability.otel` | `warn` | `info` | [article](../../../doctor/articles/observability/observability-otel.md) |
|
||||
| `check.observability.tracing` | `warn` | `pass` | [article](../../../doctor/articles/observability/observability-tracing.md) |
|
||||
|
||||
## `stellaops.doctor.release`
|
||||
| Check ID | Severity | Baseline | Article |
|
||||
| --- | --- | --- | --- |
|
||||
| `check.release.active` | `warn` | `skip` | [article](../../../doctor/articles/release/active.md) |
|
||||
| `check.release.configuration` | `warn` | `skip` | [article](../../../doctor/articles/release/configuration.md) |
|
||||
| `check.release.environment.readiness` | `warn` | `skip` | [article](../../../doctor/articles/release/environment-readiness.md) |
|
||||
| `check.release.promotion.gates` | `warn` | `skip` | [article](../../../doctor/articles/release/promotion-gates.md) |
|
||||
| `check.release.rollback.readiness` | `warn` | `skip` | [article](../../../doctor/articles/release/rollback-readiness.md) |
|
||||
| `check.release.schedule` | `info` | `skip` | [article](../../../doctor/articles/release/schedule.md) |
|
||||
|
||||
## `stellaops.doctor.scanner`
|
||||
| Check ID | Severity | Baseline | Article |
|
||||
| --- | --- | --- | --- |
|
||||
| `check.scanner.queue` | `warn` | `skip` | [article](../../../doctor/articles/scanner/queue.md) |
|
||||
| `check.scanner.reachability` | `warn` | `skip` | [article](../../../doctor/articles/scanner/reachability.md) |
|
||||
| `check.scanner.resources` | `warn` | `skip` | [article](../../../doctor/articles/scanner/resources.md) |
|
||||
| `check.scanner.sbom` | `warn` | `skip` | [article](../../../doctor/articles/scanner/sbom.md) |
|
||||
| `check.scanner.slice.cache` | `warn` | `skip` | [article](../../../doctor/articles/scanner/slice-cache.md) |
|
||||
| `check.scanner.vuln` | `warn` | `skip` | [article](../../../doctor/articles/scanner/vuln.md) |
|
||||
| `check.scanner.witness.graph` | `warn` | `skip` | [article](../../../doctor/articles/scanner/witness-graph.md) |
|
||||
|
||||
## `stellaops.doctor.security`
|
||||
| Check ID | Severity | Baseline | Article |
|
||||
| --- | --- | --- | --- |
|
||||
| `check.security.apikey` | `warn` | `skip` | [article](../../../doctor/articles/security/apikey.md) |
|
||||
| `check.security.audit.logging` | `warn` | `warn` | [article](../../../doctor/articles/security/audit-logging.md) |
|
||||
| `check.security.cors` | `warn` | `warn` | [article](../../../doctor/articles/security/cors.md) |
|
||||
| `check.security.encryption` | `warn` | `skip` | [article](../../../doctor/articles/security/encryption.md) |
|
||||
| `check.security.evidence.integrity` | `fail` | `skip` | [article](../../../doctor/articles/security/evidence-integrity.md) |
|
||||
| `check.security.headers` | `warn` | `warn` | [article](../../../doctor/articles/security/headers.md) |
|
||||
| `check.security.jwt.config` | `fail` | `skip` | [article](../../../doctor/articles/security/jwt-config.md) |
|
||||
| `check.security.password.policy` | `warn` | `skip` | [article](../../../doctor/articles/security/password-policy.md) |
|
||||
| `check.security.ratelimit` | `warn` | `info` | [article](../../../doctor/articles/security/ratelimit.md) |
|
||||
| `check.security.secrets` | `fail` | `fail` | [article](../../../doctor/articles/security/secrets.md) |
|
||||
| `check.security.tls.certificate` | `fail` | `pass` | [article](../../../doctor/articles/security/tls-certificate.md) |
|
||||
|
||||
## `stellaops.doctor.servicegraph`
|
||||
| Check ID | Severity | Baseline | Article |
|
||||
| --- | --- | --- | --- |
|
||||
| `check.servicegraph.backend` | `fail` | `skip` | [article](../../../doctor/articles/servicegraph/servicegraph-backend.md) |
|
||||
| `check.servicegraph.circuitbreaker` | `warn` | `info` | [article](../../../doctor/articles/servicegraph/servicegraph-circuitbreaker.md) |
|
||||
| `check.servicegraph.endpoints` | `fail` | `skip` | [article](../../../doctor/articles/servicegraph/servicegraph-endpoints.md) |
|
||||
| `check.servicegraph.mq` | `warn` | `skip` | [article](../../../doctor/articles/servicegraph/servicegraph-mq.md) |
|
||||
| `check.servicegraph.timeouts` | `warn` | `pass` | [article](../../../doctor/articles/servicegraph/servicegraph-timeouts.md) |
|
||||
| `check.servicegraph.valkey` | `warn` | `pass` | [article](../../../doctor/articles/servicegraph/servicegraph-valkey.md) |
|
||||
|
||||
## `stellaops.doctor.verification`
|
||||
| Check ID | Severity | Baseline | Article |
|
||||
| --- | --- | --- | --- |
|
||||
| `check.verification.artifact.pull` | `fail` | `skip` | [article](../../../doctor/articles/verification/verification-artifact-pull.md) |
|
||||
| `check.verification.policy.engine` | `fail` | `skip` | [article](../../../doctor/articles/verification/verification-policy-engine.md) |
|
||||
| `check.verification.sbom.validation` | `fail` | `skip` | [article](../../../doctor/articles/verification/verification-sbom-validation.md) |
|
||||
| `check.verification.signature` | `fail` | `skip` | [article](../../../doctor/articles/verification/verification-signature.md) |
|
||||
| `check.verification.vex.validation` | `fail` | `skip` | [article](../../../doctor/articles/verification/verification-vex-validation.md) |
|
||||
77
docs/modules/doctor/compose-baseline.md
Normal file
77
docs/modules/doctor/compose-baseline.md
Normal file
@@ -0,0 +1,77 @@
|
||||
# Doctor Compose Baseline
|
||||
|
||||
## Evidence
|
||||
- Runtime source: local default stack reachable at `http://127.1.0.26/api/v1/doctor`.
|
||||
- Catalog snapshot: `GET /api/v1/doctor/checks` on 2026-03-31.
|
||||
- Baseline run: `dr_20260331_195122_99ff09`.
|
||||
- Duration: `12103ms`.
|
||||
|
||||
## Baseline Summary
|
||||
| Status | Count |
|
||||
| --- | ---: |
|
||||
| `pass` | 10 |
|
||||
| `info` | 7 |
|
||||
| `warn` | 10 |
|
||||
| `fail` | 4 |
|
||||
| `skip` | 70 |
|
||||
| `total` | 101 |
|
||||
|
||||
## Capture Notes
|
||||
- This baseline was captured from the locally running default compose stack, not from a second fresh stack.
|
||||
- A parallel `docker compose up` was not used because `devops/compose/docker-compose.stella-ops.yml` hardcodes container names, which would conflict with the already running environment.
|
||||
- The runtime catalog currently exposes `101` checks across `14` plugins. That supersedes the stale sprint text that still referenced `99` checks across `16` plugins.
|
||||
|
||||
## Observed Failures
|
||||
| Check ID | Diagnosis | Notes |
|
||||
| --- | --- | --- |
|
||||
| `check.core.config.required` | Missing 2 required setting(s) | Missing `ConnectionStrings:DefaultConnection` and `Logging:LogLevel:Default` in the captured runtime. |
|
||||
| `check.docker.daemon` | Cannot connect to Docker daemon: Connection failed | Doctor ran without a reachable Docker daemon socket. |
|
||||
| `check.docker.socket` | 1 Docker socket issue(s) | `/var/run/docker.sock` was absent in the captured container context. |
|
||||
| `check.security.secrets` | 2 secrets management issue(s) found | The runtime reported no secrets provider plus a potential plain-text connection string. |
|
||||
|
||||
## Observed Warnings
|
||||
| Check ID | Diagnosis |
|
||||
| --- | --- |
|
||||
| `check.attestation.clock.skew` | System clock is off by 5.5 seconds (threshold: 5s) |
|
||||
| `check.binaryanalysis.buildinfo.cache` | Debian buildinfo services are reachable but cache directory does not exist |
|
||||
| `check.binaryanalysis.corpus.kpi.baseline` | KPI baseline directory does not exist: `/var/lib/stella/baselines` |
|
||||
| `check.binaryanalysis.corpus.mirror.freshness` | Corpus mirrors directory does not exist: `/var/lib/stella/mirrors` |
|
||||
| `check.binaryanalysis.ddeb.enabled` | Ubuntu ddeb repository is not configured but `ddebs.ubuntu.com` is reachable |
|
||||
| `check.core.env.variables` | No environment configuration variables detected |
|
||||
| `check.observability.logging` | 1 logging configuration issue(s) |
|
||||
| `check.security.audit.logging` | 2 audit logging issue(s) |
|
||||
| `check.security.cors` | 1 CORS configuration issue(s) found |
|
||||
| `check.security.headers` | 5 security header(s) not configured |
|
||||
|
||||
## Observed Informational Results
|
||||
| Check ID | Diagnosis |
|
||||
| --- | --- |
|
||||
| `check.binaryanalysis.debuginfod.available` | `DEBUGINFOD_URLS` not configured but default Fedora debuginfod is reachable |
|
||||
| `check.binaryanalysis.symbol.recovery.fallback` | Symbol recovery operational with 1/3 sources available |
|
||||
| `check.observability.alerting` | No alerting destinations configured |
|
||||
| `check.observability.metrics` | Metrics configuration not found |
|
||||
| `check.observability.otel` | OpenTelemetry endpoint not configured |
|
||||
| `check.security.ratelimit` | Rate limiting configuration not found |
|
||||
| `check.servicegraph.circuitbreaker` | Circuit breakers not configured |
|
||||
|
||||
## Healthy Baseline Results
|
||||
The captured runtime returned `pass` for:
|
||||
|
||||
- `check.core.config.loaded`
|
||||
- `check.core.crypto.available`
|
||||
- `check.core.env.diskspace`
|
||||
- `check.core.env.memory`
|
||||
- `check.core.services.dependencies`
|
||||
- `check.observability.healthchecks`
|
||||
- `check.observability.tracing`
|
||||
- `check.security.tls.certificate`
|
||||
- `check.servicegraph.timeouts`
|
||||
- `check.servicegraph.valkey`
|
||||
|
||||
## Skipped Checks
|
||||
- `70` checks were skipped because the captured local stack did not provide the required runtime context, credentials, test artifacts, or dependent services.
|
||||
- Skips are expected for the database, integration, release, scanner, and verification groups when the default local stack is not fully wired for end-to-end release validation.
|
||||
|
||||
## Follow-Up
|
||||
- Use [the runtime check index](./checks/README.md) to map each runtime check to its article.
|
||||
- Rebuild and rerun the Doctor services before claiming a fresh-stack zero-false-positive baseline; this document only records the captured live baseline from 2026-03-31.
|
||||
Reference in New Issue
Block a user