doctor: complete runtime check documentation sprint

Signed-off-by: master <>
This commit is contained in:
master
2026-03-31 23:26:24 +03:00
parent 404d50bcb7
commit 152c1b1357
54 changed files with 2210 additions and 258 deletions

View File

@@ -0,0 +1,53 @@
---
checkId: check.observability.alerting
plugin: stellaops.doctor.observability
severity: info
tags: [observability, alerting, notifications]
---
# Alerting Configuration
## What It Checks
Looks for configured alert destinations such as Alertmanager, Slack, email recipients, or PagerDuty routing keys.
The check reports info when alerting is explicitly disabled or when no destination is configured. It warns only when a destination is present but obviously malformed, such as invalid email addresses.
## Why It Matters
Metrics and logs are not actionable if nobody is notified when thresholds are crossed. Production installs should route alerts somewhere outside the application process.
## Common Causes
- Alerting was never configured after initial compose bring-up
- Notification secrets were omitted from environment variables
- Recipient lists contain placeholders or invalid values
## How to Fix
### Docker Compose
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web printenv | grep -E 'ALERT|SLACK|PAGERDUTY|SMTP'
```
Example compose-style configuration:
```yaml
services:
doctor-web:
environment:
Alerting__Enabled: "true"
Alerting__AlertManagerUrl: http://alertmanager:9093
Alerting__Email__Recipients__0: ops@example.com
```
### Bare Metal / systemd
Configure `Alerting:*` settings in the service configuration and ensure secrets come from the platform secrets provider rather than clear text files.
### Kubernetes / Helm
Store webhook URLs and routing keys in Secrets, then mount them into `Alerting:*` values.
## Verification
```bash
stella doctor --check check.observability.alerting
```
## Related Checks
- `check.observability.metrics` - alerting is usually driven by metrics
- `check.observability.logging` - logs are the fallback when alerts are missing

View File

@@ -0,0 +1,53 @@
---
checkId: check.observability.healthchecks
plugin: stellaops.doctor.observability
severity: warn
tags: [observability, healthchecks, readiness, liveness]
---
# Health Check Endpoints
## What It Checks
Evaluates the configured health, readiness, and liveness paths and optionally probes `http://localhost:<port><path>` when a health-check port is configured.
The check warns when endpoints are unreachable, when timeouts are outside the `1s` to `60s` range, or when readiness and liveness collapse onto the same path.
## Why It Matters
Broken health probes turn into bad restart loops, failed rolling upgrades, and misleading orchestration signals.
## Common Causes
- The service exposes `/health` but not `/health/ready` or `/health/live`
- Health-check ports differ from the actual bound HTTP port
- Probe timeout values were copied from another service without validation
## How to Fix
### Docker Compose
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://localhost:8080/health
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://localhost:8080/health/ready
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://localhost:8080/health/live
```
Set explicit paths and a reasonable timeout:
```yaml
HealthChecks__Path: /health
HealthChecks__ReadinessPath: /health/ready
HealthChecks__LivenessPath: /health/live
HealthChecks__Timeout: 30
```
### Bare Metal / systemd
Verify reverse proxies and firewalls do not block the health port.
### Kubernetes / Helm
Point readiness and liveness probes at separate endpoints whenever startup and steady-state behavior differ.
## Verification
```bash
stella doctor --check check.observability.healthchecks
```
## Related Checks
- `check.core.services.health` - aggregates the underlying ASP.NET health checks when available
- `check.observability.metrics` - shared listener misconfiguration can break both endpoints

View File

@@ -0,0 +1,49 @@
---
checkId: check.observability.logging
plugin: stellaops.doctor.observability
severity: warn
tags: [observability, logging, structured-logs]
---
# Logging Configuration
## What It Checks
Reads default and framework log levels and looks for structured logging via `Logging:Structured`, JSON console formatting, or a `Serilog` configuration section.
The check warns when default logging is `Debug` or `Trace`, when Microsoft categories are too verbose, or when structured logging is missing.
## Why It Matters
Unstructured logs slow incident response and make exports difficult to analyze. Overly verbose framework logging also drives storage growth and noise.
## Common Causes
- Only the default ASP.NET console logger is configured
- `Logging:Structured` or `Serilog` settings were omitted from compose values
- Troubleshooting log levels were left enabled in production
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
Logging__LogLevel__Default: Information
Logging__LogLevel__Microsoft: Warning
Logging__Structured: "true"
```
If Serilog is used, make sure the console sink emits JSON or another structured format that downstream tooling can parse.
### Bare Metal / systemd
Keep framework namespaces at `Warning` or stricter unless you are collecting short-lived debugging evidence.
### Kubernetes / Helm
Ensure log collectors expect the same output format the application emits.
## Verification
```bash
stella doctor --check check.observability.logging
```
## Related Checks
- `check.observability.alerting` - alerting often relies on structured log pipelines
- `check.security.audit.logging` - audit logs should follow the same transport and retention standards

View File

@@ -0,0 +1,53 @@
---
checkId: check.observability.metrics
plugin: stellaops.doctor.observability
severity: warn
tags: [observability, metrics, prometheus]
---
# Metrics Collection
## What It Checks
Inspects `Metrics:*`, `Prometheus:*`, and `OpenTelemetry:Metrics:*` settings. When a metrics port is configured and an `IHttpClientFactory` is available, the check probes `http://localhost:<port><path>`.
The check returns info when metrics are disabled or absent, and warns when the configured endpoint cannot be reached.
## Why It Matters
Metrics are the primary input for alerting, SLO tracking, and capacity planning. Missing or unreachable endpoints remove the fastest signal operators have.
## Common Causes
- Metrics were never enabled in the deployment configuration
- The metrics path or port does not match the listener exposed by the service
- A sidecar or reverse proxy blocks local probing
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
Metrics__Enabled: "true"
Metrics__Path: /metrics
Metrics__Port: 8080
```
Probe the endpoint from inside the container:
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://localhost:8080/metrics
```
### Bare Metal / systemd
Bind the metrics port explicitly if the service does not share the main HTTP listener.
### Kubernetes / Helm
Align the `ServiceMonitor` or Prometheus scrape config with the same path and port the app exposes.
## Verification
```bash
stella doctor --check check.observability.metrics
```
## Related Checks
- `check.observability.otel` - OpenTelemetry metrics often share the same collector path
- `check.observability.alerting` - metrics are usually the source for alert rules

View File

@@ -0,0 +1,52 @@
---
checkId: check.observability.otel
plugin: stellaops.doctor.observability
severity: warn
tags: [observability, opentelemetry, tracing, metrics]
---
# OpenTelemetry Configuration
## What It Checks
Reads `OpenTelemetry:*`, `Telemetry:*`, and `OTEL_*` settings for endpoint, service name, tracing enablement, metrics enablement, and sampling ratio. When possible, it probes the collector host directly.
The check reports info when no OTLP endpoint is configured and warns when the service name is missing, tracing or metrics are disabled, sampling is too low, or the collector is unreachable.
## Why It Matters
OpenTelemetry is the main path for exporting traces and metrics to external systems. Broken collector settings silently remove cross-service visibility.
## Common Causes
- `OTEL_EXPORTER_OTLP_ENDPOINT` was omitted from compose or environment settings
- `OTEL_SERVICE_NAME` was never set
- Collector networking differs between local and deployed environments
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317
OTEL_SERVICE_NAME: doctor-web
OpenTelemetry__Tracing__Enabled: "true"
OpenTelemetry__Metrics__Enabled: "true"
```
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://otel-collector:4318/
```
### Bare Metal / systemd
Keep the collector endpoint in the service unit or configuration file and verify firewalls allow traffic on the OTLP port.
### Kubernetes / Helm
Use cluster-local collector service names and inject `OTEL_SERVICE_NAME` per workload.
## Verification
```bash
stella doctor --check check.observability.otel
```
## Related Checks
- `check.observability.tracing` - validates trace-specific tuning once OTLP export is wired
- `check.observability.metrics` - metrics export often shares the same collector

View File

@@ -0,0 +1,48 @@
---
checkId: check.observability.tracing
plugin: stellaops.doctor.observability
severity: warn
tags: [observability, tracing, correlation]
---
# Distributed Tracing
## What It Checks
Validates trace enablement, propagator, sampling ratio, exporter type, and whether HTTP and database instrumentation are turned on.
The check reports info when tracing is explicitly disabled and warns when sampling is invalid, too low, or when important instrumentation is turned off.
## Why It Matters
Tracing is the fastest way to understand cross-service latency and identify the exact hop that is failing. Disabling instrumentation removes that evidence.
## Common Causes
- Sampling ratio set to `0` during load testing and never restored
- Only outbound HTTP traces are enabled while database spans remain off
- Propagator or exporter defaults differ between services
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
Tracing__Enabled: "true"
Tracing__SamplingRatio: "1.0"
Tracing__Instrumentation__Http: "true"
Tracing__Instrumentation__Database: "true"
```
### Bare Metal / systemd
Keep `Tracing:SamplingRatio` between `0.01` and `1.0` unless you are deliberately suppressing traces for a benchmark.
### Kubernetes / Helm
Propagate the same trace configuration across all services in the release path so correlation IDs remain intact.
## Verification
```bash
stella doctor --check check.observability.tracing
```
## Related Checks
- `check.observability.otel` - exporter connectivity must work before traces leave the process
- `check.servicegraph.timeouts` - tracing is most useful when diagnosing timeout-related issues