doctor: complete runtime check documentation sprint

Signed-off-by: master <>
2026-03-31 23:26:24 +03:00
parent 404d50bcb7
commit 152c1b1357
54 changed files with 2210 additions and 258 deletions
--- a/docs/doctor/articles/observability/observability-alerting.md
+++ b/docs/doctor/articles/observability/observability-alerting.md
@@ -0,0 +1,53 @@
+---
+checkId: check.observability.alerting
+plugin: stellaops.doctor.observability
+severity: info
+tags: [observability, alerting, notifications]
+---
+# Alerting Configuration
+
+## What It Checks
+Looks for configured alert destinations such as Alertmanager, Slack, email recipients, or PagerDuty routing keys.
+
+The check reports info when alerting is explicitly disabled or when no destination is configured. It warns only when a destination is present but obviously malformed, such as invalid email addresses.
+
+## Why It Matters
+Metrics and logs are not actionable if nobody is notified when thresholds are crossed. Production installs should route alerts somewhere outside the application process.
+
+## Common Causes
+- Alerting was never configured after initial compose bring-up
+- Notification secrets were omitted from environment variables
+- Recipient lists contain placeholders or invalid values
+
+## How to Fix
+
+### Docker Compose
+```bash
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web printenv | grep -E 'ALERT|SLACK|PAGERDUTY|SMTP'
+```
+
+Example compose-style configuration:
+
+```yaml
+services:
+  doctor-web:
+    environment:
+      Alerting__Enabled: "true"
+      Alerting__AlertManagerUrl: http://alertmanager:9093
+      Alerting__Email__Recipients__0: ops@example.com
+```
+
+### Bare Metal / systemd
+Configure `Alerting:*` settings in the service configuration and ensure secrets come from the platform secrets provider rather than clear text files.
+
+### Kubernetes / Helm
+Store webhook URLs and routing keys in Secrets, then mount them into `Alerting:*` values.
+
+## Verification
+```bash
+stella doctor --check check.observability.alerting
+```
+
+## Related Checks
+- `check.observability.metrics` - alerting is usually driven by metrics
+- `check.observability.logging` - logs are the fallback when alerts are missing
--- a/docs/doctor/articles/observability/observability-healthchecks.md
+++ b/docs/doctor/articles/observability/observability-healthchecks.md
@@ -0,0 +1,53 @@
+---
+checkId: check.observability.healthchecks
+plugin: stellaops.doctor.observability
+severity: warn
+tags: [observability, healthchecks, readiness, liveness]
+---
+# Health Check Endpoints
+
+## What It Checks
+Evaluates the configured health, readiness, and liveness paths and optionally probes `http://localhost:<port><path>` when a health-check port is configured.
+
+The check warns when endpoints are unreachable, when timeouts are outside the `1s` to `60s` range, or when readiness and liveness collapse onto the same path.
+
+## Why It Matters
+Broken health probes turn into bad restart loops, failed rolling upgrades, and misleading orchestration signals.
+
+## Common Causes
+- The service exposes `/health` but not `/health/ready` or `/health/live`
+- Health-check ports differ from the actual bound HTTP port
+- Probe timeout values were copied from another service without validation
+
+## How to Fix
+
+### Docker Compose
+```bash
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://localhost:8080/health
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://localhost:8080/health/ready
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://localhost:8080/health/live
+```
+
+Set explicit paths and a reasonable timeout:
+
+```yaml
+HealthChecks__Path: /health
+HealthChecks__ReadinessPath: /health/ready
+HealthChecks__LivenessPath: /health/live
+HealthChecks__Timeout: 30
+```
+
+### Bare Metal / systemd
+Verify reverse proxies and firewalls do not block the health port.
+
+### Kubernetes / Helm
+Point readiness and liveness probes at separate endpoints whenever startup and steady-state behavior differ.
+
+## Verification
+```bash
+stella doctor --check check.observability.healthchecks
+```
+
+## Related Checks
+- `check.core.services.health` - aggregates the underlying ASP.NET health checks when available
+- `check.observability.metrics` - shared listener misconfiguration can break both endpoints
--- a/docs/doctor/articles/observability/observability-logging.md
+++ b/docs/doctor/articles/observability/observability-logging.md
@@ -0,0 +1,49 @@
+---
+checkId: check.observability.logging
+plugin: stellaops.doctor.observability
+severity: warn
+tags: [observability, logging, structured-logs]
+---
+# Logging Configuration
+
+## What It Checks
+Reads default and framework log levels and looks for structured logging via `Logging:Structured`, JSON console formatting, or a `Serilog` configuration section.
+
+The check warns when default logging is `Debug` or `Trace`, when Microsoft categories are too verbose, or when structured logging is missing.
+
+## Why It Matters
+Unstructured logs slow incident response and make exports difficult to analyze. Overly verbose framework logging also drives storage growth and noise.
+
+## Common Causes
+- Only the default ASP.NET console logger is configured
+- `Logging:Structured` or `Serilog` settings were omitted from compose values
+- Troubleshooting log levels were left enabled in production
+
+## How to Fix
+
+### Docker Compose
+```yaml
+services:
+  doctor-web:
+    environment:
+      Logging__LogLevel__Default: Information
+      Logging__LogLevel__Microsoft: Warning
+      Logging__Structured: "true"
+```
+
+If Serilog is used, make sure the console sink emits JSON or another structured format that downstream tooling can parse.
+
+### Bare Metal / systemd
+Keep framework namespaces at `Warning` or stricter unless you are collecting short-lived debugging evidence.
+
+### Kubernetes / Helm
+Ensure log collectors expect the same output format the application emits.
+
+## Verification
+```bash
+stella doctor --check check.observability.logging
+```
+
+## Related Checks
+- `check.observability.alerting` - alerting often relies on structured log pipelines
+- `check.security.audit.logging` - audit logs should follow the same transport and retention standards
--- a/docs/doctor/articles/observability/observability-metrics.md
+++ b/docs/doctor/articles/observability/observability-metrics.md
@@ -0,0 +1,53 @@
+---
+checkId: check.observability.metrics
+plugin: stellaops.doctor.observability
+severity: warn
+tags: [observability, metrics, prometheus]
+---
+# Metrics Collection
+
+## What It Checks
+Inspects `Metrics:*`, `Prometheus:*`, and `OpenTelemetry:Metrics:*` settings. When a metrics port is configured and an `IHttpClientFactory` is available, the check probes `http://localhost:<port><path>`.
+
+The check returns info when metrics are disabled or absent, and warns when the configured endpoint cannot be reached.
+
+## Why It Matters
+Metrics are the primary input for alerting, SLO tracking, and capacity planning. Missing or unreachable endpoints remove the fastest signal operators have.
+
+## Common Causes
+- Metrics were never enabled in the deployment configuration
+- The metrics path or port does not match the listener exposed by the service
+- A sidecar or reverse proxy blocks local probing
+
+## How to Fix
+
+### Docker Compose
+```yaml
+services:
+  doctor-web:
+    environment:
+      Metrics__Enabled: "true"
+      Metrics__Path: /metrics
+      Metrics__Port: 8080
+```
+
+Probe the endpoint from inside the container:
+
+```bash
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://localhost:8080/metrics
+```
+
+### Bare Metal / systemd
+Bind the metrics port explicitly if the service does not share the main HTTP listener.
+
+### Kubernetes / Helm
+Align the `ServiceMonitor` or Prometheus scrape config with the same path and port the app exposes.
+
+## Verification
+```bash
+stella doctor --check check.observability.metrics
+```
+
+## Related Checks
+- `check.observability.otel` - OpenTelemetry metrics often share the same collector path
+- `check.observability.alerting` - metrics are usually the source for alert rules
--- a/docs/doctor/articles/observability/observability-otel.md
+++ b/docs/doctor/articles/observability/observability-otel.md
@@ -0,0 +1,52 @@
+---
+checkId: check.observability.otel
+plugin: stellaops.doctor.observability
+severity: warn
+tags: [observability, opentelemetry, tracing, metrics]
+---
+# OpenTelemetry Configuration
+
+## What It Checks
+Reads `OpenTelemetry:*`, `Telemetry:*`, and `OTEL_*` settings for endpoint, service name, tracing enablement, metrics enablement, and sampling ratio. When possible, it probes the collector host directly.
+
+The check reports info when no OTLP endpoint is configured and warns when the service name is missing, tracing or metrics are disabled, sampling is too low, or the collector is unreachable.
+
+## Why It Matters
+OpenTelemetry is the main path for exporting traces and metrics to external systems. Broken collector settings silently remove cross-service visibility.
+
+## Common Causes
+- `OTEL_EXPORTER_OTLP_ENDPOINT` was omitted from compose or environment settings
+- `OTEL_SERVICE_NAME` was never set
+- Collector networking differs between local and deployed environments
+
+## How to Fix
+
+### Docker Compose
+```yaml
+services:
+  doctor-web:
+    environment:
+      OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317
+      OTEL_SERVICE_NAME: doctor-web
+      OpenTelemetry__Tracing__Enabled: "true"
+      OpenTelemetry__Metrics__Enabled: "true"
+```
+
+```bash
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://otel-collector:4318/
+```
+
+### Bare Metal / systemd
+Keep the collector endpoint in the service unit or configuration file and verify firewalls allow traffic on the OTLP port.
+
+### Kubernetes / Helm
+Use cluster-local collector service names and inject `OTEL_SERVICE_NAME` per workload.
+
+## Verification
+```bash
+stella doctor --check check.observability.otel
+```
+
+## Related Checks
+- `check.observability.tracing` - validates trace-specific tuning once OTLP export is wired
+- `check.observability.metrics` - metrics export often shares the same collector
--- a/docs/doctor/articles/observability/observability-tracing.md
+++ b/docs/doctor/articles/observability/observability-tracing.md
@@ -0,0 +1,48 @@
+---
+checkId: check.observability.tracing
+plugin: stellaops.doctor.observability
+severity: warn
+tags: [observability, tracing, correlation]
+---
+# Distributed Tracing
+
+## What It Checks
+Validates trace enablement, propagator, sampling ratio, exporter type, and whether HTTP and database instrumentation are turned on.
+
+The check reports info when tracing is explicitly disabled and warns when sampling is invalid, too low, or when important instrumentation is turned off.
+
+## Why It Matters
+Tracing is the fastest way to understand cross-service latency and identify the exact hop that is failing. Disabling instrumentation removes that evidence.
+
+## Common Causes
+- Sampling ratio set to `0` during load testing and never restored
+- Only outbound HTTP traces are enabled while database spans remain off
+- Propagator or exporter defaults differ between services
+
+## How to Fix
+
+### Docker Compose
+```yaml
+services:
+  doctor-web:
+    environment:
+      Tracing__Enabled: "true"
+      Tracing__SamplingRatio: "1.0"
+      Tracing__Instrumentation__Http: "true"
+      Tracing__Instrumentation__Database: "true"
+```
+
+### Bare Metal / systemd
+Keep `Tracing:SamplingRatio` between `0.01` and `1.0` unless you are deliberately suppressing traces for a benchmark.
+
+### Kubernetes / Helm
+Propagate the same trace configuration across all services in the release path so correlation IDs remain intact.
+
+## Verification
+```bash
+stella doctor --check check.observability.tracing
+```
+
+## Related Checks
+- `check.observability.otel` - exporter connectivity must work before traces leave the process
+- `check.servicegraph.timeouts` - tracing is most useful when diagnosing timeout-related issues