doctor: complete runtime check documentation sprint

Signed-off-by: master <>
2026-03-31 23:26:24 +03:00
parent 404d50bcb7
commit 152c1b1357
54 changed files with 2210 additions and 258 deletions
--- a/docs/doctor/articles/observability/observability-tracing.md
+++ b/docs/doctor/articles/observability/observability-tracing.md
@@ -0,0 +1,48 @@
+---
+checkId: check.observability.tracing
+plugin: stellaops.doctor.observability
+severity: warn
+tags: [observability, tracing, correlation]
+---
+# Distributed Tracing
+
+## What It Checks
+Validates trace enablement, propagator, sampling ratio, exporter type, and whether HTTP and database instrumentation are turned on.
+
+The check reports info when tracing is explicitly disabled and warns when sampling is invalid, too low, or when important instrumentation is turned off.
+
+## Why It Matters
+Tracing is the fastest way to understand cross-service latency and identify the exact hop that is failing. Disabling instrumentation removes that evidence.
+
+## Common Causes
+- Sampling ratio set to `0` during load testing and never restored
+- Only outbound HTTP traces are enabled while database spans remain off
+- Propagator or exporter defaults differ between services
+
+## How to Fix
+
+### Docker Compose
+```yaml
+services:
+  doctor-web:
+    environment:
+      Tracing__Enabled: "true"
+      Tracing__SamplingRatio: "1.0"
+      Tracing__Instrumentation__Http: "true"
+      Tracing__Instrumentation__Database: "true"
+```
+
+### Bare Metal / systemd
+Keep `Tracing:SamplingRatio` between `0.01` and `1.0` unless you are deliberately suppressing traces for a benchmark.
+
+### Kubernetes / Helm
+Propagate the same trace configuration across all services in the release path so correlation IDs remain intact.
+
+## Verification
+```bash
+stella doctor --check check.observability.tracing
+```
+
+## Related Checks
+- `check.observability.otel` - exporter connectivity must work before traces leave the process
+- `check.servicegraph.timeouts` - tracing is most useful when diagnosing timeout-related issues