doctor: complete runtime check documentation sprint
Signed-off-by: master <>
This commit is contained in:
48
docs/doctor/articles/observability/observability-tracing.md
Normal file
48
docs/doctor/articles/observability/observability-tracing.md
Normal file
@@ -0,0 +1,48 @@
|
||||
---
|
||||
checkId: check.observability.tracing
|
||||
plugin: stellaops.doctor.observability
|
||||
severity: warn
|
||||
tags: [observability, tracing, correlation]
|
||||
---
|
||||
# Distributed Tracing
|
||||
|
||||
## What It Checks
|
||||
Validates trace enablement, propagator, sampling ratio, exporter type, and whether HTTP and database instrumentation are turned on.
|
||||
|
||||
The check reports info when tracing is explicitly disabled and warns when sampling is invalid, too low, or when important instrumentation is turned off.
|
||||
|
||||
## Why It Matters
|
||||
Tracing is the fastest way to understand cross-service latency and identify the exact hop that is failing. Disabling instrumentation removes that evidence.
|
||||
|
||||
## Common Causes
|
||||
- Sampling ratio set to `0` during load testing and never restored
|
||||
- Only outbound HTTP traces are enabled while database spans remain off
|
||||
- Propagator or exporter defaults differ between services
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
Tracing__Enabled: "true"
|
||||
Tracing__SamplingRatio: "1.0"
|
||||
Tracing__Instrumentation__Http: "true"
|
||||
Tracing__Instrumentation__Database: "true"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Keep `Tracing:SamplingRatio` between `0.01` and `1.0` unless you are deliberately suppressing traces for a benchmark.
|
||||
|
||||
### Kubernetes / Helm
|
||||
Propagate the same trace configuration across all services in the release path so correlation IDs remain intact.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.observability.tracing
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.observability.otel` - exporter connectivity must work before traces leave the process
|
||||
- `check.servicegraph.timeouts` - tracing is most useful when diagnosing timeout-related issues
|
||||
Reference in New Issue
Block a user