doctor: complete runtime check documentation sprint

Signed-off-by: master <>
2026-03-31 23:26:24 +03:00
parent 404d50bcb7
commit 152c1b1357
54 changed files with 2210 additions and 258 deletions
--- a/docs/doctor/articles/servicegraph/servicegraph-timeouts.md
+++ b/docs/doctor/articles/servicegraph/servicegraph-timeouts.md
@@ -0,0 +1,48 @@
+---
+checkId: check.servicegraph.timeouts
+plugin: stellaops.doctor.servicegraph
+severity: warn
+tags: [servicegraph, timeouts, configuration]
+---
+# Service Timeouts
+
+## What It Checks
+Validates `HttpClient:Timeout`, `Database:CommandTimeout`, `Cache:OperationTimeout`, and `HealthChecks:Timeout`.
+
+The check warns when HTTP timeout is below `5s` or above `300s`, database timeout is below `5s` or above `120s`, cache timeout exceeds `30s`, or health-check timeout exceeds the HTTP timeout.
+
+## Why It Matters
+Timeouts define how quickly failures surface and how long stuck work ties up resources. Poor values cause either premature failures or prolonged resource exhaustion.
+
+## Common Causes
+- Defaults from one environment were copied into another with very different latency
+- Health-check timeout was set higher than the main request timeout
+- Cache or database timeouts were raised to hide underlying performance problems
+
+## How to Fix
+
+### Docker Compose
+```yaml
+services:
+  doctor-web:
+    environment:
+      HttpClient__Timeout: "100"
+      Database__CommandTimeout: "30"
+      Cache__OperationTimeout: "5"
+      HealthChecks__Timeout: "10"
+```
+
+### Bare Metal / systemd
+Tune timeouts from measured service latencies, not from guesswork. Raise values only after understanding the slower dependency.
+
+### Kubernetes / Helm
+Keep application timeouts lower than ingress, service-mesh, and job-level deadlines so failures happen in the component that owns the retry policy.
+
+## Verification
+```bash
+stella doctor --check check.servicegraph.timeouts
+```
+
+## Related Checks
+- `check.servicegraph.backend` - timeout misconfiguration often shows up as backend failures first
+- `check.db.latency` - high database latency can force operators to revisit timeout values