doctor: complete runtime check documentation sprint

Signed-off-by: master <>
This commit is contained in:
master
2026-03-31 23:26:24 +03:00
parent 404d50bcb7
commit 152c1b1357
54 changed files with 2210 additions and 258 deletions

View File

@@ -0,0 +1,48 @@
---
checkId: check.servicegraph.timeouts
plugin: stellaops.doctor.servicegraph
severity: warn
tags: [servicegraph, timeouts, configuration]
---
# Service Timeouts
## What It Checks
Validates `HttpClient:Timeout`, `Database:CommandTimeout`, `Cache:OperationTimeout`, and `HealthChecks:Timeout`.
The check warns when HTTP timeout is below `5s` or above `300s`, database timeout is below `5s` or above `120s`, cache timeout exceeds `30s`, or health-check timeout exceeds the HTTP timeout.
## Why It Matters
Timeouts define how quickly failures surface and how long stuck work ties up resources. Poor values cause either premature failures or prolonged resource exhaustion.
## Common Causes
- Defaults from one environment were copied into another with very different latency
- Health-check timeout was set higher than the main request timeout
- Cache or database timeouts were raised to hide underlying performance problems
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
HttpClient__Timeout: "100"
Database__CommandTimeout: "30"
Cache__OperationTimeout: "5"
HealthChecks__Timeout: "10"
```
### Bare Metal / systemd
Tune timeouts from measured service latencies, not from guesswork. Raise values only after understanding the slower dependency.
### Kubernetes / Helm
Keep application timeouts lower than ingress, service-mesh, and job-level deadlines so failures happen in the component that owns the retry policy.
## Verification
```bash
stella doctor --check check.servicegraph.timeouts
```
## Related Checks
- `check.servicegraph.backend` - timeout misconfiguration often shows up as backend failures first
- `check.db.latency` - high database latency can force operators to revisit timeout values