doctor: complete runtime check documentation sprint
Signed-off-by: master <>
This commit is contained in:
48
docs/doctor/articles/servicegraph/servicegraph-timeouts.md
Normal file
48
docs/doctor/articles/servicegraph/servicegraph-timeouts.md
Normal file
@@ -0,0 +1,48 @@
|
||||
---
|
||||
checkId: check.servicegraph.timeouts
|
||||
plugin: stellaops.doctor.servicegraph
|
||||
severity: warn
|
||||
tags: [servicegraph, timeouts, configuration]
|
||||
---
|
||||
# Service Timeouts
|
||||
|
||||
## What It Checks
|
||||
Validates `HttpClient:Timeout`, `Database:CommandTimeout`, `Cache:OperationTimeout`, and `HealthChecks:Timeout`.
|
||||
|
||||
The check warns when HTTP timeout is below `5s` or above `300s`, database timeout is below `5s` or above `120s`, cache timeout exceeds `30s`, or health-check timeout exceeds the HTTP timeout.
|
||||
|
||||
## Why It Matters
|
||||
Timeouts define how quickly failures surface and how long stuck work ties up resources. Poor values cause either premature failures or prolonged resource exhaustion.
|
||||
|
||||
## Common Causes
|
||||
- Defaults from one environment were copied into another with very different latency
|
||||
- Health-check timeout was set higher than the main request timeout
|
||||
- Cache or database timeouts were raised to hide underlying performance problems
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
HttpClient__Timeout: "100"
|
||||
Database__CommandTimeout: "30"
|
||||
Cache__OperationTimeout: "5"
|
||||
HealthChecks__Timeout: "10"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Tune timeouts from measured service latencies, not from guesswork. Raise values only after understanding the slower dependency.
|
||||
|
||||
### Kubernetes / Helm
|
||||
Keep application timeouts lower than ingress, service-mesh, and job-level deadlines so failures happen in the component that owns the retry policy.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.servicegraph.timeouts
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.servicegraph.backend` - timeout misconfiguration often shows up as backend failures first
|
||||
- `check.db.latency` - high database latency can force operators to revisit timeout values
|
||||
Reference in New Issue
Block a user