doctor: complete runtime check documentation sprint

Signed-off-by: master <>
This commit is contained in:
master
2026-03-31 23:26:24 +03:00
parent 404d50bcb7
commit 152c1b1357
54 changed files with 2210 additions and 258 deletions

View File

@@ -0,0 +1,48 @@
---
checkId: check.servicegraph.circuitbreaker
plugin: stellaops.doctor.servicegraph
severity: warn
tags: [servicegraph, resilience, circuit-breaker]
---
# Circuit Breaker Status
## What It Checks
Reads `Resilience:Enabled` or `HttpClient:Resilience:Enabled` and, when enabled, validates `BreakDurationSeconds`, `FailureThreshold`, and `SamplingDurationSeconds`.
The check reports info when resilience is not configured, warns when `BreakDurationSeconds < 5` or `FailureThreshold < 2`, and passes otherwise.
## Why It Matters
Circuit breakers protect external dependencies from retry storms. Bad thresholds either trip too aggressively or never trip when a downstream service is failing.
## Common Causes
- Resilience policies were never enabled on outgoing HTTP clients
- Thresholds were copied from a benchmark profile into production
- Multiple services use different resilience defaults, making failures unpredictable
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
Resilience__Enabled: "true"
Resilience__CircuitBreaker__BreakDurationSeconds: "30"
Resilience__CircuitBreaker__FailureThreshold: "5"
Resilience__CircuitBreaker__SamplingDurationSeconds: "60"
```
### Bare Metal / systemd
Keep breaker settings in the same configuration source used for HTTP client registration so the service and Doctor observe the same values.
### Kubernetes / Helm
Standardize resilience values across backend-facing workloads instead of per-pod overrides.
## Verification
```bash
stella doctor --check check.servicegraph.circuitbreaker
```
## Related Checks
- `check.servicegraph.backend` - breaker policy protects this path when the backend degrades
- `check.servicegraph.timeouts` - timeout settings and breaker settings should be tuned together