doctor: complete runtime check documentation sprint
Signed-off-by: master <>
This commit is contained in:
@@ -0,0 +1,48 @@
|
||||
---
|
||||
checkId: check.servicegraph.circuitbreaker
|
||||
plugin: stellaops.doctor.servicegraph
|
||||
severity: warn
|
||||
tags: [servicegraph, resilience, circuit-breaker]
|
||||
---
|
||||
# Circuit Breaker Status
|
||||
|
||||
## What It Checks
|
||||
Reads `Resilience:Enabled` or `HttpClient:Resilience:Enabled` and, when enabled, validates `BreakDurationSeconds`, `FailureThreshold`, and `SamplingDurationSeconds`.
|
||||
|
||||
The check reports info when resilience is not configured, warns when `BreakDurationSeconds < 5` or `FailureThreshold < 2`, and passes otherwise.
|
||||
|
||||
## Why It Matters
|
||||
Circuit breakers protect external dependencies from retry storms. Bad thresholds either trip too aggressively or never trip when a downstream service is failing.
|
||||
|
||||
## Common Causes
|
||||
- Resilience policies were never enabled on outgoing HTTP clients
|
||||
- Thresholds were copied from a benchmark profile into production
|
||||
- Multiple services use different resilience defaults, making failures unpredictable
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
Resilience__Enabled: "true"
|
||||
Resilience__CircuitBreaker__BreakDurationSeconds: "30"
|
||||
Resilience__CircuitBreaker__FailureThreshold: "5"
|
||||
Resilience__CircuitBreaker__SamplingDurationSeconds: "60"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Keep breaker settings in the same configuration source used for HTTP client registration so the service and Doctor observe the same values.
|
||||
|
||||
### Kubernetes / Helm
|
||||
Standardize resilience values across backend-facing workloads instead of per-pod overrides.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.servicegraph.circuitbreaker
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.servicegraph.backend` - breaker policy protects this path when the backend degrades
|
||||
- `check.servicegraph.timeouts` - timeout settings and breaker settings should be tuned together
|
||||
Reference in New Issue
Block a user