doctor: complete runtime check documentation sprint

Signed-off-by: master <>
2026-03-31 23:26:24 +03:00
parent 404d50bcb7
commit 152c1b1357
54 changed files with 2210 additions and 258 deletions
--- a/docs/doctor/articles/servicegraph/servicegraph-circuitbreaker.md
+++ b/docs/doctor/articles/servicegraph/servicegraph-circuitbreaker.md
@@ -0,0 +1,48 @@
+---
+checkId: check.servicegraph.circuitbreaker
+plugin: stellaops.doctor.servicegraph
+severity: warn
+tags: [servicegraph, resilience, circuit-breaker]
+---
+# Circuit Breaker Status
+
+## What It Checks
+Reads `Resilience:Enabled` or `HttpClient:Resilience:Enabled` and, when enabled, validates `BreakDurationSeconds`, `FailureThreshold`, and `SamplingDurationSeconds`.
+
+The check reports info when resilience is not configured, warns when `BreakDurationSeconds < 5` or `FailureThreshold < 2`, and passes otherwise.
+
+## Why It Matters
+Circuit breakers protect external dependencies from retry storms. Bad thresholds either trip too aggressively or never trip when a downstream service is failing.
+
+## Common Causes
+- Resilience policies were never enabled on outgoing HTTP clients
+- Thresholds were copied from a benchmark profile into production
+- Multiple services use different resilience defaults, making failures unpredictable
+
+## How to Fix
+
+### Docker Compose
+```yaml
+services:
+  doctor-web:
+    environment:
+      Resilience__Enabled: "true"
+      Resilience__CircuitBreaker__BreakDurationSeconds: "30"
+      Resilience__CircuitBreaker__FailureThreshold: "5"
+      Resilience__CircuitBreaker__SamplingDurationSeconds: "60"
+```
+
+### Bare Metal / systemd
+Keep breaker settings in the same configuration source used for HTTP client registration so the service and Doctor observe the same values.
+
+### Kubernetes / Helm
+Standardize resilience values across backend-facing workloads instead of per-pod overrides.
+
+## Verification
+```bash
+stella doctor --check check.servicegraph.circuitbreaker
+```
+
+## Related Checks
+- `check.servicegraph.backend` - breaker policy protects this path when the backend degrades
+- `check.servicegraph.timeouts` - timeout settings and breaker settings should be tuned together