Files

master 152c1b1357 doctor: complete runtime check documentation sprint

Signed-off-by: master <>

2026-03-31 23:26:24 +03:00

1.8 KiB

Raw Blame History

checkId, plugin, severity, tags

checkId

plugin

severity

Circuit Breaker Status

What It Checks

Reads Resilience:Enabled or HttpClient:Resilience:Enabled and, when enabled, validates BreakDurationSeconds, FailureThreshold, and SamplingDurationSeconds.

The check reports info when resilience is not configured, warns when BreakDurationSeconds < 5 or FailureThreshold < 2, and passes otherwise.

Why It Matters

Circuit breakers protect external dependencies from retry storms. Bad thresholds either trip too aggressively or never trip when a downstream service is failing.

Common Causes

Resilience policies were never enabled on outgoing HTTP clients
Thresholds were copied from a benchmark profile into production
Multiple services use different resilience defaults, making failures unpredictable

How to Fix

Docker Compose

services:
  doctor-web:
    environment:
      Resilience__Enabled: "true"
      Resilience__CircuitBreaker__BreakDurationSeconds: "30"
      Resilience__CircuitBreaker__FailureThreshold: "5"
      Resilience__CircuitBreaker__SamplingDurationSeconds: "60"

Bare Metal / systemd

Keep breaker settings in the same configuration source used for HTTP client registration so the service and Doctor observe the same values.

Kubernetes / Helm

Standardize resilience values across backend-facing workloads instead of per-pod overrides.

Verification

stella doctor --check check.servicegraph.circuitbreaker

check.servicegraph.backend - breaker policy protects this path when the backend degrades
check.servicegraph.timeouts - timeout settings and breaker settings should be tuned together

1.8 KiB Raw Blame History