doctor: complete runtime check documentation sprint

Signed-off-by: master <>
2026-03-31 23:26:24 +03:00
parent 404d50bcb7
commit 152c1b1357
54 changed files with 2210 additions and 258 deletions
--- a/docs/doctor/articles/servicegraph/servicegraph-backend.md
+++ b/docs/doctor/articles/servicegraph/servicegraph-backend.md
@@ -0,0 +1,56 @@
+---
+checkId: check.servicegraph.backend
+plugin: stellaops.doctor.servicegraph
+severity: fail
+tags: [servicegraph, backend, api, connectivity]
+---
+# Backend API Connectivity
+
+## What It Checks
+Reads `StellaOps:BackendUrl` or `BackendUrl`, appends `/health`, and performs an HTTP GET through `IHttpClientFactory`.
+
+The check passes on a successful response, warns when latency exceeds `2000ms`, and fails on non-success status codes or connection errors.
+
+## Why It Matters
+The backend API is the control plane entry point for many Stella Ops flows. If it is unreachable, UI features and cross-service orchestration degrade quickly.
+
+## Common Causes
+- `StellaOps__BackendUrl` points to the wrong host, port, or scheme
+- The backend service is down or returning `5xx`
+- DNS, proxy, or network rules block access from the Doctor service
+
+## How to Fix
+
+### Docker Compose
+```yaml
+services:
+  doctor-web:
+    environment:
+      StellaOps__BackendUrl: http://platform-web:8080
+```
+
+```bash
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://platform-web:8080/health
+docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 100 platform-web
+```
+
+### Bare Metal / systemd
+```bash
+curl -fsS http://<backend-host>:<port>/health
+journalctl -u <backend-service> -n 200
+```
+
+### Kubernetes / Helm
+```bash
+kubectl exec deploy/doctor-web -n <namespace> -- curl -fsS http://<backend-service>.<namespace>.svc.cluster.local:<port>/health
+kubectl logs deploy/<backend-service> -n <namespace> --tail=200
+```
+
+## Verification
+```bash
+stella doctor --check check.servicegraph.backend
+```
+
+## Related Checks
+- `check.servicegraph.endpoints` - validates the rest of the service graph after the main backend is reachable
+- `check.servicegraph.timeouts` - slow backend responses often trace back to timeout tuning
--- a/docs/doctor/articles/servicegraph/servicegraph-circuitbreaker.md
+++ b/docs/doctor/articles/servicegraph/servicegraph-circuitbreaker.md
@@ -0,0 +1,48 @@
+---
+checkId: check.servicegraph.circuitbreaker
+plugin: stellaops.doctor.servicegraph
+severity: warn
+tags: [servicegraph, resilience, circuit-breaker]
+---
+# Circuit Breaker Status
+
+## What It Checks
+Reads `Resilience:Enabled` or `HttpClient:Resilience:Enabled` and, when enabled, validates `BreakDurationSeconds`, `FailureThreshold`, and `SamplingDurationSeconds`.
+
+The check reports info when resilience is not configured, warns when `BreakDurationSeconds < 5` or `FailureThreshold < 2`, and passes otherwise.
+
+## Why It Matters
+Circuit breakers protect external dependencies from retry storms. Bad thresholds either trip too aggressively or never trip when a downstream service is failing.
+
+## Common Causes
+- Resilience policies were never enabled on outgoing HTTP clients
+- Thresholds were copied from a benchmark profile into production
+- Multiple services use different resilience defaults, making failures unpredictable
+
+## How to Fix
+
+### Docker Compose
+```yaml
+services:
+  doctor-web:
+    environment:
+      Resilience__Enabled: "true"
+      Resilience__CircuitBreaker__BreakDurationSeconds: "30"
+      Resilience__CircuitBreaker__FailureThreshold: "5"
+      Resilience__CircuitBreaker__SamplingDurationSeconds: "60"
+```
+
+### Bare Metal / systemd
+Keep breaker settings in the same configuration source used for HTTP client registration so the service and Doctor observe the same values.
+
+### Kubernetes / Helm
+Standardize resilience values across backend-facing workloads instead of per-pod overrides.
+
+## Verification
+```bash
+stella doctor --check check.servicegraph.circuitbreaker
+```
+
+## Related Checks
+- `check.servicegraph.backend` - breaker policy protects this path when the backend degrades
+- `check.servicegraph.timeouts` - timeout settings and breaker settings should be tuned together
--- a/docs/doctor/articles/servicegraph/servicegraph-endpoints.md
+++ b/docs/doctor/articles/servicegraph/servicegraph-endpoints.md
@@ -0,0 +1,53 @@
+---
+checkId: check.servicegraph.endpoints
+plugin: stellaops.doctor.servicegraph
+severity: fail
+tags: [servicegraph, services, endpoints, connectivity]
+---
+# Service Endpoints
+
+## What It Checks
+Collects configured service URLs for Authority, Scanner, Concelier, Excititor, Attestor, VexLens, and Gateway, appends `/health`, and probes each endpoint.
+
+The check fails when any configured endpoint is unreachable or returns a non-success status. If no endpoints are configured, the check is skipped.
+
+## Why It Matters
+Stella Ops is a multi-service platform. A single broken internal endpoint can stall release orchestration, evidence generation, or advisory workflows even when the main web process is alive.
+
+## Common Causes
+- One or more `StellaOps:*Url` values are missing or point to the wrong internal service name
+- Internal DNS or network routing is broken
+- The target workload is up but not exposing `/health`
+
+## How to Fix
+
+### Docker Compose
+Set the internal URLs explicitly:
+
+```yaml
+StellaOps__AuthorityUrl: http://authority-web:8080
+StellaOps__ScannerUrl: http://scanner-web:8080
+StellaOps__GatewayUrl: http://web:8080
+```
+
+Probe each endpoint from the Doctor container:
+
+```bash
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://authority-web:8080/health
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://scanner-web:8080/health
+```
+
+### Bare Metal / systemd
+Confirm the service-discovery or reverse-proxy names resolve from the Doctor host.
+
+### Kubernetes / Helm
+Use cluster-local service DNS names and check that each workload exports a health endpoint through the same port the URL references.
+
+## Verification
+```bash
+stella doctor --check check.servicegraph.endpoints
+```
+
+## Related Checks
+- `check.servicegraph.backend` - the backend is usually the first endpoint operators validate
+- `check.servicegraph.mq` - asynchronous workflows also depend on messaging, not only HTTP endpoints
--- a/docs/doctor/articles/servicegraph/servicegraph-mq.md
+++ b/docs/doctor/articles/servicegraph/servicegraph-mq.md
@@ -0,0 +1,56 @@
+---
+checkId: check.servicegraph.mq
+plugin: stellaops.doctor.servicegraph
+severity: warn
+tags: [servicegraph, messaging, rabbitmq, connectivity]
+---
+# Message Queue Connectivity
+
+## What It Checks
+Reads `RabbitMQ:Host` or `Messaging:RabbitMQ:Host` plus an optional port, defaulting to `5672`, and attempts a TCP connection.
+
+The check skips when RabbitMQ is not configured and fails on timeouts, DNS failures, or refused connections.
+
+## Why It Matters
+Release tasks, notifications, and deferred work often depend on a functioning message broker. A dead queue path turns healthy APIs into backlogged systems.
+
+## Common Causes
+- `RabbitMQ__Host` is unset or points to the wrong broker
+- The broker container is down
+- AMQP traffic is blocked between Doctor and RabbitMQ
+
+## How to Fix
+
+### Docker Compose
+```yaml
+services:
+  doctor-web:
+    environment:
+      RabbitMQ__Host: rabbitmq
+      RabbitMQ__Port: "5672"
+```
+
+```bash
+docker compose -f devops/compose/docker-compose.stella-ops.yml ps rabbitmq
+docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 100 rabbitmq
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web sh -lc "nc -zv rabbitmq 5672"
+```
+
+### Bare Metal / systemd
+```bash
+nc -zv <rabbit-host> 5672
+```
+
+### Kubernetes / Helm
+```bash
+kubectl exec deploy/doctor-web -n <namespace> -- sh -lc "nc -zv <rabbit-service> 5672"
+```
+
+## Verification
+```bash
+stella doctor --check check.servicegraph.mq
+```
+
+## Related Checks
+- `check.servicegraph.valkey` - cache and queue connectivity usually fail together when service networking is broken
+- `check.servicegraph.timeouts` - aggressive timeouts can make a slow broker look unavailable
--- a/docs/doctor/articles/servicegraph/servicegraph-timeouts.md
+++ b/docs/doctor/articles/servicegraph/servicegraph-timeouts.md
@@ -0,0 +1,48 @@
+---
+checkId: check.servicegraph.timeouts
+plugin: stellaops.doctor.servicegraph
+severity: warn
+tags: [servicegraph, timeouts, configuration]
+---
+# Service Timeouts
+
+## What It Checks
+Validates `HttpClient:Timeout`, `Database:CommandTimeout`, `Cache:OperationTimeout`, and `HealthChecks:Timeout`.
+
+The check warns when HTTP timeout is below `5s` or above `300s`, database timeout is below `5s` or above `120s`, cache timeout exceeds `30s`, or health-check timeout exceeds the HTTP timeout.
+
+## Why It Matters
+Timeouts define how quickly failures surface and how long stuck work ties up resources. Poor values cause either premature failures or prolonged resource exhaustion.
+
+## Common Causes
+- Defaults from one environment were copied into another with very different latency
+- Health-check timeout was set higher than the main request timeout
+- Cache or database timeouts were raised to hide underlying performance problems
+
+## How to Fix
+
+### Docker Compose
+```yaml
+services:
+  doctor-web:
+    environment:
+      HttpClient__Timeout: "100"
+      Database__CommandTimeout: "30"
+      Cache__OperationTimeout: "5"
+      HealthChecks__Timeout: "10"
+```
+
+### Bare Metal / systemd
+Tune timeouts from measured service latencies, not from guesswork. Raise values only after understanding the slower dependency.
+
+### Kubernetes / Helm
+Keep application timeouts lower than ingress, service-mesh, and job-level deadlines so failures happen in the component that owns the retry policy.
+
+## Verification
+```bash
+stella doctor --check check.servicegraph.timeouts
+```
+
+## Related Checks
+- `check.servicegraph.backend` - timeout misconfiguration often shows up as backend failures first
+- `check.db.latency` - high database latency can force operators to revisit timeout values
--- a/docs/doctor/articles/servicegraph/servicegraph-valkey.md
+++ b/docs/doctor/articles/servicegraph/servicegraph-valkey.md
@@ -0,0 +1,52 @@
+---
+checkId: check.servicegraph.valkey
+plugin: stellaops.doctor.servicegraph
+severity: warn
+tags: [servicegraph, valkey, redis, cache]
+---
+# Valkey/Redis Connectivity
+
+## What It Checks
+Reads `Valkey:ConnectionString`, `Redis:ConnectionString`, `ConnectionStrings:Valkey`, or `ConnectionStrings:Redis`, parses the host and port, and opens a TCP connection.
+
+The check skips when no cache connection string is configured and fails when parsing fails or the target cannot be reached.
+
+## Why It Matters
+Cache unavailability affects queue coordination, state caching, and latency-sensitive platform features. A malformed connection string is also an early warning that the environment is not wired correctly.
+
+## Common Causes
+- The cache connection string is missing, malformed, or still points to a previous environment
+- The Valkey/Redis service is not running
+- Container networking or DNS is broken
+
+## How to Fix
+
+### Docker Compose
+```yaml
+services:
+  doctor-web:
+    environment:
+      Valkey__ConnectionString: valkey:6379,password=${STELLAOPS_VALKEY_PASSWORD}
+```
+
+```bash
+docker compose -f devops/compose/docker-compose.stella-ops.yml ps valkey
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web sh -lc "nc -zv valkey 6379"
+```
+
+### Bare Metal / systemd
+```bash
+redis-cli -h <valkey-host> -p 6379 ping
+```
+
+### Kubernetes / Helm
+Use a cluster-local service name in the connection string and verify the port exposed by the StatefulSet or Service.
+
+## Verification
+```bash
+stella doctor --check check.servicegraph.valkey
+```
+
+## Related Checks
+- `check.servicegraph.mq` - both checks validate internal service-network connectivity
+- `check.servicegraph.endpoints` - broad service discovery issues usually affect cache endpoints too