doctor: complete runtime check documentation sprint

Signed-off-by: master <>
This commit is contained in:
master
2026-03-31 23:26:24 +03:00
parent 404d50bcb7
commit 152c1b1357
54 changed files with 2210 additions and 258 deletions

View File

@@ -0,0 +1,56 @@
---
checkId: check.servicegraph.backend
plugin: stellaops.doctor.servicegraph
severity: fail
tags: [servicegraph, backend, api, connectivity]
---
# Backend API Connectivity
## What It Checks
Reads `StellaOps:BackendUrl` or `BackendUrl`, appends `/health`, and performs an HTTP GET through `IHttpClientFactory`.
The check passes on a successful response, warns when latency exceeds `2000ms`, and fails on non-success status codes or connection errors.
## Why It Matters
The backend API is the control plane entry point for many Stella Ops flows. If it is unreachable, UI features and cross-service orchestration degrade quickly.
## Common Causes
- `StellaOps__BackendUrl` points to the wrong host, port, or scheme
- The backend service is down or returning `5xx`
- DNS, proxy, or network rules block access from the Doctor service
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
StellaOps__BackendUrl: http://platform-web:8080
```
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://platform-web:8080/health
docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 100 platform-web
```
### Bare Metal / systemd
```bash
curl -fsS http://<backend-host>:<port>/health
journalctl -u <backend-service> -n 200
```
### Kubernetes / Helm
```bash
kubectl exec deploy/doctor-web -n <namespace> -- curl -fsS http://<backend-service>.<namespace>.svc.cluster.local:<port>/health
kubectl logs deploy/<backend-service> -n <namespace> --tail=200
```
## Verification
```bash
stella doctor --check check.servicegraph.backend
```
## Related Checks
- `check.servicegraph.endpoints` - validates the rest of the service graph after the main backend is reachable
- `check.servicegraph.timeouts` - slow backend responses often trace back to timeout tuning

View File

@@ -0,0 +1,48 @@
---
checkId: check.servicegraph.circuitbreaker
plugin: stellaops.doctor.servicegraph
severity: warn
tags: [servicegraph, resilience, circuit-breaker]
---
# Circuit Breaker Status
## What It Checks
Reads `Resilience:Enabled` or `HttpClient:Resilience:Enabled` and, when enabled, validates `BreakDurationSeconds`, `FailureThreshold`, and `SamplingDurationSeconds`.
The check reports info when resilience is not configured, warns when `BreakDurationSeconds < 5` or `FailureThreshold < 2`, and passes otherwise.
## Why It Matters
Circuit breakers protect external dependencies from retry storms. Bad thresholds either trip too aggressively or never trip when a downstream service is failing.
## Common Causes
- Resilience policies were never enabled on outgoing HTTP clients
- Thresholds were copied from a benchmark profile into production
- Multiple services use different resilience defaults, making failures unpredictable
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
Resilience__Enabled: "true"
Resilience__CircuitBreaker__BreakDurationSeconds: "30"
Resilience__CircuitBreaker__FailureThreshold: "5"
Resilience__CircuitBreaker__SamplingDurationSeconds: "60"
```
### Bare Metal / systemd
Keep breaker settings in the same configuration source used for HTTP client registration so the service and Doctor observe the same values.
### Kubernetes / Helm
Standardize resilience values across backend-facing workloads instead of per-pod overrides.
## Verification
```bash
stella doctor --check check.servicegraph.circuitbreaker
```
## Related Checks
- `check.servicegraph.backend` - breaker policy protects this path when the backend degrades
- `check.servicegraph.timeouts` - timeout settings and breaker settings should be tuned together

View File

@@ -0,0 +1,53 @@
---
checkId: check.servicegraph.endpoints
plugin: stellaops.doctor.servicegraph
severity: fail
tags: [servicegraph, services, endpoints, connectivity]
---
# Service Endpoints
## What It Checks
Collects configured service URLs for Authority, Scanner, Concelier, Excititor, Attestor, VexLens, and Gateway, appends `/health`, and probes each endpoint.
The check fails when any configured endpoint is unreachable or returns a non-success status. If no endpoints are configured, the check is skipped.
## Why It Matters
Stella Ops is a multi-service platform. A single broken internal endpoint can stall release orchestration, evidence generation, or advisory workflows even when the main web process is alive.
## Common Causes
- One or more `StellaOps:*Url` values are missing or point to the wrong internal service name
- Internal DNS or network routing is broken
- The target workload is up but not exposing `/health`
## How to Fix
### Docker Compose
Set the internal URLs explicitly:
```yaml
StellaOps__AuthorityUrl: http://authority-web:8080
StellaOps__ScannerUrl: http://scanner-web:8080
StellaOps__GatewayUrl: http://web:8080
```
Probe each endpoint from the Doctor container:
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://authority-web:8080/health
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://scanner-web:8080/health
```
### Bare Metal / systemd
Confirm the service-discovery or reverse-proxy names resolve from the Doctor host.
### Kubernetes / Helm
Use cluster-local service DNS names and check that each workload exports a health endpoint through the same port the URL references.
## Verification
```bash
stella doctor --check check.servicegraph.endpoints
```
## Related Checks
- `check.servicegraph.backend` - the backend is usually the first endpoint operators validate
- `check.servicegraph.mq` - asynchronous workflows also depend on messaging, not only HTTP endpoints

View File

@@ -0,0 +1,56 @@
---
checkId: check.servicegraph.mq
plugin: stellaops.doctor.servicegraph
severity: warn
tags: [servicegraph, messaging, rabbitmq, connectivity]
---
# Message Queue Connectivity
## What It Checks
Reads `RabbitMQ:Host` or `Messaging:RabbitMQ:Host` plus an optional port, defaulting to `5672`, and attempts a TCP connection.
The check skips when RabbitMQ is not configured and fails on timeouts, DNS failures, or refused connections.
## Why It Matters
Release tasks, notifications, and deferred work often depend on a functioning message broker. A dead queue path turns healthy APIs into backlogged systems.
## Common Causes
- `RabbitMQ__Host` is unset or points to the wrong broker
- The broker container is down
- AMQP traffic is blocked between Doctor and RabbitMQ
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
RabbitMQ__Host: rabbitmq
RabbitMQ__Port: "5672"
```
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml ps rabbitmq
docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 100 rabbitmq
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web sh -lc "nc -zv rabbitmq 5672"
```
### Bare Metal / systemd
```bash
nc -zv <rabbit-host> 5672
```
### Kubernetes / Helm
```bash
kubectl exec deploy/doctor-web -n <namespace> -- sh -lc "nc -zv <rabbit-service> 5672"
```
## Verification
```bash
stella doctor --check check.servicegraph.mq
```
## Related Checks
- `check.servicegraph.valkey` - cache and queue connectivity usually fail together when service networking is broken
- `check.servicegraph.timeouts` - aggressive timeouts can make a slow broker look unavailable

View File

@@ -0,0 +1,48 @@
---
checkId: check.servicegraph.timeouts
plugin: stellaops.doctor.servicegraph
severity: warn
tags: [servicegraph, timeouts, configuration]
---
# Service Timeouts
## What It Checks
Validates `HttpClient:Timeout`, `Database:CommandTimeout`, `Cache:OperationTimeout`, and `HealthChecks:Timeout`.
The check warns when HTTP timeout is below `5s` or above `300s`, database timeout is below `5s` or above `120s`, cache timeout exceeds `30s`, or health-check timeout exceeds the HTTP timeout.
## Why It Matters
Timeouts define how quickly failures surface and how long stuck work ties up resources. Poor values cause either premature failures or prolonged resource exhaustion.
## Common Causes
- Defaults from one environment were copied into another with very different latency
- Health-check timeout was set higher than the main request timeout
- Cache or database timeouts were raised to hide underlying performance problems
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
HttpClient__Timeout: "100"
Database__CommandTimeout: "30"
Cache__OperationTimeout: "5"
HealthChecks__Timeout: "10"
```
### Bare Metal / systemd
Tune timeouts from measured service latencies, not from guesswork. Raise values only after understanding the slower dependency.
### Kubernetes / Helm
Keep application timeouts lower than ingress, service-mesh, and job-level deadlines so failures happen in the component that owns the retry policy.
## Verification
```bash
stella doctor --check check.servicegraph.timeouts
```
## Related Checks
- `check.servicegraph.backend` - timeout misconfiguration often shows up as backend failures first
- `check.db.latency` - high database latency can force operators to revisit timeout values

View File

@@ -0,0 +1,52 @@
---
checkId: check.servicegraph.valkey
plugin: stellaops.doctor.servicegraph
severity: warn
tags: [servicegraph, valkey, redis, cache]
---
# Valkey/Redis Connectivity
## What It Checks
Reads `Valkey:ConnectionString`, `Redis:ConnectionString`, `ConnectionStrings:Valkey`, or `ConnectionStrings:Redis`, parses the host and port, and opens a TCP connection.
The check skips when no cache connection string is configured and fails when parsing fails or the target cannot be reached.
## Why It Matters
Cache unavailability affects queue coordination, state caching, and latency-sensitive platform features. A malformed connection string is also an early warning that the environment is not wired correctly.
## Common Causes
- The cache connection string is missing, malformed, or still points to a previous environment
- The Valkey/Redis service is not running
- Container networking or DNS is broken
## How to Fix
### Docker Compose
```yaml
services:
doctor-web:
environment:
Valkey__ConnectionString: valkey:6379,password=${STELLAOPS_VALKEY_PASSWORD}
```
```bash
docker compose -f devops/compose/docker-compose.stella-ops.yml ps valkey
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web sh -lc "nc -zv valkey 6379"
```
### Bare Metal / systemd
```bash
redis-cli -h <valkey-host> -p 6379 ping
```
### Kubernetes / Helm
Use a cluster-local service name in the connection string and verify the port exposed by the StatefulSet or Service.
## Verification
```bash
stella doctor --check check.servicegraph.valkey
```
## Related Checks
- `check.servicegraph.mq` - both checks validate internal service-network connectivity
- `check.servicegraph.endpoints` - broad service discovery issues usually affect cache endpoints too