doctor: complete runtime check documentation sprint
Signed-off-by: master <>
This commit is contained in:
56
docs/doctor/articles/servicegraph/servicegraph-backend.md
Normal file
56
docs/doctor/articles/servicegraph/servicegraph-backend.md
Normal file
@@ -0,0 +1,56 @@
|
||||
---
|
||||
checkId: check.servicegraph.backend
|
||||
plugin: stellaops.doctor.servicegraph
|
||||
severity: fail
|
||||
tags: [servicegraph, backend, api, connectivity]
|
||||
---
|
||||
# Backend API Connectivity
|
||||
|
||||
## What It Checks
|
||||
Reads `StellaOps:BackendUrl` or `BackendUrl`, appends `/health`, and performs an HTTP GET through `IHttpClientFactory`.
|
||||
|
||||
The check passes on a successful response, warns when latency exceeds `2000ms`, and fails on non-success status codes or connection errors.
|
||||
|
||||
## Why It Matters
|
||||
The backend API is the control plane entry point for many Stella Ops flows. If it is unreachable, UI features and cross-service orchestration degrade quickly.
|
||||
|
||||
## Common Causes
|
||||
- `StellaOps__BackendUrl` points to the wrong host, port, or scheme
|
||||
- The backend service is down or returning `5xx`
|
||||
- DNS, proxy, or network rules block access from the Doctor service
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
StellaOps__BackendUrl: http://platform-web:8080
|
||||
```
|
||||
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://platform-web:8080/health
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 100 platform-web
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
curl -fsS http://<backend-host>:<port>/health
|
||||
journalctl -u <backend-service> -n 200
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl exec deploy/doctor-web -n <namespace> -- curl -fsS http://<backend-service>.<namespace>.svc.cluster.local:<port>/health
|
||||
kubectl logs deploy/<backend-service> -n <namespace> --tail=200
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.servicegraph.backend
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.servicegraph.endpoints` - validates the rest of the service graph after the main backend is reachable
|
||||
- `check.servicegraph.timeouts` - slow backend responses often trace back to timeout tuning
|
||||
@@ -0,0 +1,48 @@
|
||||
---
|
||||
checkId: check.servicegraph.circuitbreaker
|
||||
plugin: stellaops.doctor.servicegraph
|
||||
severity: warn
|
||||
tags: [servicegraph, resilience, circuit-breaker]
|
||||
---
|
||||
# Circuit Breaker Status
|
||||
|
||||
## What It Checks
|
||||
Reads `Resilience:Enabled` or `HttpClient:Resilience:Enabled` and, when enabled, validates `BreakDurationSeconds`, `FailureThreshold`, and `SamplingDurationSeconds`.
|
||||
|
||||
The check reports info when resilience is not configured, warns when `BreakDurationSeconds < 5` or `FailureThreshold < 2`, and passes otherwise.
|
||||
|
||||
## Why It Matters
|
||||
Circuit breakers protect external dependencies from retry storms. Bad thresholds either trip too aggressively or never trip when a downstream service is failing.
|
||||
|
||||
## Common Causes
|
||||
- Resilience policies were never enabled on outgoing HTTP clients
|
||||
- Thresholds were copied from a benchmark profile into production
|
||||
- Multiple services use different resilience defaults, making failures unpredictable
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
Resilience__Enabled: "true"
|
||||
Resilience__CircuitBreaker__BreakDurationSeconds: "30"
|
||||
Resilience__CircuitBreaker__FailureThreshold: "5"
|
||||
Resilience__CircuitBreaker__SamplingDurationSeconds: "60"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Keep breaker settings in the same configuration source used for HTTP client registration so the service and Doctor observe the same values.
|
||||
|
||||
### Kubernetes / Helm
|
||||
Standardize resilience values across backend-facing workloads instead of per-pod overrides.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.servicegraph.circuitbreaker
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.servicegraph.backend` - breaker policy protects this path when the backend degrades
|
||||
- `check.servicegraph.timeouts` - timeout settings and breaker settings should be tuned together
|
||||
53
docs/doctor/articles/servicegraph/servicegraph-endpoints.md
Normal file
53
docs/doctor/articles/servicegraph/servicegraph-endpoints.md
Normal file
@@ -0,0 +1,53 @@
|
||||
---
|
||||
checkId: check.servicegraph.endpoints
|
||||
plugin: stellaops.doctor.servicegraph
|
||||
severity: fail
|
||||
tags: [servicegraph, services, endpoints, connectivity]
|
||||
---
|
||||
# Service Endpoints
|
||||
|
||||
## What It Checks
|
||||
Collects configured service URLs for Authority, Scanner, Concelier, Excititor, Attestor, VexLens, and Gateway, appends `/health`, and probes each endpoint.
|
||||
|
||||
The check fails when any configured endpoint is unreachable or returns a non-success status. If no endpoints are configured, the check is skipped.
|
||||
|
||||
## Why It Matters
|
||||
Stella Ops is a multi-service platform. A single broken internal endpoint can stall release orchestration, evidence generation, or advisory workflows even when the main web process is alive.
|
||||
|
||||
## Common Causes
|
||||
- One or more `StellaOps:*Url` values are missing or point to the wrong internal service name
|
||||
- Internal DNS or network routing is broken
|
||||
- The target workload is up but not exposing `/health`
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Set the internal URLs explicitly:
|
||||
|
||||
```yaml
|
||||
StellaOps__AuthorityUrl: http://authority-web:8080
|
||||
StellaOps__ScannerUrl: http://scanner-web:8080
|
||||
StellaOps__GatewayUrl: http://web:8080
|
||||
```
|
||||
|
||||
Probe each endpoint from the Doctor container:
|
||||
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://authority-web:8080/health
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web curl -fsS http://scanner-web:8080/health
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Confirm the service-discovery or reverse-proxy names resolve from the Doctor host.
|
||||
|
||||
### Kubernetes / Helm
|
||||
Use cluster-local service DNS names and check that each workload exports a health endpoint through the same port the URL references.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.servicegraph.endpoints
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.servicegraph.backend` - the backend is usually the first endpoint operators validate
|
||||
- `check.servicegraph.mq` - asynchronous workflows also depend on messaging, not only HTTP endpoints
|
||||
56
docs/doctor/articles/servicegraph/servicegraph-mq.md
Normal file
56
docs/doctor/articles/servicegraph/servicegraph-mq.md
Normal file
@@ -0,0 +1,56 @@
|
||||
---
|
||||
checkId: check.servicegraph.mq
|
||||
plugin: stellaops.doctor.servicegraph
|
||||
severity: warn
|
||||
tags: [servicegraph, messaging, rabbitmq, connectivity]
|
||||
---
|
||||
# Message Queue Connectivity
|
||||
|
||||
## What It Checks
|
||||
Reads `RabbitMQ:Host` or `Messaging:RabbitMQ:Host` plus an optional port, defaulting to `5672`, and attempts a TCP connection.
|
||||
|
||||
The check skips when RabbitMQ is not configured and fails on timeouts, DNS failures, or refused connections.
|
||||
|
||||
## Why It Matters
|
||||
Release tasks, notifications, and deferred work often depend on a functioning message broker. A dead queue path turns healthy APIs into backlogged systems.
|
||||
|
||||
## Common Causes
|
||||
- `RabbitMQ__Host` is unset or points to the wrong broker
|
||||
- The broker container is down
|
||||
- AMQP traffic is blocked between Doctor and RabbitMQ
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
RabbitMQ__Host: rabbitmq
|
||||
RabbitMQ__Port: "5672"
|
||||
```
|
||||
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml ps rabbitmq
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml logs --tail 100 rabbitmq
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web sh -lc "nc -zv rabbitmq 5672"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
nc -zv <rabbit-host> 5672
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl exec deploy/doctor-web -n <namespace> -- sh -lc "nc -zv <rabbit-service> 5672"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.servicegraph.mq
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.servicegraph.valkey` - cache and queue connectivity usually fail together when service networking is broken
|
||||
- `check.servicegraph.timeouts` - aggressive timeouts can make a slow broker look unavailable
|
||||
48
docs/doctor/articles/servicegraph/servicegraph-timeouts.md
Normal file
48
docs/doctor/articles/servicegraph/servicegraph-timeouts.md
Normal file
@@ -0,0 +1,48 @@
|
||||
---
|
||||
checkId: check.servicegraph.timeouts
|
||||
plugin: stellaops.doctor.servicegraph
|
||||
severity: warn
|
||||
tags: [servicegraph, timeouts, configuration]
|
||||
---
|
||||
# Service Timeouts
|
||||
|
||||
## What It Checks
|
||||
Validates `HttpClient:Timeout`, `Database:CommandTimeout`, `Cache:OperationTimeout`, and `HealthChecks:Timeout`.
|
||||
|
||||
The check warns when HTTP timeout is below `5s` or above `300s`, database timeout is below `5s` or above `120s`, cache timeout exceeds `30s`, or health-check timeout exceeds the HTTP timeout.
|
||||
|
||||
## Why It Matters
|
||||
Timeouts define how quickly failures surface and how long stuck work ties up resources. Poor values cause either premature failures or prolonged resource exhaustion.
|
||||
|
||||
## Common Causes
|
||||
- Defaults from one environment were copied into another with very different latency
|
||||
- Health-check timeout was set higher than the main request timeout
|
||||
- Cache or database timeouts were raised to hide underlying performance problems
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
HttpClient__Timeout: "100"
|
||||
Database__CommandTimeout: "30"
|
||||
Cache__OperationTimeout: "5"
|
||||
HealthChecks__Timeout: "10"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Tune timeouts from measured service latencies, not from guesswork. Raise values only after understanding the slower dependency.
|
||||
|
||||
### Kubernetes / Helm
|
||||
Keep application timeouts lower than ingress, service-mesh, and job-level deadlines so failures happen in the component that owns the retry policy.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.servicegraph.timeouts
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.servicegraph.backend` - timeout misconfiguration often shows up as backend failures first
|
||||
- `check.db.latency` - high database latency can force operators to revisit timeout values
|
||||
52
docs/doctor/articles/servicegraph/servicegraph-valkey.md
Normal file
52
docs/doctor/articles/servicegraph/servicegraph-valkey.md
Normal file
@@ -0,0 +1,52 @@
|
||||
---
|
||||
checkId: check.servicegraph.valkey
|
||||
plugin: stellaops.doctor.servicegraph
|
||||
severity: warn
|
||||
tags: [servicegraph, valkey, redis, cache]
|
||||
---
|
||||
# Valkey/Redis Connectivity
|
||||
|
||||
## What It Checks
|
||||
Reads `Valkey:ConnectionString`, `Redis:ConnectionString`, `ConnectionStrings:Valkey`, or `ConnectionStrings:Redis`, parses the host and port, and opens a TCP connection.
|
||||
|
||||
The check skips when no cache connection string is configured and fails when parsing fails or the target cannot be reached.
|
||||
|
||||
## Why It Matters
|
||||
Cache unavailability affects queue coordination, state caching, and latency-sensitive platform features. A malformed connection string is also an early warning that the environment is not wired correctly.
|
||||
|
||||
## Common Causes
|
||||
- The cache connection string is missing, malformed, or still points to a previous environment
|
||||
- The Valkey/Redis service is not running
|
||||
- Container networking or DNS is broken
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
services:
|
||||
doctor-web:
|
||||
environment:
|
||||
Valkey__ConnectionString: valkey:6379,password=${STELLAOPS_VALKEY_PASSWORD}
|
||||
```
|
||||
|
||||
```bash
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml ps valkey
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec doctor-web sh -lc "nc -zv valkey 6379"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
redis-cli -h <valkey-host> -p 6379 ping
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
Use a cluster-local service name in the connection string and verify the port exposed by the StatefulSet or Service.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor --check check.servicegraph.valkey
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.servicegraph.mq` - both checks validate internal service-network connectivity
|
||||
- `check.servicegraph.endpoints` - broad service discovery issues usually affect cache endpoints too
|
||||
Reference in New Issue
Block a user