Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
master
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions

View File

@@ -0,0 +1,114 @@
---
checkId: check.auth.token-service
plugin: stellaops.doctor.auth
severity: fail
tags: [auth, service, health]
---
# Token Service Health
## What It Checks
Verifies the availability and performance of the token service endpoint (`/connect/token`). The check evaluates four conditions:
1. **Service unavailable** -- token endpoint is not responding. Result: **Fail** with the endpoint URL and error message.
2. **Critically slow** -- response time exceeds **2000ms**. Result: **Fail** with actual response time and threshold.
3. **Slow** -- response time exceeds **500ms** but is under 2000ms. Result: **Warn** with response time, threshold, and token issuance count.
4. **Healthy** -- service is available and response time is under 500ms. Result: **Pass** with response time, tokens issued in last 24 hours, and active session count.
Evidence collected: `ServiceAvailable` (YES/NO), `Endpoint`, `ResponseTimeMs`, `CriticalThreshold` (2000), `WarningThreshold` (500), `TokensIssuedLast24h`, `ActiveSessions`, `Error`.
The check always runs (`CanRun` returns true).
## Why It Matters
The token service is the single point through which all access tokens are issued. If it is unavailable, no user can log in, no service can authenticate, and every API call fails with 401. Even if the service is available but slow, user login experiences degrade, automated integrations time out, and the platform feels unresponsive. This check is typically the first to detect Authority database issues or resource starvation.
## Common Causes
- Authority service not running (container stopped, process crashed)
- Token endpoint misconfigured (wrong path, wrong port)
- Database connectivity issue (Authority cannot query clients/keys)
- Database performance issues (slow queries for token validation)
- Service overloaded (high authentication request volume)
- Resource contention (CPU/memory pressure on Authority host)
- Higher than normal load (warning-level)
- Database query performance degraded (warning-level)
## How to Fix
### Docker Compose
```bash
# Check Authority service status
docker compose -f devops/compose/docker-compose.stella-ops.yml ps authority
# View Authority service logs
docker compose -f devops/compose/docker-compose.stella-ops.yml logs authority --tail 200
# Restart Authority service
docker compose -f devops/compose/docker-compose.stella-ops.yml restart authority
# Test token endpoint directly
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
curl -s -o /dev/null -w "%{http_code} %{time_total}s" http://localhost:80/connect/token
# Check database connectivity
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
stella doctor run --check check.storage.postgres
```
### Bare Metal / systemd
```bash
# Check authority service status
stella auth status
# Restart authority service
stella service restart authority
# Check database connectivity
stella doctor run --check check.storage.postgres
# Monitor service metrics
stella auth metrics --period 1h
# Review database performance
stella doctor run --check check.storage.performance
# Watch metrics in real-time (warning-level slowness)
stella auth metrics --watch
```
### Kubernetes / Helm
```bash
# Check authority pod status
kubectl get pods -l app.kubernetes.io/component=authority -n stellaops
# View pod logs
kubectl logs -l app.kubernetes.io/component=authority -n stellaops --tail=200
# Check resource usage
kubectl top pods -l app.kubernetes.io/component=authority -n stellaops
# Restart authority pods
kubectl rollout restart deployment/stellaops-authority -n stellaops
# Scale up if under load
kubectl scale deployment stellaops-authority --replicas=3 -n stellaops
# Check liveness/readiness probe status
kubectl describe pod -l app.kubernetes.io/component=authority -n stellaops | grep -A5 "Liveness\|Readiness"
```
## Verification
```
stella doctor run --check check.auth.token-service
```
## Related Checks
- `check.auth.config` -- auth must be configured before the token service can function
- `check.auth.signing-key` -- token issuance requires a valid signing key
- `check.auth.oidc` -- if delegating to external OIDC, that provider must also be healthy