Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/auth/token-service.md
+++ b/docs/doctor/articles/auth/token-service.md
@@ -0,0 +1,114 @@
+---
+checkId: check.auth.token-service
+plugin: stellaops.doctor.auth
+severity: fail
+tags: [auth, service, health]
+---
+# Token Service Health
+
+## What It Checks
+
+Verifies the availability and performance of the token service endpoint (`/connect/token`). The check evaluates four conditions:
+
+1. **Service unavailable** -- token endpoint is not responding. Result: **Fail** with the endpoint URL and error message.
+2. **Critically slow** -- response time exceeds **2000ms**. Result: **Fail** with actual response time and threshold.
+3. **Slow** -- response time exceeds **500ms** but is under 2000ms. Result: **Warn** with response time, threshold, and token issuance count.
+4. **Healthy** -- service is available and response time is under 500ms. Result: **Pass** with response time, tokens issued in last 24 hours, and active session count.
+
+Evidence collected: `ServiceAvailable` (YES/NO), `Endpoint`, `ResponseTimeMs`, `CriticalThreshold` (2000), `WarningThreshold` (500), `TokensIssuedLast24h`, `ActiveSessions`, `Error`.
+
+The check always runs (`CanRun` returns true).
+
+## Why It Matters
+
+The token service is the single point through which all access tokens are issued. If it is unavailable, no user can log in, no service can authenticate, and every API call fails with 401. Even if the service is available but slow, user login experiences degrade, automated integrations time out, and the platform feels unresponsive. This check is typically the first to detect Authority database issues or resource starvation.
+
+## Common Causes
+
+- Authority service not running (container stopped, process crashed)
+- Token endpoint misconfigured (wrong path, wrong port)
+- Database connectivity issue (Authority cannot query clients/keys)
+- Database performance issues (slow queries for token validation)
+- Service overloaded (high authentication request volume)
+- Resource contention (CPU/memory pressure on Authority host)
+- Higher than normal load (warning-level)
+- Database query performance degraded (warning-level)
+
+## How to Fix
+
+### Docker Compose
+
+```bash
+# Check Authority service status
+docker compose -f devops/compose/docker-compose.stella-ops.yml ps authority
+
+# View Authority service logs
+docker compose -f devops/compose/docker-compose.stella-ops.yml logs authority --tail 200
+
+# Restart Authority service
+docker compose -f devops/compose/docker-compose.stella-ops.yml restart authority
+
+# Test token endpoint directly
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
+  curl -s -o /dev/null -w "%{http_code} %{time_total}s" http://localhost:80/connect/token
+
+# Check database connectivity
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
+  stella doctor run --check check.storage.postgres
+```
+
+### Bare Metal / systemd
+
+```bash
+# Check authority service status
+stella auth status
+
+# Restart authority service
+stella service restart authority
+
+# Check database connectivity
+stella doctor run --check check.storage.postgres
+
+# Monitor service metrics
+stella auth metrics --period 1h
+
+# Review database performance
+stella doctor run --check check.storage.performance
+
+# Watch metrics in real-time (warning-level slowness)
+stella auth metrics --watch
+```
+
+### Kubernetes / Helm
+
+```bash
+# Check authority pod status
+kubectl get pods -l app.kubernetes.io/component=authority -n stellaops
+
+# View pod logs
+kubectl logs -l app.kubernetes.io/component=authority -n stellaops --tail=200
+
+# Check resource usage
+kubectl top pods -l app.kubernetes.io/component=authority -n stellaops
+
+# Restart authority pods
+kubectl rollout restart deployment/stellaops-authority -n stellaops
+
+# Scale up if under load
+kubectl scale deployment stellaops-authority --replicas=3 -n stellaops
+
+# Check liveness/readiness probe status
+kubectl describe pod -l app.kubernetes.io/component=authority -n stellaops | grep -A5 "Liveness\|Readiness"
+```
+
+## Verification
+
+```
+stella doctor run --check check.auth.token-service
+```
+
+## Related Checks
+
+- `check.auth.config` -- auth must be configured before the token service can function
+- `check.auth.signing-key` -- token issuance requires a valid signing key
+- `check.auth.oidc` -- if delegating to external OIDC, that provider must also be healthy