Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
master
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions

View File

@@ -0,0 +1,108 @@
---
checkId: check.auth.config
plugin: stellaops.doctor.auth
severity: fail
tags: [auth, security, core, config]
---
# Auth Configuration
## What It Checks
Validates the overall authentication configuration by inspecting three layers in sequence:
1. **Authentication configured** -- verifies that the auth subsystem has been set up (issuer URL present, basic config loaded). If not: **Fail** with "Authentication not configured".
2. **Signing keys available** -- checks whether signing keys exist for token issuance. If configured but no keys: **Fail** with "No signing keys available".
3. **Signing key expiration** -- checks if the active signing key is approaching expiration. If it will expire soon: **Warn** with the number of days remaining.
4. **All healthy** -- issuer URL configured, signing keys available, key not near expiry. Result: **Pass**.
Evidence collected: `AuthConfigured` (YES/NO), `IssuerConfigured` (YES/NO), `IssuerUrl`, `SigningKeysConfigured`/`SigningKeysAvailable` (YES/NO), `KeyExpiration` (days), `ActiveClients` count, `ActiveScopes` count.
The check always runs (`CanRun` returns true).
## Why It Matters
Authentication is the foundation of every API call in Stella Ops. If the auth subsystem is not configured, no user can log in, no service-to-service call can authenticate, and the entire platform is non-functional. Missing signing keys mean tokens cannot be issued, and an expiring key that is not rotated will cause a hard outage when it expires.
## Common Causes
- Authority service not configured (fresh installation without `stella setup auth`)
- Missing issuer URL configuration in environment variables or config files
- Signing keys not yet generated (first-run setup incomplete)
- Key material corrupted (disk failure, accidental deletion)
- HSM/PKCS#11 module not accessible (hardware key store offline)
- Signing key approaching expiration without scheduled rotation
## How to Fix
### Docker Compose
```bash
# Check Authority service configuration
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
cat /app/appsettings.json | grep -A5 "Issuer\|Signing"
# Set issuer URL via environment variable
# In .env or docker-compose.override.yml:
# AUTHORITY__ISSUER__URL=https://stella-ops.local/authority
# Restart Authority service after config changes
docker compose -f devops/compose/docker-compose.stella-ops.yml restart authority
# Generate signing keys
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
stella keys generate --type rsa
```
### Bare Metal / systemd
```bash
# Run initial auth setup
stella setup auth
# Configure issuer URL
stella auth configure --issuer https://auth.yourdomain.com
# Generate signing keys
stella keys generate --type rsa
# Rotate signing keys (if approaching expiration)
stella keys rotate
# Schedule automatic key rotation
stella keys rotate --schedule 30d
# Check key store health
stella doctor run --check check.crypto.keystore
```
### Kubernetes / Helm
```bash
# Check authority pod configuration
kubectl get configmap stellaops-authority-config -n stellaops -o yaml
# Set issuer URL in Helm values
# authority:
# issuer:
# url: "https://auth.yourdomain.com"
helm upgrade stellaops stellaops/stellaops -f values.yaml
# Generate keys via job
kubectl exec -it deploy/stellaops-authority -n stellaops -- \
stella keys generate --type rsa
# Check secrets for key material
kubectl get secret stellaops-signing-keys -n stellaops
```
## Verification
```
stella doctor run --check check.auth.config
```
## Related Checks
- `check.auth.signing-key` -- deeper signing key health (algorithm, size, rotation schedule)
- `check.auth.token-service` -- verifies token endpoint is responsive
- `check.auth.oidc` -- external OIDC provider connectivity

View File

@@ -0,0 +1,100 @@
---
checkId: check.auth.oidc
plugin: stellaops.doctor.auth
severity: warn
tags: [auth, oidc, connectivity]
---
# OIDC Provider Connectivity
## What It Checks
Tests connectivity to an external OIDC provider by performing real HTTP requests. The check reads the issuer URL from configuration keys (in priority order): `Authentication:Oidc:Issuer`, `Auth:Oidc:Authority`, `Oidc:Issuer`. If none is configured, the check passes immediately (local authority mode).
When an external provider is configured, the check performs a multi-step validation:
1. **Fetch discovery document** -- HTTP GET to `{issuerUrl}/.well-known/openid-configuration` with a 10-second timeout. If unreachable: **Fail** with connection error type classification (ssl_error, dns_failure, refused, timeout, connection_failed).
2. **Validate discovery fields** -- Parses the discovery JSON and verifies presence of `authorization_endpoint`, `token_endpoint`, and `jwks_uri`. If any are missing: **Warn** listing the missing fields.
3. **Fetch JWKS** -- HTTP GET to the `jwks_uri` from the discovery document. Counts the number of keys in the `keys` array. If zero keys: **Warn** (token validation may fail).
4. **All healthy** -- provider reachable, discovery valid, JWKS has keys. Result: **Pass**.
Evidence collected: `issuer_url`, `discovery_reachable`, `discovery_response_ms`, `authorization_endpoint_present`, `token_endpoint_present`, `jwks_uri_present`, `jwks_key_count`, `jwks_fetch_ms`, `http_status_code`, `error_message`, `connection_error_type`.
## Why It Matters
When Stella Ops is configured to delegate authentication to an external OIDC provider (Azure AD, Keycloak, Okta, etc.), all user logins and token validations depend on that provider being reachable and correctly configured. A connectivity failure means users cannot log in, and services cannot validate tokens, leading to a platform-wide authentication outage.
## Common Causes
- OIDC provider is down or undergoing maintenance
- Network connectivity issue (proxy misconfiguration, firewall rule change)
- DNS resolution failure for the provider hostname
- Firewall blocking outbound HTTPS to the provider
- Discovery document missing required fields (misconfigured provider)
- Token endpoint misconfigured after provider upgrade
- JWKS endpoint returning empty key set (key rotation in progress)
- OIDC provider rate limiting or returning errors
## How to Fix
### Docker Compose
```bash
# Test OIDC provider connectivity from the authority container
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
curl -s https://<oidc-issuer>/.well-known/openid-configuration | jq .
# Check DNS resolution
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
nslookup <oidc-host>
# Set OIDC configuration via environment
# AUTHENTICATION__OIDC__ISSUER=https://login.microsoftonline.com/<tenant>/v2.0
```
### Bare Metal / systemd
```bash
# Test provider connectivity
curl -s https://<oidc-issuer>/.well-known/openid-configuration | jq .
# Check DNS resolution
nslookup <oidc-host>
# Validate OIDC configuration
stella auth oidc validate
# Check JWKS endpoint
curl -s $(curl -s https://<oidc-issuer>/.well-known/openid-configuration | jq -r .jwks_uri) | jq .
# Check network connectivity
stella doctor run --check check.network.dns
```
### Kubernetes / Helm
```bash
# Test from authority pod
kubectl exec -it deploy/stellaops-authority -n stellaops -- \
curl -s https://<oidc-issuer>/.well-known/openid-configuration | jq .
# Check NetworkPolicy allows egress to OIDC provider
kubectl get networkpolicy -n stellaops -o yaml | grep -A10 egress
# Set OIDC configuration in Helm values
# authority:
# oidc:
# issuer: "https://login.microsoftonline.com/<tenant>/v2.0"
helm upgrade stellaops stellaops/stellaops -f values.yaml
```
## Verification
```
stella doctor run --check check.auth.oidc
```
## Related Checks
- `check.auth.config` -- overall auth configuration health
- `check.auth.signing-key` -- local signing key health (used when not delegating to external OIDC)
- `check.auth.token-service` -- token endpoint availability

View File

@@ -0,0 +1,106 @@
---
checkId: check.auth.signing-key
plugin: stellaops.doctor.auth
severity: fail
tags: [auth, security, keys]
---
# Signing Key Health
## What It Checks
Verifies the health of the active signing key used for token issuance. The check evaluates three conditions in sequence:
1. **No active key** -- if `HasActiveKey` is false: **Fail** with "No active signing key available". Evidence includes `ActiveKey: NONE` and total key count.
2. **Approaching expiration** -- if the active key expires within **30 days** (`ExpirationWarningDays`): **Warn** with the number of days remaining. Evidence includes key ID, algorithm, days until expiration, and whether rotation is scheduled.
3. **Healthy** -- active key exists with more than 30 days until expiration. Result: **Pass**. Evidence includes key ID, algorithm, key size (bits), days until expiration, and rotation schedule status.
The check always runs (`CanRun` returns true).
Evidence collected: `ActiveKeyId`, `Algorithm`, `KeySize`, `DaysUntilExpiration`, `RotationScheduled` (YES/NO), `TotalKeys`.
## Why It Matters
The signing key is used to sign every JWT token issued by the Authority service. If no active key exists, no tokens can be issued, and the entire platform's authentication stops working. If the key is approaching expiration without a rotation plan, the platform faces a hard outage on the expiration date -- all tokens signed with the key become unverifiable.
## Common Causes
- Signing keys not generated (incomplete setup)
- All keys expired without rotation
- Key store corrupted (file system issue, accidental deletion)
- Key rotation not scheduled (manual process that was forgotten)
- Previous rotation attempt failed (permissions, HSM connectivity)
## How to Fix
### Docker Compose
```bash
# Check current key status
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
stella keys status
# Generate new signing key
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
stella keys generate --type rsa --bits 4096
# Activate the new key
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
stella keys activate
# Rotate keys
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
stella keys rotate
```
### Bare Metal / systemd
```bash
# Generate new signing key
stella keys generate --type rsa --bits 4096
# Activate the key
stella keys activate
# Rotate signing key
stella keys rotate
# Schedule automatic rotation (every 30 days)
stella keys rotate --schedule 30d
# Check key status
stella keys status
```
### Kubernetes / Helm
```bash
# Check key status
kubectl exec -it deploy/stellaops-authority -n stellaops -- \
stella keys status
# Generate and activate key
kubectl exec -it deploy/stellaops-authority -n stellaops -- \
stella keys generate --type rsa --bits 4096
# Set automatic rotation in Helm values
# authority:
# signing:
# autoRotate: true
# rotationIntervalDays: 30
helm upgrade stellaops stellaops/stellaops -f values.yaml
# Check signing key secret
kubectl get secret stellaops-signing-keys -n stellaops -o jsonpath='{.data}' | base64 -d
```
## Verification
```
stella doctor run --check check.auth.signing-key
```
## Related Checks
- `check.auth.config` -- overall auth configuration including signing key presence
- `check.auth.token-service` -- token issuance depends on a healthy signing key
- `check.attestation.keymaterial` -- attestor signing keys (separate from auth signing keys)

View File

@@ -0,0 +1,114 @@
---
checkId: check.auth.token-service
plugin: stellaops.doctor.auth
severity: fail
tags: [auth, service, health]
---
# Token Service Health
## What It Checks
Verifies the availability and performance of the token service endpoint (`/connect/token`). The check evaluates four conditions:
1. **Service unavailable** -- token endpoint is not responding. Result: **Fail** with the endpoint URL and error message.
2. **Critically slow** -- response time exceeds **2000ms**. Result: **Fail** with actual response time and threshold.
3. **Slow** -- response time exceeds **500ms** but is under 2000ms. Result: **Warn** with response time, threshold, and token issuance count.
4. **Healthy** -- service is available and response time is under 500ms. Result: **Pass** with response time, tokens issued in last 24 hours, and active session count.
Evidence collected: `ServiceAvailable` (YES/NO), `Endpoint`, `ResponseTimeMs`, `CriticalThreshold` (2000), `WarningThreshold` (500), `TokensIssuedLast24h`, `ActiveSessions`, `Error`.
The check always runs (`CanRun` returns true).
## Why It Matters
The token service is the single point through which all access tokens are issued. If it is unavailable, no user can log in, no service can authenticate, and every API call fails with 401. Even if the service is available but slow, user login experiences degrade, automated integrations time out, and the platform feels unresponsive. This check is typically the first to detect Authority database issues or resource starvation.
## Common Causes
- Authority service not running (container stopped, process crashed)
- Token endpoint misconfigured (wrong path, wrong port)
- Database connectivity issue (Authority cannot query clients/keys)
- Database performance issues (slow queries for token validation)
- Service overloaded (high authentication request volume)
- Resource contention (CPU/memory pressure on Authority host)
- Higher than normal load (warning-level)
- Database query performance degraded (warning-level)
## How to Fix
### Docker Compose
```bash
# Check Authority service status
docker compose -f devops/compose/docker-compose.stella-ops.yml ps authority
# View Authority service logs
docker compose -f devops/compose/docker-compose.stella-ops.yml logs authority --tail 200
# Restart Authority service
docker compose -f devops/compose/docker-compose.stella-ops.yml restart authority
# Test token endpoint directly
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
curl -s -o /dev/null -w "%{http_code} %{time_total}s" http://localhost:80/connect/token
# Check database connectivity
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
stella doctor run --check check.storage.postgres
```
### Bare Metal / systemd
```bash
# Check authority service status
stella auth status
# Restart authority service
stella service restart authority
# Check database connectivity
stella doctor run --check check.storage.postgres
# Monitor service metrics
stella auth metrics --period 1h
# Review database performance
stella doctor run --check check.storage.performance
# Watch metrics in real-time (warning-level slowness)
stella auth metrics --watch
```
### Kubernetes / Helm
```bash
# Check authority pod status
kubectl get pods -l app.kubernetes.io/component=authority -n stellaops
# View pod logs
kubectl logs -l app.kubernetes.io/component=authority -n stellaops --tail=200
# Check resource usage
kubectl top pods -l app.kubernetes.io/component=authority -n stellaops
# Restart authority pods
kubectl rollout restart deployment/stellaops-authority -n stellaops
# Scale up if under load
kubectl scale deployment stellaops-authority --replicas=3 -n stellaops
# Check liveness/readiness probe status
kubectl describe pod -l app.kubernetes.io/component=authority -n stellaops | grep -A5 "Liveness\|Readiness"
```
## Verification
```
stella doctor run --check check.auth.token-service
```
## Related Checks
- `check.auth.config` -- auth must be configured before the token service can function
- `check.auth.signing-key` -- token issuance requires a valid signing key
- `check.auth.oidc` -- if delegating to external OIDC, that provider must also be healthy