Doctor plugin checks: implement health check classes and documentation
Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
108
docs/doctor/articles/auth/config.md
Normal file
108
docs/doctor/articles/auth/config.md
Normal file
@@ -0,0 +1,108 @@
|
||||
---
|
||||
checkId: check.auth.config
|
||||
plugin: stellaops.doctor.auth
|
||||
severity: fail
|
||||
tags: [auth, security, core, config]
|
||||
---
|
||||
# Auth Configuration
|
||||
|
||||
## What It Checks
|
||||
|
||||
Validates the overall authentication configuration by inspecting three layers in sequence:
|
||||
|
||||
1. **Authentication configured** -- verifies that the auth subsystem has been set up (issuer URL present, basic config loaded). If not: **Fail** with "Authentication not configured".
|
||||
2. **Signing keys available** -- checks whether signing keys exist for token issuance. If configured but no keys: **Fail** with "No signing keys available".
|
||||
3. **Signing key expiration** -- checks if the active signing key is approaching expiration. If it will expire soon: **Warn** with the number of days remaining.
|
||||
4. **All healthy** -- issuer URL configured, signing keys available, key not near expiry. Result: **Pass**.
|
||||
|
||||
Evidence collected: `AuthConfigured` (YES/NO), `IssuerConfigured` (YES/NO), `IssuerUrl`, `SigningKeysConfigured`/`SigningKeysAvailable` (YES/NO), `KeyExpiration` (days), `ActiveClients` count, `ActiveScopes` count.
|
||||
|
||||
The check always runs (`CanRun` returns true).
|
||||
|
||||
## Why It Matters
|
||||
|
||||
Authentication is the foundation of every API call in Stella Ops. If the auth subsystem is not configured, no user can log in, no service-to-service call can authenticate, and the entire platform is non-functional. Missing signing keys mean tokens cannot be issued, and an expiring key that is not rotated will cause a hard outage when it expires.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- Authority service not configured (fresh installation without `stella setup auth`)
|
||||
- Missing issuer URL configuration in environment variables or config files
|
||||
- Signing keys not yet generated (first-run setup incomplete)
|
||||
- Key material corrupted (disk failure, accidental deletion)
|
||||
- HSM/PKCS#11 module not accessible (hardware key store offline)
|
||||
- Signing key approaching expiration without scheduled rotation
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Check Authority service configuration
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
|
||||
cat /app/appsettings.json | grep -A5 "Issuer\|Signing"
|
||||
|
||||
# Set issuer URL via environment variable
|
||||
# In .env or docker-compose.override.yml:
|
||||
# AUTHORITY__ISSUER__URL=https://stella-ops.local/authority
|
||||
|
||||
# Restart Authority service after config changes
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml restart authority
|
||||
|
||||
# Generate signing keys
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
|
||||
stella keys generate --type rsa
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Run initial auth setup
|
||||
stella setup auth
|
||||
|
||||
# Configure issuer URL
|
||||
stella auth configure --issuer https://auth.yourdomain.com
|
||||
|
||||
# Generate signing keys
|
||||
stella keys generate --type rsa
|
||||
|
||||
# Rotate signing keys (if approaching expiration)
|
||||
stella keys rotate
|
||||
|
||||
# Schedule automatic key rotation
|
||||
stella keys rotate --schedule 30d
|
||||
|
||||
# Check key store health
|
||||
stella doctor run --check check.crypto.keystore
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Check authority pod configuration
|
||||
kubectl get configmap stellaops-authority-config -n stellaops -o yaml
|
||||
|
||||
# Set issuer URL in Helm values
|
||||
# authority:
|
||||
# issuer:
|
||||
# url: "https://auth.yourdomain.com"
|
||||
helm upgrade stellaops stellaops/stellaops -f values.yaml
|
||||
|
||||
# Generate keys via job
|
||||
kubectl exec -it deploy/stellaops-authority -n stellaops -- \
|
||||
stella keys generate --type rsa
|
||||
|
||||
# Check secrets for key material
|
||||
kubectl get secret stellaops-signing-keys -n stellaops
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.auth.config
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.auth.signing-key` -- deeper signing key health (algorithm, size, rotation schedule)
|
||||
- `check.auth.token-service` -- verifies token endpoint is responsive
|
||||
- `check.auth.oidc` -- external OIDC provider connectivity
|
||||
100
docs/doctor/articles/auth/oidc.md
Normal file
100
docs/doctor/articles/auth/oidc.md
Normal file
@@ -0,0 +1,100 @@
|
||||
---
|
||||
checkId: check.auth.oidc
|
||||
plugin: stellaops.doctor.auth
|
||||
severity: warn
|
||||
tags: [auth, oidc, connectivity]
|
||||
---
|
||||
# OIDC Provider Connectivity
|
||||
|
||||
## What It Checks
|
||||
|
||||
Tests connectivity to an external OIDC provider by performing real HTTP requests. The check reads the issuer URL from configuration keys (in priority order): `Authentication:Oidc:Issuer`, `Auth:Oidc:Authority`, `Oidc:Issuer`. If none is configured, the check passes immediately (local authority mode).
|
||||
|
||||
When an external provider is configured, the check performs a multi-step validation:
|
||||
|
||||
1. **Fetch discovery document** -- HTTP GET to `{issuerUrl}/.well-known/openid-configuration` with a 10-second timeout. If unreachable: **Fail** with connection error type classification (ssl_error, dns_failure, refused, timeout, connection_failed).
|
||||
2. **Validate discovery fields** -- Parses the discovery JSON and verifies presence of `authorization_endpoint`, `token_endpoint`, and `jwks_uri`. If any are missing: **Warn** listing the missing fields.
|
||||
3. **Fetch JWKS** -- HTTP GET to the `jwks_uri` from the discovery document. Counts the number of keys in the `keys` array. If zero keys: **Warn** (token validation may fail).
|
||||
4. **All healthy** -- provider reachable, discovery valid, JWKS has keys. Result: **Pass**.
|
||||
|
||||
Evidence collected: `issuer_url`, `discovery_reachable`, `discovery_response_ms`, `authorization_endpoint_present`, `token_endpoint_present`, `jwks_uri_present`, `jwks_key_count`, `jwks_fetch_ms`, `http_status_code`, `error_message`, `connection_error_type`.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
When Stella Ops is configured to delegate authentication to an external OIDC provider (Azure AD, Keycloak, Okta, etc.), all user logins and token validations depend on that provider being reachable and correctly configured. A connectivity failure means users cannot log in, and services cannot validate tokens, leading to a platform-wide authentication outage.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- OIDC provider is down or undergoing maintenance
|
||||
- Network connectivity issue (proxy misconfiguration, firewall rule change)
|
||||
- DNS resolution failure for the provider hostname
|
||||
- Firewall blocking outbound HTTPS to the provider
|
||||
- Discovery document missing required fields (misconfigured provider)
|
||||
- Token endpoint misconfigured after provider upgrade
|
||||
- JWKS endpoint returning empty key set (key rotation in progress)
|
||||
- OIDC provider rate limiting or returning errors
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Test OIDC provider connectivity from the authority container
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
|
||||
curl -s https://<oidc-issuer>/.well-known/openid-configuration | jq .
|
||||
|
||||
# Check DNS resolution
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
|
||||
nslookup <oidc-host>
|
||||
|
||||
# Set OIDC configuration via environment
|
||||
# AUTHENTICATION__OIDC__ISSUER=https://login.microsoftonline.com/<tenant>/v2.0
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Test provider connectivity
|
||||
curl -s https://<oidc-issuer>/.well-known/openid-configuration | jq .
|
||||
|
||||
# Check DNS resolution
|
||||
nslookup <oidc-host>
|
||||
|
||||
# Validate OIDC configuration
|
||||
stella auth oidc validate
|
||||
|
||||
# Check JWKS endpoint
|
||||
curl -s $(curl -s https://<oidc-issuer>/.well-known/openid-configuration | jq -r .jwks_uri) | jq .
|
||||
|
||||
# Check network connectivity
|
||||
stella doctor run --check check.network.dns
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Test from authority pod
|
||||
kubectl exec -it deploy/stellaops-authority -n stellaops -- \
|
||||
curl -s https://<oidc-issuer>/.well-known/openid-configuration | jq .
|
||||
|
||||
# Check NetworkPolicy allows egress to OIDC provider
|
||||
kubectl get networkpolicy -n stellaops -o yaml | grep -A10 egress
|
||||
|
||||
# Set OIDC configuration in Helm values
|
||||
# authority:
|
||||
# oidc:
|
||||
# issuer: "https://login.microsoftonline.com/<tenant>/v2.0"
|
||||
helm upgrade stellaops stellaops/stellaops -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.auth.oidc
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.auth.config` -- overall auth configuration health
|
||||
- `check.auth.signing-key` -- local signing key health (used when not delegating to external OIDC)
|
||||
- `check.auth.token-service` -- token endpoint availability
|
||||
106
docs/doctor/articles/auth/signing-key.md
Normal file
106
docs/doctor/articles/auth/signing-key.md
Normal file
@@ -0,0 +1,106 @@
|
||||
---
|
||||
checkId: check.auth.signing-key
|
||||
plugin: stellaops.doctor.auth
|
||||
severity: fail
|
||||
tags: [auth, security, keys]
|
||||
---
|
||||
# Signing Key Health
|
||||
|
||||
## What It Checks
|
||||
|
||||
Verifies the health of the active signing key used for token issuance. The check evaluates three conditions in sequence:
|
||||
|
||||
1. **No active key** -- if `HasActiveKey` is false: **Fail** with "No active signing key available". Evidence includes `ActiveKey: NONE` and total key count.
|
||||
2. **Approaching expiration** -- if the active key expires within **30 days** (`ExpirationWarningDays`): **Warn** with the number of days remaining. Evidence includes key ID, algorithm, days until expiration, and whether rotation is scheduled.
|
||||
3. **Healthy** -- active key exists with more than 30 days until expiration. Result: **Pass**. Evidence includes key ID, algorithm, key size (bits), days until expiration, and rotation schedule status.
|
||||
|
||||
The check always runs (`CanRun` returns true).
|
||||
|
||||
Evidence collected: `ActiveKeyId`, `Algorithm`, `KeySize`, `DaysUntilExpiration`, `RotationScheduled` (YES/NO), `TotalKeys`.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
The signing key is used to sign every JWT token issued by the Authority service. If no active key exists, no tokens can be issued, and the entire platform's authentication stops working. If the key is approaching expiration without a rotation plan, the platform faces a hard outage on the expiration date -- all tokens signed with the key become unverifiable.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- Signing keys not generated (incomplete setup)
|
||||
- All keys expired without rotation
|
||||
- Key store corrupted (file system issue, accidental deletion)
|
||||
- Key rotation not scheduled (manual process that was forgotten)
|
||||
- Previous rotation attempt failed (permissions, HSM connectivity)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Check current key status
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
|
||||
stella keys status
|
||||
|
||||
# Generate new signing key
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
|
||||
stella keys generate --type rsa --bits 4096
|
||||
|
||||
# Activate the new key
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
|
||||
stella keys activate
|
||||
|
||||
# Rotate keys
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
|
||||
stella keys rotate
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Generate new signing key
|
||||
stella keys generate --type rsa --bits 4096
|
||||
|
||||
# Activate the key
|
||||
stella keys activate
|
||||
|
||||
# Rotate signing key
|
||||
stella keys rotate
|
||||
|
||||
# Schedule automatic rotation (every 30 days)
|
||||
stella keys rotate --schedule 30d
|
||||
|
||||
# Check key status
|
||||
stella keys status
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Check key status
|
||||
kubectl exec -it deploy/stellaops-authority -n stellaops -- \
|
||||
stella keys status
|
||||
|
||||
# Generate and activate key
|
||||
kubectl exec -it deploy/stellaops-authority -n stellaops -- \
|
||||
stella keys generate --type rsa --bits 4096
|
||||
|
||||
# Set automatic rotation in Helm values
|
||||
# authority:
|
||||
# signing:
|
||||
# autoRotate: true
|
||||
# rotationIntervalDays: 30
|
||||
helm upgrade stellaops stellaops/stellaops -f values.yaml
|
||||
|
||||
# Check signing key secret
|
||||
kubectl get secret stellaops-signing-keys -n stellaops -o jsonpath='{.data}' | base64 -d
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.auth.signing-key
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.auth.config` -- overall auth configuration including signing key presence
|
||||
- `check.auth.token-service` -- token issuance depends on a healthy signing key
|
||||
- `check.attestation.keymaterial` -- attestor signing keys (separate from auth signing keys)
|
||||
114
docs/doctor/articles/auth/token-service.md
Normal file
114
docs/doctor/articles/auth/token-service.md
Normal file
@@ -0,0 +1,114 @@
|
||||
---
|
||||
checkId: check.auth.token-service
|
||||
plugin: stellaops.doctor.auth
|
||||
severity: fail
|
||||
tags: [auth, service, health]
|
||||
---
|
||||
# Token Service Health
|
||||
|
||||
## What It Checks
|
||||
|
||||
Verifies the availability and performance of the token service endpoint (`/connect/token`). The check evaluates four conditions:
|
||||
|
||||
1. **Service unavailable** -- token endpoint is not responding. Result: **Fail** with the endpoint URL and error message.
|
||||
2. **Critically slow** -- response time exceeds **2000ms**. Result: **Fail** with actual response time and threshold.
|
||||
3. **Slow** -- response time exceeds **500ms** but is under 2000ms. Result: **Warn** with response time, threshold, and token issuance count.
|
||||
4. **Healthy** -- service is available and response time is under 500ms. Result: **Pass** with response time, tokens issued in last 24 hours, and active session count.
|
||||
|
||||
Evidence collected: `ServiceAvailable` (YES/NO), `Endpoint`, `ResponseTimeMs`, `CriticalThreshold` (2000), `WarningThreshold` (500), `TokensIssuedLast24h`, `ActiveSessions`, `Error`.
|
||||
|
||||
The check always runs (`CanRun` returns true).
|
||||
|
||||
## Why It Matters
|
||||
|
||||
The token service is the single point through which all access tokens are issued. If it is unavailable, no user can log in, no service can authenticate, and every API call fails with 401. Even if the service is available but slow, user login experiences degrade, automated integrations time out, and the platform feels unresponsive. This check is typically the first to detect Authority database issues or resource starvation.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- Authority service not running (container stopped, process crashed)
|
||||
- Token endpoint misconfigured (wrong path, wrong port)
|
||||
- Database connectivity issue (Authority cannot query clients/keys)
|
||||
- Database performance issues (slow queries for token validation)
|
||||
- Service overloaded (high authentication request volume)
|
||||
- Resource contention (CPU/memory pressure on Authority host)
|
||||
- Higher than normal load (warning-level)
|
||||
- Database query performance degraded (warning-level)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Check Authority service status
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml ps authority
|
||||
|
||||
# View Authority service logs
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml logs authority --tail 200
|
||||
|
||||
# Restart Authority service
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml restart authority
|
||||
|
||||
# Test token endpoint directly
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
|
||||
curl -s -o /dev/null -w "%{http_code} %{time_total}s" http://localhost:80/connect/token
|
||||
|
||||
# Check database connectivity
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
|
||||
stella doctor run --check check.storage.postgres
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Check authority service status
|
||||
stella auth status
|
||||
|
||||
# Restart authority service
|
||||
stella service restart authority
|
||||
|
||||
# Check database connectivity
|
||||
stella doctor run --check check.storage.postgres
|
||||
|
||||
# Monitor service metrics
|
||||
stella auth metrics --period 1h
|
||||
|
||||
# Review database performance
|
||||
stella doctor run --check check.storage.performance
|
||||
|
||||
# Watch metrics in real-time (warning-level slowness)
|
||||
stella auth metrics --watch
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Check authority pod status
|
||||
kubectl get pods -l app.kubernetes.io/component=authority -n stellaops
|
||||
|
||||
# View pod logs
|
||||
kubectl logs -l app.kubernetes.io/component=authority -n stellaops --tail=200
|
||||
|
||||
# Check resource usage
|
||||
kubectl top pods -l app.kubernetes.io/component=authority -n stellaops
|
||||
|
||||
# Restart authority pods
|
||||
kubectl rollout restart deployment/stellaops-authority -n stellaops
|
||||
|
||||
# Scale up if under load
|
||||
kubectl scale deployment stellaops-authority --replicas=3 -n stellaops
|
||||
|
||||
# Check liveness/readiness probe status
|
||||
kubectl describe pod -l app.kubernetes.io/component=authority -n stellaops | grep -A5 "Liveness\|Readiness"
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.auth.token-service
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.auth.config` -- auth must be configured before the token service can function
|
||||
- `check.auth.signing-key` -- token issuance requires a valid signing key
|
||||
- `check.auth.oidc` -- if delegating to external OIDC, that provider must also be healthy
|
||||
Reference in New Issue
Block a user