Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
master
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions

View File

@@ -0,0 +1,124 @@
---
checkId: check.policy.engine
plugin: stellaops.doctor.policy
severity: fail
tags: [policy, core, health]
---
# Policy Engine Health
## What It Checks
Performs a three-part health check against the policy engine (OPA):
1. **Compilation**: queries `/health` to verify the engine is responding, `/v1/policies` to count loaded policies and verify they compiled, and `/v1/status` for engine version and cache metrics.
2. **Evaluation**: sends a canary POST to `/v1/data/system/health` with a minimal input and measures response time. HTTP 200 or 404 are acceptable (no policy at that path is fine). HTTP 500 indicates an engine error. Evaluation latency above 100ms triggers a warning.
3. **Storage**: queries `/v1/data` to verify the policy data store is accessible and counts top-level data entries.
If any of the three sub-checks fail, the overall result is fail. If all pass but evaluation latency exceeds 100ms, the result is warn.
Evidence collected: `engine_type`, `engine_version`, `engine_url`, `compilation_status`, `evaluation_status`, `storage_status`, `policy_count`, `compilation_time_ms`, `evaluation_latency_p50_ms`, `cache_hit_ratio`, `last_compilation_error`, `evaluation_error`, `storage_error`.
The check requires `Policy:Engine:Url` or `PolicyEngine:BaseUrl` to be configured.
## Why It Matters
The policy engine is the decision authority for all release gates, promotion approvals, and security policy enforcement. If the policy engine is down, no release can pass its policy gate. If compilation fails, policies are not enforced. Slow evaluation delays release pipelines. A corrupt or inaccessible policy store means decisions are being made against stale or missing rules, which can result in either blocked releases or unintended policy bypasses.
## Common Causes
- Policy engine service (OPA) not running or crashed
- Policy storage backend unavailable (bundled or external)
- OPA/Rego compilation error in a recently pushed policy
- Policy cache corrupted after abnormal shutdown
- Policy evaluation slower than expected due to complex rules
- Network connectivity issue between Stella Ops services and the policy engine
- Firewall blocking access to the policy engine port
- DNS resolution failure for the policy engine hostname
## How to Fix
### Docker Compose
```bash
# Check policy engine container status
docker compose -f docker-compose.stella-ops.yml ps policy-engine
# View policy engine logs
docker compose -f docker-compose.stella-ops.yml logs --tail 200 policy-engine
# Test engine health directly
curl -s http://localhost:8181/health
# Recompile all policies
stella policy compile --all
# Warm the policy cache
stella policy cache warm
```
```yaml
services:
policy-engine:
environment:
Policy__Engine__Url: "http://policy-engine:8181"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8181/health"]
interval: 30s
timeout: 10s
retries: 3
```
### Bare Metal / systemd
```bash
# Check OPA service status
sudo systemctl status stellaops-policy-engine
# View logs
sudo journalctl -u stellaops-policy-engine --since "1 hour ago"
# Restart the service
sudo systemctl restart stellaops-policy-engine
# Verify health
curl -s http://localhost:8181/health
# Recompile policies
stella policy compile --all
```
### Kubernetes / Helm
```bash
# Check policy engine pods
kubectl get pods -l app=stellaops-policy-engine
# View pod logs
kubectl logs -l app=stellaops-policy-engine --tail=200
# Restart policy engine
kubectl rollout restart deployment stellaops-policy-engine
# Verify health from within the cluster
kubectl exec -it <any-stellaops-pod> -- curl -s http://stellaops-policy-engine:8181/health
```
Set in Helm `values.yaml`:
```yaml
policyEngine:
replicas: 2
resources:
limits:
memory: 1Gi
cpu: "1"
livenessProbe:
httpGet:
path: /health
port: 8181
initialDelaySeconds: 10
periodSeconds: 30
```
## Verification
```
stella doctor run --check check.policy.engine
```
## Related Checks
- `check.release.promotion.gates` -- promotion gates depend on policy engine availability
- `check.postgres.connectivity` -- policy storage may depend on database connectivity