Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/policy/engine.md
+++ b/docs/doctor/articles/policy/engine.md
@@ -0,0 +1,124 @@
+---
+checkId: check.policy.engine
+plugin: stellaops.doctor.policy
+severity: fail
+tags: [policy, core, health]
+---
+# Policy Engine Health
+
+## What It Checks
+Performs a three-part health check against the policy engine (OPA):
+
+1. **Compilation**: queries `/health` to verify the engine is responding, `/v1/policies` to count loaded policies and verify they compiled, and `/v1/status` for engine version and cache metrics.
+2. **Evaluation**: sends a canary POST to `/v1/data/system/health` with a minimal input and measures response time. HTTP 200 or 404 are acceptable (no policy at that path is fine). HTTP 500 indicates an engine error. Evaluation latency above 100ms triggers a warning.
+3. **Storage**: queries `/v1/data` to verify the policy data store is accessible and counts top-level data entries.
+
+If any of the three sub-checks fail, the overall result is fail. If all pass but evaluation latency exceeds 100ms, the result is warn.
+
+Evidence collected: `engine_type`, `engine_version`, `engine_url`, `compilation_status`, `evaluation_status`, `storage_status`, `policy_count`, `compilation_time_ms`, `evaluation_latency_p50_ms`, `cache_hit_ratio`, `last_compilation_error`, `evaluation_error`, `storage_error`.
+
+The check requires `Policy:Engine:Url` or `PolicyEngine:BaseUrl` to be configured.
+
+## Why It Matters
+The policy engine is the decision authority for all release gates, promotion approvals, and security policy enforcement. If the policy engine is down, no release can pass its policy gate. If compilation fails, policies are not enforced. Slow evaluation delays release pipelines. A corrupt or inaccessible policy store means decisions are being made against stale or missing rules, which can result in either blocked releases or unintended policy bypasses.
+
+## Common Causes
+- Policy engine service (OPA) not running or crashed
+- Policy storage backend unavailable (bundled or external)
+- OPA/Rego compilation error in a recently pushed policy
+- Policy cache corrupted after abnormal shutdown
+- Policy evaluation slower than expected due to complex rules
+- Network connectivity issue between Stella Ops services and the policy engine
+- Firewall blocking access to the policy engine port
+- DNS resolution failure for the policy engine hostname
+
+## How to Fix
+
+### Docker Compose
+```bash
+# Check policy engine container status
+docker compose -f docker-compose.stella-ops.yml ps policy-engine
+
+# View policy engine logs
+docker compose -f docker-compose.stella-ops.yml logs --tail 200 policy-engine
+
+# Test engine health directly
+curl -s http://localhost:8181/health
+
+# Recompile all policies
+stella policy compile --all
+
+# Warm the policy cache
+stella policy cache warm
+```
+
+```yaml
+services:
+  policy-engine:
+    environment:
+      Policy__Engine__Url: "http://policy-engine:8181"
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:8181/health"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+```
+
+### Bare Metal / systemd
+```bash
+# Check OPA service status
+sudo systemctl status stellaops-policy-engine
+
+# View logs
+sudo journalctl -u stellaops-policy-engine --since "1 hour ago"
+
+# Restart the service
+sudo systemctl restart stellaops-policy-engine
+
+# Verify health
+curl -s http://localhost:8181/health
+
+# Recompile policies
+stella policy compile --all
+```
+
+### Kubernetes / Helm
+```bash
+# Check policy engine pods
+kubectl get pods -l app=stellaops-policy-engine
+
+# View pod logs
+kubectl logs -l app=stellaops-policy-engine --tail=200
+
+# Restart policy engine
+kubectl rollout restart deployment stellaops-policy-engine
+
+# Verify health from within the cluster
+kubectl exec -it <any-stellaops-pod> -- curl -s http://stellaops-policy-engine:8181/health
+```
+
+Set in Helm `values.yaml`:
+
+```yaml
+policyEngine:
+  replicas: 2
+  resources:
+    limits:
+      memory: 1Gi
+      cpu: "1"
+  livenessProbe:
+    httpGet:
+      path: /health
+      port: 8181
+    initialDelaySeconds: 10
+    periodSeconds: 30
+```
+
+## Verification
+```
+stella doctor run --check check.policy.engine
+```
+
+## Related Checks
+- `check.release.promotion.gates` -- promotion gates depend on policy engine availability
+- `check.postgres.connectivity` -- policy storage may depend on database connectivity