Files

master c58a236d70 Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 12:28:00 +02:00

4.1 KiB

Raw Blame History

checkId, plugin, severity, tags

checkId

plugin

severity

Policy Engine Health

What It Checks

Performs a three-part health check against the policy engine (OPA):

Compilation: queries /health to verify the engine is responding, /v1/policies to count loaded policies and verify they compiled, and /v1/status for engine version and cache metrics.
Evaluation: sends a canary POST to /v1/data/system/health with a minimal input and measures response time. HTTP 200 or 404 are acceptable (no policy at that path is fine). HTTP 500 indicates an engine error. Evaluation latency above 100ms triggers a warning.
Storage: queries /v1/data to verify the policy data store is accessible and counts top-level data entries.

If any of the three sub-checks fail, the overall result is fail. If all pass but evaluation latency exceeds 100ms, the result is warn.

Evidence collected: engine_type, engine_version, engine_url, compilation_status, evaluation_status, storage_status, policy_count, compilation_time_ms, evaluation_latency_p50_ms, cache_hit_ratio, last_compilation_error, evaluation_error, storage_error.

The check requires Policy:Engine:Url or PolicyEngine:BaseUrl to be configured.

Why It Matters

The policy engine is the decision authority for all release gates, promotion approvals, and security policy enforcement. If the policy engine is down, no release can pass its policy gate. If compilation fails, policies are not enforced. Slow evaluation delays release pipelines. A corrupt or inaccessible policy store means decisions are being made against stale or missing rules, which can result in either blocked releases or unintended policy bypasses.

Common Causes

Policy engine service (OPA) not running or crashed
Policy storage backend unavailable (bundled or external)
OPA/Rego compilation error in a recently pushed policy
Policy cache corrupted after abnormal shutdown
Policy evaluation slower than expected due to complex rules
Network connectivity issue between Stella Ops services and the policy engine
Firewall blocking access to the policy engine port
DNS resolution failure for the policy engine hostname

How to Fix

Docker Compose

# Check policy engine container status
docker compose -f docker-compose.stella-ops.yml ps policy-engine

# View policy engine logs
docker compose -f docker-compose.stella-ops.yml logs --tail 200 policy-engine

# Test engine health directly
curl -s http://localhost:8181/health

# Recompile all policies
stella policy compile --all

# Warm the policy cache
stella policy cache warm

services:
  policy-engine:
    environment:
      Policy__Engine__Url: "http://policy-engine:8181"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8181/health"]
      interval: 30s
      timeout: 10s
      retries: 3

Bare Metal / systemd

# Check OPA service status
sudo systemctl status stellaops-policy-engine

# View logs
sudo journalctl -u stellaops-policy-engine --since "1 hour ago"

# Restart the service
sudo systemctl restart stellaops-policy-engine

# Verify health
curl -s http://localhost:8181/health

# Recompile policies
stella policy compile --all

Kubernetes / Helm

# Check policy engine pods
kubectl get pods -l app=stellaops-policy-engine

# View pod logs
kubectl logs -l app=stellaops-policy-engine --tail=200

# Restart policy engine
kubectl rollout restart deployment stellaops-policy-engine

# Verify health from within the cluster
kubectl exec -it <any-stellaops-pod> -- curl -s http://stellaops-policy-engine:8181/health

Set in Helm values.yaml:

policyEngine:
  replicas: 2
  resources:
    limits:
      memory: 1Gi
      cpu: "1"
  livenessProbe:
    httpGet:
      path: /health
      port: 8181
    initialDelaySeconds: 10
    periodSeconds: 30

Verification

stella doctor run --check check.policy.engine

check.release.promotion.gates -- promotion gates depend on policy engine availability
check.postgres.connectivity -- policy storage may depend on database connectivity

4.1 KiB Raw Blame History