--- checkId: check.policy.engine plugin: stellaops.doctor.policy severity: fail tags: [policy, core, health] --- # Policy Engine Health ## What It Checks Performs a three-part health check against the policy engine (OPA): 1. **Compilation**: queries `/health` to verify the engine is responding, `/v1/policies` to count loaded policies and verify they compiled, and `/v1/status` for engine version and cache metrics. 2. **Evaluation**: sends a canary POST to `/v1/data/system/health` with a minimal input and measures response time. HTTP 200 or 404 are acceptable (no policy at that path is fine). HTTP 500 indicates an engine error. Evaluation latency above 100ms triggers a warning. 3. **Storage**: queries `/v1/data` to verify the policy data store is accessible and counts top-level data entries. If any of the three sub-checks fail, the overall result is fail. If all pass but evaluation latency exceeds 100ms, the result is warn. Evidence collected: `engine_type`, `engine_version`, `engine_url`, `compilation_status`, `evaluation_status`, `storage_status`, `policy_count`, `compilation_time_ms`, `evaluation_latency_p50_ms`, `cache_hit_ratio`, `last_compilation_error`, `evaluation_error`, `storage_error`. The check requires `Policy:Engine:Url` or `PolicyEngine:BaseUrl` to be configured. ## Why It Matters The policy engine is the decision authority for all release gates, promotion approvals, and security policy enforcement. If the policy engine is down, no release can pass its policy gate. If compilation fails, policies are not enforced. Slow evaluation delays release pipelines. A corrupt or inaccessible policy store means decisions are being made against stale or missing rules, which can result in either blocked releases or unintended policy bypasses. ## Common Causes - Policy engine service (OPA) not running or crashed - Policy storage backend unavailable (bundled or external) - OPA/Rego compilation error in a recently pushed policy - Policy cache corrupted after abnormal shutdown - Policy evaluation slower than expected due to complex rules - Network connectivity issue between Stella Ops services and the policy engine - Firewall blocking access to the policy engine port - DNS resolution failure for the policy engine hostname ## How to Fix ### Docker Compose ```bash # Check policy engine container status docker compose -f docker-compose.stella-ops.yml ps policy-engine # View policy engine logs docker compose -f docker-compose.stella-ops.yml logs --tail 200 policy-engine # Test engine health directly curl -s http://localhost:8181/health # Recompile all policies stella policy compile --all # Warm the policy cache stella policy cache warm ``` ```yaml services: policy-engine: environment: Policy__Engine__Url: "http://policy-engine:8181" healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8181/health"] interval: 30s timeout: 10s retries: 3 ``` ### Bare Metal / systemd ```bash # Check OPA service status sudo systemctl status stellaops-policy-engine # View logs sudo journalctl -u stellaops-policy-engine --since "1 hour ago" # Restart the service sudo systemctl restart stellaops-policy-engine # Verify health curl -s http://localhost:8181/health # Recompile policies stella policy compile --all ``` ### Kubernetes / Helm ```bash # Check policy engine pods kubectl get pods -l app=stellaops-policy-engine # View pod logs kubectl logs -l app=stellaops-policy-engine --tail=200 # Restart policy engine kubectl rollout restart deployment stellaops-policy-engine # Verify health from within the cluster kubectl exec -it -- curl -s http://stellaops-policy-engine:8181/health ``` Set in Helm `values.yaml`: ```yaml policyEngine: replicas: 2 resources: limits: memory: 1Gi cpu: "1" livenessProbe: httpGet: path: /health port: 8181 initialDelaySeconds: 10 periodSeconds: 30 ``` ## Verification ``` stella doctor run --check check.policy.engine ``` ## Related Checks - `check.release.promotion.gates` -- promotion gates depend on policy engine availability - `check.postgres.connectivity` -- policy storage may depend on database connectivity