4.5 KiB
Runbook: Policy Engine - OPA Process Crashed
Sprint: SPRINT_20260117_029_DOCS_runbook_coverage Task: RUN-003 - Policy Engine Runbooks
Metadata
| Field | Value |
|---|---|
| Component | Policy Engine |
| Severity | Critical |
| On-call scope | Platform team |
| Last updated | 2026-01-17 |
| Doctor check | check.policy.opa-health |
Symptoms
- Policy evaluations failing with "OPA unavailable" error
- Alert
PolicyOPACrashedfiring - OPA process exited unexpectedly
- Error: "connection refused" when connecting to OPA
- Metric
policy_opa_restarts_totalincreasing
Impact
| Impact Type | Description |
|---|---|
| User-facing | All policy evaluations fail; gate decisions blocked |
| Data integrity | No data loss; decisions delayed until OPA recovers |
| SLA impact | Gate latency SLO violated; release pipeline blocked |
Diagnosis
Quick checks
-
Check Doctor diagnostics:
stella doctor --check check.policy.opa-health -
Check OPA process status:
stella policy statusLook for: OPA process state, restart count
-
Check OPA logs for crash reason:
stella policy opa logs --last 30m --level error
Deep diagnosis
-
Check OPA memory usage before crash:
stella policy stats --opa-metricsProblem if: Memory usage near limit before crash
-
Check for problematic policy:
stella policy list --last-errorLook for: Policies that caused evaluation errors
-
Check OPA configuration:
stella policy opa config showLook for: Invalid configuration, missing bundles
-
Check for infinite loops in Rego:
stella policy analyze --detect-loops
Resolution
Immediate mitigation
-
Restart OPA process:
stella policy opa restart -
If OPA keeps crashing, start in safe mode:
stella policy opa start --safe-modeNote: Safe mode disables custom policies
-
Enable failopen temporarily (if allowed by policy):
stella policy config set failopen true stella policy reloadWarning: Only use if compliance allows fail-open mode
Root cause fix
If OOM killed:
-
Increase OPA memory limit:
stella policy opa config set memory_limit 2Gi stella policy opa restart -
Enable garbage collection tuning:
stella policy opa config set gc_min_heap_size 256Mi stella policy opa config set gc_max_heap_size 1Gi
If policy caused crash:
-
Identify problematic policy:
stella policy list --status error -
Disable the problematic policy:
stella policy disable <policy-id> stella policy reload -
Fix and re-enable:
stella policy validate --file <fixed-policy.rego> stella policy update <policy-id> --file <fixed-policy.rego> stella policy enable <policy-id>
If bundle loading failed:
-
Check bundle integrity:
stella policy bundle verify -
Rebuild bundle:
stella policy bundle build --output bundle.tar.gz stella policy bundle load bundle.tar.gz
If configuration issue:
-
Reset to default configuration:
stella policy opa config reset -
Reconfigure with validated settings:
stella policy opa config set workers 4 stella policy opa config set decision_log true stella policy opa restart
Verification
# Check OPA is running
stella policy status
# Check OPA health
stella policy opa health
# Test policy evaluation
stella policy evaluate --test
# Check no crashes in recent logs
stella policy opa logs --level error --last 30m
# Monitor stability
stella policy stats --watch
Prevention
- Resources: Set appropriate memory limits based on policy complexity
- Validation: Validate all policies before deployment
- Monitoring: Alert on OPA restart count > 2 in 10 minutes
- Testing: Load test policies before production deployment
Related Resources
- Architecture:
docs/modules/policy/architecture.md - Related runbooks:
policy-evaluation-slow.md,policy-compilation-failed.md - Doctor check:
src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Policy/ - OPA documentation: https://www.openpolicyagent.org/docs/latest/