Files
git.stella-ops.org/docs/operations/runbooks/policy-opa-crash.md

4.5 KiB

Runbook: Policy Engine - OPA Process Crashed

Sprint: SPRINT_20260117_029_DOCS_runbook_coverage Task: RUN-003 - Policy Engine Runbooks

Metadata

Field Value
Component Policy Engine
Severity Critical
On-call scope Platform team
Last updated 2026-01-17
Doctor check check.policy.opa-health

Symptoms

  • Policy evaluations failing with "OPA unavailable" error
  • Alert PolicyOPACrashed firing
  • OPA process exited unexpectedly
  • Error: "connection refused" when connecting to OPA
  • Metric policy_opa_restarts_total increasing

Impact

Impact Type Description
User-facing All policy evaluations fail; gate decisions blocked
Data integrity No data loss; decisions delayed until OPA recovers
SLA impact Gate latency SLO violated; release pipeline blocked

Diagnosis

Quick checks

  1. Check Doctor diagnostics:

    stella doctor --check check.policy.opa-health
    
  2. Check OPA process status:

    stella policy status
    

    Look for: OPA process state, restart count

  3. Check OPA logs for crash reason:

    stella policy opa logs --last 30m --level error
    

Deep diagnosis

  1. Check OPA memory usage before crash:

    stella policy stats --opa-metrics
    

    Problem if: Memory usage near limit before crash

  2. Check for problematic policy:

    stella policy list --last-error
    

    Look for: Policies that caused evaluation errors

  3. Check OPA configuration:

    stella policy opa config show
    

    Look for: Invalid configuration, missing bundles

  4. Check for infinite loops in Rego:

    stella policy analyze --detect-loops
    

Resolution

Immediate mitigation

  1. Restart OPA process:

    stella policy opa restart
    
  2. If OPA keeps crashing, start in safe mode:

    stella policy opa start --safe-mode
    

    Note: Safe mode disables custom policies

  3. Enable failopen temporarily (if allowed by policy):

    stella policy config set failopen true
    stella policy reload
    

    Warning: Only use if compliance allows fail-open mode

Root cause fix

If OOM killed:

  1. Increase OPA memory limit:

    stella policy opa config set memory_limit 2Gi
    stella policy opa restart
    
  2. Enable garbage collection tuning:

    stella policy opa config set gc_min_heap_size 256Mi
    stella policy opa config set gc_max_heap_size 1Gi
    

If policy caused crash:

  1. Identify problematic policy:

    stella policy list --status error
    
  2. Disable the problematic policy:

    stella policy disable <policy-id>
    stella policy reload
    
  3. Fix and re-enable:

    stella policy validate --file <fixed-policy.rego>
    stella policy update <policy-id> --file <fixed-policy.rego>
    stella policy enable <policy-id>
    

If bundle loading failed:

  1. Check bundle integrity:

    stella policy bundle verify
    
  2. Rebuild bundle:

    stella policy bundle build --output bundle.tar.gz
    stella policy bundle load bundle.tar.gz
    

If configuration issue:

  1. Reset to default configuration:

    stella policy opa config reset
    
  2. Reconfigure with validated settings:

    stella policy opa config set workers 4
    stella policy opa config set decision_log true
    stella policy opa restart
    

Verification

# Check OPA is running
stella policy status

# Check OPA health
stella policy opa health

# Test policy evaluation
stella policy evaluate --test

# Check no crashes in recent logs
stella policy opa logs --level error --last 30m

# Monitor stability
stella policy stats --watch

Prevention

  • Resources: Set appropriate memory limits based on policy complexity
  • Validation: Validate all policies before deployment
  • Monitoring: Alert on OPA restart count > 2 in 10 minutes
  • Testing: Load test policies before production deployment

  • Architecture: docs/modules/policy/architecture.md
  • Related runbooks: policy-evaluation-slow.md, policy-compilation-failed.md
  • Doctor check: src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Policy/
  • OPA documentation: https://www.openpolicyagent.org/docs/latest/