Files
git.stella-ops.org/docs/operations/runbooks/policy-evaluation-slow.md

3.7 KiB

Runbook: Policy Engine - Evaluation Latency High

Sprint: SPRINT_20260117_029_DOCS_runbook_coverage Task: RUN-003 - Policy Engine Runbooks

Metadata

Field Value
Component Policy Engine
Severity High
On-call scope Platform team
Last updated 2026-01-17
Doctor check check.policy.evaluation-latency

Symptoms

  • Policy evaluation takes >500ms (warning) or >2s (critical)
  • Gate decisions timing out in CI/CD pipelines
  • Alert PolicyEvaluationSlow firing
  • Metric policy_evaluation_duration_seconds P95 > 1s
  • Users report "policy check taking too long"

Impact

Impact Type Description
User-facing Slow release gate checks, CI/CD pipeline delays
Data integrity No data loss; decisions are still correct
SLA impact Gate latency SLO violated (target: P95 < 500ms)

Diagnosis

Quick checks

  1. Check Doctor diagnostics:

    stella doctor --check check.policy.evaluation-latency
    
  2. Check policy engine status:

    stella policy status
    
  3. Check recent evaluation times:

    stella policy stats --last 10m
    

    Look for: P95 latency, cache hit rate

Deep diagnosis

  1. Profile a slow evaluation:

    stella policy evaluate --image <image-ref> --profile
    

    Look for: Which phase is slowest (parse, compile, execute)

  2. Check OPA compilation cache:

    stella policy cache stats
    

    Problem if: Cache hit rate < 90%

  3. Check policy complexity:

    stella policy analyze --complexity
    

    Problem if: Cyclomatic complexity > 50 or rule count > 200

  4. Check external data fetches:

    stella policy logs --filter "external fetch" --level debug
    

    Problem if: Many external fetches or slow responses


Resolution

Immediate mitigation

  1. Clear and warm the compilation cache:

    stella policy cache clear
    stella policy cache warm
    
  2. Increase OPA worker count:

    stella policy config set opa.workers 4
    stella policy reload
    
  3. Enable evaluation result caching:

    stella policy config set cache.evaluation_ttl 60s
    stella policy reload
    

Root cause fix

If policy is too complex:

  1. Analyze and simplify policy:

    stella policy analyze --suggest-optimizations
    
  2. Split large policies into modules:

    stella policy refactor --auto-split
    

If external data fetches are slow:

  1. Increase external data cache TTL:

    stella policy config set external_data.cache_ttl 5m
    
  2. Pre-fetch external data:

    stella policy external-data prefetch
    

If Rego compilation is slow:

  1. Enable partial evaluation:

    stella policy config set opa.partial_eval true
    
  2. Pre-compile policies:

    stella policy compile --all
    

Verification

# Run evaluation and check latency
stella policy evaluate --image <image-ref> --timing

# Check P95 latency
stella policy stats --last 5m

# Verify cache is effective
stella policy cache stats

Prevention

  • Review: Review policy complexity before deployment
  • Monitoring: Alert on P95 latency > 300ms
  • Caching: Ensure evaluation cache is enabled
  • Pre-warming: Add cache warming to deployment pipeline

  • Architecture: docs/modules/policy/architecture.md
  • Related runbooks: policy-opa-crash.md, policy-compilation-failed.md
  • Dashboard: Grafana > Stella Ops > Policy Engine