3.7 KiB
Runbook: Policy Engine - Evaluation Latency High
Sprint: SPRINT_20260117_029_DOCS_runbook_coverage Task: RUN-003 - Policy Engine Runbooks
Metadata
| Field | Value |
|---|---|
| Component | Policy Engine |
| Severity | High |
| On-call scope | Platform team |
| Last updated | 2026-01-17 |
| Doctor check | check.policy.evaluation-latency |
Symptoms
- Policy evaluation takes >500ms (warning) or >2s (critical)
- Gate decisions timing out in CI/CD pipelines
- Alert
PolicyEvaluationSlowfiring - Metric
policy_evaluation_duration_secondsP95 > 1s - Users report "policy check taking too long"
Impact
| Impact Type | Description |
|---|---|
| User-facing | Slow release gate checks, CI/CD pipeline delays |
| Data integrity | No data loss; decisions are still correct |
| SLA impact | Gate latency SLO violated (target: P95 < 500ms) |
Diagnosis
Quick checks
-
Check Doctor diagnostics:
stella doctor --check check.policy.evaluation-latency -
Check policy engine status:
stella policy status -
Check recent evaluation times:
stella policy stats --last 10mLook for: P95 latency, cache hit rate
Deep diagnosis
-
Profile a slow evaluation:
stella policy evaluate --image <image-ref> --profileLook for: Which phase is slowest (parse, compile, execute)
-
Check OPA compilation cache:
stella policy cache statsProblem if: Cache hit rate < 90%
-
Check policy complexity:
stella policy analyze --complexityProblem if: Cyclomatic complexity > 50 or rule count > 200
-
Check external data fetches:
stella policy logs --filter "external fetch" --level debugProblem if: Many external fetches or slow responses
Resolution
Immediate mitigation
-
Clear and warm the compilation cache:
stella policy cache clear stella policy cache warm -
Increase OPA worker count:
stella policy config set opa.workers 4 stella policy reload -
Enable evaluation result caching:
stella policy config set cache.evaluation_ttl 60s stella policy reload
Root cause fix
If policy is too complex:
-
Analyze and simplify policy:
stella policy analyze --suggest-optimizations -
Split large policies into modules:
stella policy refactor --auto-split
If external data fetches are slow:
-
Increase external data cache TTL:
stella policy config set external_data.cache_ttl 5m -
Pre-fetch external data:
stella policy external-data prefetch
If Rego compilation is slow:
-
Enable partial evaluation:
stella policy config set opa.partial_eval true -
Pre-compile policies:
stella policy compile --all
Verification
# Run evaluation and check latency
stella policy evaluate --image <image-ref> --timing
# Check P95 latency
stella policy stats --last 5m
# Verify cache is effective
stella policy cache stats
Prevention
- Review: Review policy complexity before deployment
- Monitoring: Alert on P95 latency > 300ms
- Caching: Ensure evaluation cache is enabled
- Pre-warming: Add cache warming to deployment pipeline
Related Resources
- Architecture:
docs/modules/policy/architecture.md - Related runbooks:
policy-opa-crash.md,policy-compilation-failed.md - Dashboard: Grafana > Stella Ops > Policy Engine