synergy moats product advisory implementations
This commit is contained in:
174
docs/operations/runbooks/policy-evaluation-slow.md
Normal file
174
docs/operations/runbooks/policy-evaluation-slow.md
Normal file
@@ -0,0 +1,174 @@
|
||||
# Runbook: Policy Engine - Evaluation Latency High
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-003 - Policy Engine Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Policy Engine |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.policy.evaluation-latency` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Policy evaluation takes >500ms (warning) or >2s (critical)
|
||||
- [ ] Gate decisions timing out in CI/CD pipelines
|
||||
- [ ] Alert `PolicyEvaluationSlow` firing
|
||||
- [ ] Metric `policy_evaluation_duration_seconds` P95 > 1s
|
||||
- [ ] Users report "policy check taking too long"
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Slow release gate checks, CI/CD pipeline delays |
|
||||
| **Data integrity** | No data loss; decisions are still correct |
|
||||
| **SLA impact** | Gate latency SLO violated (target: P95 < 500ms) |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.policy.evaluation-latency
|
||||
```
|
||||
|
||||
2. **Check policy engine status:**
|
||||
```bash
|
||||
stella policy status
|
||||
```
|
||||
|
||||
3. **Check recent evaluation times:**
|
||||
```bash
|
||||
stella policy stats --last 10m
|
||||
```
|
||||
Look for: P95 latency, cache hit rate
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Profile a slow evaluation:**
|
||||
```bash
|
||||
stella policy evaluate --image <image-ref> --profile
|
||||
```
|
||||
Look for: Which phase is slowest (parse, compile, execute)
|
||||
|
||||
2. **Check OPA compilation cache:**
|
||||
```bash
|
||||
stella policy cache stats
|
||||
```
|
||||
Problem if: Cache hit rate < 90%
|
||||
|
||||
3. **Check policy complexity:**
|
||||
```bash
|
||||
stella policy analyze --complexity
|
||||
```
|
||||
Problem if: Cyclomatic complexity > 50 or rule count > 200
|
||||
|
||||
4. **Check external data fetches:**
|
||||
```bash
|
||||
stella policy logs --filter "external fetch" --level debug
|
||||
```
|
||||
Problem if: Many external fetches or slow responses
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Clear and warm the compilation cache:**
|
||||
```bash
|
||||
stella policy cache clear
|
||||
stella policy cache warm
|
||||
```
|
||||
|
||||
2. **Increase OPA worker count:**
|
||||
```bash
|
||||
stella policy config set opa.workers 4
|
||||
stella policy reload
|
||||
```
|
||||
|
||||
3. **Enable evaluation result caching:**
|
||||
```bash
|
||||
stella policy config set cache.evaluation_ttl 60s
|
||||
stella policy reload
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If policy is too complex:**
|
||||
|
||||
1. Analyze and simplify policy:
|
||||
```bash
|
||||
stella policy analyze --suggest-optimizations
|
||||
```
|
||||
|
||||
2. Split large policies into modules:
|
||||
```bash
|
||||
stella policy refactor --auto-split
|
||||
```
|
||||
|
||||
**If external data fetches are slow:**
|
||||
|
||||
1. Increase external data cache TTL:
|
||||
```bash
|
||||
stella policy config set external_data.cache_ttl 5m
|
||||
```
|
||||
|
||||
2. Pre-fetch external data:
|
||||
```bash
|
||||
stella policy external-data prefetch
|
||||
```
|
||||
|
||||
**If Rego compilation is slow:**
|
||||
|
||||
1. Enable partial evaluation:
|
||||
```bash
|
||||
stella policy config set opa.partial_eval true
|
||||
```
|
||||
|
||||
2. Pre-compile policies:
|
||||
```bash
|
||||
stella policy compile --all
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Run evaluation and check latency
|
||||
stella policy evaluate --image <image-ref> --timing
|
||||
|
||||
# Check P95 latency
|
||||
stella policy stats --last 5m
|
||||
|
||||
# Verify cache is effective
|
||||
stella policy cache stats
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Review:** Review policy complexity before deployment
|
||||
- [ ] **Monitoring:** Alert on P95 latency > 300ms
|
||||
- [ ] **Caching:** Ensure evaluation cache is enabled
|
||||
- [ ] **Pre-warming:** Add cache warming to deployment pipeline
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/policy/architecture.md`
|
||||
- **Related runbooks:** `policy-opa-crash.md`, `policy-compilation-failed.md`
|
||||
- **Dashboard:** Grafana > Stella Ops > Policy Engine
|
||||
Reference in New Issue
Block a user