synergy moats product advisory implementations
This commit is contained in:
205
docs/operations/runbooks/policy-opa-crash.md
Normal file
205
docs/operations/runbooks/policy-opa-crash.md
Normal file
@@ -0,0 +1,205 @@
|
||||
# Runbook: Policy Engine - OPA Process Crashed
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-003 - Policy Engine Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Policy Engine |
|
||||
| **Severity** | Critical |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.policy.opa-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Policy evaluations failing with "OPA unavailable" error
|
||||
- [ ] Alert `PolicyOPACrashed` firing
|
||||
- [ ] OPA process exited unexpectedly
|
||||
- [ ] Error: "connection refused" when connecting to OPA
|
||||
- [ ] Metric `policy_opa_restarts_total` increasing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | All policy evaluations fail; gate decisions blocked |
|
||||
| **Data integrity** | No data loss; decisions delayed until OPA recovers |
|
||||
| **SLA impact** | Gate latency SLO violated; release pipeline blocked |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.policy.opa-health
|
||||
```
|
||||
|
||||
2. **Check OPA process status:**
|
||||
```bash
|
||||
stella policy status
|
||||
```
|
||||
Look for: OPA process state, restart count
|
||||
|
||||
3. **Check OPA logs for crash reason:**
|
||||
```bash
|
||||
stella policy opa logs --last 30m --level error
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check OPA memory usage before crash:**
|
||||
```bash
|
||||
stella policy stats --opa-metrics
|
||||
```
|
||||
Problem if: Memory usage near limit before crash
|
||||
|
||||
2. **Check for problematic policy:**
|
||||
```bash
|
||||
stella policy list --last-error
|
||||
```
|
||||
Look for: Policies that caused evaluation errors
|
||||
|
||||
3. **Check OPA configuration:**
|
||||
```bash
|
||||
stella policy opa config show
|
||||
```
|
||||
Look for: Invalid configuration, missing bundles
|
||||
|
||||
4. **Check for infinite loops in Rego:**
|
||||
```bash
|
||||
stella policy analyze --detect-loops
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Restart OPA process:**
|
||||
```bash
|
||||
stella policy opa restart
|
||||
```
|
||||
|
||||
2. **If OPA keeps crashing, start in safe mode:**
|
||||
```bash
|
||||
stella policy opa start --safe-mode
|
||||
```
|
||||
Note: Safe mode disables custom policies
|
||||
|
||||
3. **Enable failopen temporarily (if allowed by policy):**
|
||||
```bash
|
||||
stella policy config set failopen true
|
||||
stella policy reload
|
||||
```
|
||||
**Warning:** Only use if compliance allows fail-open mode
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If OOM killed:**
|
||||
|
||||
1. Increase OPA memory limit:
|
||||
```bash
|
||||
stella policy opa config set memory_limit 2Gi
|
||||
stella policy opa restart
|
||||
```
|
||||
|
||||
2. Enable garbage collection tuning:
|
||||
```bash
|
||||
stella policy opa config set gc_min_heap_size 256Mi
|
||||
stella policy opa config set gc_max_heap_size 1Gi
|
||||
```
|
||||
|
||||
**If policy caused crash:**
|
||||
|
||||
1. Identify problematic policy:
|
||||
```bash
|
||||
stella policy list --status error
|
||||
```
|
||||
|
||||
2. Disable the problematic policy:
|
||||
```bash
|
||||
stella policy disable <policy-id>
|
||||
stella policy reload
|
||||
```
|
||||
|
||||
3. Fix and re-enable:
|
||||
```bash
|
||||
stella policy validate --file <fixed-policy.rego>
|
||||
stella policy update <policy-id> --file <fixed-policy.rego>
|
||||
stella policy enable <policy-id>
|
||||
```
|
||||
|
||||
**If bundle loading failed:**
|
||||
|
||||
1. Check bundle integrity:
|
||||
```bash
|
||||
stella policy bundle verify
|
||||
```
|
||||
|
||||
2. Rebuild bundle:
|
||||
```bash
|
||||
stella policy bundle build --output bundle.tar.gz
|
||||
stella policy bundle load bundle.tar.gz
|
||||
```
|
||||
|
||||
**If configuration issue:**
|
||||
|
||||
1. Reset to default configuration:
|
||||
```bash
|
||||
stella policy opa config reset
|
||||
```
|
||||
|
||||
2. Reconfigure with validated settings:
|
||||
```bash
|
||||
stella policy opa config set workers 4
|
||||
stella policy opa config set decision_log true
|
||||
stella policy opa restart
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Check OPA is running
|
||||
stella policy status
|
||||
|
||||
# Check OPA health
|
||||
stella policy opa health
|
||||
|
||||
# Test policy evaluation
|
||||
stella policy evaluate --test
|
||||
|
||||
# Check no crashes in recent logs
|
||||
stella policy opa logs --level error --last 30m
|
||||
|
||||
# Monitor stability
|
||||
stella policy stats --watch
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Resources:** Set appropriate memory limits based on policy complexity
|
||||
- [ ] **Validation:** Validate all policies before deployment
|
||||
- [ ] **Monitoring:** Alert on OPA restart count > 2 in 10 minutes
|
||||
- [ ] **Testing:** Load test policies before production deployment
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/policy/architecture.md`
|
||||
- **Related runbooks:** `policy-evaluation-slow.md`, `policy-compilation-failed.md`
|
||||
- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Policy/`
|
||||
- **OPA documentation:** https://www.openpolicyagent.org/docs/latest/
|
||||
Reference in New Issue
Block a user