Files
git.stella-ops.org/docs/operations/runbooks/policy-version-mismatch.md

196 lines
4.2 KiB
Markdown

# Runbook: Policy Engine - Policy Version Conflicts
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-003 - Policy Engine Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Policy Engine |
| **Severity** | Medium |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.policy.version-consistency` |
---
## Symptoms
- [ ] Policy evaluation returning unexpected results
- [ ] Alert `PolicyVersionMismatch` firing
- [ ] Error: "policy version conflict" or "bundle version mismatch"
- [ ] Different nodes evaluating with different policy versions
- [ ] Inconsistent gate decisions for same artifact
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Inconsistent policy decisions; unpredictable gate results |
| **Data integrity** | Decisions may not match expected policy behavior |
| **SLA impact** | Gate accuracy SLO violated; trust in decisions reduced |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.policy.version-consistency
```
2. **Check policy version across nodes:**
```bash
stella policy version --all-nodes
```
3. **Check active policy version:**
```bash
stella policy active --show-version
```
### Deep diagnosis
1. **Compare versions across instances:**
```bash
stella policy version diff --all-instances
```
Problem if: Different versions on different nodes
2. **Check bundle distribution status:**
```bash
stella policy bundle status --all-nodes
```
3. **Check for failed deployments:**
```bash
stella policy deployments list --status failed --last 24h
```
4. **Check OPA bundle sync:**
```bash
stella policy opa bundle-status
```
---
## Resolution
### Immediate mitigation
1. **Force sync to latest version:**
```bash
stella policy sync --force --all-nodes
```
2. **Pin specific version:**
```bash
stella policy pin --version <version>
stella policy sync --all-nodes
```
3. **Restart policy engines to force reload:**
```bash
stella service restart policy-engine --all-nodes
```
### Root cause fix
**If bundle distribution failed:**
1. Check bundle storage:
```bash
stella policy bundle storage-status
```
2. Rebuild and redistribute bundle:
```bash
stella policy bundle build
stella policy bundle distribute --all-nodes
```
**If node out of sync:**
1. Check specific node status:
```bash
stella policy status --node <node-id>
```
2. Force node resync:
```bash
stella policy sync --node <node-id> --force
```
3. Verify node is receiving updates:
```bash
stella policy bundle check-subscription --node <node-id>
```
**If concurrent deployments caused conflict:**
1. Check deployment history:
```bash
stella policy deployments list --last 1h
```
2. Resolve to single version:
```bash
stella policy resolve-conflict --to-version <version>
```
3. Enable deployment locking:
```bash
stella policy config set deployment.locking true
```
**If OPA bundle polling issue:**
1. Check OPA bundle configuration:
```bash
stella policy opa config show | grep bundle
```
2. Decrease polling interval for faster sync:
```bash
stella policy opa config set bundle.polling.min_delay_seconds 10
stella policy opa config set bundle.polling.max_delay_seconds 30
```
### Verification
```bash
# Verify all nodes on same version
stella policy version --all-nodes
# Test consistent evaluation
stella policy evaluate --test --all-nodes
# Verify bundle status
stella policy bundle status --all-nodes
# Check no version warnings
stella policy logs --filter "version" --level warning --last 30m
```
---
## Prevention
- [ ] **Locking:** Enable deployment locking to prevent concurrent updates
- [ ] **Monitoring:** Alert on version drift between nodes
- [ ] **Sync:** Configure aggressive bundle polling for fast convergence
- [ ] **Testing:** Deploy to staging before production to catch issues
---
## Related Resources
- **Architecture:** `docs/modules/policy/versioning.md`
- **Related runbooks:** `policy-opa-crash.md`, `policy-storage-unavailable.md`
- **Deployment guide:** `docs/operations/policy-deployment.md`