synergy moats product advisory implementations
This commit is contained in:
195
docs/operations/runbooks/policy-version-mismatch.md
Normal file
195
docs/operations/runbooks/policy-version-mismatch.md
Normal file
@@ -0,0 +1,195 @@
|
||||
# Runbook: Policy Engine - Policy Version Conflicts
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-003 - Policy Engine Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Policy Engine |
|
||||
| **Severity** | Medium |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.policy.version-consistency` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Policy evaluation returning unexpected results
|
||||
- [ ] Alert `PolicyVersionMismatch` firing
|
||||
- [ ] Error: "policy version conflict" or "bundle version mismatch"
|
||||
- [ ] Different nodes evaluating with different policy versions
|
||||
- [ ] Inconsistent gate decisions for same artifact
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Inconsistent policy decisions; unpredictable gate results |
|
||||
| **Data integrity** | Decisions may not match expected policy behavior |
|
||||
| **SLA impact** | Gate accuracy SLO violated; trust in decisions reduced |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.policy.version-consistency
|
||||
```
|
||||
|
||||
2. **Check policy version across nodes:**
|
||||
```bash
|
||||
stella policy version --all-nodes
|
||||
```
|
||||
|
||||
3. **Check active policy version:**
|
||||
```bash
|
||||
stella policy active --show-version
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Compare versions across instances:**
|
||||
```bash
|
||||
stella policy version diff --all-instances
|
||||
```
|
||||
Problem if: Different versions on different nodes
|
||||
|
||||
2. **Check bundle distribution status:**
|
||||
```bash
|
||||
stella policy bundle status --all-nodes
|
||||
```
|
||||
|
||||
3. **Check for failed deployments:**
|
||||
```bash
|
||||
stella policy deployments list --status failed --last 24h
|
||||
```
|
||||
|
||||
4. **Check OPA bundle sync:**
|
||||
```bash
|
||||
stella policy opa bundle-status
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Force sync to latest version:**
|
||||
```bash
|
||||
stella policy sync --force --all-nodes
|
||||
```
|
||||
|
||||
2. **Pin specific version:**
|
||||
```bash
|
||||
stella policy pin --version <version>
|
||||
stella policy sync --all-nodes
|
||||
```
|
||||
|
||||
3. **Restart policy engines to force reload:**
|
||||
```bash
|
||||
stella service restart policy-engine --all-nodes
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If bundle distribution failed:**
|
||||
|
||||
1. Check bundle storage:
|
||||
```bash
|
||||
stella policy bundle storage-status
|
||||
```
|
||||
|
||||
2. Rebuild and redistribute bundle:
|
||||
```bash
|
||||
stella policy bundle build
|
||||
stella policy bundle distribute --all-nodes
|
||||
```
|
||||
|
||||
**If node out of sync:**
|
||||
|
||||
1. Check specific node status:
|
||||
```bash
|
||||
stella policy status --node <node-id>
|
||||
```
|
||||
|
||||
2. Force node resync:
|
||||
```bash
|
||||
stella policy sync --node <node-id> --force
|
||||
```
|
||||
|
||||
3. Verify node is receiving updates:
|
||||
```bash
|
||||
stella policy bundle check-subscription --node <node-id>
|
||||
```
|
||||
|
||||
**If concurrent deployments caused conflict:**
|
||||
|
||||
1. Check deployment history:
|
||||
```bash
|
||||
stella policy deployments list --last 1h
|
||||
```
|
||||
|
||||
2. Resolve to single version:
|
||||
```bash
|
||||
stella policy resolve-conflict --to-version <version>
|
||||
```
|
||||
|
||||
3. Enable deployment locking:
|
||||
```bash
|
||||
stella policy config set deployment.locking true
|
||||
```
|
||||
|
||||
**If OPA bundle polling issue:**
|
||||
|
||||
1. Check OPA bundle configuration:
|
||||
```bash
|
||||
stella policy opa config show | grep bundle
|
||||
```
|
||||
|
||||
2. Decrease polling interval for faster sync:
|
||||
```bash
|
||||
stella policy opa config set bundle.polling.min_delay_seconds 10
|
||||
stella policy opa config set bundle.polling.max_delay_seconds 30
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Verify all nodes on same version
|
||||
stella policy version --all-nodes
|
||||
|
||||
# Test consistent evaluation
|
||||
stella policy evaluate --test --all-nodes
|
||||
|
||||
# Verify bundle status
|
||||
stella policy bundle status --all-nodes
|
||||
|
||||
# Check no version warnings
|
||||
stella policy logs --filter "version" --level warning --last 30m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Locking:** Enable deployment locking to prevent concurrent updates
|
||||
- [ ] **Monitoring:** Alert on version drift between nodes
|
||||
- [ ] **Sync:** Configure aggressive bundle polling for fast convergence
|
||||
- [ ] **Testing:** Deploy to staging before production to catch issues
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/policy/versioning.md`
|
||||
- **Related runbooks:** `policy-opa-crash.md`, `policy-storage-unavailable.md`
|
||||
- **Deployment guide:** `docs/operations/policy-deployment.md`
|
||||
Reference in New Issue
Block a user