synergy moats product advisory implementations
This commit is contained in:
178
docs/operations/runbooks/policy-storage-unavailable.md
Normal file
178
docs/operations/runbooks/policy-storage-unavailable.md
Normal file
@@ -0,0 +1,178 @@
|
||||
# Runbook: Policy Engine - Policy Storage Backend Down
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-003 - Policy Engine Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Policy Engine |
|
||||
| **Severity** | Critical |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.policy.storage-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Policy operations failing with "storage unavailable"
|
||||
- [ ] Alert `PolicyStorageUnavailable` firing
|
||||
- [ ] Error: "failed to connect to policy store" or "database connection refused"
|
||||
- [ ] Policy updates not persisting
|
||||
- [ ] OPA unable to load bundles from storage
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Policy updates fail; cached policies may still work |
|
||||
| **Data integrity** | Policy changes not persisted; risk of inconsistent state |
|
||||
| **SLA impact** | Policy management blocked; evaluations use cached data |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.policy.storage-health
|
||||
```
|
||||
|
||||
2. **Check storage connectivity:**
|
||||
```bash
|
||||
stella policy storage status
|
||||
```
|
||||
|
||||
3. **Check database health:**
|
||||
```bash
|
||||
stella db status --component policy
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check PostgreSQL connectivity:**
|
||||
```bash
|
||||
stella db ping --database policy
|
||||
```
|
||||
|
||||
2. **Check connection pool status:**
|
||||
```bash
|
||||
stella db pool-status --database policy
|
||||
```
|
||||
Problem if: Pool exhausted, connections timing out
|
||||
|
||||
3. **Check storage logs:**
|
||||
```bash
|
||||
stella policy logs --filter "storage" --level error --last 30m
|
||||
```
|
||||
|
||||
4. **Check disk space (if local storage):**
|
||||
```bash
|
||||
stella policy storage disk-usage
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Enable read-only mode (use cached policies):**
|
||||
```bash
|
||||
stella policy config set storage.read_only true
|
||||
stella policy reload
|
||||
```
|
||||
|
||||
2. **Switch to backup storage:**
|
||||
```bash
|
||||
stella policy storage failover --to backup
|
||||
```
|
||||
|
||||
3. **Restart policy service to reconnect:**
|
||||
```bash
|
||||
stella service restart policy-engine
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If database connection issue:**
|
||||
|
||||
1. Check database status:
|
||||
```bash
|
||||
stella db status --database policy --verbose
|
||||
```
|
||||
|
||||
2. Restart database connection pool:
|
||||
```bash
|
||||
stella db pool-restart --database policy
|
||||
```
|
||||
|
||||
3. Check and increase connection limits:
|
||||
```bash
|
||||
stella db config set policy.max_connections 50
|
||||
```
|
||||
|
||||
**If disk space exhausted:**
|
||||
|
||||
1. Check storage usage:
|
||||
```bash
|
||||
stella policy storage disk-usage --verbose
|
||||
```
|
||||
|
||||
2. Clean old policy versions:
|
||||
```bash
|
||||
stella policy versions cleanup --older-than 30d
|
||||
```
|
||||
|
||||
3. Increase storage capacity
|
||||
|
||||
**If storage corruption:**
|
||||
|
||||
1. Verify storage integrity:
|
||||
```bash
|
||||
stella policy storage verify
|
||||
```
|
||||
|
||||
2. Restore from backup:
|
||||
```bash
|
||||
stella policy storage restore --from-backup latest
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Check storage status
|
||||
stella policy storage status
|
||||
|
||||
# Test write operation
|
||||
stella policy storage test-write
|
||||
|
||||
# Test policy update
|
||||
stella policy update --test
|
||||
|
||||
# Verify no errors
|
||||
stella policy logs --filter "storage" --level error --last 30m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Monitoring:** Alert on storage connection failures immediately
|
||||
- [ ] **Redundancy:** Configure backup storage for failover
|
||||
- [ ] **Cleanup:** Schedule regular cleanup of old policy versions
|
||||
- [ ] **Capacity:** Monitor disk usage and plan for growth
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/policy/storage.md`
|
||||
- **Related runbooks:** `policy-opa-crash.md`, `postgres-ops.md`
|
||||
- **Database setup:** `docs/operations/database-configuration.md`
|
||||
Reference in New Issue
Block a user