Files
git.stella-ops.org/docs/operations/runbooks/policy-storage-unavailable.md

179 lines
3.7 KiB
Markdown

# Runbook: Policy Engine - Policy Storage Backend Down
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-003 - Policy Engine Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Policy Engine |
| **Severity** | Critical |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.policy.storage-health` |
---
## Symptoms
- [ ] Policy operations failing with "storage unavailable"
- [ ] Alert `PolicyStorageUnavailable` firing
- [ ] Error: "failed to connect to policy store" or "database connection refused"
- [ ] Policy updates not persisting
- [ ] OPA unable to load bundles from storage
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Policy updates fail; cached policies may still work |
| **Data integrity** | Policy changes not persisted; risk of inconsistent state |
| **SLA impact** | Policy management blocked; evaluations use cached data |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.policy.storage-health
```
2. **Check storage connectivity:**
```bash
stella policy storage status
```
3. **Check database health:**
```bash
stella db status --component policy
```
### Deep diagnosis
1. **Check PostgreSQL connectivity:**
```bash
stella db ping --database policy
```
2. **Check connection pool status:**
```bash
stella db pool-status --database policy
```
Problem if: Pool exhausted, connections timing out
3. **Check storage logs:**
```bash
stella policy logs --filter "storage" --level error --last 30m
```
4. **Check disk space (if local storage):**
```bash
stella policy storage disk-usage
```
---
## Resolution
### Immediate mitigation
1. **Enable read-only mode (use cached policies):**
```bash
stella policy config set storage.read_only true
stella policy reload
```
2. **Switch to backup storage:**
```bash
stella policy storage failover --to backup
```
3. **Restart policy service to reconnect:**
```bash
stella service restart policy-engine
```
### Root cause fix
**If database connection issue:**
1. Check database status:
```bash
stella db status --database policy --verbose
```
2. Restart database connection pool:
```bash
stella db pool-restart --database policy
```
3. Check and increase connection limits:
```bash
stella db config set policy.max_connections 50
```
**If disk space exhausted:**
1. Check storage usage:
```bash
stella policy storage disk-usage --verbose
```
2. Clean old policy versions:
```bash
stella policy versions cleanup --older-than 30d
```
3. Increase storage capacity
**If storage corruption:**
1. Verify storage integrity:
```bash
stella policy storage verify
```
2. Restore from backup:
```bash
stella policy storage restore --from-backup latest
```
### Verification
```bash
# Check storage status
stella policy storage status
# Test write operation
stella policy storage test-write
# Test policy update
stella policy update --test
# Verify no errors
stella policy logs --filter "storage" --level error --last 30m
```
---
## Prevention
- [ ] **Monitoring:** Alert on storage connection failures immediately
- [ ] **Redundancy:** Configure backup storage for failover
- [ ] **Cleanup:** Schedule regular cleanup of old policy versions
- [ ] **Capacity:** Monitor disk usage and plan for growth
---
## Related Resources
- **Architecture:** `docs/modules/policy/storage.md`
- **Related runbooks:** `policy-opa-crash.md`, `postgres-ops.md`
- **Database setup:** `docs/operations/database-configuration.md`