3.7 KiB
3.7 KiB
Runbook: Policy Engine - Policy Storage Backend Down
Sprint: SPRINT_20260117_029_DOCS_runbook_coverage Task: RUN-003 - Policy Engine Runbooks
Metadata
| Field | Value |
|---|---|
| Component | Policy Engine |
| Severity | Critical |
| On-call scope | Platform team |
| Last updated | 2026-01-17 |
| Doctor check | check.policy.storage-health |
Symptoms
- Policy operations failing with "storage unavailable"
- Alert
PolicyStorageUnavailablefiring - Error: "failed to connect to policy store" or "database connection refused"
- Policy updates not persisting
- OPA unable to load bundles from storage
Impact
| Impact Type | Description |
|---|---|
| User-facing | Policy updates fail; cached policies may still work |
| Data integrity | Policy changes not persisted; risk of inconsistent state |
| SLA impact | Policy management blocked; evaluations use cached data |
Diagnosis
Quick checks
-
Check Doctor diagnostics:
stella doctor --check check.policy.storage-health -
Check storage connectivity:
stella policy storage status -
Check database health:
stella db status --component policy
Deep diagnosis
-
Check PostgreSQL connectivity:
stella db ping --database policy -
Check connection pool status:
stella db pool-status --database policyProblem if: Pool exhausted, connections timing out
-
Check storage logs:
stella policy logs --filter "storage" --level error --last 30m -
Check disk space (if local storage):
stella policy storage disk-usage
Resolution
Immediate mitigation
-
Enable read-only mode (use cached policies):
stella policy config set storage.read_only true stella policy reload -
Switch to backup storage:
stella policy storage failover --to backup -
Restart policy service to reconnect:
stella service restart policy-engine
Root cause fix
If database connection issue:
-
Check database status:
stella db status --database policy --verbose -
Restart database connection pool:
stella db pool-restart --database policy -
Check and increase connection limits:
stella db config set policy.max_connections 50
If disk space exhausted:
-
Check storage usage:
stella policy storage disk-usage --verbose -
Clean old policy versions:
stella policy versions cleanup --older-than 30d -
Increase storage capacity
If storage corruption:
-
Verify storage integrity:
stella policy storage verify -
Restore from backup:
stella policy storage restore --from-backup latest
Verification
# Check storage status
stella policy storage status
# Test write operation
stella policy storage test-write
# Test policy update
stella policy update --test
# Verify no errors
stella policy logs --filter "storage" --level error --last 30m
Prevention
- Monitoring: Alert on storage connection failures immediately
- Redundancy: Configure backup storage for failover
- Cleanup: Schedule regular cleanup of old policy versions
- Capacity: Monitor disk usage and plan for growth
Related Resources
- Architecture:
docs/modules/policy/storage.md - Related runbooks:
policy-opa-crash.md,postgres-ops.md - Database setup:
docs/operations/database-configuration.md