Files
git.stella-ops.org/docs/operations/runbooks/policy-storage-unavailable.md

3.7 KiB

Runbook: Policy Engine - Policy Storage Backend Down

Sprint: SPRINT_20260117_029_DOCS_runbook_coverage Task: RUN-003 - Policy Engine Runbooks

Metadata

Field Value
Component Policy Engine
Severity Critical
On-call scope Platform team
Last updated 2026-01-17
Doctor check check.policy.storage-health

Symptoms

  • Policy operations failing with "storage unavailable"
  • Alert PolicyStorageUnavailable firing
  • Error: "failed to connect to policy store" or "database connection refused"
  • Policy updates not persisting
  • OPA unable to load bundles from storage

Impact

Impact Type Description
User-facing Policy updates fail; cached policies may still work
Data integrity Policy changes not persisted; risk of inconsistent state
SLA impact Policy management blocked; evaluations use cached data

Diagnosis

Quick checks

  1. Check Doctor diagnostics:

    stella doctor --check check.policy.storage-health
    
  2. Check storage connectivity:

    stella policy storage status
    
  3. Check database health:

    stella db status --component policy
    

Deep diagnosis

  1. Check PostgreSQL connectivity:

    stella db ping --database policy
    
  2. Check connection pool status:

    stella db pool-status --database policy
    

    Problem if: Pool exhausted, connections timing out

  3. Check storage logs:

    stella policy logs --filter "storage" --level error --last 30m
    
  4. Check disk space (if local storage):

    stella policy storage disk-usage
    

Resolution

Immediate mitigation

  1. Enable read-only mode (use cached policies):

    stella policy config set storage.read_only true
    stella policy reload
    
  2. Switch to backup storage:

    stella policy storage failover --to backup
    
  3. Restart policy service to reconnect:

    stella service restart policy-engine
    

Root cause fix

If database connection issue:

  1. Check database status:

    stella db status --database policy --verbose
    
  2. Restart database connection pool:

    stella db pool-restart --database policy
    
  3. Check and increase connection limits:

    stella db config set policy.max_connections 50
    

If disk space exhausted:

  1. Check storage usage:

    stella policy storage disk-usage --verbose
    
  2. Clean old policy versions:

    stella policy versions cleanup --older-than 30d
    
  3. Increase storage capacity

If storage corruption:

  1. Verify storage integrity:

    stella policy storage verify
    
  2. Restore from backup:

    stella policy storage restore --from-backup latest
    

Verification

# Check storage status
stella policy storage status

# Test write operation
stella policy storage test-write

# Test policy update
stella policy update --test

# Verify no errors
stella policy logs --filter "storage" --level error --last 30m

Prevention

  • Monitoring: Alert on storage connection failures immediately
  • Redundancy: Configure backup storage for failover
  • Cleanup: Schedule regular cleanup of old policy versions
  • Capacity: Monitor disk usage and plan for growth

  • Architecture: docs/modules/policy/storage.md
  • Related runbooks: policy-opa-crash.md, postgres-ops.md
  • Database setup: docs/operations/database-configuration.md