# Policy Pipeline Playbook

Scope: policy compile → simulation → approval → promotion path.

## Dashboards
- Grafana: import `ops/devops/observability/grafana/policy-pipeline.json` (datasource `Prometheus`).
- Key tiles: Compile p99, Simulation Queue Depth, Approval p95, Promotion Success Rate, Promotion Outcomes.

## Alerts (Prometheus)
- Rules: `ops/devops/observability/policy-alerts.yaml`
  - `PolicyCompileLatencyP99High` (p99 > 5s for 10m)
  - `PolicySimulationQueueBacklog` (queue depth > 100 for 10m)
  - `PolicyApprovalLatencyHigh` (p95 > 30s for 15m)
  - `PolicyPromotionFailureRate` (failures >20% over 15m)
  - `PolicyPromotionStall` (no successes while queue non-empty for 10m)

## Runbook
1. **Compile latency alert**
   - Check build nodes for CPU cap; verify cache hits for policy engine.
   - Roll restart single runner; if persists, scale policy compile workers (+1) or purge stale cache.
2. **Simulation backlog**
   - Inspect queue per stage (panel "Queue Depth by Stage").
   - If queue limited to one stage, increase concurrency for that stage or drain stuck items; otherwise, add workers.
3. **Approval latency high**
   - Look for blocked approvals (UI/API outages). Re-run approval service health check; fail over to standby.
4. **Promotion failure rate/stall**
   - Pull recent logs for promotion job; compare failure reasons (policy validation vs. target registry).
   - If registry errors, pause promotions and file incident with registry owner; if policy validation, revert latest policy change or apply override to unblock critical tenants.
5. **Verification**
   - After mitigation, ensure promotion success rate gauge recovers >95% and queues drain to baseline (<10).

## Escalation
- Primary: Policy On-Call (week N roster).
- Secondary: DevOps Guild (release).
- Page if two critical alerts fire concurrently or any critical alert lasts >30m.

## Notes
- Metrics assumed available: `policy_compile_duration_seconds_bucket`, `policy_simulation_queue_depth`, `policy_approval_latency_seconds_bucket`, `policy_promotion_outcomes_total{outcome=*}`.
- Keep alert thresholds stable unless load profile changes; adjust in Git with approval from Policy + DevOps leads.