Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
40 lines
2.2 KiB
Markdown
40 lines
2.2 KiB
Markdown
# Policy Pipeline Playbook
|
|
|
|
Scope: policy compile → simulation → approval → promotion path.
|
|
|
|
## Dashboards
|
|
- Grafana: import `ops/devops/observability/grafana/policy-pipeline.json` (datasource `Prometheus`).
|
|
- Key tiles: Compile p99, Simulation Queue Depth, Approval p95, Promotion Success Rate, Promotion Outcomes.
|
|
|
|
## Alerts (Prometheus)
|
|
- Rules: `ops/devops/observability/policy-alerts.yaml`
|
|
- `PolicyCompileLatencyP99High` (p99 > 5s for 10m)
|
|
- `PolicySimulationQueueBacklog` (queue depth > 100 for 10m)
|
|
- `PolicyApprovalLatencyHigh` (p95 > 30s for 15m)
|
|
- `PolicyPromotionFailureRate` (failures >20% over 15m)
|
|
- `PolicyPromotionStall` (no successes while queue non-empty for 10m)
|
|
|
|
## Runbook
|
|
1. **Compile latency alert**
|
|
- Check build nodes for CPU cap; verify cache hits for policy engine.
|
|
- Roll restart single runner; if persists, scale policy compile workers (+1) or purge stale cache.
|
|
2. **Simulation backlog**
|
|
- Inspect queue per stage (panel "Queue Depth by Stage").
|
|
- If queue limited to one stage, increase concurrency for that stage or drain stuck items; otherwise, add workers.
|
|
3. **Approval latency high**
|
|
- Look for blocked approvals (UI/API outages). Re-run approval service health check; fail over to standby.
|
|
4. **Promotion failure rate/stall**
|
|
- Pull recent logs for promotion job; compare failure reasons (policy validation vs. target registry).
|
|
- If registry errors, pause promotions and file incident with registry owner; if policy validation, revert latest policy change or apply override to unblock critical tenants.
|
|
5. **Verification**
|
|
- After mitigation, ensure promotion success rate gauge recovers >95% and queues drain to baseline (<10).
|
|
|
|
## Escalation
|
|
- Primary: Policy On-Call (week N roster).
|
|
- Secondary: DevOps Guild (release).
|
|
- Page if two critical alerts fire concurrently or any critical alert lasts >30m.
|
|
|
|
## Notes
|
|
- Metrics assumed available: `policy_compile_duration_seconds_bucket`, `policy_simulation_queue_depth`, `policy_approval_latency_seconds_bucket`, `policy_promotion_outcomes_total{outcome=*}`.
|
|
- Keep alert thresholds stable unless load profile changes; adjust in Git with approval from Policy + DevOps leads.
|