up
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
This commit is contained in:
39
ops/devops/observability/policy-playbook.md
Normal file
39
ops/devops/observability/policy-playbook.md
Normal file
@@ -0,0 +1,39 @@
|
||||
# Policy Pipeline Playbook
|
||||
|
||||
Scope: policy compile → simulation → approval → promotion path.
|
||||
|
||||
## Dashboards
|
||||
- Grafana: import `ops/devops/observability/grafana/policy-pipeline.json` (datasource `Prometheus`).
|
||||
- Key tiles: Compile p99, Simulation Queue Depth, Approval p95, Promotion Success Rate, Promotion Outcomes.
|
||||
|
||||
## Alerts (Prometheus)
|
||||
- Rules: `ops/devops/observability/policy-alerts.yaml`
|
||||
- `PolicyCompileLatencyP99High` (p99 > 5s for 10m)
|
||||
- `PolicySimulationQueueBacklog` (queue depth > 100 for 10m)
|
||||
- `PolicyApprovalLatencyHigh` (p95 > 30s for 15m)
|
||||
- `PolicyPromotionFailureRate` (failures >20% over 15m)
|
||||
- `PolicyPromotionStall` (no successes while queue non-empty for 10m)
|
||||
|
||||
## Runbook
|
||||
1. **Compile latency alert**
|
||||
- Check build nodes for CPU cap; verify cache hits for policy engine.
|
||||
- Roll restart single runner; if persists, scale policy compile workers (+1) or purge stale cache.
|
||||
2. **Simulation backlog**
|
||||
- Inspect queue per stage (panel "Queue Depth by Stage").
|
||||
- If queue limited to one stage, increase concurrency for that stage or drain stuck items; otherwise, add workers.
|
||||
3. **Approval latency high**
|
||||
- Look for blocked approvals (UI/API outages). Re-run approval service health check; fail over to standby.
|
||||
4. **Promotion failure rate/stall**
|
||||
- Pull recent logs for promotion job; compare failure reasons (policy validation vs. target registry).
|
||||
- If registry errors, pause promotions and file incident with registry owner; if policy validation, revert latest policy change or apply override to unblock critical tenants.
|
||||
5. **Verification**
|
||||
- After mitigation, ensure promotion success rate gauge recovers >95% and queues drain to baseline (<10).
|
||||
|
||||
## Escalation
|
||||
- Primary: Policy On-Call (week N roster).
|
||||
- Secondary: DevOps Guild (release).
|
||||
- Page if two critical alerts fire concurrently or any critical alert lasts >30m.
|
||||
|
||||
## Notes
|
||||
- Metrics assumed available: `policy_compile_duration_seconds_bucket`, `policy_simulation_queue_depth`, `policy_approval_latency_seconds_bucket`, `policy_promotion_outcomes_total{outcome=*}`.
|
||||
- Keep alert thresholds stable unless load profile changes; adjust in Git with approval from Policy + DevOps leads.
|
||||
Reference in New Issue
Block a user