Files
git.stella-ops.org/ops/devops/observability/policy-playbook.md
StellaOps Bot 9f6e6f7fb3
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
up
2025-11-25 22:09:44 +02:00

2.2 KiB

Policy Pipeline Playbook

Scope: policy compile → simulation → approval → promotion path.

Dashboards

  • Grafana: import ops/devops/observability/grafana/policy-pipeline.json (datasource Prometheus).
  • Key tiles: Compile p99, Simulation Queue Depth, Approval p95, Promotion Success Rate, Promotion Outcomes.

Alerts (Prometheus)

  • Rules: ops/devops/observability/policy-alerts.yaml
    • PolicyCompileLatencyP99High (p99 > 5s for 10m)
    • PolicySimulationQueueBacklog (queue depth > 100 for 10m)
    • PolicyApprovalLatencyHigh (p95 > 30s for 15m)
    • PolicyPromotionFailureRate (failures >20% over 15m)
    • PolicyPromotionStall (no successes while queue non-empty for 10m)

Runbook

  1. Compile latency alert
    • Check build nodes for CPU cap; verify cache hits for policy engine.
    • Roll restart single runner; if persists, scale policy compile workers (+1) or purge stale cache.
  2. Simulation backlog
    • Inspect queue per stage (panel "Queue Depth by Stage").
    • If queue limited to one stage, increase concurrency for that stage or drain stuck items; otherwise, add workers.
  3. Approval latency high
    • Look for blocked approvals (UI/API outages). Re-run approval service health check; fail over to standby.
  4. Promotion failure rate/stall
    • Pull recent logs for promotion job; compare failure reasons (policy validation vs. target registry).
    • If registry errors, pause promotions and file incident with registry owner; if policy validation, revert latest policy change or apply override to unblock critical tenants.
  5. Verification
    • After mitigation, ensure promotion success rate gauge recovers >95% and queues drain to baseline (<10).

Escalation

  • Primary: Policy On-Call (week N roster).
  • Secondary: DevOps Guild (release).
  • Page if two critical alerts fire concurrently or any critical alert lasts >30m.

Notes

  • Metrics assumed available: policy_compile_duration_seconds_bucket, policy_simulation_queue_depth, policy_approval_latency_seconds_bucket, policy_promotion_outcomes_total{outcome=*}.
  • Keep alert thresholds stable unless load profile changes; adjust in Git with approval from Policy + DevOps leads.