5.4 KiB
5.4 KiB
Alerting Rules
Prometheus alerting rules for the Release Orchestrator.
Status: Planned (not yet implemented) Source: Architecture Advisory Section 13.4 Related Modules: Metrics, Observability Overview
Overview
The Release Orchestrator provides Prometheus alerting rules for monitoring promotions, deployments, agents, and integrations.
High Priority Alerts
Security Gate Block Rate
- alert: PromotionGateBlockRate
expr: |
rate(stella_security_gate_results_total{result="blocked"}[1h]) /
rate(stella_security_gate_results_total[1h]) > 0.5
for: 15m
labels:
severity: warning
annotations:
summary: "High rate of security gate blocks"
description: "More than 50% of promotions are being blocked by security gates"
Deployment Failure Rate
- alert: DeploymentFailureRate
expr: |
rate(stella_deployments_total{status="failed"}[1h]) /
rate(stella_deployments_total[1h]) > 0.1
for: 10m
labels:
severity: critical
annotations:
summary: "High deployment failure rate"
description: "More than 10% of deployments are failing"
Agent Offline
- alert: AgentOffline
expr: |
stella_agents_status{status="offline"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Agent offline"
description: "Agent {{ $labels.agent_id }} has been offline for 5 minutes"
Promotion Stuck
- alert: PromotionStuck
expr: |
time() - stella_promotion_start_time{status="deploying"} > 1800
for: 5m
labels:
severity: warning
annotations:
summary: "Promotion stuck in deploying state"
description: "Promotion {{ $labels.promotion_id }} has been deploying for more than 30 minutes"
Integration Unhealthy
- alert: IntegrationUnhealthy
expr: |
stella_integration_health{status="unhealthy"} == 1
for: 10m
labels:
severity: warning
annotations:
summary: "Integration unhealthy"
description: "Integration {{ $labels.integration_name }} has been unhealthy for 10 minutes"
Medium Priority Alerts
Workflow Step Timeout
- alert: WorkflowStepTimeout
expr: |
stella_workflow_step_duration_seconds > 600
for: 1m
labels:
severity: warning
annotations:
summary: "Workflow step taking too long"
description: "Step {{ $labels.step_type }} in workflow {{ $labels.workflow_run_id }} has been running for more than 10 minutes"
Evidence Generation Failure
- alert: EvidenceGenerationFailure
expr: |
rate(stella_evidence_generation_failures_total[1h]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Evidence generation failures"
description: "Evidence generation is failing, affecting audit compliance"
Target Health Degraded
- alert: TargetHealthDegraded
expr: |
stella_target_health{status!="healthy"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Target health degraded"
description: "Target {{ $labels.target_name }} is reporting {{ $labels.status }}"
Approval Timeout
- alert: ApprovalTimeout
expr: |
time() - stella_promotion_approval_requested_time > 86400
for: 1h
labels:
severity: warning
annotations:
summary: "Promotion awaiting approval for too long"
description: "Promotion {{ $labels.promotion_id }} has been waiting for approval for more than 24 hours"
Low Priority Alerts
Database Connection Pool
- alert: DatabaseConnectionPoolExhausted
expr: |
stella_db_connection_pool_available < 5
for: 5m
labels:
severity: warning
annotations:
summary: "Database connection pool running low"
description: "Only {{ $value }} database connections available"
Plugin Error Rate
- alert: PluginErrorRate
expr: |
rate(stella_plugin_errors_total[5m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Plugin errors detected"
description: "Plugin {{ $labels.plugin_id }} is experiencing errors"
Alert Routing
Example AlertManager Configuration
# alertmanager.yaml
route:
receiver: default
group_by: [alertname, severity]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: pagerduty
continue: true
- match:
severity: warning
receiver: slack
receivers:
- name: default
webhook_configs:
- url: http://webhook.example.com/alerts
- name: pagerduty
pagerduty_configs:
- service_key: ${PAGERDUTY_KEY}
severity: critical
- name: slack
slack_configs:
- channel: '#alerts'
api_url: ${SLACK_WEBHOOK_URL}
title: '{{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
Dashboard Integration
Grafana Alert Panels
Recommended dashboard panels for alerts:
| Panel | Query |
|---|---|
| Active Alerts | count(ALERTS{alertstate="firing"}) |
| Alert History | count_over_time(ALERTS{alertstate="firing"}[24h]) |
| By Severity | count(ALERTS{alertstate="firing"}) by (severity) |
| By Component | count(ALERTS{alertstate="firing"}) by (alertname) |