add release orchestrator docs and sprints gaps fills
This commit is contained in:
246
docs/modules/release-orchestrator/operations/alerting.md
Normal file
246
docs/modules/release-orchestrator/operations/alerting.md
Normal file
@@ -0,0 +1,246 @@
|
||||
# Alerting Rules
|
||||
|
||||
> Prometheus alerting rules for the Release Orchestrator.
|
||||
|
||||
**Status:** Planned (not yet implemented)
|
||||
**Source:** [Architecture Advisory Section 13.4](../../../product/advisories/09-Jan-2026%20-%20Stella%20Ops%20Orchestrator%20Architecture.md)
|
||||
**Related Modules:** [Metrics](metrics.md), [Observability Overview](overview.md)
|
||||
|
||||
## Overview
|
||||
|
||||
The Release Orchestrator provides Prometheus alerting rules for monitoring promotions, deployments, agents, and integrations.
|
||||
|
||||
---
|
||||
|
||||
## High Priority Alerts
|
||||
|
||||
### Security Gate Block Rate
|
||||
|
||||
```yaml
|
||||
- alert: PromotionGateBlockRate
|
||||
expr: |
|
||||
rate(stella_security_gate_results_total{result="blocked"}[1h]) /
|
||||
rate(stella_security_gate_results_total[1h]) > 0.5
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High rate of security gate blocks"
|
||||
description: "More than 50% of promotions are being blocked by security gates"
|
||||
```
|
||||
|
||||
### Deployment Failure Rate
|
||||
|
||||
```yaml
|
||||
- alert: DeploymentFailureRate
|
||||
expr: |
|
||||
rate(stella_deployments_total{status="failed"}[1h]) /
|
||||
rate(stella_deployments_total[1h]) > 0.1
|
||||
for: 10m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "High deployment failure rate"
|
||||
description: "More than 10% of deployments are failing"
|
||||
```
|
||||
|
||||
### Agent Offline
|
||||
|
||||
```yaml
|
||||
- alert: AgentOffline
|
||||
expr: |
|
||||
stella_agents_status{status="offline"} == 1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Agent offline"
|
||||
description: "Agent {{ $labels.agent_id }} has been offline for 5 minutes"
|
||||
```
|
||||
|
||||
### Promotion Stuck
|
||||
|
||||
```yaml
|
||||
- alert: PromotionStuck
|
||||
expr: |
|
||||
time() - stella_promotion_start_time{status="deploying"} > 1800
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Promotion stuck in deploying state"
|
||||
description: "Promotion {{ $labels.promotion_id }} has been deploying for more than 30 minutes"
|
||||
```
|
||||
|
||||
### Integration Unhealthy
|
||||
|
||||
```yaml
|
||||
- alert: IntegrationUnhealthy
|
||||
expr: |
|
||||
stella_integration_health{status="unhealthy"} == 1
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Integration unhealthy"
|
||||
description: "Integration {{ $labels.integration_name }} has been unhealthy for 10 minutes"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Medium Priority Alerts
|
||||
|
||||
### Workflow Step Timeout
|
||||
|
||||
```yaml
|
||||
- alert: WorkflowStepTimeout
|
||||
expr: |
|
||||
stella_workflow_step_duration_seconds > 600
|
||||
for: 1m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Workflow step taking too long"
|
||||
description: "Step {{ $labels.step_type }} in workflow {{ $labels.workflow_run_id }} has been running for more than 10 minutes"
|
||||
```
|
||||
|
||||
### Evidence Generation Failure
|
||||
|
||||
```yaml
|
||||
- alert: EvidenceGenerationFailure
|
||||
expr: |
|
||||
rate(stella_evidence_generation_failures_total[1h]) > 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Evidence generation failures"
|
||||
description: "Evidence generation is failing, affecting audit compliance"
|
||||
```
|
||||
|
||||
### Target Health Degraded
|
||||
|
||||
```yaml
|
||||
- alert: TargetHealthDegraded
|
||||
expr: |
|
||||
stella_target_health{status!="healthy"} == 1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Target health degraded"
|
||||
description: "Target {{ $labels.target_name }} is reporting {{ $labels.status }}"
|
||||
```
|
||||
|
||||
### Approval Timeout
|
||||
|
||||
```yaml
|
||||
- alert: ApprovalTimeout
|
||||
expr: |
|
||||
time() - stella_promotion_approval_requested_time > 86400
|
||||
for: 1h
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Promotion awaiting approval for too long"
|
||||
description: "Promotion {{ $labels.promotion_id }} has been waiting for approval for more than 24 hours"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Low Priority Alerts
|
||||
|
||||
### Database Connection Pool
|
||||
|
||||
```yaml
|
||||
- alert: DatabaseConnectionPoolExhausted
|
||||
expr: |
|
||||
stella_db_connection_pool_available < 5
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Database connection pool running low"
|
||||
description: "Only {{ $value }} database connections available"
|
||||
```
|
||||
|
||||
### Plugin Error Rate
|
||||
|
||||
```yaml
|
||||
- alert: PluginErrorRate
|
||||
expr: |
|
||||
rate(stella_plugin_errors_total[5m]) > 1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Plugin errors detected"
|
||||
description: "Plugin {{ $labels.plugin_id }} is experiencing errors"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alert Routing
|
||||
|
||||
### Example AlertManager Configuration
|
||||
|
||||
```yaml
|
||||
# alertmanager.yaml
|
||||
route:
|
||||
receiver: default
|
||||
group_by: [alertname, severity]
|
||||
group_wait: 30s
|
||||
group_interval: 5m
|
||||
repeat_interval: 4h
|
||||
|
||||
routes:
|
||||
- match:
|
||||
severity: critical
|
||||
receiver: pagerduty
|
||||
continue: true
|
||||
|
||||
- match:
|
||||
severity: warning
|
||||
receiver: slack
|
||||
|
||||
receivers:
|
||||
- name: default
|
||||
webhook_configs:
|
||||
- url: http://webhook.example.com/alerts
|
||||
|
||||
- name: pagerduty
|
||||
pagerduty_configs:
|
||||
- service_key: ${PAGERDUTY_KEY}
|
||||
severity: critical
|
||||
|
||||
- name: slack
|
||||
slack_configs:
|
||||
- channel: '#alerts'
|
||||
api_url: ${SLACK_WEBHOOK_URL}
|
||||
title: '{{ .CommonAnnotations.summary }}'
|
||||
text: '{{ .CommonAnnotations.description }}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Dashboard Integration
|
||||
|
||||
### Grafana Alert Panels
|
||||
|
||||
Recommended dashboard panels for alerts:
|
||||
|
||||
| Panel | Query |
|
||||
|-------|-------|
|
||||
| Active Alerts | `count(ALERTS{alertstate="firing"})` |
|
||||
| Alert History | `count_over_time(ALERTS{alertstate="firing"}[24h])` |
|
||||
| By Severity | `count(ALERTS{alertstate="firing"}) by (severity)` |
|
||||
| By Component | `count(ALERTS{alertstate="firing"}) by (alertname)` |
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
|
||||
- [Metrics](metrics.md)
|
||||
- [Observability Overview](overview.md)
|
||||
- [Logging](logging.md)
|
||||
- [Tracing](tracing.md)
|
||||
Reference in New Issue
Block a user