Files
git.stella-ops.org/docs/modules/release-orchestrator/operations/alerting.md

5.4 KiB

Alerting Rules

Prometheus alerting rules for the Release Orchestrator.

Status: Planned (not yet implemented) Source: Architecture Advisory Section 13.4 Related Modules: Metrics, Observability Overview

Overview

The Release Orchestrator provides Prometheus alerting rules for monitoring promotions, deployments, agents, and integrations.


High Priority Alerts

Security Gate Block Rate

- alert: PromotionGateBlockRate
  expr: |
    rate(stella_security_gate_results_total{result="blocked"}[1h]) /
    rate(stella_security_gate_results_total[1h]) > 0.5
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "High rate of security gate blocks"
    description: "More than 50% of promotions are being blocked by security gates"

Deployment Failure Rate

- alert: DeploymentFailureRate
  expr: |
    rate(stella_deployments_total{status="failed"}[1h]) /
    rate(stella_deployments_total[1h]) > 0.1
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "High deployment failure rate"
    description: "More than 10% of deployments are failing"

Agent Offline

- alert: AgentOffline
  expr: |
    stella_agents_status{status="offline"} == 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Agent offline"
    description: "Agent {{ $labels.agent_id }} has been offline for 5 minutes"

Promotion Stuck

- alert: PromotionStuck
  expr: |
    time() - stella_promotion_start_time{status="deploying"} > 1800
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Promotion stuck in deploying state"
    description: "Promotion {{ $labels.promotion_id }} has been deploying for more than 30 minutes"

Integration Unhealthy

- alert: IntegrationUnhealthy
  expr: |
    stella_integration_health{status="unhealthy"} == 1
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Integration unhealthy"
    description: "Integration {{ $labels.integration_name }} has been unhealthy for 10 minutes"

Medium Priority Alerts

Workflow Step Timeout

- alert: WorkflowStepTimeout
  expr: |
    stella_workflow_step_duration_seconds > 600
  for: 1m
  labels:
    severity: warning
  annotations:
    summary: "Workflow step taking too long"
    description: "Step {{ $labels.step_type }} in workflow {{ $labels.workflow_run_id }} has been running for more than 10 minutes"

Evidence Generation Failure

- alert: EvidenceGenerationFailure
  expr: |
    rate(stella_evidence_generation_failures_total[1h]) > 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Evidence generation failures"
    description: "Evidence generation is failing, affecting audit compliance"

Target Health Degraded

- alert: TargetHealthDegraded
  expr: |
    stella_target_health{status!="healthy"} == 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Target health degraded"
    description: "Target {{ $labels.target_name }} is reporting {{ $labels.status }}"

Approval Timeout

- alert: ApprovalTimeout
  expr: |
    time() - stella_promotion_approval_requested_time > 86400
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Promotion awaiting approval for too long"
    description: "Promotion {{ $labels.promotion_id }} has been waiting for approval for more than 24 hours"

Low Priority Alerts

Database Connection Pool

- alert: DatabaseConnectionPoolExhausted
  expr: |
    stella_db_connection_pool_available < 5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Database connection pool running low"
    description: "Only {{ $value }} database connections available"

Plugin Error Rate

- alert: PluginErrorRate
  expr: |
    rate(stella_plugin_errors_total[5m]) > 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Plugin errors detected"
    description: "Plugin {{ $labels.plugin_id }} is experiencing errors"

Alert Routing

Example AlertManager Configuration

# alertmanager.yaml
route:
  receiver: default
  group_by: [alertname, severity]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - match:
        severity: critical
      receiver: pagerduty
      continue: true

    - match:
        severity: warning
      receiver: slack

receivers:
  - name: default
    webhook_configs:
      - url: http://webhook.example.com/alerts

  - name: pagerduty
    pagerduty_configs:
      - service_key: ${PAGERDUTY_KEY}
        severity: critical

  - name: slack
    slack_configs:
      - channel: '#alerts'
        api_url: ${SLACK_WEBHOOK_URL}
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

Dashboard Integration

Grafana Alert Panels

Recommended dashboard panels for alerts:

Panel Query
Active Alerts count(ALERTS{alertstate="firing"})
Alert History count_over_time(ALERTS{alertstate="firing"}[24h])
By Severity count(ALERTS{alertstate="firing"}) by (severity)
By Component count(ALERTS{alertstate="firing"}) by (alertname)

See Also