add release orchestrator docs and sprints gaps fills

2026-01-11 01:05:17 +02:00
parent d58c093887
commit a62974a8c2
37 changed files with 6061 additions and 0 deletions
--- a/docs/modules/release-orchestrator/operations/alerting.md
+++ b/docs/modules/release-orchestrator/operations/alerting.md
@@ -0,0 +1,246 @@
+# Alerting Rules
+
+> Prometheus alerting rules for the Release Orchestrator.
+
+**Status:** Planned (not yet implemented)
+**Source:** [Architecture Advisory Section 13.4](../../../product/advisories/09-Jan-2026%20-%20Stella%20Ops%20Orchestrator%20Architecture.md)
+**Related Modules:** [Metrics](metrics.md), [Observability Overview](overview.md)
+
+## Overview
+
+The Release Orchestrator provides Prometheus alerting rules for monitoring promotions, deployments, agents, and integrations.
+
+---
+
+## High Priority Alerts
+
+### Security Gate Block Rate
+
+```yaml
+- alert: PromotionGateBlockRate
+  expr: |
+    rate(stella_security_gate_results_total{result="blocked"}[1h]) /
+    rate(stella_security_gate_results_total[1h]) > 0.5
+  for: 15m
+  labels:
+    severity: warning
+  annotations:
+    summary: "High rate of security gate blocks"
+    description: "More than 50% of promotions are being blocked by security gates"
+```
+
+### Deployment Failure Rate
+
+```yaml
+- alert: DeploymentFailureRate
+  expr: |
+    rate(stella_deployments_total{status="failed"}[1h]) /
+    rate(stella_deployments_total[1h]) > 0.1
+  for: 10m
+  labels:
+    severity: critical
+  annotations:
+    summary: "High deployment failure rate"
+    description: "More than 10% of deployments are failing"
+```
+
+### Agent Offline
+
+```yaml
+- alert: AgentOffline
+  expr: |
+    stella_agents_status{status="offline"} == 1
+  for: 5m
+  labels:
+    severity: warning
+  annotations:
+    summary: "Agent offline"
+    description: "Agent {{ $labels.agent_id }} has been offline for 5 minutes"
+```
+
+### Promotion Stuck
+
+```yaml
+- alert: PromotionStuck
+  expr: |
+    time() - stella_promotion_start_time{status="deploying"} > 1800
+  for: 5m
+  labels:
+    severity: warning
+  annotations:
+    summary: "Promotion stuck in deploying state"
+    description: "Promotion {{ $labels.promotion_id }} has been deploying for more than 30 minutes"
+```
+
+### Integration Unhealthy
+
+```yaml
+- alert: IntegrationUnhealthy
+  expr: |
+    stella_integration_health{status="unhealthy"} == 1
+  for: 10m
+  labels:
+    severity: warning
+  annotations:
+    summary: "Integration unhealthy"
+    description: "Integration {{ $labels.integration_name }} has been unhealthy for 10 minutes"
+```
+
+---
+
+## Medium Priority Alerts
+
+### Workflow Step Timeout
+
+```yaml
+- alert: WorkflowStepTimeout
+  expr: |
+    stella_workflow_step_duration_seconds > 600
+  for: 1m
+  labels:
+    severity: warning
+  annotations:
+    summary: "Workflow step taking too long"
+    description: "Step {{ $labels.step_type }} in workflow {{ $labels.workflow_run_id }} has been running for more than 10 minutes"
+```
+
+### Evidence Generation Failure
+
+```yaml
+- alert: EvidenceGenerationFailure
+  expr: |
+    rate(stella_evidence_generation_failures_total[1h]) > 0
+  for: 5m
+  labels:
+    severity: warning
+  annotations:
+    summary: "Evidence generation failures"
+    description: "Evidence generation is failing, affecting audit compliance"
+```
+
+### Target Health Degraded
+
+```yaml
+- alert: TargetHealthDegraded
+  expr: |
+    stella_target_health{status!="healthy"} == 1
+  for: 5m
+  labels:
+    severity: warning
+  annotations:
+    summary: "Target health degraded"
+    description: "Target {{ $labels.target_name }} is reporting {{ $labels.status }}"
+```
+
+### Approval Timeout
+
+```yaml
+- alert: ApprovalTimeout
+  expr: |
+    time() - stella_promotion_approval_requested_time > 86400
+  for: 1h
+  labels:
+    severity: warning
+  annotations:
+    summary: "Promotion awaiting approval for too long"
+    description: "Promotion {{ $labels.promotion_id }} has been waiting for approval for more than 24 hours"
+```
+
+---
+
+## Low Priority Alerts
+
+### Database Connection Pool
+
+```yaml
+- alert: DatabaseConnectionPoolExhausted
+  expr: |
+    stella_db_connection_pool_available < 5
+  for: 5m
+  labels:
+    severity: warning
+  annotations:
+    summary: "Database connection pool running low"
+    description: "Only {{ $value }} database connections available"
+```
+
+### Plugin Error Rate
+
+```yaml
+- alert: PluginErrorRate
+  expr: |
+    rate(stella_plugin_errors_total[5m]) > 1
+  for: 5m
+  labels:
+    severity: warning
+  annotations:
+    summary: "Plugin errors detected"
+    description: "Plugin {{ $labels.plugin_id }} is experiencing errors"
+```
+
+---
+
+## Alert Routing
+
+### Example AlertManager Configuration
+
+```yaml
+# alertmanager.yaml
+route:
+  receiver: default
+  group_by: [alertname, severity]
+  group_wait: 30s
+  group_interval: 5m
+  repeat_interval: 4h
+
+  routes:
+    - match:
+        severity: critical
+      receiver: pagerduty
+      continue: true
+
+    - match:
+        severity: warning
+      receiver: slack
+
+receivers:
+  - name: default
+    webhook_configs:
+      - url: http://webhook.example.com/alerts
+
+  - name: pagerduty
+    pagerduty_configs:
+      - service_key: ${PAGERDUTY_KEY}
+        severity: critical
+
+  - name: slack
+    slack_configs:
+      - channel: '#alerts'
+        api_url: ${SLACK_WEBHOOK_URL}
+        title: '{{ .CommonAnnotations.summary }}'
+        text: '{{ .CommonAnnotations.description }}'
+```
+
+---
+
+## Dashboard Integration
+
+### Grafana Alert Panels
+
+Recommended dashboard panels for alerts:
+
+| Panel | Query |
+|-------|-------|
+| Active Alerts | `count(ALERTS{alertstate="firing"})` |
+| Alert History | `count_over_time(ALERTS{alertstate="firing"}[24h])` |
+| By Severity | `count(ALERTS{alertstate="firing"}) by (severity)` |
+| By Component | `count(ALERTS{alertstate="firing"}) by (alertname)` |
+
+---
+
+## See Also
+
+- [Metrics](metrics.md)
+- [Observability Overview](overview.md)
+- [Logging](logging.md)
+- [Tracing](tracing.md)