release orchestrator pivot, architecture and planning

2026-01-10 22:37:22 +02:00
parent c84f421e2f
commit d509c44411
130 changed files with 70292 additions and 721 deletions
--- a/docs/modules/release-orchestrator/operations/metrics.md
+++ b/docs/modules/release-orchestrator/operations/metrics.md
@@ -0,0 +1,274 @@
+# Metrics Specification
+
+## Overview
+
+Release Orchestrator exposes Prometheus-compatible metrics for monitoring deployment health, performance, and operational status.
+
+## Core Metrics
+
+### Release Metrics
+
+| Metric | Type | Description | Labels |
+|--------|------|-------------|--------|
+| `stella_releases_total` | counter | Total releases created | `tenant`, `status` |
+| `stella_releases_active` | gauge | Currently active releases | `tenant`, `status` |
+| `stella_release_components_count` | histogram | Components per release | `tenant` |
+
+### Promotion Metrics
+
+| Metric | Type | Description | Labels |
+|--------|------|-------------|--------|
+| `stella_promotions_total` | counter | Total promotions | `tenant`, `env`, `status` |
+| `stella_promotions_in_progress` | gauge | Promotions currently in progress | `tenant`, `env` |
+| `stella_promotion_duration_seconds` | histogram | Time from request to completion | `tenant`, `env`, `status` |
+| `stella_approval_pending_count` | gauge | Pending approvals | `tenant`, `env` |
+| `stella_approval_duration_seconds` | histogram | Time to approve | `tenant`, `env` |
+
+### Deployment Metrics
+
+| Metric | Type | Description | Labels |
+|--------|------|-------------|--------|
+| `stella_deployments_total` | counter | Total deployments | `tenant`, `env`, `strategy`, `status` |
+| `stella_deployment_duration_seconds` | histogram | Deployment duration | `tenant`, `env`, `strategy` |
+| `stella_deployment_tasks_total` | counter | Total deployment tasks | `tenant`, `status` |
+| `stella_deployment_task_duration_seconds` | histogram | Task duration | `target_type` |
+| `stella_rollbacks_total` | counter | Total rollbacks | `tenant`, `env`, `reason` |
+
+### Agent Metrics
+
+| Metric | Type | Description | Labels |
+|--------|------|-------------|--------|
+| `stella_agents_connected` | gauge | Connected agents | `tenant` |
+| `stella_agents_by_status` | gauge | Agents by status | `tenant`, `status` |
+| `stella_agent_tasks_total` | counter | Tasks executed by agents | `agent`, `type`, `status` |
+| `stella_agent_task_duration_seconds` | histogram | Agent task duration | `agent`, `type` |
+| `stella_agent_heartbeat_age_seconds` | gauge | Seconds since last heartbeat | `agent` |
+| `stella_agent_resource_cpu_percent` | gauge | Agent CPU usage | `agent` |
+| `stella_agent_resource_memory_percent` | gauge | Agent memory usage | `agent` |
+
+### Workflow Metrics
+
+| Metric | Type | Description | Labels |
+|--------|------|-------------|--------|
+| `stella_workflow_runs_total` | counter | Workflow executions | `tenant`, `template`, `status` |
+| `stella_workflow_runs_active` | gauge | Currently running workflows | `tenant`, `template` |
+| `stella_workflow_duration_seconds` | histogram | Workflow duration | `template`, `status` |
+| `stella_workflow_step_duration_seconds` | histogram | Step execution time | `step_type`, `status` |
+| `stella_workflow_step_retries_total` | counter | Step retry count | `step_type` |
+
+### Target Metrics
+
+| Metric | Type | Description | Labels |
+|--------|------|-------------|--------|
+| `stella_targets_total` | gauge | Total targets | `tenant`, `env`, `type` |
+| `stella_targets_by_health` | gauge | Targets by health status | `tenant`, `env`, `health` |
+| `stella_target_drift_detected` | gauge | Targets with drift | `tenant`, `env` |
+
+### Integration Metrics
+
+| Metric | Type | Description | Labels |
+|--------|------|-------------|--------|
+| `stella_integrations_total` | gauge | Configured integrations | `tenant`, `type` |
+| `stella_integration_health` | gauge | Integration health (1=healthy) | `tenant`, `integration` |
+| `stella_integration_requests_total` | counter | Requests to integrations | `integration`, `operation`, `status` |
+| `stella_integration_latency_seconds` | histogram | Integration request latency | `integration`, `operation` |
+
+### Gate Metrics
+
+| Metric | Type | Description | Labels |
+|--------|------|-------------|--------|
+| `stella_gate_evaluations_total` | counter | Gate evaluations | `tenant`, `gate_type`, `result` |
+| `stella_gate_evaluation_duration_seconds` | histogram | Gate evaluation time | `gate_type` |
+| `stella_gate_blocks_total` | counter | Blocked promotions by gate | `tenant`, `gate_type`, `env` |
+
+## API Metrics
+
+| Metric | Type | Description | Labels |
+|--------|------|-------------|--------|
+| `stella_http_requests_total` | counter | HTTP requests | `method`, `path`, `status` |
+| `stella_http_request_duration_seconds` | histogram | Request latency | `method`, `path` |
+| `stella_http_requests_in_flight` | gauge | Active requests | `method` |
+| `stella_http_request_size_bytes` | histogram | Request size | `method`, `path` |
+| `stella_http_response_size_bytes` | histogram | Response size | `method`, `path` |
+
+## Evidence Metrics
+
+| Metric | Type | Description | Labels |
+|--------|------|-------------|--------|
+| `stella_evidence_packets_total` | counter | Evidence packets generated | `tenant`, `type` |
+| `stella_evidence_packet_size_bytes` | histogram | Evidence packet size | `type` |
+| `stella_evidence_verification_total` | counter | Evidence verifications | `result` |
+
+## Prometheus Configuration
+
+```yaml
+# prometheus.yml
+global:
+  scrape_interval: 15s
+  evaluation_interval: 15s
+
+scrape_configs:
+  - job_name: 'stella-orchestrator'
+    static_configs:
+      - targets: ['stella-orchestrator:9090']
+    metrics_path: /metrics
+    scheme: https
+    tls_config:
+      ca_file: /etc/prometheus/ca.crt
+
+  - job_name: 'stella-agents'
+    kubernetes_sd_configs:
+      - role: pod
+        selectors:
+          - role: pod
+            label: "app.kubernetes.io/name=stella-agent"
+    relabel_configs:
+      - source_labels: [__meta_kubernetes_pod_label_agent_id]
+        target_label: agent_id
+```
+
+## Histogram Buckets
+
+### Duration Buckets (seconds)
+
+```yaml
+# Short operations (API calls, gate evaluations)
+short_duration_buckets: [0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
+
+# Medium operations (workflow steps)
+medium_duration_buckets: [0.1, 0.5, 1, 2.5, 5, 10, 30, 60, 120, 300]
+
+# Long operations (deployments)
+long_duration_buckets: [1, 5, 10, 30, 60, 120, 300, 600, 1200, 3600]
+```
+
+### Size Buckets (bytes)
+
+```yaml
+# Request/response sizes
+size_buckets: [100, 1000, 10000, 100000, 1000000, 10000000]
+
+# Evidence packet sizes
+evidence_buckets: [1000, 10000, 100000, 500000, 1000000, 5000000]
+```
+
+## SLI Definitions
+
+### Availability SLI
+
+```promql
+# API availability (99.9% target)
+sum(rate(stella_http_requests_total{status!~"5.."}[5m]))
+/
+sum(rate(stella_http_requests_total[5m]))
+```
+
+### Latency SLI
+
+```promql
+# API latency P99 < 500ms
+histogram_quantile(0.99,
+  sum(rate(stella_http_request_duration_seconds_bucket[5m])) by (le)
+)
+```
+
+### Deployment Success SLI
+
+```promql
+# Deployment success rate (99% target)
+sum(rate(stella_deployments_total{status="succeeded"}[24h]))
+/
+sum(rate(stella_deployments_total[24h]))
+```
+
+## Alert Rules
+
+```yaml
+groups:
+  - name: stella-orchestrator
+    rules:
+      - alert: HighDeploymentFailureRate
+        expr: |
+          sum(rate(stella_deployments_total{status="failed"}[1h]))
+          /
+          sum(rate(stella_deployments_total[1h])) > 0.1
+        for: 5m
+        labels:
+          severity: critical
+        annotations:
+          summary: High deployment failure rate
+          description: More than 10% of deployments failing in the last hour
+
+      - alert: AgentOffline
+        expr: stella_agent_heartbeat_age_seconds > 120
+        for: 2m
+        labels:
+          severity: warning
+        annotations:
+          summary: Agent {{ $labels.agent }} offline
+          description: Agent has not sent heartbeat for > 2 minutes
+
+      - alert: PendingApprovalsStale
+        expr: |
+          stella_approval_pending_count > 0
+          and
+          time() - stella_promotion_request_timestamp > 3600
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: Stale pending approvals
+          description: Approvals pending for more than 1 hour
+
+      - alert: IntegrationUnhealthy
+        expr: stella_integration_health == 0
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: Integration {{ $labels.integration }} unhealthy
+          description: Integration health check failing
+
+      - alert: HighAPILatency
+        expr: |
+          histogram_quantile(0.99,
+            sum(rate(stella_http_request_duration_seconds_bucket[5m])) by (le, path)
+          ) > 1
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: High API latency on {{ $labels.path }}
+          description: P99 latency exceeds 1 second
+```
+
+## Grafana Dashboards
+
+### Main Dashboard Panels
+
+1. **Deployment Pipeline Overview**
+   - Promotions per environment (time series)
+   - Success/failure rates (gauge)
+   - Active deployments (stat)
+
+2. **Agent Health**
+   - Connected agents (stat)
+   - Agent status distribution (pie chart)
+   - Heartbeat age (table)
+
+3. **Gate Performance**
+   - Gate evaluation counts (bar chart)
+   - Block rate by gate type (time series)
+   - Evaluation latency (heatmap)
+
+4. **API Performance**
+   - Request rate (time series)
+   - Error rate (time series)
+   - Latency distribution (heatmap)
+
+## References
+
+- [Operations Overview](overview.md)
+- [Logging](logging.md)
+- [Tracing](tracing.md)
+- [Alerting](alerting.md)