release orchestrator pivot, architecture and planning
This commit is contained in:
274
docs/modules/release-orchestrator/operations/metrics.md
Normal file
274
docs/modules/release-orchestrator/operations/metrics.md
Normal file
@@ -0,0 +1,274 @@
|
||||
# Metrics Specification
|
||||
|
||||
## Overview
|
||||
|
||||
Release Orchestrator exposes Prometheus-compatible metrics for monitoring deployment health, performance, and operational status.
|
||||
|
||||
## Core Metrics
|
||||
|
||||
### Release Metrics
|
||||
|
||||
| Metric | Type | Description | Labels |
|
||||
|--------|------|-------------|--------|
|
||||
| `stella_releases_total` | counter | Total releases created | `tenant`, `status` |
|
||||
| `stella_releases_active` | gauge | Currently active releases | `tenant`, `status` |
|
||||
| `stella_release_components_count` | histogram | Components per release | `tenant` |
|
||||
|
||||
### Promotion Metrics
|
||||
|
||||
| Metric | Type | Description | Labels |
|
||||
|--------|------|-------------|--------|
|
||||
| `stella_promotions_total` | counter | Total promotions | `tenant`, `env`, `status` |
|
||||
| `stella_promotions_in_progress` | gauge | Promotions currently in progress | `tenant`, `env` |
|
||||
| `stella_promotion_duration_seconds` | histogram | Time from request to completion | `tenant`, `env`, `status` |
|
||||
| `stella_approval_pending_count` | gauge | Pending approvals | `tenant`, `env` |
|
||||
| `stella_approval_duration_seconds` | histogram | Time to approve | `tenant`, `env` |
|
||||
|
||||
### Deployment Metrics
|
||||
|
||||
| Metric | Type | Description | Labels |
|
||||
|--------|------|-------------|--------|
|
||||
| `stella_deployments_total` | counter | Total deployments | `tenant`, `env`, `strategy`, `status` |
|
||||
| `stella_deployment_duration_seconds` | histogram | Deployment duration | `tenant`, `env`, `strategy` |
|
||||
| `stella_deployment_tasks_total` | counter | Total deployment tasks | `tenant`, `status` |
|
||||
| `stella_deployment_task_duration_seconds` | histogram | Task duration | `target_type` |
|
||||
| `stella_rollbacks_total` | counter | Total rollbacks | `tenant`, `env`, `reason` |
|
||||
|
||||
### Agent Metrics
|
||||
|
||||
| Metric | Type | Description | Labels |
|
||||
|--------|------|-------------|--------|
|
||||
| `stella_agents_connected` | gauge | Connected agents | `tenant` |
|
||||
| `stella_agents_by_status` | gauge | Agents by status | `tenant`, `status` |
|
||||
| `stella_agent_tasks_total` | counter | Tasks executed by agents | `agent`, `type`, `status` |
|
||||
| `stella_agent_task_duration_seconds` | histogram | Agent task duration | `agent`, `type` |
|
||||
| `stella_agent_heartbeat_age_seconds` | gauge | Seconds since last heartbeat | `agent` |
|
||||
| `stella_agent_resource_cpu_percent` | gauge | Agent CPU usage | `agent` |
|
||||
| `stella_agent_resource_memory_percent` | gauge | Agent memory usage | `agent` |
|
||||
|
||||
### Workflow Metrics
|
||||
|
||||
| Metric | Type | Description | Labels |
|
||||
|--------|------|-------------|--------|
|
||||
| `stella_workflow_runs_total` | counter | Workflow executions | `tenant`, `template`, `status` |
|
||||
| `stella_workflow_runs_active` | gauge | Currently running workflows | `tenant`, `template` |
|
||||
| `stella_workflow_duration_seconds` | histogram | Workflow duration | `template`, `status` |
|
||||
| `stella_workflow_step_duration_seconds` | histogram | Step execution time | `step_type`, `status` |
|
||||
| `stella_workflow_step_retries_total` | counter | Step retry count | `step_type` |
|
||||
|
||||
### Target Metrics
|
||||
|
||||
| Metric | Type | Description | Labels |
|
||||
|--------|------|-------------|--------|
|
||||
| `stella_targets_total` | gauge | Total targets | `tenant`, `env`, `type` |
|
||||
| `stella_targets_by_health` | gauge | Targets by health status | `tenant`, `env`, `health` |
|
||||
| `stella_target_drift_detected` | gauge | Targets with drift | `tenant`, `env` |
|
||||
|
||||
### Integration Metrics
|
||||
|
||||
| Metric | Type | Description | Labels |
|
||||
|--------|------|-------------|--------|
|
||||
| `stella_integrations_total` | gauge | Configured integrations | `tenant`, `type` |
|
||||
| `stella_integration_health` | gauge | Integration health (1=healthy) | `tenant`, `integration` |
|
||||
| `stella_integration_requests_total` | counter | Requests to integrations | `integration`, `operation`, `status` |
|
||||
| `stella_integration_latency_seconds` | histogram | Integration request latency | `integration`, `operation` |
|
||||
|
||||
### Gate Metrics
|
||||
|
||||
| Metric | Type | Description | Labels |
|
||||
|--------|------|-------------|--------|
|
||||
| `stella_gate_evaluations_total` | counter | Gate evaluations | `tenant`, `gate_type`, `result` |
|
||||
| `stella_gate_evaluation_duration_seconds` | histogram | Gate evaluation time | `gate_type` |
|
||||
| `stella_gate_blocks_total` | counter | Blocked promotions by gate | `tenant`, `gate_type`, `env` |
|
||||
|
||||
## API Metrics
|
||||
|
||||
| Metric | Type | Description | Labels |
|
||||
|--------|------|-------------|--------|
|
||||
| `stella_http_requests_total` | counter | HTTP requests | `method`, `path`, `status` |
|
||||
| `stella_http_request_duration_seconds` | histogram | Request latency | `method`, `path` |
|
||||
| `stella_http_requests_in_flight` | gauge | Active requests | `method` |
|
||||
| `stella_http_request_size_bytes` | histogram | Request size | `method`, `path` |
|
||||
| `stella_http_response_size_bytes` | histogram | Response size | `method`, `path` |
|
||||
|
||||
## Evidence Metrics
|
||||
|
||||
| Metric | Type | Description | Labels |
|
||||
|--------|------|-------------|--------|
|
||||
| `stella_evidence_packets_total` | counter | Evidence packets generated | `tenant`, `type` |
|
||||
| `stella_evidence_packet_size_bytes` | histogram | Evidence packet size | `type` |
|
||||
| `stella_evidence_verification_total` | counter | Evidence verifications | `result` |
|
||||
|
||||
## Prometheus Configuration
|
||||
|
||||
```yaml
|
||||
# prometheus.yml
|
||||
global:
|
||||
scrape_interval: 15s
|
||||
evaluation_interval: 15s
|
||||
|
||||
scrape_configs:
|
||||
- job_name: 'stella-orchestrator'
|
||||
static_configs:
|
||||
- targets: ['stella-orchestrator:9090']
|
||||
metrics_path: /metrics
|
||||
scheme: https
|
||||
tls_config:
|
||||
ca_file: /etc/prometheus/ca.crt
|
||||
|
||||
- job_name: 'stella-agents'
|
||||
kubernetes_sd_configs:
|
||||
- role: pod
|
||||
selectors:
|
||||
- role: pod
|
||||
label: "app.kubernetes.io/name=stella-agent"
|
||||
relabel_configs:
|
||||
- source_labels: [__meta_kubernetes_pod_label_agent_id]
|
||||
target_label: agent_id
|
||||
```
|
||||
|
||||
## Histogram Buckets
|
||||
|
||||
### Duration Buckets (seconds)
|
||||
|
||||
```yaml
|
||||
# Short operations (API calls, gate evaluations)
|
||||
short_duration_buckets: [0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
|
||||
|
||||
# Medium operations (workflow steps)
|
||||
medium_duration_buckets: [0.1, 0.5, 1, 2.5, 5, 10, 30, 60, 120, 300]
|
||||
|
||||
# Long operations (deployments)
|
||||
long_duration_buckets: [1, 5, 10, 30, 60, 120, 300, 600, 1200, 3600]
|
||||
```
|
||||
|
||||
### Size Buckets (bytes)
|
||||
|
||||
```yaml
|
||||
# Request/response sizes
|
||||
size_buckets: [100, 1000, 10000, 100000, 1000000, 10000000]
|
||||
|
||||
# Evidence packet sizes
|
||||
evidence_buckets: [1000, 10000, 100000, 500000, 1000000, 5000000]
|
||||
```
|
||||
|
||||
## SLI Definitions
|
||||
|
||||
### Availability SLI
|
||||
|
||||
```promql
|
||||
# API availability (99.9% target)
|
||||
sum(rate(stella_http_requests_total{status!~"5.."}[5m]))
|
||||
/
|
||||
sum(rate(stella_http_requests_total[5m]))
|
||||
```
|
||||
|
||||
### Latency SLI
|
||||
|
||||
```promql
|
||||
# API latency P99 < 500ms
|
||||
histogram_quantile(0.99,
|
||||
sum(rate(stella_http_request_duration_seconds_bucket[5m])) by (le)
|
||||
)
|
||||
```
|
||||
|
||||
### Deployment Success SLI
|
||||
|
||||
```promql
|
||||
# Deployment success rate (99% target)
|
||||
sum(rate(stella_deployments_total{status="succeeded"}[24h]))
|
||||
/
|
||||
sum(rate(stella_deployments_total[24h]))
|
||||
```
|
||||
|
||||
## Alert Rules
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: stella-orchestrator
|
||||
rules:
|
||||
- alert: HighDeploymentFailureRate
|
||||
expr: |
|
||||
sum(rate(stella_deployments_total{status="failed"}[1h]))
|
||||
/
|
||||
sum(rate(stella_deployments_total[1h])) > 0.1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: High deployment failure rate
|
||||
description: More than 10% of deployments failing in the last hour
|
||||
|
||||
- alert: AgentOffline
|
||||
expr: stella_agent_heartbeat_age_seconds > 120
|
||||
for: 2m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: Agent {{ $labels.agent }} offline
|
||||
description: Agent has not sent heartbeat for > 2 minutes
|
||||
|
||||
- alert: PendingApprovalsStale
|
||||
expr: |
|
||||
stella_approval_pending_count > 0
|
||||
and
|
||||
time() - stella_promotion_request_timestamp > 3600
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: Stale pending approvals
|
||||
description: Approvals pending for more than 1 hour
|
||||
|
||||
- alert: IntegrationUnhealthy
|
||||
expr: stella_integration_health == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: Integration {{ $labels.integration }} unhealthy
|
||||
description: Integration health check failing
|
||||
|
||||
- alert: HighAPILatency
|
||||
expr: |
|
||||
histogram_quantile(0.99,
|
||||
sum(rate(stella_http_request_duration_seconds_bucket[5m])) by (le, path)
|
||||
) > 1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: High API latency on {{ $labels.path }}
|
||||
description: P99 latency exceeds 1 second
|
||||
```
|
||||
|
||||
## Grafana Dashboards
|
||||
|
||||
### Main Dashboard Panels
|
||||
|
||||
1. **Deployment Pipeline Overview**
|
||||
- Promotions per environment (time series)
|
||||
- Success/failure rates (gauge)
|
||||
- Active deployments (stat)
|
||||
|
||||
2. **Agent Health**
|
||||
- Connected agents (stat)
|
||||
- Agent status distribution (pie chart)
|
||||
- Heartbeat age (table)
|
||||
|
||||
3. **Gate Performance**
|
||||
- Gate evaluation counts (bar chart)
|
||||
- Block rate by gate type (time series)
|
||||
- Evaluation latency (heatmap)
|
||||
|
||||
4. **API Performance**
|
||||
- Request rate (time series)
|
||||
- Error rate (time series)
|
||||
- Latency distribution (heatmap)
|
||||
|
||||
## References
|
||||
|
||||
- [Operations Overview](overview.md)
|
||||
- [Logging](logging.md)
|
||||
- [Tracing](tracing.md)
|
||||
- [Alerting](alerting.md)
|
||||
Reference in New Issue
Block a user