# Metrics Specification ## Overview Release Orchestrator exposes Prometheus-compatible metrics for monitoring deployment health, performance, and operational status. ## Core Metrics ### Release Metrics | Metric | Type | Description | Labels | |--------|------|-------------|--------| | `stella_releases_total` | counter | Total releases created | `tenant`, `status` | | `stella_releases_active` | gauge | Currently active releases | `tenant`, `status` | | `stella_release_components_count` | histogram | Components per release | `tenant` | ### Promotion Metrics | Metric | Type | Description | Labels | |--------|------|-------------|--------| | `stella_promotions_total` | counter | Total promotions | `tenant`, `env`, `status` | | `stella_promotions_in_progress` | gauge | Promotions currently in progress | `tenant`, `env` | | `stella_promotion_duration_seconds` | histogram | Time from request to completion | `tenant`, `env`, `status` | | `stella_approval_pending_count` | gauge | Pending approvals | `tenant`, `env` | | `stella_approval_duration_seconds` | histogram | Time to approve | `tenant`, `env` | ### Deployment Metrics | Metric | Type | Description | Labels | |--------|------|-------------|--------| | `stella_deployments_total` | counter | Total deployments | `tenant`, `env`, `strategy`, `status` | | `stella_deployment_duration_seconds` | histogram | Deployment duration | `tenant`, `env`, `strategy` | | `stella_deployment_tasks_total` | counter | Total deployment tasks | `tenant`, `status` | | `stella_deployment_task_duration_seconds` | histogram | Task duration | `target_type` | | `stella_rollbacks_total` | counter | Total rollbacks | `tenant`, `env`, `reason` | ### Agent Metrics | Metric | Type | Description | Labels | |--------|------|-------------|--------| | `stella_agents_connected` | gauge | Connected agents | `tenant` | | `stella_agents_by_status` | gauge | Agents by status | `tenant`, `status` | | `stella_agent_tasks_total` | counter | Tasks executed by agents | `agent`, `type`, `status` | | `stella_agent_task_duration_seconds` | histogram | Agent task duration | `agent`, `type` | | `stella_agent_heartbeat_age_seconds` | gauge | Seconds since last heartbeat | `agent` | | `stella_agent_resource_cpu_percent` | gauge | Agent CPU usage | `agent` | | `stella_agent_resource_memory_percent` | gauge | Agent memory usage | `agent` | ### Workflow Metrics | Metric | Type | Description | Labels | |--------|------|-------------|--------| | `stella_workflow_runs_total` | counter | Workflow executions | `tenant`, `template`, `status` | | `stella_workflow_runs_active` | gauge | Currently running workflows | `tenant`, `template` | | `stella_workflow_duration_seconds` | histogram | Workflow duration | `template`, `status` | | `stella_workflow_step_duration_seconds` | histogram | Step execution time | `step_type`, `status` | | `stella_workflow_step_retries_total` | counter | Step retry count | `step_type` | ### Target Metrics | Metric | Type | Description | Labels | |--------|------|-------------|--------| | `stella_targets_total` | gauge | Total targets | `tenant`, `env`, `type` | | `stella_targets_by_health` | gauge | Targets by health status | `tenant`, `env`, `health` | | `stella_target_drift_detected` | gauge | Targets with drift | `tenant`, `env` | ### Integration Metrics | Metric | Type | Description | Labels | |--------|------|-------------|--------| | `stella_integrations_total` | gauge | Configured integrations | `tenant`, `type` | | `stella_integration_health` | gauge | Integration health (1=healthy) | `tenant`, `integration` | | `stella_integration_requests_total` | counter | Requests to integrations | `integration`, `operation`, `status` | | `stella_integration_latency_seconds` | histogram | Integration request latency | `integration`, `operation` | ### Gate Metrics | Metric | Type | Description | Labels | |--------|------|-------------|--------| | `stella_gate_evaluations_total` | counter | Gate evaluations | `tenant`, `gate_type`, `result` | | `stella_gate_evaluation_duration_seconds` | histogram | Gate evaluation time | `gate_type` | | `stella_gate_blocks_total` | counter | Blocked promotions by gate | `tenant`, `gate_type`, `env` | ## API Metrics | Metric | Type | Description | Labels | |--------|------|-------------|--------| | `stella_http_requests_total` | counter | HTTP requests | `method`, `path`, `status` | | `stella_http_request_duration_seconds` | histogram | Request latency | `method`, `path` | | `stella_http_requests_in_flight` | gauge | Active requests | `method` | | `stella_http_request_size_bytes` | histogram | Request size | `method`, `path` | | `stella_http_response_size_bytes` | histogram | Response size | `method`, `path` | ## Evidence Metrics | Metric | Type | Description | Labels | |--------|------|-------------|--------| | `stella_evidence_packets_total` | counter | Evidence packets generated | `tenant`, `type` | | `stella_evidence_packet_size_bytes` | histogram | Evidence packet size | `type` | | `stella_evidence_verification_total` | counter | Evidence verifications | `result` | ## Prometheus Configuration ```yaml # prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'stella-orchestrator' static_configs: - targets: ['stella-orchestrator:9090'] metrics_path: /metrics scheme: https tls_config: ca_file: /etc/prometheus/ca.crt - job_name: 'stella-agents' kubernetes_sd_configs: - role: pod selectors: - role: pod label: "app.kubernetes.io/name=stella-agent" relabel_configs: - source_labels: [__meta_kubernetes_pod_label_agent_id] target_label: agent_id ``` ## Histogram Buckets ### Duration Buckets (seconds) ```yaml # Short operations (API calls, gate evaluations) short_duration_buckets: [0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10] # Medium operations (workflow steps) medium_duration_buckets: [0.1, 0.5, 1, 2.5, 5, 10, 30, 60, 120, 300] # Long operations (deployments) long_duration_buckets: [1, 5, 10, 30, 60, 120, 300, 600, 1200, 3600] ``` ### Size Buckets (bytes) ```yaml # Request/response sizes size_buckets: [100, 1000, 10000, 100000, 1000000, 10000000] # Evidence packet sizes evidence_buckets: [1000, 10000, 100000, 500000, 1000000, 5000000] ``` ## SLI Definitions ### Availability SLI ```promql # API availability (99.9% target) sum(rate(stella_http_requests_total{status!~"5.."}[5m])) / sum(rate(stella_http_requests_total[5m])) ``` ### Latency SLI ```promql # API latency P99 < 500ms histogram_quantile(0.99, sum(rate(stella_http_request_duration_seconds_bucket[5m])) by (le) ) ``` ### Deployment Success SLI ```promql # Deployment success rate (99% target) sum(rate(stella_deployments_total{status="succeeded"}[24h])) / sum(rate(stella_deployments_total[24h])) ``` ## Alert Rules ```yaml groups: - name: stella-orchestrator rules: - alert: HighDeploymentFailureRate expr: | sum(rate(stella_deployments_total{status="failed"}[1h])) / sum(rate(stella_deployments_total[1h])) > 0.1 for: 5m labels: severity: critical annotations: summary: High deployment failure rate description: More than 10% of deployments failing in the last hour - alert: AgentOffline expr: stella_agent_heartbeat_age_seconds > 120 for: 2m labels: severity: warning annotations: summary: Agent {{ $labels.agent }} offline description: Agent has not sent heartbeat for > 2 minutes - alert: PendingApprovalsStale expr: | stella_approval_pending_count > 0 and time() - stella_promotion_request_timestamp > 3600 for: 5m labels: severity: warning annotations: summary: Stale pending approvals description: Approvals pending for more than 1 hour - alert: IntegrationUnhealthy expr: stella_integration_health == 0 for: 5m labels: severity: warning annotations: summary: Integration {{ $labels.integration }} unhealthy description: Integration health check failing - alert: HighAPILatency expr: | histogram_quantile(0.99, sum(rate(stella_http_request_duration_seconds_bucket[5m])) by (le, path) ) > 1 for: 5m labels: severity: warning annotations: summary: High API latency on {{ $labels.path }} description: P99 latency exceeds 1 second ``` ## Grafana Dashboards ### Main Dashboard Panels 1. **Deployment Pipeline Overview** - Promotions per environment (time series) - Success/failure rates (gauge) - Active deployments (stat) 2. **Agent Health** - Connected agents (stat) - Agent status distribution (pie chart) - Heartbeat age (table) 3. **Gate Performance** - Gate evaluation counts (bar chart) - Block rate by gate type (time series) - Evaluation latency (heatmap) 4. **API Performance** - Request rate (time series) - Error rate (time series) - Latency distribution (heatmap) ## References - [Operations Overview](overview.md) - [Logging](logging.md) - [Tracing](tracing.md) - [Alerting](alerting.md)