# Metrics Specification

## Overview

Release Orchestrator exposes Prometheus-compatible metrics for monitoring deployment health, performance, and operational status.

## Core Metrics

### Release Metrics

| Metric | Type | Description | Labels |
|--------|------|-------------|--------|
| `stella_releases_total` | counter | Total releases created | `tenant`, `status` |
| `stella_releases_active` | gauge | Currently active releases | `tenant`, `status` |
| `stella_release_components_count` | histogram | Components per release | `tenant` |

### Promotion Metrics

| Metric | Type | Description | Labels |
|--------|------|-------------|--------|
| `stella_promotions_total` | counter | Total promotions | `tenant`, `env`, `status` |
| `stella_promotions_in_progress` | gauge | Promotions currently in progress | `tenant`, `env` |
| `stella_promotion_duration_seconds` | histogram | Time from request to completion | `tenant`, `env`, `status` |
| `stella_approval_pending_count` | gauge | Pending approvals | `tenant`, `env` |
| `stella_approval_duration_seconds` | histogram | Time to approve | `tenant`, `env` |

### Deployment Metrics

| Metric | Type | Description | Labels |
|--------|------|-------------|--------|
| `stella_deployments_total` | counter | Total deployments | `tenant`, `env`, `strategy`, `status` |
| `stella_deployment_duration_seconds` | histogram | Deployment duration | `tenant`, `env`, `strategy` |
| `stella_deployment_tasks_total` | counter | Total deployment tasks | `tenant`, `status` |
| `stella_deployment_task_duration_seconds` | histogram | Task duration | `target_type` |
| `stella_rollbacks_total` | counter | Total rollbacks | `tenant`, `env`, `reason` |

### Agent Metrics

| Metric | Type | Description | Labels |
|--------|------|-------------|--------|
| `stella_agents_connected` | gauge | Connected agents | `tenant` |
| `stella_agents_by_status` | gauge | Agents by status | `tenant`, `status` |
| `stella_agent_tasks_total` | counter | Tasks executed by agents | `agent`, `type`, `status` |
| `stella_agent_task_duration_seconds` | histogram | Agent task duration | `agent`, `type` |
| `stella_agent_heartbeat_age_seconds` | gauge | Seconds since last heartbeat | `agent` |
| `stella_agent_resource_cpu_percent` | gauge | Agent CPU usage | `agent` |
| `stella_agent_resource_memory_percent` | gauge | Agent memory usage | `agent` |

### Workflow Metrics

| Metric | Type | Description | Labels |
|--------|------|-------------|--------|
| `stella_workflow_runs_total` | counter | Workflow executions | `tenant`, `template`, `status` |
| `stella_workflow_runs_active` | gauge | Currently running workflows | `tenant`, `template` |
| `stella_workflow_duration_seconds` | histogram | Workflow duration | `template`, `status` |
| `stella_workflow_step_duration_seconds` | histogram | Step execution time | `step_type`, `status` |
| `stella_workflow_step_retries_total` | counter | Step retry count | `step_type` |

### Target Metrics

| Metric | Type | Description | Labels |
|--------|------|-------------|--------|
| `stella_targets_total` | gauge | Total targets | `tenant`, `env`, `type` |
| `stella_targets_by_health` | gauge | Targets by health status | `tenant`, `env`, `health` |
| `stella_target_drift_detected` | gauge | Targets with drift | `tenant`, `env` |

### Integration Metrics

| Metric | Type | Description | Labels |
|--------|------|-------------|--------|
| `stella_integrations_total` | gauge | Configured integrations | `tenant`, `type` |
| `stella_integration_health` | gauge | Integration health (1=healthy) | `tenant`, `integration` |
| `stella_integration_requests_total` | counter | Requests to integrations | `integration`, `operation`, `status` |
| `stella_integration_latency_seconds` | histogram | Integration request latency | `integration`, `operation` |

### Gate Metrics

| Metric | Type | Description | Labels |
|--------|------|-------------|--------|
| `stella_gate_evaluations_total` | counter | Gate evaluations | `tenant`, `gate_type`, `result` |
| `stella_gate_evaluation_duration_seconds` | histogram | Gate evaluation time | `gate_type` |
| `stella_gate_blocks_total` | counter | Blocked promotions by gate | `tenant`, `gate_type`, `env` |

## API Metrics

| Metric | Type | Description | Labels |
|--------|------|-------------|--------|
| `stella_http_requests_total` | counter | HTTP requests | `method`, `path`, `status` |
| `stella_http_request_duration_seconds` | histogram | Request latency | `method`, `path` |
| `stella_http_requests_in_flight` | gauge | Active requests | `method` |
| `stella_http_request_size_bytes` | histogram | Request size | `method`, `path` |
| `stella_http_response_size_bytes` | histogram | Response size | `method`, `path` |

## Evidence Metrics

| Metric | Type | Description | Labels |
|--------|------|-------------|--------|
| `stella_evidence_packets_total` | counter | Evidence packets generated | `tenant`, `type` |
| `stella_evidence_packet_size_bytes` | histogram | Evidence packet size | `type` |
| `stella_evidence_verification_total` | counter | Evidence verifications | `result` |

## Prometheus Configuration

```yaml
# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'stella-orchestrator'
    static_configs:
      - targets: ['stella-orchestrator:9090']
    metrics_path: /metrics
    scheme: https
    tls_config:
      ca_file: /etc/prometheus/ca.crt

  - job_name: 'stella-agents'
    kubernetes_sd_configs:
      - role: pod
        selectors:
          - role: pod
            label: "app.kubernetes.io/name=stella-agent"
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_agent_id]
        target_label: agent_id
```

## Histogram Buckets

### Duration Buckets (seconds)

```yaml
# Short operations (API calls, gate evaluations)
short_duration_buckets: [0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]

# Medium operations (workflow steps)
medium_duration_buckets: [0.1, 0.5, 1, 2.5, 5, 10, 30, 60, 120, 300]

# Long operations (deployments)
long_duration_buckets: [1, 5, 10, 30, 60, 120, 300, 600, 1200, 3600]
```

### Size Buckets (bytes)

```yaml
# Request/response sizes
size_buckets: [100, 1000, 10000, 100000, 1000000, 10000000]

# Evidence packet sizes
evidence_buckets: [1000, 10000, 100000, 500000, 1000000, 5000000]
```

## SLI Definitions

### Availability SLI

```promql
# API availability (99.9% target)
sum(rate(stella_http_requests_total{status!~"5.."}[5m]))
/
sum(rate(stella_http_requests_total[5m]))
```

### Latency SLI

```promql
# API latency P99 < 500ms
histogram_quantile(0.99,
  sum(rate(stella_http_request_duration_seconds_bucket[5m])) by (le)
)
```

### Deployment Success SLI

```promql
# Deployment success rate (99% target)
sum(rate(stella_deployments_total{status="succeeded"}[24h]))
/
sum(rate(stella_deployments_total[24h]))
```

## Alert Rules

```yaml
groups:
  - name: stella-orchestrator
    rules:
      - alert: HighDeploymentFailureRate
        expr: |
          sum(rate(stella_deployments_total{status="failed"}[1h]))
          /
          sum(rate(stella_deployments_total[1h])) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: High deployment failure rate
          description: More than 10% of deployments failing in the last hour

      - alert: AgentOffline
        expr: stella_agent_heartbeat_age_seconds > 120
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Agent {{ $labels.agent }} offline
          description: Agent has not sent heartbeat for > 2 minutes

      - alert: PendingApprovalsStale
        expr: |
          stella_approval_pending_count > 0
          and
          time() - stella_promotion_request_timestamp > 3600
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: Stale pending approvals
          description: Approvals pending for more than 1 hour

      - alert: IntegrationUnhealthy
        expr: stella_integration_health == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: Integration {{ $labels.integration }} unhealthy
          description: Integration health check failing

      - alert: HighAPILatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(stella_http_request_duration_seconds_bucket[5m])) by (le, path)
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: High API latency on {{ $labels.path }}
          description: P99 latency exceeds 1 second
```

## Grafana Dashboards

### Main Dashboard Panels

1. **Deployment Pipeline Overview**
   - Promotions per environment (time series)
   - Success/failure rates (gauge)
   - Active deployments (stat)

2. **Agent Health**
   - Connected agents (stat)
   - Agent status distribution (pie chart)
   - Heartbeat age (table)

3. **Gate Performance**
   - Gate evaluation counts (bar chart)
   - Block rate by gate type (time series)
   - Evaluation latency (heatmap)

4. **API Performance**
   - Request rate (time series)
   - Error rate (time series)
   - Latency distribution (heatmap)

## References

- [Operations Overview](overview.md)
- [Logging](logging.md)
- [Tracing](tracing.md)
- [Alerting](alerting.md)