Files

9.2 KiB

Metrics Specification

Overview

Release Orchestrator exposes Prometheus-compatible metrics for monitoring deployment health, performance, and operational status.

Core Metrics

Release Metrics

Metric Type Description Labels
stella_releases_total counter Total releases created tenant, status
stella_releases_active gauge Currently active releases tenant, status
stella_release_components_count histogram Components per release tenant

Promotion Metrics

Metric Type Description Labels
stella_promotions_total counter Total promotions tenant, env, status
stella_promotions_in_progress gauge Promotions currently in progress tenant, env
stella_promotion_duration_seconds histogram Time from request to completion tenant, env, status
stella_approval_pending_count gauge Pending approvals tenant, env
stella_approval_duration_seconds histogram Time to approve tenant, env

Deployment Metrics

Metric Type Description Labels
stella_deployments_total counter Total deployments tenant, env, strategy, status
stella_deployment_duration_seconds histogram Deployment duration tenant, env, strategy
stella_deployment_tasks_total counter Total deployment tasks tenant, status
stella_deployment_task_duration_seconds histogram Task duration target_type
stella_rollbacks_total counter Total rollbacks tenant, env, reason

Agent Metrics

Metric Type Description Labels
stella_agents_connected gauge Connected agents tenant
stella_agents_by_status gauge Agents by status tenant, status
stella_agent_tasks_total counter Tasks executed by agents agent, type, status
stella_agent_task_duration_seconds histogram Agent task duration agent, type
stella_agent_heartbeat_age_seconds gauge Seconds since last heartbeat agent
stella_agent_resource_cpu_percent gauge Agent CPU usage agent
stella_agent_resource_memory_percent gauge Agent memory usage agent

Workflow Metrics

Metric Type Description Labels
stella_workflow_runs_total counter Workflow executions tenant, template, status
stella_workflow_runs_active gauge Currently running workflows tenant, template
stella_workflow_duration_seconds histogram Workflow duration template, status
stella_workflow_step_duration_seconds histogram Step execution time step_type, status
stella_workflow_step_retries_total counter Step retry count step_type

Target Metrics

Metric Type Description Labels
stella_targets_total gauge Total targets tenant, env, type
stella_targets_by_health gauge Targets by health status tenant, env, health
stella_target_drift_detected gauge Targets with drift tenant, env

Integration Metrics

Metric Type Description Labels
stella_integrations_total gauge Configured integrations tenant, type
stella_integration_health gauge Integration health (1=healthy) tenant, integration
stella_integration_requests_total counter Requests to integrations integration, operation, status
stella_integration_latency_seconds histogram Integration request latency integration, operation

Gate Metrics

Metric Type Description Labels
stella_gate_evaluations_total counter Gate evaluations tenant, gate_type, result
stella_gate_evaluation_duration_seconds histogram Gate evaluation time gate_type
stella_gate_blocks_total counter Blocked promotions by gate tenant, gate_type, env

API Metrics

Metric Type Description Labels
stella_http_requests_total counter HTTP requests method, path, status
stella_http_request_duration_seconds histogram Request latency method, path
stella_http_requests_in_flight gauge Active requests method
stella_http_request_size_bytes histogram Request size method, path
stella_http_response_size_bytes histogram Response size method, path

Evidence Metrics

Metric Type Description Labels
stella_evidence_packets_total counter Evidence packets generated tenant, type
stella_evidence_packet_size_bytes histogram Evidence packet size type
stella_evidence_verification_total counter Evidence verifications result

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'stella-orchestrator'
    static_configs:
      - targets: ['stella-orchestrator:9090']
    metrics_path: /metrics
    scheme: https
    tls_config:
      ca_file: /etc/prometheus/ca.crt

  - job_name: 'stella-agents'
    kubernetes_sd_configs:
      - role: pod
        selectors:
          - role: pod
            label: "app.kubernetes.io/name=stella-agent"
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_agent_id]
        target_label: agent_id

Histogram Buckets

Duration Buckets (seconds)

# Short operations (API calls, gate evaluations)
short_duration_buckets: [0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]

# Medium operations (workflow steps)
medium_duration_buckets: [0.1, 0.5, 1, 2.5, 5, 10, 30, 60, 120, 300]

# Long operations (deployments)
long_duration_buckets: [1, 5, 10, 30, 60, 120, 300, 600, 1200, 3600]

Size Buckets (bytes)

# Request/response sizes
size_buckets: [100, 1000, 10000, 100000, 1000000, 10000000]

# Evidence packet sizes
evidence_buckets: [1000, 10000, 100000, 500000, 1000000, 5000000]

SLI Definitions

Availability SLI

# API availability (99.9% target)
sum(rate(stella_http_requests_total{status!~"5.."}[5m]))
/
sum(rate(stella_http_requests_total[5m]))

Latency SLI

# API latency P99 < 500ms
histogram_quantile(0.99,
  sum(rate(stella_http_request_duration_seconds_bucket[5m])) by (le)
)

Deployment Success SLI

# Deployment success rate (99% target)
sum(rate(stella_deployments_total{status="succeeded"}[24h]))
/
sum(rate(stella_deployments_total[24h]))

Alert Rules

groups:
  - name: stella-orchestrator
    rules:
      - alert: HighDeploymentFailureRate
        expr: |
          sum(rate(stella_deployments_total{status="failed"}[1h]))
          /
          sum(rate(stella_deployments_total[1h])) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: High deployment failure rate
          description: More than 10% of deployments failing in the last hour

      - alert: AgentOffline
        expr: stella_agent_heartbeat_age_seconds > 120
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Agent {{ $labels.agent }} offline
          description: Agent has not sent heartbeat for > 2 minutes

      - alert: PendingApprovalsStale
        expr: |
          stella_approval_pending_count > 0
          and
          time() - stella_promotion_request_timestamp > 3600
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: Stale pending approvals
          description: Approvals pending for more than 1 hour

      - alert: IntegrationUnhealthy
        expr: stella_integration_health == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: Integration {{ $labels.integration }} unhealthy
          description: Integration health check failing

      - alert: HighAPILatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(stella_http_request_duration_seconds_bucket[5m])) by (le, path)
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: High API latency on {{ $labels.path }}
          description: P99 latency exceeds 1 second

Grafana Dashboards

Main Dashboard Panels

  1. Deployment Pipeline Overview

    • Promotions per environment (time series)
    • Success/failure rates (gauge)
    • Active deployments (stat)
  2. Agent Health

    • Connected agents (stat)
    • Agent status distribution (pie chart)
    • Heartbeat age (table)
  3. Gate Performance

    • Gate evaluation counts (bar chart)
    • Block rate by gate type (time series)
    • Evaluation latency (heatmap)
  4. API Performance

    • Request rate (time series)
    • Error rate (time series)
    • Latency distribution (heatmap)

References