Files

master d509c44411 release orchestrator pivot, architecture and planning

2026-01-10 22:37:22 +02:00

9.2 KiB

Raw Permalink Blame History

Metrics Specification

Overview

Release Orchestrator exposes Prometheus-compatible metrics for monitoring deployment health, performance, and operational status.

Core Metrics

Release Metrics

Metric	Type	Description	Labels
`stella_releases_total`	counter	Total releases created	`tenant`, `status`
`stella_releases_active`	gauge	Currently active releases	`tenant`, `status`
`stella_release_components_count`	histogram	Components per release	`tenant`

Promotion Metrics

Metric	Type	Description	Labels
`stella_promotions_total`	counter	Total promotions	`tenant`, `env`, `status`
`stella_promotions_in_progress`	gauge	Promotions currently in progress	`tenant`, `env`
`stella_promotion_duration_seconds`	histogram	Time from request to completion	`tenant`, `env`, `status`
`stella_approval_pending_count`	gauge	Pending approvals	`tenant`, `env`
`stella_approval_duration_seconds`	histogram	Time to approve	`tenant`, `env`

Deployment Metrics

Metric	Type	Description	Labels
`stella_deployments_total`	counter	Total deployments	`tenant`, `env`, `strategy`, `status`
`stella_deployment_duration_seconds`	histogram	Deployment duration	`tenant`, `env`, `strategy`
`stella_deployment_tasks_total`	counter	Total deployment tasks	`tenant`, `status`
`stella_deployment_task_duration_seconds`	histogram	Task duration	`target_type`
`stella_rollbacks_total`	counter	Total rollbacks	`tenant`, `env`, `reason`

Agent Metrics

Metric	Type	Description	Labels
`stella_agents_connected`	gauge	Connected agents	`tenant`
`stella_agents_by_status`	gauge	Agents by status	`tenant`, `status`
`stella_agent_tasks_total`	counter	Tasks executed by agents	`agent`, `type`, `status`
`stella_agent_task_duration_seconds`	histogram	Agent task duration	`agent`, `type`
`stella_agent_heartbeat_age_seconds`	gauge	Seconds since last heartbeat	`agent`
`stella_agent_resource_cpu_percent`	gauge	Agent CPU usage	`agent`
`stella_agent_resource_memory_percent`	gauge	Agent memory usage	`agent`

Workflow Metrics

Metric	Type	Description	Labels
`stella_workflow_runs_total`	counter	Workflow executions	`tenant`, `template`, `status`
`stella_workflow_runs_active`	gauge	Currently running workflows	`tenant`, `template`
`stella_workflow_duration_seconds`	histogram	Workflow duration	`template`, `status`
`stella_workflow_step_duration_seconds`	histogram	Step execution time	`step_type`, `status`
`stella_workflow_step_retries_total`	counter	Step retry count	`step_type`

Target Metrics

Metric	Type	Description	Labels
`stella_targets_total`	gauge	Total targets	`tenant`, `env`, `type`
`stella_targets_by_health`	gauge	Targets by health status	`tenant`, `env`, `health`
`stella_target_drift_detected`	gauge	Targets with drift	`tenant`, `env`

Integration Metrics

Metric	Type	Description	Labels
`stella_integrations_total`	gauge	Configured integrations	`tenant`, `type`
`stella_integration_health`	gauge	Integration health (1=healthy)	`tenant`, `integration`
`stella_integration_requests_total`	counter	Requests to integrations	`integration`, `operation`, `status`
`stella_integration_latency_seconds`	histogram	Integration request latency	`integration`, `operation`

Gate Metrics

Metric	Type	Description	Labels
`stella_gate_evaluations_total`	counter	Gate evaluations	`tenant`, `gate_type`, `result`
`stella_gate_evaluation_duration_seconds`	histogram	Gate evaluation time	`gate_type`
`stella_gate_blocks_total`	counter	Blocked promotions by gate	`tenant`, `gate_type`, `env`

API Metrics

Metric	Type	Description	Labels
`stella_http_requests_total`	counter	HTTP requests	`method`, `path`, `status`
`stella_http_request_duration_seconds`	histogram	Request latency	`method`, `path`
`stella_http_requests_in_flight`	gauge	Active requests	`method`
`stella_http_request_size_bytes`	histogram	Request size	`method`, `path`
`stella_http_response_size_bytes`	histogram	Response size	`method`, `path`

Evidence Metrics

Metric	Type	Description	Labels
`stella_evidence_packets_total`	counter	Evidence packets generated	`tenant`, `type`
`stella_evidence_packet_size_bytes`	histogram	Evidence packet size	`type`
`stella_evidence_verification_total`	counter	Evidence verifications	`result`

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'stella-orchestrator'
    static_configs:
      - targets: ['stella-orchestrator:9090']
    metrics_path: /metrics
    scheme: https
    tls_config:
      ca_file: /etc/prometheus/ca.crt

  - job_name: 'stella-agents'
    kubernetes_sd_configs:
      - role: pod
        selectors:
          - role: pod
            label: "app.kubernetes.io/name=stella-agent"
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_agent_id]
        target_label: agent_id

Histogram Buckets

Duration Buckets (seconds)

# Short operations (API calls, gate evaluations)
short_duration_buckets: [0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]

# Medium operations (workflow steps)
medium_duration_buckets: [0.1, 0.5, 1, 2.5, 5, 10, 30, 60, 120, 300]

# Long operations (deployments)
long_duration_buckets: [1, 5, 10, 30, 60, 120, 300, 600, 1200, 3600]

Size Buckets (bytes)

# Request/response sizes
size_buckets: [100, 1000, 10000, 100000, 1000000, 10000000]

# Evidence packet sizes
evidence_buckets: [1000, 10000, 100000, 500000, 1000000, 5000000]

SLI Definitions

Availability SLI

# API availability (99.9% target)
sum(rate(stella_http_requests_total{status!~"5.."}[5m]))
/
sum(rate(stella_http_requests_total[5m]))

Latency SLI

# API latency P99 < 500ms
histogram_quantile(0.99,
  sum(rate(stella_http_request_duration_seconds_bucket[5m])) by (le)
)

Deployment Success SLI

# Deployment success rate (99% target)
sum(rate(stella_deployments_total{status="succeeded"}[24h]))
/
sum(rate(stella_deployments_total[24h]))

Alert Rules

groups:
  - name: stella-orchestrator
    rules:
      - alert: HighDeploymentFailureRate
        expr: |
          sum(rate(stella_deployments_total{status="failed"}[1h]))
          /
          sum(rate(stella_deployments_total[1h])) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: High deployment failure rate
          description: More than 10% of deployments failing in the last hour

      - alert: AgentOffline
        expr: stella_agent_heartbeat_age_seconds > 120
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Agent {{ $labels.agent }} offline
          description: Agent has not sent heartbeat for > 2 minutes

      - alert: PendingApprovalsStale
        expr: |
          stella_approval_pending_count > 0
          and
          time() - stella_promotion_request_timestamp > 3600
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: Stale pending approvals
          description: Approvals pending for more than 1 hour

      - alert: IntegrationUnhealthy
        expr: stella_integration_health == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: Integration {{ $labels.integration }} unhealthy
          description: Integration health check failing

      - alert: HighAPILatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(stella_http_request_duration_seconds_bucket[5m])) by (le, path)
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: High API latency on {{ $labels.path }}
          description: P99 latency exceeds 1 second

Grafana Dashboards

Main Dashboard Panels

Deployment Pipeline Overview
- Promotions per environment (time series)
- Success/failure rates (gauge)
- Active deployments (stat)
Agent Health
- Connected agents (stat)
- Agent status distribution (pie chart)
- Heartbeat age (table)
Gate Performance
- Gate evaluation counts (bar chart)
- Block rate by gate type (time series)
- Evaluation latency (heatmap)
API Performance
- Request rate (time series)
- Error rate (time series)
- Latency distribution (heatmap)

9.2 KiB Raw Permalink Blame History

Metrics Specification

Overview

Core Metrics

Release Metrics

Promotion Metrics

Deployment Metrics

Agent Metrics

Workflow Metrics

Target Metrics

Integration Metrics

Gate Metrics

API Metrics

Evidence Metrics

Prometheus Configuration

Histogram Buckets

Duration Buckets (seconds)

Size Buckets (bytes)

SLI Definitions

Availability SLI

Latency SLI

Deployment Success SLI

Alert Rules

Grafana Dashboards

Main Dashboard Panels

References

9.2 KiB

Raw Permalink Blame History