Files

15 KiB

Operations Overview

Observability Stack

Release Orchestrator provides comprehensive observability through metrics, logging, and distributed tracing.

              OBSERVABILITY ARCHITECTURE

  ┌─────────────────────────────────────────────────────────────────────────────┐
  │                     RELEASE ORCHESTRATOR                                     │
  │                                                                             │
  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
  │  │ Metrics     │  │ Logs        │  │ Traces      │  │ Events      │        │
  │  │ Exporter    │  │ Collector   │  │ Exporter    │  │ Publisher   │        │
  │  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘        │
  │         │                │                │                │                │
  └─────────┼────────────────┼────────────────┼────────────────┼────────────────┘
            │                │                │                │
            ▼                ▼                ▼                ▼
  ┌─────────────────────────────────────────────────────────────────────────────┐
  │                      OBSERVABILITY BACKENDS                                  │
  │                                                                             │
  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
  │  │ Prometheus  │  │ Loki /      │  │ Jaeger /    │  │ Event       │        │
  │  │ / Mimir     │  │ Elasticsearch│  │ Tempo       │  │ Bus         │        │
  │  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘        │
  │         │                │                │                │                │
  │         └────────────────┴────────────────┴────────────────┘                │
  │                                    │                                        │
  │                                    ▼                                        │
  │                           ┌─────────────────┐                               │
  │                           │    Grafana      │                               │
  │                           │   Dashboards    │                               │
  │                           └─────────────────┘                               │
  │                                                                             │
  └─────────────────────────────────────────────────────────────────────────────┘

Metrics

Core Metrics

Metric Type Description Labels
stella_releases_total counter Total releases created tenant, status
stella_promotions_total counter Total promotions tenant, env, status
stella_deployments_total counter Total deployments tenant, env, strategy
stella_deployment_duration_seconds histogram Deployment duration tenant, env, strategy
stella_rollbacks_total counter Total rollbacks tenant, env, reason
stella_agents_connected gauge Connected agents tenant
stella_targets_total gauge Total targets tenant, env, type
stella_workflow_runs_total counter Workflow executions tenant, template, status
stella_workflow_step_duration_seconds histogram Step execution time step_type
stella_approval_pending_count gauge Pending approvals tenant, env
stella_approval_duration_seconds histogram Time to approve tenant, env

API Metrics

Metric Type Description Labels
stella_http_requests_total counter HTTP requests method, path, status
stella_http_request_duration_seconds histogram Request latency method, path
stella_http_requests_in_flight gauge Active requests method

Agent Metrics

Metric Type Description Labels
stella_agent_tasks_total counter Tasks executed agent, type, status
stella_agent_task_duration_seconds histogram Task duration agent, type
stella_agent_heartbeat_age_seconds gauge Since last heartbeat agent

Prometheus Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'stella-orchestrator'
    static_configs:
      - targets: ['stella-orchestrator:9090']
    metrics_path: /metrics
    scheme: https
    tls_config:
      ca_file: /etc/prometheus/ca.crt

  - job_name: 'stella-agents'
    kubernetes_sd_configs:
      - role: pod
        selectors:
          - role: pod
            label: "app.kubernetes.io/name=stella-agent"
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_agent_id]
        target_label: agent_id

Logging

Log Format

{
  "timestamp": "2026-01-09T10:30:00.123Z",
  "level": "info",
  "message": "Deployment started",
  "service": "deploy-orchestrator",
  "version": "1.0.0",
  "traceId": "abc123def456",
  "spanId": "789ghi",
  "tenantId": "tenant-uuid",
  "correlationId": "corr-uuid",
  "context": {
    "deploymentJobId": "job-uuid",
    "releaseId": "release-uuid",
    "environmentId": "env-uuid"
  }
}

Log Levels

Level Usage
error Failures requiring attention
warn Degraded operation, recoverable issues
info Business events (deployment started, approval granted)
debug Detailed operational info
trace Very detailed debugging

Structured Logging Configuration

// Logging configuration
const loggerConfig = {
  level: process.env.LOG_LEVEL || 'info',
  format: 'json',
  outputs: [
    {
      type: 'stdout',
      format: 'json'
    },
    {
      type: 'file',
      path: '/var/log/stella/orchestrator.log',
      rotation: {
        maxSize: '100MB',
        maxFiles: 10
      }
    }
  ],
  // Sensitive field masking
  redact: [
    'password',
    'token',
    'secret',
    'credentials',
    'authorization'
  ]
};

Important Log Events

Event Level Description
deployment.started info Deployment job started
deployment.completed info Deployment successful
deployment.failed error Deployment failed
rollback.initiated warn Rollback triggered
approval.granted info Promotion approved
approval.denied info Promotion rejected
agent.connected info Agent came online
agent.disconnected warn Agent went offline
security.gate.failed warn Security check blocked

Distributed Tracing

Trace Context Propagation

// Trace context in requests
interface TraceContext {
  traceId: string;
  spanId: string;
  parentSpanId?: string;
  sampled: boolean;
  baggage?: Record<string, string>;
}

// W3C Trace Context headers
// traceparent: 00-{traceId}-{spanId}-{flags}
// tracestate: stella=...

// Example trace propagation
class TracingMiddleware {
  handle(req: Request, res: Response, next: NextFunction): void {
    const traceparent = req.headers['traceparent'];
    const traceContext = this.parseTraceParent(traceparent);

    // Start span for this request
    const span = this.tracer.startSpan('http.request', {
      parent: traceContext,
      attributes: {
        'http.method': req.method,
        'http.url': req.url,
        'http.user_agent': req.headers['user-agent'],
        'tenant.id': req.tenantId
      }
    });

    // Attach to request for downstream use
    req.span = span;

    res.on('finish', () => {
      span.setAttribute('http.status_code', res.statusCode);
      span.end();
    });

    next();
  }
}

Key Spans

Span Name Description Attributes
deployment.execute Full deployment release_id, environment
task.dispatch Task dispatch to agent target_id, agent_id
agent.execute Agent task execution task_type, duration
workflow.run Workflow execution template_id, status
workflow.step Individual step step_type, node_id
approval.wait Waiting for approval promotion_id, duration
gate.evaluate Gate evaluation gate_type, result

Jaeger Configuration

# jaeger-config.yaml
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: stella-jaeger
spec:
  strategy: production
  collector:
    maxReplicas: 5
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: https://elasticsearch:9200
    secretName: jaeger-es-secret
  ingress:
    enabled: true

Alerting

Alert Rules

# prometheus-rules.yaml
groups:
  - name: stella.deployment
    rules:
      - alert: DeploymentFailureRateHigh
        expr: |
          sum(rate(stella_deployments_total{status="failed"}[5m])) /
          sum(rate(stella_deployments_total[5m])) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High deployment failure rate"
          description: "More than 10% of deployments are failing"

      - alert: DeploymentDurationHigh
        expr: |
          histogram_quantile(0.95, sum(rate(stella_deployment_duration_seconds_bucket[5m])) by (le, tenant)) > 600
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Deployment duration high"
          description: "P95 deployment duration exceeds 10 minutes"

      - alert: RollbackRateHigh
        expr: |
          sum(rate(stella_rollbacks_total[1h])) > 3
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High rollback rate"
          description: "More than 3 rollbacks in the last hour"

  - name: stella.agents
    rules:
      - alert: AgentOffline
        expr: |
          stella_agent_heartbeat_age_seconds > 120
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Agent offline"
          description: "Agent {{ $labels.agent }} has not sent heartbeat for 2 minutes"

      - alert: AgentPoolLow
        expr: |
          count(stella_agents_connected{status="online"}) by (tenant) < 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low agent count"
          description: "Fewer than 2 agents online for tenant {{ $labels.tenant }}"

  - name: stella.approvals
    rules:
      - alert: ApprovalBacklogHigh
        expr: |
          stella_approval_pending_count > 10
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Approval backlog growing"
          description: "More than 10 pending approvals for over an hour"

      - alert: ApprovalWaitLong
        expr: |
          histogram_quantile(0.90, stella_approval_duration_seconds_bucket) > 86400
        for: 1h
        labels:
          severity: info
        annotations:
          summary: "Long approval wait times"
          description: "P90 approval wait time exceeds 24 hours"

PagerDuty Integration

interface AlertManagerConfig {
  receivers: [
    {
      name: "stella-critical",
      pagerduty_configs: [
        {
          service_key: "${PAGERDUTY_SERVICE_KEY}",
          severity: "critical"
        }
      ]
    },
    {
      name: "stella-warning",
      slack_configs: [
        {
          api_url: "${SLACK_WEBHOOK_URL}",
          channel: "#stella-alerts",
          send_resolved: true
        }
      ]
    }
  ],
  route: {
    receiver: "stella-warning",
    routes: [
      {
        match: { severity: "critical" },
        receiver: "stella-critical"
      }
    ]
  }
}

Dashboards

Deployment Dashboard

Key panels:

  • Deployment rate over time
  • Success/failure ratio
  • Average deployment duration
  • Deployment duration histogram
  • Active deployments by environment
  • Recent deployment list

Agent Health Dashboard

Key panels:

  • Connected agents count
  • Agent heartbeat status
  • Tasks per agent
  • Task success rate by agent
  • Agent resource utilization

Approval Dashboard

Key panels:

  • Pending approvals count
  • Approval response time
  • Approvals by user
  • Rejection reasons breakdown

Health Endpoints

Application Health

GET /health

Response:

{
  "status": "healthy",
  "version": "1.0.0",
  "uptime": 86400,
  "checks": {
    "database": { "status": "healthy", "latency": 5 },
    "redis": { "status": "healthy", "latency": 2 },
    "vault": { "status": "healthy", "latency": 10 }
  }
}

In Valkey-backed deployments, the redis check reflects the Redis-compatible Valkey cache.

Readiness Probe

GET /health/ready

Liveness Probe

GET /health/live

Performance Tuning

Database Connection Pool

const poolConfig = {
  min: 5,
  max: 20,
  acquireTimeout: 30000,
  idleTimeout: 600000,
  connectionTimeout: 10000
};

Cache Configuration

const cacheConfig = {
  // Release cache
  releases: {
    ttl: 300,           // 5 minutes
    maxSize: 1000
  },
  // Target cache
  targets: {
    ttl: 60,            // 1 minute
    maxSize: 5000
  },
  // Workflow template cache
  templates: {
    ttl: 3600,          // 1 hour
    maxSize: 100
  }
};

Rate Limiting

const rateLimitConfig = {
  // API rate limits
  api: {
    windowMs: 60000,    // 1 minute
    max: 1000,          // requests per window
    burst: 100          // burst allowance
  },
  // Webhook rate limits
  webhooks: {
    windowMs: 60000,
    max: 100
  },
  // Per-tenant limits
  tenant: {
    windowMs: 60000,
    max: 500
  }
};

References