stella-ops.org/git.stella-ops.org

Fork 0

Files

master c32fff8f86 license switch agpl -> busl1, sprints work, new product advisories

2026-01-20 15:32:20 +02:00

15 KiB

Raw Permalink Blame History

Operations Overview

Observability Stack

Release Orchestrator provides comprehensive observability through metrics, logging, and distributed tracing.

              OBSERVABILITY ARCHITECTURE

  ┌─────────────────────────────────────────────────────────────────────────────┐
  │                     RELEASE ORCHESTRATOR                                     │
  │                                                                             │
  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
  │  │ Metrics     │  │ Logs        │  │ Traces      │  │ Events      │        │
  │  │ Exporter    │  │ Collector   │  │ Exporter    │  │ Publisher   │        │
  │  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘        │
  │         │                │                │                │                │
  └─────────┼────────────────┼────────────────┼────────────────┼────────────────┘
            │                │                │                │
            ▼                ▼                ▼                ▼
  ┌─────────────────────────────────────────────────────────────────────────────┐
  │                      OBSERVABILITY BACKENDS                                  │
  │                                                                             │
  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
  │  │ Prometheus  │  │ Loki /      │  │ Jaeger /    │  │ Event       │        │
  │  │ / Mimir     │  │ Elasticsearch│  │ Tempo       │  │ Bus         │        │
  │  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘        │
  │         │                │                │                │                │
  │         └────────────────┴────────────────┴────────────────┘                │
  │                                    │                                        │
  │                                    ▼                                        │
  │                           ┌─────────────────┐                               │
  │                           │    Grafana      │                               │
  │                           │   Dashboards    │                               │
  │                           └─────────────────┘                               │
  │                                                                             │
  └─────────────────────────────────────────────────────────────────────────────┘

Metrics

Core Metrics

Metric	Type	Description	Labels
`stella_releases_total`	counter	Total releases created	`tenant`, `status`
`stella_promotions_total`	counter	Total promotions	`tenant`, `env`, `status`
`stella_deployments_total`	counter	Total deployments	`tenant`, `env`, `strategy`
`stella_deployment_duration_seconds`	histogram	Deployment duration	`tenant`, `env`, `strategy`
`stella_rollbacks_total`	counter	Total rollbacks	`tenant`, `env`, `reason`
`stella_agents_connected`	gauge	Connected agents	`tenant`
`stella_targets_total`	gauge	Total targets	`tenant`, `env`, `type`
`stella_workflow_runs_total`	counter	Workflow executions	`tenant`, `template`, `status`
`stella_workflow_step_duration_seconds`	histogram	Step execution time	`step_type`
`stella_approval_pending_count`	gauge	Pending approvals	`tenant`, `env`
`stella_approval_duration_seconds`	histogram	Time to approve	`tenant`, `env`

API Metrics

Metric	Type	Description	Labels
`stella_http_requests_total`	counter	HTTP requests	`method`, `path`, `status`
`stella_http_request_duration_seconds`	histogram	Request latency	`method`, `path`
`stella_http_requests_in_flight`	gauge	Active requests	`method`

Agent Metrics

Metric	Type	Description	Labels
`stella_agent_tasks_total`	counter	Tasks executed	`agent`, `type`, `status`
`stella_agent_task_duration_seconds`	histogram	Task duration	`agent`, `type`
`stella_agent_heartbeat_age_seconds`	gauge	Since last heartbeat	`agent`

Prometheus Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'stella-orchestrator'
    static_configs:
      - targets: ['stella-orchestrator:9090']
    metrics_path: /metrics
    scheme: https
    tls_config:
      ca_file: /etc/prometheus/ca.crt

  - job_name: 'stella-agents'
    kubernetes_sd_configs:
      - role: pod
        selectors:
          - role: pod
            label: "app.kubernetes.io/name=stella-agent"
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_agent_id]
        target_label: agent_id

Logging

Log Format

{
  "timestamp": "2026-01-09T10:30:00.123Z",
  "level": "info",
  "message": "Deployment started",
  "service": "deploy-orchestrator",
  "version": "1.0.0",
  "traceId": "abc123def456",
  "spanId": "789ghi",
  "tenantId": "tenant-uuid",
  "correlationId": "corr-uuid",
  "context": {
    "deploymentJobId": "job-uuid",
    "releaseId": "release-uuid",
    "environmentId": "env-uuid"
  }
}

Log Levels

Level	Usage
`error`	Failures requiring attention
`warn`	Degraded operation, recoverable issues
`info`	Business events (deployment started, approval granted)
`debug`	Detailed operational info
`trace`	Very detailed debugging

Structured Logging Configuration

// Logging configuration
const loggerConfig = {
  level: process.env.LOG_LEVEL || 'info',
  format: 'json',
  outputs: [
    {
      type: 'stdout',
      format: 'json'
    },
    {
      type: 'file',
      path: '/var/log/stella/orchestrator.log',
      rotation: {
        maxSize: '100MB',
        maxFiles: 10
      }
    }
  ],
  // Sensitive field masking
  redact: [
    'password',
    'token',
    'secret',
    'credentials',
    'authorization'
  ]
};

Important Log Events

Event	Level	Description
`deployment.started`	info	Deployment job started
`deployment.completed`	info	Deployment successful
`deployment.failed`	error	Deployment failed
`rollback.initiated`	warn	Rollback triggered
`approval.granted`	info	Promotion approved
`approval.denied`	info	Promotion rejected
`agent.connected`	info	Agent came online
`agent.disconnected`	warn	Agent went offline
`security.gate.failed`	warn	Security check blocked

Distributed Tracing

Trace Context Propagation

// Trace context in requests
interface TraceContext {
  traceId: string;
  spanId: string;
  parentSpanId?: string;
  sampled: boolean;
  baggage?: Record<string, string>;
}

// W3C Trace Context headers
// traceparent: 00-{traceId}-{spanId}-{flags}
// tracestate: stella=...

// Example trace propagation
class TracingMiddleware {
  handle(req: Request, res: Response, next: NextFunction): void {
    const traceparent = req.headers['traceparent'];
    const traceContext = this.parseTraceParent(traceparent);

    // Start span for this request
    const span = this.tracer.startSpan('http.request', {
      parent: traceContext,
      attributes: {
        'http.method': req.method,
        'http.url': req.url,
        'http.user_agent': req.headers['user-agent'],
        'tenant.id': req.tenantId
      }
    });

    // Attach to request for downstream use
    req.span = span;

    res.on('finish', () => {
      span.setAttribute('http.status_code', res.statusCode);
      span.end();
    });

    next();
  }
}

Key Spans

Span Name	Description	Attributes
`deployment.execute`	Full deployment	`release_id`, `environment`
`task.dispatch`	Task dispatch to agent	`target_id`, `agent_id`
`agent.execute`	Agent task execution	`task_type`, `duration`
`workflow.run`	Workflow execution	`template_id`, `status`
`workflow.step`	Individual step	`step_type`, `node_id`
`approval.wait`	Waiting for approval	`promotion_id`, `duration`
`gate.evaluate`	Gate evaluation	`gate_type`, `result`

Jaeger Configuration

# jaeger-config.yaml
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: stella-jaeger
spec:
  strategy: production
  collector:
    maxReplicas: 5
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: https://elasticsearch:9200
    secretName: jaeger-es-secret
  ingress:
    enabled: true

Alerting

Alert Rules

# prometheus-rules.yaml
groups:
  - name: stella.deployment
    rules:
      - alert: DeploymentFailureRateHigh
        expr: |
          sum(rate(stella_deployments_total{status="failed"}[5m])) /
          sum(rate(stella_deployments_total[5m])) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High deployment failure rate"
          description: "More than 10% of deployments are failing"

      - alert: DeploymentDurationHigh
        expr: |
          histogram_quantile(0.95, sum(rate(stella_deployment_duration_seconds_bucket[5m])) by (le, tenant)) > 600
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Deployment duration high"
          description: "P95 deployment duration exceeds 10 minutes"

      - alert: RollbackRateHigh
        expr: |
          sum(rate(stella_rollbacks_total[1h])) > 3
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High rollback rate"
          description: "More than 3 rollbacks in the last hour"

  - name: stella.agents
    rules:
      - alert: AgentOffline
        expr: |
          stella_agent_heartbeat_age_seconds > 120
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Agent offline"
          description: "Agent {{ $labels.agent }} has not sent heartbeat for 2 minutes"

      - alert: AgentPoolLow
        expr: |
          count(stella_agents_connected{status="online"}) by (tenant) < 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low agent count"
          description: "Fewer than 2 agents online for tenant {{ $labels.tenant }}"

  - name: stella.approvals
    rules:
      - alert: ApprovalBacklogHigh
        expr: |
          stella_approval_pending_count > 10
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Approval backlog growing"
          description: "More than 10 pending approvals for over an hour"

      - alert: ApprovalWaitLong
        expr: |
          histogram_quantile(0.90, stella_approval_duration_seconds_bucket) > 86400
        for: 1h
        labels:
          severity: info
        annotations:
          summary: "Long approval wait times"
          description: "P90 approval wait time exceeds 24 hours"

PagerDuty Integration

interface AlertManagerConfig {
  receivers: [
    {
      name: "stella-critical",
      pagerduty_configs: [
        {
          service_key: "${PAGERDUTY_SERVICE_KEY}",
          severity: "critical"
        }
      ]
    },
    {
      name: "stella-warning",
      slack_configs: [
        {
          api_url: "${SLACK_WEBHOOK_URL}",
          channel: "#stella-alerts",
          send_resolved: true
        }
      ]
    }
  ],
  route: {
    receiver: "stella-warning",
    routes: [
      {
        match: { severity: "critical" },
        receiver: "stella-critical"
      }
    ]
  }
}

Dashboards

Deployment Dashboard

Key panels:

Deployment rate over time
Success/failure ratio
Average deployment duration
Deployment duration histogram
Active deployments by environment
Recent deployment list

Agent Health Dashboard

Key panels:

Connected agents count
Agent heartbeat status
Tasks per agent
Task success rate by agent
Agent resource utilization

Approval Dashboard

Key panels:

Pending approvals count
Approval response time
Approvals by user
Rejection reasons breakdown

Health Endpoints

Application Health

GET /health

Response:

{
  "status": "healthy",
  "version": "1.0.0",
  "uptime": 86400,
  "checks": {
    "database": { "status": "healthy", "latency": 5 },
    "redis": { "status": "healthy", "latency": 2 },
    "vault": { "status": "healthy", "latency": 10 }
  }
}

In Valkey-backed deployments, the redis check reflects the Redis-compatible Valkey cache.

Readiness Probe

GET /health/ready

Liveness Probe

GET /health/live

Performance Tuning

Database Connection Pool

const poolConfig = {
  min: 5,
  max: 20,
  acquireTimeout: 30000,
  idleTimeout: 600000,
  connectionTimeout: 10000
};

Cache Configuration

const cacheConfig = {
  // Release cache
  releases: {
    ttl: 300,           // 5 minutes
    maxSize: 1000
  },
  // Target cache
  targets: {
    ttl: 60,            // 1 minute
    maxSize: 5000
  },
  // Workflow template cache
  templates: {
    ttl: 3600,          // 1 hour
    maxSize: 100
  }
};

Rate Limiting

const rateLimitConfig = {
  // API rate limits
  api: {
    windowMs: 60000,    // 1 minute
    max: 1000,          // requests per window
    burst: 100          // burst allowance
  },
  // Webhook rate limits
  webhooks: {
    windowMs: 60000,
    max: 100
  },
  // Per-tenant limits
  tenant: {
    windowMs: 60000,
    max: 500
  }
};

15 KiB Raw Permalink Blame History

Operations Overview

Observability Stack

Metrics

Core Metrics

API Metrics

Agent Metrics

Prometheus Configuration

Logging

Log Format

Log Levels

Structured Logging Configuration

Important Log Events

Distributed Tracing

Trace Context Propagation

Key Spans

Jaeger Configuration

Alerting

Alert Rules

PagerDuty Integration

Dashboards

Deployment Dashboard

Agent Health Dashboard

Approval Dashboard

Health Endpoints

Application Health

Readiness Probe

Liveness Probe

Performance Tuning

Database Connection Pool

Cache Configuration

Rate Limiting

References

15 KiB

Raw Permalink Blame History