# Operations Overview

## Observability Stack

Release Orchestrator provides comprehensive observability through metrics, logging, and distributed tracing.

```
              OBSERVABILITY ARCHITECTURE

  ┌─────────────────────────────────────────────────────────────────────────────┐
  │                     RELEASE ORCHESTRATOR                                     │
  │                                                                             │
  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
  │  │ Metrics     │  │ Logs        │  │ Traces      │  │ Events      │        │
  │  │ Exporter    │  │ Collector   │  │ Exporter    │  │ Publisher   │        │
  │  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘        │
  │         │                │                │                │                │
  └─────────┼────────────────┼────────────────┼────────────────┼────────────────┘
            │                │                │                │
            ▼                ▼                ▼                ▼
  ┌─────────────────────────────────────────────────────────────────────────────┐
  │                      OBSERVABILITY BACKENDS                                  │
  │                                                                             │
  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
  │  │ Prometheus  │  │ Loki /      │  │ Jaeger /    │  │ Event       │        │
  │  │ / Mimir     │  │ Elasticsearch│  │ Tempo       │  │ Bus         │        │
  │  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘        │
  │         │                │                │                │                │
  │         └────────────────┴────────────────┴────────────────┘                │
  │                                    │                                        │
  │                                    ▼                                        │
  │                           ┌─────────────────┐                               │
  │                           │    Grafana      │                               │
  │                           │   Dashboards    │                               │
  │                           └─────────────────┘                               │
  │                                                                             │
  └─────────────────────────────────────────────────────────────────────────────┘
```

## Metrics

### Core Metrics

| Metric | Type | Description | Labels |
|--------|------|-------------|--------|
| `stella_releases_total` | counter | Total releases created | `tenant`, `status` |
| `stella_promotions_total` | counter | Total promotions | `tenant`, `env`, `status` |
| `stella_deployments_total` | counter | Total deployments | `tenant`, `env`, `strategy` |
| `stella_deployment_duration_seconds` | histogram | Deployment duration | `tenant`, `env`, `strategy` |
| `stella_rollbacks_total` | counter | Total rollbacks | `tenant`, `env`, `reason` |
| `stella_agents_connected` | gauge | Connected agents | `tenant` |
| `stella_targets_total` | gauge | Total targets | `tenant`, `env`, `type` |
| `stella_workflow_runs_total` | counter | Workflow executions | `tenant`, `template`, `status` |
| `stella_workflow_step_duration_seconds` | histogram | Step execution time | `step_type` |
| `stella_approval_pending_count` | gauge | Pending approvals | `tenant`, `env` |
| `stella_approval_duration_seconds` | histogram | Time to approve | `tenant`, `env` |

### API Metrics

| Metric | Type | Description | Labels |
|--------|------|-------------|--------|
| `stella_http_requests_total` | counter | HTTP requests | `method`, `path`, `status` |
| `stella_http_request_duration_seconds` | histogram | Request latency | `method`, `path` |
| `stella_http_requests_in_flight` | gauge | Active requests | `method` |

### Agent Metrics

| Metric | Type | Description | Labels |
|--------|------|-------------|--------|
| `stella_agent_tasks_total` | counter | Tasks executed | `agent`, `type`, `status` |
| `stella_agent_task_duration_seconds` | histogram | Task duration | `agent`, `type` |
| `stella_agent_heartbeat_age_seconds` | gauge | Since last heartbeat | `agent` |

### Prometheus Configuration

```yaml
# prometheus.yml
scrape_configs:
  - job_name: 'stella-orchestrator'
    static_configs:
      - targets: ['stella-orchestrator:9090']
    metrics_path: /metrics
    scheme: https
    tls_config:
      ca_file: /etc/prometheus/ca.crt

  - job_name: 'stella-agents'
    kubernetes_sd_configs:
      - role: pod
        selectors:
          - role: pod
            label: "app.kubernetes.io/name=stella-agent"
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_agent_id]
        target_label: agent_id
```

## Logging

### Log Format

```json
{
  "timestamp": "2026-01-09T10:30:00.123Z",
  "level": "info",
  "message": "Deployment started",
  "service": "deploy-orchestrator",
  "version": "1.0.0",
  "traceId": "abc123def456",
  "spanId": "789ghi",
  "tenantId": "tenant-uuid",
  "correlationId": "corr-uuid",
  "context": {
    "deploymentJobId": "job-uuid",
    "releaseId": "release-uuid",
    "environmentId": "env-uuid"
  }
}
```

### Log Levels

| Level | Usage |
|-------|-------|
| `error` | Failures requiring attention |
| `warn` | Degraded operation, recoverable issues |
| `info` | Business events (deployment started, approval granted) |
| `debug` | Detailed operational info |
| `trace` | Very detailed debugging |

### Structured Logging Configuration

```typescript
// Logging configuration
const loggerConfig = {
  level: process.env.LOG_LEVEL || 'info',
  format: 'json',
  outputs: [
    {
      type: 'stdout',
      format: 'json'
    },
    {
      type: 'file',
      path: '/var/log/stella/orchestrator.log',
      rotation: {
        maxSize: '100MB',
        maxFiles: 10
      }
    }
  ],
  // Sensitive field masking
  redact: [
    'password',
    'token',
    'secret',
    'credentials',
    'authorization'
  ]
};
```

### Important Log Events

| Event | Level | Description |
|-------|-------|-------------|
| `deployment.started` | info | Deployment job started |
| `deployment.completed` | info | Deployment successful |
| `deployment.failed` | error | Deployment failed |
| `rollback.initiated` | warn | Rollback triggered |
| `approval.granted` | info | Promotion approved |
| `approval.denied` | info | Promotion rejected |
| `agent.connected` | info | Agent came online |
| `agent.disconnected` | warn | Agent went offline |
| `security.gate.failed` | warn | Security check blocked |

## Distributed Tracing

### Trace Context Propagation

```typescript
// Trace context in requests
interface TraceContext {
  traceId: string;
  spanId: string;
  parentSpanId?: string;
  sampled: boolean;
  baggage?: Record<string, string>;
}

// W3C Trace Context headers
// traceparent: 00-{traceId}-{spanId}-{flags}
// tracestate: stella=...

// Example trace propagation
class TracingMiddleware {
  handle(req: Request, res: Response, next: NextFunction): void {
    const traceparent = req.headers['traceparent'];
    const traceContext = this.parseTraceParent(traceparent);

    // Start span for this request
    const span = this.tracer.startSpan('http.request', {
      parent: traceContext,
      attributes: {
        'http.method': req.method,
        'http.url': req.url,
        'http.user_agent': req.headers['user-agent'],
        'tenant.id': req.tenantId
      }
    });

    // Attach to request for downstream use
    req.span = span;

    res.on('finish', () => {
      span.setAttribute('http.status_code', res.statusCode);
      span.end();
    });

    next();
  }
}
```

### Key Spans

| Span Name | Description | Attributes |
|-----------|-------------|------------|
| `deployment.execute` | Full deployment | `release_id`, `environment` |
| `task.dispatch` | Task dispatch to agent | `target_id`, `agent_id` |
| `agent.execute` | Agent task execution | `task_type`, `duration` |
| `workflow.run` | Workflow execution | `template_id`, `status` |
| `workflow.step` | Individual step | `step_type`, `node_id` |
| `approval.wait` | Waiting for approval | `promotion_id`, `duration` |
| `gate.evaluate` | Gate evaluation | `gate_type`, `result` |

### Jaeger Configuration

```yaml
# jaeger-config.yaml
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: stella-jaeger
spec:
  strategy: production
  collector:
    maxReplicas: 5
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: https://elasticsearch:9200
    secretName: jaeger-es-secret
  ingress:
    enabled: true
```

## Alerting

### Alert Rules

```yaml
# prometheus-rules.yaml
groups:
  - name: stella.deployment
    rules:
      - alert: DeploymentFailureRateHigh
        expr: |
          sum(rate(stella_deployments_total{status="failed"}[5m])) /
          sum(rate(stella_deployments_total[5m])) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High deployment failure rate"
          description: "More than 10% of deployments are failing"

      - alert: DeploymentDurationHigh
        expr: |
          histogram_quantile(0.95, sum(rate(stella_deployment_duration_seconds_bucket[5m])) by (le, tenant)) > 600
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Deployment duration high"
          description: "P95 deployment duration exceeds 10 minutes"

      - alert: RollbackRateHigh
        expr: |
          sum(rate(stella_rollbacks_total[1h])) > 3
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High rollback rate"
          description: "More than 3 rollbacks in the last hour"

  - name: stella.agents
    rules:
      - alert: AgentOffline
        expr: |
          stella_agent_heartbeat_age_seconds > 120
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Agent offline"
          description: "Agent {{ $labels.agent }} has not sent heartbeat for 2 minutes"

      - alert: AgentPoolLow
        expr: |
          count(stella_agents_connected{status="online"}) by (tenant) < 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low agent count"
          description: "Fewer than 2 agents online for tenant {{ $labels.tenant }}"

  - name: stella.approvals
    rules:
      - alert: ApprovalBacklogHigh
        expr: |
          stella_approval_pending_count > 10
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Approval backlog growing"
          description: "More than 10 pending approvals for over an hour"

      - alert: ApprovalWaitLong
        expr: |
          histogram_quantile(0.90, stella_approval_duration_seconds_bucket) > 86400
        for: 1h
        labels:
          severity: info
        annotations:
          summary: "Long approval wait times"
          description: "P90 approval wait time exceeds 24 hours"
```

### PagerDuty Integration

```typescript
interface AlertManagerConfig {
  receivers: [
    {
      name: "stella-critical",
      pagerduty_configs: [
        {
          service_key: "${PAGERDUTY_SERVICE_KEY}",
          severity: "critical"
        }
      ]
    },
    {
      name: "stella-warning",
      slack_configs: [
        {
          api_url: "${SLACK_WEBHOOK_URL}",
          channel: "#stella-alerts",
          send_resolved: true
        }
      ]
    }
  ],
  route: {
    receiver: "stella-warning",
    routes: [
      {
        match: { severity: "critical" },
        receiver: "stella-critical"
      }
    ]
  }
}
```

## Dashboards

### Deployment Dashboard

Key panels:
- Deployment rate over time
- Success/failure ratio
- Average deployment duration
- Deployment duration histogram
- Active deployments by environment
- Recent deployment list

### Agent Health Dashboard

Key panels:
- Connected agents count
- Agent heartbeat status
- Tasks per agent
- Task success rate by agent
- Agent resource utilization

### Approval Dashboard

Key panels:
- Pending approvals count
- Approval response time
- Approvals by user
- Rejection reasons breakdown

## Health Endpoints

### Application Health

```http
GET /health
```

Response:
```json
{
  "status": "healthy",
  "version": "1.0.0",
  "uptime": 86400,
  "checks": {
    "database": { "status": "healthy", "latency": 5 },
    "redis": { "status": "healthy", "latency": 2 },
    "vault": { "status": "healthy", "latency": 10 }
  }
}
```

In Valkey-backed deployments, the `redis` check reflects the Redis-compatible
Valkey cache.

### Readiness Probe

```http
GET /health/ready
```

### Liveness Probe

```http
GET /health/live
```

## Performance Tuning

### Database Connection Pool

```typescript
const poolConfig = {
  min: 5,
  max: 20,
  acquireTimeout: 30000,
  idleTimeout: 600000,
  connectionTimeout: 10000
};
```

### Cache Configuration

```typescript
const cacheConfig = {
  // Release cache
  releases: {
    ttl: 300,           // 5 minutes
    maxSize: 1000
  },
  // Target cache
  targets: {
    ttl: 60,            // 1 minute
    maxSize: 5000
  },
  // Workflow template cache
  templates: {
    ttl: 3600,          // 1 hour
    maxSize: 100
  }
};
```

### Rate Limiting

```typescript
const rateLimitConfig = {
  // API rate limits
  api: {
    windowMs: 60000,    // 1 minute
    max: 1000,          // requests per window
    burst: 100          // burst allowance
  },
  // Webhook rate limits
  webhooks: {
    windowMs: 60000,
    max: 100
  },
  // Per-tenant limits
  tenant: {
    windowMs: 60000,
    max: 500
  }
};
```

## References

- [Metrics Reference](metrics.md)
- [Logging Guide](logging.md)
- [Tracing Setup](tracing.md)
- [Alert Configuration](alerting.md)