release orchestrator pivot, architecture and planning

2026-01-10 22:37:22 +02:00
parent c84f421e2f
commit d509c44411
130 changed files with 70292 additions and 721 deletions
--- a/docs/modules/release-orchestrator/operations/overview.md
+++ b/docs/modules/release-orchestrator/operations/overview.md
@@ -0,0 +1,508 @@
+# Operations Overview
+
+## Observability Stack
+
+Release Orchestrator provides comprehensive observability through metrics, logging, and distributed tracing.
+
+```
+              OBSERVABILITY ARCHITECTURE
+
+  ┌─────────────────────────────────────────────────────────────────────────────┐
+  │                     RELEASE ORCHESTRATOR                                     │
+  │                                                                             │
+  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
+  │  │ Metrics     │  │ Logs        │  │ Traces      │  │ Events      │        │
+  │  │ Exporter    │  │ Collector   │  │ Exporter    │  │ Publisher   │        │
+  │  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘        │
+  │         │                │                │                │                │
+  └─────────┼────────────────┼────────────────┼────────────────┼────────────────┘
+            │                │                │                │
+            ▼                ▼                ▼                ▼
+  ┌─────────────────────────────────────────────────────────────────────────────┐
+  │                      OBSERVABILITY BACKENDS                                  │
+  │                                                                             │
+  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
+  │  │ Prometheus  │  │ Loki /      │  │ Jaeger /    │  │ Event       │        │
+  │  │ / Mimir     │  │ Elasticsearch│  │ Tempo       │  │ Bus         │        │
+  │  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘        │
+  │         │                │                │                │                │
+  │         └────────────────┴────────────────┴────────────────┘                │
+  │                                    │                                        │
+  │                                    ▼                                        │
+  │                           ┌─────────────────┐                               │
+  │                           │    Grafana      │                               │
+  │                           │   Dashboards    │                               │
+  │                           └─────────────────┘                               │
+  │                                                                             │
+  └─────────────────────────────────────────────────────────────────────────────┘
+```
+
+## Metrics
+
+### Core Metrics
+
+| Metric | Type | Description | Labels |
+|--------|------|-------------|--------|
+| `stella_releases_total` | counter | Total releases created | `tenant`, `status` |
+| `stella_promotions_total` | counter | Total promotions | `tenant`, `env`, `status` |
+| `stella_deployments_total` | counter | Total deployments | `tenant`, `env`, `strategy` |
+| `stella_deployment_duration_seconds` | histogram | Deployment duration | `tenant`, `env`, `strategy` |
+| `stella_rollbacks_total` | counter | Total rollbacks | `tenant`, `env`, `reason` |
+| `stella_agents_connected` | gauge | Connected agents | `tenant` |
+| `stella_targets_total` | gauge | Total targets | `tenant`, `env`, `type` |
+| `stella_workflow_runs_total` | counter | Workflow executions | `tenant`, `template`, `status` |
+| `stella_workflow_step_duration_seconds` | histogram | Step execution time | `step_type` |
+| `stella_approval_pending_count` | gauge | Pending approvals | `tenant`, `env` |
+| `stella_approval_duration_seconds` | histogram | Time to approve | `tenant`, `env` |
+
+### API Metrics
+
+| Metric | Type | Description | Labels |
+|--------|------|-------------|--------|
+| `stella_http_requests_total` | counter | HTTP requests | `method`, `path`, `status` |
+| `stella_http_request_duration_seconds` | histogram | Request latency | `method`, `path` |
+| `stella_http_requests_in_flight` | gauge | Active requests | `method` |
+
+### Agent Metrics
+
+| Metric | Type | Description | Labels |
+|--------|------|-------------|--------|
+| `stella_agent_tasks_total` | counter | Tasks executed | `agent`, `type`, `status` |
+| `stella_agent_task_duration_seconds` | histogram | Task duration | `agent`, `type` |
+| `stella_agent_heartbeat_age_seconds` | gauge | Since last heartbeat | `agent` |
+
+### Prometheus Configuration
+
+```yaml
+# prometheus.yml
+scrape_configs:
+  - job_name: 'stella-orchestrator'
+    static_configs:
+      - targets: ['stella-orchestrator:9090']
+    metrics_path: /metrics
+    scheme: https
+    tls_config:
+      ca_file: /etc/prometheus/ca.crt
+
+  - job_name: 'stella-agents'
+    kubernetes_sd_configs:
+      - role: pod
+        selectors:
+          - role: pod
+            label: "app.kubernetes.io/name=stella-agent"
+    relabel_configs:
+      - source_labels: [__meta_kubernetes_pod_label_agent_id]
+        target_label: agent_id
+```
+
+## Logging
+
+### Log Format
+
+```json
+{
+  "timestamp": "2026-01-09T10:30:00.123Z",
+  "level": "info",
+  "message": "Deployment started",
+  "service": "deploy-orchestrator",
+  "version": "1.0.0",
+  "traceId": "abc123def456",
+  "spanId": "789ghi",
+  "tenantId": "tenant-uuid",
+  "correlationId": "corr-uuid",
+  "context": {
+    "deploymentJobId": "job-uuid",
+    "releaseId": "release-uuid",
+    "environmentId": "env-uuid"
+  }
+}
+```
+
+### Log Levels
+
+| Level | Usage |
+|-------|-------|
+| `error` | Failures requiring attention |
+| `warn` | Degraded operation, recoverable issues |
+| `info` | Business events (deployment started, approval granted) |
+| `debug` | Detailed operational info |
+| `trace` | Very detailed debugging |
+
+### Structured Logging Configuration
+
+```typescript
+// Logging configuration
+const loggerConfig = {
+  level: process.env.LOG_LEVEL || 'info',
+  format: 'json',
+  outputs: [
+    {
+      type: 'stdout',
+      format: 'json'
+    },
+    {
+      type: 'file',
+      path: '/var/log/stella/orchestrator.log',
+      rotation: {
+        maxSize: '100MB',
+        maxFiles: 10
+      }
+    }
+  ],
+  // Sensitive field masking
+  redact: [
+    'password',
+    'token',
+    'secret',
+    'credentials',
+    'authorization'
+  ]
+};
+```
+
+### Important Log Events
+
+| Event | Level | Description |
+|-------|-------|-------------|
+| `deployment.started` | info | Deployment job started |
+| `deployment.completed` | info | Deployment successful |
+| `deployment.failed` | error | Deployment failed |
+| `rollback.initiated` | warn | Rollback triggered |
+| `approval.granted` | info | Promotion approved |
+| `approval.denied` | info | Promotion rejected |
+| `agent.connected` | info | Agent came online |
+| `agent.disconnected` | warn | Agent went offline |
+| `security.gate.failed` | warn | Security check blocked |
+
+## Distributed Tracing
+
+### Trace Context Propagation
+
+```typescript
+// Trace context in requests
+interface TraceContext {
+  traceId: string;
+  spanId: string;
+  parentSpanId?: string;
+  sampled: boolean;
+  baggage?: Record<string, string>;
+}
+
+// W3C Trace Context headers
+// traceparent: 00-{traceId}-{spanId}-{flags}
+// tracestate: stella=...
+
+// Example trace propagation
+class TracingMiddleware {
+  handle(req: Request, res: Response, next: NextFunction): void {
+    const traceparent = req.headers['traceparent'];
+    const traceContext = this.parseTraceParent(traceparent);
+
+    // Start span for this request
+    const span = this.tracer.startSpan('http.request', {
+      parent: traceContext,
+      attributes: {
+        'http.method': req.method,
+        'http.url': req.url,
+        'http.user_agent': req.headers['user-agent'],
+        'tenant.id': req.tenantId
+      }
+    });
+
+    // Attach to request for downstream use
+    req.span = span;
+
+    res.on('finish', () => {
+      span.setAttribute('http.status_code', res.statusCode);
+      span.end();
+    });
+
+    next();
+  }
+}
+```
+
+### Key Spans
+
+| Span Name | Description | Attributes |
+|-----------|-------------|------------|
+| `deployment.execute` | Full deployment | `release_id`, `environment` |
+| `task.dispatch` | Task dispatch to agent | `target_id`, `agent_id` |
+| `agent.execute` | Agent task execution | `task_type`, `duration` |
+| `workflow.run` | Workflow execution | `template_id`, `status` |
+| `workflow.step` | Individual step | `step_type`, `node_id` |
+| `approval.wait` | Waiting for approval | `promotion_id`, `duration` |
+| `gate.evaluate` | Gate evaluation | `gate_type`, `result` |
+
+### Jaeger Configuration
+
+```yaml
+# jaeger-config.yaml
+apiVersion: jaegertracing.io/v1
+kind: Jaeger
+metadata:
+  name: stella-jaeger
+spec:
+  strategy: production
+  collector:
+    maxReplicas: 5
+  storage:
+    type: elasticsearch
+    options:
+      es:
+        server-urls: https://elasticsearch:9200
+    secretName: jaeger-es-secret
+  ingress:
+    enabled: true
+```
+
+## Alerting
+
+### Alert Rules
+
+```yaml
+# prometheus-rules.yaml
+groups:
+  - name: stella.deployment
+    rules:
+      - alert: DeploymentFailureRateHigh
+        expr: |
+          sum(rate(stella_deployments_total{status="failed"}[5m])) /
+          sum(rate(stella_deployments_total[5m])) > 0.1
+        for: 5m
+        labels:
+          severity: critical
+        annotations:
+          summary: "High deployment failure rate"
+          description: "More than 10% of deployments are failing"
+
+      - alert: DeploymentDurationHigh
+        expr: |
+          histogram_quantile(0.95, sum(rate(stella_deployment_duration_seconds_bucket[5m])) by (le, tenant)) > 600
+        for: 10m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Deployment duration high"
+          description: "P95 deployment duration exceeds 10 minutes"
+
+      - alert: RollbackRateHigh
+        expr: |
+          sum(rate(stella_rollbacks_total[1h])) > 3
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "High rollback rate"
+          description: "More than 3 rollbacks in the last hour"
+
+  - name: stella.agents
+    rules:
+      - alert: AgentOffline
+        expr: |
+          stella_agent_heartbeat_age_seconds > 120
+        for: 2m
+        labels:
+          severity: critical
+        annotations:
+          summary: "Agent offline"
+          description: "Agent {{ $labels.agent }} has not sent heartbeat for 2 minutes"
+
+      - alert: AgentPoolLow
+        expr: |
+          count(stella_agents_connected{status="online"}) by (tenant) < 2
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Low agent count"
+          description: "Fewer than 2 agents online for tenant {{ $labels.tenant }}"
+
+  - name: stella.approvals
+    rules:
+      - alert: ApprovalBacklogHigh
+        expr: |
+          stella_approval_pending_count > 10
+        for: 1h
+        labels:
+          severity: warning
+        annotations:
+          summary: "Approval backlog growing"
+          description: "More than 10 pending approvals for over an hour"
+
+      - alert: ApprovalWaitLong
+        expr: |
+          histogram_quantile(0.90, stella_approval_duration_seconds_bucket) > 86400
+        for: 1h
+        labels:
+          severity: info
+        annotations:
+          summary: "Long approval wait times"
+          description: "P90 approval wait time exceeds 24 hours"
+```
+
+### PagerDuty Integration
+
+```typescript
+interface AlertManagerConfig {
+  receivers: [
+    {
+      name: "stella-critical",
+      pagerduty_configs: [
+        {
+          service_key: "${PAGERDUTY_SERVICE_KEY}",
+          severity: "critical"
+        }
+      ]
+    },
+    {
+      name: "stella-warning",
+      slack_configs: [
+        {
+          api_url: "${SLACK_WEBHOOK_URL}",
+          channel: "#stella-alerts",
+          send_resolved: true
+        }
+      ]
+    }
+  ],
+  route: {
+    receiver: "stella-warning",
+    routes: [
+      {
+        match: { severity: "critical" },
+        receiver: "stella-critical"
+      }
+    ]
+  }
+}
+```
+
+## Dashboards
+
+### Deployment Dashboard
+
+Key panels:
+- Deployment rate over time
+- Success/failure ratio
+- Average deployment duration
+- Deployment duration histogram
+- Active deployments by environment
+- Recent deployment list
+
+### Agent Health Dashboard
+
+Key panels:
+- Connected agents count
+- Agent heartbeat status
+- Tasks per agent
+- Task success rate by agent
+- Agent resource utilization
+
+### Approval Dashboard
+
+Key panels:
+- Pending approvals count
+- Approval response time
+- Approvals by user
+- Rejection reasons breakdown
+
+## Health Endpoints
+
+### Application Health
+
+```http
+GET /health
+```
+
+Response:
+```json
+{
+  "status": "healthy",
+  "version": "1.0.0",
+  "uptime": 86400,
+  "checks": {
+    "database": { "status": "healthy", "latency": 5 },
+    "redis": { "status": "healthy", "latency": 2 },
+    "vault": { "status": "healthy", "latency": 10 }
+  }
+}
+```
+
+### Readiness Probe
+
+```http
+GET /health/ready
+```
+
+### Liveness Probe
+
+```http
+GET /health/live
+```
+
+## Performance Tuning
+
+### Database Connection Pool
+
+```typescript
+const poolConfig = {
+  min: 5,
+  max: 20,
+  acquireTimeout: 30000,
+  idleTimeout: 600000,
+  connectionTimeout: 10000
+};
+```
+
+### Cache Configuration
+
+```typescript
+const cacheConfig = {
+  // Release cache
+  releases: {
+    ttl: 300,           // 5 minutes
+    maxSize: 1000
+  },
+  // Target cache
+  targets: {
+    ttl: 60,            // 1 minute
+    maxSize: 5000
+  },
+  // Workflow template cache
+  templates: {
+    ttl: 3600,          // 1 hour
+    maxSize: 100
+  }
+};
+```
+
+### Rate Limiting
+
+```typescript
+const rateLimitConfig = {
+  // API rate limits
+  api: {
+    windowMs: 60000,    // 1 minute
+    max: 1000,          // requests per window
+    burst: 100          // burst allowance
+  },
+  // Webhook rate limits
+  webhooks: {
+    windowMs: 60000,
+    max: 100
+  },
+  // Per-tenant limits
+  tenant: {
+    windowMs: 60000,
+    max: 500
+  }
+};
+```
+
+## References
+
+- [Metrics Reference](metrics.md)
+- [Logging Guide](logging.md)
+- [Tracing Setup](tracing.md)
+- [Alert Configuration](alerting.md)