# Operations Overview ## Observability Stack Release Orchestrator provides comprehensive observability through metrics, logging, and distributed tracing. ``` OBSERVABILITY ARCHITECTURE ┌─────────────────────────────────────────────────────────────────────────────┐ │ RELEASE ORCHESTRATOR │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Metrics │ │ Logs │ │ Traces │ │ Events │ │ │ │ Exporter │ │ Collector │ │ Exporter │ │ Publisher │ │ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ │ │ └─────────┼────────────────┼────────────────┼────────────────┼────────────────┘ │ │ │ │ ▼ ▼ ▼ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ OBSERVABILITY BACKENDS │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Prometheus │ │ Loki / │ │ Jaeger / │ │ Event │ │ │ │ / Mimir │ │ Elasticsearch│ │ Tempo │ │ Bus │ │ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ │ │ │ └────────────────┴────────────────┴────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ Grafana │ │ │ │ Dashboards │ │ │ └─────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` ## Metrics ### Core Metrics | Metric | Type | Description | Labels | |--------|------|-------------|--------| | `stella_releases_total` | counter | Total releases created | `tenant`, `status` | | `stella_promotions_total` | counter | Total promotions | `tenant`, `env`, `status` | | `stella_deployments_total` | counter | Total deployments | `tenant`, `env`, `strategy` | | `stella_deployment_duration_seconds` | histogram | Deployment duration | `tenant`, `env`, `strategy` | | `stella_rollbacks_total` | counter | Total rollbacks | `tenant`, `env`, `reason` | | `stella_agents_connected` | gauge | Connected agents | `tenant` | | `stella_targets_total` | gauge | Total targets | `tenant`, `env`, `type` | | `stella_workflow_runs_total` | counter | Workflow executions | `tenant`, `template`, `status` | | `stella_workflow_step_duration_seconds` | histogram | Step execution time | `step_type` | | `stella_approval_pending_count` | gauge | Pending approvals | `tenant`, `env` | | `stella_approval_duration_seconds` | histogram | Time to approve | `tenant`, `env` | ### API Metrics | Metric | Type | Description | Labels | |--------|------|-------------|--------| | `stella_http_requests_total` | counter | HTTP requests | `method`, `path`, `status` | | `stella_http_request_duration_seconds` | histogram | Request latency | `method`, `path` | | `stella_http_requests_in_flight` | gauge | Active requests | `method` | ### Agent Metrics | Metric | Type | Description | Labels | |--------|------|-------------|--------| | `stella_agent_tasks_total` | counter | Tasks executed | `agent`, `type`, `status` | | `stella_agent_task_duration_seconds` | histogram | Task duration | `agent`, `type` | | `stella_agent_heartbeat_age_seconds` | gauge | Since last heartbeat | `agent` | ### Prometheus Configuration ```yaml # prometheus.yml scrape_configs: - job_name: 'stella-orchestrator' static_configs: - targets: ['stella-orchestrator:9090'] metrics_path: /metrics scheme: https tls_config: ca_file: /etc/prometheus/ca.crt - job_name: 'stella-agents' kubernetes_sd_configs: - role: pod selectors: - role: pod label: "app.kubernetes.io/name=stella-agent" relabel_configs: - source_labels: [__meta_kubernetes_pod_label_agent_id] target_label: agent_id ``` ## Logging ### Log Format ```json { "timestamp": "2026-01-09T10:30:00.123Z", "level": "info", "message": "Deployment started", "service": "deploy-orchestrator", "version": "1.0.0", "traceId": "abc123def456", "spanId": "789ghi", "tenantId": "tenant-uuid", "correlationId": "corr-uuid", "context": { "deploymentJobId": "job-uuid", "releaseId": "release-uuid", "environmentId": "env-uuid" } } ``` ### Log Levels | Level | Usage | |-------|-------| | `error` | Failures requiring attention | | `warn` | Degraded operation, recoverable issues | | `info` | Business events (deployment started, approval granted) | | `debug` | Detailed operational info | | `trace` | Very detailed debugging | ### Structured Logging Configuration ```typescript // Logging configuration const loggerConfig = { level: process.env.LOG_LEVEL || 'info', format: 'json', outputs: [ { type: 'stdout', format: 'json' }, { type: 'file', path: '/var/log/stella/orchestrator.log', rotation: { maxSize: '100MB', maxFiles: 10 } } ], // Sensitive field masking redact: [ 'password', 'token', 'secret', 'credentials', 'authorization' ] }; ``` ### Important Log Events | Event | Level | Description | |-------|-------|-------------| | `deployment.started` | info | Deployment job started | | `deployment.completed` | info | Deployment successful | | `deployment.failed` | error | Deployment failed | | `rollback.initiated` | warn | Rollback triggered | | `approval.granted` | info | Promotion approved | | `approval.denied` | info | Promotion rejected | | `agent.connected` | info | Agent came online | | `agent.disconnected` | warn | Agent went offline | | `security.gate.failed` | warn | Security check blocked | ## Distributed Tracing ### Trace Context Propagation ```typescript // Trace context in requests interface TraceContext { traceId: string; spanId: string; parentSpanId?: string; sampled: boolean; baggage?: Record; } // W3C Trace Context headers // traceparent: 00-{traceId}-{spanId}-{flags} // tracestate: stella=... // Example trace propagation class TracingMiddleware { handle(req: Request, res: Response, next: NextFunction): void { const traceparent = req.headers['traceparent']; const traceContext = this.parseTraceParent(traceparent); // Start span for this request const span = this.tracer.startSpan('http.request', { parent: traceContext, attributes: { 'http.method': req.method, 'http.url': req.url, 'http.user_agent': req.headers['user-agent'], 'tenant.id': req.tenantId } }); // Attach to request for downstream use req.span = span; res.on('finish', () => { span.setAttribute('http.status_code', res.statusCode); span.end(); }); next(); } } ``` ### Key Spans | Span Name | Description | Attributes | |-----------|-------------|------------| | `deployment.execute` | Full deployment | `release_id`, `environment` | | `task.dispatch` | Task dispatch to agent | `target_id`, `agent_id` | | `agent.execute` | Agent task execution | `task_type`, `duration` | | `workflow.run` | Workflow execution | `template_id`, `status` | | `workflow.step` | Individual step | `step_type`, `node_id` | | `approval.wait` | Waiting for approval | `promotion_id`, `duration` | | `gate.evaluate` | Gate evaluation | `gate_type`, `result` | ### Jaeger Configuration ```yaml # jaeger-config.yaml apiVersion: jaegertracing.io/v1 kind: Jaeger metadata: name: stella-jaeger spec: strategy: production collector: maxReplicas: 5 storage: type: elasticsearch options: es: server-urls: https://elasticsearch:9200 secretName: jaeger-es-secret ingress: enabled: true ``` ## Alerting ### Alert Rules ```yaml # prometheus-rules.yaml groups: - name: stella.deployment rules: - alert: DeploymentFailureRateHigh expr: | sum(rate(stella_deployments_total{status="failed"}[5m])) / sum(rate(stella_deployments_total[5m])) > 0.1 for: 5m labels: severity: critical annotations: summary: "High deployment failure rate" description: "More than 10% of deployments are failing" - alert: DeploymentDurationHigh expr: | histogram_quantile(0.95, sum(rate(stella_deployment_duration_seconds_bucket[5m])) by (le, tenant)) > 600 for: 10m labels: severity: warning annotations: summary: "Deployment duration high" description: "P95 deployment duration exceeds 10 minutes" - alert: RollbackRateHigh expr: | sum(rate(stella_rollbacks_total[1h])) > 3 for: 5m labels: severity: warning annotations: summary: "High rollback rate" description: "More than 3 rollbacks in the last hour" - name: stella.agents rules: - alert: AgentOffline expr: | stella_agent_heartbeat_age_seconds > 120 for: 2m labels: severity: critical annotations: summary: "Agent offline" description: "Agent {{ $labels.agent }} has not sent heartbeat for 2 minutes" - alert: AgentPoolLow expr: | count(stella_agents_connected{status="online"}) by (tenant) < 2 for: 5m labels: severity: warning annotations: summary: "Low agent count" description: "Fewer than 2 agents online for tenant {{ $labels.tenant }}" - name: stella.approvals rules: - alert: ApprovalBacklogHigh expr: | stella_approval_pending_count > 10 for: 1h labels: severity: warning annotations: summary: "Approval backlog growing" description: "More than 10 pending approvals for over an hour" - alert: ApprovalWaitLong expr: | histogram_quantile(0.90, stella_approval_duration_seconds_bucket) > 86400 for: 1h labels: severity: info annotations: summary: "Long approval wait times" description: "P90 approval wait time exceeds 24 hours" ``` ### PagerDuty Integration ```typescript interface AlertManagerConfig { receivers: [ { name: "stella-critical", pagerduty_configs: [ { service_key: "${PAGERDUTY_SERVICE_KEY}", severity: "critical" } ] }, { name: "stella-warning", slack_configs: [ { api_url: "${SLACK_WEBHOOK_URL}", channel: "#stella-alerts", send_resolved: true } ] } ], route: { receiver: "stella-warning", routes: [ { match: { severity: "critical" }, receiver: "stella-critical" } ] } } ``` ## Dashboards ### Deployment Dashboard Key panels: - Deployment rate over time - Success/failure ratio - Average deployment duration - Deployment duration histogram - Active deployments by environment - Recent deployment list ### Agent Health Dashboard Key panels: - Connected agents count - Agent heartbeat status - Tasks per agent - Task success rate by agent - Agent resource utilization ### Approval Dashboard Key panels: - Pending approvals count - Approval response time - Approvals by user - Rejection reasons breakdown ## Health Endpoints ### Application Health ```http GET /health ``` Response: ```json { "status": "healthy", "version": "1.0.0", "uptime": 86400, "checks": { "database": { "status": "healthy", "latency": 5 }, "redis": { "status": "healthy", "latency": 2 }, "vault": { "status": "healthy", "latency": 10 } } } ``` In Valkey-backed deployments, the `redis` check reflects the Redis-compatible Valkey cache. ### Readiness Probe ```http GET /health/ready ``` ### Liveness Probe ```http GET /health/live ``` ## Performance Tuning ### Database Connection Pool ```typescript const poolConfig = { min: 5, max: 20, acquireTimeout: 30000, idleTimeout: 600000, connectionTimeout: 10000 }; ``` ### Cache Configuration ```typescript const cacheConfig = { // Release cache releases: { ttl: 300, // 5 minutes maxSize: 1000 }, // Target cache targets: { ttl: 60, // 1 minute maxSize: 5000 }, // Workflow template cache templates: { ttl: 3600, // 1 hour maxSize: 100 } }; ``` ### Rate Limiting ```typescript const rateLimitConfig = { // API rate limits api: { windowMs: 60000, // 1 minute max: 1000, // requests per window burst: 100 // burst allowance }, // Webhook rate limits webhooks: { windowMs: 60000, max: 100 }, // Per-tenant limits tenant: { windowMs: 60000, max: 500 } }; ``` ## References - [Metrics Reference](metrics.md) - [Logging Guide](logging.md) - [Tracing Setup](tracing.md) - [Alert Configuration](alerting.md)