15 KiB
15 KiB
Operations Overview
Observability Stack
Release Orchestrator provides comprehensive observability through metrics, logging, and distributed tracing.
OBSERVABILITY ARCHITECTURE
┌─────────────────────────────────────────────────────────────────────────────┐
│ RELEASE ORCHESTRATOR │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Metrics │ │ Logs │ │ Traces │ │ Events │ │
│ │ Exporter │ │ Collector │ │ Exporter │ │ Publisher │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │ │
└─────────┼────────────────┼────────────────┼────────────────┼────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY BACKENDS │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Prometheus │ │ Loki / │ │ Jaeger / │ │ Event │ │
│ │ / Mimir │ │ Elasticsearch│ │ Tempo │ │ Bus │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │ │
│ └────────────────┴────────────────┴────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Grafana │ │
│ │ Dashboards │ │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Metrics
Core Metrics
| Metric | Type | Description | Labels |
|---|---|---|---|
stella_releases_total |
counter | Total releases created | tenant, status |
stella_promotions_total |
counter | Total promotions | tenant, env, status |
stella_deployments_total |
counter | Total deployments | tenant, env, strategy |
stella_deployment_duration_seconds |
histogram | Deployment duration | tenant, env, strategy |
stella_rollbacks_total |
counter | Total rollbacks | tenant, env, reason |
stella_agents_connected |
gauge | Connected agents | tenant |
stella_targets_total |
gauge | Total targets | tenant, env, type |
stella_workflow_runs_total |
counter | Workflow executions | tenant, template, status |
stella_workflow_step_duration_seconds |
histogram | Step execution time | step_type |
stella_approval_pending_count |
gauge | Pending approvals | tenant, env |
stella_approval_duration_seconds |
histogram | Time to approve | tenant, env |
API Metrics
| Metric | Type | Description | Labels |
|---|---|---|---|
stella_http_requests_total |
counter | HTTP requests | method, path, status |
stella_http_request_duration_seconds |
histogram | Request latency | method, path |
stella_http_requests_in_flight |
gauge | Active requests | method |
Agent Metrics
| Metric | Type | Description | Labels |
|---|---|---|---|
stella_agent_tasks_total |
counter | Tasks executed | agent, type, status |
stella_agent_task_duration_seconds |
histogram | Task duration | agent, type |
stella_agent_heartbeat_age_seconds |
gauge | Since last heartbeat | agent |
Prometheus Configuration
# prometheus.yml
scrape_configs:
- job_name: 'stella-orchestrator'
static_configs:
- targets: ['stella-orchestrator:9090']
metrics_path: /metrics
scheme: https
tls_config:
ca_file: /etc/prometheus/ca.crt
- job_name: 'stella-agents'
kubernetes_sd_configs:
- role: pod
selectors:
- role: pod
label: "app.kubernetes.io/name=stella-agent"
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_agent_id]
target_label: agent_id
Logging
Log Format
{
"timestamp": "2026-01-09T10:30:00.123Z",
"level": "info",
"message": "Deployment started",
"service": "deploy-orchestrator",
"version": "1.0.0",
"traceId": "abc123def456",
"spanId": "789ghi",
"tenantId": "tenant-uuid",
"correlationId": "corr-uuid",
"context": {
"deploymentJobId": "job-uuid",
"releaseId": "release-uuid",
"environmentId": "env-uuid"
}
}
Log Levels
| Level | Usage |
|---|---|
error |
Failures requiring attention |
warn |
Degraded operation, recoverable issues |
info |
Business events (deployment started, approval granted) |
debug |
Detailed operational info |
trace |
Very detailed debugging |
Structured Logging Configuration
// Logging configuration
const loggerConfig = {
level: process.env.LOG_LEVEL || 'info',
format: 'json',
outputs: [
{
type: 'stdout',
format: 'json'
},
{
type: 'file',
path: '/var/log/stella/orchestrator.log',
rotation: {
maxSize: '100MB',
maxFiles: 10
}
}
],
// Sensitive field masking
redact: [
'password',
'token',
'secret',
'credentials',
'authorization'
]
};
Important Log Events
| Event | Level | Description |
|---|---|---|
deployment.started |
info | Deployment job started |
deployment.completed |
info | Deployment successful |
deployment.failed |
error | Deployment failed |
rollback.initiated |
warn | Rollback triggered |
approval.granted |
info | Promotion approved |
approval.denied |
info | Promotion rejected |
agent.connected |
info | Agent came online |
agent.disconnected |
warn | Agent went offline |
security.gate.failed |
warn | Security check blocked |
Distributed Tracing
Trace Context Propagation
// Trace context in requests
interface TraceContext {
traceId: string;
spanId: string;
parentSpanId?: string;
sampled: boolean;
baggage?: Record<string, string>;
}
// W3C Trace Context headers
// traceparent: 00-{traceId}-{spanId}-{flags}
// tracestate: stella=...
// Example trace propagation
class TracingMiddleware {
handle(req: Request, res: Response, next: NextFunction): void {
const traceparent = req.headers['traceparent'];
const traceContext = this.parseTraceParent(traceparent);
// Start span for this request
const span = this.tracer.startSpan('http.request', {
parent: traceContext,
attributes: {
'http.method': req.method,
'http.url': req.url,
'http.user_agent': req.headers['user-agent'],
'tenant.id': req.tenantId
}
});
// Attach to request for downstream use
req.span = span;
res.on('finish', () => {
span.setAttribute('http.status_code', res.statusCode);
span.end();
});
next();
}
}
Key Spans
| Span Name | Description | Attributes |
|---|---|---|
deployment.execute |
Full deployment | release_id, environment |
task.dispatch |
Task dispatch to agent | target_id, agent_id |
agent.execute |
Agent task execution | task_type, duration |
workflow.run |
Workflow execution | template_id, status |
workflow.step |
Individual step | step_type, node_id |
approval.wait |
Waiting for approval | promotion_id, duration |
gate.evaluate |
Gate evaluation | gate_type, result |
Jaeger Configuration
# jaeger-config.yaml
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: stella-jaeger
spec:
strategy: production
collector:
maxReplicas: 5
storage:
type: elasticsearch
options:
es:
server-urls: https://elasticsearch:9200
secretName: jaeger-es-secret
ingress:
enabled: true
Alerting
Alert Rules
# prometheus-rules.yaml
groups:
- name: stella.deployment
rules:
- alert: DeploymentFailureRateHigh
expr: |
sum(rate(stella_deployments_total{status="failed"}[5m])) /
sum(rate(stella_deployments_total[5m])) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High deployment failure rate"
description: "More than 10% of deployments are failing"
- alert: DeploymentDurationHigh
expr: |
histogram_quantile(0.95, sum(rate(stella_deployment_duration_seconds_bucket[5m])) by (le, tenant)) > 600
for: 10m
labels:
severity: warning
annotations:
summary: "Deployment duration high"
description: "P95 deployment duration exceeds 10 minutes"
- alert: RollbackRateHigh
expr: |
sum(rate(stella_rollbacks_total[1h])) > 3
for: 5m
labels:
severity: warning
annotations:
summary: "High rollback rate"
description: "More than 3 rollbacks in the last hour"
- name: stella.agents
rules:
- alert: AgentOffline
expr: |
stella_agent_heartbeat_age_seconds > 120
for: 2m
labels:
severity: critical
annotations:
summary: "Agent offline"
description: "Agent {{ $labels.agent }} has not sent heartbeat for 2 minutes"
- alert: AgentPoolLow
expr: |
count(stella_agents_connected{status="online"}) by (tenant) < 2
for: 5m
labels:
severity: warning
annotations:
summary: "Low agent count"
description: "Fewer than 2 agents online for tenant {{ $labels.tenant }}"
- name: stella.approvals
rules:
- alert: ApprovalBacklogHigh
expr: |
stella_approval_pending_count > 10
for: 1h
labels:
severity: warning
annotations:
summary: "Approval backlog growing"
description: "More than 10 pending approvals for over an hour"
- alert: ApprovalWaitLong
expr: |
histogram_quantile(0.90, stella_approval_duration_seconds_bucket) > 86400
for: 1h
labels:
severity: info
annotations:
summary: "Long approval wait times"
description: "P90 approval wait time exceeds 24 hours"
PagerDuty Integration
interface AlertManagerConfig {
receivers: [
{
name: "stella-critical",
pagerduty_configs: [
{
service_key: "${PAGERDUTY_SERVICE_KEY}",
severity: "critical"
}
]
},
{
name: "stella-warning",
slack_configs: [
{
api_url: "${SLACK_WEBHOOK_URL}",
channel: "#stella-alerts",
send_resolved: true
}
]
}
],
route: {
receiver: "stella-warning",
routes: [
{
match: { severity: "critical" },
receiver: "stella-critical"
}
]
}
}
Dashboards
Deployment Dashboard
Key panels:
- Deployment rate over time
- Success/failure ratio
- Average deployment duration
- Deployment duration histogram
- Active deployments by environment
- Recent deployment list
Agent Health Dashboard
Key panels:
- Connected agents count
- Agent heartbeat status
- Tasks per agent
- Task success rate by agent
- Agent resource utilization
Approval Dashboard
Key panels:
- Pending approvals count
- Approval response time
- Approvals by user
- Rejection reasons breakdown
Health Endpoints
Application Health
GET /health
Response:
{
"status": "healthy",
"version": "1.0.0",
"uptime": 86400,
"checks": {
"database": { "status": "healthy", "latency": 5 },
"redis": { "status": "healthy", "latency": 2 },
"vault": { "status": "healthy", "latency": 10 }
}
}
Readiness Probe
GET /health/ready
Liveness Probe
GET /health/live
Performance Tuning
Database Connection Pool
const poolConfig = {
min: 5,
max: 20,
acquireTimeout: 30000,
idleTimeout: 600000,
connectionTimeout: 10000
};
Cache Configuration
const cacheConfig = {
// Release cache
releases: {
ttl: 300, // 5 minutes
maxSize: 1000
},
// Target cache
targets: {
ttl: 60, // 1 minute
maxSize: 5000
},
// Workflow template cache
templates: {
ttl: 3600, // 1 hour
maxSize: 100
}
};
Rate Limiting
const rateLimitConfig = {
// API rate limits
api: {
windowMs: 60000, // 1 minute
max: 1000, // requests per window
burst: 100 // burst allowance
},
// Webhook rate limits
webhooks: {
windowMs: 60000,
max: 100
},
// Per-tenant limits
tenant: {
windowMs: 60000,
max: 500
}
};