release orchestrator pivot, architecture and planning

This commit is contained in:
2026-01-10 22:37:22 +02:00
parent c84f421e2f
commit d509c44411
130 changed files with 70292 additions and 721 deletions

View File

@@ -0,0 +1,508 @@
# Operations Overview
## Observability Stack
Release Orchestrator provides comprehensive observability through metrics, logging, and distributed tracing.
```
OBSERVABILITY ARCHITECTURE
┌─────────────────────────────────────────────────────────────────────────────┐
│ RELEASE ORCHESTRATOR │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Metrics │ │ Logs │ │ Traces │ │ Events │ │
│ │ Exporter │ │ Collector │ │ Exporter │ │ Publisher │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │ │
└─────────┼────────────────┼────────────────┼────────────────┼────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY BACKENDS │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Prometheus │ │ Loki / │ │ Jaeger / │ │ Event │ │
│ │ / Mimir │ │ Elasticsearch│ │ Tempo │ │ Bus │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │ │
│ └────────────────┴────────────────┴────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Grafana │ │
│ │ Dashboards │ │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
## Metrics
### Core Metrics
| Metric | Type | Description | Labels |
|--------|------|-------------|--------|
| `stella_releases_total` | counter | Total releases created | `tenant`, `status` |
| `stella_promotions_total` | counter | Total promotions | `tenant`, `env`, `status` |
| `stella_deployments_total` | counter | Total deployments | `tenant`, `env`, `strategy` |
| `stella_deployment_duration_seconds` | histogram | Deployment duration | `tenant`, `env`, `strategy` |
| `stella_rollbacks_total` | counter | Total rollbacks | `tenant`, `env`, `reason` |
| `stella_agents_connected` | gauge | Connected agents | `tenant` |
| `stella_targets_total` | gauge | Total targets | `tenant`, `env`, `type` |
| `stella_workflow_runs_total` | counter | Workflow executions | `tenant`, `template`, `status` |
| `stella_workflow_step_duration_seconds` | histogram | Step execution time | `step_type` |
| `stella_approval_pending_count` | gauge | Pending approvals | `tenant`, `env` |
| `stella_approval_duration_seconds` | histogram | Time to approve | `tenant`, `env` |
### API Metrics
| Metric | Type | Description | Labels |
|--------|------|-------------|--------|
| `stella_http_requests_total` | counter | HTTP requests | `method`, `path`, `status` |
| `stella_http_request_duration_seconds` | histogram | Request latency | `method`, `path` |
| `stella_http_requests_in_flight` | gauge | Active requests | `method` |
### Agent Metrics
| Metric | Type | Description | Labels |
|--------|------|-------------|--------|
| `stella_agent_tasks_total` | counter | Tasks executed | `agent`, `type`, `status` |
| `stella_agent_task_duration_seconds` | histogram | Task duration | `agent`, `type` |
| `stella_agent_heartbeat_age_seconds` | gauge | Since last heartbeat | `agent` |
### Prometheus Configuration
```yaml
# prometheus.yml
scrape_configs:
- job_name: 'stella-orchestrator'
static_configs:
- targets: ['stella-orchestrator:9090']
metrics_path: /metrics
scheme: https
tls_config:
ca_file: /etc/prometheus/ca.crt
- job_name: 'stella-agents'
kubernetes_sd_configs:
- role: pod
selectors:
- role: pod
label: "app.kubernetes.io/name=stella-agent"
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_agent_id]
target_label: agent_id
```
## Logging
### Log Format
```json
{
"timestamp": "2026-01-09T10:30:00.123Z",
"level": "info",
"message": "Deployment started",
"service": "deploy-orchestrator",
"version": "1.0.0",
"traceId": "abc123def456",
"spanId": "789ghi",
"tenantId": "tenant-uuid",
"correlationId": "corr-uuid",
"context": {
"deploymentJobId": "job-uuid",
"releaseId": "release-uuid",
"environmentId": "env-uuid"
}
}
```
### Log Levels
| Level | Usage |
|-------|-------|
| `error` | Failures requiring attention |
| `warn` | Degraded operation, recoverable issues |
| `info` | Business events (deployment started, approval granted) |
| `debug` | Detailed operational info |
| `trace` | Very detailed debugging |
### Structured Logging Configuration
```typescript
// Logging configuration
const loggerConfig = {
level: process.env.LOG_LEVEL || 'info',
format: 'json',
outputs: [
{
type: 'stdout',
format: 'json'
},
{
type: 'file',
path: '/var/log/stella/orchestrator.log',
rotation: {
maxSize: '100MB',
maxFiles: 10
}
}
],
// Sensitive field masking
redact: [
'password',
'token',
'secret',
'credentials',
'authorization'
]
};
```
### Important Log Events
| Event | Level | Description |
|-------|-------|-------------|
| `deployment.started` | info | Deployment job started |
| `deployment.completed` | info | Deployment successful |
| `deployment.failed` | error | Deployment failed |
| `rollback.initiated` | warn | Rollback triggered |
| `approval.granted` | info | Promotion approved |
| `approval.denied` | info | Promotion rejected |
| `agent.connected` | info | Agent came online |
| `agent.disconnected` | warn | Agent went offline |
| `security.gate.failed` | warn | Security check blocked |
## Distributed Tracing
### Trace Context Propagation
```typescript
// Trace context in requests
interface TraceContext {
traceId: string;
spanId: string;
parentSpanId?: string;
sampled: boolean;
baggage?: Record<string, string>;
}
// W3C Trace Context headers
// traceparent: 00-{traceId}-{spanId}-{flags}
// tracestate: stella=...
// Example trace propagation
class TracingMiddleware {
handle(req: Request, res: Response, next: NextFunction): void {
const traceparent = req.headers['traceparent'];
const traceContext = this.parseTraceParent(traceparent);
// Start span for this request
const span = this.tracer.startSpan('http.request', {
parent: traceContext,
attributes: {
'http.method': req.method,
'http.url': req.url,
'http.user_agent': req.headers['user-agent'],
'tenant.id': req.tenantId
}
});
// Attach to request for downstream use
req.span = span;
res.on('finish', () => {
span.setAttribute('http.status_code', res.statusCode);
span.end();
});
next();
}
}
```
### Key Spans
| Span Name | Description | Attributes |
|-----------|-------------|------------|
| `deployment.execute` | Full deployment | `release_id`, `environment` |
| `task.dispatch` | Task dispatch to agent | `target_id`, `agent_id` |
| `agent.execute` | Agent task execution | `task_type`, `duration` |
| `workflow.run` | Workflow execution | `template_id`, `status` |
| `workflow.step` | Individual step | `step_type`, `node_id` |
| `approval.wait` | Waiting for approval | `promotion_id`, `duration` |
| `gate.evaluate` | Gate evaluation | `gate_type`, `result` |
### Jaeger Configuration
```yaml
# jaeger-config.yaml
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: stella-jaeger
spec:
strategy: production
collector:
maxReplicas: 5
storage:
type: elasticsearch
options:
es:
server-urls: https://elasticsearch:9200
secretName: jaeger-es-secret
ingress:
enabled: true
```
## Alerting
### Alert Rules
```yaml
# prometheus-rules.yaml
groups:
- name: stella.deployment
rules:
- alert: DeploymentFailureRateHigh
expr: |
sum(rate(stella_deployments_total{status="failed"}[5m])) /
sum(rate(stella_deployments_total[5m])) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High deployment failure rate"
description: "More than 10% of deployments are failing"
- alert: DeploymentDurationHigh
expr: |
histogram_quantile(0.95, sum(rate(stella_deployment_duration_seconds_bucket[5m])) by (le, tenant)) > 600
for: 10m
labels:
severity: warning
annotations:
summary: "Deployment duration high"
description: "P95 deployment duration exceeds 10 minutes"
- alert: RollbackRateHigh
expr: |
sum(rate(stella_rollbacks_total[1h])) > 3
for: 5m
labels:
severity: warning
annotations:
summary: "High rollback rate"
description: "More than 3 rollbacks in the last hour"
- name: stella.agents
rules:
- alert: AgentOffline
expr: |
stella_agent_heartbeat_age_seconds > 120
for: 2m
labels:
severity: critical
annotations:
summary: "Agent offline"
description: "Agent {{ $labels.agent }} has not sent heartbeat for 2 minutes"
- alert: AgentPoolLow
expr: |
count(stella_agents_connected{status="online"}) by (tenant) < 2
for: 5m
labels:
severity: warning
annotations:
summary: "Low agent count"
description: "Fewer than 2 agents online for tenant {{ $labels.tenant }}"
- name: stella.approvals
rules:
- alert: ApprovalBacklogHigh
expr: |
stella_approval_pending_count > 10
for: 1h
labels:
severity: warning
annotations:
summary: "Approval backlog growing"
description: "More than 10 pending approvals for over an hour"
- alert: ApprovalWaitLong
expr: |
histogram_quantile(0.90, stella_approval_duration_seconds_bucket) > 86400
for: 1h
labels:
severity: info
annotations:
summary: "Long approval wait times"
description: "P90 approval wait time exceeds 24 hours"
```
### PagerDuty Integration
```typescript
interface AlertManagerConfig {
receivers: [
{
name: "stella-critical",
pagerduty_configs: [
{
service_key: "${PAGERDUTY_SERVICE_KEY}",
severity: "critical"
}
]
},
{
name: "stella-warning",
slack_configs: [
{
api_url: "${SLACK_WEBHOOK_URL}",
channel: "#stella-alerts",
send_resolved: true
}
]
}
],
route: {
receiver: "stella-warning",
routes: [
{
match: { severity: "critical" },
receiver: "stella-critical"
}
]
}
}
```
## Dashboards
### Deployment Dashboard
Key panels:
- Deployment rate over time
- Success/failure ratio
- Average deployment duration
- Deployment duration histogram
- Active deployments by environment
- Recent deployment list
### Agent Health Dashboard
Key panels:
- Connected agents count
- Agent heartbeat status
- Tasks per agent
- Task success rate by agent
- Agent resource utilization
### Approval Dashboard
Key panels:
- Pending approvals count
- Approval response time
- Approvals by user
- Rejection reasons breakdown
## Health Endpoints
### Application Health
```http
GET /health
```
Response:
```json
{
"status": "healthy",
"version": "1.0.0",
"uptime": 86400,
"checks": {
"database": { "status": "healthy", "latency": 5 },
"redis": { "status": "healthy", "latency": 2 },
"vault": { "status": "healthy", "latency": 10 }
}
}
```
### Readiness Probe
```http
GET /health/ready
```
### Liveness Probe
```http
GET /health/live
```
## Performance Tuning
### Database Connection Pool
```typescript
const poolConfig = {
min: 5,
max: 20,
acquireTimeout: 30000,
idleTimeout: 600000,
connectionTimeout: 10000
};
```
### Cache Configuration
```typescript
const cacheConfig = {
// Release cache
releases: {
ttl: 300, // 5 minutes
maxSize: 1000
},
// Target cache
targets: {
ttl: 60, // 1 minute
maxSize: 5000
},
// Workflow template cache
templates: {
ttl: 3600, // 1 hour
maxSize: 100
}
};
```
### Rate Limiting
```typescript
const rateLimitConfig = {
// API rate limits
api: {
windowMs: 60000, // 1 minute
max: 1000, // requests per window
burst: 100 // burst allowance
},
// Webhook rate limits
webhooks: {
windowMs: 60000,
max: 100
},
// Per-tenant limits
tenant: {
windowMs: 60000,
max: 500
}
};
```
## References
- [Metrics Reference](metrics.md)
- [Logging Guide](logging.md)
- [Tracing Setup](tracing.md)
- [Alert Configuration](alerting.md)