release orchestrator pivot, architecture and planning
This commit is contained in:
274
docs/modules/release-orchestrator/operations/metrics.md
Normal file
274
docs/modules/release-orchestrator/operations/metrics.md
Normal file
@@ -0,0 +1,274 @@
|
||||
# Metrics Specification
|
||||
|
||||
## Overview
|
||||
|
||||
Release Orchestrator exposes Prometheus-compatible metrics for monitoring deployment health, performance, and operational status.
|
||||
|
||||
## Core Metrics
|
||||
|
||||
### Release Metrics
|
||||
|
||||
| Metric | Type | Description | Labels |
|
||||
|--------|------|-------------|--------|
|
||||
| `stella_releases_total` | counter | Total releases created | `tenant`, `status` |
|
||||
| `stella_releases_active` | gauge | Currently active releases | `tenant`, `status` |
|
||||
| `stella_release_components_count` | histogram | Components per release | `tenant` |
|
||||
|
||||
### Promotion Metrics
|
||||
|
||||
| Metric | Type | Description | Labels |
|
||||
|--------|------|-------------|--------|
|
||||
| `stella_promotions_total` | counter | Total promotions | `tenant`, `env`, `status` |
|
||||
| `stella_promotions_in_progress` | gauge | Promotions currently in progress | `tenant`, `env` |
|
||||
| `stella_promotion_duration_seconds` | histogram | Time from request to completion | `tenant`, `env`, `status` |
|
||||
| `stella_approval_pending_count` | gauge | Pending approvals | `tenant`, `env` |
|
||||
| `stella_approval_duration_seconds` | histogram | Time to approve | `tenant`, `env` |
|
||||
|
||||
### Deployment Metrics
|
||||
|
||||
| Metric | Type | Description | Labels |
|
||||
|--------|------|-------------|--------|
|
||||
| `stella_deployments_total` | counter | Total deployments | `tenant`, `env`, `strategy`, `status` |
|
||||
| `stella_deployment_duration_seconds` | histogram | Deployment duration | `tenant`, `env`, `strategy` |
|
||||
| `stella_deployment_tasks_total` | counter | Total deployment tasks | `tenant`, `status` |
|
||||
| `stella_deployment_task_duration_seconds` | histogram | Task duration | `target_type` |
|
||||
| `stella_rollbacks_total` | counter | Total rollbacks | `tenant`, `env`, `reason` |
|
||||
|
||||
### Agent Metrics
|
||||
|
||||
| Metric | Type | Description | Labels |
|
||||
|--------|------|-------------|--------|
|
||||
| `stella_agents_connected` | gauge | Connected agents | `tenant` |
|
||||
| `stella_agents_by_status` | gauge | Agents by status | `tenant`, `status` |
|
||||
| `stella_agent_tasks_total` | counter | Tasks executed by agents | `agent`, `type`, `status` |
|
||||
| `stella_agent_task_duration_seconds` | histogram | Agent task duration | `agent`, `type` |
|
||||
| `stella_agent_heartbeat_age_seconds` | gauge | Seconds since last heartbeat | `agent` |
|
||||
| `stella_agent_resource_cpu_percent` | gauge | Agent CPU usage | `agent` |
|
||||
| `stella_agent_resource_memory_percent` | gauge | Agent memory usage | `agent` |
|
||||
|
||||
### Workflow Metrics
|
||||
|
||||
| Metric | Type | Description | Labels |
|
||||
|--------|------|-------------|--------|
|
||||
| `stella_workflow_runs_total` | counter | Workflow executions | `tenant`, `template`, `status` |
|
||||
| `stella_workflow_runs_active` | gauge | Currently running workflows | `tenant`, `template` |
|
||||
| `stella_workflow_duration_seconds` | histogram | Workflow duration | `template`, `status` |
|
||||
| `stella_workflow_step_duration_seconds` | histogram | Step execution time | `step_type`, `status` |
|
||||
| `stella_workflow_step_retries_total` | counter | Step retry count | `step_type` |
|
||||
|
||||
### Target Metrics
|
||||
|
||||
| Metric | Type | Description | Labels |
|
||||
|--------|------|-------------|--------|
|
||||
| `stella_targets_total` | gauge | Total targets | `tenant`, `env`, `type` |
|
||||
| `stella_targets_by_health` | gauge | Targets by health status | `tenant`, `env`, `health` |
|
||||
| `stella_target_drift_detected` | gauge | Targets with drift | `tenant`, `env` |
|
||||
|
||||
### Integration Metrics
|
||||
|
||||
| Metric | Type | Description | Labels |
|
||||
|--------|------|-------------|--------|
|
||||
| `stella_integrations_total` | gauge | Configured integrations | `tenant`, `type` |
|
||||
| `stella_integration_health` | gauge | Integration health (1=healthy) | `tenant`, `integration` |
|
||||
| `stella_integration_requests_total` | counter | Requests to integrations | `integration`, `operation`, `status` |
|
||||
| `stella_integration_latency_seconds` | histogram | Integration request latency | `integration`, `operation` |
|
||||
|
||||
### Gate Metrics
|
||||
|
||||
| Metric | Type | Description | Labels |
|
||||
|--------|------|-------------|--------|
|
||||
| `stella_gate_evaluations_total` | counter | Gate evaluations | `tenant`, `gate_type`, `result` |
|
||||
| `stella_gate_evaluation_duration_seconds` | histogram | Gate evaluation time | `gate_type` |
|
||||
| `stella_gate_blocks_total` | counter | Blocked promotions by gate | `tenant`, `gate_type`, `env` |
|
||||
|
||||
## API Metrics
|
||||
|
||||
| Metric | Type | Description | Labels |
|
||||
|--------|------|-------------|--------|
|
||||
| `stella_http_requests_total` | counter | HTTP requests | `method`, `path`, `status` |
|
||||
| `stella_http_request_duration_seconds` | histogram | Request latency | `method`, `path` |
|
||||
| `stella_http_requests_in_flight` | gauge | Active requests | `method` |
|
||||
| `stella_http_request_size_bytes` | histogram | Request size | `method`, `path` |
|
||||
| `stella_http_response_size_bytes` | histogram | Response size | `method`, `path` |
|
||||
|
||||
## Evidence Metrics
|
||||
|
||||
| Metric | Type | Description | Labels |
|
||||
|--------|------|-------------|--------|
|
||||
| `stella_evidence_packets_total` | counter | Evidence packets generated | `tenant`, `type` |
|
||||
| `stella_evidence_packet_size_bytes` | histogram | Evidence packet size | `type` |
|
||||
| `stella_evidence_verification_total` | counter | Evidence verifications | `result` |
|
||||
|
||||
## Prometheus Configuration
|
||||
|
||||
```yaml
|
||||
# prometheus.yml
|
||||
global:
|
||||
scrape_interval: 15s
|
||||
evaluation_interval: 15s
|
||||
|
||||
scrape_configs:
|
||||
- job_name: 'stella-orchestrator'
|
||||
static_configs:
|
||||
- targets: ['stella-orchestrator:9090']
|
||||
metrics_path: /metrics
|
||||
scheme: https
|
||||
tls_config:
|
||||
ca_file: /etc/prometheus/ca.crt
|
||||
|
||||
- job_name: 'stella-agents'
|
||||
kubernetes_sd_configs:
|
||||
- role: pod
|
||||
selectors:
|
||||
- role: pod
|
||||
label: "app.kubernetes.io/name=stella-agent"
|
||||
relabel_configs:
|
||||
- source_labels: [__meta_kubernetes_pod_label_agent_id]
|
||||
target_label: agent_id
|
||||
```
|
||||
|
||||
## Histogram Buckets
|
||||
|
||||
### Duration Buckets (seconds)
|
||||
|
||||
```yaml
|
||||
# Short operations (API calls, gate evaluations)
|
||||
short_duration_buckets: [0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
|
||||
|
||||
# Medium operations (workflow steps)
|
||||
medium_duration_buckets: [0.1, 0.5, 1, 2.5, 5, 10, 30, 60, 120, 300]
|
||||
|
||||
# Long operations (deployments)
|
||||
long_duration_buckets: [1, 5, 10, 30, 60, 120, 300, 600, 1200, 3600]
|
||||
```
|
||||
|
||||
### Size Buckets (bytes)
|
||||
|
||||
```yaml
|
||||
# Request/response sizes
|
||||
size_buckets: [100, 1000, 10000, 100000, 1000000, 10000000]
|
||||
|
||||
# Evidence packet sizes
|
||||
evidence_buckets: [1000, 10000, 100000, 500000, 1000000, 5000000]
|
||||
```
|
||||
|
||||
## SLI Definitions
|
||||
|
||||
### Availability SLI
|
||||
|
||||
```promql
|
||||
# API availability (99.9% target)
|
||||
sum(rate(stella_http_requests_total{status!~"5.."}[5m]))
|
||||
/
|
||||
sum(rate(stella_http_requests_total[5m]))
|
||||
```
|
||||
|
||||
### Latency SLI
|
||||
|
||||
```promql
|
||||
# API latency P99 < 500ms
|
||||
histogram_quantile(0.99,
|
||||
sum(rate(stella_http_request_duration_seconds_bucket[5m])) by (le)
|
||||
)
|
||||
```
|
||||
|
||||
### Deployment Success SLI
|
||||
|
||||
```promql
|
||||
# Deployment success rate (99% target)
|
||||
sum(rate(stella_deployments_total{status="succeeded"}[24h]))
|
||||
/
|
||||
sum(rate(stella_deployments_total[24h]))
|
||||
```
|
||||
|
||||
## Alert Rules
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: stella-orchestrator
|
||||
rules:
|
||||
- alert: HighDeploymentFailureRate
|
||||
expr: |
|
||||
sum(rate(stella_deployments_total{status="failed"}[1h]))
|
||||
/
|
||||
sum(rate(stella_deployments_total[1h])) > 0.1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: High deployment failure rate
|
||||
description: More than 10% of deployments failing in the last hour
|
||||
|
||||
- alert: AgentOffline
|
||||
expr: stella_agent_heartbeat_age_seconds > 120
|
||||
for: 2m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: Agent {{ $labels.agent }} offline
|
||||
description: Agent has not sent heartbeat for > 2 minutes
|
||||
|
||||
- alert: PendingApprovalsStale
|
||||
expr: |
|
||||
stella_approval_pending_count > 0
|
||||
and
|
||||
time() - stella_promotion_request_timestamp > 3600
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: Stale pending approvals
|
||||
description: Approvals pending for more than 1 hour
|
||||
|
||||
- alert: IntegrationUnhealthy
|
||||
expr: stella_integration_health == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: Integration {{ $labels.integration }} unhealthy
|
||||
description: Integration health check failing
|
||||
|
||||
- alert: HighAPILatency
|
||||
expr: |
|
||||
histogram_quantile(0.99,
|
||||
sum(rate(stella_http_request_duration_seconds_bucket[5m])) by (le, path)
|
||||
) > 1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: High API latency on {{ $labels.path }}
|
||||
description: P99 latency exceeds 1 second
|
||||
```
|
||||
|
||||
## Grafana Dashboards
|
||||
|
||||
### Main Dashboard Panels
|
||||
|
||||
1. **Deployment Pipeline Overview**
|
||||
- Promotions per environment (time series)
|
||||
- Success/failure rates (gauge)
|
||||
- Active deployments (stat)
|
||||
|
||||
2. **Agent Health**
|
||||
- Connected agents (stat)
|
||||
- Agent status distribution (pie chart)
|
||||
- Heartbeat age (table)
|
||||
|
||||
3. **Gate Performance**
|
||||
- Gate evaluation counts (bar chart)
|
||||
- Block rate by gate type (time series)
|
||||
- Evaluation latency (heatmap)
|
||||
|
||||
4. **API Performance**
|
||||
- Request rate (time series)
|
||||
- Error rate (time series)
|
||||
- Latency distribution (heatmap)
|
||||
|
||||
## References
|
||||
|
||||
- [Operations Overview](overview.md)
|
||||
- [Logging](logging.md)
|
||||
- [Tracing](tracing.md)
|
||||
- [Alerting](alerting.md)
|
||||
508
docs/modules/release-orchestrator/operations/overview.md
Normal file
508
docs/modules/release-orchestrator/operations/overview.md
Normal file
@@ -0,0 +1,508 @@
|
||||
# Operations Overview
|
||||
|
||||
## Observability Stack
|
||||
|
||||
Release Orchestrator provides comprehensive observability through metrics, logging, and distributed tracing.
|
||||
|
||||
```
|
||||
OBSERVABILITY ARCHITECTURE
|
||||
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ RELEASE ORCHESTRATOR │
|
||||
│ │
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||||
│ │ Metrics │ │ Logs │ │ Traces │ │ Events │ │
|
||||
│ │ Exporter │ │ Collector │ │ Exporter │ │ Publisher │ │
|
||||
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
|
||||
│ │ │ │ │ │
|
||||
└─────────┼────────────────┼────────────────┼────────────────┼────────────────┘
|
||||
│ │ │ │
|
||||
▼ ▼ ▼ ▼
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ OBSERVABILITY BACKENDS │
|
||||
│ │
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||||
│ │ Prometheus │ │ Loki / │ │ Jaeger / │ │ Event │ │
|
||||
│ │ / Mimir │ │ Elasticsearch│ │ Tempo │ │ Bus │ │
|
||||
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
|
||||
│ │ │ │ │ │
|
||||
│ └────────────────┴────────────────┴────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌─────────────────┐ │
|
||||
│ │ Grafana │ │
|
||||
│ │ Dashboards │ │
|
||||
│ └─────────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Metrics
|
||||
|
||||
### Core Metrics
|
||||
|
||||
| Metric | Type | Description | Labels |
|
||||
|--------|------|-------------|--------|
|
||||
| `stella_releases_total` | counter | Total releases created | `tenant`, `status` |
|
||||
| `stella_promotions_total` | counter | Total promotions | `tenant`, `env`, `status` |
|
||||
| `stella_deployments_total` | counter | Total deployments | `tenant`, `env`, `strategy` |
|
||||
| `stella_deployment_duration_seconds` | histogram | Deployment duration | `tenant`, `env`, `strategy` |
|
||||
| `stella_rollbacks_total` | counter | Total rollbacks | `tenant`, `env`, `reason` |
|
||||
| `stella_agents_connected` | gauge | Connected agents | `tenant` |
|
||||
| `stella_targets_total` | gauge | Total targets | `tenant`, `env`, `type` |
|
||||
| `stella_workflow_runs_total` | counter | Workflow executions | `tenant`, `template`, `status` |
|
||||
| `stella_workflow_step_duration_seconds` | histogram | Step execution time | `step_type` |
|
||||
| `stella_approval_pending_count` | gauge | Pending approvals | `tenant`, `env` |
|
||||
| `stella_approval_duration_seconds` | histogram | Time to approve | `tenant`, `env` |
|
||||
|
||||
### API Metrics
|
||||
|
||||
| Metric | Type | Description | Labels |
|
||||
|--------|------|-------------|--------|
|
||||
| `stella_http_requests_total` | counter | HTTP requests | `method`, `path`, `status` |
|
||||
| `stella_http_request_duration_seconds` | histogram | Request latency | `method`, `path` |
|
||||
| `stella_http_requests_in_flight` | gauge | Active requests | `method` |
|
||||
|
||||
### Agent Metrics
|
||||
|
||||
| Metric | Type | Description | Labels |
|
||||
|--------|------|-------------|--------|
|
||||
| `stella_agent_tasks_total` | counter | Tasks executed | `agent`, `type`, `status` |
|
||||
| `stella_agent_task_duration_seconds` | histogram | Task duration | `agent`, `type` |
|
||||
| `stella_agent_heartbeat_age_seconds` | gauge | Since last heartbeat | `agent` |
|
||||
|
||||
### Prometheus Configuration
|
||||
|
||||
```yaml
|
||||
# prometheus.yml
|
||||
scrape_configs:
|
||||
- job_name: 'stella-orchestrator'
|
||||
static_configs:
|
||||
- targets: ['stella-orchestrator:9090']
|
||||
metrics_path: /metrics
|
||||
scheme: https
|
||||
tls_config:
|
||||
ca_file: /etc/prometheus/ca.crt
|
||||
|
||||
- job_name: 'stella-agents'
|
||||
kubernetes_sd_configs:
|
||||
- role: pod
|
||||
selectors:
|
||||
- role: pod
|
||||
label: "app.kubernetes.io/name=stella-agent"
|
||||
relabel_configs:
|
||||
- source_labels: [__meta_kubernetes_pod_label_agent_id]
|
||||
target_label: agent_id
|
||||
```
|
||||
|
||||
## Logging
|
||||
|
||||
### Log Format
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "2026-01-09T10:30:00.123Z",
|
||||
"level": "info",
|
||||
"message": "Deployment started",
|
||||
"service": "deploy-orchestrator",
|
||||
"version": "1.0.0",
|
||||
"traceId": "abc123def456",
|
||||
"spanId": "789ghi",
|
||||
"tenantId": "tenant-uuid",
|
||||
"correlationId": "corr-uuid",
|
||||
"context": {
|
||||
"deploymentJobId": "job-uuid",
|
||||
"releaseId": "release-uuid",
|
||||
"environmentId": "env-uuid"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Log Levels
|
||||
|
||||
| Level | Usage |
|
||||
|-------|-------|
|
||||
| `error` | Failures requiring attention |
|
||||
| `warn` | Degraded operation, recoverable issues |
|
||||
| `info` | Business events (deployment started, approval granted) |
|
||||
| `debug` | Detailed operational info |
|
||||
| `trace` | Very detailed debugging |
|
||||
|
||||
### Structured Logging Configuration
|
||||
|
||||
```typescript
|
||||
// Logging configuration
|
||||
const loggerConfig = {
|
||||
level: process.env.LOG_LEVEL || 'info',
|
||||
format: 'json',
|
||||
outputs: [
|
||||
{
|
||||
type: 'stdout',
|
||||
format: 'json'
|
||||
},
|
||||
{
|
||||
type: 'file',
|
||||
path: '/var/log/stella/orchestrator.log',
|
||||
rotation: {
|
||||
maxSize: '100MB',
|
||||
maxFiles: 10
|
||||
}
|
||||
}
|
||||
],
|
||||
// Sensitive field masking
|
||||
redact: [
|
||||
'password',
|
||||
'token',
|
||||
'secret',
|
||||
'credentials',
|
||||
'authorization'
|
||||
]
|
||||
};
|
||||
```
|
||||
|
||||
### Important Log Events
|
||||
|
||||
| Event | Level | Description |
|
||||
|-------|-------|-------------|
|
||||
| `deployment.started` | info | Deployment job started |
|
||||
| `deployment.completed` | info | Deployment successful |
|
||||
| `deployment.failed` | error | Deployment failed |
|
||||
| `rollback.initiated` | warn | Rollback triggered |
|
||||
| `approval.granted` | info | Promotion approved |
|
||||
| `approval.denied` | info | Promotion rejected |
|
||||
| `agent.connected` | info | Agent came online |
|
||||
| `agent.disconnected` | warn | Agent went offline |
|
||||
| `security.gate.failed` | warn | Security check blocked |
|
||||
|
||||
## Distributed Tracing
|
||||
|
||||
### Trace Context Propagation
|
||||
|
||||
```typescript
|
||||
// Trace context in requests
|
||||
interface TraceContext {
|
||||
traceId: string;
|
||||
spanId: string;
|
||||
parentSpanId?: string;
|
||||
sampled: boolean;
|
||||
baggage?: Record<string, string>;
|
||||
}
|
||||
|
||||
// W3C Trace Context headers
|
||||
// traceparent: 00-{traceId}-{spanId}-{flags}
|
||||
// tracestate: stella=...
|
||||
|
||||
// Example trace propagation
|
||||
class TracingMiddleware {
|
||||
handle(req: Request, res: Response, next: NextFunction): void {
|
||||
const traceparent = req.headers['traceparent'];
|
||||
const traceContext = this.parseTraceParent(traceparent);
|
||||
|
||||
// Start span for this request
|
||||
const span = this.tracer.startSpan('http.request', {
|
||||
parent: traceContext,
|
||||
attributes: {
|
||||
'http.method': req.method,
|
||||
'http.url': req.url,
|
||||
'http.user_agent': req.headers['user-agent'],
|
||||
'tenant.id': req.tenantId
|
||||
}
|
||||
});
|
||||
|
||||
// Attach to request for downstream use
|
||||
req.span = span;
|
||||
|
||||
res.on('finish', () => {
|
||||
span.setAttribute('http.status_code', res.statusCode);
|
||||
span.end();
|
||||
});
|
||||
|
||||
next();
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Key Spans
|
||||
|
||||
| Span Name | Description | Attributes |
|
||||
|-----------|-------------|------------|
|
||||
| `deployment.execute` | Full deployment | `release_id`, `environment` |
|
||||
| `task.dispatch` | Task dispatch to agent | `target_id`, `agent_id` |
|
||||
| `agent.execute` | Agent task execution | `task_type`, `duration` |
|
||||
| `workflow.run` | Workflow execution | `template_id`, `status` |
|
||||
| `workflow.step` | Individual step | `step_type`, `node_id` |
|
||||
| `approval.wait` | Waiting for approval | `promotion_id`, `duration` |
|
||||
| `gate.evaluate` | Gate evaluation | `gate_type`, `result` |
|
||||
|
||||
### Jaeger Configuration
|
||||
|
||||
```yaml
|
||||
# jaeger-config.yaml
|
||||
apiVersion: jaegertracing.io/v1
|
||||
kind: Jaeger
|
||||
metadata:
|
||||
name: stella-jaeger
|
||||
spec:
|
||||
strategy: production
|
||||
collector:
|
||||
maxReplicas: 5
|
||||
storage:
|
||||
type: elasticsearch
|
||||
options:
|
||||
es:
|
||||
server-urls: https://elasticsearch:9200
|
||||
secretName: jaeger-es-secret
|
||||
ingress:
|
||||
enabled: true
|
||||
```
|
||||
|
||||
## Alerting
|
||||
|
||||
### Alert Rules
|
||||
|
||||
```yaml
|
||||
# prometheus-rules.yaml
|
||||
groups:
|
||||
- name: stella.deployment
|
||||
rules:
|
||||
- alert: DeploymentFailureRateHigh
|
||||
expr: |
|
||||
sum(rate(stella_deployments_total{status="failed"}[5m])) /
|
||||
sum(rate(stella_deployments_total[5m])) > 0.1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "High deployment failure rate"
|
||||
description: "More than 10% of deployments are failing"
|
||||
|
||||
- alert: DeploymentDurationHigh
|
||||
expr: |
|
||||
histogram_quantile(0.95, sum(rate(stella_deployment_duration_seconds_bucket[5m])) by (le, tenant)) > 600
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Deployment duration high"
|
||||
description: "P95 deployment duration exceeds 10 minutes"
|
||||
|
||||
- alert: RollbackRateHigh
|
||||
expr: |
|
||||
sum(rate(stella_rollbacks_total[1h])) > 3
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High rollback rate"
|
||||
description: "More than 3 rollbacks in the last hour"
|
||||
|
||||
- name: stella.agents
|
||||
rules:
|
||||
- alert: AgentOffline
|
||||
expr: |
|
||||
stella_agent_heartbeat_age_seconds > 120
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Agent offline"
|
||||
description: "Agent {{ $labels.agent }} has not sent heartbeat for 2 minutes"
|
||||
|
||||
- alert: AgentPoolLow
|
||||
expr: |
|
||||
count(stella_agents_connected{status="online"}) by (tenant) < 2
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Low agent count"
|
||||
description: "Fewer than 2 agents online for tenant {{ $labels.tenant }}"
|
||||
|
||||
- name: stella.approvals
|
||||
rules:
|
||||
- alert: ApprovalBacklogHigh
|
||||
expr: |
|
||||
stella_approval_pending_count > 10
|
||||
for: 1h
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Approval backlog growing"
|
||||
description: "More than 10 pending approvals for over an hour"
|
||||
|
||||
- alert: ApprovalWaitLong
|
||||
expr: |
|
||||
histogram_quantile(0.90, stella_approval_duration_seconds_bucket) > 86400
|
||||
for: 1h
|
||||
labels:
|
||||
severity: info
|
||||
annotations:
|
||||
summary: "Long approval wait times"
|
||||
description: "P90 approval wait time exceeds 24 hours"
|
||||
```
|
||||
|
||||
### PagerDuty Integration
|
||||
|
||||
```typescript
|
||||
interface AlertManagerConfig {
|
||||
receivers: [
|
||||
{
|
||||
name: "stella-critical",
|
||||
pagerduty_configs: [
|
||||
{
|
||||
service_key: "${PAGERDUTY_SERVICE_KEY}",
|
||||
severity: "critical"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
name: "stella-warning",
|
||||
slack_configs: [
|
||||
{
|
||||
api_url: "${SLACK_WEBHOOK_URL}",
|
||||
channel: "#stella-alerts",
|
||||
send_resolved: true
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
route: {
|
||||
receiver: "stella-warning",
|
||||
routes: [
|
||||
{
|
||||
match: { severity: "critical" },
|
||||
receiver: "stella-critical"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Dashboards
|
||||
|
||||
### Deployment Dashboard
|
||||
|
||||
Key panels:
|
||||
- Deployment rate over time
|
||||
- Success/failure ratio
|
||||
- Average deployment duration
|
||||
- Deployment duration histogram
|
||||
- Active deployments by environment
|
||||
- Recent deployment list
|
||||
|
||||
### Agent Health Dashboard
|
||||
|
||||
Key panels:
|
||||
- Connected agents count
|
||||
- Agent heartbeat status
|
||||
- Tasks per agent
|
||||
- Task success rate by agent
|
||||
- Agent resource utilization
|
||||
|
||||
### Approval Dashboard
|
||||
|
||||
Key panels:
|
||||
- Pending approvals count
|
||||
- Approval response time
|
||||
- Approvals by user
|
||||
- Rejection reasons breakdown
|
||||
|
||||
## Health Endpoints
|
||||
|
||||
### Application Health
|
||||
|
||||
```http
|
||||
GET /health
|
||||
```
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"status": "healthy",
|
||||
"version": "1.0.0",
|
||||
"uptime": 86400,
|
||||
"checks": {
|
||||
"database": { "status": "healthy", "latency": 5 },
|
||||
"redis": { "status": "healthy", "latency": 2 },
|
||||
"vault": { "status": "healthy", "latency": 10 }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Readiness Probe
|
||||
|
||||
```http
|
||||
GET /health/ready
|
||||
```
|
||||
|
||||
### Liveness Probe
|
||||
|
||||
```http
|
||||
GET /health/live
|
||||
```
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
### Database Connection Pool
|
||||
|
||||
```typescript
|
||||
const poolConfig = {
|
||||
min: 5,
|
||||
max: 20,
|
||||
acquireTimeout: 30000,
|
||||
idleTimeout: 600000,
|
||||
connectionTimeout: 10000
|
||||
};
|
||||
```
|
||||
|
||||
### Cache Configuration
|
||||
|
||||
```typescript
|
||||
const cacheConfig = {
|
||||
// Release cache
|
||||
releases: {
|
||||
ttl: 300, // 5 minutes
|
||||
maxSize: 1000
|
||||
},
|
||||
// Target cache
|
||||
targets: {
|
||||
ttl: 60, // 1 minute
|
||||
maxSize: 5000
|
||||
},
|
||||
// Workflow template cache
|
||||
templates: {
|
||||
ttl: 3600, // 1 hour
|
||||
maxSize: 100
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
```typescript
|
||||
const rateLimitConfig = {
|
||||
// API rate limits
|
||||
api: {
|
||||
windowMs: 60000, // 1 minute
|
||||
max: 1000, // requests per window
|
||||
burst: 100 // burst allowance
|
||||
},
|
||||
// Webhook rate limits
|
||||
webhooks: {
|
||||
windowMs: 60000,
|
||||
max: 100
|
||||
},
|
||||
// Per-tenant limits
|
||||
tenant: {
|
||||
windowMs: 60000,
|
||||
max: 500
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- [Metrics Reference](metrics.md)
|
||||
- [Logging Guide](logging.md)
|
||||
- [Tracing Setup](tracing.md)
|
||||
- [Alert Configuration](alerting.md)
|
||||
Reference in New Issue
Block a user