Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
- Introduced RunnerBackgroundService to handle execution of runner segments. - Added RunnerExecutionService for processing segments and aggregating results. - Implemented PlannerQueueDispatchService to manage dispatching of planner messages. - Created PlannerQueueDispatcherBackgroundService for leasing and processing planner queue messages. - Developed ScannerReportClient for interacting with the scanner service. - Enhanced observability with SchedulerWorkerMetrics for tracking planner and runner performance. - Added comprehensive documentation for the new runner execution pipeline and observability metrics. - Implemented event emission for rescan activity and scanner report readiness.
6.2 KiB
6.2 KiB
Telemetry Storage Deployment (DEVOPS-OBS-50-002)
Audience: DevOps Guild, Observability Guild
Scope: Prometheus (metrics), Tempo (traces), Loki (logs) storage backends with tenant isolation, TLS, retention policies, and Authority integration.
1. Components & Ports
| Service | Port | Purpose | TLS |
|---|---|---|---|
| Prometheus | 9090 | Metrics API / alerting | Client auth (mTLS) to scrape collector |
| Tempo | 3200 | Trace ingest + API | mTLS (client cert required) |
| Loki | 3100 | Log ingest + API | mTLS (client cert required) |
The collector forwards OTLP traffic to Tempo (traces), Prometheus scrapes the collector’s /metrics endpoint, and Loki is used for log search.
2. Local validation (Compose)
./ops/devops/telemetry/generate_dev_tls.sh
cd deploy/compose
# Start collector + storage stack
docker compose -f docker-compose.telemetry.yaml up -d
docker compose -f docker-compose.telemetry-storage.yaml up -d
python ../../ops/devops/telemetry/smoke_otel_collector.py --host localhost
Configuration files live in deploy/telemetry/storage/. Adjust the overrides before shipping to staging/production.
3. Kubernetes blueprint
Deploy Prometheus, Tempo, and Loki to the observability namespace. The Helm values snippet below illustrates the key settings (charts not yet versioned—define them in the observability repo):
prometheus:
server:
extraFlags:
- web.enable-lifecycle
persistentVolume:
enabled: true
size: 200Gi
additionalScrapeConfigsSecret: stellaops-prometheus-scrape
extraSecretMounts:
- name: otel-mtls
secretName: stellaops-otel-tls-stage
mountPath: /etc/telemetry/tls
readOnly: true
- name: otel-token
secretName: stellaops-prometheus-token
mountPath: /etc/telemetry/auth
readOnly: true
loki:
auth_enabled: true
singleBinary:
replicas: 2
storage:
type: filesystem
existingSecretForTls: stellaops-otel-tls-stage
runtimeConfig:
configMap:
name: stellaops-loki-tenant-overrides
tempo:
server:
http_listen_port: 3200
storage:
trace:
backend: s3
s3:
endpoint: tempo-minio.observability.svc:9000
bucket: tempo-traces
multitenancyEnabled: true
extraVolumeMounts:
- name: otel-mtls
mountPath: /etc/telemetry/tls
readOnly: true
- name: tempo-tenant-overrides
mountPath: /etc/telemetry/tenants
readOnly: true
Staging bootstrap commands
kubectl create namespace observability --dry-run=client -o yaml | kubectl apply -f -
# TLS material (generated via ops/devops/telemetry/generate_dev_tls.sh or from PKI)
kubectl -n observability create secret generic stellaops-otel-tls-stage \
--from-file=tls.crt=collector-stage.crt \
--from-file=tls.key=collector-stage.key \
--from-file=ca.crt=collector-ca.crt
# Prometheus bearer token issued by Authority (scope obs:read)
kubectl -n observability create secret generic stellaops-prometheus-token \
--from-file=token=prometheus-stage.token
# Tenant overrides
kubectl -n observability create configmap stellaops-loki-tenant-overrides \
--from-file=overrides.yaml=deploy/telemetry/storage/tenants/loki-overrides.yaml
kubectl -n observability create configmap tempo-tenant-overrides \
--from-file=tempo-overrides.yaml=deploy/telemetry/storage/tenants/tempo-overrides.yaml
# Additional scrape config referencing the collector service
kubectl -n observability create secret generic stellaops-prometheus-scrape \
--from-file=prometheus-additional.yaml=deploy/telemetry/storage/prometheus.yaml
Provision the following secrets/configs (names can be overridden via Helm values):
| Name | Type | Notes |
|---|---|---|
stellaops-otel-tls-stage |
Secret | Shared CA + server cert/key for collector/storage mTLS. |
stellaops-prometheus-token |
Secret | Bearer token minted by Authority (obs:read). |
stellaops-loki-tenant-overrides |
ConfigMap | Text from deploy/telemetry/storage/tenants/loki-overrides.yaml. |
tempo-tenant-overrides |
ConfigMap | Text from deploy/telemetry/storage/tenants/tempo-overrides.yaml. |
4. Authority & tenancy integration
- Create Authority clients for each backend (
observability-prometheus,observability-loki,observability-tempo).stella authority client create observability-prometheus \ --scopes obs:read \ --audience observability --description "Prometheus collector scrape" stella authority client create observability-loki \ --scopes obs:logs timeline:read \ --audience observability --description "Loki ingestion" stella authority client create observability-tempo \ --scopes obs:traces \ --audience observability --description "Tempo ingestion" - Mint tokens/credentials and store them in the secrets above (see staging bootstrap commands). Example:
stella authority token issue observability-prometheus --ttl 30d > prometheus-stage.token - Update ingress/gateway policies to forward
X-StellaOps-Tenantinto Loki/Tempo so tenant headers propagate end-to-end, and ensure each workload setstenant.idattributes (seedocs/observability/observability.md).
5. Retention & isolation
- Adjust
deploy/telemetry/storage/tenants/*.yamlto set per-tenant retention and ingestion limits. - Configure object storage (S3, GCS, Azure Blob) when moving beyond filesystem storage.
- For air-gapped deployments, mirror the telemetry bundle using
ops/devops/telemetry/package_offline_bundle.pyand import inside the Offline Kit staging directory.
6. Operational checklist
- Certificates rotated and secrets updated.
- Prometheus scrape succeeds (
curl -sk --cert client.crt --key client.key https://collector:9464). - Tempo and Loki report tenant activity (
/api/status). - Retention policy tested by uploading sample data and verifying expiry.
- Alerts wired into SLO evaluator (DEVOPS-OBS-51-001).
- Component rule packs imported (e.g.
docs/ops/scheduler-worker-prometheus-rules.yaml).
7. References
deploy/telemetry/storage/README.mddeploy/compose/docker-compose.telemetry-storage.yamldocs/ops/telemetry-collector.mddocs/observability/observability.md