Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
- Introduced RunnerBackgroundService to handle execution of runner segments. - Added RunnerExecutionService for processing segments and aggregating results. - Implemented PlannerQueueDispatchService to manage dispatching of planner messages. - Created PlannerQueueDispatcherBackgroundService for leasing and processing planner queue messages. - Developed ScannerReportClient for interacting with the scanner service. - Enhanced observability with SchedulerWorkerMetrics for tracking planner and runner performance. - Added comprehensive documentation for the new runner execution pipeline and observability metrics. - Implemented event emission for rescan activity and scanner report readiness.
174 lines
6.2 KiB
Markdown
174 lines
6.2 KiB
Markdown
# Telemetry Storage Deployment (DEVOPS-OBS-50-002)
|
||
|
||
> **Audience:** DevOps Guild, Observability Guild
|
||
>
|
||
> **Scope:** Prometheus (metrics), Tempo (traces), Loki (logs) storage backends with tenant isolation, TLS, retention policies, and Authority integration.
|
||
|
||
---
|
||
|
||
## 1. Components & Ports
|
||
|
||
| Service | Port | Purpose | TLS |
|
||
|-----------|------|---------|-----|
|
||
| Prometheus | 9090 | Metrics API / alerting | Client auth (mTLS) to scrape collector |
|
||
| Tempo | 3200 | Trace ingest + API | mTLS (client cert required) |
|
||
| Loki | 3100 | Log ingest + API | mTLS (client cert required) |
|
||
|
||
The collector forwards OTLP traffic to Tempo (traces), Prometheus scrapes the collector’s `/metrics` endpoint, and Loki is used for log search.
|
||
|
||
---
|
||
|
||
## 2. Local validation (Compose)
|
||
|
||
```bash
|
||
./ops/devops/telemetry/generate_dev_tls.sh
|
||
cd deploy/compose
|
||
# Start collector + storage stack
|
||
docker compose -f docker-compose.telemetry.yaml up -d
|
||
docker compose -f docker-compose.telemetry-storage.yaml up -d
|
||
python ../../ops/devops/telemetry/smoke_otel_collector.py --host localhost
|
||
```
|
||
|
||
Configuration files live in `deploy/telemetry/storage/`. Adjust the overrides before shipping to staging/production.
|
||
|
||
---
|
||
|
||
## 3. Kubernetes blueprint
|
||
|
||
Deploy Prometheus, Tempo, and Loki to the `observability` namespace. The Helm values snippet below illustrates the key settings (charts not yet versioned—define them in the observability repo):
|
||
|
||
```yaml
|
||
prometheus:
|
||
server:
|
||
extraFlags:
|
||
- web.enable-lifecycle
|
||
persistentVolume:
|
||
enabled: true
|
||
size: 200Gi
|
||
additionalScrapeConfigsSecret: stellaops-prometheus-scrape
|
||
extraSecretMounts:
|
||
- name: otel-mtls
|
||
secretName: stellaops-otel-tls-stage
|
||
mountPath: /etc/telemetry/tls
|
||
readOnly: true
|
||
- name: otel-token
|
||
secretName: stellaops-prometheus-token
|
||
mountPath: /etc/telemetry/auth
|
||
readOnly: true
|
||
|
||
loki:
|
||
auth_enabled: true
|
||
singleBinary:
|
||
replicas: 2
|
||
storage:
|
||
type: filesystem
|
||
existingSecretForTls: stellaops-otel-tls-stage
|
||
runtimeConfig:
|
||
configMap:
|
||
name: stellaops-loki-tenant-overrides
|
||
|
||
tempo:
|
||
server:
|
||
http_listen_port: 3200
|
||
storage:
|
||
trace:
|
||
backend: s3
|
||
s3:
|
||
endpoint: tempo-minio.observability.svc:9000
|
||
bucket: tempo-traces
|
||
multitenancyEnabled: true
|
||
extraVolumeMounts:
|
||
- name: otel-mtls
|
||
mountPath: /etc/telemetry/tls
|
||
readOnly: true
|
||
- name: tempo-tenant-overrides
|
||
mountPath: /etc/telemetry/tenants
|
||
readOnly: true
|
||
```
|
||
|
||
### Staging bootstrap commands
|
||
|
||
```bash
|
||
kubectl create namespace observability --dry-run=client -o yaml | kubectl apply -f -
|
||
|
||
# TLS material (generated via ops/devops/telemetry/generate_dev_tls.sh or from PKI)
|
||
kubectl -n observability create secret generic stellaops-otel-tls-stage \
|
||
--from-file=tls.crt=collector-stage.crt \
|
||
--from-file=tls.key=collector-stage.key \
|
||
--from-file=ca.crt=collector-ca.crt
|
||
|
||
# Prometheus bearer token issued by Authority (scope obs:read)
|
||
kubectl -n observability create secret generic stellaops-prometheus-token \
|
||
--from-file=token=prometheus-stage.token
|
||
|
||
# Tenant overrides
|
||
kubectl -n observability create configmap stellaops-loki-tenant-overrides \
|
||
--from-file=overrides.yaml=deploy/telemetry/storage/tenants/loki-overrides.yaml
|
||
|
||
kubectl -n observability create configmap tempo-tenant-overrides \
|
||
--from-file=tempo-overrides.yaml=deploy/telemetry/storage/tenants/tempo-overrides.yaml
|
||
|
||
# Additional scrape config referencing the collector service
|
||
kubectl -n observability create secret generic stellaops-prometheus-scrape \
|
||
--from-file=prometheus-additional.yaml=deploy/telemetry/storage/prometheus.yaml
|
||
```
|
||
|
||
Provision the following secrets/configs (names can be overridden via Helm values):
|
||
|
||
| Name | Type | Notes |
|
||
|------|------|-------|
|
||
| `stellaops-otel-tls-stage` | Secret | Shared CA + server cert/key for collector/storage mTLS.
|
||
| `stellaops-prometheus-token` | Secret | Bearer token minted by Authority (`obs:read`).
|
||
| `stellaops-loki-tenant-overrides` | ConfigMap | Text from `deploy/telemetry/storage/tenants/loki-overrides.yaml`.
|
||
| `tempo-tenant-overrides` | ConfigMap | Text from `deploy/telemetry/storage/tenants/tempo-overrides.yaml`.
|
||
|
||
---
|
||
|
||
## 4. Authority & tenancy integration
|
||
|
||
1. Create Authority clients for each backend (`observability-prometheus`, `observability-loki`, `observability-tempo`).
|
||
```bash
|
||
stella authority client create observability-prometheus \
|
||
--scopes obs:read \
|
||
--audience observability --description "Prometheus collector scrape"
|
||
stella authority client create observability-loki \
|
||
--scopes obs:logs timeline:read \
|
||
--audience observability --description "Loki ingestion"
|
||
stella authority client create observability-tempo \
|
||
--scopes obs:traces \
|
||
--audience observability --description "Tempo ingestion"
|
||
```
|
||
2. Mint tokens/credentials and store them in the secrets above (see staging bootstrap commands). Example:
|
||
```bash
|
||
stella authority token issue observability-prometheus --ttl 30d > prometheus-stage.token
|
||
```
|
||
3. Update ingress/gateway policies to forward `X-StellaOps-Tenant` into Loki/Tempo so tenant headers propagate end-to-end, and ensure each workload sets `tenant.id` attributes (see `docs/observability/observability.md`).
|
||
|
||
---
|
||
|
||
## 5. Retention & isolation
|
||
|
||
- Adjust `deploy/telemetry/storage/tenants/*.yaml` to set per-tenant retention and ingestion limits.
|
||
- Configure object storage (S3, GCS, Azure Blob) when moving beyond filesystem storage.
|
||
- For air-gapped deployments, mirror the telemetry bundle using `ops/devops/telemetry/package_offline_bundle.py` and import inside the Offline Kit staging directory.
|
||
|
||
---
|
||
|
||
## 6. Operational checklist
|
||
|
||
- [ ] Certificates rotated and secrets updated.
|
||
- [ ] Prometheus scrape succeeds (`curl -sk --cert client.crt --key client.key https://collector:9464`).
|
||
- [ ] Tempo and Loki report tenant activity (`/api/status`).
|
||
- [ ] Retention policy tested by uploading sample data and verifying expiry.
|
||
- [ ] Alerts wired into SLO evaluator (DEVOPS-OBS-51-001).
|
||
- [ ] Component rule packs imported (e.g. `docs/ops/scheduler-worker-prometheus-rules.yaml`).
|
||
|
||
---
|
||
|
||
## 7. References
|
||
|
||
- `deploy/telemetry/storage/README.md`
|
||
- `deploy/compose/docker-compose.telemetry-storage.yaml`
|
||
- `docs/ops/telemetry-collector.md`
|
||
- `docs/observability/observability.md`
|