Files
git.stella-ops.org/docs/ops/telemetry-storage.md
Vladimir Moushkov 4d932cc1ba
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
feat: Implement runner execution pipeline with planner dispatch and execution services
- Introduced RunnerBackgroundService to handle execution of runner segments.
- Added RunnerExecutionService for processing segments and aggregating results.
- Implemented PlannerQueueDispatchService to manage dispatching of planner messages.
- Created PlannerQueueDispatcherBackgroundService for leasing and processing planner queue messages.
- Developed ScannerReportClient for interacting with the scanner service.
- Enhanced observability with SchedulerWorkerMetrics for tracking planner and runner performance.
- Added comprehensive documentation for the new runner execution pipeline and observability metrics.
- Implemented event emission for rescan activity and scanner report readiness.
2025-10-27 18:57:35 +02:00

174 lines
6.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Telemetry Storage Deployment (DEVOPS-OBS-50-002)
> **Audience:** DevOps Guild, Observability Guild
>
> **Scope:** Prometheus (metrics), Tempo (traces), Loki (logs) storage backends with tenant isolation, TLS, retention policies, and Authority integration.
---
## 1. Components & Ports
| Service | Port | Purpose | TLS |
|-----------|------|---------|-----|
| Prometheus | 9090 | Metrics API / alerting | Client auth (mTLS) to scrape collector |
| Tempo | 3200 | Trace ingest + API | mTLS (client cert required) |
| Loki | 3100 | Log ingest + API | mTLS (client cert required) |
The collector forwards OTLP traffic to Tempo (traces), Prometheus scrapes the collectors `/metrics` endpoint, and Loki is used for log search.
---
## 2. Local validation (Compose)
```bash
./ops/devops/telemetry/generate_dev_tls.sh
cd deploy/compose
# Start collector + storage stack
docker compose -f docker-compose.telemetry.yaml up -d
docker compose -f docker-compose.telemetry-storage.yaml up -d
python ../../ops/devops/telemetry/smoke_otel_collector.py --host localhost
```
Configuration files live in `deploy/telemetry/storage/`. Adjust the overrides before shipping to staging/production.
---
## 3. Kubernetes blueprint
Deploy Prometheus, Tempo, and Loki to the `observability` namespace. The Helm values snippet below illustrates the key settings (charts not yet versioned—define them in the observability repo):
```yaml
prometheus:
server:
extraFlags:
- web.enable-lifecycle
persistentVolume:
enabled: true
size: 200Gi
additionalScrapeConfigsSecret: stellaops-prometheus-scrape
extraSecretMounts:
- name: otel-mtls
secretName: stellaops-otel-tls-stage
mountPath: /etc/telemetry/tls
readOnly: true
- name: otel-token
secretName: stellaops-prometheus-token
mountPath: /etc/telemetry/auth
readOnly: true
loki:
auth_enabled: true
singleBinary:
replicas: 2
storage:
type: filesystem
existingSecretForTls: stellaops-otel-tls-stage
runtimeConfig:
configMap:
name: stellaops-loki-tenant-overrides
tempo:
server:
http_listen_port: 3200
storage:
trace:
backend: s3
s3:
endpoint: tempo-minio.observability.svc:9000
bucket: tempo-traces
multitenancyEnabled: true
extraVolumeMounts:
- name: otel-mtls
mountPath: /etc/telemetry/tls
readOnly: true
- name: tempo-tenant-overrides
mountPath: /etc/telemetry/tenants
readOnly: true
```
### Staging bootstrap commands
```bash
kubectl create namespace observability --dry-run=client -o yaml | kubectl apply -f -
# TLS material (generated via ops/devops/telemetry/generate_dev_tls.sh or from PKI)
kubectl -n observability create secret generic stellaops-otel-tls-stage \
--from-file=tls.crt=collector-stage.crt \
--from-file=tls.key=collector-stage.key \
--from-file=ca.crt=collector-ca.crt
# Prometheus bearer token issued by Authority (scope obs:read)
kubectl -n observability create secret generic stellaops-prometheus-token \
--from-file=token=prometheus-stage.token
# Tenant overrides
kubectl -n observability create configmap stellaops-loki-tenant-overrides \
--from-file=overrides.yaml=deploy/telemetry/storage/tenants/loki-overrides.yaml
kubectl -n observability create configmap tempo-tenant-overrides \
--from-file=tempo-overrides.yaml=deploy/telemetry/storage/tenants/tempo-overrides.yaml
# Additional scrape config referencing the collector service
kubectl -n observability create secret generic stellaops-prometheus-scrape \
--from-file=prometheus-additional.yaml=deploy/telemetry/storage/prometheus.yaml
```
Provision the following secrets/configs (names can be overridden via Helm values):
| Name | Type | Notes |
|------|------|-------|
| `stellaops-otel-tls-stage` | Secret | Shared CA + server cert/key for collector/storage mTLS.
| `stellaops-prometheus-token` | Secret | Bearer token minted by Authority (`obs:read`).
| `stellaops-loki-tenant-overrides` | ConfigMap | Text from `deploy/telemetry/storage/tenants/loki-overrides.yaml`.
| `tempo-tenant-overrides` | ConfigMap | Text from `deploy/telemetry/storage/tenants/tempo-overrides.yaml`.
---
## 4. Authority & tenancy integration
1. Create Authority clients for each backend (`observability-prometheus`, `observability-loki`, `observability-tempo`).
```bash
stella authority client create observability-prometheus \
--scopes obs:read \
--audience observability --description "Prometheus collector scrape"
stella authority client create observability-loki \
--scopes obs:logs timeline:read \
--audience observability --description "Loki ingestion"
stella authority client create observability-tempo \
--scopes obs:traces \
--audience observability --description "Tempo ingestion"
```
2. Mint tokens/credentials and store them in the secrets above (see staging bootstrap commands). Example:
```bash
stella authority token issue observability-prometheus --ttl 30d > prometheus-stage.token
```
3. Update ingress/gateway policies to forward `X-StellaOps-Tenant` into Loki/Tempo so tenant headers propagate end-to-end, and ensure each workload sets `tenant.id` attributes (see `docs/observability/observability.md`).
---
## 5. Retention & isolation
- Adjust `deploy/telemetry/storage/tenants/*.yaml` to set per-tenant retention and ingestion limits.
- Configure object storage (S3, GCS, Azure Blob) when moving beyond filesystem storage.
- For air-gapped deployments, mirror the telemetry bundle using `ops/devops/telemetry/package_offline_bundle.py` and import inside the Offline Kit staging directory.
---
## 6. Operational checklist
- [ ] Certificates rotated and secrets updated.
- [ ] Prometheus scrape succeeds (`curl -sk --cert client.crt --key client.key https://collector:9464`).
- [ ] Tempo and Loki report tenant activity (`/api/status`).
- [ ] Retention policy tested by uploading sample data and verifying expiry.
- [ ] Alerts wired into SLO evaluator (DEVOPS-OBS-51-001).
- [ ] Component rule packs imported (e.g. `docs/ops/scheduler-worker-prometheus-rules.yaml`).
---
## 7. References
- `deploy/telemetry/storage/README.md`
- `deploy/compose/docker-compose.telemetry-storage.yaml`
- `docs/ops/telemetry-collector.md`
- `docs/observability/observability.md`