Files
git.stella-ops.org/docs/ops/telemetry-storage.md
master 96d52884e8
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Add Policy DSL Validator, Schema Exporter, and Simulation Smoke tools
- Implemented PolicyDslValidator with command-line options for strict mode and JSON output.
- Created PolicySchemaExporter to generate JSON schemas for policy-related models.
- Developed PolicySimulationSmoke tool to validate policy simulations against expected outcomes.
- Added project files and necessary dependencies for each tool.
- Ensured proper error handling and usage instructions across tools.
2025-10-27 08:00:11 +02:00

6.1 KiB
Raw Blame History

Telemetry Storage Deployment (DEVOPS-OBS-50-002)

Audience: DevOps Guild, Observability Guild

Scope: Prometheus (metrics), Tempo (traces), Loki (logs) storage backends with tenant isolation, TLS, retention policies, and Authority integration.


1. Components & Ports

Service Port Purpose TLS
Prometheus 9090 Metrics API / alerting Client auth (mTLS) to scrape collector
Tempo 3200 Trace ingest + API mTLS (client cert required)
Loki 3100 Log ingest + API mTLS (client cert required)

The collector forwards OTLP traffic to Tempo (traces), Prometheus scrapes the collectors /metrics endpoint, and Loki is used for log search.


2. Local validation (Compose)

./ops/devops/telemetry/generate_dev_tls.sh
cd deploy/compose
# Start collector + storage stack
docker compose -f docker-compose.telemetry.yaml up -d
docker compose -f docker-compose.telemetry-storage.yaml up -d
python ../../ops/devops/telemetry/smoke_otel_collector.py --host localhost

Configuration files live in deploy/telemetry/storage/. Adjust the overrides before shipping to staging/production.


3. Kubernetes blueprint

Deploy Prometheus, Tempo, and Loki to the observability namespace. The Helm values snippet below illustrates the key settings (charts not yet versioned—define them in the observability repo):

prometheus:
  server:
    extraFlags:
      - web.enable-lifecycle
    persistentVolume:
      enabled: true
      size: 200Gi
  additionalScrapeConfigsSecret: stellaops-prometheus-scrape
  extraSecretMounts:
    - name: otel-mtls
      secretName: stellaops-otel-tls-stage
      mountPath: /etc/telemetry/tls
      readOnly: true
    - name: otel-token
      secretName: stellaops-prometheus-token
      mountPath: /etc/telemetry/auth
      readOnly: true

loki:
  auth_enabled: true
  singleBinary:
    replicas: 2
  storage:
    type: filesystem
  existingSecretForTls: stellaops-otel-tls-stage
  runtimeConfig:
    configMap:
      name: stellaops-loki-tenant-overrides

tempo:
  server:
    http_listen_port: 3200
  storage:
    trace:
      backend: s3
      s3:
        endpoint: tempo-minio.observability.svc:9000
        bucket: tempo-traces
  multitenancyEnabled: true
  extraVolumeMounts:
    - name: otel-mtls
      mountPath: /etc/telemetry/tls
      readOnly: true
    - name: tempo-tenant-overrides
      mountPath: /etc/telemetry/tenants
      readOnly: true

Staging bootstrap commands

kubectl create namespace observability --dry-run=client -o yaml | kubectl apply -f -

# TLS material (generated via ops/devops/telemetry/generate_dev_tls.sh or from PKI)
kubectl -n observability create secret generic stellaops-otel-tls-stage \
  --from-file=tls.crt=collector-stage.crt \
  --from-file=tls.key=collector-stage.key \
  --from-file=ca.crt=collector-ca.crt

# Prometheus bearer token issued by Authority (scope obs:read)
kubectl -n observability create secret generic stellaops-prometheus-token \
  --from-file=token=prometheus-stage.token

# Tenant overrides
kubectl -n observability create configmap stellaops-loki-tenant-overrides \
  --from-file=overrides.yaml=deploy/telemetry/storage/tenants/loki-overrides.yaml

kubectl -n observability create configmap tempo-tenant-overrides \
  --from-file=tempo-overrides.yaml=deploy/telemetry/storage/tenants/tempo-overrides.yaml

# Additional scrape config referencing the collector service
kubectl -n observability create secret generic stellaops-prometheus-scrape \
  --from-file=prometheus-additional.yaml=deploy/telemetry/storage/prometheus.yaml

Provision the following secrets/configs (names can be overridden via Helm values):

Name Type Notes
stellaops-otel-tls-stage Secret Shared CA + server cert/key for collector/storage mTLS.
stellaops-prometheus-token Secret Bearer token minted by Authority (obs:read).
stellaops-loki-tenant-overrides ConfigMap Text from deploy/telemetry/storage/tenants/loki-overrides.yaml.
tempo-tenant-overrides ConfigMap Text from deploy/telemetry/storage/tenants/tempo-overrides.yaml.

4. Authority & tenancy integration

  1. Create Authority clients for each backend (observability-prometheus, observability-loki, observability-tempo).
    stella authority client create observability-prometheus \
      --scopes obs:read \
      --audience observability --description "Prometheus collector scrape"
    stella authority client create observability-loki \
      --scopes obs:logs timeline:read \
      --audience observability --description "Loki ingestion"
    stella authority client create observability-tempo \
      --scopes obs:traces \
      --audience observability --description "Tempo ingestion"
    
  2. Mint tokens/credentials and store them in the secrets above (see staging bootstrap commands). Example:
    stella authority token issue observability-prometheus --ttl 30d > prometheus-stage.token
    
  3. Update ingress/gateway policies to forward X-StellaOps-Tenant into Loki/Tempo so tenant headers propagate end-to-end, and ensure each workload sets tenant.id attributes (see docs/observability/observability.md).

5. Retention & isolation

  • Adjust deploy/telemetry/storage/tenants/*.yaml to set per-tenant retention and ingestion limits.
  • Configure object storage (S3, GCS, Azure Blob) when moving beyond filesystem storage.
  • For air-gapped deployments, mirror the telemetry bundle using ops/devops/telemetry/package_offline_bundle.py and import inside the Offline Kit staging directory.

6. Operational checklist

  • Certificates rotated and secrets updated.
  • Prometheus scrape succeeds (curl -sk --cert client.crt --key client.key https://collector:9464).
  • Tempo and Loki report tenant activity (/api/status).
  • Retention policy tested by uploading sample data and verifying expiry.
  • Alerts wired into SLO evaluator (DEVOPS-OBS-51-001).

7. References

  • deploy/telemetry/storage/README.md
  • deploy/compose/docker-compose.telemetry-storage.yaml
  • docs/ops/telemetry-collector.md
  • docs/observability/observability.md