Some checks failed
		
		
	
	Docs CI / lint-and-preview (push) Has been cancelled
				
			- Implemented PolicyDslValidator with command-line options for strict mode and JSON output. - Created PolicySchemaExporter to generate JSON schemas for policy-related models. - Developed PolicySimulationSmoke tool to validate policy simulations against expected outcomes. - Added project files and necessary dependencies for each tool. - Ensured proper error handling and usage instructions across tools.
		
			
				
	
	
		
			173 lines
		
	
	
		
			6.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			173 lines
		
	
	
		
			6.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # Telemetry Storage Deployment (DEVOPS-OBS-50-002)
 | ||
| 
 | ||
| > **Audience:** DevOps Guild, Observability Guild
 | ||
| >
 | ||
| > **Scope:** Prometheus (metrics), Tempo (traces), Loki (logs) storage backends with tenant isolation, TLS, retention policies, and Authority integration.
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 1. Components & Ports
 | ||
| 
 | ||
| | Service   | Port | Purpose | TLS |
 | ||
| |-----------|------|---------|-----|
 | ||
| | Prometheus | 9090 | Metrics API / alerting | Client auth (mTLS) to scrape collector |
 | ||
| | Tempo      | 3200 | Trace ingest + API | mTLS (client cert required) |
 | ||
| | Loki       | 3100 | Log ingest + API | mTLS (client cert required) |
 | ||
| 
 | ||
| The collector forwards OTLP traffic to Tempo (traces), Prometheus scrapes the collector’s `/metrics` endpoint, and Loki is used for log search.
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 2. Local validation (Compose)
 | ||
| 
 | ||
| ```bash
 | ||
| ./ops/devops/telemetry/generate_dev_tls.sh
 | ||
| cd deploy/compose
 | ||
| # Start collector + storage stack
 | ||
| docker compose -f docker-compose.telemetry.yaml up -d
 | ||
| docker compose -f docker-compose.telemetry-storage.yaml up -d
 | ||
| python ../../ops/devops/telemetry/smoke_otel_collector.py --host localhost
 | ||
| ```
 | ||
| 
 | ||
| Configuration files live in `deploy/telemetry/storage/`. Adjust the overrides before shipping to staging/production.
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 3. Kubernetes blueprint
 | ||
| 
 | ||
| Deploy Prometheus, Tempo, and Loki to the `observability` namespace. The Helm values snippet below illustrates the key settings (charts not yet versioned—define them in the observability repo):
 | ||
| 
 | ||
| ```yaml
 | ||
| prometheus:
 | ||
|   server:
 | ||
|     extraFlags:
 | ||
|       - web.enable-lifecycle
 | ||
|     persistentVolume:
 | ||
|       enabled: true
 | ||
|       size: 200Gi
 | ||
|   additionalScrapeConfigsSecret: stellaops-prometheus-scrape
 | ||
|   extraSecretMounts:
 | ||
|     - name: otel-mtls
 | ||
|       secretName: stellaops-otel-tls-stage
 | ||
|       mountPath: /etc/telemetry/tls
 | ||
|       readOnly: true
 | ||
|     - name: otel-token
 | ||
|       secretName: stellaops-prometheus-token
 | ||
|       mountPath: /etc/telemetry/auth
 | ||
|       readOnly: true
 | ||
| 
 | ||
| loki:
 | ||
|   auth_enabled: true
 | ||
|   singleBinary:
 | ||
|     replicas: 2
 | ||
|   storage:
 | ||
|     type: filesystem
 | ||
|   existingSecretForTls: stellaops-otel-tls-stage
 | ||
|   runtimeConfig:
 | ||
|     configMap:
 | ||
|       name: stellaops-loki-tenant-overrides
 | ||
| 
 | ||
| tempo:
 | ||
|   server:
 | ||
|     http_listen_port: 3200
 | ||
|   storage:
 | ||
|     trace:
 | ||
|       backend: s3
 | ||
|       s3:
 | ||
|         endpoint: tempo-minio.observability.svc:9000
 | ||
|         bucket: tempo-traces
 | ||
|   multitenancyEnabled: true
 | ||
|   extraVolumeMounts:
 | ||
|     - name: otel-mtls
 | ||
|       mountPath: /etc/telemetry/tls
 | ||
|       readOnly: true
 | ||
|     - name: tempo-tenant-overrides
 | ||
|       mountPath: /etc/telemetry/tenants
 | ||
|       readOnly: true
 | ||
| ```
 | ||
| 
 | ||
| ### Staging bootstrap commands
 | ||
| 
 | ||
| ```bash
 | ||
| kubectl create namespace observability --dry-run=client -o yaml | kubectl apply -f -
 | ||
| 
 | ||
| # TLS material (generated via ops/devops/telemetry/generate_dev_tls.sh or from PKI)
 | ||
| kubectl -n observability create secret generic stellaops-otel-tls-stage \
 | ||
|   --from-file=tls.crt=collector-stage.crt \
 | ||
|   --from-file=tls.key=collector-stage.key \
 | ||
|   --from-file=ca.crt=collector-ca.crt
 | ||
| 
 | ||
| # Prometheus bearer token issued by Authority (scope obs:read)
 | ||
| kubectl -n observability create secret generic stellaops-prometheus-token \
 | ||
|   --from-file=token=prometheus-stage.token
 | ||
| 
 | ||
| # Tenant overrides
 | ||
| kubectl -n observability create configmap stellaops-loki-tenant-overrides \
 | ||
|   --from-file=overrides.yaml=deploy/telemetry/storage/tenants/loki-overrides.yaml
 | ||
| 
 | ||
| kubectl -n observability create configmap tempo-tenant-overrides \
 | ||
|   --from-file=tempo-overrides.yaml=deploy/telemetry/storage/tenants/tempo-overrides.yaml
 | ||
| 
 | ||
| # Additional scrape config referencing the collector service
 | ||
| kubectl -n observability create secret generic stellaops-prometheus-scrape \
 | ||
|   --from-file=prometheus-additional.yaml=deploy/telemetry/storage/prometheus.yaml
 | ||
| ```
 | ||
| 
 | ||
| Provision the following secrets/configs (names can be overridden via Helm values):
 | ||
| 
 | ||
| | Name | Type | Notes |
 | ||
| |------|------|-------|
 | ||
| | `stellaops-otel-tls-stage` | Secret | Shared CA + server cert/key for collector/storage mTLS.
 | ||
| | `stellaops-prometheus-token` | Secret | Bearer token minted by Authority (`obs:read`).
 | ||
| | `stellaops-loki-tenant-overrides` | ConfigMap | Text from `deploy/telemetry/storage/tenants/loki-overrides.yaml`.
 | ||
| | `tempo-tenant-overrides` | ConfigMap | Text from `deploy/telemetry/storage/tenants/tempo-overrides.yaml`.
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 4. Authority & tenancy integration
 | ||
| 
 | ||
| 1. Create Authority clients for each backend (`observability-prometheus`, `observability-loki`, `observability-tempo`).
 | ||
|    ```bash
 | ||
|    stella authority client create observability-prometheus \
 | ||
|      --scopes obs:read \
 | ||
|      --audience observability --description "Prometheus collector scrape"
 | ||
|    stella authority client create observability-loki \
 | ||
|      --scopes obs:logs timeline:read \
 | ||
|      --audience observability --description "Loki ingestion"
 | ||
|    stella authority client create observability-tempo \
 | ||
|      --scopes obs:traces \
 | ||
|      --audience observability --description "Tempo ingestion"
 | ||
|    ```
 | ||
| 2. Mint tokens/credentials and store them in the secrets above (see staging bootstrap commands). Example:
 | ||
|    ```bash
 | ||
|    stella authority token issue observability-prometheus --ttl 30d > prometheus-stage.token
 | ||
|    ```
 | ||
| 3. Update ingress/gateway policies to forward `X-StellaOps-Tenant` into Loki/Tempo so tenant headers propagate end-to-end, and ensure each workload sets `tenant.id` attributes (see `docs/observability/observability.md`).
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 5. Retention & isolation
 | ||
| 
 | ||
| - Adjust `deploy/telemetry/storage/tenants/*.yaml` to set per-tenant retention and ingestion limits.
 | ||
| - Configure object storage (S3, GCS, Azure Blob) when moving beyond filesystem storage.
 | ||
| - For air-gapped deployments, mirror the telemetry bundle using `ops/devops/telemetry/package_offline_bundle.py` and import inside the Offline Kit staging directory.
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 6. Operational checklist
 | ||
| 
 | ||
| - [ ] Certificates rotated and secrets updated.
 | ||
| - [ ] Prometheus scrape succeeds (`curl -sk --cert client.crt --key client.key https://collector:9464`).
 | ||
| - [ ] Tempo and Loki report tenant activity (`/api/status`).
 | ||
| - [ ] Retention policy tested by uploading sample data and verifying expiry.
 | ||
| - [ ] Alerts wired into SLO evaluator (DEVOPS-OBS-51-001).
 | ||
| 
 | ||
| ---
 | ||
| 
 | ||
| ## 7. References
 | ||
| 
 | ||
| - `deploy/telemetry/storage/README.md`
 | ||
| - `deploy/compose/docker-compose.telemetry-storage.yaml`
 | ||
| - `docs/ops/telemetry-collector.md`
 | ||
| - `docs/observability/observability.md`
 |