Align AOC tasks for Excititor and Concelier
This commit is contained in:
@@ -1,113 +1,113 @@
|
||||
# Telemetry Collector Deployment Guide
|
||||
|
||||
> **Scope:** DevOps Guild, Observability Guild, and operators enabling the StellaOps telemetry pipeline (DEVOPS-OBS-50-001 / DEVOPS-OBS-50-003).
|
||||
|
||||
This guide describes how to deploy the default OpenTelemetry Collector packaged with Stella Ops, validate its ingest endpoints, and prepare an offline-ready bundle for air-gapped environments.
|
||||
|
||||
---
|
||||
|
||||
## 1. Overview
|
||||
|
||||
The collector terminates OTLP traffic from Stella Ops services and exports metrics, traces, and logs.
|
||||
|
||||
| Endpoint | Purpose | TLS | Authentication |
|
||||
| -------- | ------- | --- | -------------- |
|
||||
| `:4317` | OTLP gRPC ingest | mTLS | Client certificate issued by collector CA |
|
||||
| `:4318` | OTLP HTTP ingest | mTLS | Client certificate issued by collector CA |
|
||||
| `:9464` | Prometheus scrape | mTLS | Same client certificate |
|
||||
| `:13133` | Health check | mTLS | Same client certificate |
|
||||
| `:1777` | pprof diagnostics | mTLS | Same client certificate |
|
||||
|
||||
The default configuration lives at `deploy/telemetry/otel-collector-config.yaml` and mirrors the Helm values in the `stellaops` chart.
|
||||
|
||||
---
|
||||
|
||||
## 2. Local validation (Compose)
|
||||
|
||||
```bash
|
||||
# 1. Generate dev certificates (CA + collector + client)
|
||||
./ops/devops/telemetry/generate_dev_tls.sh
|
||||
|
||||
# 2. Start the collector overlay
|
||||
cd deploy/compose
|
||||
docker compose -f docker-compose.telemetry.yaml up -d
|
||||
|
||||
# 3. Start the storage overlay (Prometheus, Tempo, Loki)
|
||||
docker compose -f docker-compose.telemetry-storage.yaml up -d
|
||||
|
||||
# 4. Run the smoke test (OTLP HTTP)
|
||||
python ../../ops/devops/telemetry/smoke_otel_collector.py --host localhost
|
||||
```
|
||||
|
||||
# Telemetry Collector Deployment Guide
|
||||
|
||||
> **Scope:** DevOps Guild, Observability Guild, and operators enabling the StellaOps telemetry pipeline (DEVOPS-OBS-50-001 / DEVOPS-OBS-50-003).
|
||||
|
||||
This guide describes how to deploy the default OpenTelemetry Collector packaged with Stella Ops, validate its ingest endpoints, and prepare an offline-ready bundle for air-gapped environments.
|
||||
|
||||
---
|
||||
|
||||
## 1. Overview
|
||||
|
||||
The collector terminates OTLP traffic from Stella Ops services and exports metrics, traces, and logs.
|
||||
|
||||
| Endpoint | Purpose | TLS | Authentication |
|
||||
| -------- | ------- | --- | -------------- |
|
||||
| `:4317` | OTLP gRPC ingest | mTLS | Client certificate issued by collector CA |
|
||||
| `:4318` | OTLP HTTP ingest | mTLS | Client certificate issued by collector CA |
|
||||
| `:9464` | Prometheus scrape | mTLS | Same client certificate |
|
||||
| `:13133` | Health check | mTLS | Same client certificate |
|
||||
| `:1777` | pprof diagnostics | mTLS | Same client certificate |
|
||||
|
||||
The default configuration lives at `deploy/telemetry/otel-collector-config.yaml` and mirrors the Helm values in the `stellaops` chart.
|
||||
|
||||
---
|
||||
|
||||
## 2. Local validation (Compose)
|
||||
|
||||
```bash
|
||||
# 1. Generate dev certificates (CA + collector + client)
|
||||
./ops/devops/telemetry/generate_dev_tls.sh
|
||||
|
||||
# 2. Start the collector overlay
|
||||
cd deploy/compose
|
||||
docker compose -f docker-compose.telemetry.yaml up -d
|
||||
|
||||
# 3. Start the storage overlay (Prometheus, Tempo, Loki)
|
||||
docker compose -f docker-compose.telemetry-storage.yaml up -d
|
||||
|
||||
# 4. Run the smoke test (OTLP HTTP)
|
||||
python ../../ops/devops/telemetry/smoke_otel_collector.py --host localhost
|
||||
```
|
||||
|
||||
The smoke test posts sample traces, metrics, and logs and verifies that the collector increments the `otelcol_receiver_accepted_*` counters exposed via the Prometheus exporter. The storage overlay gives you a local Prometheus/Tempo/Loki stack to confirm end-to-end wiring. The same client certificate can be used by local services to weave traces together. See [`Telemetry Storage Deployment`](storage.md) for the storage configuration guidelines used in staging/production.
|
||||
|
||||
---
|
||||
|
||||
## 3. Kubernetes deployment
|
||||
|
||||
Enable the collector in Helm by setting the following values (example shown for the dev profile):
|
||||
|
||||
```yaml
|
||||
telemetry:
|
||||
collector:
|
||||
enabled: true
|
||||
defaultTenant: <tenant>
|
||||
tls:
|
||||
secretName: stellaops-otel-tls-<env>
|
||||
```
|
||||
|
||||
Provide a Kubernetes secret named `stellaops-otel-tls-<env>` (for staging: `stellaops-otel-tls-stage`) with the keys `tls.crt`, `tls.key`, and `ca.crt`. The secret must contain the collector certificate, private key, and issuing CA respectively. Example:
|
||||
|
||||
```bash
|
||||
kubectl create secret generic stellaops-otel-tls-stage \
|
||||
--from-file=tls.crt=collector.crt \
|
||||
--from-file=tls.key=collector.key \
|
||||
--from-file=ca.crt=ca.crt
|
||||
```
|
||||
|
||||
Helm renders the collector deployment, service, and config map automatically:
|
||||
|
||||
```bash
|
||||
helm upgrade --install stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-dev.yaml
|
||||
```
|
||||
|
||||
Update client workloads to trust `ca.crt` and present client certificates that chain back to the same CA.
|
||||
|
||||
---
|
||||
|
||||
## 4. Offline packaging (DEVOPS-OBS-50-003)
|
||||
|
||||
Use the packaging helper to produce a tarball that can be mirrored inside the Offline Kit or air-gapped sites:
|
||||
|
||||
```bash
|
||||
python ops/devops/telemetry/package_offline_bundle.py --output out/telemetry/telemetry-bundle.tar.gz
|
||||
```
|
||||
|
||||
The script gathers:
|
||||
|
||||
- `deploy/telemetry/README.md`
|
||||
- Collector configuration (`deploy/telemetry/otel-collector-config.yaml` and Helm copy)
|
||||
- Helm template/values for the collector
|
||||
- Compose overlay (`deploy/compose/docker-compose.telemetry.yaml`)
|
||||
|
||||
The tarball ships with a `.sha256` checksum. To attach a Cosign signature, add `--sign` and provide `COSIGN_KEY_REF`/`COSIGN_IDENTITY_TOKEN` env vars (or use the `--cosign-key` flag).
|
||||
|
||||
Distribute the bundle alongside certificates generated by your PKI. For air-gapped installs, regenerate certificates inside the enclave and recreate the `stellaops-otel-tls` secret.
|
||||
|
||||
---
|
||||
|
||||
## 5. Operational checks
|
||||
|
||||
1. **Health probes** – `kubectl exec` into the collector pod and run `curl -fsSk --cert client.crt --key client.key --cacert ca.crt https://127.0.0.1:13133/healthz`.
|
||||
2. **Metrics scrape** – confirm Prometheus ingests `otelcol_receiver_accepted_*` counters.
|
||||
3. **Trace correlation** – ensure services propagate `trace_id` and `tenant.id` attributes; refer to `docs/observability/observability.md` for expected spans.
|
||||
4. **Certificate rotation** – when rotating the CA, update the secret and restart the collector; roll out new client certificates before enabling `require_client_certificate` if staged.
|
||||
|
||||
---
|
||||
|
||||
## 6. Related references
|
||||
|
||||
- `deploy/telemetry/README.md` – source configuration and local workflow.
|
||||
- `ops/devops/telemetry/smoke_otel_collector.py` – OTLP smoke test.
|
||||
- `docs/observability/observability.md` – metrics/traces/logs taxonomy.
|
||||
- `docs/13_RELEASE_ENGINEERING_PLAYBOOK.md` – release checklist for telemetry assets.
|
||||
|
||||
---
|
||||
|
||||
## 3. Kubernetes deployment
|
||||
|
||||
Enable the collector in Helm by setting the following values (example shown for the dev profile):
|
||||
|
||||
```yaml
|
||||
telemetry:
|
||||
collector:
|
||||
enabled: true
|
||||
defaultTenant: <tenant>
|
||||
tls:
|
||||
secretName: stellaops-otel-tls-<env>
|
||||
```
|
||||
|
||||
Provide a Kubernetes secret named `stellaops-otel-tls-<env>` (for staging: `stellaops-otel-tls-stage`) with the keys `tls.crt`, `tls.key`, and `ca.crt`. The secret must contain the collector certificate, private key, and issuing CA respectively. Example:
|
||||
|
||||
```bash
|
||||
kubectl create secret generic stellaops-otel-tls-stage \
|
||||
--from-file=tls.crt=collector.crt \
|
||||
--from-file=tls.key=collector.key \
|
||||
--from-file=ca.crt=ca.crt
|
||||
```
|
||||
|
||||
Helm renders the collector deployment, service, and config map automatically:
|
||||
|
||||
```bash
|
||||
helm upgrade --install stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-dev.yaml
|
||||
```
|
||||
|
||||
Update client workloads to trust `ca.crt` and present client certificates that chain back to the same CA.
|
||||
|
||||
---
|
||||
|
||||
## 4. Offline packaging (DEVOPS-OBS-50-003)
|
||||
|
||||
Use the packaging helper to produce a tarball that can be mirrored inside the Offline Kit or air-gapped sites:
|
||||
|
||||
```bash
|
||||
python ops/devops/telemetry/package_offline_bundle.py --output out/telemetry/telemetry-bundle.tar.gz
|
||||
```
|
||||
|
||||
The script gathers:
|
||||
|
||||
- `deploy/telemetry/README.md`
|
||||
- Collector configuration (`deploy/telemetry/otel-collector-config.yaml` and Helm copy)
|
||||
- Helm template/values for the collector
|
||||
- Compose overlay (`deploy/compose/docker-compose.telemetry.yaml`)
|
||||
|
||||
The tarball ships with a `.sha256` checksum. To attach a Cosign signature, add `--sign` and provide `COSIGN_KEY_REF`/`COSIGN_IDENTITY_TOKEN` env vars (or use the `--cosign-key` flag).
|
||||
|
||||
Distribute the bundle alongside certificates generated by your PKI. For air-gapped installs, regenerate certificates inside the enclave and recreate the `stellaops-otel-tls` secret.
|
||||
|
||||
---
|
||||
|
||||
## 5. Operational checks
|
||||
|
||||
1. **Health probes** – `kubectl exec` into the collector pod and run `curl -fsSk --cert client.crt --key client.key --cacert ca.crt https://127.0.0.1:13133/healthz`.
|
||||
2. **Metrics scrape** – confirm Prometheus ingests `otelcol_receiver_accepted_*` counters.
|
||||
3. **Trace correlation** – ensure services propagate `trace_id` and `tenant.id` attributes; refer to `docs/observability/observability.md` for expected spans.
|
||||
4. **Certificate rotation** – when rotating the CA, update the secret and restart the collector; roll out new client certificates before enabling `require_client_certificate` if staged.
|
||||
|
||||
---
|
||||
|
||||
## 6. Related references
|
||||
|
||||
- `deploy/telemetry/README.md` – source configuration and local workflow.
|
||||
- `ops/devops/telemetry/smoke_otel_collector.py` – OTLP smoke test.
|
||||
- `docs/observability/observability.md` – metrics/traces/logs taxonomy.
|
||||
- `docs/13_RELEASE_ENGINEERING_PLAYBOOK.md` – release checklist for telemetry assets.
|
||||
|
||||
@@ -1,25 +1,25 @@
|
||||
# Telemetry Storage Deployment (DEVOPS-OBS-50-002)
|
||||
|
||||
> **Audience:** DevOps Guild, Observability Guild
|
||||
>
|
||||
> **Scope:** Prometheus (metrics), Tempo (traces), Loki (logs) storage backends with tenant isolation, TLS, retention policies, and Authority integration.
|
||||
|
||||
---
|
||||
|
||||
## 1. Components & Ports
|
||||
|
||||
| Service | Port | Purpose | TLS |
|
||||
|-----------|------|---------|-----|
|
||||
| Prometheus | 9090 | Metrics API / alerting | Client auth (mTLS) to scrape collector |
|
||||
| Tempo | 3200 | Trace ingest + API | mTLS (client cert required) |
|
||||
| Loki | 3100 | Log ingest + API | mTLS (client cert required) |
|
||||
|
||||
The collector forwards OTLP traffic to Tempo (traces), Prometheus scrapes the collector’s `/metrics` endpoint, and Loki is used for log search.
|
||||
|
||||
---
|
||||
|
||||
## 2. Local validation (Compose)
|
||||
|
||||
# Telemetry Storage Deployment (DEVOPS-OBS-50-002)
|
||||
|
||||
> **Audience:** DevOps Guild, Observability Guild
|
||||
>
|
||||
> **Scope:** Prometheus (metrics), Tempo (traces), Loki (logs) storage backends with tenant isolation, TLS, retention policies, and Authority integration.
|
||||
|
||||
---
|
||||
|
||||
## 1. Components & Ports
|
||||
|
||||
| Service | Port | Purpose | TLS |
|
||||
|-----------|------|---------|-----|
|
||||
| Prometheus | 9090 | Metrics API / alerting | Client auth (mTLS) to scrape collector |
|
||||
| Tempo | 3200 | Trace ingest + API | mTLS (client cert required) |
|
||||
| Loki | 3100 | Log ingest + API | mTLS (client cert required) |
|
||||
|
||||
The collector forwards OTLP traffic to Tempo (traces), Prometheus scrapes the collector’s `/metrics` endpoint, and Loki is used for log search.
|
||||
|
||||
---
|
||||
|
||||
## 2. Local validation (Compose)
|
||||
|
||||
```bash
|
||||
./ops/devops/telemetry/generate_dev_tls.sh
|
||||
cd deploy/compose
|
||||
@@ -31,145 +31,145 @@ python ../../ops/devops/telemetry/validate_storage_stack.py
|
||||
```
|
||||
|
||||
Configuration files live in `deploy/telemetry/storage/`. Adjust the overrides before shipping to staging/production.
|
||||
|
||||
---
|
||||
|
||||
## 3. Kubernetes blueprint
|
||||
|
||||
Deploy Prometheus, Tempo, and Loki to the `observability` namespace. The Helm values snippet below illustrates the key settings (charts not yet versioned—define them in the observability repo):
|
||||
|
||||
```yaml
|
||||
prometheus:
|
||||
server:
|
||||
extraFlags:
|
||||
- web.enable-lifecycle
|
||||
persistentVolume:
|
||||
enabled: true
|
||||
size: 200Gi
|
||||
additionalScrapeConfigsSecret: stellaops-prometheus-scrape
|
||||
extraSecretMounts:
|
||||
- name: otel-mtls
|
||||
secretName: stellaops-otel-tls-stage
|
||||
mountPath: /etc/telemetry/tls
|
||||
readOnly: true
|
||||
- name: otel-token
|
||||
secretName: stellaops-prometheus-token
|
||||
mountPath: /etc/telemetry/auth
|
||||
readOnly: true
|
||||
|
||||
loki:
|
||||
auth_enabled: true
|
||||
singleBinary:
|
||||
replicas: 2
|
||||
storage:
|
||||
type: filesystem
|
||||
existingSecretForTls: stellaops-otel-tls-stage
|
||||
runtimeConfig:
|
||||
configMap:
|
||||
name: stellaops-loki-tenant-overrides
|
||||
|
||||
tempo:
|
||||
server:
|
||||
http_listen_port: 3200
|
||||
storage:
|
||||
trace:
|
||||
backend: s3
|
||||
s3:
|
||||
endpoint: tempo-minio.observability.svc:9000
|
||||
bucket: tempo-traces
|
||||
multitenancyEnabled: true
|
||||
extraVolumeMounts:
|
||||
- name: otel-mtls
|
||||
mountPath: /etc/telemetry/tls
|
||||
readOnly: true
|
||||
- name: tempo-tenant-overrides
|
||||
mountPath: /etc/telemetry/tenants
|
||||
readOnly: true
|
||||
```
|
||||
|
||||
### Staging bootstrap commands
|
||||
|
||||
```bash
|
||||
kubectl create namespace observability --dry-run=client -o yaml | kubectl apply -f -
|
||||
|
||||
# TLS material (generated via ops/devops/telemetry/generate_dev_tls.sh or from PKI)
|
||||
kubectl -n observability create secret generic stellaops-otel-tls-stage \
|
||||
--from-file=tls.crt=collector-stage.crt \
|
||||
--from-file=tls.key=collector-stage.key \
|
||||
--from-file=ca.crt=collector-ca.crt
|
||||
|
||||
# Prometheus bearer token issued by Authority (scope obs:read)
|
||||
kubectl -n observability create secret generic stellaops-prometheus-token \
|
||||
--from-file=token=prometheus-stage.token
|
||||
|
||||
# Tenant overrides
|
||||
kubectl -n observability create configmap stellaops-loki-tenant-overrides \
|
||||
--from-file=overrides.yaml=deploy/telemetry/storage/tenants/loki-overrides.yaml
|
||||
|
||||
kubectl -n observability create configmap tempo-tenant-overrides \
|
||||
--from-file=tempo-overrides.yaml=deploy/telemetry/storage/tenants/tempo-overrides.yaml
|
||||
|
||||
# Additional scrape config referencing the collector service
|
||||
kubectl -n observability create secret generic stellaops-prometheus-scrape \
|
||||
--from-file=prometheus-additional.yaml=deploy/telemetry/storage/prometheus.yaml
|
||||
```
|
||||
|
||||
Provision the following secrets/configs (names can be overridden via Helm values):
|
||||
|
||||
| Name | Type | Notes |
|
||||
|------|------|-------|
|
||||
| `stellaops-otel-tls-stage` | Secret | Shared CA + server cert/key for collector/storage mTLS.
|
||||
| `stellaops-prometheus-token` | Secret | Bearer token minted by Authority (`obs:read`).
|
||||
| `stellaops-loki-tenant-overrides` | ConfigMap | Text from `deploy/telemetry/storage/tenants/loki-overrides.yaml`.
|
||||
| `tempo-tenant-overrides` | ConfigMap | Text from `deploy/telemetry/storage/tenants/tempo-overrides.yaml`.
|
||||
|
||||
---
|
||||
|
||||
## 4. Authority & tenancy integration
|
||||
|
||||
1. Create Authority clients for each backend (`observability-prometheus`, `observability-loki`, `observability-tempo`).
|
||||
```bash
|
||||
stella authority client create observability-prometheus \
|
||||
--scopes obs:read \
|
||||
--audience observability --description "Prometheus collector scrape"
|
||||
stella authority client create observability-loki \
|
||||
--scopes obs:logs timeline:read \
|
||||
--audience observability --description "Loki ingestion"
|
||||
stella authority client create observability-tempo \
|
||||
--scopes obs:traces \
|
||||
--audience observability --description "Tempo ingestion"
|
||||
```
|
||||
2. Mint tokens/credentials and store them in the secrets above (see staging bootstrap commands). Example:
|
||||
```bash
|
||||
stella authority token issue observability-prometheus --ttl 30d > prometheus-stage.token
|
||||
```
|
||||
3. Update ingress/gateway policies to forward `X-StellaOps-Tenant` into Loki/Tempo so tenant headers propagate end-to-end, and ensure each workload sets `tenant.id` attributes (see `docs/observability/observability.md`).
|
||||
|
||||
---
|
||||
|
||||
## 5. Retention & isolation
|
||||
|
||||
- Adjust `deploy/telemetry/storage/tenants/*.yaml` to set per-tenant retention and ingestion limits.
|
||||
- Configure object storage (S3, GCS, Azure Blob) when moving beyond filesystem storage.
|
||||
- For air-gapped deployments, mirror the telemetry bundle using `ops/devops/telemetry/package_offline_bundle.py` and import inside the Offline Kit staging directory.
|
||||
|
||||
---
|
||||
|
||||
## 6. Operational checklist
|
||||
|
||||
- [ ] Certificates rotated and secrets updated.
|
||||
|
||||
---
|
||||
|
||||
## 3. Kubernetes blueprint
|
||||
|
||||
Deploy Prometheus, Tempo, and Loki to the `observability` namespace. The Helm values snippet below illustrates the key settings (charts not yet versioned—define them in the observability repo):
|
||||
|
||||
```yaml
|
||||
prometheus:
|
||||
server:
|
||||
extraFlags:
|
||||
- web.enable-lifecycle
|
||||
persistentVolume:
|
||||
enabled: true
|
||||
size: 200Gi
|
||||
additionalScrapeConfigsSecret: stellaops-prometheus-scrape
|
||||
extraSecretMounts:
|
||||
- name: otel-mtls
|
||||
secretName: stellaops-otel-tls-stage
|
||||
mountPath: /etc/telemetry/tls
|
||||
readOnly: true
|
||||
- name: otel-token
|
||||
secretName: stellaops-prometheus-token
|
||||
mountPath: /etc/telemetry/auth
|
||||
readOnly: true
|
||||
|
||||
loki:
|
||||
auth_enabled: true
|
||||
singleBinary:
|
||||
replicas: 2
|
||||
storage:
|
||||
type: filesystem
|
||||
existingSecretForTls: stellaops-otel-tls-stage
|
||||
runtimeConfig:
|
||||
configMap:
|
||||
name: stellaops-loki-tenant-overrides
|
||||
|
||||
tempo:
|
||||
server:
|
||||
http_listen_port: 3200
|
||||
storage:
|
||||
trace:
|
||||
backend: s3
|
||||
s3:
|
||||
endpoint: tempo-minio.observability.svc:9000
|
||||
bucket: tempo-traces
|
||||
multitenancyEnabled: true
|
||||
extraVolumeMounts:
|
||||
- name: otel-mtls
|
||||
mountPath: /etc/telemetry/tls
|
||||
readOnly: true
|
||||
- name: tempo-tenant-overrides
|
||||
mountPath: /etc/telemetry/tenants
|
||||
readOnly: true
|
||||
```
|
||||
|
||||
### Staging bootstrap commands
|
||||
|
||||
```bash
|
||||
kubectl create namespace observability --dry-run=client -o yaml | kubectl apply -f -
|
||||
|
||||
# TLS material (generated via ops/devops/telemetry/generate_dev_tls.sh or from PKI)
|
||||
kubectl -n observability create secret generic stellaops-otel-tls-stage \
|
||||
--from-file=tls.crt=collector-stage.crt \
|
||||
--from-file=tls.key=collector-stage.key \
|
||||
--from-file=ca.crt=collector-ca.crt
|
||||
|
||||
# Prometheus bearer token issued by Authority (scope obs:read)
|
||||
kubectl -n observability create secret generic stellaops-prometheus-token \
|
||||
--from-file=token=prometheus-stage.token
|
||||
|
||||
# Tenant overrides
|
||||
kubectl -n observability create configmap stellaops-loki-tenant-overrides \
|
||||
--from-file=overrides.yaml=deploy/telemetry/storage/tenants/loki-overrides.yaml
|
||||
|
||||
kubectl -n observability create configmap tempo-tenant-overrides \
|
||||
--from-file=tempo-overrides.yaml=deploy/telemetry/storage/tenants/tempo-overrides.yaml
|
||||
|
||||
# Additional scrape config referencing the collector service
|
||||
kubectl -n observability create secret generic stellaops-prometheus-scrape \
|
||||
--from-file=prometheus-additional.yaml=deploy/telemetry/storage/prometheus.yaml
|
||||
```
|
||||
|
||||
Provision the following secrets/configs (names can be overridden via Helm values):
|
||||
|
||||
| Name | Type | Notes |
|
||||
|------|------|-------|
|
||||
| `stellaops-otel-tls-stage` | Secret | Shared CA + server cert/key for collector/storage mTLS.
|
||||
| `stellaops-prometheus-token` | Secret | Bearer token minted by Authority (`obs:read`).
|
||||
| `stellaops-loki-tenant-overrides` | ConfigMap | Text from `deploy/telemetry/storage/tenants/loki-overrides.yaml`.
|
||||
| `tempo-tenant-overrides` | ConfigMap | Text from `deploy/telemetry/storage/tenants/tempo-overrides.yaml`.
|
||||
|
||||
---
|
||||
|
||||
## 4. Authority & tenancy integration
|
||||
|
||||
1. Create Authority clients for each backend (`observability-prometheus`, `observability-loki`, `observability-tempo`).
|
||||
```bash
|
||||
stella authority client create observability-prometheus \
|
||||
--scopes obs:read \
|
||||
--audience observability --description "Prometheus collector scrape"
|
||||
stella authority client create observability-loki \
|
||||
--scopes obs:logs timeline:read \
|
||||
--audience observability --description "Loki ingestion"
|
||||
stella authority client create observability-tempo \
|
||||
--scopes obs:traces \
|
||||
--audience observability --description "Tempo ingestion"
|
||||
```
|
||||
2. Mint tokens/credentials and store them in the secrets above (see staging bootstrap commands). Example:
|
||||
```bash
|
||||
stella authority token issue observability-prometheus --ttl 30d > prometheus-stage.token
|
||||
```
|
||||
3. Update ingress/gateway policies to forward `X-StellaOps-Tenant` into Loki/Tempo so tenant headers propagate end-to-end, and ensure each workload sets `tenant.id` attributes (see `docs/observability/observability.md`).
|
||||
|
||||
---
|
||||
|
||||
## 5. Retention & isolation
|
||||
|
||||
- Adjust `deploy/telemetry/storage/tenants/*.yaml` to set per-tenant retention and ingestion limits.
|
||||
- Configure object storage (S3, GCS, Azure Blob) when moving beyond filesystem storage.
|
||||
- For air-gapped deployments, mirror the telemetry bundle using `ops/devops/telemetry/package_offline_bundle.py` and import inside the Offline Kit staging directory.
|
||||
|
||||
---
|
||||
|
||||
## 6. Operational checklist
|
||||
|
||||
- [ ] Certificates rotated and secrets updated.
|
||||
- [ ] Prometheus scrape succeeds (`curl -sk --cert client.crt --key client.key https://collector:9464`).
|
||||
- [ ] Tempo and Loki report tenant activity (`/api/status`).
|
||||
- [ ] Retention policy tested by uploading sample data and verifying expiry.
|
||||
- [ ] `python ops/devops/telemetry/validate_storage_stack.py` passes before committing updated configs.
|
||||
- [ ] Alerts wired into SLO evaluator (DEVOPS-OBS-51-001).
|
||||
- [ ] Component rule packs imported (e.g. `docs/modules/scheduler/operations/worker-prometheus-rules.yaml`).
|
||||
|
||||
---
|
||||
|
||||
## 7. References
|
||||
|
||||
- `deploy/telemetry/storage/README.md`
|
||||
- `deploy/compose/docker-compose.telemetry-storage.yaml`
|
||||
- `docs/modules/telemetry/operations/collector.md`
|
||||
- `docs/observability/observability.md`
|
||||
|
||||
---
|
||||
|
||||
## 7. References
|
||||
|
||||
- `deploy/telemetry/storage/README.md`
|
||||
- `deploy/compose/docker-compose.telemetry-storage.yaml`
|
||||
- `docs/modules/telemetry/operations/collector.md`
|
||||
- `docs/observability/observability.md`
|
||||
|
||||
Reference in New Issue
Block a user