Files
git.stella-ops.org/docs/ops/telemetry-collector.md
master 96d52884e8
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Add Policy DSL Validator, Schema Exporter, and Simulation Smoke tools
- Implemented PolicyDslValidator with command-line options for strict mode and JSON output.
- Created PolicySchemaExporter to generate JSON schemas for policy-related models.
- Developed PolicySimulationSmoke tool to validate policy simulations against expected outcomes.
- Added project files and necessary dependencies for each tool.
- Ensured proper error handling and usage instructions across tools.
2025-10-27 08:00:11 +02:00

114 lines
4.8 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Telemetry Collector Deployment Guide
> **Scope:** DevOps Guild, Observability Guild, and operators enabling the StellaOps telemetry pipeline (DEVOPS-OBS-50-001 / DEVOPS-OBS-50-003).
This guide describes how to deploy the default OpenTelemetry Collector packaged with StellaOps, validate its ingest endpoints, and prepare an offline-ready bundle for air-gapped environments.
---
## 1. Overview
The collector terminates OTLP traffic from StellaOps services and exports metrics, traces, and logs.
| Endpoint | Purpose | TLS | Authentication |
| -------- | ------- | --- | -------------- |
| `:4317` | OTLP gRPC ingest | mTLS | Client certificate issued by collector CA |
| `:4318` | OTLP HTTP ingest | mTLS | Client certificate issued by collector CA |
| `:9464` | Prometheus scrape | mTLS | Same client certificate |
| `:13133` | Health check | mTLS | Same client certificate |
| `:1777` | pprof diagnostics | mTLS | Same client certificate |
The default configuration lives at `deploy/telemetry/otel-collector-config.yaml` and mirrors the Helm values in the `stellaops` chart.
---
## 2. Local validation (Compose)
```bash
# 1. Generate dev certificates (CA + collector + client)
./ops/devops/telemetry/generate_dev_tls.sh
# 2. Start the collector overlay
cd deploy/compose
docker compose -f docker-compose.telemetry.yaml up -d
# 3. Start the storage overlay (Prometheus, Tempo, Loki)
docker compose -f docker-compose.telemetry-storage.yaml up -d
# 4. Run the smoke test (OTLP HTTP)
python ../../ops/devops/telemetry/smoke_otel_collector.py --host localhost
```
The smoke test posts sample traces, metrics, and logs and verifies that the collector increments the `otelcol_receiver_accepted_*` counters exposed via the Prometheus exporter. The storage overlay gives you a local Prometheus/Tempo/Loki stack to confirm end-to-end wiring. The same client certificate can be used by local services to weave traces together. See [`Telemetry Storage Deployment`](telemetry-storage.md) for the storage configuration guidelines used in staging/production.
---
## 3. Kubernetes deployment
Enable the collector in Helm by setting the following values (example shown for the dev profile):
```yaml
telemetry:
collector:
enabled: true
defaultTenant: <tenant>
tls:
secretName: stellaops-otel-tls-<env>
```
Provide a Kubernetes secret named `stellaops-otel-tls-<env>` (for staging: `stellaops-otel-tls-stage`) with the keys `tls.crt`, `tls.key`, and `ca.crt`. The secret must contain the collector certificate, private key, and issuing CA respectively. Example:
```bash
kubectl create secret generic stellaops-otel-tls-stage \
--from-file=tls.crt=collector.crt \
--from-file=tls.key=collector.key \
--from-file=ca.crt=ca.crt
```
Helm renders the collector deployment, service, and config map automatically:
```bash
helm upgrade --install stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-dev.yaml
```
Update client workloads to trust `ca.crt` and present client certificates that chain back to the same CA.
---
## 4. Offline packaging (DEVOPS-OBS-50-003)
Use the packaging helper to produce a tarball that can be mirrored inside the Offline Kit or air-gapped sites:
```bash
python ops/devops/telemetry/package_offline_bundle.py --output out/telemetry/telemetry-bundle.tar.gz
```
The script gathers:
- `deploy/telemetry/README.md`
- Collector configuration (`deploy/telemetry/otel-collector-config.yaml` and Helm copy)
- Helm template/values for the collector
- Compose overlay (`deploy/compose/docker-compose.telemetry.yaml`)
The tarball ships with a `.sha256` checksum. To attach a Cosign signature, add `--sign` and provide `COSIGN_KEY_REF`/`COSIGN_IDENTITY_TOKEN` env vars (or use the `--cosign-key` flag).
Distribute the bundle alongside certificates generated by your PKI. For air-gapped installs, regenerate certificates inside the enclave and recreate the `stellaops-otel-tls` secret.
---
## 5. Operational checks
1. **Health probes** `kubectl exec` into the collector pod and run `curl -fsSk --cert client.crt --key client.key --cacert ca.crt https://127.0.0.1:13133/healthz`.
2. **Metrics scrape** confirm Prometheus ingests `otelcol_receiver_accepted_*` counters.
3. **Trace correlation** ensure services propagate `trace_id` and `tenant.id` attributes; refer to `docs/observability/observability.md` for expected spans.
4. **Certificate rotation** when rotating the CA, update the secret and restart the collector; roll out new client certificates before enabling `require_client_certificate` if staged.
---
## 6. Related references
- `deploy/telemetry/README.md` source configuration and local workflow.
- `ops/devops/telemetry/smoke_otel_collector.py` OTLP smoke test.
- `docs/observability/observability.md` metrics/traces/logs taxonomy.
- `docs/13_RELEASE_ENGINEERING_PLAYBOOK.md` release checklist for telemetry assets.