feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules
- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes. - Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes. - Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables. - Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
This commit is contained in:
		
							
								
								
									
										113
									
								
								docs/modules/telemetry/operations/collector.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										113
									
								
								docs/modules/telemetry/operations/collector.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,113 @@ | ||||
| # Telemetry Collector Deployment Guide | ||||
|  | ||||
| > **Scope:** DevOps Guild, Observability Guild, and operators enabling the StellaOps telemetry pipeline (DEVOPS-OBS-50-001 / DEVOPS-OBS-50-003). | ||||
|  | ||||
| This guide describes how to deploy the default OpenTelemetry Collector packaged with Stella Ops, validate its ingest endpoints, and prepare an offline-ready bundle for air-gapped environments. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 1. Overview | ||||
|  | ||||
| The collector terminates OTLP traffic from Stella Ops services and exports metrics, traces, and logs. | ||||
|  | ||||
| | Endpoint | Purpose | TLS | Authentication | | ||||
| | -------- | ------- | --- | -------------- | | ||||
| | `:4317`  | OTLP gRPC ingest | mTLS | Client certificate issued by collector CA | | ||||
| | `:4318`  | OTLP HTTP ingest | mTLS | Client certificate issued by collector CA | | ||||
| | `:9464`  | Prometheus scrape | mTLS | Same client certificate | | ||||
| | `:13133` | Health check | mTLS | Same client certificate | | ||||
| | `:1777`  | pprof diagnostics | mTLS | Same client certificate | | ||||
|  | ||||
| The default configuration lives at `deploy/telemetry/otel-collector-config.yaml` and mirrors the Helm values in the `stellaops` chart. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 2. Local validation (Compose) | ||||
|  | ||||
| ```bash | ||||
| # 1. Generate dev certificates (CA + collector + client) | ||||
| ./ops/devops/telemetry/generate_dev_tls.sh | ||||
|  | ||||
| # 2. Start the collector overlay | ||||
| cd deploy/compose | ||||
| docker compose -f docker-compose.telemetry.yaml up -d | ||||
|  | ||||
| # 3. Start the storage overlay (Prometheus, Tempo, Loki) | ||||
| docker compose -f docker-compose.telemetry-storage.yaml up -d | ||||
|  | ||||
| # 4. Run the smoke test (OTLP HTTP) | ||||
| python ../../ops/devops/telemetry/smoke_otel_collector.py --host localhost | ||||
| ``` | ||||
|  | ||||
| The smoke test posts sample traces, metrics, and logs and verifies that the collector increments the `otelcol_receiver_accepted_*` counters exposed via the Prometheus exporter. The storage overlay gives you a local Prometheus/Tempo/Loki stack to confirm end-to-end wiring. The same client certificate can be used by local services to weave traces together. See [`Telemetry Storage Deployment`](telemetry-storage.md) for the storage configuration guidelines used in staging/production. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 3. Kubernetes deployment | ||||
|  | ||||
| Enable the collector in Helm by setting the following values (example shown for the dev profile): | ||||
|  | ||||
| ```yaml | ||||
| telemetry: | ||||
|   collector: | ||||
|     enabled: true | ||||
|     defaultTenant: <tenant> | ||||
|     tls: | ||||
|       secretName: stellaops-otel-tls-<env> | ||||
| ``` | ||||
|  | ||||
| Provide a Kubernetes secret named `stellaops-otel-tls-<env>` (for staging: `stellaops-otel-tls-stage`) with the keys `tls.crt`, `tls.key`, and `ca.crt`. The secret must contain the collector certificate, private key, and issuing CA respectively. Example: | ||||
|  | ||||
| ```bash | ||||
| kubectl create secret generic stellaops-otel-tls-stage \ | ||||
|   --from-file=tls.crt=collector.crt \ | ||||
|   --from-file=tls.key=collector.key \ | ||||
|   --from-file=ca.crt=ca.crt | ||||
| ``` | ||||
|  | ||||
| Helm renders the collector deployment, service, and config map automatically: | ||||
|  | ||||
| ```bash | ||||
| helm upgrade --install stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-dev.yaml | ||||
| ``` | ||||
|  | ||||
| Update client workloads to trust `ca.crt` and present client certificates that chain back to the same CA. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 4. Offline packaging (DEVOPS-OBS-50-003) | ||||
|  | ||||
| Use the packaging helper to produce a tarball that can be mirrored inside the Offline Kit or air-gapped sites: | ||||
|  | ||||
| ```bash | ||||
| python ops/devops/telemetry/package_offline_bundle.py --output out/telemetry/telemetry-bundle.tar.gz | ||||
| ``` | ||||
|  | ||||
| The script gathers: | ||||
|  | ||||
| - `deploy/telemetry/README.md` | ||||
| - Collector configuration (`deploy/telemetry/otel-collector-config.yaml` and Helm copy) | ||||
| - Helm template/values for the collector | ||||
| - Compose overlay (`deploy/compose/docker-compose.telemetry.yaml`) | ||||
|  | ||||
| The tarball ships with a `.sha256` checksum. To attach a Cosign signature, add `--sign` and provide `COSIGN_KEY_REF`/`COSIGN_IDENTITY_TOKEN` env vars (or use the `--cosign-key` flag). | ||||
|  | ||||
| Distribute the bundle alongside certificates generated by your PKI. For air-gapped installs, regenerate certificates inside the enclave and recreate the `stellaops-otel-tls` secret. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 5. Operational checks | ||||
|  | ||||
| 1. **Health probes** – `kubectl exec` into the collector pod and run `curl -fsSk --cert client.crt --key client.key --cacert ca.crt https://127.0.0.1:13133/healthz`. | ||||
| 2. **Metrics scrape** – confirm Prometheus ingests `otelcol_receiver_accepted_*` counters. | ||||
| 3. **Trace correlation** – ensure services propagate `trace_id` and `tenant.id` attributes; refer to `docs/observability/observability.md` for expected spans. | ||||
| 4. **Certificate rotation** – when rotating the CA, update the secret and restart the collector; roll out new client certificates before enabling `require_client_certificate` if staged. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 6. Related references | ||||
|  | ||||
| - `deploy/telemetry/README.md` – source configuration and local workflow. | ||||
| - `ops/devops/telemetry/smoke_otel_collector.py` – OTLP smoke test. | ||||
| - `docs/observability/observability.md` – metrics/traces/logs taxonomy. | ||||
| - `docs/13_RELEASE_ENGINEERING_PLAYBOOK.md` – release checklist for telemetry assets. | ||||
							
								
								
									
										173
									
								
								docs/modules/telemetry/operations/storage.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										173
									
								
								docs/modules/telemetry/operations/storage.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,173 @@ | ||||
| # Telemetry Storage Deployment (DEVOPS-OBS-50-002) | ||||
|  | ||||
| > **Audience:** DevOps Guild, Observability Guild | ||||
| > | ||||
| > **Scope:** Prometheus (metrics), Tempo (traces), Loki (logs) storage backends with tenant isolation, TLS, retention policies, and Authority integration. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 1. Components & Ports | ||||
|  | ||||
| | Service   | Port | Purpose | TLS | | ||||
| |-----------|------|---------|-----| | ||||
| | Prometheus | 9090 | Metrics API / alerting | Client auth (mTLS) to scrape collector | | ||||
| | Tempo      | 3200 | Trace ingest + API | mTLS (client cert required) | | ||||
| | Loki       | 3100 | Log ingest + API | mTLS (client cert required) | | ||||
|  | ||||
| The collector forwards OTLP traffic to Tempo (traces), Prometheus scrapes the collector’s `/metrics` endpoint, and Loki is used for log search. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 2. Local validation (Compose) | ||||
|  | ||||
| ```bash | ||||
| ./ops/devops/telemetry/generate_dev_tls.sh | ||||
| cd deploy/compose | ||||
| # Start collector + storage stack | ||||
| docker compose -f docker-compose.telemetry.yaml up -d | ||||
| docker compose -f docker-compose.telemetry-storage.yaml up -d | ||||
| python ../../ops/devops/telemetry/smoke_otel_collector.py --host localhost | ||||
| ``` | ||||
|  | ||||
| Configuration files live in `deploy/telemetry/storage/`. Adjust the overrides before shipping to staging/production. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 3. Kubernetes blueprint | ||||
|  | ||||
| Deploy Prometheus, Tempo, and Loki to the `observability` namespace. The Helm values snippet below illustrates the key settings (charts not yet versioned—define them in the observability repo): | ||||
|  | ||||
| ```yaml | ||||
| prometheus: | ||||
|   server: | ||||
|     extraFlags: | ||||
|       - web.enable-lifecycle | ||||
|     persistentVolume: | ||||
|       enabled: true | ||||
|       size: 200Gi | ||||
|   additionalScrapeConfigsSecret: stellaops-prometheus-scrape | ||||
|   extraSecretMounts: | ||||
|     - name: otel-mtls | ||||
|       secretName: stellaops-otel-tls-stage | ||||
|       mountPath: /etc/telemetry/tls | ||||
|       readOnly: true | ||||
|     - name: otel-token | ||||
|       secretName: stellaops-prometheus-token | ||||
|       mountPath: /etc/telemetry/auth | ||||
|       readOnly: true | ||||
|  | ||||
| loki: | ||||
|   auth_enabled: true | ||||
|   singleBinary: | ||||
|     replicas: 2 | ||||
|   storage: | ||||
|     type: filesystem | ||||
|   existingSecretForTls: stellaops-otel-tls-stage | ||||
|   runtimeConfig: | ||||
|     configMap: | ||||
|       name: stellaops-loki-tenant-overrides | ||||
|  | ||||
| tempo: | ||||
|   server: | ||||
|     http_listen_port: 3200 | ||||
|   storage: | ||||
|     trace: | ||||
|       backend: s3 | ||||
|       s3: | ||||
|         endpoint: tempo-minio.observability.svc:9000 | ||||
|         bucket: tempo-traces | ||||
|   multitenancyEnabled: true | ||||
|   extraVolumeMounts: | ||||
|     - name: otel-mtls | ||||
|       mountPath: /etc/telemetry/tls | ||||
|       readOnly: true | ||||
|     - name: tempo-tenant-overrides | ||||
|       mountPath: /etc/telemetry/tenants | ||||
|       readOnly: true | ||||
| ``` | ||||
|  | ||||
| ### Staging bootstrap commands | ||||
|  | ||||
| ```bash | ||||
| kubectl create namespace observability --dry-run=client -o yaml | kubectl apply -f - | ||||
|  | ||||
| # TLS material (generated via ops/devops/telemetry/generate_dev_tls.sh or from PKI) | ||||
| kubectl -n observability create secret generic stellaops-otel-tls-stage \ | ||||
|   --from-file=tls.crt=collector-stage.crt \ | ||||
|   --from-file=tls.key=collector-stage.key \ | ||||
|   --from-file=ca.crt=collector-ca.crt | ||||
|  | ||||
| # Prometheus bearer token issued by Authority (scope obs:read) | ||||
| kubectl -n observability create secret generic stellaops-prometheus-token \ | ||||
|   --from-file=token=prometheus-stage.token | ||||
|  | ||||
| # Tenant overrides | ||||
| kubectl -n observability create configmap stellaops-loki-tenant-overrides \ | ||||
|   --from-file=overrides.yaml=deploy/telemetry/storage/tenants/loki-overrides.yaml | ||||
|  | ||||
| kubectl -n observability create configmap tempo-tenant-overrides \ | ||||
|   --from-file=tempo-overrides.yaml=deploy/telemetry/storage/tenants/tempo-overrides.yaml | ||||
|  | ||||
| # Additional scrape config referencing the collector service | ||||
| kubectl -n observability create secret generic stellaops-prometheus-scrape \ | ||||
|   --from-file=prometheus-additional.yaml=deploy/telemetry/storage/prometheus.yaml | ||||
| ``` | ||||
|  | ||||
| Provision the following secrets/configs (names can be overridden via Helm values): | ||||
|  | ||||
| | Name | Type | Notes | | ||||
| |------|------|-------| | ||||
| | `stellaops-otel-tls-stage` | Secret | Shared CA + server cert/key for collector/storage mTLS. | ||||
| | `stellaops-prometheus-token` | Secret | Bearer token minted by Authority (`obs:read`). | ||||
| | `stellaops-loki-tenant-overrides` | ConfigMap | Text from `deploy/telemetry/storage/tenants/loki-overrides.yaml`. | ||||
| | `tempo-tenant-overrides` | ConfigMap | Text from `deploy/telemetry/storage/tenants/tempo-overrides.yaml`. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 4. Authority & tenancy integration | ||||
|  | ||||
| 1. Create Authority clients for each backend (`observability-prometheus`, `observability-loki`, `observability-tempo`). | ||||
|    ```bash | ||||
|    stella authority client create observability-prometheus \ | ||||
|      --scopes obs:read \ | ||||
|      --audience observability --description "Prometheus collector scrape" | ||||
|    stella authority client create observability-loki \ | ||||
|      --scopes obs:logs timeline:read \ | ||||
|      --audience observability --description "Loki ingestion" | ||||
|    stella authority client create observability-tempo \ | ||||
|      --scopes obs:traces \ | ||||
|      --audience observability --description "Tempo ingestion" | ||||
|    ``` | ||||
| 2. Mint tokens/credentials and store them in the secrets above (see staging bootstrap commands). Example: | ||||
|    ```bash | ||||
|    stella authority token issue observability-prometheus --ttl 30d > prometheus-stage.token | ||||
|    ``` | ||||
| 3. Update ingress/gateway policies to forward `X-StellaOps-Tenant` into Loki/Tempo so tenant headers propagate end-to-end, and ensure each workload sets `tenant.id` attributes (see `docs/observability/observability.md`). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 5. Retention & isolation | ||||
|  | ||||
| - Adjust `deploy/telemetry/storage/tenants/*.yaml` to set per-tenant retention and ingestion limits. | ||||
| - Configure object storage (S3, GCS, Azure Blob) when moving beyond filesystem storage. | ||||
| - For air-gapped deployments, mirror the telemetry bundle using `ops/devops/telemetry/package_offline_bundle.py` and import inside the Offline Kit staging directory. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 6. Operational checklist | ||||
|  | ||||
| - [ ] Certificates rotated and secrets updated. | ||||
| - [ ] Prometheus scrape succeeds (`curl -sk --cert client.crt --key client.key https://collector:9464`). | ||||
| - [ ] Tempo and Loki report tenant activity (`/api/status`). | ||||
| - [ ] Retention policy tested by uploading sample data and verifying expiry. | ||||
| - [ ] Alerts wired into SLO evaluator (DEVOPS-OBS-51-001). | ||||
| - [ ] Component rule packs imported (e.g. `docs/modules/scheduler/operations/worker-prometheus-rules.yaml`). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 7. References | ||||
|  | ||||
| - `deploy/telemetry/storage/README.md` | ||||
| - `deploy/compose/docker-compose.telemetry-storage.yaml` | ||||
| - `docs/modules/telemetry/operations/collector.md` | ||||
| - `docs/observability/observability.md` | ||||
		Reference in New Issue
	
	Block a user