feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules

- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes.
- Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes.
- Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables.
- Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
This commit is contained in:
2025-10-30 00:09:39 +02:00
parent 3154c67978
commit 7b5bdcf4d3
503 changed files with 16136 additions and 54638 deletions

View File

@@ -0,0 +1,22 @@
# Telemetry agent guide
## Mission
Telemetry module captures deployment and operations guidance for the shared observability stack (collectors, storage, dashboards).
## Key docs
- [Module README](./README.md)
- [Architecture](./architecture.md)
- [Implementation plan](./implementation_plan.md)
- [Task board](./TASKS.md)
## How to get started
1. Open ../../implplan/SPRINTS.md and locate the stories referencing this module.
2. Review ./TASKS.md for local follow-ups and confirm status transitions (TODO → DOING → DONE/BLOCKED).
3. Read the architecture and README for domain context before editing code or docs.
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.

View File

@@ -0,0 +1,34 @@
# StellaOps Telemetry
Telemetry module captures deployment and operations guidance for the shared observability stack (collectors, storage, dashboards).
## Responsibilities
- Deploy and operate OpenTelemetry collectors for StellaOps services.
- Provide storage configuration for Prometheus/Tempo/Loki stacks.
- Document smoke tests and offline bootstrapping steps.
- Align metrics and alert packs with module SLOs.
## Key components
- Collector deployment guide (./operations/collector.md).
- Storage deployment guide (./operations/storage.md).
- Smoke tooling in `ops/devops/telemetry/`.
## Integrations & dependencies
- DevOps pipelines for packaging telemetry bundles.
- Module-specific dashboards (scheduler, scanner, etc.).
- Security/Compliance for retention policies.
## Operational notes
- Smoke script references (../../ops/devops/telemetry).
- Bundle packaging instructions in ops/devops/telemetry.
## Related resources
- ./operations/collector.md
- ./operations/storage.md
## Backlog references
- TELEMETRY-OBS-50-001 … 50-004 in ../../TASKS.md.
- Collector/storage automation tracked in ops/devops/TASKS.md.
## Epic alignment
- **Epic 15 Observability & Forensics:** deliver collector/storage deployments, forensic evidence retention, and observability bundles with deterministic configuration.

View File

@@ -0,0 +1,9 @@
# Task board — Telemetry
> Local tasks should link back to ./AGENTS.md and mirror status updates into ../../TASKS.md when applicable.
| ID | Status | Owner(s) | Description | Notes |
|----|--------|----------|-------------|-------|
| TELEMETRY-DOCS-0001 | DOING (2025-10-29) | Docs Guild | Validate that ./README.md aligns with the latest release notes. | See ./AGENTS.md |
| TELEMETRY-OPS-0001 | TODO | Ops Guild | Review runbooks/observability assets after next sprint demo. | Sync outcomes back to ../../TASKS.md |
| TELEMETRY-ENG-0001 | TODO | Module Team | Cross-check implementation plan milestones against ../../implplan/SPRINTS.md. | Update status via ./AGENTS.md workflow |

View File

@@ -0,0 +1,41 @@
# Telemetry architecture
> Derived from Epic15 Observability & Forensics; details collector topology, storage profiles, forensic pipelines, and offline packaging.
## 1) Topology
- **Collector tier.** OpenTelemetry Collector instances deployed per environment (ingest TLS, GRPC/OTLP receivers, tail-based sampling). Config packages delivered via Offline Kit.
- **Processing pipelines.** Pipelines for traces, metrics, logs with processors (batch, tail sampling, attributes redaction, resource detection). Profiles: `default`, `forensic` (high-retention), `airgap` (file-based exporters).
- **Exporters.** OTLP to Prometheus/Tempo/Loki (online) or file/OTLP-HTTP to Offline Kit staging (air-gapped). Exporters are allow-listed to satisfy Sovereign readiness.
## 2) Storage
- **Prometheus** for metrics with remote-write support and retention windows (default 30days, forensic 180days).
- **Tempo** (or Jaeger all-in-one) for traces with block storage backend (S3-compatible or filesystem) and deterministic chunk manifests.
- **Loki** for logs stored in immutable chunks; index shards hashed for reproducibility.
- **Forensic archive** — periodic export of raw OTLP records into signed bundles (`otlp/metrics.pb`, `otlp/traces.pb`, `otlp/logs.pb`, `manifest.json`).
## 3) Pipelines & Guardrails
- **Redaction.** Attribute processors strip PII/secrets based on policy-managed allowed keys. Redaction profiles mirrored in Offline Kit.
- **Sampling.** Tail sampling by service/error; incident mode (triggered by Orchestrator) promotes services to 100% sampling, extends retention, and toggles Notify alerts.
- **Alerting.** Prometheus rules/Dashboards packaged with Export Center: service SLOs, queue depth, policy run latency, ingestion AOC violations.
## 4) APIs & integration
- `GET /telemetry/config/profile/{name}` — download collector config bundle (YAML + signature).
- `POST /telemetry/incidents/mode` — toggle incident sampling + forensic bundle generation.
- `GET /telemetry/exports/forensic/{window}` — stream signed OTLP bundles for compliance.
- CLI commands: `stella telemetry deploy --profile default`, `stella telemetry capture --window 24h --out bundle.tar.gz`.
## 5) Offline support
- Offline Kit ships collector binaries/config, bootstrap scripts, dashboards, alert rules, and OTLP replay tooling. Bundles include `manifest.json` with digests, DSSE signatures, and instructions.
- For offline environments, exporters write to local filesystem; operators transfer bundles to analysis workstation using signed manifests.
## 6) Observability of telemetry stack
- Meta-metrics: `collector_export_failures_total`, `telemetry_bundle_generation_seconds`, `telemetry_incident_mode{state}`.
- Health endpoints for collectors and storage clusters, plus dashboards for ingestion rate, retention, rule evaluations.
Refer to the module README and implementation plan for immediate context, and update this document once component boundaries and data flows are finalised.

View File

@@ -0,0 +1,64 @@
# Implementation plan — Telemetry
## Delivery phases
- **Phase 1 Collector & pipeline profiles**
Publish OpenTelemetry collector configs (`default`, `forensic`, `airgap`), establish ingest gateways, TLS/mTLS, and attribute redaction policies.
- **Phase 2 Storage backends & retention**
Deploy Prometheus/Tempo/Loki (or equivalents) with retention tiers, bucket/object storage, deterministic manifest generation, and sealed-mode allowlists.
- **Phase 3 Incident mode & forensic capture**
Implement incident toggles (CLI/API), tail sampling adjustments, forensic bundle generation (OTLP archives, manifest/signature), and Notify hooks.
- **Phase 4 Observability dashboards & automation**
Deliver dashboards (service SLOs, queue depth, policy latency), alert rules, Grafana packages, and CLI automation for deployment and capture.
- **Phase 5 Offline & compliance**
Ship Offline Kit artefacts (collectors, configs, dashboards, replay tooling), signed bundles, and documentation for air-gapped review workflows.
- **Phase 6 Hardening & SOC handoff**
Complete RBAC integration, audit logging, incident response runbooks, performance tuning, and integration tests across services.
## Work breakdown
- **Collector configs**
- Maintain config templates per profile with processors (redaction, batching, resource detection) and exporters.
- CLI automation (`stella telemetry deploy`, `stella telemetry profile diff`), validation tests, and config signing.
- **Storage & retention**
- Provision Prometheus/Tempo/Loki (or vendor equivalents) with retention tiers (default, forensic, airgap).
- Ensure determinism (chunk manifests, content hashing), remote-write allowlists, sealed/offline modes.
- Implement archivers for forensic bundles (metrics/traces/logs) with cosign signatures.
- **Incident mode**
- API/CLI to toggle incident sampling, retention escalation, Notify signals, and auto bundle capture.
- Hook into Orchestrator to respond to incidents and revert after cooldown.
- **Dashboards & alerts**
- Dashboard packages for core services (ingestion, policy, export, attestation).
- Alert rules for SLO burn, collector failure, exporter backlog, bundle generation errors.
- Self-observability metrics (`collector_export_failures_total`, `telemetry_incident_mode{}`).
- **Offline support**
- Offline Kit assets: collector binaries/configs, import scripts, dashboards, replay instructions, compliance checklists.
- File-based exporters and manual transfer workflows with signed manifests.
- **Docs & runbooks**
- Update observability overview, forensic capture guide, incident response checklist, sealed-mode instructions, RBAC matrix.
- SOC handoff package with control objectives and audit evidence.
## Acceptance criteria
- Collectors ingest metrics/logs/traces across deployments, applying redaction rules and tenant isolation; profiles validate via CI.
- Storage backends retain data per default/forensic/airgap SLAs with deterministic chunk manifests and sealed-mode compliance.
- Incident mode toggles sampling to 100%, extends retention, triggers Notify, and captures forensic bundles signed with cosign.
- Dashboards and alerts cover service SLOs, queue depth, policy latency, ingestion violations, and telemetry stack health.
- CLI commands (`stella telemetry deploy/capture/status`) automate config rollout, forensic capture, and verification.
- Offline bundles replay telemetry in sealed environments using provided scripts and manifests.
## Risks & mitigations
- **PII leakage:** strict redaction processors, policy-managed allowlists, audit tests.
- **Collector overload:** horizontal scaling, batching, circuit breakers, incident mode throttling.
- **Storage cost:** tiered retention, compression, pruning policies, offline archiving.
- **Air-gap drift:** offline kit refresh schedule, deterministic manifest verification.
- **Alert fatigue:** burn-rate alerts, deduping, SOC runbooks.
## Test strategy
- **Config lint/tests:** schema validation, unit tests for processors/exporters, golden configs.
- **Integration:** simulate service traces/logs/metrics, verify pipelines, incident toggles, bundle generation.
- **Performance:** load tests with peak ingestion, long retention windows, failover scenarios.
- **Security:** redaction verification, RBAC/tenant scoping, sealed-mode tests, signed config verification.
- **Offline:** capture bundles, transfer, replay, compliance attestation.
## Definition of done
- Collector profiles, storage backends, incident mode, dashboards, CLI, and offline kit delivered with telemetry and documentation.
- Runbooks and SOC handoff packages published; compliance checklists appended.
- ./TASKS.md and ../../TASKS.md updated; imposed rule statements confirmed in documentation.

View File

@@ -0,0 +1,113 @@
# Telemetry Collector Deployment Guide
> **Scope:** DevOps Guild, Observability Guild, and operators enabling the StellaOps telemetry pipeline (DEVOPS-OBS-50-001 / DEVOPS-OBS-50-003).
This guide describes how to deploy the default OpenTelemetry Collector packaged with StellaOps, validate its ingest endpoints, and prepare an offline-ready bundle for air-gapped environments.
---
## 1. Overview
The collector terminates OTLP traffic from StellaOps services and exports metrics, traces, and logs.
| Endpoint | Purpose | TLS | Authentication |
| -------- | ------- | --- | -------------- |
| `:4317` | OTLP gRPC ingest | mTLS | Client certificate issued by collector CA |
| `:4318` | OTLP HTTP ingest | mTLS | Client certificate issued by collector CA |
| `:9464` | Prometheus scrape | mTLS | Same client certificate |
| `:13133` | Health check | mTLS | Same client certificate |
| `:1777` | pprof diagnostics | mTLS | Same client certificate |
The default configuration lives at `deploy/telemetry/otel-collector-config.yaml` and mirrors the Helm values in the `stellaops` chart.
---
## 2. Local validation (Compose)
```bash
# 1. Generate dev certificates (CA + collector + client)
./ops/devops/telemetry/generate_dev_tls.sh
# 2. Start the collector overlay
cd deploy/compose
docker compose -f docker-compose.telemetry.yaml up -d
# 3. Start the storage overlay (Prometheus, Tempo, Loki)
docker compose -f docker-compose.telemetry-storage.yaml up -d
# 4. Run the smoke test (OTLP HTTP)
python ../../ops/devops/telemetry/smoke_otel_collector.py --host localhost
```
The smoke test posts sample traces, metrics, and logs and verifies that the collector increments the `otelcol_receiver_accepted_*` counters exposed via the Prometheus exporter. The storage overlay gives you a local Prometheus/Tempo/Loki stack to confirm end-to-end wiring. The same client certificate can be used by local services to weave traces together. See [`Telemetry Storage Deployment`](telemetry-storage.md) for the storage configuration guidelines used in staging/production.
---
## 3. Kubernetes deployment
Enable the collector in Helm by setting the following values (example shown for the dev profile):
```yaml
telemetry:
collector:
enabled: true
defaultTenant: <tenant>
tls:
secretName: stellaops-otel-tls-<env>
```
Provide a Kubernetes secret named `stellaops-otel-tls-<env>` (for staging: `stellaops-otel-tls-stage`) with the keys `tls.crt`, `tls.key`, and `ca.crt`. The secret must contain the collector certificate, private key, and issuing CA respectively. Example:
```bash
kubectl create secret generic stellaops-otel-tls-stage \
--from-file=tls.crt=collector.crt \
--from-file=tls.key=collector.key \
--from-file=ca.crt=ca.crt
```
Helm renders the collector deployment, service, and config map automatically:
```bash
helm upgrade --install stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-dev.yaml
```
Update client workloads to trust `ca.crt` and present client certificates that chain back to the same CA.
---
## 4. Offline packaging (DEVOPS-OBS-50-003)
Use the packaging helper to produce a tarball that can be mirrored inside the Offline Kit or air-gapped sites:
```bash
python ops/devops/telemetry/package_offline_bundle.py --output out/telemetry/telemetry-bundle.tar.gz
```
The script gathers:
- `deploy/telemetry/README.md`
- Collector configuration (`deploy/telemetry/otel-collector-config.yaml` and Helm copy)
- Helm template/values for the collector
- Compose overlay (`deploy/compose/docker-compose.telemetry.yaml`)
The tarball ships with a `.sha256` checksum. To attach a Cosign signature, add `--sign` and provide `COSIGN_KEY_REF`/`COSIGN_IDENTITY_TOKEN` env vars (or use the `--cosign-key` flag).
Distribute the bundle alongside certificates generated by your PKI. For air-gapped installs, regenerate certificates inside the enclave and recreate the `stellaops-otel-tls` secret.
---
## 5. Operational checks
1. **Health probes** `kubectl exec` into the collector pod and run `curl -fsSk --cert client.crt --key client.key --cacert ca.crt https://127.0.0.1:13133/healthz`.
2. **Metrics scrape** confirm Prometheus ingests `otelcol_receiver_accepted_*` counters.
3. **Trace correlation** ensure services propagate `trace_id` and `tenant.id` attributes; refer to `docs/observability/observability.md` for expected spans.
4. **Certificate rotation** when rotating the CA, update the secret and restart the collector; roll out new client certificates before enabling `require_client_certificate` if staged.
---
## 6. Related references
- `deploy/telemetry/README.md` source configuration and local workflow.
- `ops/devops/telemetry/smoke_otel_collector.py` OTLP smoke test.
- `docs/observability/observability.md` metrics/traces/logs taxonomy.
- `docs/13_RELEASE_ENGINEERING_PLAYBOOK.md` release checklist for telemetry assets.

View File

@@ -0,0 +1,173 @@
# Telemetry Storage Deployment (DEVOPS-OBS-50-002)
> **Audience:** DevOps Guild, Observability Guild
>
> **Scope:** Prometheus (metrics), Tempo (traces), Loki (logs) storage backends with tenant isolation, TLS, retention policies, and Authority integration.
---
## 1. Components & Ports
| Service | Port | Purpose | TLS |
|-----------|------|---------|-----|
| Prometheus | 9090 | Metrics API / alerting | Client auth (mTLS) to scrape collector |
| Tempo | 3200 | Trace ingest + API | mTLS (client cert required) |
| Loki | 3100 | Log ingest + API | mTLS (client cert required) |
The collector forwards OTLP traffic to Tempo (traces), Prometheus scrapes the collectors `/metrics` endpoint, and Loki is used for log search.
---
## 2. Local validation (Compose)
```bash
./ops/devops/telemetry/generate_dev_tls.sh
cd deploy/compose
# Start collector + storage stack
docker compose -f docker-compose.telemetry.yaml up -d
docker compose -f docker-compose.telemetry-storage.yaml up -d
python ../../ops/devops/telemetry/smoke_otel_collector.py --host localhost
```
Configuration files live in `deploy/telemetry/storage/`. Adjust the overrides before shipping to staging/production.
---
## 3. Kubernetes blueprint
Deploy Prometheus, Tempo, and Loki to the `observability` namespace. The Helm values snippet below illustrates the key settings (charts not yet versioned—define them in the observability repo):
```yaml
prometheus:
server:
extraFlags:
- web.enable-lifecycle
persistentVolume:
enabled: true
size: 200Gi
additionalScrapeConfigsSecret: stellaops-prometheus-scrape
extraSecretMounts:
- name: otel-mtls
secretName: stellaops-otel-tls-stage
mountPath: /etc/telemetry/tls
readOnly: true
- name: otel-token
secretName: stellaops-prometheus-token
mountPath: /etc/telemetry/auth
readOnly: true
loki:
auth_enabled: true
singleBinary:
replicas: 2
storage:
type: filesystem
existingSecretForTls: stellaops-otel-tls-stage
runtimeConfig:
configMap:
name: stellaops-loki-tenant-overrides
tempo:
server:
http_listen_port: 3200
storage:
trace:
backend: s3
s3:
endpoint: tempo-minio.observability.svc:9000
bucket: tempo-traces
multitenancyEnabled: true
extraVolumeMounts:
- name: otel-mtls
mountPath: /etc/telemetry/tls
readOnly: true
- name: tempo-tenant-overrides
mountPath: /etc/telemetry/tenants
readOnly: true
```
### Staging bootstrap commands
```bash
kubectl create namespace observability --dry-run=client -o yaml | kubectl apply -f -
# TLS material (generated via ops/devops/telemetry/generate_dev_tls.sh or from PKI)
kubectl -n observability create secret generic stellaops-otel-tls-stage \
--from-file=tls.crt=collector-stage.crt \
--from-file=tls.key=collector-stage.key \
--from-file=ca.crt=collector-ca.crt
# Prometheus bearer token issued by Authority (scope obs:read)
kubectl -n observability create secret generic stellaops-prometheus-token \
--from-file=token=prometheus-stage.token
# Tenant overrides
kubectl -n observability create configmap stellaops-loki-tenant-overrides \
--from-file=overrides.yaml=deploy/telemetry/storage/tenants/loki-overrides.yaml
kubectl -n observability create configmap tempo-tenant-overrides \
--from-file=tempo-overrides.yaml=deploy/telemetry/storage/tenants/tempo-overrides.yaml
# Additional scrape config referencing the collector service
kubectl -n observability create secret generic stellaops-prometheus-scrape \
--from-file=prometheus-additional.yaml=deploy/telemetry/storage/prometheus.yaml
```
Provision the following secrets/configs (names can be overridden via Helm values):
| Name | Type | Notes |
|------|------|-------|
| `stellaops-otel-tls-stage` | Secret | Shared CA + server cert/key for collector/storage mTLS.
| `stellaops-prometheus-token` | Secret | Bearer token minted by Authority (`obs:read`).
| `stellaops-loki-tenant-overrides` | ConfigMap | Text from `deploy/telemetry/storage/tenants/loki-overrides.yaml`.
| `tempo-tenant-overrides` | ConfigMap | Text from `deploy/telemetry/storage/tenants/tempo-overrides.yaml`.
---
## 4. Authority & tenancy integration
1. Create Authority clients for each backend (`observability-prometheus`, `observability-loki`, `observability-tempo`).
```bash
stella authority client create observability-prometheus \
--scopes obs:read \
--audience observability --description "Prometheus collector scrape"
stella authority client create observability-loki \
--scopes obs:logs timeline:read \
--audience observability --description "Loki ingestion"
stella authority client create observability-tempo \
--scopes obs:traces \
--audience observability --description "Tempo ingestion"
```
2. Mint tokens/credentials and store them in the secrets above (see staging bootstrap commands). Example:
```bash
stella authority token issue observability-prometheus --ttl 30d > prometheus-stage.token
```
3. Update ingress/gateway policies to forward `X-StellaOps-Tenant` into Loki/Tempo so tenant headers propagate end-to-end, and ensure each workload sets `tenant.id` attributes (see `docs/observability/observability.md`).
---
## 5. Retention & isolation
- Adjust `deploy/telemetry/storage/tenants/*.yaml` to set per-tenant retention and ingestion limits.
- Configure object storage (S3, GCS, Azure Blob) when moving beyond filesystem storage.
- For air-gapped deployments, mirror the telemetry bundle using `ops/devops/telemetry/package_offline_bundle.py` and import inside the Offline Kit staging directory.
---
## 6. Operational checklist
- [ ] Certificates rotated and secrets updated.
- [ ] Prometheus scrape succeeds (`curl -sk --cert client.crt --key client.key https://collector:9464`).
- [ ] Tempo and Loki report tenant activity (`/api/status`).
- [ ] Retention policy tested by uploading sample data and verifying expiry.
- [ ] Alerts wired into SLO evaluator (DEVOPS-OBS-51-001).
- [ ] Component rule packs imported (e.g. `docs/modules/scheduler/operations/worker-prometheus-rules.yaml`).
---
## 7. References
- `deploy/telemetry/storage/README.md`
- `deploy/compose/docker-compose.telemetry-storage.yaml`
- `docs/modules/telemetry/operations/collector.md`
- `docs/observability/observability.md`