feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules

- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes. - Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes. - Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables. - Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
2025-10-30 00:09:39 +02:00
parent 86f606a115
commit e8537460a3
503 changed files with 16136 additions and 54638 deletions
--- a/docs/modules/telemetry/AGENTS.md
+++ b/docs/modules/telemetry/AGENTS.md
@@ -0,0 +1,22 @@
+# Telemetry agent guide
+
+## Mission
+Telemetry module captures deployment and operations guidance for the shared observability stack (collectors, storage, dashboards).
+
+## Key docs
+- [Module README](./README.md)
+- [Architecture](./architecture.md)
+- [Implementation plan](./implementation_plan.md)
+- [Task board](./TASKS.md)
+
+## How to get started
+1. Open ../../implplan/SPRINTS.md and locate the stories referencing this module.
+2. Review ./TASKS.md for local follow-ups and confirm status transitions (TODO → DOING → DONE/BLOCKED).
+3. Read the architecture and README for domain context before editing code or docs.
+4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
+
+## Guardrails
+- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
+- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
+- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
+- Update runbooks/observability assets when operational characteristics change.
--- a/docs/modules/telemetry/README.md
+++ b/docs/modules/telemetry/README.md
@@ -0,0 +1,34 @@
+# StellaOps Telemetry
+
+Telemetry module captures deployment and operations guidance for the shared observability stack (collectors, storage, dashboards).
+
+## Responsibilities
+- Deploy and operate OpenTelemetry collectors for StellaOps services.
+- Provide storage configuration for Prometheus/Tempo/Loki stacks.
+- Document smoke tests and offline bootstrapping steps.
+- Align metrics and alert packs with module SLOs.
+
+## Key components
+- Collector deployment guide (./operations/collector.md).
+- Storage deployment guide (./operations/storage.md).
+- Smoke tooling in `ops/devops/telemetry/`.
+
+## Integrations & dependencies
+- DevOps pipelines for packaging telemetry bundles.
+- Module-specific dashboards (scheduler, scanner, etc.).
+- Security/Compliance for retention policies.
+
+## Operational notes
+- Smoke script references (../../ops/devops/telemetry).
+- Bundle packaging instructions in ops/devops/telemetry.
+
+## Related resources
+- ./operations/collector.md
+- ./operations/storage.md
+
+## Backlog references
+- TELEMETRY-OBS-50-001 … 50-004 in ../../TASKS.md.
+- Collector/storage automation tracked in ops/devops/TASKS.md.
+
+## Epic alignment
+- **Epic 15 – Observability & Forensics:** deliver collector/storage deployments, forensic evidence retention, and observability bundles with deterministic configuration.
--- a/docs/modules/telemetry/TASKS.md
+++ b/docs/modules/telemetry/TASKS.md
@@ -0,0 +1,9 @@
+# Task board — Telemetry
+
+> Local tasks should link back to ./AGENTS.md and mirror status updates into ../../TASKS.md when applicable.
+
+| ID | Status | Owner(s) | Description | Notes |
+|----|--------|----------|-------------|-------|
+| TELEMETRY-DOCS-0001 | DOING (2025-10-29) | Docs Guild | Validate that ./README.md aligns with the latest release notes. | See ./AGENTS.md |
+| TELEMETRY-OPS-0001 | TODO | Ops Guild | Review runbooks/observability assets after next sprint demo. | Sync outcomes back to ../../TASKS.md |
+| TELEMETRY-ENG-0001 | TODO | Module Team | Cross-check implementation plan milestones against ../../implplan/SPRINTS.md. | Update status via ./AGENTS.md workflow |
--- a/docs/modules/telemetry/architecture.md
+++ b/docs/modules/telemetry/architecture.md
@@ -0,0 +1,41 @@
+# Telemetry architecture
+
+> Derived from Epic 15 – Observability & Forensics; details collector topology, storage profiles, forensic pipelines, and offline packaging.
+
+## 1) Topology
+
+- **Collector tier.** OpenTelemetry Collector instances deployed per environment (ingest TLS, GRPC/OTLP receivers, tail-based sampling). Config packages delivered via Offline Kit.
+- **Processing pipelines.** Pipelines for traces, metrics, logs with processors (batch, tail sampling, attributes redaction, resource detection). Profiles: `default`, `forensic` (high-retention), `airgap` (file-based exporters).
+- **Exporters.** OTLP to Prometheus/Tempo/Loki (online) or file/OTLP-HTTP to Offline Kit staging (air-gapped). Exporters are allow-listed to satisfy Sovereign readiness.
+
+## 2) Storage
+
+- **Prometheus** for metrics with remote-write support and retention windows (default 30 days, forensic 180 days).
+- **Tempo** (or Jaeger all-in-one) for traces with block storage backend (S3-compatible or filesystem) and deterministic chunk manifests.
+- **Loki** for logs stored in immutable chunks; index shards hashed for reproducibility.
+- **Forensic archive** — periodic export of raw OTLP records into signed bundles (`otlp/metrics.pb`, `otlp/traces.pb`, `otlp/logs.pb`, `manifest.json`).
+
+## 3) Pipelines & Guardrails
+
+- **Redaction.** Attribute processors strip PII/secrets based on policy-managed allowed keys. Redaction profiles mirrored in Offline Kit.
+- **Sampling.** Tail sampling by service/error; incident mode (triggered by Orchestrator) promotes services to 100 % sampling, extends retention, and toggles Notify alerts.
+- **Alerting.** Prometheus rules/Dashboards packaged with Export Center: service SLOs, queue depth, policy run latency, ingestion AOC violations.
+
+## 4) APIs & integration
+
+- `GET /telemetry/config/profile/{name}` — download collector config bundle (YAML + signature).
+- `POST /telemetry/incidents/mode` — toggle incident sampling + forensic bundle generation.
+- `GET /telemetry/exports/forensic/{window}` — stream signed OTLP bundles for compliance.
+- CLI commands: `stella telemetry deploy --profile default`, `stella telemetry capture --window 24h --out bundle.tar.gz`.
+
+## 5) Offline support
+
+- Offline Kit ships collector binaries/config, bootstrap scripts, dashboards, alert rules, and OTLP replay tooling. Bundles include `manifest.json` with digests, DSSE signatures, and instructions.
+- For offline environments, exporters write to local filesystem; operators transfer bundles to analysis workstation using signed manifests.
+
+## 6) Observability of telemetry stack
+
+- Meta-metrics: `collector_export_failures_total`, `telemetry_bundle_generation_seconds`, `telemetry_incident_mode{state}`.
+- Health endpoints for collectors and storage clusters, plus dashboards for ingestion rate, retention, rule evaluations.
+
+Refer to the module README and implementation plan for immediate context, and update this document once component boundaries and data flows are finalised.
--- a/docs/modules/telemetry/implementation_plan.md
+++ b/docs/modules/telemetry/implementation_plan.md
@@ -0,0 +1,64 @@
+# Implementation plan — Telemetry
+
+## Delivery phases
+- **Phase 1 – Collector & pipeline profiles**  
+  Publish OpenTelemetry collector configs (`default`, `forensic`, `airgap`), establish ingest gateways, TLS/mTLS, and attribute redaction policies.
+- **Phase 2 – Storage backends & retention**  
+  Deploy Prometheus/Tempo/Loki (or equivalents) with retention tiers, bucket/object storage, deterministic manifest generation, and sealed-mode allowlists.
+- **Phase 3 – Incident mode & forensic capture**  
+  Implement incident toggles (CLI/API), tail sampling adjustments, forensic bundle generation (OTLP archives, manifest/signature), and Notify hooks.
+- **Phase 4 – Observability dashboards & automation**  
+  Deliver dashboards (service SLOs, queue depth, policy latency), alert rules, Grafana packages, and CLI automation for deployment and capture.
+- **Phase 5 – Offline & compliance**  
+  Ship Offline Kit artefacts (collectors, configs, dashboards, replay tooling), signed bundles, and documentation for air-gapped review workflows.
+- **Phase 6 – Hardening & SOC handoff**  
+  Complete RBAC integration, audit logging, incident response runbooks, performance tuning, and integration tests across services.
+
+## Work breakdown
+- **Collector configs**
+  - Maintain config templates per profile with processors (redaction, batching, resource detection) and exporters.
+  - CLI automation (`stella telemetry deploy`, `stella telemetry profile diff`), validation tests, and config signing.
+- **Storage & retention**
+  - Provision Prometheus/Tempo/Loki (or vendor equivalents) with retention tiers (default, forensic, airgap).
+  - Ensure determinism (chunk manifests, content hashing), remote-write allowlists, sealed/offline modes.
+  - Implement archivers for forensic bundles (metrics/traces/logs) with cosign signatures.
+- **Incident mode**
+  - API/CLI to toggle incident sampling, retention escalation, Notify signals, and auto bundle capture.
+  - Hook into Orchestrator to respond to incidents and revert after cooldown.
+- **Dashboards & alerts**
+  - Dashboard packages for core services (ingestion, policy, export, attestation).
+  - Alert rules for SLO burn, collector failure, exporter backlog, bundle generation errors.
+  - Self-observability metrics (`collector_export_failures_total`, `telemetry_incident_mode{}`).
+- **Offline support**
+  - Offline Kit assets: collector binaries/configs, import scripts, dashboards, replay instructions, compliance checklists.
+  - File-based exporters and manual transfer workflows with signed manifests.
+- **Docs & runbooks**
+  - Update observability overview, forensic capture guide, incident response checklist, sealed-mode instructions, RBAC matrix.
+  - SOC handoff package with control objectives and audit evidence.
+
+## Acceptance criteria
+- Collectors ingest metrics/logs/traces across deployments, applying redaction rules and tenant isolation; profiles validate via CI.
+- Storage backends retain data per default/forensic/airgap SLAs with deterministic chunk manifests and sealed-mode compliance.
+- Incident mode toggles sampling to 100 %, extends retention, triggers Notify, and captures forensic bundles signed with cosign.
+- Dashboards and alerts cover service SLOs, queue depth, policy latency, ingestion violations, and telemetry stack health.
+- CLI commands (`stella telemetry deploy/capture/status`) automate config rollout, forensic capture, and verification.
+- Offline bundles replay telemetry in sealed environments using provided scripts and manifests.
+
+## Risks & mitigations
+- **PII leakage:** strict redaction processors, policy-managed allowlists, audit tests.
+- **Collector overload:** horizontal scaling, batching, circuit breakers, incident mode throttling.
+- **Storage cost:** tiered retention, compression, pruning policies, offline archiving.
+- **Air-gap drift:** offline kit refresh schedule, deterministic manifest verification.
+- **Alert fatigue:** burn-rate alerts, deduping, SOC runbooks.
+
+## Test strategy
+- **Config lint/tests:** schema validation, unit tests for processors/exporters, golden configs.
+- **Integration:** simulate service traces/logs/metrics, verify pipelines, incident toggles, bundle generation.
+- **Performance:** load tests with peak ingestion, long retention windows, failover scenarios.
+- **Security:** redaction verification, RBAC/tenant scoping, sealed-mode tests, signed config verification.
+- **Offline:** capture bundles, transfer, replay, compliance attestation.
+
+## Definition of done
+- Collector profiles, storage backends, incident mode, dashboards, CLI, and offline kit delivered with telemetry and documentation.
+- Runbooks and SOC handoff packages published; compliance checklists appended.
+- ./TASKS.md and ../../TASKS.md updated; imposed rule statements confirmed in documentation.
--- a/docs/modules/telemetry/operations/collector.md
+++ b/docs/modules/telemetry/operations/collector.md
@@ -0,0 +1,113 @@
+# Telemetry Collector Deployment Guide
+
+> **Scope:** DevOps Guild, Observability Guild, and operators enabling the StellaOps telemetry pipeline (DEVOPS-OBS-50-001 / DEVOPS-OBS-50-003).
+
+This guide describes how to deploy the default OpenTelemetry Collector packaged with Stella Ops, validate its ingest endpoints, and prepare an offline-ready bundle for air-gapped environments.
+
+---
+
+## 1. Overview
+
+The collector terminates OTLP traffic from Stella Ops services and exports metrics, traces, and logs.
+
+| Endpoint | Purpose | TLS | Authentication |
+| -------- | ------- | --- | -------------- |
+| `:4317`  | OTLP gRPC ingest | mTLS | Client certificate issued by collector CA |
+| `:4318`  | OTLP HTTP ingest | mTLS | Client certificate issued by collector CA |
+| `:9464`  | Prometheus scrape | mTLS | Same client certificate |
+| `:13133` | Health check | mTLS | Same client certificate |
+| `:1777`  | pprof diagnostics | mTLS | Same client certificate |
+
+The default configuration lives at `deploy/telemetry/otel-collector-config.yaml` and mirrors the Helm values in the `stellaops` chart.
+
+---
+
+## 2. Local validation (Compose)
+
+```bash
+# 1. Generate dev certificates (CA + collector + client)
+./ops/devops/telemetry/generate_dev_tls.sh
+
+# 2. Start the collector overlay
+cd deploy/compose
+docker compose -f docker-compose.telemetry.yaml up -d
+
+# 3. Start the storage overlay (Prometheus, Tempo, Loki)
+docker compose -f docker-compose.telemetry-storage.yaml up -d
+
+# 4. Run the smoke test (OTLP HTTP)
+python ../../ops/devops/telemetry/smoke_otel_collector.py --host localhost
+```
+
+The smoke test posts sample traces, metrics, and logs and verifies that the collector increments the `otelcol_receiver_accepted_*` counters exposed via the Prometheus exporter. The storage overlay gives you a local Prometheus/Tempo/Loki stack to confirm end-to-end wiring. The same client certificate can be used by local services to weave traces together. See [`Telemetry Storage Deployment`](telemetry-storage.md) for the storage configuration guidelines used in staging/production.
+
+---
+
+## 3. Kubernetes deployment
+
+Enable the collector in Helm by setting the following values (example shown for the dev profile):
+
+```yaml
+telemetry:
+  collector:
+    enabled: true
+    defaultTenant: <tenant>
+    tls:
+      secretName: stellaops-otel-tls-<env>
+```
+
+Provide a Kubernetes secret named `stellaops-otel-tls-<env>` (for staging: `stellaops-otel-tls-stage`) with the keys `tls.crt`, `tls.key`, and `ca.crt`. The secret must contain the collector certificate, private key, and issuing CA respectively. Example:
+
+```bash
+kubectl create secret generic stellaops-otel-tls-stage \
+  --from-file=tls.crt=collector.crt \
+  --from-file=tls.key=collector.key \
+  --from-file=ca.crt=ca.crt
+```
+
+Helm renders the collector deployment, service, and config map automatically:
+
+```bash
+helm upgrade --install stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-dev.yaml
+```
+
+Update client workloads to trust `ca.crt` and present client certificates that chain back to the same CA.
+
+---
+
+## 4. Offline packaging (DEVOPS-OBS-50-003)
+
+Use the packaging helper to produce a tarball that can be mirrored inside the Offline Kit or air-gapped sites:
+
+```bash
+python ops/devops/telemetry/package_offline_bundle.py --output out/telemetry/telemetry-bundle.tar.gz
+```
+
+The script gathers:
+
+- `deploy/telemetry/README.md`
+- Collector configuration (`deploy/telemetry/otel-collector-config.yaml` and Helm copy)
+- Helm template/values for the collector
+- Compose overlay (`deploy/compose/docker-compose.telemetry.yaml`)
+
+The tarball ships with a `.sha256` checksum. To attach a Cosign signature, add `--sign` and provide `COSIGN_KEY_REF`/`COSIGN_IDENTITY_TOKEN` env vars (or use the `--cosign-key` flag).
+
+Distribute the bundle alongside certificates generated by your PKI. For air-gapped installs, regenerate certificates inside the enclave and recreate the `stellaops-otel-tls` secret.
+
+---
+
+## 5. Operational checks
+
+1. **Health probes** – `kubectl exec` into the collector pod and run `curl -fsSk --cert client.crt --key client.key --cacert ca.crt https://127.0.0.1:13133/healthz`.
+2. **Metrics scrape** – confirm Prometheus ingests `otelcol_receiver_accepted_*` counters.
+3. **Trace correlation** – ensure services propagate `trace_id` and `tenant.id` attributes; refer to `docs/observability/observability.md` for expected spans.
+4. **Certificate rotation** – when rotating the CA, update the secret and restart the collector; roll out new client certificates before enabling `require_client_certificate` if staged.
+
+---
+
+## 6. Related references
+
+- `deploy/telemetry/README.md` – source configuration and local workflow.
+- `ops/devops/telemetry/smoke_otel_collector.py` – OTLP smoke test.
+- `docs/observability/observability.md` – metrics/traces/logs taxonomy.
+- `docs/13_RELEASE_ENGINEERING_PLAYBOOK.md` – release checklist for telemetry assets.
--- a/docs/modules/telemetry/operations/storage.md
+++ b/docs/modules/telemetry/operations/storage.md
@@ -0,0 +1,173 @@
+# Telemetry Storage Deployment (DEVOPS-OBS-50-002)
+
+> **Audience:** DevOps Guild, Observability Guild
+>
+> **Scope:** Prometheus (metrics), Tempo (traces), Loki (logs) storage backends with tenant isolation, TLS, retention policies, and Authority integration.
+
+---
+
+## 1. Components & Ports
+
+| Service   | Port | Purpose | TLS |
+|-----------|------|---------|-----|
+| Prometheus | 9090 | Metrics API / alerting | Client auth (mTLS) to scrape collector |
+| Tempo      | 3200 | Trace ingest + API | mTLS (client cert required) |
+| Loki       | 3100 | Log ingest + API | mTLS (client cert required) |
+
+The collector forwards OTLP traffic to Tempo (traces), Prometheus scrapes the collector’s `/metrics` endpoint, and Loki is used for log search.
+
+---
+
+## 2. Local validation (Compose)
+
+```bash
+./ops/devops/telemetry/generate_dev_tls.sh
+cd deploy/compose
+# Start collector + storage stack
+docker compose -f docker-compose.telemetry.yaml up -d
+docker compose -f docker-compose.telemetry-storage.yaml up -d
+python ../../ops/devops/telemetry/smoke_otel_collector.py --host localhost
+```
+
+Configuration files live in `deploy/telemetry/storage/`. Adjust the overrides before shipping to staging/production.
+
+---
+
+## 3. Kubernetes blueprint
+
+Deploy Prometheus, Tempo, and Loki to the `observability` namespace. The Helm values snippet below illustrates the key settings (charts not yet versioned—define them in the observability repo):
+
+```yaml
+prometheus:
+  server:
+    extraFlags:
+      - web.enable-lifecycle
+    persistentVolume:
+      enabled: true
+      size: 200Gi
+  additionalScrapeConfigsSecret: stellaops-prometheus-scrape
+  extraSecretMounts:
+    - name: otel-mtls
+      secretName: stellaops-otel-tls-stage
+      mountPath: /etc/telemetry/tls
+      readOnly: true
+    - name: otel-token
+      secretName: stellaops-prometheus-token
+      mountPath: /etc/telemetry/auth
+      readOnly: true
+
+loki:
+  auth_enabled: true
+  singleBinary:
+    replicas: 2
+  storage:
+    type: filesystem
+  existingSecretForTls: stellaops-otel-tls-stage
+  runtimeConfig:
+    configMap:
+      name: stellaops-loki-tenant-overrides
+
+tempo:
+  server:
+    http_listen_port: 3200
+  storage:
+    trace:
+      backend: s3
+      s3:
+        endpoint: tempo-minio.observability.svc:9000
+        bucket: tempo-traces
+  multitenancyEnabled: true
+  extraVolumeMounts:
+    - name: otel-mtls
+      mountPath: /etc/telemetry/tls
+      readOnly: true
+    - name: tempo-tenant-overrides
+      mountPath: /etc/telemetry/tenants
+      readOnly: true
+```
+
+### Staging bootstrap commands
+
+```bash
+kubectl create namespace observability --dry-run=client -o yaml | kubectl apply -f -
+
+# TLS material (generated via ops/devops/telemetry/generate_dev_tls.sh or from PKI)
+kubectl -n observability create secret generic stellaops-otel-tls-stage \
+  --from-file=tls.crt=collector-stage.crt \
+  --from-file=tls.key=collector-stage.key \
+  --from-file=ca.crt=collector-ca.crt
+
+# Prometheus bearer token issued by Authority (scope obs:read)
+kubectl -n observability create secret generic stellaops-prometheus-token \
+  --from-file=token=prometheus-stage.token
+
+# Tenant overrides
+kubectl -n observability create configmap stellaops-loki-tenant-overrides \
+  --from-file=overrides.yaml=deploy/telemetry/storage/tenants/loki-overrides.yaml
+
+kubectl -n observability create configmap tempo-tenant-overrides \
+  --from-file=tempo-overrides.yaml=deploy/telemetry/storage/tenants/tempo-overrides.yaml
+
+# Additional scrape config referencing the collector service
+kubectl -n observability create secret generic stellaops-prometheus-scrape \
+  --from-file=prometheus-additional.yaml=deploy/telemetry/storage/prometheus.yaml
+```
+
+Provision the following secrets/configs (names can be overridden via Helm values):
+
+| Name | Type | Notes |
+|------|------|-------|
+| `stellaops-otel-tls-stage` | Secret | Shared CA + server cert/key for collector/storage mTLS.
+| `stellaops-prometheus-token` | Secret | Bearer token minted by Authority (`obs:read`).
+| `stellaops-loki-tenant-overrides` | ConfigMap | Text from `deploy/telemetry/storage/tenants/loki-overrides.yaml`.
+| `tempo-tenant-overrides` | ConfigMap | Text from `deploy/telemetry/storage/tenants/tempo-overrides.yaml`.
+
+---
+
+## 4. Authority & tenancy integration
+
+1. Create Authority clients for each backend (`observability-prometheus`, `observability-loki`, `observability-tempo`).
+   ```bash
+   stella authority client create observability-prometheus \
+     --scopes obs:read \
+     --audience observability --description "Prometheus collector scrape"
+   stella authority client create observability-loki \
+     --scopes obs:logs timeline:read \
+     --audience observability --description "Loki ingestion"
+   stella authority client create observability-tempo \
+     --scopes obs:traces \
+     --audience observability --description "Tempo ingestion"
+   ```
+2. Mint tokens/credentials and store them in the secrets above (see staging bootstrap commands). Example:
+   ```bash
+   stella authority token issue observability-prometheus --ttl 30d > prometheus-stage.token
+   ```
+3. Update ingress/gateway policies to forward `X-StellaOps-Tenant` into Loki/Tempo so tenant headers propagate end-to-end, and ensure each workload sets `tenant.id` attributes (see `docs/observability/observability.md`).
+
+---
+
+## 5. Retention & isolation
+
+- Adjust `deploy/telemetry/storage/tenants/*.yaml` to set per-tenant retention and ingestion limits.
+- Configure object storage (S3, GCS, Azure Blob) when moving beyond filesystem storage.
+- For air-gapped deployments, mirror the telemetry bundle using `ops/devops/telemetry/package_offline_bundle.py` and import inside the Offline Kit staging directory.
+
+---
+
+## 6. Operational checklist
+
+- [ ] Certificates rotated and secrets updated.
+- [ ] Prometheus scrape succeeds (`curl -sk --cert client.crt --key client.key https://collector:9464`).
+- [ ] Tempo and Loki report tenant activity (`/api/status`).
+- [ ] Retention policy tested by uploading sample data and verifying expiry.
+- [ ] Alerts wired into SLO evaluator (DEVOPS-OBS-51-001).
+- [ ] Component rule packs imported (e.g. `docs/modules/scheduler/operations/worker-prometheus-rules.yaml`).
+
+---
+
+## 7. References
+
+- `deploy/telemetry/storage/README.md`
+- `deploy/compose/docker-compose.telemetry-storage.yaml`
+- `docs/modules/telemetry/operations/collector.md`
+- `docs/observability/observability.md`