docs consolidation work

2025-12-25 18:48:11 +02:00
parent 2a06f780cf
commit 0103defcff
114 changed files with 4143 additions and 2395 deletions
--- a/docs/modules/telemetry/README.md
+++ b/docs/modules/telemetry/README.md
@@ -2,12 +2,12 @@

 Telemetry module captures deployment and operations guidance for the shared observability stack (collectors, storage, dashboards).

-## Latest updates (2025-11-30)
- Sprint tracker `docs/implplan/SPRINT_0330_0001_0001_docs_modules_telemetry.md` and module `TASKS.md` added to mirror status.
- Observability runbook stub + dashboard placeholder added under `operations/` (offline import).
- Storage/isolation posture references updated; align with platform docs.
-
-## Responsibilities
+## Latest updates (2025-11-30)
+- Sprint tracker `docs/implplan/SPRINT_0330_0001_0001_docs_modules_telemetry.md` and module `TASKS.md` added to mirror status.
+- Observability runbook stub + dashboard placeholder added under `operations/` (offline import).
+- Storage/isolation posture references updated; align with platform docs.
+
+## Responsibilities
 - Deploy and operate OpenTelemetry collectors for StellaOps services.
 - Provide storage configuration for Prometheus/Tempo/Loki stacks.
 - Document smoke tests and offline bootstrapping steps.
@@ -23,11 +23,11 @@ Telemetry module captures deployment and operations guidance for the shared obse
 - Module-specific dashboards (scheduler, scanner, etc.).
 - Security/Compliance for retention policies.

-## Operational notes
- Smoke script references (../../ops/devops/telemetry).
- Bundle packaging instructions in ops/devops/telemetry.
- Sprint 23 console security sign-off (2025-10-27) added the `console-security.json` Grafana board and burn-rate alert pack—ensure environments import the updated dashboards/alerts referenced in `docs/updates/2025-10-27-console-security-signoff.md`.
- Observability assets for this sprint: `operations/observability.md` and `operations/dashboards/telemetry-observability.json` (offline import).
+## Operational notes
+- Smoke script references (../../ops/devops/telemetry).
+- Bundle packaging instructions in ops/devops/telemetry.
+- Sprint 23 console security sign-off (2025-10-27) added the `console-security.json` Grafana board and burn-rate alert pack—ensure environments import the updated dashboards/alerts referenced in `docs/updates/2025-10-27-console-security-signoff.md`.
+- Observability assets for this sprint: `operations/observability.md` and `operations/dashboards/telemetry-observability.json` (offline import).

 ## Related resources
 - ./operations/collector.md
@@ -37,5 +37,64 @@ Telemetry module captures deployment and operations guidance for the shared obse
 - TELEMETRY-OBS-50-001 … 50-004 in ../../TASKS.md.
 - Collector/storage automation tracked in ops/devops/TASKS.md.

+## Implementation Status
+
+### Phase 1 – Collector & pipeline profiles (In Progress)
+- OpenTelemetry collector configs: default, forensic, airgap profiles
+- Ingest gateways with TLS/mTLS support
+- Attribute redaction policies and tenant isolation
+- CLI automation: stella telemetry deploy, stella telemetry profile diff
+
+### Phase 2 – Storage backends & retention (Planned)
+- Prometheus/Tempo/Loki deployment with retention tiers
+- Bucket/object storage with deterministic manifest generation
+- Sealed-mode allowlists and offline bundle support
+- Remote-write configuration and archivers
+
+### Phase 3 – Incident mode & forensic capture (Planned)
+- Incident toggles via CLI/API for sampling adjustments
+- Tail sampling to 100% during incidents
+- Forensic bundle generation: OTLP archives with manifest/signature
+- Notify hooks for incident escalation
+
+### Phase 4 – Observability dashboards & automation (Planned)
+- Service SLO dashboards: queue depth, policy latency, ingestion violations
+- Alert rules: burn-rate, collector failure, exporter backlog
+- Grafana packages for core services
+- Self-observability metrics
+
+### Phase 5 – Offline & compliance (Planned)
+- Offline Kit artifacts: collector binaries/configs, import scripts
+- Deterministic bundles with signed manifests
+- Replay tooling and compliance checklists
+- File-based exporters for air-gapped environments
+
+### Phase 6 – Hardening & SOC handoff (Planned)
+- RBAC integration and audit logging
+- Incident response runbooks and performance tuning
+- Integration tests across services
+- SOC handoff package with control objectives
+
+### Key Acceptance Criteria
+- Collectors ingest metrics/logs/traces with redaction rules and tenant isolation
+- Storage backends retain data per SLAs with deterministic manifests
+- Incident mode triggers forensic capture with signed bundles
+- Dashboards/alerts cover service SLOs and telemetry stack health
+- CLI automates config rollout, forensic capture, verification
+- Offline bundles replay telemetry in sealed environments
+
+### Technical Decisions & Risks
+- PII leakage prevented via strict redaction processors, policy-managed allowlists
+- Collector overload managed with horizontal scaling, batching, circuit breakers
+- Storage cost controlled via tiered retention, compression, pruning, offline archiving
+- Air-gap drift mitigated with offline kit refresh schedule, manifest verification
+- Alert fatigue reduced with burn-rate alerts, deduping, SOC runbooks
+
+### Operational Assets (Sprint 0330 · 2025-11-30)
+- Observability runbook: operations/observability.md
+- Dashboard placeholder: operations/dashboards/telemetry-observability.json
+- Console security dashboard: console-security.json (Sprint 23)
+- Burn-rate alert pack for environments
+
 ## Epic alignment
 - **Epic 15 – Observability & Forensics:** deliver collector/storage deployments, forensic evidence retention, and observability bundles with deterministic configuration.
--- a/docs/modules/telemetry/implementation_plan.md
+++ b/docs/modules/telemetry/implementation_plan.md
@@ -1,69 +0,0 @@
-# Implementation plan — Telemetry
-
-## Delivery phases
- **Phase 1 – Collector & pipeline profiles**  
-  Publish OpenTelemetry collector configs (`default`, `forensic`, `airgap`), establish ingest gateways, TLS/mTLS, and attribute redaction policies.
- **Phase 2 – Storage backends & retention**  
-  Deploy Prometheus/Tempo/Loki (or equivalents) with retention tiers, bucket/object storage, deterministic manifest generation, and sealed-mode allowlists.
- **Phase 3 – Incident mode & forensic capture**  
-  Implement incident toggles (CLI/API), tail sampling adjustments, forensic bundle generation (OTLP archives, manifest/signature), and Notify hooks.
- **Phase 4 – Observability dashboards & automation**  
-  Deliver dashboards (service SLOs, queue depth, policy latency), alert rules, Grafana packages, and CLI automation for deployment and capture.
- **Phase 5 – Offline & compliance**  
-  Ship Offline Kit artefacts (collectors, configs, dashboards, replay tooling), signed bundles, and documentation for air-gapped review workflows.
- **Phase 6 – Hardening & SOC handoff**  
-  Complete RBAC integration, audit logging, incident response runbooks, performance tuning, and integration tests across services.
-
-## Work breakdown
- **Collector configs**
-  - Maintain config templates per profile with processors (redaction, batching, resource detection) and exporters.
-  - CLI automation (`stella telemetry deploy`, `stella telemetry profile diff`), validation tests, and config signing.
- **Storage & retention**
-  - Provision Prometheus/Tempo/Loki (or vendor equivalents) with retention tiers (default, forensic, airgap).
-  - Ensure determinism (chunk manifests, content hashing), remote-write allowlists, sealed/offline modes.
-  - Implement archivers for forensic bundles (metrics/traces/logs) with cosign signatures.
- **Incident mode**
-  - API/CLI to toggle incident sampling, retention escalation, Notify signals, and auto bundle capture.
-  - Hook into Orchestrator to respond to incidents and revert after cooldown.
- **Dashboards & alerts**
-  - Dashboard packages for core services (ingestion, policy, export, attestation).
-  - Alert rules for SLO burn, collector failure, exporter backlog, bundle generation errors.
-  - Self-observability metrics (`collector_export_failures_total`, `telemetry_incident_mode{}`).
- **Offline support**
-  - Offline Kit assets: collector binaries/configs, import scripts, dashboards, replay instructions, compliance checklists.
-  - File-based exporters and manual transfer workflows with signed manifests.
- **Docs & runbooks**
-  - Update observability overview, forensic capture guide, incident response checklist, sealed-mode instructions, RBAC matrix.
-  - SOC handoff package with control objectives and audit evidence.
-
-## Acceptance criteria
- Collectors ingest metrics/logs/traces across deployments, applying redaction rules and tenant isolation; profiles validate via CI.
- Storage backends retain data per default/forensic/airgap SLAs with deterministic chunk manifests and sealed-mode compliance.
- Incident mode toggles sampling to 100 %, extends retention, triggers Notify, and captures forensic bundles signed with cosign.
- Dashboards and alerts cover service SLOs, queue depth, policy latency, ingestion violations, and telemetry stack health.
- CLI commands (`stella telemetry deploy/capture/status`) automate config rollout, forensic capture, and verification.
- Offline bundles replay telemetry in sealed environments using provided scripts and manifests.
-
-## Risks & mitigations
- **PII leakage:** strict redaction processors, policy-managed allowlists, audit tests.
- **Collector overload:** horizontal scaling, batching, circuit breakers, incident mode throttling.
- **Storage cost:** tiered retention, compression, pruning policies, offline archiving.
- **Air-gap drift:** offline kit refresh schedule, deterministic manifest verification.
- **Alert fatigue:** burn-rate alerts, deduping, SOC runbooks.
-
-## Test strategy
- **Config lint/tests:** schema validation, unit tests for processors/exporters, golden configs.
- **Integration:** simulate service traces/logs/metrics, verify pipelines, incident toggles, bundle generation.
- **Performance:** load tests with peak ingestion, long retention windows, failover scenarios.
- **Security:** redaction verification, RBAC/tenant scoping, sealed-mode tests, signed config verification.
- **Offline:** capture bundles, transfer, replay, compliance attestation.
-
-## Definition of done
- Collector profiles, storage backends, incident mode, dashboards, CLI, and offline kit delivered with telemetry and documentation.
- Runbooks and SOC handoff packages published; compliance checklists appended.
- ./TASKS.md and ../../TASKS.md updated; imposed rule statements confirmed in documentation.
-
-## Sprint alignment (2025-11-30)
- Docs refresh tracked in `docs/implplan/SPRINT_0330_0001_0001_docs_modules_telemetry.md`; statuses mirrored in `docs/modules/telemetry/TASKS.md`.
- Observability evidence lives in `operations/observability.md` with Grafana JSON stub under `operations/dashboards/`.
- Keep future doc/ops updates mirrored across sprint, TASKS, and module front doors to avoid drift.