Files
git.stella-ops.org/docs/modules/telemetry/README.md
2025-12-25 19:09:48 +02:00

4.8 KiB
Raw Blame History

StellaOps Telemetry

Telemetry module captures deployment and operations guidance for the shared observability stack (collectors, storage, dashboards).

Latest updates (2025-11-30)

  • Sprint tracker docs/implplan/SPRINT_0330_0001_0001_docs_modules_telemetry.md and module TASKS.md added to mirror status.
  • Observability runbook stub + dashboard placeholder added under operations/ (offline import).
  • Storage/isolation posture references updated; align with platform docs.

Responsibilities

  • Deploy and operate OpenTelemetry collectors for StellaOps services.
  • Provide storage configuration for Prometheus/Tempo/Loki stacks.
  • Document smoke tests and offline bootstrapping steps.
  • Align metrics and alert packs with module SLOs.

Key components

  • Collector deployment guide (./operations/collector.md).
  • Storage deployment guide (./operations/storage.md).
  • Smoke tooling in ops/devops/telemetry/.

Integrations & dependencies

  • DevOps pipelines for packaging telemetry bundles.
  • Module-specific dashboards (scheduler, scanner, etc.).
  • Security/Compliance for retention policies.

Operational notes

  • Smoke script references (../../ops/devops/telemetry).
  • Bundle packaging instructions in ops/devops/telemetry.
  • Sprint 23 console security sign-off (2025-10-27) added the console-security.json Grafana board and burn-rate alert pack—ensure environments import the updated dashboards/alerts referenced in docs/updates/2025-10-27-console-security-signoff.md.
  • Observability assets for this sprint: operations/observability.md and operations/dashboards/telemetry-observability.json (offline import).
  • ./operations/collector.md
  • ./operations/storage.md

Backlog references

  • TELEMETRY-OBS-50-001 … 50-004 in ../../TASKS.md.
  • Collector/storage automation tracked in ops/devops/TASKS.md.

Implementation Status

Phase 1 Collector & pipeline profiles (In Progress)

  • OpenTelemetry collector configs: default, forensic, airgap profiles
  • Ingest gateways with TLS/mTLS support
  • Attribute redaction policies and tenant isolation
  • CLI automation: stella telemetry deploy, stella telemetry profile diff

Phase 2 Storage backends & retention (Planned)

  • Prometheus/Tempo/Loki deployment with retention tiers
  • Bucket/object storage with deterministic manifest generation
  • Sealed-mode allowlists and offline bundle support
  • Remote-write configuration and archivers

Phase 3 Incident mode & forensic capture (Planned)

  • Incident toggles via CLI/API for sampling adjustments
  • Tail sampling to 100% during incidents
  • Forensic bundle generation: OTLP archives with manifest/signature
  • Notify hooks for incident escalation

Phase 4 Observability dashboards & automation (Planned)

  • Service SLO dashboards: queue depth, policy latency, ingestion violations
  • Alert rules: burn-rate, collector failure, exporter backlog
  • Grafana packages for core services
  • Self-observability metrics

Phase 5 Offline & compliance (Planned)

  • Offline Kit artifacts: collector binaries/configs, import scripts
  • Deterministic bundles with signed manifests
  • Replay tooling and compliance checklists
  • File-based exporters for air-gapped environments

Phase 6 Hardening & SOC handoff (Planned)

  • RBAC integration and audit logging
  • Incident response runbooks and performance tuning
  • Integration tests across services
  • SOC handoff package with control objectives

Key Acceptance Criteria

  • Collectors ingest metrics/logs/traces with redaction rules and tenant isolation
  • Storage backends retain data per SLAs with deterministic manifests
  • Incident mode triggers forensic capture with signed bundles
  • Dashboards/alerts cover service SLOs and telemetry stack health
  • CLI automates config rollout, forensic capture, verification
  • Offline bundles replay telemetry in sealed environments

Technical Decisions & Risks

  • PII leakage prevented via strict redaction processors, policy-managed allowlists
  • Collector overload managed with horizontal scaling, batching, circuit breakers
  • Storage cost controlled via tiered retention, compression, pruning, offline archiving
  • Air-gap drift mitigated with offline kit refresh schedule, manifest verification
  • Alert fatigue reduced with burn-rate alerts, deduping, SOC runbooks

Operational Assets (Sprint 0330 · 2025-11-30)

  • Observability runbook: operations/observability.md
  • Dashboard placeholder: operations/dashboards/telemetry-observability.json
  • Console security dashboard: console-security.json (Sprint 23)
  • Burn-rate alert pack for environments

Epic alignment

  • Epic 15 Observability & Forensics: deliver collector/storage deployments, forensic evidence retention, and observability bundles with deterministic configuration.