Files
git.stella-ops.org/docs/modules/telemetry/README.md
2025-12-25 18:50:33 +02:00

101 lines
4.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# StellaOps Telemetry
Telemetry module captures deployment and operations guidance for the shared observability stack (collectors, storage, dashboards).
## Latest updates (2025-11-30)
- Sprint tracker `docs/implplan/SPRINT_0330_0001_0001_docs_modules_telemetry.md` and module `TASKS.md` added to mirror status.
- Observability runbook stub + dashboard placeholder added under `operations/` (offline import).
- Storage/isolation posture references updated; align with platform docs.
## Responsibilities
- Deploy and operate OpenTelemetry collectors for StellaOps services.
- Provide storage configuration for Prometheus/Tempo/Loki stacks.
- Document smoke tests and offline bootstrapping steps.
- Align metrics and alert packs with module SLOs.
## Key components
- Collector deployment guide (./operations/collector.md).
- Storage deployment guide (./operations/storage.md).
- Smoke tooling in `ops/devops/telemetry/`.
## Integrations & dependencies
- DevOps pipelines for packaging telemetry bundles.
- Module-specific dashboards (scheduler, scanner, etc.).
- Security/Compliance for retention policies.
## Operational notes
- Smoke script references (../../ops/devops/telemetry).
- Bundle packaging instructions in ops/devops/telemetry.
- Sprint 23 console security sign-off (2025-10-27) added the `console-security.json` Grafana board and burn-rate alert pack—ensure environments import the updated dashboards/alerts referenced in `docs/updates/2025-10-27-console-security-signoff.md`.
- Observability assets for this sprint: `operations/observability.md` and `operations/dashboards/telemetry-observability.json` (offline import).
## Related resources
- ./operations/collector.md
- ./operations/storage.md
## Backlog references
- TELEMETRY-OBS-50-001 … 50-004 in ../../TASKS.md.
- Collector/storage automation tracked in ops/devops/TASKS.md.
## Implementation Status
### Phase 1 Collector & pipeline profiles (In Progress)
- OpenTelemetry collector configs: default, forensic, airgap profiles
- Ingest gateways with TLS/mTLS support
- Attribute redaction policies and tenant isolation
- CLI automation: stella telemetry deploy, stella telemetry profile diff
### Phase 2 Storage backends & retention (Planned)
- Prometheus/Tempo/Loki deployment with retention tiers
- Bucket/object storage with deterministic manifest generation
- Sealed-mode allowlists and offline bundle support
- Remote-write configuration and archivers
### Phase 3 Incident mode & forensic capture (Planned)
- Incident toggles via CLI/API for sampling adjustments
- Tail sampling to 100% during incidents
- Forensic bundle generation: OTLP archives with manifest/signature
- Notify hooks for incident escalation
### Phase 4 Observability dashboards & automation (Planned)
- Service SLO dashboards: queue depth, policy latency, ingestion violations
- Alert rules: burn-rate, collector failure, exporter backlog
- Grafana packages for core services
- Self-observability metrics
### Phase 5 Offline & compliance (Planned)
- Offline Kit artifacts: collector binaries/configs, import scripts
- Deterministic bundles with signed manifests
- Replay tooling and compliance checklists
- File-based exporters for air-gapped environments
### Phase 6 Hardening & SOC handoff (Planned)
- RBAC integration and audit logging
- Incident response runbooks and performance tuning
- Integration tests across services
- SOC handoff package with control objectives
### Key Acceptance Criteria
- Collectors ingest metrics/logs/traces with redaction rules and tenant isolation
- Storage backends retain data per SLAs with deterministic manifests
- Incident mode triggers forensic capture with signed bundles
- Dashboards/alerts cover service SLOs and telemetry stack health
- CLI automates config rollout, forensic capture, verification
- Offline bundles replay telemetry in sealed environments
### Technical Decisions & Risks
- PII leakage prevented via strict redaction processors, policy-managed allowlists
- Collector overload managed with horizontal scaling, batching, circuit breakers
- Storage cost controlled via tiered retention, compression, pruning, offline archiving
- Air-gap drift mitigated with offline kit refresh schedule, manifest verification
- Alert fatigue reduced with burn-rate alerts, deduping, SOC runbooks
### Operational Assets (Sprint 0330 · 2025-11-30)
- Observability runbook: operations/observability.md
- Dashboard placeholder: operations/dashboards/telemetry-observability.json
- Console security dashboard: console-security.json (Sprint 23)
- Burn-rate alert pack for environments
## Epic alignment
- **Epic 15 Observability & Forensics:** deliver collector/storage deployments, forensic evidence retention, and observability bundles with deterministic configuration.