101 lines
4.8 KiB
Markdown
101 lines
4.8 KiB
Markdown
# StellaOps Telemetry
|
||
|
||
Telemetry module captures deployment and operations guidance for the shared observability stack (collectors, storage, dashboards).
|
||
|
||
## Latest updates (2025-11-30)
|
||
- Sprint tracker `docs/implplan/SPRINT_0330_0001_0001_docs_modules_telemetry.md` and module `TASKS.md` added to mirror status.
|
||
- Observability runbook stub + dashboard placeholder added under `operations/` (offline import).
|
||
- Storage/isolation posture references updated; align with platform docs.
|
||
|
||
## Responsibilities
|
||
- Deploy and operate OpenTelemetry collectors for StellaOps services.
|
||
- Provide storage configuration for Prometheus/Tempo/Loki stacks.
|
||
- Document smoke tests and offline bootstrapping steps.
|
||
- Align metrics and alert packs with module SLOs.
|
||
|
||
## Key components
|
||
- Collector deployment guide (./operations/collector.md).
|
||
- Storage deployment guide (./operations/storage.md).
|
||
- Smoke tooling in `ops/devops/telemetry/`.
|
||
|
||
## Integrations & dependencies
|
||
- DevOps pipelines for packaging telemetry bundles.
|
||
- Module-specific dashboards (scheduler, scanner, etc.).
|
||
- Security/Compliance for retention policies.
|
||
|
||
## Operational notes
|
||
- Smoke script references (../../ops/devops/telemetry).
|
||
- Bundle packaging instructions in ops/devops/telemetry.
|
||
- Sprint 23 console security sign-off (2025-10-27) added the `console-security.json` Grafana board and burn-rate alert pack—ensure environments import the updated dashboards/alerts referenced in `docs/updates/2025-10-27-console-security-signoff.md`.
|
||
- Observability assets for this sprint: `operations/observability.md` and `operations/dashboards/telemetry-observability.json` (offline import).
|
||
|
||
## Related resources
|
||
- ./operations/collector.md
|
||
- ./operations/storage.md
|
||
|
||
## Backlog references
|
||
- TELEMETRY-OBS-50-001 … 50-004 in ../../TASKS.md.
|
||
- Collector/storage automation tracked in ops/devops/TASKS.md.
|
||
|
||
## Implementation Status
|
||
|
||
### Phase 1 – Collector & pipeline profiles (In Progress)
|
||
- OpenTelemetry collector configs: default, forensic, airgap profiles
|
||
- Ingest gateways with TLS/mTLS support
|
||
- Attribute redaction policies and tenant isolation
|
||
- CLI automation: stella telemetry deploy, stella telemetry profile diff
|
||
|
||
### Phase 2 – Storage backends & retention (Planned)
|
||
- Prometheus/Tempo/Loki deployment with retention tiers
|
||
- Bucket/object storage with deterministic manifest generation
|
||
- Sealed-mode allowlists and offline bundle support
|
||
- Remote-write configuration and archivers
|
||
|
||
### Phase 3 – Incident mode & forensic capture (Planned)
|
||
- Incident toggles via CLI/API for sampling adjustments
|
||
- Tail sampling to 100% during incidents
|
||
- Forensic bundle generation: OTLP archives with manifest/signature
|
||
- Notify hooks for incident escalation
|
||
|
||
### Phase 4 – Observability dashboards & automation (Planned)
|
||
- Service SLO dashboards: queue depth, policy latency, ingestion violations
|
||
- Alert rules: burn-rate, collector failure, exporter backlog
|
||
- Grafana packages for core services
|
||
- Self-observability metrics
|
||
|
||
### Phase 5 – Offline & compliance (Planned)
|
||
- Offline Kit artifacts: collector binaries/configs, import scripts
|
||
- Deterministic bundles with signed manifests
|
||
- Replay tooling and compliance checklists
|
||
- File-based exporters for air-gapped environments
|
||
|
||
### Phase 6 – Hardening & SOC handoff (Planned)
|
||
- RBAC integration and audit logging
|
||
- Incident response runbooks and performance tuning
|
||
- Integration tests across services
|
||
- SOC handoff package with control objectives
|
||
|
||
### Key Acceptance Criteria
|
||
- Collectors ingest metrics/logs/traces with redaction rules and tenant isolation
|
||
- Storage backends retain data per SLAs with deterministic manifests
|
||
- Incident mode triggers forensic capture with signed bundles
|
||
- Dashboards/alerts cover service SLOs and telemetry stack health
|
||
- CLI automates config rollout, forensic capture, verification
|
||
- Offline bundles replay telemetry in sealed environments
|
||
|
||
### Technical Decisions & Risks
|
||
- PII leakage prevented via strict redaction processors, policy-managed allowlists
|
||
- Collector overload managed with horizontal scaling, batching, circuit breakers
|
||
- Storage cost controlled via tiered retention, compression, pruning, offline archiving
|
||
- Air-gap drift mitigated with offline kit refresh schedule, manifest verification
|
||
- Alert fatigue reduced with burn-rate alerts, deduping, SOC runbooks
|
||
|
||
### Operational Assets (Sprint 0330 · 2025-11-30)
|
||
- Observability runbook: operations/observability.md
|
||
- Dashboard placeholder: operations/dashboards/telemetry-observability.json
|
||
- Console security dashboard: console-security.json (Sprint 23)
|
||
- Burn-rate alert pack for environments
|
||
|
||
## Epic alignment
|
||
- **Epic 15 – Observability & Forensics:** deliver collector/storage deployments, forensic evidence retention, and observability bundles with deterministic configuration.
|