Files
git.stella-ops.org/docs/modules/telemetry/implementation_plan.md
StellaOps Bot 17d45a6d30
Some checks failed
Airgap Sealed CI Smoke / sealed-smoke (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
Export Center CI / export-ci (push) Has been cancelled
feat: Implement Filesystem and MongoDB provenance writers for PackRun execution context
- Added `FilesystemPackRunProvenanceWriter` to write provenance manifests to the filesystem.
- Introduced `MongoPackRunArtifactReader` to read artifacts from MongoDB.
- Created `MongoPackRunProvenanceWriter` to store provenance manifests in MongoDB.
- Developed unit tests for filesystem and MongoDB provenance writers.
- Established `ITimelineEventStore` and `ITimelineIngestionService` interfaces for timeline event handling.
- Implemented `TimelineIngestionService` to validate and persist timeline events with hashing.
- Created PostgreSQL schema and migration scripts for timeline indexing.
- Added dependency injection support for timeline indexer services.
- Developed tests for timeline ingestion and schema validation.
2025-11-30 15:38:14 +02:00

5.1 KiB
Raw Blame History

Implementation plan — Telemetry

Delivery phases

  • Phase 1 Collector & pipeline profiles
    Publish OpenTelemetry collector configs (default, forensic, airgap), establish ingest gateways, TLS/mTLS, and attribute redaction policies.
  • Phase 2 Storage backends & retention
    Deploy Prometheus/Tempo/Loki (or equivalents) with retention tiers, bucket/object storage, deterministic manifest generation, and sealed-mode allowlists.
  • Phase 3 Incident mode & forensic capture
    Implement incident toggles (CLI/API), tail sampling adjustments, forensic bundle generation (OTLP archives, manifest/signature), and Notify hooks.
  • Phase 4 Observability dashboards & automation
    Deliver dashboards (service SLOs, queue depth, policy latency), alert rules, Grafana packages, and CLI automation for deployment and capture.
  • Phase 5 Offline & compliance
    Ship Offline Kit artefacts (collectors, configs, dashboards, replay tooling), signed bundles, and documentation for air-gapped review workflows.
  • Phase 6 Hardening & SOC handoff
    Complete RBAC integration, audit logging, incident response runbooks, performance tuning, and integration tests across services.

Work breakdown

  • Collector configs
    • Maintain config templates per profile with processors (redaction, batching, resource detection) and exporters.
    • CLI automation (stella telemetry deploy, stella telemetry profile diff), validation tests, and config signing.
  • Storage & retention
    • Provision Prometheus/Tempo/Loki (or vendor equivalents) with retention tiers (default, forensic, airgap).
    • Ensure determinism (chunk manifests, content hashing), remote-write allowlists, sealed/offline modes.
    • Implement archivers for forensic bundles (metrics/traces/logs) with cosign signatures.
  • Incident mode
    • API/CLI to toggle incident sampling, retention escalation, Notify signals, and auto bundle capture.
    • Hook into Orchestrator to respond to incidents and revert after cooldown.
  • Dashboards & alerts
    • Dashboard packages for core services (ingestion, policy, export, attestation).
    • Alert rules for SLO burn, collector failure, exporter backlog, bundle generation errors.
    • Self-observability metrics (collector_export_failures_total, telemetry_incident_mode{}).
  • Offline support
    • Offline Kit assets: collector binaries/configs, import scripts, dashboards, replay instructions, compliance checklists.
    • File-based exporters and manual transfer workflows with signed manifests.
  • Docs & runbooks
    • Update observability overview, forensic capture guide, incident response checklist, sealed-mode instructions, RBAC matrix.
    • SOC handoff package with control objectives and audit evidence.

Acceptance criteria

  • Collectors ingest metrics/logs/traces across deployments, applying redaction rules and tenant isolation; profiles validate via CI.
  • Storage backends retain data per default/forensic/airgap SLAs with deterministic chunk manifests and sealed-mode compliance.
  • Incident mode toggles sampling to 100%, extends retention, triggers Notify, and captures forensic bundles signed with cosign.
  • Dashboards and alerts cover service SLOs, queue depth, policy latency, ingestion violations, and telemetry stack health.
  • CLI commands (stella telemetry deploy/capture/status) automate config rollout, forensic capture, and verification.
  • Offline bundles replay telemetry in sealed environments using provided scripts and manifests.

Risks & mitigations

  • PII leakage: strict redaction processors, policy-managed allowlists, audit tests.
  • Collector overload: horizontal scaling, batching, circuit breakers, incident mode throttling.
  • Storage cost: tiered retention, compression, pruning policies, offline archiving.
  • Air-gap drift: offline kit refresh schedule, deterministic manifest verification.
  • Alert fatigue: burn-rate alerts, deduping, SOC runbooks.

Test strategy

  • Config lint/tests: schema validation, unit tests for processors/exporters, golden configs.
  • Integration: simulate service traces/logs/metrics, verify pipelines, incident toggles, bundle generation.
  • Performance: load tests with peak ingestion, long retention windows, failover scenarios.
  • Security: redaction verification, RBAC/tenant scoping, sealed-mode tests, signed config verification.
  • Offline: capture bundles, transfer, replay, compliance attestation.

Definition of done

  • Collector profiles, storage backends, incident mode, dashboards, CLI, and offline kit delivered with telemetry and documentation.
  • Runbooks and SOC handoff packages published; compliance checklists appended.
  • ./TASKS.md and ../../TASKS.md updated; imposed rule statements confirmed in documentation.

Sprint alignment (2025-11-30)

  • Docs refresh tracked in docs/implplan/SPRINT_0330_0001_0001_docs_modules_telemetry.md; statuses mirrored in docs/modules/telemetry/TASKS.md.
  • Observability evidence lives in operations/observability.md with Grafana JSON stub under operations/dashboards/.
  • Keep future doc/ops updates mirrored across sprint, TASKS, and module front doors to avoid drift.