stella-ops.org/git.stella-ops.org

Fork 0

Files

StellaOps Bot 17d45a6d30

Airgap Sealed CI Smoke / sealed-smoke (push) Has been cancelled

Details

Docs CI / lint-and-preview (push) Has been cancelled

Details

Export Center CI / export-ci (push) Has been cancelled

Details

feat: Implement Filesystem and MongoDB provenance writers for PackRun execution context

- Added `FilesystemPackRunProvenanceWriter` to write provenance manifests to the filesystem.
- Introduced `MongoPackRunArtifactReader` to read artifacts from MongoDB.
- Created `MongoPackRunProvenanceWriter` to store provenance manifests in MongoDB.
- Developed unit tests for filesystem and MongoDB provenance writers.
- Established `ITimelineEventStore` and `ITimelineIngestionService` interfaces for timeline event handling.
- Implemented `TimelineIngestionService` to validate and persist timeline events with hashing.
- Created PostgreSQL schema and migration scripts for timeline indexing.
- Added dependency injection support for timeline indexer services.
- Developed tests for timeline ingestion and schema validation.

2025-11-30 15:38:14 +02:00

5.1 KiB

Raw Blame History

Implementation plan — Telemetry

Delivery phases

Phase 1 – Collector & pipeline profiles
Publish OpenTelemetry collector configs (default, forensic, airgap), establish ingest gateways, TLS/mTLS, and attribute redaction policies.
Phase 2 – Storage backends & retention
Deploy Prometheus/Tempo/Loki (or equivalents) with retention tiers, bucket/object storage, deterministic manifest generation, and sealed-mode allowlists.
Phase 3 – Incident mode & forensic capture
Implement incident toggles (CLI/API), tail sampling adjustments, forensic bundle generation (OTLP archives, manifest/signature), and Notify hooks.
Phase 4 – Observability dashboards & automation
Deliver dashboards (service SLOs, queue depth, policy latency), alert rules, Grafana packages, and CLI automation for deployment and capture.
Phase 5 – Offline & compliance
Ship Offline Kit artefacts (collectors, configs, dashboards, replay tooling), signed bundles, and documentation for air-gapped review workflows.
Phase 6 – Hardening & SOC handoff
Complete RBAC integration, audit logging, incident response runbooks, performance tuning, and integration tests across services.

Work breakdown

Collector configs
- Maintain config templates per profile with processors (redaction, batching, resource detection) and exporters.
- CLI automation (stella telemetry deploy, stella telemetry profile diff), validation tests, and config signing.
Storage & retention
- Provision Prometheus/Tempo/Loki (or vendor equivalents) with retention tiers (default, forensic, airgap).
- Ensure determinism (chunk manifests, content hashing), remote-write allowlists, sealed/offline modes.
- Implement archivers for forensic bundles (metrics/traces/logs) with cosign signatures.
Incident mode
- API/CLI to toggle incident sampling, retention escalation, Notify signals, and auto bundle capture.
- Hook into Orchestrator to respond to incidents and revert after cooldown.
Dashboards & alerts
- Dashboard packages for core services (ingestion, policy, export, attestation).
- Alert rules for SLO burn, collector failure, exporter backlog, bundle generation errors.
- Self-observability metrics (collector_export_failures_total, telemetry_incident_mode{}).
Offline support
- Offline Kit assets: collector binaries/configs, import scripts, dashboards, replay instructions, compliance checklists.
- File-based exporters and manual transfer workflows with signed manifests.
Docs & runbooks
- Update observability overview, forensic capture guide, incident response checklist, sealed-mode instructions, RBAC matrix.
- SOC handoff package with control objectives and audit evidence.

Acceptance criteria

Collectors ingest metrics/logs/traces across deployments, applying redaction rules and tenant isolation; profiles validate via CI.
Storage backends retain data per default/forensic/airgap SLAs with deterministic chunk manifests and sealed-mode compliance.
Incident mode toggles sampling to 100 %, extends retention, triggers Notify, and captures forensic bundles signed with cosign.
Dashboards and alerts cover service SLOs, queue depth, policy latency, ingestion violations, and telemetry stack health.
CLI commands (stella telemetry deploy/capture/status) automate config rollout, forensic capture, and verification.
Offline bundles replay telemetry in sealed environments using provided scripts and manifests.

Risks & mitigations

PII leakage: strict redaction processors, policy-managed allowlists, audit tests.
Collector overload: horizontal scaling, batching, circuit breakers, incident mode throttling.
Storage cost: tiered retention, compression, pruning policies, offline archiving.
Air-gap drift: offline kit refresh schedule, deterministic manifest verification.
Alert fatigue: burn-rate alerts, deduping, SOC runbooks.

Test strategy

Config lint/tests: schema validation, unit tests for processors/exporters, golden configs.
Integration: simulate service traces/logs/metrics, verify pipelines, incident toggles, bundle generation.
Performance: load tests with peak ingestion, long retention windows, failover scenarios.
Security: redaction verification, RBAC/tenant scoping, sealed-mode tests, signed config verification.
Offline: capture bundles, transfer, replay, compliance attestation.

Definition of done

Collector profiles, storage backends, incident mode, dashboards, CLI, and offline kit delivered with telemetry and documentation.
Runbooks and SOC handoff packages published; compliance checklists appended.
./TASKS.md and ../../TASKS.md updated; imposed rule statements confirmed in documentation.

Sprint alignment (2025-11-30)

Docs refresh tracked in docs/implplan/SPRINT_0330_0001_0001_docs_modules_telemetry.md; statuses mirrored in docs/modules/telemetry/TASKS.md.
Observability evidence lives in operations/observability.md with Grafana JSON stub under operations/dashboards/.
Keep future doc/ops updates mirrored across sprint, TASKS, and module front doors to avoid drift.

5.1 KiB Raw Blame History Unescape Escape

Implementation plan — Telemetry

Delivery phases

Work breakdown

Acceptance criteria

Risks & mitigations

Test strategy

Definition of done

Sprint alignment (2025-11-30)

5.1 KiB

Raw Blame History