# Telemetry architecture > Derived from Epic 15 – Observability & Forensics; details collector topology, storage profiles, forensic pipelines, and offline packaging. ## 1) Topology - **Collector tier.** OpenTelemetry Collector instances deployed per environment (ingest TLS, GRPC/OTLP receivers, tail-based sampling). Config packages delivered via Offline Kit. - **Processing pipelines.** Pipelines for traces, metrics, logs with processors (batch, tail sampling, attributes redaction, resource detection). Profiles: `default`, `forensic` (high-retention), `airgap` (file-based exporters). - **Exporters.** OTLP to Prometheus/Tempo/Loki (online) or file/OTLP-HTTP to Offline Kit staging (air-gapped). Exporters are allow-listed to satisfy Sovereign readiness. ## 2) Storage - **Prometheus** for metrics with remote-write support and retention windows (default 30 days, forensic 180 days). - **Tempo** (or Jaeger all-in-one) for traces with block storage backend (S3-compatible or filesystem) and deterministic chunk manifests. - **Loki** for logs stored in immutable chunks; index shards hashed for reproducibility. - **Forensic archive** — periodic export of raw OTLP records into signed bundles (`otlp/metrics.pb`, `otlp/traces.pb`, `otlp/logs.pb`, `manifest.json`). ## 3) Pipelines & Guardrails - **Redaction.** Attribute processors strip PII/secrets based on policy-managed allowed keys. Redaction profiles mirrored in Offline Kit. - **Sampling.** Tail sampling by service/error; incident mode (triggered by Orchestrator) promotes services to 100 % sampling, extends retention, and toggles Notify alerts. - **Alerting.** Prometheus rules/Dashboards packaged with Export Center: service SLOs, queue depth, policy run latency, ingestion AOC violations. - **Sealed-mode guard.** `StellaOps.Telemetry.Core` enforces `IEgressPolicy` on OTLP exporters; when air-gap mode is sealed any non-loopback collector endpoints are automatically disabled and a structured warning with remediation is emitted. ## 4) APIs & integration - `GET /telemetry/config/profile/{name}` — download collector config bundle (YAML + signature). - `POST /telemetry/incidents/mode` — toggle incident sampling + forensic bundle generation. - `GET /telemetry/exports/forensic/{window}` — stream signed OTLP bundles for compliance. - CLI commands: `stella telemetry deploy --profile default`, `stella telemetry capture --window 24h --out bundle.tar.gz`. ## 5) Offline support - Offline Kit ships collector binaries/config, bootstrap scripts, dashboards, alert rules, and OTLP replay tooling. Bundles include `manifest.json` with digests, DSSE signatures, and instructions. - For offline environments, exporters write to local filesystem; operators transfer bundles to analysis workstation using signed manifests. ## 6) Observability of telemetry stack - Meta-metrics: `collector_export_failures_total`, `telemetry_bundle_generation_seconds`, `telemetry_incident_mode{state}`. - Health endpoints for collectors and storage clusters, plus dashboards for ingestion rate, retention, rule evaluations. ## 7) DORA Metrics Stella Ops tracks the four key DORA (DevOps Research and Assessment) metrics for software delivery performance: ### 7.1) Metrics Tracked - **Deployment Frequency** (`dora_deployments_total`, `dora_deployment_frequency_per_day`) — How often deployments occur per day/week. - **Lead Time for Changes** (`dora_lead_time_hours`) — Time from commit to deployment in production. - **Change Failure Rate** (`dora_deployment_failure_total`, `dora_change_failure_rate_percent`) — Percentage of deployments requiring rollback, hotfix, or failing. - **Mean Time to Recovery (MTTR)** (`dora_time_to_recovery_hours`) — Average time to recover from incidents. ### 7.2) Performance Classification The system classifies teams into DORA performance levels: - **Elite**: On-demand deployments, <24h lead time, <15% CFR, <1h MTTR - **High**: Weekly deployments, <1 week lead time, <30% CFR, <1 day MTTR - **Medium**: Monthly deployments, <6 months lead time, <45% CFR, <1 week MTTR - **Low**: Quarterly or less frequent deployments with higher failure rates ### 7.3) Integration Points - `IDoraMetricsService` — Service interface for recording deployments and incidents - `DoraMetrics` — OpenTelemetry-style metrics class with SLO breach tracking - DI registration: `services.AddDoraMetrics(options => { ... })` - Events are recorded when Release Orchestrator completes promotions or rollbacks ### 7.4) SLO Tracking Configurable SLO targets via `DoraMetricsOptions`: - `LeadTimeSloHours` (default: 24) - `DeploymentFrequencySloPerDay` (default: 1) - `ChangeFailureRateSloPercent` (default: 15) - `MttrSloHours` (default: 1) SLO breaches are recorded as `dora_slo_breach_total` with `metric` label. ### 7.5) Outcome Analytics and Attribution (Sprint 20260208_065) Telemetry now includes deterministic executive outcome attribution built on top of the existing DORA event stream: - `IOutcomeAnalyticsService` (`src/Telemetry/StellaOps.Telemetry.Core/StellaOps.Telemetry.Core/IOutcomeAnalyticsService.cs`) - `DoraOutcomeAnalyticsService` (`src/Telemetry/StellaOps.Telemetry.Core/StellaOps.Telemetry.Core/DoraOutcomeAnalyticsService.cs`) - Outcome report models (`src/Telemetry/StellaOps.Telemetry.Core/StellaOps.Telemetry.Core/OutcomeAnalyticsModels.cs`) Outcome attribution behavior: - Produces `OutcomeExecutiveReport` for a fixed tenant/environment/time window with deterministic ordering. - Adds MTTA support via `DoraIncidentEvent.AcknowledgedAt` and `TimeToAcknowledge`. - Groups deployment outcomes by normalized pipeline (`pipeline-a`, `pipeline-b`, `unknown`) with per-pipeline change failure rate and median lead time. - Groups incidents by severity with resolved/acknowledged counts plus MTTA/MTTR aggregates. - Produces daily cohort slices across the requested date range for executive trend views. Dependency injection integration: - `TelemetryServiceCollectionExtensions.AddDoraMetrics(...)` now also registers `IOutcomeAnalyticsService`, so existing telemetry entry points automatically expose attribution reporting without additional module wiring. Verification coverage: - `src/Telemetry/StellaOps.Telemetry.Core/StellaOps.Telemetry.Core.Tests/OutcomeAnalyticsServiceTests.cs` - `src/Telemetry/StellaOps.Telemetry.Core/StellaOps.Telemetry.Core.Tests/DoraMetricsServiceTests.cs` - Full telemetry core test suite pass (`262` tests) remains green after integration. Refer to the module README and implementation plan for immediate context, and update this document once component boundaries and data flows are finalised. ## 8) Federation DSSE Security Posture (Updated 2026-03-04) Status: - Advisory gap `TEL-001` is closed. Federation consent and bundle paths now emit signed DSSE envelopes instead of payload passthrough placeholders. Implemented contract: - Consent and bundle envelopes now use explicit DSSE JSON structure: `payloadType`, base64 `payload`, and `signatures[]` (`keyid`, `sig`). - Consent proofs and bundle summaries carry signer identity metadata (`SignerKeyId`) for auditability. - Bundle payload canonicalization is deterministic for identical logical inputs: - bucket ordering: `cveId` (ordinal), then `noisyCount` (descending), `artifactCount`, `observationCount` - deterministic bundle ID derivation from canonical payload seed + fixed clock input - Bundle verification enforces: - envelope digest integrity (`sha256:` over envelope bytes) - payload type match - trusted-key signature verification - consent digest linkage (`consentDigest` in payload must match `ConsentDsseDigest`) Signer/verifier integration and fallback: - Federation now uses explicit abstractions: - `IFederationDsseEnvelopeSigner` - `IFederationDsseEnvelopeVerifier` - Default adapter: `HmacFederationDsseEnvelopeService` (offline-safe HMAC-SHA256 DSSE sign/verify using local trusted key map in `FederatedTelemetryOptions`). - Failure mode is deterministic and auditable: - signing failures throw `FederationSignatureException` with stable error codes (for example `federation.dsse.sign_failed`, `federation.dsse.signer_unavailable`) - optional unsigned fallback (`AllowUnsignedDsseFallback`) emits envelopes tagged with `offline-unsigned-fallback` for explicit operator visibility. Verification evidence: - `dotnet test src/Telemetry/StellaOps.Telemetry.Federation.Tests/StellaOps.Telemetry.Federation.Tests.csproj -m:1 -v minimal` - Result: `47` passed, `0` failed. - Coverage includes payload tamper, signature tamper, wrong-key verification failure, consent expiry + signature validity combination, and deterministic replay digest checks. Tracking sprint: - `docs/implplan/SPRINT_20260304_307_Telemetry_federation_dsse_bundle_hardening.md`