8.7 KiB
Telemetry architecture
Derived from Epic 15 – Observability & Forensics; details collector topology, storage profiles, forensic pipelines, and offline packaging.
1) Topology
- Collector tier. OpenTelemetry Collector instances deployed per environment (ingest TLS, GRPC/OTLP receivers, tail-based sampling). Config packages delivered via Offline Kit.
- Processing pipelines. Pipelines for traces, metrics, logs with processors (batch, tail sampling, attributes redaction, resource detection). Profiles:
default,forensic(high-retention),airgap(file-based exporters). - Exporters. OTLP to Prometheus/Tempo/Loki (online) or file/OTLP-HTTP to Offline Kit staging (air-gapped). Exporters are allow-listed to satisfy Sovereign readiness.
2) Storage
- Prometheus for metrics with remote-write support and retention windows (default 30 days, forensic 180 days).
- Tempo (or Jaeger all-in-one) for traces with block storage backend (S3-compatible or filesystem) and deterministic chunk manifests.
- Loki for logs stored in immutable chunks; index shards hashed for reproducibility.
- Forensic archive — periodic export of raw OTLP records into signed bundles (
otlp/metrics.pb,otlp/traces.pb,otlp/logs.pb,manifest.json).
3) Pipelines & Guardrails
- Redaction. Attribute processors strip PII/secrets based on policy-managed allowed keys. Redaction profiles mirrored in Offline Kit.
- Sampling. Tail sampling by service/error; incident mode (triggered by Orchestrator) promotes services to 100 % sampling, extends retention, and toggles Notify alerts.
- Alerting. Prometheus rules/Dashboards packaged with Export Center: service SLOs, queue depth, policy run latency, ingestion AOC violations.
- Sealed-mode guard.
StellaOps.Telemetry.CoreenforcesIEgressPolicyon OTLP exporters; when air-gap mode is sealed any non-loopback collector endpoints are automatically disabled and a structured warning with remediation is emitted.
4) APIs & integration
GET /telemetry/config/profile/{name}— download collector config bundle (YAML + signature).POST /telemetry/incidents/mode— toggle incident sampling + forensic bundle generation.GET /telemetry/exports/forensic/{window}— stream signed OTLP bundles for compliance.- CLI commands:
stella telemetry deploy --profile default,stella telemetry capture --window 24h --out bundle.tar.gz.
5) Offline support
- Offline Kit ships collector binaries/config, bootstrap scripts, dashboards, alert rules, and OTLP replay tooling. Bundles include
manifest.jsonwith digests, DSSE signatures, and instructions. - For offline environments, exporters write to local filesystem; operators transfer bundles to analysis workstation using signed manifests.
6) Observability of telemetry stack
- Meta-metrics:
collector_export_failures_total,telemetry_bundle_generation_seconds,telemetry_incident_mode{state}. - Health endpoints for collectors and storage clusters, plus dashboards for ingestion rate, retention, rule evaluations.
7) DORA Metrics
Stella Ops tracks the four key DORA (DevOps Research and Assessment) metrics for software delivery performance:
7.1) Metrics Tracked
- Deployment Frequency (
dora_deployments_total,dora_deployment_frequency_per_day) — How often deployments occur per day/week. - Lead Time for Changes (
dora_lead_time_hours) — Time from commit to deployment in production. - Change Failure Rate (
dora_deployment_failure_total,dora_change_failure_rate_percent) — Percentage of deployments requiring rollback, hotfix, or failing. - Mean Time to Recovery (MTTR) (
dora_time_to_recovery_hours) — Average time to recover from incidents.
7.2) Performance Classification
The system classifies teams into DORA performance levels:
- Elite: On-demand deployments, <24h lead time, <15% CFR, <1h MTTR
- High: Weekly deployments, <1 week lead time, <30% CFR, <1 day MTTR
- Medium: Monthly deployments, <6 months lead time, <45% CFR, <1 week MTTR
- Low: Quarterly or less frequent deployments with higher failure rates
7.3) Integration Points
IDoraMetricsService— Service interface for recording deployments and incidentsDoraMetrics— OpenTelemetry-style metrics class with SLO breach tracking- DI registration:
services.AddDoraMetrics(options => { ... }) - Events are recorded when Release Orchestrator completes promotions or rollbacks
7.4) SLO Tracking
Configurable SLO targets via DoraMetricsOptions:
LeadTimeSloHours(default: 24)DeploymentFrequencySloPerDay(default: 1)ChangeFailureRateSloPercent(default: 15)MttrSloHours(default: 1)
SLO breaches are recorded as dora_slo_breach_total with metric label.
7.5) Outcome Analytics and Attribution (Sprint 20260208_065)
Telemetry now includes deterministic executive outcome attribution built on top of the existing DORA event stream:
IOutcomeAnalyticsService(src/Telemetry/StellaOps.Telemetry.Core/StellaOps.Telemetry.Core/IOutcomeAnalyticsService.cs)DoraOutcomeAnalyticsService(src/Telemetry/StellaOps.Telemetry.Core/StellaOps.Telemetry.Core/DoraOutcomeAnalyticsService.cs)- Outcome report models (
src/Telemetry/StellaOps.Telemetry.Core/StellaOps.Telemetry.Core/OutcomeAnalyticsModels.cs)
Outcome attribution behavior:
- Produces
OutcomeExecutiveReportfor a fixed tenant/environment/time window with deterministic ordering. - Adds MTTA support via
DoraIncidentEvent.AcknowledgedAtandTimeToAcknowledge. - Groups deployment outcomes by normalized pipeline (
pipeline-a,pipeline-b,unknown) with per-pipeline change failure rate and median lead time. - Groups incidents by severity with resolved/acknowledged counts plus MTTA/MTTR aggregates.
- Produces daily cohort slices across the requested date range for executive trend views.
Dependency injection integration:
TelemetryServiceCollectionExtensions.AddDoraMetrics(...)now also registersIOutcomeAnalyticsService, so existing telemetry entry points automatically expose attribution reporting without additional module wiring.
Verification coverage:
src/Telemetry/StellaOps.Telemetry.Core/StellaOps.Telemetry.Core.Tests/OutcomeAnalyticsServiceTests.cssrc/Telemetry/StellaOps.Telemetry.Core/StellaOps.Telemetry.Core.Tests/DoraMetricsServiceTests.cs- Full telemetry core test suite pass (
262tests) remains green after integration.
Refer to the module README and implementation plan for immediate context, and update this document once component boundaries and data flows are finalised.
8) Federation DSSE Security Posture (Updated 2026-03-04)
Status:
- Advisory gap
TEL-001is closed. Federation consent and bundle paths now emit signed DSSE envelopes instead of payload passthrough placeholders.
Implemented contract:
- Consent and bundle envelopes now use explicit DSSE JSON structure:
payloadType, base64payload, andsignatures[](keyid,sig). - Consent proofs and bundle summaries carry signer identity metadata (
SignerKeyId) for auditability. - Bundle payload canonicalization is deterministic for identical logical inputs:
- bucket ordering:
cveId(ordinal), thennoisyCount(descending),artifactCount,observationCount - deterministic bundle ID derivation from canonical payload seed + fixed clock input
- bucket ordering:
- Bundle verification enforces:
- envelope digest integrity (
sha256:over envelope bytes) - payload type match
- trusted-key signature verification
- consent digest linkage (
consentDigestin payload must matchConsentDsseDigest)
- envelope digest integrity (
Signer/verifier integration and fallback:
- Federation now uses explicit abstractions:
IFederationDsseEnvelopeSignerIFederationDsseEnvelopeVerifier
- Default adapter:
HmacFederationDsseEnvelopeService(offline-safe HMAC-SHA256 DSSE sign/verify using local trusted key map inFederatedTelemetryOptions). - Failure mode is deterministic and auditable:
- signing failures throw
FederationSignatureExceptionwith stable error codes (for examplefederation.dsse.sign_failed,federation.dsse.signer_unavailable) - optional unsigned fallback (
AllowUnsignedDsseFallback) emits envelopes tagged withoffline-unsigned-fallbackfor explicit operator visibility.
- signing failures throw
Verification evidence:
dotnet test src/Telemetry/StellaOps.Telemetry.Federation.Tests/StellaOps.Telemetry.Federation.Tests.csproj -m:1 -v minimal- Result:
47passed,0failed. - Coverage includes payload tamper, signature tamper, wrong-key verification failure, consent expiry + signature validity combination, and deterministic replay digest checks.
Tracking sprint:
docs/implplan/SPRINT_20260304_307_Telemetry_federation_dsse_bundle_hardening.md