Files
git.stella-ops.org/docs/observability/ui-telemetry.md
root 68da90a11a
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Restructure solution layout by module
2025-10-28 15:10:40 +02:00

13 KiB
Raw Blame History

Console Observability

Audience: Observability Guild, Console Guild, SRE/operators.
Scope: Metrics, logs, traces, dashboards, alerting, feature flags, and offline workflows for the StellaOps Console (Sprint23).
Prerequisites: Console deployed with metrics enabled (CONSOLE_METRICS_ENABLED=true) and OTLP exporters configured (OTEL_EXPORTER_OTLP_*).


1·Instrumentation Overview

  • Telemetry stack: OpenTelemetry Web SDK (browser) + Console telemetry bridge → OTLP collector (Tempo/Prometheus/Loki). Server-side endpoints expose /metrics (Prometheus) and /health/*.
  • Sampling: Front-end spans sample at 5% by default (OTEL_TRACES_SAMPLER=parentbased_traceidratio). Metrics are un-sampled; log sampling is handled per category (§3).
  • Correlation IDs: Every API call carries x-stellaops-correlation-id; structured UI events mirror that value so operators can follow a request across gateway, backend, and UI.
  • Scope gating: Operators need the ui.telemetry scope to view live charts in the Admin workspace; the scope also controls access to /console/telemetry SSE streams.

2·Metrics

2.1 Experience & Navigation

Metric Type Labels Notes
ui_route_render_seconds Histogram route, tenant, device (desktop,tablet) Time between route activation and first interactive paint. Target P95 ≤1.5s (cached).
ui_request_duration_seconds Histogram service, method, status, tenant Gateway proxy timing for backend calls performed by the console. Alerts when backend latency degrades.
ui_filter_apply_total Counter route, filter, tenant Increments when a global filter or context chip is applied. Used to track adoption of saved views.
ui_tenant_switch_total Counter fromTenant, toTenant, trigger (picker, shortcut, link) Emitted after a successful tenant switch; correlates with Authority ui.tenant.switch logs.
ui_offline_banner_seconds Histogram reason (authority, manifest, gateway), tenant Duration of offline banner visibility; integrate with air-gap SLAs.

2.2 Security & Session

Metric Type Labels Notes
ui_dpop_failure_total Counter endpoint, reason (nonce, jkt, clockSkew) Raised when DPoP validation fails; pair with Authority audit trail.
ui_fresh_auth_prompt_total Counter action (token.revoke, policy.activate, client.create), tenant Counts fresh-auth modals; backlog above baseline indicates workflow friction.
ui_fresh_auth_failure_total Counter action, reason (timeout,cancelled,auth_error) Optional metric (set CONSOLE_FRESH_AUTH_METRICS=true when feature flag lands).

2.3 Downloads & Offline Kit

Metric Type Labels Notes
ui_download_manifest_refresh_seconds Histogram tenant, channel (edge,stable,airgap) Time to fetch and verify downloads manifest. Target <3s.
ui_download_export_queue_depth Gauge tenant, artifactType (sbom,policy,attestation,console) Mirrors /console/downloads queue depth; triggers when offline bundles lag.
ui_download_command_copied_total Counter tenant, artifactType Increments when users copy CLI commands from the UI. Useful to observe CLI parity adoption.

2.4 Telemetry Emission & Errors

Metric Type Labels Notes
ui_telemetry_batch_failures_total Counter transport (otlp-http,otlp-grpc), reason Emitted by OTLP bridge when batches fail. Enable via CONSOLE_METRICS_VERBOSE=true.
ui_telemetry_queue_depth Gauge priority (normal,high), tenant Browser-side buffer depth; monitor for spikes under degraded collectors.

Scraping tips:

  • Enable /metrics via CONSOLE_METRICS_ENABLED=true.
  • Set OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.collector:4318 and relevant headers (OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer <token>).
  • For air-gapped sites, point the exporter to the Offline Kit collector (localhost:4318) and forward the metrics snapshot using stella offline bundle metrics.

3·Logs

  • Format: JSON via Console log bridge; emitted to stdout and optional OTLP log exporter. Core fields: timestamp, level, action, route, tenant, subject, correlationId, dpop.jkt, device, offlineMode.
  • Categories:
    • ui.action general user interactions (route changes, command palette, filter updates). Sampled 50% by default; override with feature flag telemetry.logVerbose.
    • ui.tenant.switch always logged; includes fromTenant, toTenant, tokenId, and Authority audit correlation.
    • ui.download.commandCopied download commands copied; includes artifactId, digest, manifestVersion.
    • ui.security.anomaly DPoP mismatches, tenant header errors, CSP violations (level = Warning).
    • ui.telemetry.failure OTLP export errors; include httpStatus, batchSize, retryCount.
  • PII handling: Full emails are scrubbed; only hashed values (user:<sha256>) appear unless ui.admin + fresh-auth were granted for the action (still redacted in logs).
  • Retention: Recommended 14days for connected sites, 30days for sealed/air-gap audits. Ship logs to Loki/Elastic with ingest label service="stellaops-web-ui".

4·Traces

  • Span names & attributes:
    • ui.route.transition wraps route navigation; attributes: route, tenant, renderMillis, prefetchHit.
    • ui.api.fetch HTTP fetch to backend; attributes: service, endpoint, status, networkTime.
    • ui.sse.stream Server-sent event subscriptions (status ticker, runs); attributes: channel, connectedMillis, reconnects.
    • ui.telemetry.batch Browser OTLP flush; attributes: batchSize, success, retryCount.
    • ui.policy.action Policy workspace actions (simulate, approve, activate) per docs/ui/policy-editor.md.
  • Propagation: Spans use W3C traceparent; gateway echoes header to backend APIs so traces stitch across UI → gateway → service.
  • Sampling controls: OTEL_TRACES_SAMPLER_ARG (ratio) and feature flag telemetry.forceSampling (sets to 100% for incident debugging).
  • Viewing traces: Grafana Tempo or Jaeger via collector. Filter by service.name = stellaops-console. For cross-service debugging, filter on correlationId and tenant.

5·Dashboards

5.1 Experience Overview

Panels:

  • Route render histogram (P50/P90/P99) by route.
  • Backend call latency stacked by service (ui_request_duration_seconds).
  • Offline banner duration trend (ui_offline_banner_seconds).
  • Tenant switch volume vs failure rate (overlay ui_dpop_failure_total).
  • Command palette usage (ui_filter_apply_total + ui.action log counts).

5.2 Downloads & Offline Kit

  • Manifest refresh time chart (per channel).
  • Export queue depth gauge with alert thresholds.
  • CLI command adoption (bar chart per artifact type, using ui_download_command_copied_total).
  • Offline parity banner occurrences (downloads.offlineParity flag from API → derived metric).
  • Last Offline Kit import timestamp (join with Downloads API metadata).

5.3 Security & Session

  • Fresh-auth prompt counts vs success/fail ratios.
  • DPoP failure stacked by reason.
  • Tenant mismatch warnings (from ui.security.anomaly logs).
  • Scope usage heatmap (derived from Authority audit events + UI logs).
  • CSP violation counts (browser securitypolicyviolation listener forwarded to logs).

Capture screenshots for Grafana once dashboards stabilise (docs/assets/ui/observability/*.png). Replace placeholders before releasing the doc.


6·Alerting

Alert Condition Suggested Action
ConsoleLatencyHigh ui_route_render_seconds_bucket{le="1.5"} drops below 0.95 for 3 intervals Inspect route splits, check backend latencies, review CDN cache.
BackendLatencyHigh ui_request_duration_seconds_sum / ui_request_duration_seconds_count > 1s for any service Correlate with gateway/service dashboards; escalate to owning guild.
TenantSwitchFailures Increase in ui_dpop_failure_total or ui.security.anomaly (tenant mismatch) > 3/min Validate Authority issuer, check clock skew, confirm tenant config.
FreshAuthLoop ui_fresh_auth_prompt_total spikes with matching ui_fresh_auth_failure_total Review Authority /fresh-auth endpoint, session timeout config, UX regressions.
OfflineBannerLong ui_offline_banner_seconds P95 > 120s Investigate Authority/gateway availability; verify Offline Kit freshness.
DownloadsBacklog ui_download_export_queue_depth > 5 for 10min OR queue age > alert threshold Ping Downloads service, ensure manifest pipeline (DOWNLOADS-CONSOLE-23-001) is healthy.
TelemetryExportErrors ui_telemetry_batch_failures_total > 0 for ≥5min Check collector health, credentials, or TLS trust.

Integrate alerts with Notifier (ui.alerts) or existing Ops channels. Tag incidents with component=console for correlation.


7·Feature Flags & Configuration

Flag / Env Var Purpose Default
CONSOLE_FEATURE_FLAGS Enables UI modules (runs, downloads, policies, telemetry). Telemetry panel requires telemetry. runs,downloads,policies
CONSOLE_METRICS_ENABLED Exposes /metrics for Prometheus scrape. true
CONSOLE_METRICS_VERBOSE Emits additional batching metrics (ui_telemetry_*). false
CONSOLE_LOG_LEVEL Minimum log level (Information, Debug). Use Debug for incident sampling. Information
CONSOLE_METRICS_SAMPLING (planned) Controls front-end span sampling ratio. Document once released. 0.05
OTEL_EXPORTER_OTLP_ENDPOINT Collector URL; supports HTTPS. unset
OTEL_EXPORTER_OTLP_HEADERS Comma-separated headers (auth). unset
OTEL_EXPORTER_OTLP_INSECURE Allow HTTP (dev only). false
OTEL_SERVICE_NAME Service tag for traces/logs. Set to stellaops-console. auto
CONSOLE_TELEMETRY_SSE_ENABLED Enables /console/telemetry SSE feed for dashboards. true

Feature flag changes should be tracked in release notes and mirrored in /docs/ui/navigation.md (shortcuts may change when modules toggle).


8·Offline / Air-Gapped Workflow

  • Mirror the console image and telemetry collector as part of the Offline Kit (see /docs/install/docker.md §4).
  • Scrape metrics locally via curl -k https://console.local/metrics > metrics.prom; archive alongside logs for audits.
  • Use stella offline kit import to keep the downloads manifest in sync; dashboards display staleness using ui_download_manifest_refresh_seconds.
  • When collectors are unavailable, console queues OTLP batches (up to 5min) and exposes backlog through ui_telemetry_queue_depth; export queue metrics to prove no data loss.
  • After reconnecting, run stella console status --telemetry (CLI parity pending; see DOCS-CONSOLE-23-014) or verify ui_telemetry_batch_failures_total resets to zero.
  • Retain telemetry bundles for 30days per compliance guidelines; include Grafana JSON exports in audit packages.

9·Compliance Checklist

  • /metrics scraped in staging & production; dashboards display ui_route_render_seconds, ui_request_duration_seconds, and downloads metrics.
  • OTLP traces/logs confirmed end-to-end (collector, Tempo/Loki).
  • Alert rules from §6 implemented in monitoring stack with runbooks linked.
  • Feature flags documented and change-controlled; telemetry disabled only with approval.
  • DPoP/fresh-auth anomalies correlated with Authority audit logs during drill.
  • Offline capture workflow exercised; evidence stored in audit vault.
  • Screenshots of Grafana dashboards committed once they stabilise (update references).
  • Cross-links verified (docs/deploy/console.md, docs/security/console-security.md, docs/ui/downloads.md, docs/ui/console-overview.md).

10·References

  • /docs/deploy/console.md Metrics endpoint, OTLP config, health checks.
  • /docs/security/console-security.md Security metrics & alert hints.
  • /docs/ui/console-overview.md Telemetry primitives and performance budgets.
  • /docs/ui/downloads.md Downloads metrics and parity workflow.
  • /docs/observability/observability.md Platform-wide practices.
  • /ops/telemetry-collector.md & /ops/telemetry-storage.md Collector deployment.
  • /docs/install/docker.md Compose/Helm environment variables.

Last updated: 2025-10-28 (Sprint23).