13 KiB
Console Observability
Audience: Observability Guild, Console Guild, SRE/operators.
Scope: Metrics, logs, traces, dashboards, alerting, feature flags, and offline workflows for the StellaOps Console (Sprint 23).
Prerequisites: Console deployed with metrics enabled (CONSOLE_METRICS_ENABLED=true) and OTLP exporters configured (OTEL_EXPORTER_OTLP_*).
1 · Instrumentation Overview
- Telemetry stack: OpenTelemetry Web SDK (browser) + Console telemetry bridge → OTLP collector (Tempo/Prometheus/Loki). Server-side endpoints expose
/metrics(Prometheus) and/health/*. - Sampling: Front-end spans sample at 5 % by default (
OTEL_TRACES_SAMPLER=parentbased_traceidratio). Metrics are un-sampled; log sampling is handled per category (§3). - Correlation IDs: Every API call carries
x-stellaops-correlation-id; structured UI events mirror that value so operators can follow a request across gateway, backend, and UI. - Scope gating: Operators need the
ui.telemetryscope to view live charts in the Admin workspace; the scope also controls access to/console/telemetrySSE streams.
2 · Metrics
2.1 Experience & Navigation
| Metric | Type | Labels | Notes |
|---|---|---|---|
ui_route_render_seconds |
Histogram | route, tenant, device (desktop,tablet) |
Time between route activation and first interactive paint. Target P95 ≤ 1.5 s (cached). |
ui_request_duration_seconds |
Histogram | service, method, status, tenant |
Gateway proxy timing for backend calls performed by the console. Alerts when backend latency degrades. |
ui_filter_apply_total |
Counter | route, filter, tenant |
Increments when a global filter or context chip is applied. Used to track adoption of saved views. |
ui_tenant_switch_total |
Counter | fromTenant, toTenant, trigger (picker, shortcut, link) |
Emitted after a successful tenant switch; correlates with Authority ui.tenant.switch logs. |
ui_offline_banner_seconds |
Histogram | reason (authority, manifest, gateway), tenant |
Duration of offline banner visibility; integrate with air-gap SLAs. |
2.2 Security & Session
| Metric | Type | Labels | Notes |
|---|---|---|---|
ui_dpop_failure_total |
Counter | endpoint, reason (nonce, jkt, clockSkew) |
Raised when DPoP validation fails; pair with Authority audit trail. |
ui_fresh_auth_prompt_total |
Counter | action (token.revoke, policy.activate, client.create), tenant |
Counts fresh-auth modals; backlog above baseline indicates workflow friction. |
ui_fresh_auth_failure_total |
Counter | action, reason (timeout,cancelled,auth_error) |
Optional metric (set CONSOLE_FRESH_AUTH_METRICS=true when feature flag lands). |
2.3 Downloads & Offline Kit
| Metric | Type | Labels | Notes |
|---|---|---|---|
ui_download_manifest_refresh_seconds |
Histogram | tenant, channel (edge,stable,airgap) |
Time to fetch and verify downloads manifest. Target < 3 s. |
ui_download_export_queue_depth |
Gauge | tenant, artifactType (sbom,policy,attestation,console) |
Mirrors /console/downloads queue depth; triggers when offline bundles lag. |
ui_download_command_copied_total |
Counter | tenant, artifactType |
Increments when users copy CLI commands from the UI. Useful to observe CLI parity adoption. |
2.4 Telemetry Emission & Errors
| Metric | Type | Labels | Notes |
|---|---|---|---|
ui_telemetry_batch_failures_total |
Counter | transport (otlp-http,otlp-grpc), reason |
Emitted by OTLP bridge when batches fail. Enable via CONSOLE_METRICS_VERBOSE=true. |
ui_telemetry_queue_depth |
Gauge | priority (normal,high), tenant |
Browser-side buffer depth; monitor for spikes under degraded collectors. |
Scraping tips:
- Enable
/metricsviaCONSOLE_METRICS_ENABLED=true.- Set
OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.collector:4318and relevant headers (OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer <token>).- For air-gapped sites, point the exporter to the Offline Kit collector (
localhost:4318) and forward the metrics snapshot usingstella offline bundle metrics.
3 · Logs
- Format: JSON via Console log bridge; emitted to stdout and optional OTLP log exporter. Core fields:
timestamp,level,action,route,tenant,subject,correlationId,dpop.jkt,device,offlineMode. - Categories:
ui.action– general user interactions (route changes, command palette, filter updates). Sampled 50 % by default; override with feature flagtelemetry.logVerbose.ui.tenant.switch– always logged; includesfromTenant,toTenant,tokenId, and Authority audit correlation.ui.download.commandCopied– download commands copied; includesartifactId,digest,manifestVersion.ui.security.anomaly– DPoP mismatches, tenant header errors, CSP violations (level =Warning).ui.telemetry.failure– OTLP export errors; includehttpStatus,batchSize,retryCount.
- PII handling: Full emails are scrubbed; only hashed values (
user:<sha256>) appear unlessui.admin+ fresh-auth were granted for the action (still redacted in logs). - Retention: Recommended 14 days for connected sites, 30 days for sealed/air-gap audits. Ship logs to Loki/Elastic with ingest label
service="stellaops-web-ui".
4 · Traces
- Span names & attributes:
ui.route.transition– wraps route navigation; attributes:route,tenant,renderMillis,prefetchHit.ui.api.fetch– HTTP fetch to backend; attributes:service,endpoint,status,networkTime.ui.sse.stream– Server-sent event subscriptions (status ticker, runs); attributes:channel,connectedMillis,reconnects.ui.telemetry.batch– Browser OTLP flush; attributes:batchSize,success,retryCount.ui.policy.action– Policy workspace actions (simulate, approve, activate) perdocs/ui/policy-editor.md.
- Propagation: Spans use W3C
traceparent; gateway echoes header to backend APIs so traces stitch across UI → gateway → service. - Sampling controls:
OTEL_TRACES_SAMPLER_ARG(ratio) and feature flagtelemetry.forceSampling(sets to 100 % for incident debugging). - Viewing traces: Grafana Tempo or Jaeger via collector. Filter by
service.name = stellaops-console. For cross-service debugging, filter oncorrelationIdandtenant.
5 · Dashboards
5.1 Experience Overview
Panels:
- Route render histogram (P50/P90/P99) by route.
- Backend call latency stacked by service (
ui_request_duration_seconds). - Offline banner duration trend (
ui_offline_banner_seconds). - Tenant switch volume vs failure rate (overlay
ui_dpop_failure_total). - Command palette usage (
ui_filter_apply_total+ui.actionlog counts).
5.2 Downloads & Offline Kit
- Manifest refresh time chart (per channel).
- Export queue depth gauge with alert thresholds.
- CLI command adoption (bar chart per artifact type, using
ui_download_command_copied_total). - Offline parity banner occurrences (
downloads.offlineParityflag from API → derived metric). - Last Offline Kit import timestamp (join with Downloads API metadata).
5.3 Security & Session
- Fresh-auth prompt counts vs success/fail ratios.
- DPoP failure stacked by reason.
- Tenant mismatch warnings (from
ui.security.anomalylogs). - Scope usage heatmap (derived from Authority audit events + UI logs).
- CSP violation counts (browser
securitypolicyviolationlistener forwarded to logs).
Capture screenshots for Grafana once dashboards stabilise (
docs/assets/ui/observability/*.png). Replace placeholders before releasing the doc.
6 · Alerting
| Alert | Condition | Suggested Action |
|---|---|---|
| ConsoleLatencyHigh | ui_route_render_seconds_bucket{le="1.5"} drops below 0.95 for 3 intervals |
Inspect route splits, check backend latencies, review CDN cache. |
| BackendLatencyHigh | ui_request_duration_seconds_sum / ui_request_duration_seconds_count > 1 s for any service |
Correlate with gateway/service dashboards; escalate to owning guild. |
| TenantSwitchFailures | Increase in ui_dpop_failure_total or ui.security.anomaly (tenant mismatch) > 3/min |
Validate Authority issuer, check clock skew, confirm tenant config. |
| FreshAuthLoop | ui_fresh_auth_prompt_total spikes with matching ui_fresh_auth_failure_total |
Review Authority /fresh-auth endpoint, session timeout config, UX regressions. |
| OfflineBannerLong | ui_offline_banner_seconds P95 > 120 s |
Investigate Authority/gateway availability; verify Offline Kit freshness. |
| DownloadsBacklog | ui_download_export_queue_depth > 5 for 10 min OR queue age > alert threshold |
Ping Downloads service, ensure manifest pipeline (DOWNLOADS-CONSOLE-23-001) is healthy. |
| TelemetryExportErrors | ui_telemetry_batch_failures_total > 0 for ≥5 min |
Check collector health, credentials, or TLS trust. |
Integrate alerts with Notifier (ui.alerts) or existing Ops channels. Tag incidents with component=console for correlation.
7 · Feature Flags & Configuration
| Flag / Env Var | Purpose | Default |
|---|---|---|
CONSOLE_FEATURE_FLAGS |
Enables UI modules (runs, downloads, policies, telemetry). Telemetry panel requires telemetry. |
runs,downloads,policies |
CONSOLE_METRICS_ENABLED |
Exposes /metrics for Prometheus scrape. |
true |
CONSOLE_METRICS_VERBOSE |
Emits additional batching metrics (ui_telemetry_*). |
false |
CONSOLE_LOG_LEVEL |
Minimum log level (Information, Debug). Use Debug for incident sampling. |
Information |
CONSOLE_METRICS_SAMPLING (planned) |
Controls front-end span sampling ratio. Document once released. | 0.05 |
OTEL_EXPORTER_OTLP_ENDPOINT |
Collector URL; supports HTTPS. | unset |
OTEL_EXPORTER_OTLP_HEADERS |
Comma-separated headers (auth). | unset |
OTEL_EXPORTER_OTLP_INSECURE |
Allow HTTP (dev only). | false |
OTEL_SERVICE_NAME |
Service tag for traces/logs. Set to stellaops-console. |
auto |
CONSOLE_TELEMETRY_SSE_ENABLED |
Enables /console/telemetry SSE feed for dashboards. |
true |
Feature flag changes should be tracked in release notes and mirrored in /docs/ui/navigation.md (shortcuts may change when modules toggle).
8 · Offline / Air-Gapped Workflow
- Mirror the console image and telemetry collector as part of the Offline Kit (see
/docs/install/docker.md§4). - Scrape metrics locally via
curl -k https://console.local/metrics > metrics.prom; archive alongside logs for audits. - Use
stella offline kit importto keep the downloads manifest in sync; dashboards display staleness usingui_download_manifest_refresh_seconds. - When collectors are unavailable, console queues OTLP batches (up to 5 min) and exposes backlog through
ui_telemetry_queue_depth; export queue metrics to prove no data loss. - After reconnecting, run
stella console status --telemetry(CLI parity pending; see DOCS-CONSOLE-23-014) or verifyui_telemetry_batch_failures_totalresets to zero. - Retain telemetry bundles for 30 days per compliance guidelines; include Grafana JSON exports in audit packages.
9 · Compliance Checklist
/metricsscraped in staging & production; dashboards displayui_route_render_seconds,ui_request_duration_seconds, and downloads metrics.- OTLP traces/logs confirmed end-to-end (collector, Tempo/Loki).
- Alert rules from §6 implemented in monitoring stack with runbooks linked.
- Feature flags documented and change-controlled; telemetry disabled only with approval.
- DPoP/fresh-auth anomalies correlated with Authority audit logs during drill.
- Offline capture workflow exercised; evidence stored in audit vault.
- Screenshots of Grafana dashboards committed once they stabilise (update references).
- Cross-links verified (
docs/deploy/console.md,docs/security/console-security.md,docs/ui/downloads.md,docs/ui/console-overview.md).
10 · References
/docs/deploy/console.md– Metrics endpoint, OTLP config, health checks./docs/security/console-security.md– Security metrics & alert hints./docs/ui/console-overview.md– Telemetry primitives and performance budgets./docs/ui/downloads.md– Downloads metrics and parity workflow./docs/observability/observability.md– Platform-wide practices./ops/telemetry-collector.md&/ops/telemetry-storage.md– Collector deployment./docs/install/docker.md– Compose/Helm environment variables.
Last updated: 2025-10-28 (Sprint 23).