# Console Observability > **Audience:** Observability Guild, Console Guild, SRE/operators. > **Scope:** Metrics, logs, traces, dashboards, alerting, feature flags, and offline workflows for the StellaOps Console (Sprint 23). > **Prerequisites:** Console deployed with metrics enabled (`CONSOLE_METRICS_ENABLED=true`) and OTLP exporters configured (`OTEL_EXPORTER_OTLP_*`). --- ## 1 · Instrumentation Overview - **Telemetry stack:** OpenTelemetry Web SDK (browser) + Console telemetry bridge → OTLP collector (Tempo/Prometheus/Loki). Server-side endpoints expose `/metrics` (Prometheus) and `/health/*`. - **Sampling:** Front-end spans sample at 5 % by default (`OTEL_TRACES_SAMPLER=parentbased_traceidratio`). Metrics are un-sampled; log sampling is handled per category (§3). - **Correlation IDs:** Every API call carries `x-stellaops-correlation-id`; structured UI events mirror that value so operators can follow a request across gateway, backend, and UI. - **Scope gating:** Operators need the `ui.telemetry` scope to view live charts in the Admin workspace; the scope also controls access to `/console/telemetry` SSE streams. --- ## 2 · Metrics ### 2.1 Experience & Navigation | Metric | Type | Labels | Notes | |--------|------|--------|-------| | `ui_route_render_seconds` | Histogram | `route`, `tenant`, `device` (`desktop`,`tablet`) | Time between route activation and first interactive paint. Target P95 ≤ 1.5 s (cached). | | `ui_request_duration_seconds` | Histogram | `service`, `method`, `status`, `tenant` | Gateway proxy timing for backend calls performed by the console. Alerts when backend latency degrades. | | `ui_filter_apply_total` | Counter | `route`, `filter`, `tenant` | Increments when a global filter or context chip is applied. Used to track adoption of saved views. | | `ui_tenant_switch_total` | Counter | `fromTenant`, `toTenant`, `trigger` (`picker`, `shortcut`, `link`) | Emitted after a successful tenant switch; correlates with Authority `ui.tenant.switch` logs. | | `ui_offline_banner_seconds` | Histogram | `reason` (`authority`, `manifest`, `gateway`), `tenant` | Duration of offline banner visibility; integrate with air-gap SLAs. | ### 2.2 Security & Session | Metric | Type | Labels | Notes | |--------|------|--------|-------| | `ui_dpop_failure_total` | Counter | `endpoint`, `reason` (`nonce`, `jkt`, `clockSkew`) | Raised when DPoP validation fails; pair with Authority audit trail. | | `ui_fresh_auth_prompt_total` | Counter | `action` (`token.revoke`, `policy.activate`, `client.create`), `tenant` | Counts fresh-auth modals; backlog above baseline indicates workflow friction. | | `ui_fresh_auth_failure_total` | Counter | `action`, `reason` (`timeout`,`cancelled`,`auth_error`) | Optional metric (set `CONSOLE_FRESH_AUTH_METRICS=true` when feature flag lands). | ### 2.3 Downloads & Offline Kit | Metric | Type | Labels | Notes | |--------|------|--------|-------| | `ui_download_manifest_refresh_seconds` | Histogram | `tenant`, `channel` (`edge`,`stable`,`airgap`) | Time to fetch and verify downloads manifest. Target < 3 s. | | `ui_download_export_queue_depth` | Gauge | `tenant`, `artifactType` (`sbom`,`policy`,`attestation`,`console`) | Mirrors `/console/downloads` queue depth; triggers when offline bundles lag. | | `ui_download_command_copied_total` | Counter | `tenant`, `artifactType` | Increments when users copy CLI commands from the UI. Useful to observe CLI parity adoption. | ### 2.4 Telemetry Emission & Errors | Metric | Type | Labels | Notes | |--------|------|--------|-------| | `ui_telemetry_batch_failures_total` | Counter | `transport` (`otlp-http`,`otlp-grpc`), `reason` | Emitted by OTLP bridge when batches fail. Enable via `CONSOLE_METRICS_VERBOSE=true`. | | `ui_telemetry_queue_depth` | Gauge | `priority` (`normal`,`high`), `tenant` | Browser-side buffer depth; monitor for spikes under degraded collectors. | > **Scraping tips:** > - Enable `/metrics` via `CONSOLE_METRICS_ENABLED=true`. > - Set `OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.collector:4318` and relevant headers (`OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer `). > - For air-gapped sites, point the exporter to the Offline Kit collector (`localhost:4318`) and forward the metrics snapshot using `stella offline bundle metrics`. --- ## 3 · Logs - **Format:** JSON via Console log bridge; emitted to stdout and optional OTLP log exporter. Core fields: `timestamp`, `level`, `action`, `route`, `tenant`, `subject`, `correlationId`, `dpop.jkt`, `device`, `offlineMode`. - **Categories:** - `ui.action` – general user interactions (route changes, command palette, filter updates). Sampled 50 % by default; override with feature flag `telemetry.logVerbose`. - `ui.tenant.switch` – always logged; includes `fromTenant`, `toTenant`, `tokenId`, and Authority audit correlation. - `ui.download.commandCopied` – download commands copied; includes `artifactId`, `digest`, `manifestVersion`. - `ui.security.anomaly` – DPoP mismatches, tenant header errors, CSP violations (level = `Warning`). - `ui.telemetry.failure` – OTLP export errors; include `httpStatus`, `batchSize`, `retryCount`. - **PII handling:** Full emails are scrubbed; only hashed values (`user:`) appear unless `ui.admin` + fresh-auth were granted for the action (still redacted in logs). - **Retention:** Recommended 14 days for connected sites, 30 days for sealed/air-gap audits. Ship logs to Loki/Elastic with ingest label `service="stellaops-web-ui"`. --- ## 4 · Traces - **Span names & attributes:** - `ui.route.transition` – wraps route navigation; attributes: `route`, `tenant`, `renderMillis`, `prefetchHit`. - `ui.api.fetch` – HTTP fetch to backend; attributes: `service`, `endpoint`, `status`, `networkTime`. - `ui.sse.stream` – Server-sent event subscriptions (status ticker, runs); attributes: `channel`, `connectedMillis`, `reconnects`. - `ui.telemetry.batch` – Browser OTLP flush; attributes: `batchSize`, `success`, `retryCount`. - `ui.policy.action` – Policy workspace actions (simulate, approve, activate) per `docs/ui/policy-editor.md`. - **Propagation:** Spans use W3C `traceparent`; gateway echoes header to backend APIs so traces stitch across UI → gateway → service. - **Sampling controls:** `OTEL_TRACES_SAMPLER_ARG` (ratio) and feature flag `telemetry.forceSampling` (sets to 100 % for incident debugging). - **Viewing traces:** Grafana Tempo or Jaeger via collector. Filter by `service.name = stellaops-console`. For cross-service debugging, filter on `correlationId` and `tenant`. --- ## 5 · Dashboards ### 5.1 Experience Overview Panels: - Route render histogram (P50/P90/P99) by route. - Backend call latency stacked by service (`ui_request_duration_seconds`). - Offline banner duration trend (`ui_offline_banner_seconds`). - Tenant switch volume vs failure rate (overlay `ui_dpop_failure_total`). - Command palette usage (`ui_filter_apply_total` + `ui.action` log counts). ### 5.2 Downloads & Offline Kit - Manifest refresh time chart (per channel). - Export queue depth gauge with alert thresholds. - CLI command adoption (bar chart per artifact type, using `ui_download_command_copied_total`). - Offline parity banner occurrences (`downloads.offlineParity` flag from API → derived metric). - Last Offline Kit import timestamp (join with Downloads API metadata). ### 5.3 Security & Session - Fresh-auth prompt counts vs success/fail ratios. - DPoP failure stacked by reason. - Tenant mismatch warnings (from `ui.security.anomaly` logs). - Scope usage heatmap (derived from Authority audit events + UI logs). - CSP violation counts (browser `securitypolicyviolation` listener forwarded to logs). > Capture screenshots for Grafana once dashboards stabilise (`docs/assets/ui/observability/*.png`). Replace placeholders before releasing the doc. --- ## 6 · Alerting | Alert | Condition | Suggested Action | |-------|-----------|------------------| | **ConsoleLatencyHigh** | `ui_route_render_seconds_bucket{le="1.5"}` drops below 0.95 for 3 intervals | Inspect route splits, check backend latencies, review CDN cache. | | **BackendLatencyHigh** | `ui_request_duration_seconds_sum / ui_request_duration_seconds_count` > 1 s for any service | Correlate with gateway/service dashboards; escalate to owning guild. | | **TenantSwitchFailures** | Increase in `ui_dpop_failure_total` or `ui.security.anomaly` (tenant mismatch) > 3/min | Validate Authority issuer, check clock skew, confirm tenant config. | | **FreshAuthLoop** | `ui_fresh_auth_prompt_total` spikes with matching `ui_fresh_auth_failure_total` | Review Authority `/fresh-auth` endpoint, session timeout config, UX regressions. | | **OfflineBannerLong** | `ui_offline_banner_seconds` P95 > 120 s | Investigate Authority/gateway availability; verify Offline Kit freshness. | | **DownloadsBacklog** | `ui_download_export_queue_depth` > 5 for 10 min OR queue age > alert threshold | Ping Downloads service, ensure manifest pipeline (`DOWNLOADS-CONSOLE-23-001`) is healthy. | | **TelemetryExportErrors** | `ui_telemetry_batch_failures_total` > 0 for ≥5 min | Check collector health, credentials, or TLS trust. | Integrate alerts with Notifier (`ui.alerts`) or existing Ops channels. Tag incidents with `component=console` for correlation. --- ## 7 · Feature Flags & Configuration | Flag / Env Var | Purpose | Default | |----------------|---------|---------| | `CONSOLE_FEATURE_FLAGS` | Enables UI modules (`runs`, `downloads`, `policies`, `telemetry`). Telemetry panel requires `telemetry`. | `runs,downloads,policies` | | `CONSOLE_METRICS_ENABLED` | Exposes `/metrics` for Prometheus scrape. | `true` | | `CONSOLE_METRICS_VERBOSE` | Emits additional batching metrics (`ui_telemetry_*`). | `false` | | `CONSOLE_LOG_LEVEL` | Minimum log level (`Information`, `Debug`). Use `Debug` for incident sampling. | `Information` | | `CONSOLE_METRICS_SAMPLING` *(planned)* | Controls front-end span sampling ratio. Document once released. | `0.05` | | `OTEL_EXPORTER_OTLP_ENDPOINT` | Collector URL; supports HTTPS. | unset | | `OTEL_EXPORTER_OTLP_HEADERS` | Comma-separated headers (auth). | unset | | `OTEL_EXPORTER_OTLP_INSECURE` | Allow HTTP (dev only). | `false` | | `OTEL_SERVICE_NAME` | Service tag for traces/logs. Set to `stellaops-console`. | auto | | `CONSOLE_TELEMETRY_SSE_ENABLED` | Enables `/console/telemetry` SSE feed for dashboards. | `true` | Feature flag changes should be tracked in release notes and mirrored in `/docs/ui/navigation.md` (shortcuts may change when modules toggle). --- ## 8 · Offline / Air-Gapped Workflow - Mirror the console image and telemetry collector as part of the Offline Kit (see `/docs/install/docker.md` §4). - Scrape metrics locally via `curl -k https://console.local/metrics > metrics.prom`; archive alongside logs for audits. - Use `stella offline kit import` to keep the downloads manifest in sync; dashboards display staleness using `ui_download_manifest_refresh_seconds`. - When collectors are unavailable, console queues OTLP batches (up to 5 min) and exposes backlog through `ui_telemetry_queue_depth`; export queue metrics to prove no data loss. - After reconnecting, run `stella console status --telemetry` *(CLI parity pending; see DOCS-CONSOLE-23-014)* or verify `ui_telemetry_batch_failures_total` resets to zero. - Retain telemetry bundles for 30 days per compliance guidelines; include Grafana JSON exports in audit packages. --- ## 9 · Compliance Checklist - [ ] `/metrics` scraped in staging & production; dashboards display `ui_route_render_seconds`, `ui_request_duration_seconds`, and downloads metrics. - [ ] OTLP traces/logs confirmed end-to-end (collector, Tempo/Loki). - [ ] Alert rules from §6 implemented in monitoring stack with runbooks linked. - [ ] Feature flags documented and change-controlled; telemetry disabled only with approval. - [ ] DPoP/fresh-auth anomalies correlated with Authority audit logs during drill. - [ ] Offline capture workflow exercised; evidence stored in audit vault. - [ ] Screenshots of Grafana dashboards committed once they stabilise (update references). - [ ] Cross-links verified (`docs/deploy/console.md`, `docs/security/console-security.md`, `docs/ui/downloads.md`, `docs/ui/console-overview.md`). --- ## 10 · References - `/docs/deploy/console.md` – Metrics endpoint, OTLP config, health checks. - `/docs/security/console-security.md` – Security metrics & alert hints. - `/docs/ui/console-overview.md` – Telemetry primitives and performance budgets. - `/docs/ui/downloads.md` – Downloads metrics and parity workflow. - `/docs/observability/observability.md` – Platform-wide practices. - `/ops/telemetry-collector.md` & `/ops/telemetry-storage.md` – Collector deployment. - `/docs/install/docker.md` – Compose/Helm environment variables. --- *Last updated: 2025-10-28 (Sprint 23).*