Files
git.stella-ops.org/docs/observability/ui-telemetry.md
root 68da90a11a
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Restructure solution layout by module
2025-10-28 15:10:40 +02:00

192 lines
13 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Console Observability
> **Audience:** Observability Guild, Console Guild, SRE/operators.
> **Scope:** Metrics, logs, traces, dashboards, alerting, feature flags, and offline workflows for the StellaOps Console (Sprint23).
> **Prerequisites:** Console deployed with metrics enabled (`CONSOLE_METRICS_ENABLED=true`) and OTLP exporters configured (`OTEL_EXPORTER_OTLP_*`).
---
## 1·Instrumentation Overview
- **Telemetry stack:** OpenTelemetry Web SDK (browser) + Console telemetry bridge → OTLP collector (Tempo/Prometheus/Loki). Server-side endpoints expose `/metrics` (Prometheus) and `/health/*`.
- **Sampling:** Front-end spans sample at 5% by default (`OTEL_TRACES_SAMPLER=parentbased_traceidratio`). Metrics are un-sampled; log sampling is handled per category (§3).
- **Correlation IDs:** Every API call carries `x-stellaops-correlation-id`; structured UI events mirror that value so operators can follow a request across gateway, backend, and UI.
- **Scope gating:** Operators need the `ui.telemetry` scope to view live charts in the Admin workspace; the scope also controls access to `/console/telemetry` SSE streams.
---
## 2·Metrics
### 2.1 Experience & Navigation
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `ui_route_render_seconds` | Histogram | `route`, `tenant`, `device` (`desktop`,`tablet`) | Time between route activation and first interactive paint. Target P95 ≤1.5s (cached). |
| `ui_request_duration_seconds` | Histogram | `service`, `method`, `status`, `tenant` | Gateway proxy timing for backend calls performed by the console. Alerts when backend latency degrades. |
| `ui_filter_apply_total` | Counter | `route`, `filter`, `tenant` | Increments when a global filter or context chip is applied. Used to track adoption of saved views. |
| `ui_tenant_switch_total` | Counter | `fromTenant`, `toTenant`, `trigger` (`picker`, `shortcut`, `link`) | Emitted after a successful tenant switch; correlates with Authority `ui.tenant.switch` logs. |
| `ui_offline_banner_seconds` | Histogram | `reason` (`authority`, `manifest`, `gateway`), `tenant` | Duration of offline banner visibility; integrate with air-gap SLAs. |
### 2.2 Security & Session
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `ui_dpop_failure_total` | Counter | `endpoint`, `reason` (`nonce`, `jkt`, `clockSkew`) | Raised when DPoP validation fails; pair with Authority audit trail. |
| `ui_fresh_auth_prompt_total` | Counter | `action` (`token.revoke`, `policy.activate`, `client.create`), `tenant` | Counts fresh-auth modals; backlog above baseline indicates workflow friction. |
| `ui_fresh_auth_failure_total` | Counter | `action`, `reason` (`timeout`,`cancelled`,`auth_error`) | Optional metric (set `CONSOLE_FRESH_AUTH_METRICS=true` when feature flag lands). |
### 2.3 Downloads & Offline Kit
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `ui_download_manifest_refresh_seconds` | Histogram | `tenant`, `channel` (`edge`,`stable`,`airgap`) | Time to fetch and verify downloads manifest. Target <3s. |
| `ui_download_export_queue_depth` | Gauge | `tenant`, `artifactType` (`sbom`,`policy`,`attestation`,`console`) | Mirrors `/console/downloads` queue depth; triggers when offline bundles lag. |
| `ui_download_command_copied_total` | Counter | `tenant`, `artifactType` | Increments when users copy CLI commands from the UI. Useful to observe CLI parity adoption. |
### 2.4 Telemetry Emission & Errors
| Metric | Type | Labels | Notes |
|--------|------|--------|-------|
| `ui_telemetry_batch_failures_total` | Counter | `transport` (`otlp-http`,`otlp-grpc`), `reason` | Emitted by OTLP bridge when batches fail. Enable via `CONSOLE_METRICS_VERBOSE=true`. |
| `ui_telemetry_queue_depth` | Gauge | `priority` (`normal`,`high`), `tenant` | Browser-side buffer depth; monitor for spikes under degraded collectors. |
> **Scraping tips:**
> - Enable `/metrics` via `CONSOLE_METRICS_ENABLED=true`.
> - Set `OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.collector:4318` and relevant headers (`OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer <token>`).
> - For air-gapped sites, point the exporter to the Offline Kit collector (`localhost:4318`) and forward the metrics snapshot using `stella offline bundle metrics`.
---
## 3·Logs
- **Format:** JSON via Console log bridge; emitted to stdout and optional OTLP log exporter. Core fields: `timestamp`, `level`, `action`, `route`, `tenant`, `subject`, `correlationId`, `dpop.jkt`, `device`, `offlineMode`.
- **Categories:**
- `ui.action` general user interactions (route changes, command palette, filter updates). Sampled 50% by default; override with feature flag `telemetry.logVerbose`.
- `ui.tenant.switch` always logged; includes `fromTenant`, `toTenant`, `tokenId`, and Authority audit correlation.
- `ui.download.commandCopied` download commands copied; includes `artifactId`, `digest`, `manifestVersion`.
- `ui.security.anomaly` DPoP mismatches, tenant header errors, CSP violations (level = `Warning`).
- `ui.telemetry.failure` OTLP export errors; include `httpStatus`, `batchSize`, `retryCount`.
- **PII handling:** Full emails are scrubbed; only hashed values (`user:<sha256>`) appear unless `ui.admin` + fresh-auth were granted for the action (still redacted in logs).
- **Retention:** Recommended 14days for connected sites, 30days for sealed/air-gap audits. Ship logs to Loki/Elastic with ingest label `service="stellaops-web-ui"`.
---
## 4·Traces
- **Span names & attributes:**
- `ui.route.transition` wraps route navigation; attributes: `route`, `tenant`, `renderMillis`, `prefetchHit`.
- `ui.api.fetch` HTTP fetch to backend; attributes: `service`, `endpoint`, `status`, `networkTime`.
- `ui.sse.stream` Server-sent event subscriptions (status ticker, runs); attributes: `channel`, `connectedMillis`, `reconnects`.
- `ui.telemetry.batch` Browser OTLP flush; attributes: `batchSize`, `success`, `retryCount`.
- `ui.policy.action` Policy workspace actions (simulate, approve, activate) per `docs/ui/policy-editor.md`.
- **Propagation:** Spans use W3C `traceparent`; gateway echoes header to backend APIs so traces stitch across UI gateway service.
- **Sampling controls:** `OTEL_TRACES_SAMPLER_ARG` (ratio) and feature flag `telemetry.forceSampling` (sets to 100% for incident debugging).
- **Viewing traces:** Grafana Tempo or Jaeger via collector. Filter by `service.name = stellaops-console`. For cross-service debugging, filter on `correlationId` and `tenant`.
---
## 5·Dashboards
### 5.1 Experience Overview
Panels:
- Route render histogram (P50/P90/P99) by route.
- Backend call latency stacked by service (`ui_request_duration_seconds`).
- Offline banner duration trend (`ui_offline_banner_seconds`).
- Tenant switch volume vs failure rate (overlay `ui_dpop_failure_total`).
- Command palette usage (`ui_filter_apply_total` + `ui.action` log counts).
### 5.2 Downloads & Offline Kit
- Manifest refresh time chart (per channel).
- Export queue depth gauge with alert thresholds.
- CLI command adoption (bar chart per artifact type, using `ui_download_command_copied_total`).
- Offline parity banner occurrences (`downloads.offlineParity` flag from API derived metric).
- Last Offline Kit import timestamp (join with Downloads API metadata).
### 5.3 Security & Session
- Fresh-auth prompt counts vs success/fail ratios.
- DPoP failure stacked by reason.
- Tenant mismatch warnings (from `ui.security.anomaly` logs).
- Scope usage heatmap (derived from Authority audit events + UI logs).
- CSP violation counts (browser `securitypolicyviolation` listener forwarded to logs).
> Capture screenshots for Grafana once dashboards stabilise (`docs/assets/ui/observability/*.png`). Replace placeholders before releasing the doc.
---
## 6·Alerting
| Alert | Condition | Suggested Action |
|-------|-----------|------------------|
| **ConsoleLatencyHigh** | `ui_route_render_seconds_bucket{le="1.5"}` drops below 0.95 for 3 intervals | Inspect route splits, check backend latencies, review CDN cache. |
| **BackendLatencyHigh** | `ui_request_duration_seconds_sum / ui_request_duration_seconds_count` > 1s for any service | Correlate with gateway/service dashboards; escalate to owning guild. |
| **TenantSwitchFailures** | Increase in `ui_dpop_failure_total` or `ui.security.anomaly` (tenant mismatch) > 3/min | Validate Authority issuer, check clock skew, confirm tenant config. |
| **FreshAuthLoop** | `ui_fresh_auth_prompt_total` spikes with matching `ui_fresh_auth_failure_total` | Review Authority `/fresh-auth` endpoint, session timeout config, UX regressions. |
| **OfflineBannerLong** | `ui_offline_banner_seconds` P95 > 120s | Investigate Authority/gateway availability; verify Offline Kit freshness. |
| **DownloadsBacklog** | `ui_download_export_queue_depth` > 5 for 10min OR queue age > alert threshold | Ping Downloads service, ensure manifest pipeline (`DOWNLOADS-CONSOLE-23-001`) is healthy. |
| **TelemetryExportErrors** | `ui_telemetry_batch_failures_total` > 0 for ≥5min | Check collector health, credentials, or TLS trust. |
Integrate alerts with Notifier (`ui.alerts`) or existing Ops channels. Tag incidents with `component=console` for correlation.
---
## 7·Feature Flags & Configuration
| Flag / Env Var | Purpose | Default |
|----------------|---------|---------|
| `CONSOLE_FEATURE_FLAGS` | Enables UI modules (`runs`, `downloads`, `policies`, `telemetry`). Telemetry panel requires `telemetry`. | `runs,downloads,policies` |
| `CONSOLE_METRICS_ENABLED` | Exposes `/metrics` for Prometheus scrape. | `true` |
| `CONSOLE_METRICS_VERBOSE` | Emits additional batching metrics (`ui_telemetry_*`). | `false` |
| `CONSOLE_LOG_LEVEL` | Minimum log level (`Information`, `Debug`). Use `Debug` for incident sampling. | `Information` |
| `CONSOLE_METRICS_SAMPLING` *(planned)* | Controls front-end span sampling ratio. Document once released. | `0.05` |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | Collector URL; supports HTTPS. | unset |
| `OTEL_EXPORTER_OTLP_HEADERS` | Comma-separated headers (auth). | unset |
| `OTEL_EXPORTER_OTLP_INSECURE` | Allow HTTP (dev only). | `false` |
| `OTEL_SERVICE_NAME` | Service tag for traces/logs. Set to `stellaops-console`. | auto |
| `CONSOLE_TELEMETRY_SSE_ENABLED` | Enables `/console/telemetry` SSE feed for dashboards. | `true` |
Feature flag changes should be tracked in release notes and mirrored in `/docs/ui/navigation.md` (shortcuts may change when modules toggle).
---
## 8·Offline / Air-Gapped Workflow
- Mirror the console image and telemetry collector as part of the Offline Kit (see `/docs/install/docker.md` §4).
- Scrape metrics locally via `curl -k https://console.local/metrics > metrics.prom`; archive alongside logs for audits.
- Use `stella offline kit import` to keep the downloads manifest in sync; dashboards display staleness using `ui_download_manifest_refresh_seconds`.
- When collectors are unavailable, console queues OTLP batches (up to 5min) and exposes backlog through `ui_telemetry_queue_depth`; export queue metrics to prove no data loss.
- After reconnecting, run `stella console status --telemetry` *(CLI parity pending; see DOCS-CONSOLE-23-014)* or verify `ui_telemetry_batch_failures_total` resets to zero.
- Retain telemetry bundles for 30days per compliance guidelines; include Grafana JSON exports in audit packages.
---
## 9·Compliance Checklist
- [ ] `/metrics` scraped in staging & production; dashboards display `ui_route_render_seconds`, `ui_request_duration_seconds`, and downloads metrics.
- [ ] OTLP traces/logs confirmed end-to-end (collector, Tempo/Loki).
- [ ] Alert rules from §6 implemented in monitoring stack with runbooks linked.
- [ ] Feature flags documented and change-controlled; telemetry disabled only with approval.
- [ ] DPoP/fresh-auth anomalies correlated with Authority audit logs during drill.
- [ ] Offline capture workflow exercised; evidence stored in audit vault.
- [ ] Screenshots of Grafana dashboards committed once they stabilise (update references).
- [ ] Cross-links verified (`docs/deploy/console.md`, `docs/security/console-security.md`, `docs/ui/downloads.md`, `docs/ui/console-overview.md`).
---
## 10·References
- `/docs/deploy/console.md` Metrics endpoint, OTLP config, health checks.
- `/docs/security/console-security.md` Security metrics & alert hints.
- `/docs/ui/console-overview.md` Telemetry primitives and performance budgets.
- `/docs/ui/downloads.md` Downloads metrics and parity workflow.
- `/docs/observability/observability.md` Platform-wide practices.
- `/ops/telemetry-collector.md` & `/ops/telemetry-storage.md` Collector deployment.
- `/docs/install/docker.md` Compose/Helm environment variables.
---
*Last updated: 2025-10-28 (Sprint23).*