192 lines
13 KiB
Markdown
192 lines
13 KiB
Markdown
# Console Observability
|
||
|
||
> **Audience:** Observability Guild, Console Guild, SRE/operators.
|
||
> **Scope:** Metrics, logs, traces, dashboards, alerting, feature flags, and offline workflows for the StellaOps Console (Sprint 23).
|
||
> **Prerequisites:** Console deployed with metrics enabled (`CONSOLE_METRICS_ENABLED=true`) and OTLP exporters configured (`OTEL_EXPORTER_OTLP_*`).
|
||
|
||
---
|
||
|
||
## 1 · Instrumentation Overview
|
||
|
||
- **Telemetry stack:** OpenTelemetry Web SDK (browser) + Console telemetry bridge → OTLP collector (Tempo/Prometheus/Loki). Server-side endpoints expose `/metrics` (Prometheus) and `/health/*`.
|
||
- **Sampling:** Front-end spans sample at 5 % by default (`OTEL_TRACES_SAMPLER=parentbased_traceidratio`). Metrics are un-sampled; log sampling is handled per category (§3).
|
||
- **Correlation IDs:** Every API call carries `x-stellaops-correlation-id`; structured UI events mirror that value so operators can follow a request across gateway, backend, and UI.
|
||
- **Scope gating:** Operators need the `ui.telemetry` scope to view live charts in the Admin workspace; the scope also controls access to `/console/telemetry` SSE streams.
|
||
|
||
---
|
||
|
||
## 2 · Metrics
|
||
|
||
### 2.1 Experience & Navigation
|
||
|
||
| Metric | Type | Labels | Notes |
|
||
|--------|------|--------|-------|
|
||
| `ui_route_render_seconds` | Histogram | `route`, `tenant`, `device` (`desktop`,`tablet`) | Time between route activation and first interactive paint. Target P95 ≤ 1.5 s (cached). |
|
||
| `ui_request_duration_seconds` | Histogram | `service`, `method`, `status`, `tenant` | Gateway proxy timing for backend calls performed by the console. Alerts when backend latency degrades. |
|
||
| `ui_filter_apply_total` | Counter | `route`, `filter`, `tenant` | Increments when a global filter or context chip is applied. Used to track adoption of saved views. |
|
||
| `ui_tenant_switch_total` | Counter | `fromTenant`, `toTenant`, `trigger` (`picker`, `shortcut`, `link`) | Emitted after a successful tenant switch; correlates with Authority `ui.tenant.switch` logs. |
|
||
| `ui_offline_banner_seconds` | Histogram | `reason` (`authority`, `manifest`, `gateway`), `tenant` | Duration of offline banner visibility; integrate with air-gap SLAs. |
|
||
|
||
### 2.2 Security & Session
|
||
|
||
| Metric | Type | Labels | Notes |
|
||
|--------|------|--------|-------|
|
||
| `ui_dpop_failure_total` | Counter | `endpoint`, `reason` (`nonce`, `jkt`, `clockSkew`) | Raised when DPoP validation fails; pair with Authority audit trail. |
|
||
| `ui_fresh_auth_prompt_total` | Counter | `action` (`token.revoke`, `policy.activate`, `client.create`), `tenant` | Counts fresh-auth modals; backlog above baseline indicates workflow friction. |
|
||
| `ui_fresh_auth_failure_total` | Counter | `action`, `reason` (`timeout`,`cancelled`,`auth_error`) | Optional metric (set `CONSOLE_FRESH_AUTH_METRICS=true` when feature flag lands). |
|
||
|
||
### 2.3 Downloads & Offline Kit
|
||
|
||
| Metric | Type | Labels | Notes |
|
||
|--------|------|--------|-------|
|
||
| `ui_download_manifest_refresh_seconds` | Histogram | `tenant`, `channel` (`edge`,`stable`,`airgap`) | Time to fetch and verify downloads manifest. Target < 3 s. |
|
||
| `ui_download_export_queue_depth` | Gauge | `tenant`, `artifactType` (`sbom`,`policy`,`attestation`,`console`) | Mirrors `/console/downloads` queue depth; triggers when offline bundles lag. |
|
||
| `ui_download_command_copied_total` | Counter | `tenant`, `artifactType` | Increments when users copy CLI commands from the UI. Useful to observe CLI parity adoption. |
|
||
|
||
### 2.4 Telemetry Emission & Errors
|
||
|
||
| Metric | Type | Labels | Notes |
|
||
|--------|------|--------|-------|
|
||
| `ui_telemetry_batch_failures_total` | Counter | `transport` (`otlp-http`,`otlp-grpc`), `reason` | Emitted by OTLP bridge when batches fail. Enable via `CONSOLE_METRICS_VERBOSE=true`. |
|
||
| `ui_telemetry_queue_depth` | Gauge | `priority` (`normal`,`high`), `tenant` | Browser-side buffer depth; monitor for spikes under degraded collectors. |
|
||
|
||
> **Scraping tips:**
|
||
> - Enable `/metrics` via `CONSOLE_METRICS_ENABLED=true`.
|
||
> - Set `OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.collector:4318` and relevant headers (`OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer <token>`).
|
||
> - For air-gapped sites, point the exporter to the Offline Kit collector (`localhost:4318`) and forward the metrics snapshot using `stella offline bundle metrics`.
|
||
|
||
---
|
||
|
||
## 3 · Logs
|
||
|
||
- **Format:** JSON via Console log bridge; emitted to stdout and optional OTLP log exporter. Core fields: `timestamp`, `level`, `action`, `route`, `tenant`, `subject`, `correlationId`, `dpop.jkt`, `device`, `offlineMode`.
|
||
- **Categories:**
|
||
- `ui.action` – general user interactions (route changes, command palette, filter updates). Sampled 50 % by default; override with feature flag `telemetry.logVerbose`.
|
||
- `ui.tenant.switch` – always logged; includes `fromTenant`, `toTenant`, `tokenId`, and Authority audit correlation.
|
||
- `ui.download.commandCopied` – download commands copied; includes `artifactId`, `digest`, `manifestVersion`.
|
||
- `ui.security.anomaly` – DPoP mismatches, tenant header errors, CSP violations (level = `Warning`).
|
||
- `ui.telemetry.failure` – OTLP export errors; include `httpStatus`, `batchSize`, `retryCount`.
|
||
- **PII handling:** Full emails are scrubbed; only hashed values (`user:<sha256>`) appear unless `ui.admin` + fresh-auth were granted for the action (still redacted in logs).
|
||
- **Retention:** Recommended 14 days for connected sites, 30 days for sealed/air-gap audits. Ship logs to Loki/Elastic with ingest label `service="stellaops-web-ui"`.
|
||
|
||
---
|
||
|
||
## 4 · Traces
|
||
|
||
- **Span names & attributes:**
|
||
- `ui.route.transition` – wraps route navigation; attributes: `route`, `tenant`, `renderMillis`, `prefetchHit`.
|
||
- `ui.api.fetch` – HTTP fetch to backend; attributes: `service`, `endpoint`, `status`, `networkTime`.
|
||
- `ui.sse.stream` – Server-sent event subscriptions (status ticker, runs); attributes: `channel`, `connectedMillis`, `reconnects`.
|
||
- `ui.telemetry.batch` – Browser OTLP flush; attributes: `batchSize`, `success`, `retryCount`.
|
||
- `ui.policy.action` – Policy workspace actions (simulate, approve, activate) per `docs/ui/policy-editor.md`.
|
||
- **Propagation:** Spans use W3C `traceparent`; gateway echoes header to backend APIs so traces stitch across UI → gateway → service.
|
||
- **Sampling controls:** `OTEL_TRACES_SAMPLER_ARG` (ratio) and feature flag `telemetry.forceSampling` (sets to 100 % for incident debugging).
|
||
- **Viewing traces:** Grafana Tempo or Jaeger via collector. Filter by `service.name = stellaops-console`. For cross-service debugging, filter on `correlationId` and `tenant`.
|
||
|
||
---
|
||
|
||
## 5 · Dashboards
|
||
|
||
### 5.1 Experience Overview
|
||
|
||
Panels:
|
||
- Route render histogram (P50/P90/P99) by route.
|
||
- Backend call latency stacked by service (`ui_request_duration_seconds`).
|
||
- Offline banner duration trend (`ui_offline_banner_seconds`).
|
||
- Tenant switch volume vs failure rate (overlay `ui_dpop_failure_total`).
|
||
- Command palette usage (`ui_filter_apply_total` + `ui.action` log counts).
|
||
|
||
### 5.2 Downloads & Offline Kit
|
||
|
||
- Manifest refresh time chart (per channel).
|
||
- Export queue depth gauge with alert thresholds.
|
||
- CLI command adoption (bar chart per artifact type, using `ui_download_command_copied_total`).
|
||
- Offline parity banner occurrences (`downloads.offlineParity` flag from API → derived metric).
|
||
- Last Offline Kit import timestamp (join with Downloads API metadata).
|
||
|
||
### 5.3 Security & Session
|
||
|
||
- Fresh-auth prompt counts vs success/fail ratios.
|
||
- DPoP failure stacked by reason.
|
||
- Tenant mismatch warnings (from `ui.security.anomaly` logs).
|
||
- Scope usage heatmap (derived from Authority audit events + UI logs).
|
||
- CSP violation counts (browser `securitypolicyviolation` listener forwarded to logs).
|
||
|
||
> Capture screenshots for Grafana once dashboards stabilise (`docs/assets/ui/observability/*.png`). Replace placeholders before releasing the doc.
|
||
|
||
---
|
||
|
||
## 6 · Alerting
|
||
|
||
| Alert | Condition | Suggested Action |
|
||
|-------|-----------|------------------|
|
||
| **ConsoleLatencyHigh** | `ui_route_render_seconds_bucket{le="1.5"}` drops below 0.95 for 3 intervals | Inspect route splits, check backend latencies, review CDN cache. |
|
||
| **BackendLatencyHigh** | `ui_request_duration_seconds_sum / ui_request_duration_seconds_count` > 1 s for any service | Correlate with gateway/service dashboards; escalate to owning guild. |
|
||
| **TenantSwitchFailures** | Increase in `ui_dpop_failure_total` or `ui.security.anomaly` (tenant mismatch) > 3/min | Validate Authority issuer, check clock skew, confirm tenant config. |
|
||
| **FreshAuthLoop** | `ui_fresh_auth_prompt_total` spikes with matching `ui_fresh_auth_failure_total` | Review Authority `/fresh-auth` endpoint, session timeout config, UX regressions. |
|
||
| **OfflineBannerLong** | `ui_offline_banner_seconds` P95 > 120 s | Investigate Authority/gateway availability; verify Offline Kit freshness. |
|
||
| **DownloadsBacklog** | `ui_download_export_queue_depth` > 5 for 10 min OR queue age > alert threshold | Ping Downloads service, ensure manifest pipeline (`DOWNLOADS-CONSOLE-23-001`) is healthy. |
|
||
| **TelemetryExportErrors** | `ui_telemetry_batch_failures_total` > 0 for ≥5 min | Check collector health, credentials, or TLS trust. |
|
||
|
||
Integrate alerts with Notifier (`ui.alerts`) or existing Ops channels. Tag incidents with `component=console` for correlation.
|
||
|
||
---
|
||
|
||
## 7 · Feature Flags & Configuration
|
||
|
||
| Flag / Env Var | Purpose | Default |
|
||
|----------------|---------|---------|
|
||
| `CONSOLE_FEATURE_FLAGS` | Enables UI modules (`runs`, `downloads`, `policies`, `telemetry`). Telemetry panel requires `telemetry`. | `runs,downloads,policies` |
|
||
| `CONSOLE_METRICS_ENABLED` | Exposes `/metrics` for Prometheus scrape. | `true` |
|
||
| `CONSOLE_METRICS_VERBOSE` | Emits additional batching metrics (`ui_telemetry_*`). | `false` |
|
||
| `CONSOLE_LOG_LEVEL` | Minimum log level (`Information`, `Debug`). Use `Debug` for incident sampling. | `Information` |
|
||
| `CONSOLE_METRICS_SAMPLING` *(planned)* | Controls front-end span sampling ratio. Document once released. | `0.05` |
|
||
| `OTEL_EXPORTER_OTLP_ENDPOINT` | Collector URL; supports HTTPS. | unset |
|
||
| `OTEL_EXPORTER_OTLP_HEADERS` | Comma-separated headers (auth). | unset |
|
||
| `OTEL_EXPORTER_OTLP_INSECURE` | Allow HTTP (dev only). | `false` |
|
||
| `OTEL_SERVICE_NAME` | Service tag for traces/logs. Set to `stellaops-console`. | auto |
|
||
| `CONSOLE_TELEMETRY_SSE_ENABLED` | Enables `/console/telemetry` SSE feed for dashboards. | `true` |
|
||
|
||
Feature flag changes should be tracked in release notes and mirrored in `/docs/ui/navigation.md` (shortcuts may change when modules toggle).
|
||
|
||
---
|
||
|
||
## 8 · Offline / Air-Gapped Workflow
|
||
|
||
- Mirror the console image and telemetry collector as part of the Offline Kit (see `/docs/install/docker.md` §4).
|
||
- Scrape metrics locally via `curl -k https://console.local/metrics > metrics.prom`; archive alongside logs for audits.
|
||
- Use `stella offline kit import` to keep the downloads manifest in sync; dashboards display staleness using `ui_download_manifest_refresh_seconds`.
|
||
- When collectors are unavailable, console queues OTLP batches (up to 5 min) and exposes backlog through `ui_telemetry_queue_depth`; export queue metrics to prove no data loss.
|
||
- After reconnecting, run `stella console status --telemetry` *(CLI parity pending; see DOCS-CONSOLE-23-014)* or verify `ui_telemetry_batch_failures_total` resets to zero.
|
||
- Retain telemetry bundles for 30 days per compliance guidelines; include Grafana JSON exports in audit packages.
|
||
|
||
---
|
||
|
||
## 9 · Compliance Checklist
|
||
|
||
- [ ] `/metrics` scraped in staging & production; dashboards display `ui_route_render_seconds`, `ui_request_duration_seconds`, and downloads metrics.
|
||
- [ ] OTLP traces/logs confirmed end-to-end (collector, Tempo/Loki).
|
||
- [ ] Alert rules from §6 implemented in monitoring stack with runbooks linked.
|
||
- [ ] Feature flags documented and change-controlled; telemetry disabled only with approval.
|
||
- [ ] DPoP/fresh-auth anomalies correlated with Authority audit logs during drill.
|
||
- [ ] Offline capture workflow exercised; evidence stored in audit vault.
|
||
- [ ] Screenshots of Grafana dashboards committed once they stabilise (update references).
|
||
- [ ] Cross-links verified (`docs/deploy/console.md`, `docs/security/console-security.md`, `docs/ui/downloads.md`, `docs/ui/console-overview.md`).
|
||
|
||
---
|
||
|
||
## 10 · References
|
||
|
||
- `/docs/deploy/console.md` – Metrics endpoint, OTLP config, health checks.
|
||
- `/docs/security/console-security.md` – Security metrics & alert hints.
|
||
- `/docs/ui/console-overview.md` – Telemetry primitives and performance budgets.
|
||
- `/docs/ui/downloads.md` – Downloads metrics and parity workflow.
|
||
- `/docs/observability/observability.md` – Platform-wide practices.
|
||
- `/ops/telemetry-collector.md` & `/ops/telemetry-storage.md` – Collector deployment.
|
||
- `/docs/install/docker.md` – Compose/Helm environment variables.
|
||
|
||
---
|
||
|
||
*Last updated: 2025-10-28 (Sprint 23).*
|
||
|