Restructure solution layout by module

2025-10-28 15:10:40 +02:00
parent 4e3e575db5
commit 68da90a11a
4103 changed files with 192899 additions and 187024 deletions
--- a/docs/observability/policy.md
+++ b/docs/observability/policy.md
@@ -1,166 +1,166 @@
-# Policy Engine Observability
-
-> **Audience:** Observability Guild, SRE/Platform operators, Policy Guild.  
-> **Scope:** Metrics, logs, traces, dashboards, alerting, sampling, and incident workflows for the Policy Engine service (Sprint 20).  
-> **Prerequisites:** Policy Engine v2 deployed with OpenTelemetry exporters enabled (`observability:enabled=true` in config).
-
---
-
-## 1 · Instrumentation Overview
-
- **Telemetry stack:** OpenTelemetry SDK (metrics + traces), Serilog structured logging, OTLP exporters → Collector → Prometheus/Loki/Tempo.
- **Namespace conventions:** `policy.*` for metrics/traces/log categories; labels use `tenant`, `policy`, `mode`, `runId`.
- **Sampling:** Default 10 % trace sampling, 1 % rule-hit log sampling; incident mode overrides to 100 % (see §6).
- **Correlation IDs:** Every API request gets `traceId` + `requestId`. CLI/UI display IDs to streamline support.
-
---
-
-## 2 · Metrics
-
-### 2.1 Run Pipeline
-
-| Metric | Type | Labels | Notes |
-|--------|------|--------|-------|
-| `policy_run_seconds` | Histogram | `tenant`, `policy`, `mode` (`full`, `incremental`, `simulate`) | P95 target ≤ 5 min incremental, ≤ 30 min full. |
-| `policy_run_queue_depth` | Gauge | `tenant` | Number of pending jobs per tenant (updated each enqueue/dequeue). |
-| `policy_run_failures_total` | Counter | `tenant`, `policy`, `reason` (`err_pol_*`, `network`, `cancelled`) | Aligns with error codes. |
-| `policy_run_retries_total` | Counter | `tenant`, `policy` | Helps identify noisy sources. |
-| `policy_run_inputs_pending_bytes` | Gauge | `tenant` | Size of buffered change batches awaiting run. |
-
-### 2.2 Evaluator Insights
-
-| Metric | Type | Labels | Notes |
-|--------|------|--------|-------|
-| `policy_rules_fired_total` | Counter | `tenant`, `policy`, `rule` | Increment per rule match (sampled). |
-| `policy_vex_overrides_total` | Counter | `tenant`, `policy`, `vendor`, `justification` | Tracks VEX precedence decisions. |
-| `policy_suppressions_total` | Counter | `tenant`, `policy`, `action` (`ignore`, `warn`, `quiet`) | Audits suppression usage. |
-| `policy_selection_batch_duration_seconds` | Histogram | `tenant`, `policy` | Measures joiner performance. |
-| `policy_materialization_conflicts_total` | Counter | `tenant`, `policy` | Non-zero indicates optimistic concurrency retries. |
-
-### 2.3 API Surface
-
-| Metric | Type | Labels | Notes |
-|--------|------|--------|-------|
-| `policy_api_requests_total` | Counter | `endpoint`, `method`, `status` | Exposed via Minimal API instrumentation. |
-| `policy_api_latency_seconds` | Histogram | `endpoint`, `method` | Budget ≤ 250 ms for GETs, ≤ 1 s for POSTs. |
-| `policy_api_rate_limited_total` | Counter | `endpoint` | Tied to throttles (`429`). |
-
-### 2.4 Queue & Change Streams
-
-| Metric | Type | Labels | Notes |
-|--------|------|--------|-------|
-| `policy_queue_leases_active` | Gauge | `tenant` | Number of leased jobs. |
-| `policy_queue_lease_expirations_total` | Counter | `tenant` | Alerts when workers fail to ack. |
-| `policy_delta_backlog_age_seconds` | Gauge | `tenant`, `source` (`concelier`, `excititor`, `sbom`) | Age of oldest unprocessed change event. |
-
---
-
-## 3 · Logs
-
- **Format:** JSON (`Serilog`). Core fields: `timestamp`, `level`, `message`, `policyId`, `policyVersion`, `tenant`, `runId`, `rule`, `traceId`, `env.sealed`, `error.code`.
- **Log categories:**
-  - `policy.run` (queue lifecycle, run begin/end, stats)
-  - `policy.evaluate` (batch execution summaries; rule-hit sampling)
-  - `policy.materialize` (Mongo operations, conflicts, retries)
-  - `policy.simulate` (diff results, CLI invocation metadata)
-  - `policy.lifecycle` (submit/review/approve events)
- **Sampling:** Rule-hit logs sample 1 % by default; toggled to 100 % in incident mode or when `--trace` flag used in CLI.
- **PII:** No user secrets recorded; user identities referenced as `user:<id>` or `group:<id>` only.
-
---
-
-## 4 · Traces
-
- Spans emit via OpenTelemetry instrumentation.
- **Primary spans:**
-  - `policy.api` – wraps HTTP request, records `endpoint`, `status`, `scope`.
-  - `policy.select` – change stream ingestion and batch assembly (attributes: `candidateCount`, `cursor`).
-  - `policy.evaluate` – evaluation batch (attributes: `batchSize`, `ruleHits`, `severityChanges`).
-  - `policy.materialize` – Mongo writes (attributes: `writes`, `historyWrites`, `retryCount`).
-  - `policy.simulate` – simulation diff generation (attributes: `sbomCount`, `diffAdded`, `diffRemoved`).
- Trace context propagated to CLI via response headers `traceparent`; UI surfaces in run detail view.
- Incident mode forces span sampling to 100 % and extends retention via Collector config override.
-
---
-
-## 5 · Dashboards
-
-### 5.1 Policy Runs Overview
-
-Widgets:
- Run duration histogram (per mode/tenant).
- Queue depth + backlog age line charts.
- Failure rate stacked by error code.
- Incremental backlog heatmap (policy × age).
- Active vs scheduled runs table.
-
-### 5.2 Rule Impact & VEX
-
- Top N rules by firings (bar chart).
- VEX overrides by vendor/justification (stacked chart).
- Suppression usage (pie + table with justifications).
- Quieted findings trend (line).
-
-### 5.3 Simulation & Approval Health
-
- Simulation diff histogram (added vs removed).
- Pending approvals by age (table with SLA colour coding).
- Compliance checklist status (lint, determinism CI, simulation evidence).
-
-> Placeholders for Grafana panels should be replaced with actual screenshots once dashboards land (`../assets/policy-observability/*.png`).
-
---
-
-## 6 · Alerting
-
-| Alert | Condition | Suggested Action |
-|-------|-----------|------------------|
-| **PolicyRunSlaBreach** | `policy_run_seconds{mode="incremental"}` P95 > 300 s for 3 windows | Check queue depth, upstream services, scale worker pool. |
-| **PolicyQueueStuck** | `policy_delta_backlog_age_seconds` > 600 | Investigate change stream connectivity. |
-| **DeterminismMismatch** | Run status `failed` with `ERR_POL_004` OR CI replay diff | Switch to incident sampling, gather replay bundle, notify Policy Guild. |
-| **SimulationDrift** | CLI/CI simulation exit `20` (blocking diff) over threshold | Review policy changes before approval. |
-| **VexOverrideSpike** | `policy_vex_overrides_total` > configured baseline (per vendor) | Verify upstream VEX feed; ensure justification codes expected. |
-| **SuppressionSurge** | `policy_suppressions_total` increase > 3σ vs baseline | Audit new suppress rules; check approvals. |
-
-Alerts integrate with Notifier channels (`policy.alerts`) and Ops on-call rotations.
-
---
-
-## 7 · Incident Mode & Forensics
-
- Toggle via `POST /api/policy/incidents/activate` (requires `policy:operate` scope).
- Effects:
-  - Trace sampling → 100 %.
-  - Rule-hit log sampling → 100 %.
-  - Retention window extended to 30 days for incident duration.
-  - `policy.incident.activated` event emitted (Console + Notifier banners).
- Post-incident tasks:
-  - `stella policy run replay` for affected runs; attach bundles to incident record.
-  - Restore sampling defaults with `.../deactivate`.
-  - Update incident checklist in `/docs/policy/lifecycle.md` (section 8) with findings.
-
---
-
-## 8 · Integration Points
-
- **Authority:** Exposes metric `policy_scope_denied_total` for failed authorisation; correlate with `policy_api_requests_total`.
- **Concelier/Excititor:** Shared trace IDs propagate via gRPC metadata to help debug upstream latency.
- **Scheduler:** Future integration will push run queues into shared scheduler dashboards (planned in SCHED-MODELS-20-002).
- **Offline Kit:** CLI exports logs + metrics snapshots (`stella offline bundle metrics`) for air-gapped audits.
-
---
-
-## 9 · Compliance Checklist
-
- [ ] **Metrics registered:** All metrics listed above exported and documented in Grafana dashboards.
- [ ] **Alert policies configured:** Ops or Observability Guild created alerts matching table in §6.
- [ ] **Sampling overrides tested:** Incident mode toggles verified in staging; retention roll-back rehearsed.
- [ ] **Trace propagation validated:** CLI/UI display trace IDs and allow copy for support.
- [ ] **Log scrubbing enforced:** Unit tests guarantee no secrets/PII in logs; sampling respects configuration.
- [ ] **Offline capture rehearsed:** Metrics/log snapshot commands executed in sealed environment.
- [ ] **Docs cross-links:** Links to architecture, runs, lifecycle, CLI, API docs verified.
-
---
-
-*Last updated: 2025-10-26 (Sprint 20).*
-
+# Policy Engine Observability
+
+> **Audience:** Observability Guild, SRE/Platform operators, Policy Guild.  
+> **Scope:** Metrics, logs, traces, dashboards, alerting, sampling, and incident workflows for the Policy Engine service (Sprint 20).  
+> **Prerequisites:** Policy Engine v2 deployed with OpenTelemetry exporters enabled (`observability:enabled=true` in config).
+
+---
+
+## 1 · Instrumentation Overview
+
+- **Telemetry stack:** OpenTelemetry SDK (metrics + traces), Serilog structured logging, OTLP exporters → Collector → Prometheus/Loki/Tempo.
+- **Namespace conventions:** `policy.*` for metrics/traces/log categories; labels use `tenant`, `policy`, `mode`, `runId`.
+- **Sampling:** Default 10 % trace sampling, 1 % rule-hit log sampling; incident mode overrides to 100 % (see §6).
+- **Correlation IDs:** Every API request gets `traceId` + `requestId`. CLI/UI display IDs to streamline support.
+
+---
+
+## 2 · Metrics
+
+### 2.1 Run Pipeline
+
+| Metric | Type | Labels | Notes |
+|--------|------|--------|-------|
+| `policy_run_seconds` | Histogram | `tenant`, `policy`, `mode` (`full`, `incremental`, `simulate`) | P95 target ≤ 5 min incremental, ≤ 30 min full. |
+| `policy_run_queue_depth` | Gauge | `tenant` | Number of pending jobs per tenant (updated each enqueue/dequeue). |
+| `policy_run_failures_total` | Counter | `tenant`, `policy`, `reason` (`err_pol_*`, `network`, `cancelled`) | Aligns with error codes. |
+| `policy_run_retries_total` | Counter | `tenant`, `policy` | Helps identify noisy sources. |
+| `policy_run_inputs_pending_bytes` | Gauge | `tenant` | Size of buffered change batches awaiting run. |
+
+### 2.2 Evaluator Insights
+
+| Metric | Type | Labels | Notes |
+|--------|------|--------|-------|
+| `policy_rules_fired_total` | Counter | `tenant`, `policy`, `rule` | Increment per rule match (sampled). |
+| `policy_vex_overrides_total` | Counter | `tenant`, `policy`, `vendor`, `justification` | Tracks VEX precedence decisions. |
+| `policy_suppressions_total` | Counter | `tenant`, `policy`, `action` (`ignore`, `warn`, `quiet`) | Audits suppression usage. |
+| `policy_selection_batch_duration_seconds` | Histogram | `tenant`, `policy` | Measures joiner performance. |
+| `policy_materialization_conflicts_total` | Counter | `tenant`, `policy` | Non-zero indicates optimistic concurrency retries. |
+
+### 2.3 API Surface
+
+| Metric | Type | Labels | Notes |
+|--------|------|--------|-------|
+| `policy_api_requests_total` | Counter | `endpoint`, `method`, `status` | Exposed via Minimal API instrumentation. |
+| `policy_api_latency_seconds` | Histogram | `endpoint`, `method` | Budget ≤ 250 ms for GETs, ≤ 1 s for POSTs. |
+| `policy_api_rate_limited_total` | Counter | `endpoint` | Tied to throttles (`429`). |
+
+### 2.4 Queue & Change Streams
+
+| Metric | Type | Labels | Notes |
+|--------|------|--------|-------|
+| `policy_queue_leases_active` | Gauge | `tenant` | Number of leased jobs. |
+| `policy_queue_lease_expirations_total` | Counter | `tenant` | Alerts when workers fail to ack. |
+| `policy_delta_backlog_age_seconds` | Gauge | `tenant`, `source` (`concelier`, `excititor`, `sbom`) | Age of oldest unprocessed change event. |
+
+---
+
+## 3 · Logs
+
+- **Format:** JSON (`Serilog`). Core fields: `timestamp`, `level`, `message`, `policyId`, `policyVersion`, `tenant`, `runId`, `rule`, `traceId`, `env.sealed`, `error.code`.
+- **Log categories:**
+  - `policy.run` (queue lifecycle, run begin/end, stats)
+  - `policy.evaluate` (batch execution summaries; rule-hit sampling)
+  - `policy.materialize` (Mongo operations, conflicts, retries)
+  - `policy.simulate` (diff results, CLI invocation metadata)
+  - `policy.lifecycle` (submit/review/approve events)
+- **Sampling:** Rule-hit logs sample 1 % by default; toggled to 100 % in incident mode or when `--trace` flag used in CLI.
+- **PII:** No user secrets recorded; user identities referenced as `user:<id>` or `group:<id>` only.
+
+---
+
+## 4 · Traces
+
+- Spans emit via OpenTelemetry instrumentation.
+- **Primary spans:**
+  - `policy.api` – wraps HTTP request, records `endpoint`, `status`, `scope`.
+  - `policy.select` – change stream ingestion and batch assembly (attributes: `candidateCount`, `cursor`).
+  - `policy.evaluate` – evaluation batch (attributes: `batchSize`, `ruleHits`, `severityChanges`).
+  - `policy.materialize` – Mongo writes (attributes: `writes`, `historyWrites`, `retryCount`).
+  - `policy.simulate` – simulation diff generation (attributes: `sbomCount`, `diffAdded`, `diffRemoved`).
+- Trace context propagated to CLI via response headers `traceparent`; UI surfaces in run detail view.
+- Incident mode forces span sampling to 100 % and extends retention via Collector config override.
+
+---
+
+## 5 · Dashboards
+
+### 5.1 Policy Runs Overview
+
+Widgets:
+- Run duration histogram (per mode/tenant).
+- Queue depth + backlog age line charts.
+- Failure rate stacked by error code.
+- Incremental backlog heatmap (policy × age).
+- Active vs scheduled runs table.
+
+### 5.2 Rule Impact & VEX
+
+- Top N rules by firings (bar chart).
+- VEX overrides by vendor/justification (stacked chart).
+- Suppression usage (pie + table with justifications).
+- Quieted findings trend (line).
+
+### 5.3 Simulation & Approval Health
+
+- Simulation diff histogram (added vs removed).
+- Pending approvals by age (table with SLA colour coding).
+- Compliance checklist status (lint, determinism CI, simulation evidence).
+
+> Placeholders for Grafana panels should be replaced with actual screenshots once dashboards land (`../assets/policy-observability/*.png`).
+
+---
+
+## 6 · Alerting
+
+| Alert | Condition | Suggested Action |
+|-------|-----------|------------------|
+| **PolicyRunSlaBreach** | `policy_run_seconds{mode="incremental"}` P95 > 300 s for 3 windows | Check queue depth, upstream services, scale worker pool. |
+| **PolicyQueueStuck** | `policy_delta_backlog_age_seconds` > 600 | Investigate change stream connectivity. |
+| **DeterminismMismatch** | Run status `failed` with `ERR_POL_004` OR CI replay diff | Switch to incident sampling, gather replay bundle, notify Policy Guild. |
+| **SimulationDrift** | CLI/CI simulation exit `20` (blocking diff) over threshold | Review policy changes before approval. |
+| **VexOverrideSpike** | `policy_vex_overrides_total` > configured baseline (per vendor) | Verify upstream VEX feed; ensure justification codes expected. |
+| **SuppressionSurge** | `policy_suppressions_total` increase > 3σ vs baseline | Audit new suppress rules; check approvals. |
+
+Alerts integrate with Notifier channels (`policy.alerts`) and Ops on-call rotations.
+
+---
+
+## 7 · Incident Mode & Forensics
+
+- Toggle via `POST /api/policy/incidents/activate` (requires `policy:operate` scope).
+- Effects:
+  - Trace sampling → 100 %.
+  - Rule-hit log sampling → 100 %.
+  - Retention window extended to 30 days for incident duration.
+  - `policy.incident.activated` event emitted (Console + Notifier banners).
+- Post-incident tasks:
+  - `stella policy run replay` for affected runs; attach bundles to incident record.
+  - Restore sampling defaults with `.../deactivate`.
+  - Update incident checklist in `/docs/policy/lifecycle.md` (section 8) with findings.
+
+---
+
+## 8 · Integration Points
+
+- **Authority:** Exposes metric `policy_scope_denied_total` for failed authorisation; correlate with `policy_api_requests_total`.
+- **Concelier/Excititor:** Shared trace IDs propagate via gRPC metadata to help debug upstream latency.
+- **Scheduler:** Future integration will push run queues into shared scheduler dashboards (planned in SCHED-MODELS-20-002).
+- **Offline Kit:** CLI exports logs + metrics snapshots (`stella offline bundle metrics`) for air-gapped audits.
+
+---
+
+## 9 · Compliance Checklist
+
+- [ ] **Metrics registered:** All metrics listed above exported and documented in Grafana dashboards.
+- [ ] **Alert policies configured:** Ops or Observability Guild created alerts matching table in §6.
+- [ ] **Sampling overrides tested:** Incident mode toggles verified in staging; retention roll-back rehearsed.
+- [ ] **Trace propagation validated:** CLI/UI display trace IDs and allow copy for support.
+- [ ] **Log scrubbing enforced:** Unit tests guarantee no secrets/PII in logs; sampling respects configuration.
+- [ ] **Offline capture rehearsed:** Metrics/log snapshot commands executed in sealed environment.
+- [ ] **Docs cross-links:** Links to architecture, runs, lifecycle, CLI, API docs verified.
+
+---
+
+*Last updated: 2025-10-26 (Sprint 20).*
+
--- a/docs/observability/ui-telemetry.md
+++ b/docs/observability/ui-telemetry.md
@@ -1,191 +1,191 @@
-# Console Observability
-
-> **Audience:** Observability Guild, Console Guild, SRE/operators.  
-> **Scope:** Metrics, logs, traces, dashboards, alerting, feature flags, and offline workflows for the StellaOps Console (Sprint 23).  
-> **Prerequisites:** Console deployed with metrics enabled (`CONSOLE_METRICS_ENABLED=true`) and OTLP exporters configured (`OTEL_EXPORTER_OTLP_*`).
-
---
-
-## 1 · Instrumentation Overview
-
- **Telemetry stack:** OpenTelemetry Web SDK (browser) + Console telemetry bridge → OTLP collector (Tempo/Prometheus/Loki). Server-side endpoints expose `/metrics` (Prometheus) and `/health/*`.  
- **Sampling:** Front-end spans sample at 5 % by default (`OTEL_TRACES_SAMPLER=parentbased_traceidratio`). Metrics are un-sampled; log sampling is handled per category (§3).  
- **Correlation IDs:** Every API call carries `x-stellaops-correlation-id`; structured UI events mirror that value so operators can follow a request across gateway, backend, and UI.  
- **Scope gating:** Operators need the `ui.telemetry` scope to view live charts in the Admin workspace; the scope also controls access to `/console/telemetry` SSE streams.
-
---
-
-## 2 · Metrics
-
-### 2.1 Experience & Navigation
-
-| Metric | Type | Labels | Notes |
-|--------|------|--------|-------|
-| `ui_route_render_seconds` | Histogram | `route`, `tenant`, `device` (`desktop`,`tablet`) | Time between route activation and first interactive paint. Target P95 ≤ 1.5 s (cached). |
-| `ui_request_duration_seconds` | Histogram | `service`, `method`, `status`, `tenant` | Gateway proxy timing for backend calls performed by the console. Alerts when backend latency degrades. |
-| `ui_filter_apply_total` | Counter | `route`, `filter`, `tenant` | Increments when a global filter or context chip is applied. Used to track adoption of saved views. |
-| `ui_tenant_switch_total` | Counter | `fromTenant`, `toTenant`, `trigger` (`picker`, `shortcut`, `link`) | Emitted after a successful tenant switch; correlates with Authority `ui.tenant.switch` logs. |
-| `ui_offline_banner_seconds` | Histogram | `reason` (`authority`, `manifest`, `gateway`), `tenant` | Duration of offline banner visibility; integrate with air-gap SLAs. |
-
-### 2.2 Security & Session
-
-| Metric | Type | Labels | Notes |
-|--------|------|--------|-------|
-| `ui_dpop_failure_total` | Counter | `endpoint`, `reason` (`nonce`, `jkt`, `clockSkew`) | Raised when DPoP validation fails; pair with Authority audit trail. |
-| `ui_fresh_auth_prompt_total` | Counter | `action` (`token.revoke`, `policy.activate`, `client.create`), `tenant` | Counts fresh-auth modals; backlog above baseline indicates workflow friction. |
-| `ui_fresh_auth_failure_total` | Counter | `action`, `reason` (`timeout`,`cancelled`,`auth_error`) | Optional metric (set `CONSOLE_FRESH_AUTH_METRICS=true` when feature flag lands). |
-
-### 2.3 Downloads & Offline Kit
-
-| Metric | Type | Labels | Notes |
-|--------|------|--------|-------|
-| `ui_download_manifest_refresh_seconds` | Histogram | `tenant`, `channel` (`edge`,`stable`,`airgap`) | Time to fetch and verify downloads manifest. Target < 3 s. |
-| `ui_download_export_queue_depth` | Gauge | `tenant`, `artifactType` (`sbom`,`policy`,`attestation`,`console`) | Mirrors `/console/downloads` queue depth; triggers when offline bundles lag. |
-| `ui_download_command_copied_total` | Counter | `tenant`, `artifactType` | Increments when users copy CLI commands from the UI. Useful to observe CLI parity adoption. |
-
-### 2.4 Telemetry Emission & Errors
-
-| Metric | Type | Labels | Notes |
-|--------|------|--------|-------|
-| `ui_telemetry_batch_failures_total` | Counter | `transport` (`otlp-http`,`otlp-grpc`), `reason` | Emitted by OTLP bridge when batches fail. Enable via `CONSOLE_METRICS_VERBOSE=true`. |
-| `ui_telemetry_queue_depth` | Gauge | `priority` (`normal`,`high`), `tenant` | Browser-side buffer depth; monitor for spikes under degraded collectors. |
-
-> **Scraping tips:**  
-> - Enable `/metrics` via `CONSOLE_METRICS_ENABLED=true`.  
-> - Set `OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.collector:4318` and relevant headers (`OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer <token>`).  
-> - For air-gapped sites, point the exporter to the Offline Kit collector (`localhost:4318`) and forward the metrics snapshot using `stella offline bundle metrics`.
-
---
-
-## 3 · Logs
-
- **Format:** JSON via Console log bridge; emitted to stdout and optional OTLP log exporter. Core fields: `timestamp`, `level`, `action`, `route`, `tenant`, `subject`, `correlationId`, `dpop.jkt`, `device`, `offlineMode`.  
- **Categories:**
-  - `ui.action` – general user interactions (route changes, command palette, filter updates). Sampled 50 % by default; override with feature flag `telemetry.logVerbose`.  
-  - `ui.tenant.switch` – always logged; includes `fromTenant`, `toTenant`, `tokenId`, and Authority audit correlation.  
-  - `ui.download.commandCopied` – download commands copied; includes `artifactId`, `digest`, `manifestVersion`.  
-  - `ui.security.anomaly` – DPoP mismatches, tenant header errors, CSP violations (level = `Warning`).  
-  - `ui.telemetry.failure` – OTLP export errors; include `httpStatus`, `batchSize`, `retryCount`.  
- **PII handling:** Full emails are scrubbed; only hashed values (`user:<sha256>`) appear unless `ui.admin` + fresh-auth were granted for the action (still redacted in logs).  
- **Retention:** Recommended 14 days for connected sites, 30 days for sealed/air-gap audits. Ship logs to Loki/Elastic with ingest label `service="stellaops-web-ui"`.
-
---
-
-## 4 · Traces
-
- **Span names & attributes:**
-  - `ui.route.transition` – wraps route navigation; attributes: `route`, `tenant`, `renderMillis`, `prefetchHit`.  
-  - `ui.api.fetch` – HTTP fetch to backend; attributes: `service`, `endpoint`, `status`, `networkTime`.  
-  - `ui.sse.stream` – Server-sent event subscriptions (status ticker, runs); attributes: `channel`, `connectedMillis`, `reconnects`.  
-  - `ui.telemetry.batch` – Browser OTLP flush; attributes: `batchSize`, `success`, `retryCount`.  
-  - `ui.policy.action` – Policy workspace actions (simulate, approve, activate) per `docs/ui/policy-editor.md`.  
- **Propagation:** Spans use W3C `traceparent`; gateway echoes header to backend APIs so traces stitch across UI → gateway → service.  
- **Sampling controls:** `OTEL_TRACES_SAMPLER_ARG` (ratio) and feature flag `telemetry.forceSampling` (sets to 100 % for incident debugging).  
- **Viewing traces:** Grafana Tempo or Jaeger via collector. Filter by `service.name = stellaops-console`. For cross-service debugging, filter on `correlationId` and `tenant`.
-
---
-
-## 5 · Dashboards
-
-### 5.1 Experience Overview
-
-Panels:
- Route render histogram (P50/P90/P99) by route.  
- Backend call latency stacked by service (`ui_request_duration_seconds`).  
- Offline banner duration trend (`ui_offline_banner_seconds`).  
- Tenant switch volume vs failure rate (overlay `ui_dpop_failure_total`).  
- Command palette usage (`ui_filter_apply_total` + `ui.action` log counts).
-
-### 5.2 Downloads & Offline Kit
-
- Manifest refresh time chart (per channel).  
- Export queue depth gauge with alert thresholds.  
- CLI command adoption (bar chart per artifact type, using `ui_download_command_copied_total`).  
- Offline parity banner occurrences (`downloads.offlineParity` flag from API → derived metric).  
- Last Offline Kit import timestamp (join with Downloads API metadata).
-
-### 5.3 Security & Session
-
- Fresh-auth prompt counts vs success/fail ratios.  
- DPoP failure stacked by reason.  
- Tenant mismatch warnings (from `ui.security.anomaly` logs).  
- Scope usage heatmap (derived from Authority audit events + UI logs).  
- CSP violation counts (browser `securitypolicyviolation` listener forwarded to logs).
-
-> Capture screenshots for Grafana once dashboards stabilise (`docs/assets/ui/observability/*.png`). Replace placeholders before releasing the doc.
-
---
-
-## 6 · Alerting
-
-| Alert | Condition | Suggested Action |
-|-------|-----------|------------------|
-| **ConsoleLatencyHigh** | `ui_route_render_seconds_bucket{le="1.5"}` drops below 0.95 for 3 intervals | Inspect route splits, check backend latencies, review CDN cache. |
-| **BackendLatencyHigh** | `ui_request_duration_seconds_sum / ui_request_duration_seconds_count` > 1 s for any service | Correlate with gateway/service dashboards; escalate to owning guild. |
-| **TenantSwitchFailures** | Increase in `ui_dpop_failure_total` or `ui.security.anomaly` (tenant mismatch) > 3/min | Validate Authority issuer, check clock skew, confirm tenant config. |
-| **FreshAuthLoop** | `ui_fresh_auth_prompt_total` spikes with matching `ui_fresh_auth_failure_total` | Review Authority `/fresh-auth` endpoint, session timeout config, UX regressions. |
-| **OfflineBannerLong** | `ui_offline_banner_seconds` P95 > 120 s | Investigate Authority/gateway availability; verify Offline Kit freshness. |
-| **DownloadsBacklog** | `ui_download_export_queue_depth` > 5 for 10 min OR queue age > alert threshold | Ping Downloads service, ensure manifest pipeline (`DOWNLOADS-CONSOLE-23-001`) is healthy. |
-| **TelemetryExportErrors** | `ui_telemetry_batch_failures_total` > 0 for ≥5 min | Check collector health, credentials, or TLS trust. |
-
-Integrate alerts with Notifier (`ui.alerts`) or existing Ops channels. Tag incidents with `component=console` for correlation.
-
---
-
-## 7 · Feature Flags & Configuration
-
-| Flag / Env Var | Purpose | Default |
-|----------------|---------|---------|
-| `CONSOLE_FEATURE_FLAGS` | Enables UI modules (`runs`, `downloads`, `policies`, `telemetry`). Telemetry panel requires `telemetry`. | `runs,downloads,policies` |
-| `CONSOLE_METRICS_ENABLED` | Exposes `/metrics` for Prometheus scrape. | `true` |
-| `CONSOLE_METRICS_VERBOSE` | Emits additional batching metrics (`ui_telemetry_*`). | `false` |
-| `CONSOLE_LOG_LEVEL` | Minimum log level (`Information`, `Debug`). Use `Debug` for incident sampling. | `Information` |
-| `CONSOLE_METRICS_SAMPLING` *(planned)* | Controls front-end span sampling ratio. Document once released. | `0.05` |
-| `OTEL_EXPORTER_OTLP_ENDPOINT` | Collector URL; supports HTTPS. | unset |
-| `OTEL_EXPORTER_OTLP_HEADERS` | Comma-separated headers (auth). | unset |
-| `OTEL_EXPORTER_OTLP_INSECURE` | Allow HTTP (dev only). | `false` |
-| `OTEL_SERVICE_NAME` | Service tag for traces/logs. Set to `stellaops-console`. | auto |
-| `CONSOLE_TELEMETRY_SSE_ENABLED` | Enables `/console/telemetry` SSE feed for dashboards. | `true` |
-
-Feature flag changes should be tracked in release notes and mirrored in `/docs/ui/navigation.md` (shortcuts may change when modules toggle).
-
---
-
-## 8 · Offline / Air-Gapped Workflow
-
- Mirror the console image and telemetry collector as part of the Offline Kit (see `/docs/install/docker.md` §4).  
- Scrape metrics locally via `curl -k https://console.local/metrics > metrics.prom`; archive alongside logs for audits.  
- Use `stella offline kit import` to keep the downloads manifest in sync; dashboards display staleness using `ui_download_manifest_refresh_seconds`.  
- When collectors are unavailable, console queues OTLP batches (up to 5 min) and exposes backlog through `ui_telemetry_queue_depth`; export queue metrics to prove no data loss.  
- After reconnecting, run `stella console status --telemetry` *(CLI parity pending; see DOCS-CONSOLE-23-014)* or verify `ui_telemetry_batch_failures_total` resets to zero.  
- Retain telemetry bundles for 30 days per compliance guidelines; include Grafana JSON exports in audit packages.
-
---
-
-## 9 · Compliance Checklist
-
- [ ] `/metrics` scraped in staging & production; dashboards display `ui_route_render_seconds`, `ui_request_duration_seconds`, and downloads metrics.  
- [ ] OTLP traces/logs confirmed end-to-end (collector, Tempo/Loki).  
- [ ] Alert rules from §6 implemented in monitoring stack with runbooks linked.  
- [ ] Feature flags documented and change-controlled; telemetry disabled only with approval.  
- [ ] DPoP/fresh-auth anomalies correlated with Authority audit logs during drill.  
- [ ] Offline capture workflow exercised; evidence stored in audit vault.  
- [ ] Screenshots of Grafana dashboards committed once they stabilise (update references).  
- [ ] Cross-links verified (`docs/deploy/console.md`, `docs/security/console-security.md`, `docs/ui/downloads.md`, `docs/ui/console-overview.md`).
-
---
-
-## 10 · References
-
- `/docs/deploy/console.md` – Metrics endpoint, OTLP config, health checks.  
- `/docs/security/console-security.md` – Security metrics & alert hints.  
- `/docs/ui/console-overview.md` – Telemetry primitives and performance budgets.  
- `/docs/ui/downloads.md` – Downloads metrics and parity workflow.  
- `/docs/observability/observability.md` – Platform-wide practices.  
- `/ops/telemetry-collector.md` & `/ops/telemetry-storage.md` – Collector deployment.  
- `/docs/install/docker.md` – Compose/Helm environment variables.
-
---
-
-*Last updated: 2025-10-28 (Sprint 23).* 
-
+# Console Observability
+
+> **Audience:** Observability Guild, Console Guild, SRE/operators.  
+> **Scope:** Metrics, logs, traces, dashboards, alerting, feature flags, and offline workflows for the StellaOps Console (Sprint 23).  
+> **Prerequisites:** Console deployed with metrics enabled (`CONSOLE_METRICS_ENABLED=true`) and OTLP exporters configured (`OTEL_EXPORTER_OTLP_*`).
+
+---
+
+## 1 · Instrumentation Overview
+
+- **Telemetry stack:** OpenTelemetry Web SDK (browser) + Console telemetry bridge → OTLP collector (Tempo/Prometheus/Loki). Server-side endpoints expose `/metrics` (Prometheus) and `/health/*`.  
+- **Sampling:** Front-end spans sample at 5 % by default (`OTEL_TRACES_SAMPLER=parentbased_traceidratio`). Metrics are un-sampled; log sampling is handled per category (§3).  
+- **Correlation IDs:** Every API call carries `x-stellaops-correlation-id`; structured UI events mirror that value so operators can follow a request across gateway, backend, and UI.  
+- **Scope gating:** Operators need the `ui.telemetry` scope to view live charts in the Admin workspace; the scope also controls access to `/console/telemetry` SSE streams.
+
+---
+
+## 2 · Metrics
+
+### 2.1 Experience & Navigation
+
+| Metric | Type | Labels | Notes |
+|--------|------|--------|-------|
+| `ui_route_render_seconds` | Histogram | `route`, `tenant`, `device` (`desktop`,`tablet`) | Time between route activation and first interactive paint. Target P95 ≤ 1.5 s (cached). |
+| `ui_request_duration_seconds` | Histogram | `service`, `method`, `status`, `tenant` | Gateway proxy timing for backend calls performed by the console. Alerts when backend latency degrades. |
+| `ui_filter_apply_total` | Counter | `route`, `filter`, `tenant` | Increments when a global filter or context chip is applied. Used to track adoption of saved views. |
+| `ui_tenant_switch_total` | Counter | `fromTenant`, `toTenant`, `trigger` (`picker`, `shortcut`, `link`) | Emitted after a successful tenant switch; correlates with Authority `ui.tenant.switch` logs. |
+| `ui_offline_banner_seconds` | Histogram | `reason` (`authority`, `manifest`, `gateway`), `tenant` | Duration of offline banner visibility; integrate with air-gap SLAs. |
+
+### 2.2 Security & Session
+
+| Metric | Type | Labels | Notes |
+|--------|------|--------|-------|
+| `ui_dpop_failure_total` | Counter | `endpoint`, `reason` (`nonce`, `jkt`, `clockSkew`) | Raised when DPoP validation fails; pair with Authority audit trail. |
+| `ui_fresh_auth_prompt_total` | Counter | `action` (`token.revoke`, `policy.activate`, `client.create`), `tenant` | Counts fresh-auth modals; backlog above baseline indicates workflow friction. |
+| `ui_fresh_auth_failure_total` | Counter | `action`, `reason` (`timeout`,`cancelled`,`auth_error`) | Optional metric (set `CONSOLE_FRESH_AUTH_METRICS=true` when feature flag lands). |
+
+### 2.3 Downloads & Offline Kit
+
+| Metric | Type | Labels | Notes |
+|--------|------|--------|-------|
+| `ui_download_manifest_refresh_seconds` | Histogram | `tenant`, `channel` (`edge`,`stable`,`airgap`) | Time to fetch and verify downloads manifest. Target < 3 s. |
+| `ui_download_export_queue_depth` | Gauge | `tenant`, `artifactType` (`sbom`,`policy`,`attestation`,`console`) | Mirrors `/console/downloads` queue depth; triggers when offline bundles lag. |
+| `ui_download_command_copied_total` | Counter | `tenant`, `artifactType` | Increments when users copy CLI commands from the UI. Useful to observe CLI parity adoption. |
+
+### 2.4 Telemetry Emission & Errors
+
+| Metric | Type | Labels | Notes |
+|--------|------|--------|-------|
+| `ui_telemetry_batch_failures_total` | Counter | `transport` (`otlp-http`,`otlp-grpc`), `reason` | Emitted by OTLP bridge when batches fail. Enable via `CONSOLE_METRICS_VERBOSE=true`. |
+| `ui_telemetry_queue_depth` | Gauge | `priority` (`normal`,`high`), `tenant` | Browser-side buffer depth; monitor for spikes under degraded collectors. |
+
+> **Scraping tips:**  
+> - Enable `/metrics` via `CONSOLE_METRICS_ENABLED=true`.  
+> - Set `OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.collector:4318` and relevant headers (`OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer <token>`).  
+> - For air-gapped sites, point the exporter to the Offline Kit collector (`localhost:4318`) and forward the metrics snapshot using `stella offline bundle metrics`.
+
+---
+
+## 3 · Logs
+
+- **Format:** JSON via Console log bridge; emitted to stdout and optional OTLP log exporter. Core fields: `timestamp`, `level`, `action`, `route`, `tenant`, `subject`, `correlationId`, `dpop.jkt`, `device`, `offlineMode`.  
+- **Categories:**
+  - `ui.action` – general user interactions (route changes, command palette, filter updates). Sampled 50 % by default; override with feature flag `telemetry.logVerbose`.  
+  - `ui.tenant.switch` – always logged; includes `fromTenant`, `toTenant`, `tokenId`, and Authority audit correlation.  
+  - `ui.download.commandCopied` – download commands copied; includes `artifactId`, `digest`, `manifestVersion`.  
+  - `ui.security.anomaly` – DPoP mismatches, tenant header errors, CSP violations (level = `Warning`).  
+  - `ui.telemetry.failure` – OTLP export errors; include `httpStatus`, `batchSize`, `retryCount`.  
+- **PII handling:** Full emails are scrubbed; only hashed values (`user:<sha256>`) appear unless `ui.admin` + fresh-auth were granted for the action (still redacted in logs).  
+- **Retention:** Recommended 14 days for connected sites, 30 days for sealed/air-gap audits. Ship logs to Loki/Elastic with ingest label `service="stellaops-web-ui"`.
+
+---
+
+## 4 · Traces
+
+- **Span names & attributes:**
+  - `ui.route.transition` – wraps route navigation; attributes: `route`, `tenant`, `renderMillis`, `prefetchHit`.  
+  - `ui.api.fetch` – HTTP fetch to backend; attributes: `service`, `endpoint`, `status`, `networkTime`.  
+  - `ui.sse.stream` – Server-sent event subscriptions (status ticker, runs); attributes: `channel`, `connectedMillis`, `reconnects`.  
+  - `ui.telemetry.batch` – Browser OTLP flush; attributes: `batchSize`, `success`, `retryCount`.  
+  - `ui.policy.action` – Policy workspace actions (simulate, approve, activate) per `docs/ui/policy-editor.md`.  
+- **Propagation:** Spans use W3C `traceparent`; gateway echoes header to backend APIs so traces stitch across UI → gateway → service.  
+- **Sampling controls:** `OTEL_TRACES_SAMPLER_ARG` (ratio) and feature flag `telemetry.forceSampling` (sets to 100 % for incident debugging).  
+- **Viewing traces:** Grafana Tempo or Jaeger via collector. Filter by `service.name = stellaops-console`. For cross-service debugging, filter on `correlationId` and `tenant`.
+
+---
+
+## 5 · Dashboards
+
+### 5.1 Experience Overview
+
+Panels:
+- Route render histogram (P50/P90/P99) by route.  
+- Backend call latency stacked by service (`ui_request_duration_seconds`).  
+- Offline banner duration trend (`ui_offline_banner_seconds`).  
+- Tenant switch volume vs failure rate (overlay `ui_dpop_failure_total`).  
+- Command palette usage (`ui_filter_apply_total` + `ui.action` log counts).
+
+### 5.2 Downloads & Offline Kit
+
+- Manifest refresh time chart (per channel).  
+- Export queue depth gauge with alert thresholds.  
+- CLI command adoption (bar chart per artifact type, using `ui_download_command_copied_total`).  
+- Offline parity banner occurrences (`downloads.offlineParity` flag from API → derived metric).  
+- Last Offline Kit import timestamp (join with Downloads API metadata).
+
+### 5.3 Security & Session
+
+- Fresh-auth prompt counts vs success/fail ratios.  
+- DPoP failure stacked by reason.  
+- Tenant mismatch warnings (from `ui.security.anomaly` logs).  
+- Scope usage heatmap (derived from Authority audit events + UI logs).  
+- CSP violation counts (browser `securitypolicyviolation` listener forwarded to logs).
+
+> Capture screenshots for Grafana once dashboards stabilise (`docs/assets/ui/observability/*.png`). Replace placeholders before releasing the doc.
+
+---
+
+## 6 · Alerting
+
+| Alert | Condition | Suggested Action |
+|-------|-----------|------------------|
+| **ConsoleLatencyHigh** | `ui_route_render_seconds_bucket{le="1.5"}` drops below 0.95 for 3 intervals | Inspect route splits, check backend latencies, review CDN cache. |
+| **BackendLatencyHigh** | `ui_request_duration_seconds_sum / ui_request_duration_seconds_count` > 1 s for any service | Correlate with gateway/service dashboards; escalate to owning guild. |
+| **TenantSwitchFailures** | Increase in `ui_dpop_failure_total` or `ui.security.anomaly` (tenant mismatch) > 3/min | Validate Authority issuer, check clock skew, confirm tenant config. |
+| **FreshAuthLoop** | `ui_fresh_auth_prompt_total` spikes with matching `ui_fresh_auth_failure_total` | Review Authority `/fresh-auth` endpoint, session timeout config, UX regressions. |
+| **OfflineBannerLong** | `ui_offline_banner_seconds` P95 > 120 s | Investigate Authority/gateway availability; verify Offline Kit freshness. |
+| **DownloadsBacklog** | `ui_download_export_queue_depth` > 5 for 10 min OR queue age > alert threshold | Ping Downloads service, ensure manifest pipeline (`DOWNLOADS-CONSOLE-23-001`) is healthy. |
+| **TelemetryExportErrors** | `ui_telemetry_batch_failures_total` > 0 for ≥5 min | Check collector health, credentials, or TLS trust. |
+
+Integrate alerts with Notifier (`ui.alerts`) or existing Ops channels. Tag incidents with `component=console` for correlation.
+
+---
+
+## 7 · Feature Flags & Configuration
+
+| Flag / Env Var | Purpose | Default |
+|----------------|---------|---------|
+| `CONSOLE_FEATURE_FLAGS` | Enables UI modules (`runs`, `downloads`, `policies`, `telemetry`). Telemetry panel requires `telemetry`. | `runs,downloads,policies` |
+| `CONSOLE_METRICS_ENABLED` | Exposes `/metrics` for Prometheus scrape. | `true` |
+| `CONSOLE_METRICS_VERBOSE` | Emits additional batching metrics (`ui_telemetry_*`). | `false` |
+| `CONSOLE_LOG_LEVEL` | Minimum log level (`Information`, `Debug`). Use `Debug` for incident sampling. | `Information` |
+| `CONSOLE_METRICS_SAMPLING` *(planned)* | Controls front-end span sampling ratio. Document once released. | `0.05` |
+| `OTEL_EXPORTER_OTLP_ENDPOINT` | Collector URL; supports HTTPS. | unset |
+| `OTEL_EXPORTER_OTLP_HEADERS` | Comma-separated headers (auth). | unset |
+| `OTEL_EXPORTER_OTLP_INSECURE` | Allow HTTP (dev only). | `false` |
+| `OTEL_SERVICE_NAME` | Service tag for traces/logs. Set to `stellaops-console`. | auto |
+| `CONSOLE_TELEMETRY_SSE_ENABLED` | Enables `/console/telemetry` SSE feed for dashboards. | `true` |
+
+Feature flag changes should be tracked in release notes and mirrored in `/docs/ui/navigation.md` (shortcuts may change when modules toggle).
+
+---
+
+## 8 · Offline / Air-Gapped Workflow
+
+- Mirror the console image and telemetry collector as part of the Offline Kit (see `/docs/install/docker.md` §4).  
+- Scrape metrics locally via `curl -k https://console.local/metrics > metrics.prom`; archive alongside logs for audits.  
+- Use `stella offline kit import` to keep the downloads manifest in sync; dashboards display staleness using `ui_download_manifest_refresh_seconds`.  
+- When collectors are unavailable, console queues OTLP batches (up to 5 min) and exposes backlog through `ui_telemetry_queue_depth`; export queue metrics to prove no data loss.  
+- After reconnecting, run `stella console status --telemetry` *(CLI parity pending; see DOCS-CONSOLE-23-014)* or verify `ui_telemetry_batch_failures_total` resets to zero.  
+- Retain telemetry bundles for 30 days per compliance guidelines; include Grafana JSON exports in audit packages.
+
+---
+
+## 9 · Compliance Checklist
+
+- [ ] `/metrics` scraped in staging & production; dashboards display `ui_route_render_seconds`, `ui_request_duration_seconds`, and downloads metrics.  
+- [ ] OTLP traces/logs confirmed end-to-end (collector, Tempo/Loki).  
+- [ ] Alert rules from §6 implemented in monitoring stack with runbooks linked.  
+- [ ] Feature flags documented and change-controlled; telemetry disabled only with approval.  
+- [ ] DPoP/fresh-auth anomalies correlated with Authority audit logs during drill.  
+- [ ] Offline capture workflow exercised; evidence stored in audit vault.  
+- [ ] Screenshots of Grafana dashboards committed once they stabilise (update references).  
+- [ ] Cross-links verified (`docs/deploy/console.md`, `docs/security/console-security.md`, `docs/ui/downloads.md`, `docs/ui/console-overview.md`).
+
+---
+
+## 10 · References
+
+- `/docs/deploy/console.md` – Metrics endpoint, OTLP config, health checks.  
+- `/docs/security/console-security.md` – Security metrics & alert hints.  
+- `/docs/ui/console-overview.md` – Telemetry primitives and performance budgets.  
+- `/docs/ui/downloads.md` – Downloads metrics and parity workflow.  
+- `/docs/observability/observability.md` – Platform-wide practices.  
+- `/ops/telemetry-collector.md` & `/ops/telemetry-storage.md` – Collector deployment.  
+- `/docs/install/docker.md` – Compose/Helm environment variables.
+
+---
+
+*Last updated: 2025-10-28 (Sprint 23).* 
+