Restructure solution layout by module
	
		
			
	
		
	
	
		
	
		
			Some checks failed
		
		
	
	
		
			
				
	
				Docs CI / lint-and-preview (push) Has been cancelled
				
			
		
		
	
	
				
					
				
			
		
			Some checks failed
		
		
	
	Docs CI / lint-and-preview (push) Has been cancelled
				
			This commit is contained in:
		| @@ -1,166 +1,166 @@ | ||||
| # Policy Engine Observability | ||||
|  | ||||
| > **Audience:** Observability Guild, SRE/Platform operators, Policy Guild.   | ||||
| > **Scope:** Metrics, logs, traces, dashboards, alerting, sampling, and incident workflows for the Policy Engine service (Sprint 20).   | ||||
| > **Prerequisites:** Policy Engine v2 deployed with OpenTelemetry exporters enabled (`observability:enabled=true` in config). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 1 · Instrumentation Overview | ||||
|  | ||||
| - **Telemetry stack:** OpenTelemetry SDK (metrics + traces), Serilog structured logging, OTLP exporters → Collector → Prometheus/Loki/Tempo. | ||||
| - **Namespace conventions:** `policy.*` for metrics/traces/log categories; labels use `tenant`, `policy`, `mode`, `runId`. | ||||
| - **Sampling:** Default 10 % trace sampling, 1 % rule-hit log sampling; incident mode overrides to 100 % (see §6). | ||||
| - **Correlation IDs:** Every API request gets `traceId` + `requestId`. CLI/UI display IDs to streamline support. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 2 · Metrics | ||||
|  | ||||
| ### 2.1 Run Pipeline | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `policy_run_seconds` | Histogram | `tenant`, `policy`, `mode` (`full`, `incremental`, `simulate`) | P95 target ≤ 5 min incremental, ≤ 30 min full. | | ||||
| | `policy_run_queue_depth` | Gauge | `tenant` | Number of pending jobs per tenant (updated each enqueue/dequeue). | | ||||
| | `policy_run_failures_total` | Counter | `tenant`, `policy`, `reason` (`err_pol_*`, `network`, `cancelled`) | Aligns with error codes. | | ||||
| | `policy_run_retries_total` | Counter | `tenant`, `policy` | Helps identify noisy sources. | | ||||
| | `policy_run_inputs_pending_bytes` | Gauge | `tenant` | Size of buffered change batches awaiting run. | | ||||
|  | ||||
| ### 2.2 Evaluator Insights | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `policy_rules_fired_total` | Counter | `tenant`, `policy`, `rule` | Increment per rule match (sampled). | | ||||
| | `policy_vex_overrides_total` | Counter | `tenant`, `policy`, `vendor`, `justification` | Tracks VEX precedence decisions. | | ||||
| | `policy_suppressions_total` | Counter | `tenant`, `policy`, `action` (`ignore`, `warn`, `quiet`) | Audits suppression usage. | | ||||
| | `policy_selection_batch_duration_seconds` | Histogram | `tenant`, `policy` | Measures joiner performance. | | ||||
| | `policy_materialization_conflicts_total` | Counter | `tenant`, `policy` | Non-zero indicates optimistic concurrency retries. | | ||||
|  | ||||
| ### 2.3 API Surface | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `policy_api_requests_total` | Counter | `endpoint`, `method`, `status` | Exposed via Minimal API instrumentation. | | ||||
| | `policy_api_latency_seconds` | Histogram | `endpoint`, `method` | Budget ≤ 250 ms for GETs, ≤ 1 s for POSTs. | | ||||
| | `policy_api_rate_limited_total` | Counter | `endpoint` | Tied to throttles (`429`). | | ||||
|  | ||||
| ### 2.4 Queue & Change Streams | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `policy_queue_leases_active` | Gauge | `tenant` | Number of leased jobs. | | ||||
| | `policy_queue_lease_expirations_total` | Counter | `tenant` | Alerts when workers fail to ack. | | ||||
| | `policy_delta_backlog_age_seconds` | Gauge | `tenant`, `source` (`concelier`, `excititor`, `sbom`) | Age of oldest unprocessed change event. | | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 3 · Logs | ||||
|  | ||||
| - **Format:** JSON (`Serilog`). Core fields: `timestamp`, `level`, `message`, `policyId`, `policyVersion`, `tenant`, `runId`, `rule`, `traceId`, `env.sealed`, `error.code`. | ||||
| - **Log categories:** | ||||
|   - `policy.run` (queue lifecycle, run begin/end, stats) | ||||
|   - `policy.evaluate` (batch execution summaries; rule-hit sampling) | ||||
|   - `policy.materialize` (Mongo operations, conflicts, retries) | ||||
|   - `policy.simulate` (diff results, CLI invocation metadata) | ||||
|   - `policy.lifecycle` (submit/review/approve events) | ||||
| - **Sampling:** Rule-hit logs sample 1 % by default; toggled to 100 % in incident mode or when `--trace` flag used in CLI. | ||||
| - **PII:** No user secrets recorded; user identities referenced as `user:<id>` or `group:<id>` only. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 4 · Traces | ||||
|  | ||||
| - Spans emit via OpenTelemetry instrumentation. | ||||
| - **Primary spans:** | ||||
|   - `policy.api` – wraps HTTP request, records `endpoint`, `status`, `scope`. | ||||
|   - `policy.select` – change stream ingestion and batch assembly (attributes: `candidateCount`, `cursor`). | ||||
|   - `policy.evaluate` – evaluation batch (attributes: `batchSize`, `ruleHits`, `severityChanges`). | ||||
|   - `policy.materialize` – Mongo writes (attributes: `writes`, `historyWrites`, `retryCount`). | ||||
|   - `policy.simulate` – simulation diff generation (attributes: `sbomCount`, `diffAdded`, `diffRemoved`). | ||||
| - Trace context propagated to CLI via response headers `traceparent`; UI surfaces in run detail view. | ||||
| - Incident mode forces span sampling to 100 % and extends retention via Collector config override. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 5 · Dashboards | ||||
|  | ||||
| ### 5.1 Policy Runs Overview | ||||
|  | ||||
| Widgets: | ||||
| - Run duration histogram (per mode/tenant). | ||||
| - Queue depth + backlog age line charts. | ||||
| - Failure rate stacked by error code. | ||||
| - Incremental backlog heatmap (policy × age). | ||||
| - Active vs scheduled runs table. | ||||
|  | ||||
| ### 5.2 Rule Impact & VEX | ||||
|  | ||||
| - Top N rules by firings (bar chart). | ||||
| - VEX overrides by vendor/justification (stacked chart). | ||||
| - Suppression usage (pie + table with justifications). | ||||
| - Quieted findings trend (line). | ||||
|  | ||||
| ### 5.3 Simulation & Approval Health | ||||
|  | ||||
| - Simulation diff histogram (added vs removed). | ||||
| - Pending approvals by age (table with SLA colour coding). | ||||
| - Compliance checklist status (lint, determinism CI, simulation evidence). | ||||
|  | ||||
| > Placeholders for Grafana panels should be replaced with actual screenshots once dashboards land (`../assets/policy-observability/*.png`). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 6 · Alerting | ||||
|  | ||||
| | Alert | Condition | Suggested Action | | ||||
| |-------|-----------|------------------| | ||||
| | **PolicyRunSlaBreach** | `policy_run_seconds{mode="incremental"}` P95 > 300 s for 3 windows | Check queue depth, upstream services, scale worker pool. | | ||||
| | **PolicyQueueStuck** | `policy_delta_backlog_age_seconds` > 600 | Investigate change stream connectivity. | | ||||
| | **DeterminismMismatch** | Run status `failed` with `ERR_POL_004` OR CI replay diff | Switch to incident sampling, gather replay bundle, notify Policy Guild. | | ||||
| | **SimulationDrift** | CLI/CI simulation exit `20` (blocking diff) over threshold | Review policy changes before approval. | | ||||
| | **VexOverrideSpike** | `policy_vex_overrides_total` > configured baseline (per vendor) | Verify upstream VEX feed; ensure justification codes expected. | | ||||
| | **SuppressionSurge** | `policy_suppressions_total` increase > 3σ vs baseline | Audit new suppress rules; check approvals. | | ||||
|  | ||||
| Alerts integrate with Notifier channels (`policy.alerts`) and Ops on-call rotations. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 7 · Incident Mode & Forensics | ||||
|  | ||||
| - Toggle via `POST /api/policy/incidents/activate` (requires `policy:operate` scope). | ||||
| - Effects: | ||||
|   - Trace sampling → 100 %. | ||||
|   - Rule-hit log sampling → 100 %. | ||||
|   - Retention window extended to 30 days for incident duration. | ||||
|   - `policy.incident.activated` event emitted (Console + Notifier banners). | ||||
| - Post-incident tasks: | ||||
|   - `stella policy run replay` for affected runs; attach bundles to incident record. | ||||
|   - Restore sampling defaults with `.../deactivate`. | ||||
|   - Update incident checklist in `/docs/policy/lifecycle.md` (section 8) with findings. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 8 · Integration Points | ||||
|  | ||||
| - **Authority:** Exposes metric `policy_scope_denied_total` for failed authorisation; correlate with `policy_api_requests_total`. | ||||
| - **Concelier/Excititor:** Shared trace IDs propagate via gRPC metadata to help debug upstream latency. | ||||
| - **Scheduler:** Future integration will push run queues into shared scheduler dashboards (planned in SCHED-MODELS-20-002). | ||||
| - **Offline Kit:** CLI exports logs + metrics snapshots (`stella offline bundle metrics`) for air-gapped audits. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 9 · Compliance Checklist | ||||
|  | ||||
| - [ ] **Metrics registered:** All metrics listed above exported and documented in Grafana dashboards. | ||||
| - [ ] **Alert policies configured:** Ops or Observability Guild created alerts matching table in §6. | ||||
| - [ ] **Sampling overrides tested:** Incident mode toggles verified in staging; retention roll-back rehearsed. | ||||
| - [ ] **Trace propagation validated:** CLI/UI display trace IDs and allow copy for support. | ||||
| - [ ] **Log scrubbing enforced:** Unit tests guarantee no secrets/PII in logs; sampling respects configuration. | ||||
| - [ ] **Offline capture rehearsed:** Metrics/log snapshot commands executed in sealed environment. | ||||
| - [ ] **Docs cross-links:** Links to architecture, runs, lifecycle, CLI, API docs verified. | ||||
|  | ||||
| --- | ||||
|  | ||||
| *Last updated: 2025-10-26 (Sprint 20).* | ||||
|  | ||||
| # Policy Engine Observability | ||||
|  | ||||
| > **Audience:** Observability Guild, SRE/Platform operators, Policy Guild.   | ||||
| > **Scope:** Metrics, logs, traces, dashboards, alerting, sampling, and incident workflows for the Policy Engine service (Sprint 20).   | ||||
| > **Prerequisites:** Policy Engine v2 deployed with OpenTelemetry exporters enabled (`observability:enabled=true` in config). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 1 · Instrumentation Overview | ||||
|  | ||||
| - **Telemetry stack:** OpenTelemetry SDK (metrics + traces), Serilog structured logging, OTLP exporters → Collector → Prometheus/Loki/Tempo. | ||||
| - **Namespace conventions:** `policy.*` for metrics/traces/log categories; labels use `tenant`, `policy`, `mode`, `runId`. | ||||
| - **Sampling:** Default 10 % trace sampling, 1 % rule-hit log sampling; incident mode overrides to 100 % (see §6). | ||||
| - **Correlation IDs:** Every API request gets `traceId` + `requestId`. CLI/UI display IDs to streamline support. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 2 · Metrics | ||||
|  | ||||
| ### 2.1 Run Pipeline | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `policy_run_seconds` | Histogram | `tenant`, `policy`, `mode` (`full`, `incremental`, `simulate`) | P95 target ≤ 5 min incremental, ≤ 30 min full. | | ||||
| | `policy_run_queue_depth` | Gauge | `tenant` | Number of pending jobs per tenant (updated each enqueue/dequeue). | | ||||
| | `policy_run_failures_total` | Counter | `tenant`, `policy`, `reason` (`err_pol_*`, `network`, `cancelled`) | Aligns with error codes. | | ||||
| | `policy_run_retries_total` | Counter | `tenant`, `policy` | Helps identify noisy sources. | | ||||
| | `policy_run_inputs_pending_bytes` | Gauge | `tenant` | Size of buffered change batches awaiting run. | | ||||
|  | ||||
| ### 2.2 Evaluator Insights | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `policy_rules_fired_total` | Counter | `tenant`, `policy`, `rule` | Increment per rule match (sampled). | | ||||
| | `policy_vex_overrides_total` | Counter | `tenant`, `policy`, `vendor`, `justification` | Tracks VEX precedence decisions. | | ||||
| | `policy_suppressions_total` | Counter | `tenant`, `policy`, `action` (`ignore`, `warn`, `quiet`) | Audits suppression usage. | | ||||
| | `policy_selection_batch_duration_seconds` | Histogram | `tenant`, `policy` | Measures joiner performance. | | ||||
| | `policy_materialization_conflicts_total` | Counter | `tenant`, `policy` | Non-zero indicates optimistic concurrency retries. | | ||||
|  | ||||
| ### 2.3 API Surface | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `policy_api_requests_total` | Counter | `endpoint`, `method`, `status` | Exposed via Minimal API instrumentation. | | ||||
| | `policy_api_latency_seconds` | Histogram | `endpoint`, `method` | Budget ≤ 250 ms for GETs, ≤ 1 s for POSTs. | | ||||
| | `policy_api_rate_limited_total` | Counter | `endpoint` | Tied to throttles (`429`). | | ||||
|  | ||||
| ### 2.4 Queue & Change Streams | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `policy_queue_leases_active` | Gauge | `tenant` | Number of leased jobs. | | ||||
| | `policy_queue_lease_expirations_total` | Counter | `tenant` | Alerts when workers fail to ack. | | ||||
| | `policy_delta_backlog_age_seconds` | Gauge | `tenant`, `source` (`concelier`, `excititor`, `sbom`) | Age of oldest unprocessed change event. | | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 3 · Logs | ||||
|  | ||||
| - **Format:** JSON (`Serilog`). Core fields: `timestamp`, `level`, `message`, `policyId`, `policyVersion`, `tenant`, `runId`, `rule`, `traceId`, `env.sealed`, `error.code`. | ||||
| - **Log categories:** | ||||
|   - `policy.run` (queue lifecycle, run begin/end, stats) | ||||
|   - `policy.evaluate` (batch execution summaries; rule-hit sampling) | ||||
|   - `policy.materialize` (Mongo operations, conflicts, retries) | ||||
|   - `policy.simulate` (diff results, CLI invocation metadata) | ||||
|   - `policy.lifecycle` (submit/review/approve events) | ||||
| - **Sampling:** Rule-hit logs sample 1 % by default; toggled to 100 % in incident mode or when `--trace` flag used in CLI. | ||||
| - **PII:** No user secrets recorded; user identities referenced as `user:<id>` or `group:<id>` only. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 4 · Traces | ||||
|  | ||||
| - Spans emit via OpenTelemetry instrumentation. | ||||
| - **Primary spans:** | ||||
|   - `policy.api` – wraps HTTP request, records `endpoint`, `status`, `scope`. | ||||
|   - `policy.select` – change stream ingestion and batch assembly (attributes: `candidateCount`, `cursor`). | ||||
|   - `policy.evaluate` – evaluation batch (attributes: `batchSize`, `ruleHits`, `severityChanges`). | ||||
|   - `policy.materialize` – Mongo writes (attributes: `writes`, `historyWrites`, `retryCount`). | ||||
|   - `policy.simulate` – simulation diff generation (attributes: `sbomCount`, `diffAdded`, `diffRemoved`). | ||||
| - Trace context propagated to CLI via response headers `traceparent`; UI surfaces in run detail view. | ||||
| - Incident mode forces span sampling to 100 % and extends retention via Collector config override. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 5 · Dashboards | ||||
|  | ||||
| ### 5.1 Policy Runs Overview | ||||
|  | ||||
| Widgets: | ||||
| - Run duration histogram (per mode/tenant). | ||||
| - Queue depth + backlog age line charts. | ||||
| - Failure rate stacked by error code. | ||||
| - Incremental backlog heatmap (policy × age). | ||||
| - Active vs scheduled runs table. | ||||
|  | ||||
| ### 5.2 Rule Impact & VEX | ||||
|  | ||||
| - Top N rules by firings (bar chart). | ||||
| - VEX overrides by vendor/justification (stacked chart). | ||||
| - Suppression usage (pie + table with justifications). | ||||
| - Quieted findings trend (line). | ||||
|  | ||||
| ### 5.3 Simulation & Approval Health | ||||
|  | ||||
| - Simulation diff histogram (added vs removed). | ||||
| - Pending approvals by age (table with SLA colour coding). | ||||
| - Compliance checklist status (lint, determinism CI, simulation evidence). | ||||
|  | ||||
| > Placeholders for Grafana panels should be replaced with actual screenshots once dashboards land (`../assets/policy-observability/*.png`). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 6 · Alerting | ||||
|  | ||||
| | Alert | Condition | Suggested Action | | ||||
| |-------|-----------|------------------| | ||||
| | **PolicyRunSlaBreach** | `policy_run_seconds{mode="incremental"}` P95 > 300 s for 3 windows | Check queue depth, upstream services, scale worker pool. | | ||||
| | **PolicyQueueStuck** | `policy_delta_backlog_age_seconds` > 600 | Investigate change stream connectivity. | | ||||
| | **DeterminismMismatch** | Run status `failed` with `ERR_POL_004` OR CI replay diff | Switch to incident sampling, gather replay bundle, notify Policy Guild. | | ||||
| | **SimulationDrift** | CLI/CI simulation exit `20` (blocking diff) over threshold | Review policy changes before approval. | | ||||
| | **VexOverrideSpike** | `policy_vex_overrides_total` > configured baseline (per vendor) | Verify upstream VEX feed; ensure justification codes expected. | | ||||
| | **SuppressionSurge** | `policy_suppressions_total` increase > 3σ vs baseline | Audit new suppress rules; check approvals. | | ||||
|  | ||||
| Alerts integrate with Notifier channels (`policy.alerts`) and Ops on-call rotations. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 7 · Incident Mode & Forensics | ||||
|  | ||||
| - Toggle via `POST /api/policy/incidents/activate` (requires `policy:operate` scope). | ||||
| - Effects: | ||||
|   - Trace sampling → 100 %. | ||||
|   - Rule-hit log sampling → 100 %. | ||||
|   - Retention window extended to 30 days for incident duration. | ||||
|   - `policy.incident.activated` event emitted (Console + Notifier banners). | ||||
| - Post-incident tasks: | ||||
|   - `stella policy run replay` for affected runs; attach bundles to incident record. | ||||
|   - Restore sampling defaults with `.../deactivate`. | ||||
|   - Update incident checklist in `/docs/policy/lifecycle.md` (section 8) with findings. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 8 · Integration Points | ||||
|  | ||||
| - **Authority:** Exposes metric `policy_scope_denied_total` for failed authorisation; correlate with `policy_api_requests_total`. | ||||
| - **Concelier/Excititor:** Shared trace IDs propagate via gRPC metadata to help debug upstream latency. | ||||
| - **Scheduler:** Future integration will push run queues into shared scheduler dashboards (planned in SCHED-MODELS-20-002). | ||||
| - **Offline Kit:** CLI exports logs + metrics snapshots (`stella offline bundle metrics`) for air-gapped audits. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 9 · Compliance Checklist | ||||
|  | ||||
| - [ ] **Metrics registered:** All metrics listed above exported and documented in Grafana dashboards. | ||||
| - [ ] **Alert policies configured:** Ops or Observability Guild created alerts matching table in §6. | ||||
| - [ ] **Sampling overrides tested:** Incident mode toggles verified in staging; retention roll-back rehearsed. | ||||
| - [ ] **Trace propagation validated:** CLI/UI display trace IDs and allow copy for support. | ||||
| - [ ] **Log scrubbing enforced:** Unit tests guarantee no secrets/PII in logs; sampling respects configuration. | ||||
| - [ ] **Offline capture rehearsed:** Metrics/log snapshot commands executed in sealed environment. | ||||
| - [ ] **Docs cross-links:** Links to architecture, runs, lifecycle, CLI, API docs verified. | ||||
|  | ||||
| --- | ||||
|  | ||||
| *Last updated: 2025-10-26 (Sprint 20).* | ||||
|  | ||||
|   | ||||
| @@ -1,191 +1,191 @@ | ||||
| # Console Observability | ||||
|  | ||||
| > **Audience:** Observability Guild, Console Guild, SRE/operators.   | ||||
| > **Scope:** Metrics, logs, traces, dashboards, alerting, feature flags, and offline workflows for the StellaOps Console (Sprint 23).   | ||||
| > **Prerequisites:** Console deployed with metrics enabled (`CONSOLE_METRICS_ENABLED=true`) and OTLP exporters configured (`OTEL_EXPORTER_OTLP_*`). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 1 · Instrumentation Overview | ||||
|  | ||||
| - **Telemetry stack:** OpenTelemetry Web SDK (browser) + Console telemetry bridge → OTLP collector (Tempo/Prometheus/Loki). Server-side endpoints expose `/metrics` (Prometheus) and `/health/*`.   | ||||
| - **Sampling:** Front-end spans sample at 5 % by default (`OTEL_TRACES_SAMPLER=parentbased_traceidratio`). Metrics are un-sampled; log sampling is handled per category (§3).   | ||||
| - **Correlation IDs:** Every API call carries `x-stellaops-correlation-id`; structured UI events mirror that value so operators can follow a request across gateway, backend, and UI.   | ||||
| - **Scope gating:** Operators need the `ui.telemetry` scope to view live charts in the Admin workspace; the scope also controls access to `/console/telemetry` SSE streams. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 2 · Metrics | ||||
|  | ||||
| ### 2.1 Experience & Navigation | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `ui_route_render_seconds` | Histogram | `route`, `tenant`, `device` (`desktop`,`tablet`) | Time between route activation and first interactive paint. Target P95 ≤ 1.5 s (cached). | | ||||
| | `ui_request_duration_seconds` | Histogram | `service`, `method`, `status`, `tenant` | Gateway proxy timing for backend calls performed by the console. Alerts when backend latency degrades. | | ||||
| | `ui_filter_apply_total` | Counter | `route`, `filter`, `tenant` | Increments when a global filter or context chip is applied. Used to track adoption of saved views. | | ||||
| | `ui_tenant_switch_total` | Counter | `fromTenant`, `toTenant`, `trigger` (`picker`, `shortcut`, `link`) | Emitted after a successful tenant switch; correlates with Authority `ui.tenant.switch` logs. | | ||||
| | `ui_offline_banner_seconds` | Histogram | `reason` (`authority`, `manifest`, `gateway`), `tenant` | Duration of offline banner visibility; integrate with air-gap SLAs. | | ||||
|  | ||||
| ### 2.2 Security & Session | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `ui_dpop_failure_total` | Counter | `endpoint`, `reason` (`nonce`, `jkt`, `clockSkew`) | Raised when DPoP validation fails; pair with Authority audit trail. | | ||||
| | `ui_fresh_auth_prompt_total` | Counter | `action` (`token.revoke`, `policy.activate`, `client.create`), `tenant` | Counts fresh-auth modals; backlog above baseline indicates workflow friction. | | ||||
| | `ui_fresh_auth_failure_total` | Counter | `action`, `reason` (`timeout`,`cancelled`,`auth_error`) | Optional metric (set `CONSOLE_FRESH_AUTH_METRICS=true` when feature flag lands). | | ||||
|  | ||||
| ### 2.3 Downloads & Offline Kit | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `ui_download_manifest_refresh_seconds` | Histogram | `tenant`, `channel` (`edge`,`stable`,`airgap`) | Time to fetch and verify downloads manifest. Target < 3 s. | | ||||
| | `ui_download_export_queue_depth` | Gauge | `tenant`, `artifactType` (`sbom`,`policy`,`attestation`,`console`) | Mirrors `/console/downloads` queue depth; triggers when offline bundles lag. | | ||||
| | `ui_download_command_copied_total` | Counter | `tenant`, `artifactType` | Increments when users copy CLI commands from the UI. Useful to observe CLI parity adoption. | | ||||
|  | ||||
| ### 2.4 Telemetry Emission & Errors | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `ui_telemetry_batch_failures_total` | Counter | `transport` (`otlp-http`,`otlp-grpc`), `reason` | Emitted by OTLP bridge when batches fail. Enable via `CONSOLE_METRICS_VERBOSE=true`. | | ||||
| | `ui_telemetry_queue_depth` | Gauge | `priority` (`normal`,`high`), `tenant` | Browser-side buffer depth; monitor for spikes under degraded collectors. | | ||||
|  | ||||
| > **Scraping tips:**   | ||||
| > - Enable `/metrics` via `CONSOLE_METRICS_ENABLED=true`.   | ||||
| > - Set `OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.collector:4318` and relevant headers (`OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer <token>`).   | ||||
| > - For air-gapped sites, point the exporter to the Offline Kit collector (`localhost:4318`) and forward the metrics snapshot using `stella offline bundle metrics`. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 3 · Logs | ||||
|  | ||||
| - **Format:** JSON via Console log bridge; emitted to stdout and optional OTLP log exporter. Core fields: `timestamp`, `level`, `action`, `route`, `tenant`, `subject`, `correlationId`, `dpop.jkt`, `device`, `offlineMode`.   | ||||
| - **Categories:** | ||||
|   - `ui.action` – general user interactions (route changes, command palette, filter updates). Sampled 50 % by default; override with feature flag `telemetry.logVerbose`.   | ||||
|   - `ui.tenant.switch` – always logged; includes `fromTenant`, `toTenant`, `tokenId`, and Authority audit correlation.   | ||||
|   - `ui.download.commandCopied` – download commands copied; includes `artifactId`, `digest`, `manifestVersion`.   | ||||
|   - `ui.security.anomaly` – DPoP mismatches, tenant header errors, CSP violations (level = `Warning`).   | ||||
|   - `ui.telemetry.failure` – OTLP export errors; include `httpStatus`, `batchSize`, `retryCount`.   | ||||
| - **PII handling:** Full emails are scrubbed; only hashed values (`user:<sha256>`) appear unless `ui.admin` + fresh-auth were granted for the action (still redacted in logs).   | ||||
| - **Retention:** Recommended 14 days for connected sites, 30 days for sealed/air-gap audits. Ship logs to Loki/Elastic with ingest label `service="stellaops-web-ui"`. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 4 · Traces | ||||
|  | ||||
| - **Span names & attributes:** | ||||
|   - `ui.route.transition` – wraps route navigation; attributes: `route`, `tenant`, `renderMillis`, `prefetchHit`.   | ||||
|   - `ui.api.fetch` – HTTP fetch to backend; attributes: `service`, `endpoint`, `status`, `networkTime`.   | ||||
|   - `ui.sse.stream` – Server-sent event subscriptions (status ticker, runs); attributes: `channel`, `connectedMillis`, `reconnects`.   | ||||
|   - `ui.telemetry.batch` – Browser OTLP flush; attributes: `batchSize`, `success`, `retryCount`.   | ||||
|   - `ui.policy.action` – Policy workspace actions (simulate, approve, activate) per `docs/ui/policy-editor.md`.   | ||||
| - **Propagation:** Spans use W3C `traceparent`; gateway echoes header to backend APIs so traces stitch across UI → gateway → service.   | ||||
| - **Sampling controls:** `OTEL_TRACES_SAMPLER_ARG` (ratio) and feature flag `telemetry.forceSampling` (sets to 100 % for incident debugging).   | ||||
| - **Viewing traces:** Grafana Tempo or Jaeger via collector. Filter by `service.name = stellaops-console`. For cross-service debugging, filter on `correlationId` and `tenant`. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 5 · Dashboards | ||||
|  | ||||
| ### 5.1 Experience Overview | ||||
|  | ||||
| Panels: | ||||
| - Route render histogram (P50/P90/P99) by route.   | ||||
| - Backend call latency stacked by service (`ui_request_duration_seconds`).   | ||||
| - Offline banner duration trend (`ui_offline_banner_seconds`).   | ||||
| - Tenant switch volume vs failure rate (overlay `ui_dpop_failure_total`).   | ||||
| - Command palette usage (`ui_filter_apply_total` + `ui.action` log counts). | ||||
|  | ||||
| ### 5.2 Downloads & Offline Kit | ||||
|  | ||||
| - Manifest refresh time chart (per channel).   | ||||
| - Export queue depth gauge with alert thresholds.   | ||||
| - CLI command adoption (bar chart per artifact type, using `ui_download_command_copied_total`).   | ||||
| - Offline parity banner occurrences (`downloads.offlineParity` flag from API → derived metric).   | ||||
| - Last Offline Kit import timestamp (join with Downloads API metadata). | ||||
|  | ||||
| ### 5.3 Security & Session | ||||
|  | ||||
| - Fresh-auth prompt counts vs success/fail ratios.   | ||||
| - DPoP failure stacked by reason.   | ||||
| - Tenant mismatch warnings (from `ui.security.anomaly` logs).   | ||||
| - Scope usage heatmap (derived from Authority audit events + UI logs).   | ||||
| - CSP violation counts (browser `securitypolicyviolation` listener forwarded to logs). | ||||
|  | ||||
| > Capture screenshots for Grafana once dashboards stabilise (`docs/assets/ui/observability/*.png`). Replace placeholders before releasing the doc. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 6 · Alerting | ||||
|  | ||||
| | Alert | Condition | Suggested Action | | ||||
| |-------|-----------|------------------| | ||||
| | **ConsoleLatencyHigh** | `ui_route_render_seconds_bucket{le="1.5"}` drops below 0.95 for 3 intervals | Inspect route splits, check backend latencies, review CDN cache. | | ||||
| | **BackendLatencyHigh** | `ui_request_duration_seconds_sum / ui_request_duration_seconds_count` > 1 s for any service | Correlate with gateway/service dashboards; escalate to owning guild. | | ||||
| | **TenantSwitchFailures** | Increase in `ui_dpop_failure_total` or `ui.security.anomaly` (tenant mismatch) > 3/min | Validate Authority issuer, check clock skew, confirm tenant config. | | ||||
| | **FreshAuthLoop** | `ui_fresh_auth_prompt_total` spikes with matching `ui_fresh_auth_failure_total` | Review Authority `/fresh-auth` endpoint, session timeout config, UX regressions. | | ||||
| | **OfflineBannerLong** | `ui_offline_banner_seconds` P95 > 120 s | Investigate Authority/gateway availability; verify Offline Kit freshness. | | ||||
| | **DownloadsBacklog** | `ui_download_export_queue_depth` > 5 for 10 min OR queue age > alert threshold | Ping Downloads service, ensure manifest pipeline (`DOWNLOADS-CONSOLE-23-001`) is healthy. | | ||||
| | **TelemetryExportErrors** | `ui_telemetry_batch_failures_total` > 0 for ≥5 min | Check collector health, credentials, or TLS trust. | | ||||
|  | ||||
| Integrate alerts with Notifier (`ui.alerts`) or existing Ops channels. Tag incidents with `component=console` for correlation. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 7 · Feature Flags & Configuration | ||||
|  | ||||
| | Flag / Env Var | Purpose | Default | | ||||
| |----------------|---------|---------| | ||||
| | `CONSOLE_FEATURE_FLAGS` | Enables UI modules (`runs`, `downloads`, `policies`, `telemetry`). Telemetry panel requires `telemetry`. | `runs,downloads,policies` | | ||||
| | `CONSOLE_METRICS_ENABLED` | Exposes `/metrics` for Prometheus scrape. | `true` | | ||||
| | `CONSOLE_METRICS_VERBOSE` | Emits additional batching metrics (`ui_telemetry_*`). | `false` | | ||||
| | `CONSOLE_LOG_LEVEL` | Minimum log level (`Information`, `Debug`). Use `Debug` for incident sampling. | `Information` | | ||||
| | `CONSOLE_METRICS_SAMPLING` *(planned)* | Controls front-end span sampling ratio. Document once released. | `0.05` | | ||||
| | `OTEL_EXPORTER_OTLP_ENDPOINT` | Collector URL; supports HTTPS. | unset | | ||||
| | `OTEL_EXPORTER_OTLP_HEADERS` | Comma-separated headers (auth). | unset | | ||||
| | `OTEL_EXPORTER_OTLP_INSECURE` | Allow HTTP (dev only). | `false` | | ||||
| | `OTEL_SERVICE_NAME` | Service tag for traces/logs. Set to `stellaops-console`. | auto | | ||||
| | `CONSOLE_TELEMETRY_SSE_ENABLED` | Enables `/console/telemetry` SSE feed for dashboards. | `true` | | ||||
|  | ||||
| Feature flag changes should be tracked in release notes and mirrored in `/docs/ui/navigation.md` (shortcuts may change when modules toggle). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 8 · Offline / Air-Gapped Workflow | ||||
|  | ||||
| - Mirror the console image and telemetry collector as part of the Offline Kit (see `/docs/install/docker.md` §4).   | ||||
| - Scrape metrics locally via `curl -k https://console.local/metrics > metrics.prom`; archive alongside logs for audits.   | ||||
| - Use `stella offline kit import` to keep the downloads manifest in sync; dashboards display staleness using `ui_download_manifest_refresh_seconds`.   | ||||
| - When collectors are unavailable, console queues OTLP batches (up to 5 min) and exposes backlog through `ui_telemetry_queue_depth`; export queue metrics to prove no data loss.   | ||||
| - After reconnecting, run `stella console status --telemetry` *(CLI parity pending; see DOCS-CONSOLE-23-014)* or verify `ui_telemetry_batch_failures_total` resets to zero.   | ||||
| - Retain telemetry bundles for 30 days per compliance guidelines; include Grafana JSON exports in audit packages. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 9 · Compliance Checklist | ||||
|  | ||||
| - [ ] `/metrics` scraped in staging & production; dashboards display `ui_route_render_seconds`, `ui_request_duration_seconds`, and downloads metrics.   | ||||
| - [ ] OTLP traces/logs confirmed end-to-end (collector, Tempo/Loki).   | ||||
| - [ ] Alert rules from §6 implemented in monitoring stack with runbooks linked.   | ||||
| - [ ] Feature flags documented and change-controlled; telemetry disabled only with approval.   | ||||
| - [ ] DPoP/fresh-auth anomalies correlated with Authority audit logs during drill.   | ||||
| - [ ] Offline capture workflow exercised; evidence stored in audit vault.   | ||||
| - [ ] Screenshots of Grafana dashboards committed once they stabilise (update references).   | ||||
| - [ ] Cross-links verified (`docs/deploy/console.md`, `docs/security/console-security.md`, `docs/ui/downloads.md`, `docs/ui/console-overview.md`). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 10 · References | ||||
|  | ||||
| - `/docs/deploy/console.md` – Metrics endpoint, OTLP config, health checks.   | ||||
| - `/docs/security/console-security.md` – Security metrics & alert hints.   | ||||
| - `/docs/ui/console-overview.md` – Telemetry primitives and performance budgets.   | ||||
| - `/docs/ui/downloads.md` – Downloads metrics and parity workflow.   | ||||
| - `/docs/observability/observability.md` – Platform-wide practices.   | ||||
| - `/ops/telemetry-collector.md` & `/ops/telemetry-storage.md` – Collector deployment.   | ||||
| - `/docs/install/docker.md` – Compose/Helm environment variables. | ||||
|  | ||||
| --- | ||||
|  | ||||
| *Last updated: 2025-10-28 (Sprint 23).*  | ||||
|  | ||||
| # Console Observability | ||||
|  | ||||
| > **Audience:** Observability Guild, Console Guild, SRE/operators.   | ||||
| > **Scope:** Metrics, logs, traces, dashboards, alerting, feature flags, and offline workflows for the StellaOps Console (Sprint 23).   | ||||
| > **Prerequisites:** Console deployed with metrics enabled (`CONSOLE_METRICS_ENABLED=true`) and OTLP exporters configured (`OTEL_EXPORTER_OTLP_*`). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 1 · Instrumentation Overview | ||||
|  | ||||
| - **Telemetry stack:** OpenTelemetry Web SDK (browser) + Console telemetry bridge → OTLP collector (Tempo/Prometheus/Loki). Server-side endpoints expose `/metrics` (Prometheus) and `/health/*`.   | ||||
| - **Sampling:** Front-end spans sample at 5 % by default (`OTEL_TRACES_SAMPLER=parentbased_traceidratio`). Metrics are un-sampled; log sampling is handled per category (§3).   | ||||
| - **Correlation IDs:** Every API call carries `x-stellaops-correlation-id`; structured UI events mirror that value so operators can follow a request across gateway, backend, and UI.   | ||||
| - **Scope gating:** Operators need the `ui.telemetry` scope to view live charts in the Admin workspace; the scope also controls access to `/console/telemetry` SSE streams. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 2 · Metrics | ||||
|  | ||||
| ### 2.1 Experience & Navigation | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `ui_route_render_seconds` | Histogram | `route`, `tenant`, `device` (`desktop`,`tablet`) | Time between route activation and first interactive paint. Target P95 ≤ 1.5 s (cached). | | ||||
| | `ui_request_duration_seconds` | Histogram | `service`, `method`, `status`, `tenant` | Gateway proxy timing for backend calls performed by the console. Alerts when backend latency degrades. | | ||||
| | `ui_filter_apply_total` | Counter | `route`, `filter`, `tenant` | Increments when a global filter or context chip is applied. Used to track adoption of saved views. | | ||||
| | `ui_tenant_switch_total` | Counter | `fromTenant`, `toTenant`, `trigger` (`picker`, `shortcut`, `link`) | Emitted after a successful tenant switch; correlates with Authority `ui.tenant.switch` logs. | | ||||
| | `ui_offline_banner_seconds` | Histogram | `reason` (`authority`, `manifest`, `gateway`), `tenant` | Duration of offline banner visibility; integrate with air-gap SLAs. | | ||||
|  | ||||
| ### 2.2 Security & Session | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `ui_dpop_failure_total` | Counter | `endpoint`, `reason` (`nonce`, `jkt`, `clockSkew`) | Raised when DPoP validation fails; pair with Authority audit trail. | | ||||
| | `ui_fresh_auth_prompt_total` | Counter | `action` (`token.revoke`, `policy.activate`, `client.create`), `tenant` | Counts fresh-auth modals; backlog above baseline indicates workflow friction. | | ||||
| | `ui_fresh_auth_failure_total` | Counter | `action`, `reason` (`timeout`,`cancelled`,`auth_error`) | Optional metric (set `CONSOLE_FRESH_AUTH_METRICS=true` when feature flag lands). | | ||||
|  | ||||
| ### 2.3 Downloads & Offline Kit | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `ui_download_manifest_refresh_seconds` | Histogram | `tenant`, `channel` (`edge`,`stable`,`airgap`) | Time to fetch and verify downloads manifest. Target < 3 s. | | ||||
| | `ui_download_export_queue_depth` | Gauge | `tenant`, `artifactType` (`sbom`,`policy`,`attestation`,`console`) | Mirrors `/console/downloads` queue depth; triggers when offline bundles lag. | | ||||
| | `ui_download_command_copied_total` | Counter | `tenant`, `artifactType` | Increments when users copy CLI commands from the UI. Useful to observe CLI parity adoption. | | ||||
|  | ||||
| ### 2.4 Telemetry Emission & Errors | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `ui_telemetry_batch_failures_total` | Counter | `transport` (`otlp-http`,`otlp-grpc`), `reason` | Emitted by OTLP bridge when batches fail. Enable via `CONSOLE_METRICS_VERBOSE=true`. | | ||||
| | `ui_telemetry_queue_depth` | Gauge | `priority` (`normal`,`high`), `tenant` | Browser-side buffer depth; monitor for spikes under degraded collectors. | | ||||
|  | ||||
| > **Scraping tips:**   | ||||
| > - Enable `/metrics` via `CONSOLE_METRICS_ENABLED=true`.   | ||||
| > - Set `OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.collector:4318` and relevant headers (`OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer <token>`).   | ||||
| > - For air-gapped sites, point the exporter to the Offline Kit collector (`localhost:4318`) and forward the metrics snapshot using `stella offline bundle metrics`. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 3 · Logs | ||||
|  | ||||
| - **Format:** JSON via Console log bridge; emitted to stdout and optional OTLP log exporter. Core fields: `timestamp`, `level`, `action`, `route`, `tenant`, `subject`, `correlationId`, `dpop.jkt`, `device`, `offlineMode`.   | ||||
| - **Categories:** | ||||
|   - `ui.action` – general user interactions (route changes, command palette, filter updates). Sampled 50 % by default; override with feature flag `telemetry.logVerbose`.   | ||||
|   - `ui.tenant.switch` – always logged; includes `fromTenant`, `toTenant`, `tokenId`, and Authority audit correlation.   | ||||
|   - `ui.download.commandCopied` – download commands copied; includes `artifactId`, `digest`, `manifestVersion`.   | ||||
|   - `ui.security.anomaly` – DPoP mismatches, tenant header errors, CSP violations (level = `Warning`).   | ||||
|   - `ui.telemetry.failure` – OTLP export errors; include `httpStatus`, `batchSize`, `retryCount`.   | ||||
| - **PII handling:** Full emails are scrubbed; only hashed values (`user:<sha256>`) appear unless `ui.admin` + fresh-auth were granted for the action (still redacted in logs).   | ||||
| - **Retention:** Recommended 14 days for connected sites, 30 days for sealed/air-gap audits. Ship logs to Loki/Elastic with ingest label `service="stellaops-web-ui"`. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 4 · Traces | ||||
|  | ||||
| - **Span names & attributes:** | ||||
|   - `ui.route.transition` – wraps route navigation; attributes: `route`, `tenant`, `renderMillis`, `prefetchHit`.   | ||||
|   - `ui.api.fetch` – HTTP fetch to backend; attributes: `service`, `endpoint`, `status`, `networkTime`.   | ||||
|   - `ui.sse.stream` – Server-sent event subscriptions (status ticker, runs); attributes: `channel`, `connectedMillis`, `reconnects`.   | ||||
|   - `ui.telemetry.batch` – Browser OTLP flush; attributes: `batchSize`, `success`, `retryCount`.   | ||||
|   - `ui.policy.action` – Policy workspace actions (simulate, approve, activate) per `docs/ui/policy-editor.md`.   | ||||
| - **Propagation:** Spans use W3C `traceparent`; gateway echoes header to backend APIs so traces stitch across UI → gateway → service.   | ||||
| - **Sampling controls:** `OTEL_TRACES_SAMPLER_ARG` (ratio) and feature flag `telemetry.forceSampling` (sets to 100 % for incident debugging).   | ||||
| - **Viewing traces:** Grafana Tempo or Jaeger via collector. Filter by `service.name = stellaops-console`. For cross-service debugging, filter on `correlationId` and `tenant`. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 5 · Dashboards | ||||
|  | ||||
| ### 5.1 Experience Overview | ||||
|  | ||||
| Panels: | ||||
| - Route render histogram (P50/P90/P99) by route.   | ||||
| - Backend call latency stacked by service (`ui_request_duration_seconds`).   | ||||
| - Offline banner duration trend (`ui_offline_banner_seconds`).   | ||||
| - Tenant switch volume vs failure rate (overlay `ui_dpop_failure_total`).   | ||||
| - Command palette usage (`ui_filter_apply_total` + `ui.action` log counts). | ||||
|  | ||||
| ### 5.2 Downloads & Offline Kit | ||||
|  | ||||
| - Manifest refresh time chart (per channel).   | ||||
| - Export queue depth gauge with alert thresholds.   | ||||
| - CLI command adoption (bar chart per artifact type, using `ui_download_command_copied_total`).   | ||||
| - Offline parity banner occurrences (`downloads.offlineParity` flag from API → derived metric).   | ||||
| - Last Offline Kit import timestamp (join with Downloads API metadata). | ||||
|  | ||||
| ### 5.3 Security & Session | ||||
|  | ||||
| - Fresh-auth prompt counts vs success/fail ratios.   | ||||
| - DPoP failure stacked by reason.   | ||||
| - Tenant mismatch warnings (from `ui.security.anomaly` logs).   | ||||
| - Scope usage heatmap (derived from Authority audit events + UI logs).   | ||||
| - CSP violation counts (browser `securitypolicyviolation` listener forwarded to logs). | ||||
|  | ||||
| > Capture screenshots for Grafana once dashboards stabilise (`docs/assets/ui/observability/*.png`). Replace placeholders before releasing the doc. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 6 · Alerting | ||||
|  | ||||
| | Alert | Condition | Suggested Action | | ||||
| |-------|-----------|------------------| | ||||
| | **ConsoleLatencyHigh** | `ui_route_render_seconds_bucket{le="1.5"}` drops below 0.95 for 3 intervals | Inspect route splits, check backend latencies, review CDN cache. | | ||||
| | **BackendLatencyHigh** | `ui_request_duration_seconds_sum / ui_request_duration_seconds_count` > 1 s for any service | Correlate with gateway/service dashboards; escalate to owning guild. | | ||||
| | **TenantSwitchFailures** | Increase in `ui_dpop_failure_total` or `ui.security.anomaly` (tenant mismatch) > 3/min | Validate Authority issuer, check clock skew, confirm tenant config. | | ||||
| | **FreshAuthLoop** | `ui_fresh_auth_prompt_total` spikes with matching `ui_fresh_auth_failure_total` | Review Authority `/fresh-auth` endpoint, session timeout config, UX regressions. | | ||||
| | **OfflineBannerLong** | `ui_offline_banner_seconds` P95 > 120 s | Investigate Authority/gateway availability; verify Offline Kit freshness. | | ||||
| | **DownloadsBacklog** | `ui_download_export_queue_depth` > 5 for 10 min OR queue age > alert threshold | Ping Downloads service, ensure manifest pipeline (`DOWNLOADS-CONSOLE-23-001`) is healthy. | | ||||
| | **TelemetryExportErrors** | `ui_telemetry_batch_failures_total` > 0 for ≥5 min | Check collector health, credentials, or TLS trust. | | ||||
|  | ||||
| Integrate alerts with Notifier (`ui.alerts`) or existing Ops channels. Tag incidents with `component=console` for correlation. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 7 · Feature Flags & Configuration | ||||
|  | ||||
| | Flag / Env Var | Purpose | Default | | ||||
| |----------------|---------|---------| | ||||
| | `CONSOLE_FEATURE_FLAGS` | Enables UI modules (`runs`, `downloads`, `policies`, `telemetry`). Telemetry panel requires `telemetry`. | `runs,downloads,policies` | | ||||
| | `CONSOLE_METRICS_ENABLED` | Exposes `/metrics` for Prometheus scrape. | `true` | | ||||
| | `CONSOLE_METRICS_VERBOSE` | Emits additional batching metrics (`ui_telemetry_*`). | `false` | | ||||
| | `CONSOLE_LOG_LEVEL` | Minimum log level (`Information`, `Debug`). Use `Debug` for incident sampling. | `Information` | | ||||
| | `CONSOLE_METRICS_SAMPLING` *(planned)* | Controls front-end span sampling ratio. Document once released. | `0.05` | | ||||
| | `OTEL_EXPORTER_OTLP_ENDPOINT` | Collector URL; supports HTTPS. | unset | | ||||
| | `OTEL_EXPORTER_OTLP_HEADERS` | Comma-separated headers (auth). | unset | | ||||
| | `OTEL_EXPORTER_OTLP_INSECURE` | Allow HTTP (dev only). | `false` | | ||||
| | `OTEL_SERVICE_NAME` | Service tag for traces/logs. Set to `stellaops-console`. | auto | | ||||
| | `CONSOLE_TELEMETRY_SSE_ENABLED` | Enables `/console/telemetry` SSE feed for dashboards. | `true` | | ||||
|  | ||||
| Feature flag changes should be tracked in release notes and mirrored in `/docs/ui/navigation.md` (shortcuts may change when modules toggle). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 8 · Offline / Air-Gapped Workflow | ||||
|  | ||||
| - Mirror the console image and telemetry collector as part of the Offline Kit (see `/docs/install/docker.md` §4).   | ||||
| - Scrape metrics locally via `curl -k https://console.local/metrics > metrics.prom`; archive alongside logs for audits.   | ||||
| - Use `stella offline kit import` to keep the downloads manifest in sync; dashboards display staleness using `ui_download_manifest_refresh_seconds`.   | ||||
| - When collectors are unavailable, console queues OTLP batches (up to 5 min) and exposes backlog through `ui_telemetry_queue_depth`; export queue metrics to prove no data loss.   | ||||
| - After reconnecting, run `stella console status --telemetry` *(CLI parity pending; see DOCS-CONSOLE-23-014)* or verify `ui_telemetry_batch_failures_total` resets to zero.   | ||||
| - Retain telemetry bundles for 30 days per compliance guidelines; include Grafana JSON exports in audit packages. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 9 · Compliance Checklist | ||||
|  | ||||
| - [ ] `/metrics` scraped in staging & production; dashboards display `ui_route_render_seconds`, `ui_request_duration_seconds`, and downloads metrics.   | ||||
| - [ ] OTLP traces/logs confirmed end-to-end (collector, Tempo/Loki).   | ||||
| - [ ] Alert rules from §6 implemented in monitoring stack with runbooks linked.   | ||||
| - [ ] Feature flags documented and change-controlled; telemetry disabled only with approval.   | ||||
| - [ ] DPoP/fresh-auth anomalies correlated with Authority audit logs during drill.   | ||||
| - [ ] Offline capture workflow exercised; evidence stored in audit vault.   | ||||
| - [ ] Screenshots of Grafana dashboards committed once they stabilise (update references).   | ||||
| - [ ] Cross-links verified (`docs/deploy/console.md`, `docs/security/console-security.md`, `docs/ui/downloads.md`, `docs/ui/console-overview.md`). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 10 · References | ||||
|  | ||||
| - `/docs/deploy/console.md` – Metrics endpoint, OTLP config, health checks.   | ||||
| - `/docs/security/console-security.md` – Security metrics & alert hints.   | ||||
| - `/docs/ui/console-overview.md` – Telemetry primitives and performance budgets.   | ||||
| - `/docs/ui/downloads.md` – Downloads metrics and parity workflow.   | ||||
| - `/docs/observability/observability.md` – Platform-wide practices.   | ||||
| - `/ops/telemetry-collector.md` & `/ops/telemetry-storage.md` – Collector deployment.   | ||||
| - `/docs/install/docker.md` – Compose/Helm environment variables. | ||||
|  | ||||
| --- | ||||
|  | ||||
| *Last updated: 2025-10-28 (Sprint 23).*  | ||||
|  | ||||
|   | ||||
		Reference in New Issue
	
	Block a user