feat: Implement Scheduler Worker Options and Planner Loop
	
		
			
	
		
	
	
		
	
		
			Some checks failed
		
		
	
	
		
			
				
	
				Docs CI / lint-and-preview (push) Has been cancelled
				
			
		
		
	
	
				
					
				
			
		
			Some checks failed
		
		
	
	Docs CI / lint-and-preview (push) Has been cancelled
				
			- Added `SchedulerWorkerOptions` class to encapsulate configuration for the scheduler worker. - Introduced `PlannerBackgroundService` to manage the planner loop, fetching and processing planning runs. - Created `PlannerExecutionService` to handle the execution logic for planning runs, including impact targeting and run persistence. - Developed `PlannerExecutionResult` and `PlannerExecutionStatus` to standardize execution outcomes. - Implemented validation logic within `SchedulerWorkerOptions` to ensure proper configuration. - Added documentation for the planner loop and impact targeting features. - Established health check endpoints and authentication mechanisms for the Signals service. - Created unit tests for the Signals API to ensure proper functionality and response handling. - Configured options for authority integration and fallback authentication methods.
This commit is contained in:
		
							
								
								
									
										191
									
								
								docs/observability/ui-telemetry.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										191
									
								
								docs/observability/ui-telemetry.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,191 @@ | ||||
| # Console Observability | ||||
|  | ||||
| > **Audience:** Observability Guild, Console Guild, SRE/operators.   | ||||
| > **Scope:** Metrics, logs, traces, dashboards, alerting, feature flags, and offline workflows for the StellaOps Console (Sprint 23).   | ||||
| > **Prerequisites:** Console deployed with metrics enabled (`CONSOLE_METRICS_ENABLED=true`) and OTLP exporters configured (`OTEL_EXPORTER_OTLP_*`). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 1 · Instrumentation Overview | ||||
|  | ||||
| - **Telemetry stack:** OpenTelemetry Web SDK (browser) + Console telemetry bridge → OTLP collector (Tempo/Prometheus/Loki). Server-side endpoints expose `/metrics` (Prometheus) and `/health/*`.   | ||||
| - **Sampling:** Front-end spans sample at 5 % by default (`OTEL_TRACES_SAMPLER=parentbased_traceidratio`). Metrics are un-sampled; log sampling is handled per category (§3).   | ||||
| - **Correlation IDs:** Every API call carries `x-stellaops-correlation-id`; structured UI events mirror that value so operators can follow a request across gateway, backend, and UI.   | ||||
| - **Scope gating:** Operators need the `ui.telemetry` scope to view live charts in the Admin workspace; the scope also controls access to `/console/telemetry` SSE streams. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 2 · Metrics | ||||
|  | ||||
| ### 2.1 Experience & Navigation | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `ui_route_render_seconds` | Histogram | `route`, `tenant`, `device` (`desktop`,`tablet`) | Time between route activation and first interactive paint. Target P95 ≤ 1.5 s (cached). | | ||||
| | `ui_request_duration_seconds` | Histogram | `service`, `method`, `status`, `tenant` | Gateway proxy timing for backend calls performed by the console. Alerts when backend latency degrades. | | ||||
| | `ui_filter_apply_total` | Counter | `route`, `filter`, `tenant` | Increments when a global filter or context chip is applied. Used to track adoption of saved views. | | ||||
| | `ui_tenant_switch_total` | Counter | `fromTenant`, `toTenant`, `trigger` (`picker`, `shortcut`, `link`) | Emitted after a successful tenant switch; correlates with Authority `ui.tenant.switch` logs. | | ||||
| | `ui_offline_banner_seconds` | Histogram | `reason` (`authority`, `manifest`, `gateway`), `tenant` | Duration of offline banner visibility; integrate with air-gap SLAs. | | ||||
|  | ||||
| ### 2.2 Security & Session | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `ui_dpop_failure_total` | Counter | `endpoint`, `reason` (`nonce`, `jkt`, `clockSkew`) | Raised when DPoP validation fails; pair with Authority audit trail. | | ||||
| | `ui_fresh_auth_prompt_total` | Counter | `action` (`token.revoke`, `policy.activate`, `client.create`), `tenant` | Counts fresh-auth modals; backlog above baseline indicates workflow friction. | | ||||
| | `ui_fresh_auth_failure_total` | Counter | `action`, `reason` (`timeout`,`cancelled`,`auth_error`) | Optional metric (set `CONSOLE_FRESH_AUTH_METRICS=true` when feature flag lands). | | ||||
|  | ||||
| ### 2.3 Downloads & Offline Kit | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `ui_download_manifest_refresh_seconds` | Histogram | `tenant`, `channel` (`edge`,`stable`,`airgap`) | Time to fetch and verify downloads manifest. Target < 3 s. | | ||||
| | `ui_download_export_queue_depth` | Gauge | `tenant`, `artifactType` (`sbom`,`policy`,`attestation`,`console`) | Mirrors `/console/downloads` queue depth; triggers when offline bundles lag. | | ||||
| | `ui_download_command_copied_total` | Counter | `tenant`, `artifactType` | Increments when users copy CLI commands from the UI. Useful to observe CLI parity adoption. | | ||||
|  | ||||
| ### 2.4 Telemetry Emission & Errors | ||||
|  | ||||
| | Metric | Type | Labels | Notes | | ||||
| |--------|------|--------|-------| | ||||
| | `ui_telemetry_batch_failures_total` | Counter | `transport` (`otlp-http`,`otlp-grpc`), `reason` | Emitted by OTLP bridge when batches fail. Enable via `CONSOLE_METRICS_VERBOSE=true`. | | ||||
| | `ui_telemetry_queue_depth` | Gauge | `priority` (`normal`,`high`), `tenant` | Browser-side buffer depth; monitor for spikes under degraded collectors. | | ||||
|  | ||||
| > **Scraping tips:**   | ||||
| > - Enable `/metrics` via `CONSOLE_METRICS_ENABLED=true`.   | ||||
| > - Set `OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.collector:4318` and relevant headers (`OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer <token>`).   | ||||
| > - For air-gapped sites, point the exporter to the Offline Kit collector (`localhost:4318`) and forward the metrics snapshot using `stella offline bundle metrics`. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 3 · Logs | ||||
|  | ||||
| - **Format:** JSON via Console log bridge; emitted to stdout and optional OTLP log exporter. Core fields: `timestamp`, `level`, `action`, `route`, `tenant`, `subject`, `correlationId`, `dpop.jkt`, `device`, `offlineMode`.   | ||||
| - **Categories:** | ||||
|   - `ui.action` – general user interactions (route changes, command palette, filter updates). Sampled 50 % by default; override with feature flag `telemetry.logVerbose`.   | ||||
|   - `ui.tenant.switch` – always logged; includes `fromTenant`, `toTenant`, `tokenId`, and Authority audit correlation.   | ||||
|   - `ui.download.commandCopied` – download commands copied; includes `artifactId`, `digest`, `manifestVersion`.   | ||||
|   - `ui.security.anomaly` – DPoP mismatches, tenant header errors, CSP violations (level = `Warning`).   | ||||
|   - `ui.telemetry.failure` – OTLP export errors; include `httpStatus`, `batchSize`, `retryCount`.   | ||||
| - **PII handling:** Full emails are scrubbed; only hashed values (`user:<sha256>`) appear unless `ui.admin` + fresh-auth were granted for the action (still redacted in logs).   | ||||
| - **Retention:** Recommended 14 days for connected sites, 30 days for sealed/air-gap audits. Ship logs to Loki/Elastic with ingest label `service="stellaops-web-ui"`. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 4 · Traces | ||||
|  | ||||
| - **Span names & attributes:** | ||||
|   - `ui.route.transition` – wraps route navigation; attributes: `route`, `tenant`, `renderMillis`, `prefetchHit`.   | ||||
|   - `ui.api.fetch` – HTTP fetch to backend; attributes: `service`, `endpoint`, `status`, `networkTime`.   | ||||
|   - `ui.sse.stream` – Server-sent event subscriptions (status ticker, runs); attributes: `channel`, `connectedMillis`, `reconnects`.   | ||||
|   - `ui.telemetry.batch` – Browser OTLP flush; attributes: `batchSize`, `success`, `retryCount`.   | ||||
|   - `ui.policy.action` – Policy workspace actions (simulate, approve, activate) per `docs/ui/policy-editor.md`.   | ||||
| - **Propagation:** Spans use W3C `traceparent`; gateway echoes header to backend APIs so traces stitch across UI → gateway → service.   | ||||
| - **Sampling controls:** `OTEL_TRACES_SAMPLER_ARG` (ratio) and feature flag `telemetry.forceSampling` (sets to 100 % for incident debugging).   | ||||
| - **Viewing traces:** Grafana Tempo or Jaeger via collector. Filter by `service.name = stellaops-console`. For cross-service debugging, filter on `correlationId` and `tenant`. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 5 · Dashboards | ||||
|  | ||||
| ### 5.1 Experience Overview | ||||
|  | ||||
| Panels: | ||||
| - Route render histogram (P50/P90/P99) by route.   | ||||
| - Backend call latency stacked by service (`ui_request_duration_seconds`).   | ||||
| - Offline banner duration trend (`ui_offline_banner_seconds`).   | ||||
| - Tenant switch volume vs failure rate (overlay `ui_dpop_failure_total`).   | ||||
| - Command palette usage (`ui_filter_apply_total` + `ui.action` log counts). | ||||
|  | ||||
| ### 5.2 Downloads & Offline Kit | ||||
|  | ||||
| - Manifest refresh time chart (per channel).   | ||||
| - Export queue depth gauge with alert thresholds.   | ||||
| - CLI command adoption (bar chart per artifact type, using `ui_download_command_copied_total`).   | ||||
| - Offline parity banner occurrences (`downloads.offlineParity` flag from API → derived metric).   | ||||
| - Last Offline Kit import timestamp (join with Downloads API metadata). | ||||
|  | ||||
| ### 5.3 Security & Session | ||||
|  | ||||
| - Fresh-auth prompt counts vs success/fail ratios.   | ||||
| - DPoP failure stacked by reason.   | ||||
| - Tenant mismatch warnings (from `ui.security.anomaly` logs).   | ||||
| - Scope usage heatmap (derived from Authority audit events + UI logs).   | ||||
| - CSP violation counts (browser `securitypolicyviolation` listener forwarded to logs). | ||||
|  | ||||
| > Capture screenshots for Grafana once dashboards stabilise (`docs/assets/ui/observability/*.png`). Replace placeholders before releasing the doc. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 6 · Alerting | ||||
|  | ||||
| | Alert | Condition | Suggested Action | | ||||
| |-------|-----------|------------------| | ||||
| | **ConsoleLatencyHigh** | `ui_route_render_seconds_bucket{le="1.5"}` drops below 0.95 for 3 intervals | Inspect route splits, check backend latencies, review CDN cache. | | ||||
| | **BackendLatencyHigh** | `ui_request_duration_seconds_sum / ui_request_duration_seconds_count` > 1 s for any service | Correlate with gateway/service dashboards; escalate to owning guild. | | ||||
| | **TenantSwitchFailures** | Increase in `ui_dpop_failure_total` or `ui.security.anomaly` (tenant mismatch) > 3/min | Validate Authority issuer, check clock skew, confirm tenant config. | | ||||
| | **FreshAuthLoop** | `ui_fresh_auth_prompt_total` spikes with matching `ui_fresh_auth_failure_total` | Review Authority `/fresh-auth` endpoint, session timeout config, UX regressions. | | ||||
| | **OfflineBannerLong** | `ui_offline_banner_seconds` P95 > 120 s | Investigate Authority/gateway availability; verify Offline Kit freshness. | | ||||
| | **DownloadsBacklog** | `ui_download_export_queue_depth` > 5 for 10 min OR queue age > alert threshold | Ping Downloads service, ensure manifest pipeline (`DOWNLOADS-CONSOLE-23-001`) is healthy. | | ||||
| | **TelemetryExportErrors** | `ui_telemetry_batch_failures_total` > 0 for ≥5 min | Check collector health, credentials, or TLS trust. | | ||||
|  | ||||
| Integrate alerts with Notifier (`ui.alerts`) or existing Ops channels. Tag incidents with `component=console` for correlation. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 7 · Feature Flags & Configuration | ||||
|  | ||||
| | Flag / Env Var | Purpose | Default | | ||||
| |----------------|---------|---------| | ||||
| | `CONSOLE_FEATURE_FLAGS` | Enables UI modules (`runs`, `downloads`, `policies`, `telemetry`). Telemetry panel requires `telemetry`. | `runs,downloads,policies` | | ||||
| | `CONSOLE_METRICS_ENABLED` | Exposes `/metrics` for Prometheus scrape. | `true` | | ||||
| | `CONSOLE_METRICS_VERBOSE` | Emits additional batching metrics (`ui_telemetry_*`). | `false` | | ||||
| | `CONSOLE_LOG_LEVEL` | Minimum log level (`Information`, `Debug`). Use `Debug` for incident sampling. | `Information` | | ||||
| | `CONSOLE_METRICS_SAMPLING` *(planned)* | Controls front-end span sampling ratio. Document once released. | `0.05` | | ||||
| | `OTEL_EXPORTER_OTLP_ENDPOINT` | Collector URL; supports HTTPS. | unset | | ||||
| | `OTEL_EXPORTER_OTLP_HEADERS` | Comma-separated headers (auth). | unset | | ||||
| | `OTEL_EXPORTER_OTLP_INSECURE` | Allow HTTP (dev only). | `false` | | ||||
| | `OTEL_SERVICE_NAME` | Service tag for traces/logs. Set to `stellaops-console`. | auto | | ||||
| | `CONSOLE_TELEMETRY_SSE_ENABLED` | Enables `/console/telemetry` SSE feed for dashboards. | `true` | | ||||
|  | ||||
| Feature flag changes should be tracked in release notes and mirrored in `/docs/ui/navigation.md` (shortcuts may change when modules toggle). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 8 · Offline / Air-Gapped Workflow | ||||
|  | ||||
| - Mirror the console image and telemetry collector as part of the Offline Kit (see `/docs/install/docker.md` §4).   | ||||
| - Scrape metrics locally via `curl -k https://console.local/metrics > metrics.prom`; archive alongside logs for audits.   | ||||
| - Use `stella offline kit import` to keep the downloads manifest in sync; dashboards display staleness using `ui_download_manifest_refresh_seconds`.   | ||||
| - When collectors are unavailable, console queues OTLP batches (up to 5 min) and exposes backlog through `ui_telemetry_queue_depth`; export queue metrics to prove no data loss.   | ||||
| - After reconnecting, run `stella console status --telemetry` *(CLI parity pending; see DOCS-CONSOLE-23-014)* or verify `ui_telemetry_batch_failures_total` resets to zero.   | ||||
| - Retain telemetry bundles for 30 days per compliance guidelines; include Grafana JSON exports in audit packages. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 9 · Compliance Checklist | ||||
|  | ||||
| - [ ] `/metrics` scraped in staging & production; dashboards display `ui_route_render_seconds`, `ui_request_duration_seconds`, and downloads metrics.   | ||||
| - [ ] OTLP traces/logs confirmed end-to-end (collector, Tempo/Loki).   | ||||
| - [ ] Alert rules from §6 implemented in monitoring stack with runbooks linked.   | ||||
| - [ ] Feature flags documented and change-controlled; telemetry disabled only with approval.   | ||||
| - [ ] DPoP/fresh-auth anomalies correlated with Authority audit logs during drill.   | ||||
| - [ ] Offline capture workflow exercised; evidence stored in audit vault.   | ||||
| - [ ] Screenshots of Grafana dashboards committed once they stabilise (update references).   | ||||
| - [ ] Cross-links verified (`docs/deploy/console.md`, `docs/security/console-security.md`, `docs/ui/downloads.md`, `docs/ui/console-overview.md`). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 10 · References | ||||
|  | ||||
| - `/docs/deploy/console.md` – Metrics endpoint, OTLP config, health checks.   | ||||
| - `/docs/security/console-security.md` – Security metrics & alert hints.   | ||||
| - `/docs/ui/console-overview.md` – Telemetry primitives and performance budgets.   | ||||
| - `/docs/ui/downloads.md` – Downloads metrics and parity workflow.   | ||||
| - `/docs/observability/observability.md` – Platform-wide practices.   | ||||
| - `/ops/telemetry-collector.md` & `/ops/telemetry-storage.md` – Collector deployment.   | ||||
| - `/docs/install/docker.md` – Compose/Helm environment variables. | ||||
|  | ||||
| --- | ||||
|  | ||||
| *Last updated: 2025-10-28 (Sprint 23).*  | ||||
|  | ||||
		Reference in New Issue
	
	Block a user