- Added `SchedulerWorkerOptions` class to encapsulate configuration for the scheduler worker. - Introduced `PlannerBackgroundService` to manage the planner loop, fetching and processing planning runs. - Created `PlannerExecutionService` to handle the execution logic for planning runs, including impact targeting and run persistence. - Developed `PlannerExecutionResult` and `PlannerExecutionStatus` to standardize execution outcomes. - Implemented validation logic within `SchedulerWorkerOptions` to ensure proper configuration. - Added documentation for the planner loop and impact targeting features. - Established health check endpoints and authentication mechanisms for the Signals service. - Created unit tests for the Signals API to ensure proper functionality and response handling. - Configured options for authority integration and fallback authentication methods.
		
			
				
	
	
	
		
			13 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	Console Observability
Audience: Observability Guild, Console Guild, SRE/operators.
Scope: Metrics, logs, traces, dashboards, alerting, feature flags, and offline workflows for the StellaOps Console (Sprint 23).
Prerequisites: Console deployed with metrics enabled (CONSOLE_METRICS_ENABLED=true) and OTLP exporters configured (OTEL_EXPORTER_OTLP_*).
1 · Instrumentation Overview
- Telemetry stack: OpenTelemetry Web SDK (browser) + Console telemetry bridge → OTLP collector (Tempo/Prometheus/Loki). Server-side endpoints expose 
/metrics(Prometheus) and/health/*. - Sampling: Front-end spans sample at 5 % by default (
OTEL_TRACES_SAMPLER=parentbased_traceidratio). Metrics are un-sampled; log sampling is handled per category (§3). - Correlation IDs: Every API call carries 
x-stellaops-correlation-id; structured UI events mirror that value so operators can follow a request across gateway, backend, and UI. - Scope gating: Operators need the 
ui.telemetryscope to view live charts in the Admin workspace; the scope also controls access to/console/telemetrySSE streams. 
2 · Metrics
2.1 Experience & Navigation
| Metric | Type | Labels | Notes | 
|---|---|---|---|
ui_route_render_seconds | 
Histogram | route, tenant, device (desktop,tablet) | 
Time between route activation and first interactive paint. Target P95 ≤ 1.5 s (cached). | 
ui_request_duration_seconds | 
Histogram | service, method, status, tenant | 
Gateway proxy timing for backend calls performed by the console. Alerts when backend latency degrades. | 
ui_filter_apply_total | 
Counter | route, filter, tenant | 
Increments when a global filter or context chip is applied. Used to track adoption of saved views. | 
ui_tenant_switch_total | 
Counter | fromTenant, toTenant, trigger (picker, shortcut, link) | 
Emitted after a successful tenant switch; correlates with Authority ui.tenant.switch logs. | 
ui_offline_banner_seconds | 
Histogram | reason (authority, manifest, gateway), tenant | 
Duration of offline banner visibility; integrate with air-gap SLAs. | 
2.2 Security & Session
| Metric | Type | Labels | Notes | 
|---|---|---|---|
ui_dpop_failure_total | 
Counter | endpoint, reason (nonce, jkt, clockSkew) | 
Raised when DPoP validation fails; pair with Authority audit trail. | 
ui_fresh_auth_prompt_total | 
Counter | action (token.revoke, policy.activate, client.create), tenant | 
Counts fresh-auth modals; backlog above baseline indicates workflow friction. | 
ui_fresh_auth_failure_total | 
Counter | action, reason (timeout,cancelled,auth_error) | 
Optional metric (set CONSOLE_FRESH_AUTH_METRICS=true when feature flag lands). | 
2.3 Downloads & Offline Kit
| Metric | Type | Labels | Notes | 
|---|---|---|---|
ui_download_manifest_refresh_seconds | 
Histogram | tenant, channel (edge,stable,airgap) | 
Time to fetch and verify downloads manifest. Target < 3 s. | 
ui_download_export_queue_depth | 
Gauge | tenant, artifactType (sbom,policy,attestation,console) | 
Mirrors /console/downloads queue depth; triggers when offline bundles lag. | 
ui_download_command_copied_total | 
Counter | tenant, artifactType | 
Increments when users copy CLI commands from the UI. Useful to observe CLI parity adoption. | 
2.4 Telemetry Emission & Errors
| Metric | Type | Labels | Notes | 
|---|---|---|---|
ui_telemetry_batch_failures_total | 
Counter | transport (otlp-http,otlp-grpc), reason | 
Emitted by OTLP bridge when batches fail. Enable via CONSOLE_METRICS_VERBOSE=true. | 
ui_telemetry_queue_depth | 
Gauge | priority (normal,high), tenant | 
Browser-side buffer depth; monitor for spikes under degraded collectors. | 
Scraping tips:
- Enable
 /metricsviaCONSOLE_METRICS_ENABLED=true.- Set
 OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.collector:4318and relevant headers (OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer <token>).- For air-gapped sites, point the exporter to the Offline Kit collector (
 localhost:4318) and forward the metrics snapshot usingstella offline bundle metrics.
3 · Logs
- Format: JSON via Console log bridge; emitted to stdout and optional OTLP log exporter. Core fields: 
timestamp,level,action,route,tenant,subject,correlationId,dpop.jkt,device,offlineMode. - Categories:
ui.action– general user interactions (route changes, command palette, filter updates). Sampled 50 % by default; override with feature flagtelemetry.logVerbose.ui.tenant.switch– always logged; includesfromTenant,toTenant,tokenId, and Authority audit correlation.ui.download.commandCopied– download commands copied; includesartifactId,digest,manifestVersion.ui.security.anomaly– DPoP mismatches, tenant header errors, CSP violations (level =Warning).ui.telemetry.failure– OTLP export errors; includehttpStatus,batchSize,retryCount.
 - PII handling: Full emails are scrubbed; only hashed values (
user:<sha256>) appear unlessui.admin+ fresh-auth were granted for the action (still redacted in logs). - Retention: Recommended 14 days for connected sites, 30 days for sealed/air-gap audits. Ship logs to Loki/Elastic with ingest label 
service="stellaops-web-ui". 
4 · Traces
- Span names & attributes:
ui.route.transition– wraps route navigation; attributes:route,tenant,renderMillis,prefetchHit.ui.api.fetch– HTTP fetch to backend; attributes:service,endpoint,status,networkTime.ui.sse.stream– Server-sent event subscriptions (status ticker, runs); attributes:channel,connectedMillis,reconnects.ui.telemetry.batch– Browser OTLP flush; attributes:batchSize,success,retryCount.ui.policy.action– Policy workspace actions (simulate, approve, activate) perdocs/ui/policy-editor.md.
 - Propagation: Spans use W3C 
traceparent; gateway echoes header to backend APIs so traces stitch across UI → gateway → service. - Sampling controls: 
OTEL_TRACES_SAMPLER_ARG(ratio) and feature flagtelemetry.forceSampling(sets to 100 % for incident debugging). - Viewing traces: Grafana Tempo or Jaeger via collector. Filter by 
service.name = stellaops-console. For cross-service debugging, filter oncorrelationIdandtenant. 
5 · Dashboards
5.1 Experience Overview
Panels:
- Route render histogram (P50/P90/P99) by route.
 - Backend call latency stacked by service (
ui_request_duration_seconds). - Offline banner duration trend (
ui_offline_banner_seconds). - Tenant switch volume vs failure rate (overlay 
ui_dpop_failure_total). - Command palette usage (
ui_filter_apply_total+ui.actionlog counts). 
5.2 Downloads & Offline Kit
- Manifest refresh time chart (per channel).
 - Export queue depth gauge with alert thresholds.
 - CLI command adoption (bar chart per artifact type, using 
ui_download_command_copied_total). - Offline parity banner occurrences (
downloads.offlineParityflag from API → derived metric). - Last Offline Kit import timestamp (join with Downloads API metadata).
 
5.3 Security & Session
- Fresh-auth prompt counts vs success/fail ratios.
 - DPoP failure stacked by reason.
 - Tenant mismatch warnings (from 
ui.security.anomalylogs). - Scope usage heatmap (derived from Authority audit events + UI logs).
 - CSP violation counts (browser 
securitypolicyviolationlistener forwarded to logs). 
Capture screenshots for Grafana once dashboards stabilise (
docs/assets/ui/observability/*.png). Replace placeholders before releasing the doc.
6 · Alerting
| Alert | Condition | Suggested Action | 
|---|---|---|
| ConsoleLatencyHigh | ui_route_render_seconds_bucket{le="1.5"} drops below 0.95 for 3 intervals | 
Inspect route splits, check backend latencies, review CDN cache. | 
| BackendLatencyHigh | ui_request_duration_seconds_sum / ui_request_duration_seconds_count > 1 s for any service | 
Correlate with gateway/service dashboards; escalate to owning guild. | 
| TenantSwitchFailures | Increase in ui_dpop_failure_total or ui.security.anomaly (tenant mismatch) > 3/min | 
Validate Authority issuer, check clock skew, confirm tenant config. | 
| FreshAuthLoop | ui_fresh_auth_prompt_total spikes with matching ui_fresh_auth_failure_total | 
Review Authority /fresh-auth endpoint, session timeout config, UX regressions. | 
| OfflineBannerLong | ui_offline_banner_seconds P95 > 120 s | 
Investigate Authority/gateway availability; verify Offline Kit freshness. | 
| DownloadsBacklog | ui_download_export_queue_depth > 5 for 10 min OR queue age > alert threshold | 
Ping Downloads service, ensure manifest pipeline (DOWNLOADS-CONSOLE-23-001) is healthy. | 
| TelemetryExportErrors | ui_telemetry_batch_failures_total > 0 for ≥5 min | 
Check collector health, credentials, or TLS trust. | 
Integrate alerts with Notifier (ui.alerts) or existing Ops channels. Tag incidents with component=console for correlation.
7 · Feature Flags & Configuration
| Flag / Env Var | Purpose | Default | 
|---|---|---|
CONSOLE_FEATURE_FLAGS | 
Enables UI modules (runs, downloads, policies, telemetry). Telemetry panel requires telemetry. | 
runs,downloads,policies | 
CONSOLE_METRICS_ENABLED | 
Exposes /metrics for Prometheus scrape. | 
true | 
CONSOLE_METRICS_VERBOSE | 
Emits additional batching metrics (ui_telemetry_*). | 
false | 
CONSOLE_LOG_LEVEL | 
Minimum log level (Information, Debug). Use Debug for incident sampling. | 
Information | 
CONSOLE_METRICS_SAMPLING (planned) | 
Controls front-end span sampling ratio. Document once released. | 0.05 | 
OTEL_EXPORTER_OTLP_ENDPOINT | 
Collector URL; supports HTTPS. | unset | 
OTEL_EXPORTER_OTLP_HEADERS | 
Comma-separated headers (auth). | unset | 
OTEL_EXPORTER_OTLP_INSECURE | 
Allow HTTP (dev only). | false | 
OTEL_SERVICE_NAME | 
Service tag for traces/logs. Set to stellaops-console. | 
auto | 
CONSOLE_TELEMETRY_SSE_ENABLED | 
Enables /console/telemetry SSE feed for dashboards. | 
true | 
Feature flag changes should be tracked in release notes and mirrored in /docs/ui/navigation.md (shortcuts may change when modules toggle).
8 · Offline / Air-Gapped Workflow
- Mirror the console image and telemetry collector as part of the Offline Kit (see 
/docs/install/docker.md§4). - Scrape metrics locally via 
curl -k https://console.local/metrics > metrics.prom; archive alongside logs for audits. - Use 
stella offline kit importto keep the downloads manifest in sync; dashboards display staleness usingui_download_manifest_refresh_seconds. - When collectors are unavailable, console queues OTLP batches (up to 5 min) and exposes backlog through 
ui_telemetry_queue_depth; export queue metrics to prove no data loss. - After reconnecting, run 
stella console status --telemetry(CLI parity pending; see DOCS-CONSOLE-23-014) or verifyui_telemetry_batch_failures_totalresets to zero. - Retain telemetry bundles for 30 days per compliance guidelines; include Grafana JSON exports in audit packages.
 
9 · Compliance Checklist
/metricsscraped in staging & production; dashboards displayui_route_render_seconds,ui_request_duration_seconds, and downloads metrics.- OTLP traces/logs confirmed end-to-end (collector, Tempo/Loki).
 - Alert rules from §6 implemented in monitoring stack with runbooks linked.
 - Feature flags documented and change-controlled; telemetry disabled only with approval.
 - DPoP/fresh-auth anomalies correlated with Authority audit logs during drill.
 - Offline capture workflow exercised; evidence stored in audit vault.
 - Screenshots of Grafana dashboards committed once they stabilise (update references).
 - Cross-links verified (
docs/deploy/console.md,docs/security/console-security.md,docs/ui/downloads.md,docs/ui/console-overview.md). 
10 · References
/docs/deploy/console.md– Metrics endpoint, OTLP config, health checks./docs/security/console-security.md– Security metrics & alert hints./docs/ui/console-overview.md– Telemetry primitives and performance budgets./docs/ui/downloads.md– Downloads metrics and parity workflow./docs/observability/observability.md– Platform-wide practices./ops/telemetry-collector.md&/ops/telemetry-storage.md– Collector deployment./docs/install/docker.md– Compose/Helm environment variables.
Last updated: 2025-10-28 (Sprint 23).