- Added `SchedulerWorkerOptions` class to encapsulate configuration for the scheduler worker. - Introduced `PlannerBackgroundService` to manage the planner loop, fetching and processing planning runs. - Created `PlannerExecutionService` to handle the execution logic for planning runs, including impact targeting and run persistence. - Developed `PlannerExecutionResult` and `PlannerExecutionStatus` to standardize execution outcomes. - Implemented validation logic within `SchedulerWorkerOptions` to ensure proper configuration. - Added documentation for the planner loop and impact targeting features. - Established health check endpoints and authentication mechanisms for the Signals service. - Created unit tests for the Signals API to ensure proper functionality and response handling. - Configured options for authority integration and fallback authentication methods.
		
			
				
	
	
	
		
			13 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	Console Observability
Audience: Observability Guild, Console Guild, SRE/operators.
Scope: Metrics, logs, traces, dashboards, alerting, feature flags, and offline workflows for the StellaOps Console (Sprint 23).
Prerequisites: Console deployed with metrics enabled (CONSOLE_METRICS_ENABLED=true) and OTLP exporters configured (OTEL_EXPORTER_OTLP_*).
1 · Instrumentation Overview
- Telemetry stack: OpenTelemetry Web SDK (browser) + Console telemetry bridge → OTLP collector (Tempo/Prometheus/Loki). Server-side endpoints expose /metrics(Prometheus) and/health/*.
- Sampling: Front-end spans sample at 5 % by default (OTEL_TRACES_SAMPLER=parentbased_traceidratio). Metrics are un-sampled; log sampling is handled per category (§3).
- Correlation IDs: Every API call carries x-stellaops-correlation-id; structured UI events mirror that value so operators can follow a request across gateway, backend, and UI.
- Scope gating: Operators need the ui.telemetryscope to view live charts in the Admin workspace; the scope also controls access to/console/telemetrySSE streams.
2 · Metrics
2.1 Experience & Navigation
| Metric | Type | Labels | Notes | 
|---|---|---|---|
| ui_route_render_seconds | Histogram | route,tenant,device(desktop,tablet) | Time between route activation and first interactive paint. Target P95 ≤ 1.5 s (cached). | 
| ui_request_duration_seconds | Histogram | service,method,status,tenant | Gateway proxy timing for backend calls performed by the console. Alerts when backend latency degrades. | 
| ui_filter_apply_total | Counter | route,filter,tenant | Increments when a global filter or context chip is applied. Used to track adoption of saved views. | 
| ui_tenant_switch_total | Counter | fromTenant,toTenant,trigger(picker,shortcut,link) | Emitted after a successful tenant switch; correlates with Authority ui.tenant.switchlogs. | 
| ui_offline_banner_seconds | Histogram | reason(authority,manifest,gateway),tenant | Duration of offline banner visibility; integrate with air-gap SLAs. | 
2.2 Security & Session
| Metric | Type | Labels | Notes | 
|---|---|---|---|
| ui_dpop_failure_total | Counter | endpoint,reason(nonce,jkt,clockSkew) | Raised when DPoP validation fails; pair with Authority audit trail. | 
| ui_fresh_auth_prompt_total | Counter | action(token.revoke,policy.activate,client.create),tenant | Counts fresh-auth modals; backlog above baseline indicates workflow friction. | 
| ui_fresh_auth_failure_total | Counter | action,reason(timeout,cancelled,auth_error) | Optional metric (set CONSOLE_FRESH_AUTH_METRICS=truewhen feature flag lands). | 
2.3 Downloads & Offline Kit
| Metric | Type | Labels | Notes | 
|---|---|---|---|
| ui_download_manifest_refresh_seconds | Histogram | tenant,channel(edge,stable,airgap) | Time to fetch and verify downloads manifest. Target < 3 s. | 
| ui_download_export_queue_depth | Gauge | tenant,artifactType(sbom,policy,attestation,console) | Mirrors /console/downloadsqueue depth; triggers when offline bundles lag. | 
| ui_download_command_copied_total | Counter | tenant,artifactType | Increments when users copy CLI commands from the UI. Useful to observe CLI parity adoption. | 
2.4 Telemetry Emission & Errors
| Metric | Type | Labels | Notes | 
|---|---|---|---|
| ui_telemetry_batch_failures_total | Counter | transport(otlp-http,otlp-grpc),reason | Emitted by OTLP bridge when batches fail. Enable via CONSOLE_METRICS_VERBOSE=true. | 
| ui_telemetry_queue_depth | Gauge | priority(normal,high),tenant | Browser-side buffer depth; monitor for spikes under degraded collectors. | 
Scraping tips:
- Enable
/metricsviaCONSOLE_METRICS_ENABLED=true.- Set
OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.collector:4318and relevant headers (OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer <token>).- For air-gapped sites, point the exporter to the Offline Kit collector (
localhost:4318) and forward the metrics snapshot usingstella offline bundle metrics.
3 · Logs
- Format: JSON via Console log bridge; emitted to stdout and optional OTLP log exporter. Core fields: timestamp,level,action,route,tenant,subject,correlationId,dpop.jkt,device,offlineMode.
- Categories:
- ui.action– general user interactions (route changes, command palette, filter updates). Sampled 50 % by default; override with feature flag- telemetry.logVerbose.
- ui.tenant.switch– always logged; includes- fromTenant,- toTenant,- tokenId, and Authority audit correlation.
- ui.download.commandCopied– download commands copied; includes- artifactId,- digest,- manifestVersion.
- ui.security.anomaly– DPoP mismatches, tenant header errors, CSP violations (level =- Warning).
- ui.telemetry.failure– OTLP export errors; include- httpStatus,- batchSize,- retryCount.
 
- PII handling: Full emails are scrubbed; only hashed values (user:<sha256>) appear unlessui.admin+ fresh-auth were granted for the action (still redacted in logs).
- Retention: Recommended 14 days for connected sites, 30 days for sealed/air-gap audits. Ship logs to Loki/Elastic with ingest label service="stellaops-web-ui".
4 · Traces
- Span names & attributes:
- ui.route.transition– wraps route navigation; attributes:- route,- tenant,- renderMillis,- prefetchHit.
- ui.api.fetch– HTTP fetch to backend; attributes:- service,- endpoint,- status,- networkTime.
- ui.sse.stream– Server-sent event subscriptions (status ticker, runs); attributes:- channel,- connectedMillis,- reconnects.
- ui.telemetry.batch– Browser OTLP flush; attributes:- batchSize,- success,- retryCount.
- ui.policy.action– Policy workspace actions (simulate, approve, activate) per- docs/ui/policy-editor.md.
 
- Propagation: Spans use W3C traceparent; gateway echoes header to backend APIs so traces stitch across UI → gateway → service.
- Sampling controls: OTEL_TRACES_SAMPLER_ARG(ratio) and feature flagtelemetry.forceSampling(sets to 100 % for incident debugging).
- Viewing traces: Grafana Tempo or Jaeger via collector. Filter by service.name = stellaops-console. For cross-service debugging, filter oncorrelationIdandtenant.
5 · Dashboards
5.1 Experience Overview
Panels:
- Route render histogram (P50/P90/P99) by route.
- Backend call latency stacked by service (ui_request_duration_seconds).
- Offline banner duration trend (ui_offline_banner_seconds).
- Tenant switch volume vs failure rate (overlay ui_dpop_failure_total).
- Command palette usage (ui_filter_apply_total+ui.actionlog counts).
5.2 Downloads & Offline Kit
- Manifest refresh time chart (per channel).
- Export queue depth gauge with alert thresholds.
- CLI command adoption (bar chart per artifact type, using ui_download_command_copied_total).
- Offline parity banner occurrences (downloads.offlineParityflag from API → derived metric).
- Last Offline Kit import timestamp (join with Downloads API metadata).
5.3 Security & Session
- Fresh-auth prompt counts vs success/fail ratios.
- DPoP failure stacked by reason.
- Tenant mismatch warnings (from ui.security.anomalylogs).
- Scope usage heatmap (derived from Authority audit events + UI logs).
- CSP violation counts (browser securitypolicyviolationlistener forwarded to logs).
Capture screenshots for Grafana once dashboards stabilise (
docs/assets/ui/observability/*.png). Replace placeholders before releasing the doc.
6 · Alerting
| Alert | Condition | Suggested Action | 
|---|---|---|
| ConsoleLatencyHigh | ui_route_render_seconds_bucket{le="1.5"}drops below 0.95 for 3 intervals | Inspect route splits, check backend latencies, review CDN cache. | 
| BackendLatencyHigh | ui_request_duration_seconds_sum / ui_request_duration_seconds_count> 1 s for any service | Correlate with gateway/service dashboards; escalate to owning guild. | 
| TenantSwitchFailures | Increase in ui_dpop_failure_totalorui.security.anomaly(tenant mismatch) > 3/min | Validate Authority issuer, check clock skew, confirm tenant config. | 
| FreshAuthLoop | ui_fresh_auth_prompt_totalspikes with matchingui_fresh_auth_failure_total | Review Authority /fresh-authendpoint, session timeout config, UX regressions. | 
| OfflineBannerLong | ui_offline_banner_secondsP95 > 120 s | Investigate Authority/gateway availability; verify Offline Kit freshness. | 
| DownloadsBacklog | ui_download_export_queue_depth> 5 for 10 min OR queue age > alert threshold | Ping Downloads service, ensure manifest pipeline ( DOWNLOADS-CONSOLE-23-001) is healthy. | 
| TelemetryExportErrors | ui_telemetry_batch_failures_total> 0 for ≥5 min | Check collector health, credentials, or TLS trust. | 
Integrate alerts with Notifier (ui.alerts) or existing Ops channels. Tag incidents with component=console for correlation.
7 · Feature Flags & Configuration
| Flag / Env Var | Purpose | Default | 
|---|---|---|
| CONSOLE_FEATURE_FLAGS | Enables UI modules ( runs,downloads,policies,telemetry). Telemetry panel requirestelemetry. | runs,downloads,policies | 
| CONSOLE_METRICS_ENABLED | Exposes /metricsfor Prometheus scrape. | true | 
| CONSOLE_METRICS_VERBOSE | Emits additional batching metrics ( ui_telemetry_*). | false | 
| CONSOLE_LOG_LEVEL | Minimum log level ( Information,Debug). UseDebugfor incident sampling. | Information | 
| CONSOLE_METRICS_SAMPLING(planned) | Controls front-end span sampling ratio. Document once released. | 0.05 | 
| OTEL_EXPORTER_OTLP_ENDPOINT | Collector URL; supports HTTPS. | unset | 
| OTEL_EXPORTER_OTLP_HEADERS | Comma-separated headers (auth). | unset | 
| OTEL_EXPORTER_OTLP_INSECURE | Allow HTTP (dev only). | false | 
| OTEL_SERVICE_NAME | Service tag for traces/logs. Set to stellaops-console. | auto | 
| CONSOLE_TELEMETRY_SSE_ENABLED | Enables /console/telemetrySSE feed for dashboards. | true | 
Feature flag changes should be tracked in release notes and mirrored in /docs/ui/navigation.md (shortcuts may change when modules toggle).
8 · Offline / Air-Gapped Workflow
- Mirror the console image and telemetry collector as part of the Offline Kit (see /docs/install/docker.md§4).
- Scrape metrics locally via curl -k https://console.local/metrics > metrics.prom; archive alongside logs for audits.
- Use stella offline kit importto keep the downloads manifest in sync; dashboards display staleness usingui_download_manifest_refresh_seconds.
- When collectors are unavailable, console queues OTLP batches (up to 5 min) and exposes backlog through ui_telemetry_queue_depth; export queue metrics to prove no data loss.
- After reconnecting, run stella console status --telemetry(CLI parity pending; see DOCS-CONSOLE-23-014) or verifyui_telemetry_batch_failures_totalresets to zero.
- Retain telemetry bundles for 30 days per compliance guidelines; include Grafana JSON exports in audit packages.
9 · Compliance Checklist
- /metricsscraped in staging & production; dashboards display- ui_route_render_seconds,- ui_request_duration_seconds, and downloads metrics.
- OTLP traces/logs confirmed end-to-end (collector, Tempo/Loki).
- Alert rules from §6 implemented in monitoring stack with runbooks linked.
- Feature flags documented and change-controlled; telemetry disabled only with approval.
- DPoP/fresh-auth anomalies correlated with Authority audit logs during drill.
- Offline capture workflow exercised; evidence stored in audit vault.
- Screenshots of Grafana dashboards committed once they stabilise (update references).
- Cross-links verified (docs/deploy/console.md,docs/security/console-security.md,docs/ui/downloads.md,docs/ui/console-overview.md).
10 · References
- /docs/deploy/console.md– Metrics endpoint, OTLP config, health checks.
- /docs/security/console-security.md– Security metrics & alert hints.
- /docs/ui/console-overview.md– Telemetry primitives and performance budgets.
- /docs/ui/downloads.md– Downloads metrics and parity workflow.
- /docs/observability/observability.md– Platform-wide practices.
- /ops/telemetry-collector.md&- /ops/telemetry-storage.md– Collector deployment.
- /docs/install/docker.md– Compose/Helm environment variables.
Last updated: 2025-10-28 (Sprint 23).