docs consolidation and others

2026-01-06 19:02:21 +02:00
parent d7bdca6d97
commit 4789027317
849 changed files with 16551 additions and 66770 deletions
--- a/docs/modules/telemetry/guides/aggregation.md
+++ b/docs/modules/telemetry/guides/aggregation.md
@@ -0,0 +1,37 @@
+# Aggregation Observability
+
+Last updated: 2025-11-25 (Docs Tasks Md.V · DOCS-LNM-22-007)
+
+Covers metrics, traces, and logs for Link-Not-Merge (LNM) aggregation and evidence pipelines.
+
+## Metrics
+- `aggregation_ingest_latency_seconds` (histogram) — end-to-end ingest per statement; labels: `tenant`, `source`, `status`.
+- `aggregation_conflict_total` (counter) — conflicts encountered; labels: `tenant`, `advisory`, `product`, `reason`.
+- `aggregation_overlay_cache_hits_total` / `_misses_total` — overlay cache effectiveness; labels: `tenant`, `cache`.
+- `aggregation_vex_gate_total` — VEX gating outcomes; labels: `tenant`, `status` (`affected`, `not_affected`, `unknown`).
+- `aggregation_queue_depth` (gauge) — pending statements per tenant.
+
+## Traces
+- Span name `aggregation.process` with attributes:
+  - `tenant`, `advisory`, `product`, `vex_status`, `source_kind`
+  - `overlay_version`, `cache_hit` (bool)
+- Link to upstream ingest span (`traceparent` forwarded by Excititor/Concelier).
+- Export to OTLP; sampling default 10% outside prod, 100% for `status=error`.
+
+## Logs
+Structured JSON with fields: `tenant`, `advisory`, `product`, `vex_status`, `decision` (`merged|suppressed|dropped`), `reason`, `duration_ms`, `trace_id`.
+
+## SLOs
+- **Ingest latency**: p95 < 500ms per statement (steady state).
+- **Cache hit rate**: >80% for overlays; alerts when below for 15 minutes.
+- **Error rate**: <0.1% over 10 minute window.
+
+## Alerts
+- `HighConflictRate` — `aggregation_conflict_total` delta > 100/minute per tenant.
+- `QueueBacklog` — `aggregation_queue_depth` > 10k for 5 minutes.
+- `LowCacheHit` — overlay cache hit rate < 60% for 10 minutes.
+
+## Offline/air-gap considerations
+- Export metrics to local Prometheus scrape; no external sinks.
+- Trace sampling and log retention configured via environment without needing control-plane access.
+- Deterministic ordering preserved; cache warmers seeded from bundled fixtures.
--- a/docs/modules/telemetry/guides/cli-incident-toggle-12-001.md
+++ b/docs/modules/telemetry/guides/cli-incident-toggle-12-001.md
@@ -0,0 +1,29 @@
+# CLI incident toggle contract (CLI-OBS-12-001)
+
+**Goal**: define a deterministic CLI flag and config surface to enter/exit incident mode, required by TELEMETRY-OBS-55-001/56-001.
+
+## Flags and config
+- CLI flag: `--incident-mode` (bool). Defaults to false.
+- Config key: `Telemetry:Incident:Enabled` (bool) and `Telemetry:Incident:TTL` (TimeSpan).
+- When both flag and config specified, flag wins (opt-in only; cannot disable if config enables and flag present).
+
+## Effects when enabled
+- Increase sampling rate ceiling to 100% for telemetry within the process.
+- Add tag `incident=true` to logs/metrics/traces.
+- Shorten exporter/reporting flush interval to 5s; disable external exporters when `Sealed=true`.
+- Emit activation audit event `telemetry.incident.activated` with fields `{tenant, actor, source, expires_at}`.
+
+## Persistence
+- Incident flag runtime value stored in local state file `~/.stellaops/incident-mode.json` with fields `{enabled, set_at, expires_at, actor}` for offline continuity.
+- File is tenant-scoped; permissions 0600.
+
+## Expiry / TTL
+- Default TTL: 30 minutes unless `Telemetry:Incident:TTL` provided.
+- On expiry, emit `telemetry.incident.expired` audit event.
+
+## Validation expectations
+- CLI should refuse `--incident-mode` if `--sealed` is set and external exporters are configured (must drop exporters first).
+- Unit tests to cover precedence (flag over config), TTL expiry, state file perms, and audit emissions.
+
+## Provenance
+- Authored 2025-11-20 to unblock PREP-CLI-OBS-12-001 and TELEMETRY-OBS-55-001.
--- a/docs/modules/telemetry/guides/fn-drift.md
+++ b/docs/modules/telemetry/guides/fn-drift.md
@@ -0,0 +1,177 @@
+# FN-Drift Metrics Reference
+
+> **Sprint:** SPRINT_3404_0001_0001
+> **Module:** Scanner Storage / Telemetry
+
+## Overview
+
+False-Negative Drift (FN-Drift) measures how often vulnerability classifications change from "not affected" or "unknown" to "affected" during rescans. This metric is critical for:
+
+- **Accuracy Assessment**: Tracking scanner reliability over time
+- **SLO Compliance**: Meeting false-negative rate targets
+- **Root Cause Analysis**: Stratified analysis by drift cause
+- **Feed Quality**: Identifying problematic vulnerability feeds
+
+## Metrics
+
+### Gauges (30-day rolling window)
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `scanner.fn_drift.percent` | Gauge | 30-day rolling FN-Drift percentage |
+| `scanner.fn_drift.transitions_30d` | Gauge | Total FN transitions in last 30 days |
+| `scanner.fn_drift.evaluated_30d` | Gauge | Total findings evaluated in last 30 days |
+| `scanner.fn_drift.cause.feed_delta` | Gauge | FN transitions caused by feed updates |
+| `scanner.fn_drift.cause.rule_delta` | Gauge | FN transitions caused by rule changes |
+| `scanner.fn_drift.cause.lattice_delta` | Gauge | FN transitions caused by VEX lattice changes |
+| `scanner.fn_drift.cause.reachability_delta` | Gauge | FN transitions caused by reachability changes |
+| `scanner.fn_drift.cause.engine` | Gauge | FN transitions caused by engine changes (should be ~0) |
+
+### Counters (all-time)
+
+| Metric | Type | Labels | Description |
+|--------|------|--------|-------------|
+| `scanner.classification_changes_total` | Counter | `cause` | Total classification status changes |
+| `scanner.fn_transitions_total` | Counter | `cause` | Total false-negative transitions |
+
+## Classification Statuses
+
+| Status | Description |
+|--------|-------------|
+| `new` | First scan, no previous status |
+| `unaffected` | Confirmed not affected |
+| `unknown` | Status unknown/uncertain |
+| `affected` | Confirmed affected |
+| `fixed` | Previously affected, now fixed |
+
+## Drift Causes
+
+| Cause | Description | Expected Impact |
+|-------|-------------|-----------------|
+| `feed_delta` | Vulnerability feed updated (NVD, GHSA, OVAL) | High - most common cause |
+| `rule_delta` | Policy rules changed | Medium - controlled by policy team |
+| `lattice_delta` | VEX lattice state changed | Medium - VEX updates |
+| `reachability_delta` | Reachability analysis changed | Low - improved analysis |
+| `engine` | Scanner engine change | ~0 - determinism violation if >0 |
+| `other` | Unknown/unclassified cause | Low - investigate if high |
+
+## FN-Drift Definition
+
+A **False-Negative Transition** occurs when:
+- Previous status was `unaffected` or `unknown`
+- New status is `affected`
+
+This indicates the scanner previously classified a finding as "not vulnerable" but now classifies it as "vulnerable" - a false negative in the earlier scan.
+
+### FN-Drift Rate Calculation
+
+```
+FN-Drift % = (FN Transitions / Total Reclassified) × 100
+```
+
+Where:
+- **FN Transitions**: Count of `(unaffected|unknown) → affected` changes
+- **Total Reclassified**: Count of all status changes (excluding `new`)
+
+## SLO Thresholds
+
+| SLO Level | FN-Drift Threshold | Alert Severity |
+|-----------|-------------------|----------------|
+| Target | < 1.0% | None |
+| Warning | 1.0% - 2.5% | Warning |
+| Critical | > 2.5% | Critical |
+| Engine Drift | > 0% | Page |
+
+### Alerting Rules
+
+```yaml
+# Example Prometheus alerting rules
+groups:
+  - name: fn-drift
+    rules:
+      - alert: FnDriftWarning
+        expr: scanner_fn_drift_percent > 1.0
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "FN-Drift rate above warning threshold"
+          
+      - alert: FnDriftCritical
+        expr: scanner_fn_drift_percent > 2.5
+        for: 5m
+        labels:
+          severity: critical
+        annotations:
+          summary: "FN-Drift rate above critical threshold"
+          
+      - alert: EngineDriftDetected
+        expr: scanner_fn_drift_cause_engine > 0
+        for: 1m
+        labels:
+          severity: page
+        annotations:
+          summary: "Engine-caused FN drift detected - determinism violation"
+```
+
+## Dashboard Queries
+
+### FN-Drift Trend (Grafana)
+
+```promql
+# 30-day rolling FN-Drift percentage
+scanner_fn_drift_percent
+
+# FN transitions by cause
+sum by (cause) (rate(scanner_fn_transitions_total[1h]))
+
+# Classification changes rate
+sum by (cause) (rate(scanner_classification_changes_total[1h]))
+```
+
+### Drift Cause Breakdown
+
+```promql
+# Pie chart of drift causes
+topk(5, 
+  sum by (cause) (
+    increase(scanner_fn_transitions_total[24h])
+  )
+)
+```
+
+## Database Schema
+
+### classification_history Table
+
+```sql
+CREATE TABLE scanner.classification_history (
+    id BIGSERIAL PRIMARY KEY,
+    artifact_digest TEXT NOT NULL,
+    vuln_id TEXT NOT NULL,
+    package_purl TEXT NOT NULL,
+    tenant_id UUID NOT NULL,
+    manifest_id UUID NOT NULL,
+    execution_id UUID NOT NULL,
+    previous_status TEXT NOT NULL,
+    new_status TEXT NOT NULL,
+    is_fn_transition BOOLEAN GENERATED ALWAYS AS (...) STORED,
+    cause TEXT NOT NULL,
+    cause_detail JSONB,
+    changed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
+);
+```
+
+### fn_drift_stats Materialized View
+
+Aggregated daily statistics for efficient dashboard queries:
+- Day bucket
+- Tenant ID
+- Cause breakdown
+- FN count and percentage
+
+## Related Documentation
+
+- [Determinism Technical Reference](../product-advisories/14-Dec-2025%20-%20Determinism%20and%20Reproducibility%20Technical%20Reference.md) - Section 13.2
+- [Scanner Architecture](../modules/scanner/architecture.md)
+- [Telemetry Stack](../modules/telemetry/architecture.md)
--- a/docs/modules/telemetry/guides/logging.md
+++ b/docs/modules/telemetry/guides/logging.md
@@ -0,0 +1,49 @@
+# Logging Standards (DOCS-OBS-50-003)
+
+Last updated: 2025-12-15
+
+## Goals
+- Deterministic, structured logs for all services.
+- Keep tenant safety and redaction guarantees while enabling search, correlation, and offline analysis.
+
+## Log shape (JSON)
+Required fields:
+- `timestamp` (UTC ISO-8601)
+- `tenant`, `workload` (service name), `env`, `region`, `version`
+- `level` (`debug|info|warn|error|fatal`)
+- `category` (logger/category name), `operation` (verb/action)
+- `trace_id`, `span_id`, `correlation_id` (if external)
+- `message` (concise, no secrets)
+- `status` (`ok|error|fault|throttle`)
+- `error.code`, `error.message` (redacted), `retryable` (bool) when status != ok
+
+Optional but recommended:
+- `resource` (subject id/purl/path when safe), `http.method`, `http.status_code`, `duration_ms`, `host`, `pid`, `thread`.
+
+## Offline Kit / air-gap import fields
+When emitting logs for Offline Kit import/activation flows, keep field names stable:
+- Required scope key: `tenant_id`
+- Common keys: `bundle_type`, `bundle_digest`, `bundle_path`, `manifest_version`, `manifest_created_at`
+- Force activation keys: `force_activate`, `force_activate_reason`
+- Outcome keys: `result`, `reason_code`, `reason_message`
+- Quarantine keys: `quarantine_id`, `quarantine_path`
+
+## Redaction rules
+- Never log Authorization headers, tokens, passwords, private keys, full request/response bodies.
+- Redact to `"[redacted]"` and add `redaction.reason` (`secret|pii|policy`).
+- Hash low-cardinality identifiers when needed (`sha256` hex) and mark `hashed=true`.
+
+## Determinism & offline posture
+- Stable key ordering not required, but field set must be consistent per log type.
+- No external enrichment; rely on bundled metadata (service map, tenant labels).
+- All times UTC; newline-delimited JSON (NDJSON); LF line endings.
+
+## Sampling & rate limits
+- Info logs rate-limited per component (default 100/s); warn/error/fatal never sampled.
+- Structured audit logs (`category=audit`) are never sampled and must include `actor`, `action`, `target`, `result`.
+
+## Validation checklist
+- [ ] Required fields present and non-empty.
+- [ ] No secrets/PII; redaction markers recorded.
+- [ ] Correlation fields (`trace_id`, `span_id`) set when spans exist.
+- [ ] Log level matches outcome (errors use warn/error/fatal only).
--- a/docs/modules/telemetry/guides/metrics-and-slos.md
+++ b/docs/modules/telemetry/guides/metrics-and-slos.md
@@ -0,0 +1,113 @@
+# Metrics & SLOs (DOCS-OBS-51-001)
+
+Last updated: 2025-12-15
+
+## Core metrics (platform-wide)
+- **Requests**: `http_requests_total{tenant,workload,route,status}` (counter); latency histogram `http_request_duration_seconds`.
+- **Jobs**: `worker_jobs_total{tenant,queue,status}`; `worker_job_duration_seconds`.
+- **DB**: `db_query_duration_seconds{db,operation}`; `db_pool_in_use`, `db_pool_available`.
+- **Cache**: `cache_requests_total{result=hit|miss}`; `cache_latency_seconds`.
+- **Queue depth**: `queue_depth{tenant,queue}` (gauge).
+- **Errors**: `errors_total{tenant,workload,code}`.
+- **Custom module metrics**: keep namespaced (e.g., `riskengine_score_duration_seconds`, `notify_delivery_attempts_total`).
+
+## SLOs (suggested)
+- API availability: 99.9% monthly per public service.
+- P95 latency: <300 ms for read endpoints; <1 s for write endpoints.
+- Worker job success: >99% over 30d; P95 job duration set per queue (document locally).
+- Queue backlog: alert when `queue_depth` > 1000 for 5 minutes per tenant/queue.
+- Error budget policy: 28-day rolling window; burn-rate alerts at 2× and 14× budget.
+
+## Alert examples
+- High error rate: `rate(errors_total[5m]) / rate(http_requests_total[5m]) > 0.02`.
+- Latency regression: `histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le,route)) > 0.3`.
+- Queue backlog: `queue_depth > 1000` for 5m.
+- Job failures: `rate(worker_jobs_total{status="failed"}[10m]) > 0.01`.
+
+## UX KPIs (triage TTFS)
+- Targets:
+  - TTFS first evidence p95: <= 1.5s
+  - TTFS skeleton p95: <= 0.2s
+  - Clicks-to-closure median: <= 6
+  - Evidence completeness avg: >= 90% (>= 3.6/4)
+
+```promql
+# TTFS first evidence p50/p95
+histogram_quantile(0.50, sum(rate(stellaops_ttfs_first_evidence_seconds_bucket[5m])) by (le))
+histogram_quantile(0.95, sum(rate(stellaops_ttfs_first_evidence_seconds_bucket[5m])) by (le))
+
+# Clicks-to-closure median
+histogram_quantile(0.50, sum(rate(stellaops_clicks_to_closure_bucket[5m])) by (le))
+
+# Evidence completeness average percent (0-4 mapped to 0-100)
+100 * (sum(rate(stellaops_evidence_completeness_score_sum[5m])) / clamp_min(sum(rate(stellaops_evidence_completeness_score_count[5m])), 1)) / 4
+
+# Budget violations by phase
+sum(rate(stellaops_performance_budget_violations_total[5m])) by (phase)
+```
+
+- Dashboard: `ops/devops/observability/grafana/triage-ttfs.json`
+- Alerts: `ops/devops/observability/triage-alerts.yaml`
+
+## TTFS Metrics (time-to-first-signal)
+- Core metrics:
+  - `ttfs_latency_seconds{surface,cache_hit,signal_source,kind,phase,tenant_id}` (histogram)
+  - `ttfs_signal_total{surface,cache_hit,signal_source,kind,phase,tenant_id}` (counter)
+  - `ttfs_cache_hit_total{surface,cache_hit,signal_source,kind,phase,tenant_id}` (counter)
+  - `ttfs_cache_miss_total{surface,cache_hit,signal_source,kind,phase,tenant_id}` (counter)
+  - `ttfs_slo_breach_total{surface,cache_hit,signal_source,kind,phase,tenant_id}` (counter)
+  - `ttfs_error_total{surface,cache_hit,signal_source,kind,phase,tenant_id,error_type,error_code}` (counter)
+
+- SLO targets:
+  - P50 < 2s, P95 < 5s (all surfaces)
+  - Warm path P50 < 700ms, P95 < 2.5s
+  - Cold path P95 < 4s
+
+```promql
+# TTFS latency p50/p95
+histogram_quantile(0.50, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le))
+histogram_quantile(0.95, sum(rate(ttfs_latency_seconds_bucket[5m])) by (le))
+
+# SLO breach rate (per minute)
+60 * sum(rate(ttfs_slo_breach_total[5m]))
+```
+
+## Offline Kit (air-gap) metrics
+- `offlinekit_import_total{status,tenant_id}` (counter)
+- `offlinekit_attestation_verify_latency_seconds{attestation_type,success}` (histogram)
+- `attestor_rekor_success_total{mode}` (counter)
+- `attestor_rekor_retry_total{reason}` (counter)
+- `rekor_inclusion_latency{success}` (histogram)
+
+```promql
+# Import rate by status
+sum(rate(offlinekit_import_total[5m])) by (status)
+
+# Import success rate
+sum(rate(offlinekit_import_total{status="success"}[5m])) / clamp_min(sum(rate(offlinekit_import_total[5m])), 1)
+
+# Attestation verify p95 by type (success only)
+histogram_quantile(0.95, sum(rate(offlinekit_attestation_verify_latency_seconds_bucket{success="true"}[5m])) by (le, attestation_type))
+
+# Rekor inclusion latency p95 (by success)
+histogram_quantile(0.95, sum(rate(rekor_inclusion_latency_bucket[5m])) by (le, success))
+```
+
+Dashboard: `docs/modules/telemetry/dashboards/offline-kit-operations.json`
+
+## Observability hygiene
+- Tag everything with `tenant`, `workload`, `env`, `region`, `version`.
+- Keep metric names stable; prefer adding labels over renaming.
+- No high-cardinality labels (avoid `user_id`, `path`, raw errors); bucket or hash if needed.
+- Offline: scrape locally (Prometheus/OTLP); ship exports via bundle if required.
+
+## Dashboards
+- Golden signals per service: traffic, errors, saturation, latency (P50/P95/P99).
+- Queue dashboards: depth, age, throughput, success/fail rates.
+- Tracing overlays: link span `status` to error metrics; use exemplars where supported.
+
+## Validation checklist
+- [ ] Metrics emitted with required tags.
+- [ ] Cardinality review completed (no unbounded labels).
+- [ ] Alerts wired to error budget policy.
+- [ ] Dashboards cover golden signals and queue health.
--- a/docs/modules/telemetry/guides/observability.md
+++ b/docs/modules/telemetry/guides/observability.md
@@ -0,0 +1,240 @@
+# AOC Observability Guide
+
+> **Audience:** Observability Guild, Concelier/Excititor SREs, platform operators.  
+> **Scope:** Metrics, traces, logs, dashboards, and runbooks introduced as part of the Aggregation-Only Contract (AOC) rollout (Sprint 19).
+
+This guide captures the canonical signals emitted by Concelier and Excititor once AOC guards are active. It explains how to consume the metrics in dashboards, correlate traces/logs for incident triage, and operate in offline environments. Pair this guide with the [AOC reference](../aoc/aggregation-only-contract.md) and [architecture overview](../modules/platform/architecture-overview.md).
+
+---
+
+## 1 · Metrics
+
+| Metric | Type | Labels | Description |
+|--------|------|--------|-------------|
+| `ingestion_write_total` | Counter | `source`, `tenant`, `result` (`ok`, `reject`, `noop`) | Counts write attempts to `advisory_raw`/`vex_raw`. Rejects correspond to guard failures. |
+| `ingestion_latency_seconds` | Histogram | `source`, `tenant`, `phase` (`fetch`, `transform`, `write`) | Measures end-to-end runtime for ingestion stages. Use `quantile=0.95` for alerting. |
+| `aoc_violation_total` | Counter | `source`, `tenant`, `code` (`ERR_AOC_00x`) | Total guard violations bucketed by error code. Drives dashboard pills and alert thresholds. |
+| `ingestion_signature_verified_total` | Counter | `source`, `tenant`, `result` (`ok`, `fail`, `skipped`) | Tracks signature/checksum verification outcomes. |
+| `advisory_revision_count` | Gauge | `source`, `tenant` | Supersedes depth for raw documents; spikes indicate noisy upstream feeds. |
+| `verify_runs_total` | Counter | `tenant`, `initiator` (`ui`, `cli`, `api`, `scheduled`) | How many `stella aoc verify` or `/aoc/verify` runs executed. |
+| `verify_duration_seconds` | Histogram | `tenant`, `initiator` | Runtime of verification jobs; use P95 to detect regressions. |
+
+### 1.1 Alerts
+
+- **Violation spike:** Alert when `increase(aoc_violation_total[15m]) > 0` for critical sources. Page SRE if `code="ERR_AOC_005"` (signature failure) or `ERR_AOC_001` persists > 30 min.
+- **Stale ingestion:** Alert when `max_over_time(ingestion_latency_seconds_sum / ingestion_latency_seconds_count)[30m]` exceeds 30 s or if `ingestion_write_total` has no growth for > 60 min.
+- **Signature drop:** Warn when `rate(ingestion_signature_verified_total{result="fail"}[1h]) > 0`.
+
+### 1.2 · `/obs/excititor/health`
+
+`GET /obs/excititor/health` (scope `vex.admin`) returns a compact snapshot for Grafana tiles and Console widgets:
+
+- `ingest` — overall status, worst lag (seconds), and the top connectors (status, lagSeconds, failure count, last success).
+- `link` — freshness of consensus/linkset processing plus document counts and the number currently carrying conflicts.
+- `signature` — recent coverage window (evaluated, with signatures, verified, failures, unsigned, coverage ratio).
+- `conflicts` — rolling totals grouped by status plus per-bucket trend data for charts.
+
+```json
+{
+  "generatedAt": "2025-11-08T11:00:00Z",
+  "ingest": { "status": "healthy", "connectors": [ { "connectorId": "excititor:redhat", "lagSeconds": 45.3 } ] },
+  "link": { "status": "warning", "lastConsensusAt": "2025-11-08T10:57:03Z" },
+  "signature": { "status": "critical", "documentsEvaluated": 120, "verified": 30, "failures": 2 },
+  "conflicts": { "status": "warning", "conflictStatements": 325, "trend": [ { "bucketStart": "2025-11-08T10:00:00Z", "conflicts": 130 } ] }
+}
+```
+
+| Setting | Default | Purpose |
+|---------|---------|---------|
+| `Excititor:Observability:IngestWarningThreshold` | `06:00:00` | Connector lag before `ingest.status` becomes `warning`. |
+| `Excititor:Observability:IngestCriticalThreshold` | `24:00:00` | Connector lag before `ingest.status` becomes `critical`. |
+| `Excititor:Observability:LinkWarningThreshold` | `00:15:00` | Maximum acceptable delay between consensus recalculations. |
+| `Excititor:Observability:LinkCriticalThreshold` | `01:00:00` | Delay that marks link status as `critical`. |
+| `Excititor:Observability:SignatureWindow` | `12:00:00` | Lookback window for signature coverage. |
+| `Excititor:Observability:SignatureHealthyCoverage` | `0.8` | Coverage ratio that still counts as healthy. |
+| `Excititor:Observability:SignatureWarningCoverage` | `0.5` | Coverage ratio that flips the status to `warning`. |
+| `Excititor:Observability:ConflictTrendWindow` | `24:00:00` | Rolling window used for conflict aggregation. |
+| `Excititor:Observability:ConflictTrendBucketMinutes` | `60` | Resolution of conflict `trend` buckets. |
+| `Excititor:Observability:ConflictWarningRatio` | `0.15` | Fraction of consensus docs with conflicts that triggers `warning`. |
+| `Excititor:Observability:ConflictCriticalRatio` | `0.3` | Ratio that marks `conflicts.status` as `critical`. |
+| `Excititor:Observability:MaxConnectorDetails` | `50` | Number of connector entries returned (keeps payloads small). |
+
+### 1.3 · Regression & DI hygiene
+
+1. **Keep storage/integration tests green when telemetry touches persistence.**
+   - `./tools/postgres/local-postgres.sh start` downloads PostgreSQL 16.x (if needed), launches the instance, and prints `export EXCITITOR_TEST_POSTGRES_URI=postgresql://.../excititor-tests`. Copy that export into your shell.
+   - `./tools/postgres/local-postgres.sh restart` is a shortcut for "stop if running, then start" using the same dataset—use it after tweaking config or when tests need a bounce without wiping fixtures.
+   - `./tools/postgres/local-postgres.sh clean` stops the instance (if running) and deletes the managed data/log directories so storage tests begin from a pristine catalog.
+   - Run `dotnet test src/Excititor/__Tests/StellaOps.Excititor.Storage.Postgres.Tests/StellaOps.Excititor.Storage.Postgres.Tests.csproj -nologo -v minimal` (add `--filter` if you only touched specific suites). These tests exercise the same write paths that feed the dashboards, so regressions show up immediately.
+   - `./tools/postgres/local-postgres.sh stop` when finished so CI/dev hosts stay clean; `status|logs|shell` are available for troubleshooting.
+2. **Declare optional Minimal API dependencies with `[FromServices] ... = null`.** RequestDelegateFactory treats `[FromServices] IVexSigner? signer = null` (or similar) as optional, so host startup succeeds even when tests have not registered that service. This pattern keeps observability endpoints cancellable while avoiding brittle test overrides.
+
+
+---
+
+## 2 · Traces
+
+### 2.1 Span taxonomy
+
+| Span name | Parent | Key attributes |
+|-----------|--------|----------------|
+| `ingest.fetch` | job root span | `source`, `tenant`, `uri`, `contentHash` |
+| `ingest.transform` | `ingest.fetch` | `documentType` (`csaf`, `osv`, `vex`), `payloadBytes` |
+| `ingest.write` | `ingest.transform` | `collection` (`advisory_raw`, `vex_raw`), `result` (`ok`, `reject`) |
+| `aoc.guard` | `ingest.write` | `code` (on violation), `violationCount`, `supersedes` |
+| `verify.run` | verification job root | `tenant`, `window.from`, `window.to`, `sources`, `violations` |
+
+### 2.2 Trace usage
+
+- Correlate UI dashboard entries with traces via `traceId` surfaced in violation drawers (`docs/UI_GUIDE.md`).
+- Use `aoc.guard` spans to inspect guard payload snapshots. Sensitive fields are redacted automatically; raw JSON lives in secure logs only.
+- For scheduled verification, filter traces by `initiator="scheduled"` to compare runtimes pre/post change.
+
+### 2.3 Telemetry configuration (Excititor)
+
+- Configure the web service via `Excititor:Telemetry`:
+
+  ```jsonc
+  {
+    "Excititor": {
+      "Telemetry": {
+        "Enabled": true,
+        "EnableTracing": true,
+        "EnableMetrics": true,
+        "ServiceName": "stellaops-excititor-web",
+        "OtlpEndpoint": "http://otel-collector:4317",
+        "OtlpHeaders": {
+          "Authorization": "Bearer ${OTEL_PUSH_TOKEN}"
+        },
+        "ResourceAttributes": {
+          "env": "prod-us",
+          "service.group": "ingestion"
+        }
+      }
+    }
+  }
+  ```
+
+- Point the OTLP endpoint at the shared collector profile from §1 so Excititor metrics land in the `ingestion_*` dashboards next to Concelier. Resource attributes drive Grafana filtering (e.g., `env`, `service.group`).
+- For offline/air-gap bundles set `Enabled=false` and collect the file exporter artifacts from the Offline Kit; import them into Grafana after transfer to keep time-to-truth dashboards consistent.
+- Local development templates: run `tools/postgres/local-postgres.sh start` to spin up a PostgreSQL instance plus the matching `psql` client. The script prints the `export EXCITITOR_TEST_POSTGRES_URI=...` command that integration tests (e.g., `StellaOps.Excititor.Storage.Postgres.Tests`) will honor. Use `restart` for a quick bounce, `clean` to wipe data between suites, and `stop` when finished.
+
+---
+
+## 3 · Logs
+
+Structured logs include the following keys (JSON):
+
+| Key | Description |
+|-----|-------------|
+| `traceId` | Matches OpenTelemetry trace/span IDs for cross-system correlation. |
+| `tenant` | Tenant identifier enforced by Authority middleware. |
+| `source.vendor` | Logical source (e.g., `redhat`, `ubuntu`, `osv`, `ghsa`). |
+| `upstream.upstreamId` | Vendor-provided ID (CVE, GHSA, etc.). |
+| `contentHash` | `sha256:` digest of the raw document. |
+| `violation.code` | Present when guard rejects `ERR_AOC_00x`. |
+| `verification.window` | Present on `/aoc/verify` job logs. |
+
+Excititor APIs mirror these identifiers via response headers:
+
+| Header | Purpose |
+| --- | --- |
+| `X-Stella-TraceId` | W3C trace/span identifier for deep-linking from Console → Grafana/Loki. |
+| `X-Stella-CorrelationId` | Stable correlation identifier (respects inbound header or falls back to the request trace ID). |
+
+Logs are shipped to the central Loki/Elasticsearch cluster. Use the template query:
+
+```logql
+{app="concelier-web"} | json | violation_code != ""
+```
+
+to spot active AOC violations.
+
+### 1.3 · Advisory chunk API (Advisory AI feeds)
+
+Advisory AI now leans on Concelier’s `/advisories/{key}/chunks` endpoint for deterministic evidence packs. The service exports dedicated metrics so dashboards can highlight latency spikes, cache noise, or aggressive guardrail filtering before they impact Advisory AI responses.
+
+| Metric | Type | Labels | Description |
+| --- | --- | --- | --- |
+| `advisory_ai_chunk_requests_total` | Counter | `tenant`, `result`, `truncated`, `cache` | Count of chunk API calls, tagged with cache hits/misses and truncation state. |
+| `advisory_ai_chunk_latency_milliseconds` | Histogram | `tenant`, `result`, `truncated`, `cache` | End-to-end build latency (milliseconds) for each chunk request. |
+| `advisory_ai_chunk_segments` | Histogram | `tenant`, `result`, `truncated` | Number of chunk segments returned to the caller; watch for sudden drops tied to guardrails. |
+| `advisory_ai_chunk_sources` | Histogram | `tenant`, `result` | How many upstream observations/sources contributed to a response (after observation limits). |
+| `advisory_ai_guardrail_blocks_total` | Counter | `tenant`, `reason`, `cache` | Per-reason count of segments suppressed by guardrails (length, normalization, character set). |
+
+Dashboards should plot latency P95/P99 next to cache hit rates and guardrail block deltas to catch degradation early. Advisory AI CLI/Console surfaces the same metadata so support engineers can correlate with Grafana/Loki entries using `traceId`/`correlationId` headers.
+
+---
+
+## 4 · Dashboards
+
+Primary Grafana dashboard: **“AOC Ingestion Health”** (`dashboards/aoc-ingestion.json`). Panels include:
+
+1. **Sources overview:** table fed by `ingestion_write_total` and `ingestion_latency_seconds` (mirrors Console tiles).
+2. **Violation trend:** stacked bar chart of `aoc_violation_total` per code.
+3. **Signature success rate:** timeseries derived from `ingestion_signature_verified_total`.
+4. **Supersedes depth:** gauge showing `advisory_revision_count` P95.
+5. **Verification runs:** histogram and latency boxplot using `verify_runs_total` / `verify_duration_seconds`.
+
+Secondary dashboards:
+
+- **AOC Alerts (Ops view):** summarises active alerts, last verify run, and links to incident runbook.
+- **Offline Mode Dashboard:** fed from Offline Kit imports; highlights snapshot age and queued verification jobs.
+
+Update `docs/assets/dashboards/` with screenshots when Grafana capture pipeline produces the latest renders.
+
+---
+
+## 5 · Operational workflows
+
+1. **During ingestion incident:**
+   - Check Console dashboard for offending sources.
+   - Pivot to logs using document `contentHash`.
+   - Re-run `stella sources ingest --dry-run` with problematic payloads to validate fixes.
+   - After remediation, run `stella aoc verify --since 24h` and confirm exit code `0`.
+2. **Scheduled verification:**
+   - Configure cron job to run `stella aoc verify --format json --export ...`.
+   - Ship JSON to `aoc-verify` bucket and ingest into metrics using custom exporter.
+   - Alert on missing exports (no file uploaded within 26 h).
+3. **Offline kit validation:**
+   - Use Offline Dashboard
+4. **Incident toggle audit:**
+   - Authority requires `incident_reason` when issuing `obs:incident` tokens; plan your runbooks to capture business justification.
+   - Auditors can call `/authority/audit/incident?limit=100` with the tenant header to list recent incident activations, including reason and issuer. to ensure snapshots contain latest metrics.
+   - Run verification reports locally and attach to bundle before distribution.
+
+---
+
+## 6 · Offline considerations
+
+- Metrics exporters bundled with Offline Kit write to local Prometheus snapshots; sync them with central Grafana once connectivity is restored.
+- CLI verification reports should be hashed (`sha256sum`) and archived for audit trails.
+- Dashboards include offline data sources (`prometheus-offline`) switchable via dropdown.
+
+---
+
+## 7 · References
+
+- [Aggregation-Only Contract reference](../aoc/aggregation-only-contract.md)
+- [Architecture overview](../modules/platform/architecture-overview.md)
+- [Console guide](../UI_GUIDE.md)
+- [CLI AOC commands](../modules/cli/guides/cli-reference.md)
+- [Concelier architecture](../modules/concelier/architecture.md)
+- [Excititor architecture](../modules/excititor/architecture.md)
+- [Scheduler Worker observability guide](../modules/scheduler/operations/worker.md)
+
+---
+
+## 8 · Compliance checklist
+
+- [ ] Metrics documented with label sets and alert guidance.
+- [ ] Tracing span taxonomy aligned with Concelier/Excititor implementation.
+- [ ] Log schema matches structured logging contracts (traceId, tenant, source, contentHash).
+- [ ] Grafana dashboard references verified and screenshots scheduled.
+- [ ] Offline/air-gap workflow captured.
+- [ ] Cross-links to AOC reference, console, and CLI docs included.
+- [ ] Observability Guild sign-off scheduled (OWNER: @obs-guild, due 2025-10-28).
+
+---
+
+*Last updated: 2025-10-26 (Sprint 19).* 
--- a/docs/modules/telemetry/guides/policy.md
+++ b/docs/modules/telemetry/guides/policy.md
@@ -0,0 +1,166 @@
+# Policy Engine Observability
+
+> **Audience:** Observability Guild, SRE/Platform operators, Policy Guild.  
+> **Scope:** Metrics, logs, traces, dashboards, alerting, sampling, and incident workflows for the Policy Engine service (Sprint 20).  
+> **Prerequisites:** Policy Engine v2 deployed with OpenTelemetry exporters enabled (`observability:enabled=true` in config).
+
+---
+
+## 1 · Instrumentation Overview
+
+- **Telemetry stack:** OpenTelemetry SDK (metrics + traces), Serilog structured logging, OTLP exporters → Collector → Prometheus/Loki/Tempo.
+- **Namespace conventions:** `policy.*` for metrics/traces/log categories; labels use `tenant`, `policy`, `mode`, `runId`.
+- **Sampling:** Default 10 % trace sampling, 1 % rule-hit log sampling; incident mode overrides to 100 % (see §6).
+- **Correlation IDs:** Every API request gets `traceId` + `requestId`. CLI/UI display IDs to streamline support.
+
+---
+
+## 2 · Metrics
+
+### 2.1 Run Pipeline
+
+| Metric | Type | Labels | Notes |
+|--------|------|--------|-------|
+| `policy_run_seconds` | Histogram | `tenant`, `policy`, `mode` (`full`, `incremental`, `simulate`) | P95 target ≤ 5 min incremental, ≤ 30 min full. |
+| `policy_run_queue_depth` | Gauge | `tenant` | Number of pending jobs per tenant (updated each enqueue/dequeue). |
+| `policy_run_failures_total` | Counter | `tenant`, `policy`, `reason` (`err_pol_*`, `network`, `cancelled`) | Aligns with error codes. |
+| `policy_run_retries_total` | Counter | `tenant`, `policy` | Helps identify noisy sources. |
+| `policy_run_inputs_pending_bytes` | Gauge | `tenant` | Size of buffered change batches awaiting run. |
+
+### 2.2 Evaluator Insights
+
+| Metric | Type | Labels | Notes |
+|--------|------|--------|-------|
+| `policy_rules_fired_total` | Counter | `tenant`, `policy`, `rule` | Increment per rule match (sampled). |
+| `policy_vex_overrides_total` | Counter | `tenant`, `policy`, `vendor`, `justification` | Tracks VEX precedence decisions. |
+| `policy_suppressions_total` | Counter | `tenant`, `policy`, `action` (`ignore`, `warn`, `quiet`) | Audits suppression usage. |
+| `policy_selection_batch_duration_seconds` | Histogram | `tenant`, `policy` | Measures joiner performance. |
+| `policy_materialization_conflicts_total` | Counter | `tenant`, `policy` | Non-zero indicates optimistic concurrency retries. |
+
+### 2.3 API Surface
+
+| Metric | Type | Labels | Notes |
+|--------|------|--------|-------|
+| `policy_api_requests_total` | Counter | `endpoint`, `method`, `status` | Exposed via Minimal API instrumentation. |
+| `policy_api_latency_seconds` | Histogram | `endpoint`, `method` | Budget ≤ 250 ms for GETs, ≤ 1 s for POSTs. |
+| `policy_api_rate_limited_total` | Counter | `endpoint` | Tied to throttles (`429`). |
+
+### 2.4 Queue & Change Streams
+
+| Metric | Type | Labels | Notes |
+|--------|------|--------|-------|
+| `policy_queue_leases_active` | Gauge | `tenant` | Number of leased jobs. |
+| `policy_queue_lease_expirations_total` | Counter | `tenant` | Alerts when workers fail to ack. |
+| `policy_delta_backlog_age_seconds` | Gauge | `tenant`, `source` (`concelier`, `excititor`, `sbom`) | Age of oldest unprocessed change event. |
+
+---
+
+## 3 · Logs
+
+- **Format:** JSON (`Serilog`). Core fields: `timestamp`, `level`, `message`, `policyId`, `policyVersion`, `tenant`, `runId`, `rule`, `traceId`, `env.sealed`, `error.code`.
+- **Log categories:**
+  - `policy.run` (queue lifecycle, run begin/end, stats)
+  - `policy.evaluate` (batch execution summaries; rule-hit sampling)
+  - `policy.materialize` (Mongo operations, conflicts, retries)
+  - `policy.simulate` (diff results, CLI invocation metadata)
+  - `policy.lifecycle` (submit/review/approve events)
+- **Sampling:** Rule-hit logs sample 1 % by default; toggled to 100 % in incident mode or when `--trace` flag used in CLI.
+- **PII:** No user secrets recorded; user identities referenced as `user:<id>` or `group:<id>` only.
+
+---
+
+## 4 · Traces
+
+- Spans emit via OpenTelemetry instrumentation.
+- **Primary spans:**
+  - `policy.api` – wraps HTTP request, records `endpoint`, `status`, `scope`.
+  - `policy.select` – change stream ingestion and batch assembly (attributes: `candidateCount`, `cursor`).
+  - `policy.evaluate` – evaluation batch (attributes: `batchSize`, `ruleHits`, `severityChanges`).
+  - `policy.materialize` – Mongo writes (attributes: `writes`, `historyWrites`, `retryCount`).
+  - `policy.simulate` – simulation diff generation (attributes: `sbomCount`, `diffAdded`, `diffRemoved`).
+- Trace context propagated to CLI via response headers `traceparent`; UI surfaces in run detail view.
+- Incident mode forces span sampling to 100 % and extends retention via Collector config override.
+
+---
+
+## 5 · Dashboards
+
+### 5.1 Policy Runs Overview
+
+Widgets:
+- Run duration histogram (per mode/tenant).
+- Queue depth + backlog age line charts.
+- Failure rate stacked by error code.
+- Incremental backlog heatmap (policy × age).
+- Active vs scheduled runs table.
+
+### 5.2 Rule Impact & VEX
+
+- Top N rules by firings (bar chart).
+- VEX overrides by vendor/justification (stacked chart).
+- Suppression usage (pie + table with justifications).
+- Quieted findings trend (line).
+
+### 5.3 Simulation & Approval Health
+
+- Simulation diff histogram (added vs removed).
+- Pending approvals by age (table with SLA colour coding).
+- Compliance checklist status (lint, determinism CI, simulation evidence).
+
+> Placeholders for Grafana panels should be replaced with actual screenshots once dashboards land (`../assets/policy-observability/*.png`).
+
+---
+
+## 6 · Alerting
+
+| Alert | Condition | Suggested Action |
+|-------|-----------|------------------|
+| **PolicyRunSlaBreach** | `policy_run_seconds{mode="incremental"}` P95 > 300 s for 3 windows | Check queue depth, upstream services, scale worker pool. |
+| **PolicyQueueStuck** | `policy_delta_backlog_age_seconds` > 600 | Investigate change stream connectivity. |
+| **DeterminismMismatch** | Run status `failed` with `ERR_POL_004` OR CI replay diff | Switch to incident sampling, gather replay bundle, notify Policy Guild. |
+| **SimulationDrift** | CLI/CI simulation exit `20` (blocking diff) over threshold | Review policy changes before approval. |
+| **VexOverrideSpike** | `policy_vex_overrides_total` > configured baseline (per vendor) | Verify upstream VEX feed; ensure justification codes expected. |
+| **SuppressionSurge** | `policy_suppressions_total` increase > 3σ vs baseline | Audit new suppress rules; check approvals. |
+
+Alerts integrate with Notifier channels (`policy.alerts`) and Ops on-call rotations.
+
+---
+
+## 7 · Incident Mode & Forensics
+
+- Toggle via `POST /api/policy/incidents/activate` (requires `policy:operate` scope).
+- Effects:
+  - Trace sampling → 100 %.
+  - Rule-hit log sampling → 100 %.
+  - Retention window extended to 30 days for incident duration.
+  - `policy.incident.activated` event emitted (Console + Notifier banners).
+- Post-incident tasks:
+  - `stella policy run replay` for affected runs; attach bundles to incident record.
+  - Restore sampling defaults with `.../deactivate`.
+  - Update incident checklist in `/docs/policy/lifecycle.md` (section 8) with findings.
+
+---
+
+## 8 · Integration Points
+
+- **Authority:** Exposes metric `policy_scope_denied_total` for failed authorisation; correlate with `policy_api_requests_total`.
+- **Concelier/Excititor:** Shared trace IDs propagate via gRPC metadata to help debug upstream latency.
+- **Scheduler:** Future integration will push run queues into shared scheduler dashboards (planned in SCHED-MODELS-20-002).
+- **Offline Kit:** CLI exports logs + metrics snapshots (`stella offline bundle metrics`) for air-gapped audits.
+
+---
+
+## 9 · Compliance Checklist
+
+- [ ] **Metrics registered:** All metrics listed above exported and documented in Grafana dashboards.
+- [ ] **Alert policies configured:** Ops or Observability Guild created alerts matching table in §6.
+- [ ] **Sampling overrides tested:** Incident mode toggles verified in staging; retention roll-back rehearsed.
+- [ ] **Trace propagation validated:** CLI/UI display trace IDs and allow copy for support.
+- [ ] **Log scrubbing enforced:** Unit tests guarantee no secrets/PII in logs; sampling respects configuration.
+- [ ] **Offline capture rehearsed:** Metrics/log snapshot commands executed in sealed environment.
+- [ ] **Docs cross-links:** Links to architecture, runs, lifecycle, CLI, API docs verified.
+
+---
+
+*Last updated: 2025-10-26 (Sprint 20).*
+
--- a/docs/modules/telemetry/guides/telemetry-bootstrap.md
+++ b/docs/modules/telemetry/guides/telemetry-bootstrap.md
@@ -0,0 +1,48 @@
+# Telemetry Core Bootstrap (v1 · 2025-11-19)
+
+## Goal
+Show minimal host wiring for `StellaOps.Telemetry.Core` with deterministic defaults and sealed-mode friendliness.
+
+## Sample (web/worker host)
+```csharp
+var builder = WebApplication.CreateBuilder(args);
+
+builder.Services.AddStellaOpsTelemetry(
+    builder.Configuration,
+    serviceName: "StellaOps.SampleService",
+    serviceVersion: builder.Configuration["VERSION"],
+    configureOptions: options =>
+    {
+        // Disable collector in sealed mode / air-gap
+        options.Collector.Enabled = builder.Configuration.GetValue<bool>("Telemetry:Collector:Enabled", true);
+        options.Collector.Endpoint = builder.Configuration["Telemetry:Collector:Endpoint"];
+        options.Collector.Protocol = TelemetryCollectorProtocol.Grpc;
+    },
+    configureMetrics: m => m.AddAspNetCoreInstrumentation(),
+    configureTracing: t => t.AddHttpClientInstrumentation());
+```
+
+## Configuration (appsettings.json)
+```json
+{
+  "Telemetry": {
+    "Collector": {
+      "Enabled": true,
+      "Endpoint": "https://otel-collector.example:4317",
+      "Protocol": "Grpc",
+      "Component": "sample-service",
+      "Intent": "telemetry-export",
+      "DisableOnViolation": true
+    }
+  }
+}
+```
+
+## Determinism & safety
+- UTC timestamps only; no random IDs introduced by the helper.
+- Exporter is skipped when endpoint missing or egress policy denies.
+- `VSTEST_DISABLE_APPDOMAIN=1` recommended for tests with `tools/linksets-ci.sh` pattern.
+
+## Next
+- Propagation adapters (50-002) will build on this bootstrap.
+- Scrub/analyzer policies live under upcoming 51-001/51-002 tasks.
--- a/docs/modules/telemetry/guides/telemetry-propagation-51-001.md
+++ b/docs/modules/telemetry/guides/telemetry-propagation-51-001.md
@@ -0,0 +1,45 @@
+# Telemetry propagation contract (TELEMETRY-OBS-51-001)
+
+**Goal**: standardise trace/metrics propagation across StellaOps services so golden-signal helpers remain deterministic, tenant-safe, and offline-friendly.
+
+## Scope
+- Applies to HTTP, gRPC, background jobs, and message handlers instrumented via `StellaOps.Telemetry.Core`.
+- Complements bootstrap guide (`telemetry-bootstrap.md`) and precedes metrics helper implementation.
+
+## Required context fields
+- `trace_id` / `span_id`: W3C TraceContext headers only (no B3); generate if missing.
+- `tenant`: lower-case string; required for all incoming requests; default to `unknown` only in sealed/offline diagnostics jobs.
+- `actor`: optional user/service principal; redacted to hash in logs when `Scrub.Sealed=true`.
+- `imposed_rule`: optional string conveying enforcement context (e.g., `merge=false`).
+
+## HTTP middleware
+- Accept `traceparent`/`tracestate`; reject/strip vendor-specific headers.
+- Propagate `tenant`, `actor`, `imposed-rule` via `x-stella-tenant`, `x-stella-actor`, `x-stella-imposed-rule` headers (defaults configurable via `Telemetry:Propagation`).
+- Middleware entry point: `app.UseStellaOpsTelemetryContext()` plus the `TelemetryPropagationHandler` automatically added to all `HttpClient` instances when `AddStellaOpsTelemetry` is called.
+- Emit exemplars: when sampling is off, attach exemplar ids to request duration and active request metrics.
+
+## gRPC interceptors
+- Use binary TraceContext; carry metadata keys `stella-tenant`, `stella-actor`, `stella-imposed-rule`.
+- Enforce presence of `tenant`; abort with `Unauthenticated` if missing in non-sealed mode.
+
+## Jobs & message handlers
+- Wrap background job execution with Activity + baggage items (`tenant`, `actor`, `imposed_rule`).
+- When publishing bus events, stamp `trace_id` and `tenant` into headers; avoid embedding PII in payloads.
+
+## Metrics helper expectations
+- Golden signals: `http.server.duration`, `http.client.duration`, `messaging.operation.duration`, `job.execution.duration`, `runtime.gc.pause`, `db.call.duration`.
+- Mandatory tags: `tenant`, `service`, `endpoint`/`operation`, `result` (`ok|error|cancelled|throttled`), `sealed` (`true|false`).
+- Cardinality guard: trim tag values to 64 chars (configurable) and replace values beyond the first 50 distinct entries per key with `other` (enforced by `MetricLabelGuard`).
+- Helper API: `Histogram<double>.RecordRequestDuration(guard, durationMs, route, verb, status, result)` applies guard + tags consistently.
+
+## Determinism & offline posture
+- All timestamps UTC RFC3339; sampling configs controlled via appsettings and mirrored in offline bundles.
+- No external exporters when `Sealed=true`; use in-memory or file-based OTLP for air-gap.
+
+## Tests to add with implementation
+- Middleware unit tests asserting header/baggage mapping and tenant enforcement.
+- Metrics helper tests ensuring required tags present and trimmed; exemplar id attached when enabled.
+- Deterministic snapshot tests for serialized OTLP when sealed/offline.
+
+## Provenance
+- Authored 2025-11-20 to unblock TELEMETRY-OBS-51-001; to be refined as helpers are coded.
--- a/docs/modules/telemetry/guides/telemetry-scrub-51-002.md
+++ b/docs/modules/telemetry/guides/telemetry-scrub-51-002.md
@@ -0,0 +1,35 @@
+# Telemetry scrubbing contract (TELEMETRY-OBS-51-002)
+
+**Purpose**: define redaction/scrubbing rules for logs/traces/metrics before implementing helpers in `StellaOps.Telemetry.Core`.
+
+## Redaction rules
+- Strip or hash PII/credentials: emails, tokens, passwords, secrets, bearer/mTLS cert blobs.
+- Default hash algorithm: SHA-256 hex; include `scrubbed=true` tag.
+- Allowlist fields that remain: `tenant`, `trace_id`, `span_id`, `endpoint`, `result`, `sealed`.
+
+## Configuration knobs
+- `Telemetry:Scrub:Enabled` (bool, default true).
+- `Telemetry:Scrub:Sealed` (bool, default false) — when true, force scrubbing and disable external exporters.
+- `Telemetry:Scrub:HashSalt` (string, optional) — per-tenant salt; omit to keep deterministic hashes across deployments.
+- `Telemetry:Scrub:MaxValueLength` (int, default 256) — truncate values beyond this length before hashing.
+
+## Logger sink expectations
+- Implement scrubber as `ILogPayloadFilter` injected before sink.
+- Ensure message templates remain intact; only values scrubbed.
+- Preserve structured shape so downstream parsing remains deterministic.
+
+## Metrics & traces
+- Never place raw user input into metric/tag values; pass through scrubber before export.
+- Span events must omit payload bodies; include keyed references only.
+
+## Auditing
+- When scrubbing occurs, add tag `scrubbed=true` and `scrub_reason` (`pii|secret|length|pattern`).
+- Provide counter `telemetry.scrub.events{tenant,reason}` for observability.
+
+## Tests to add with implementation
+- Unit tests for regex-based scrubbing of tokens, emails, URLs with creds.
+- Config-driven tests toggling `Enabled`/`Sealed` modes to ensure exporters are suppressed when sealed.
+- Determinism test: same input yields identical hashed output when salt unset.
+
+## Provenance
+- Authored 2025-11-20 to unblock TELEMETRY-OBS-51-002 and downstream 55/56 tasks.
--- a/docs/modules/telemetry/guides/telemetry-sealed-56-001.md
+++ b/docs/modules/telemetry/guides/telemetry-sealed-56-001.md
@@ -0,0 +1,33 @@
+# Sealed-mode telemetry helpers (TELEMETRY-OBS-56-001 prep)
+
+## Objective
+Define behavior and configuration for telemetry when `Sealed=true`, ensuring no external egress while preserving deterministic local traces/metrics for audits.
+
+## Requirements
+- Disable external OTLP/exporters automatically when sealed; fallback to in-memory or file OTLP (`telemetry-sealed.otlp`) with bounded size (default 10 MB, ring buffer).
+- Add tag `sealed=true` to all spans/metrics/logs; suppress exemplars.
+- Force scrubbing: treat `Scrub.Sealed=true` regardless of default settings.
+- Sampling: cap to 10% max in sealed mode unless CLI incident toggle raises it (see CLI-OBS-12-001 contract); ceiling 100% with explicit override `Telemetry:Sealed:MaxSamplingPercent`.
+- Clock source: require monotonic clock for duration; emit warning if system clock skew detected >500ms.
+
+## Configuration keys
+- `Telemetry:Sealed:Enabled` (bool) — driven by host; when true activate sealed behavior.
+- `Telemetry:Sealed:Exporter` (enum `memory|file`) — default `file`.
+- `Telemetry:Sealed:FilePath` (string) — default `./logs/telemetry-sealed.otlp`.
+- `Telemetry:Sealed:MaxBytes` (int) — default 10_485_760 (10 MB).
+- `Telemetry:Sealed:MaxSamplingPercent` (int) — default 10.
+- Derived flag `Telemetry:Sealed:EffectiveIncidentMode` (read-only) exposes if incident-mode override lifted sampling ceiling.
+
+## File exporter format
+- OTLP binary, append-only, deterministic ordering by enqueue time.
+- Rotate when exceeding `MaxBytes` using suffix `.1`, `.2` capped to 3 files; oldest dropped.
+- Permissions 0600 by default; fail-start if path is world-readable.
+
+## Validation tests to implement with 56-001
+- Unit: sealed mode forces exporter swap and tags `sealed=true`, `scrubbed=true`.
+- Unit: sampling capped at max percent unless incident override set.
+- Unit: file exporter rotates deterministically and enforces 0600 perms.
+- Integration: sealed + incident mode together still block external exporters and honor scrub rules.
+
+## Provenance
+- Authored 2025-11-20 to satisfy PREP-TELEMETRY-OBS-56-001 and unblock implementation.
--- a/docs/modules/telemetry/guides/telemetry-standards.md
+++ b/docs/modules/telemetry/guides/telemetry-standards.md
@@ -0,0 +1,38 @@
+# Telemetry Standards (DOCS-OBS-50-002)
+
+Last updated: 2025-11-25 (Docs Tasks Md.VI)
+
+## Common envelope
+- **Trace context**: `trace_id`, `span_id`, `trace_flags`; propagate W3C `traceparent` and `baggage` end to end.
+- **Tenant & workload**: `tenant`, `workload` (service name), `region`, `env` (dev/stage/prod), `version` (git sha or semver).
+- **Subject**: `component` (module), `operation` (verb/name), `resource` (purl/uri/subject id when safe).
+- **Timing**: UTC ISO-8601 `timestamp`; durations in milliseconds with integers.
+- **Outcome**: `status` (`ok|error|fault|throttle`), `error.code` (machine), `error.message` (human, redacted), `retryable` (bool).
+
+## Scrubbing policy
+- Denylist PII/secrets before emit: emails, tokens, Authorization headers, bearer fragments, private keys, passwords, session IDs.
+- Redact fields to `"[redacted]"` and add `redaction.reason` (`secret|pii|tenant_policy`).
+- Hash low-cardinality identifiers when needed (`sha256` lowercase hex) and mark `hashed=true`.
+- Logs must not contain full request/response bodies; store hashes plus lengths. For NDJSON exports, allow hashes + selected headers only.
+
+## Sampling defaults
+- **Traces**: 10% head sampling non-prod; 100% for `status=error|fault` and for spans tagged `audit=true`. Prod default 5% with the same error/audit boost.
+- **Logs**: info logs rate-limited per component (default 100/s); warn/error never sampled. Structured JSON only.
+- **Metrics**: never sampled; counters/gauges/histograms use deterministic bucket boundaries documented in component specs.
+
+## Redaction override procedure
+- Overrides are rare and must be auditable.
+- To allow a field temporarily, set `telemetry.redaction.overrides=<comma list>` in service config with change-ticket id; emit `redaction.override=true` tag on affected spans/logs.
+- Overrides expire automatically after `telemetry.redaction.override_ttl` (default 24h); services refuse to start with expired overrides.
+- All overrides are logged to `telemetry.redaction.audit` channel with actor, ticket, fields, TTL.
+
+## Determinism & offline posture
+- No external enrichers; all enrichment data must be preloaded bundles (e.g., service map, tenant metadata).
+- Sorting for exports: by `timestamp`, then `workload`, then `operation`.
+- Time always UTC; avoid locale-specific formats.
+
+## Validation checklist
+- [ ] `traceparent` propagated and present on inbound/outbound.
+- [ ] Required fields present (`tenant`, `workload`, `operation`, `status`).
+- [ ] Scrubbing tests cover auth headers and bodies.
+- [ ] Sampling knobs configurable via env vars with documented defaults.
--- a/docs/modules/telemetry/guides/tracing.md
+++ b/docs/modules/telemetry/guides/tracing.md
@@ -0,0 +1,37 @@
+# Tracing Standards (DOCS-OBS-50-004)
+
+Last updated: 2025-11-25 (Docs Tasks Md.VI)
+
+## Goals
+- Consistent distributed tracing across services (API, workers, CLI).
+- Safe for offline/air-gapped deployments.
+- Deterministic span data for replay/debug.
+
+## Context propagation
+- Use W3C headers: `traceparent` (required), `baggage` (optional key/value pairs).
+- Preserve incoming `trace_id` for all downstream calls; create child spans per operation.
+- For async work (queues, cron), copy `traceparent` and `baggage` into the message envelope; new span links to the stored context using **links**, not a new parent.
+
+## Span conventions
+- Names: `<component>.<operation>` (e.g., `riskengine.simulate`, `notify.deliver`).
+- Required attributes: `tenant`, `workload` (service), `env`, `region`, `version`, `operation`, `status`.
+- HTTP spans: add `http.method`, `http.route`, `http.status_code`, `net.peer.name`, `net.peer.port`.
+- DB spans: `db.system`, `db.name`, `db.operation`, `db.statement` (omit literals).
+- Message spans: `messaging.system`, `messaging.destination`, `messaging.operation` (`send|receive|process`), `messaging.message_id`.
+- Errors: set `status=error`, include `error.code`, redacted `error.message`, `retryable` (bool).
+
+## Sampling
+- Default head sampling: 10% non-prod, 5% prod.
+- Always sample spans with `status=error|fault` or `audit=true`.
+- Allow override via env `Tracing__SampleRate` (0–1) per service; document in runbooks.
+
+## Offline/air-gap posture
+- No external exporters; emit OTLP to local collector or file.
+- Disable remote enrichment; rely on bundled service map.
+- All timestamps UTC; span ids deterministic only in scope of traceparent (no GUID reuse).
+
+## Validation checklist
+- [ ] `traceparent` forwarded on every inbound/outbound call.
+- [ ] Required attributes present on spans.
+- [ ] Error spans include codes and redacted messages.
+- [ ] Sampling knobs documented in service config.
--- a/docs/modules/telemetry/guides/ui-telemetry.md
+++ b/docs/modules/telemetry/guides/ui-telemetry.md
@@ -0,0 +1,190 @@
+# Console Observability
+
+> **Audience:** Observability Guild, Console Guild, SRE/operators.  
+> **Scope:** Metrics, logs, traces, dashboards, alerting, feature flags, and offline workflows for the StellaOps Console (Sprint 23).  
+> **Prerequisites:** Console deployed with metrics enabled (`CONSOLE_METRICS_ENABLED=true`) and OTLP exporters configured (`OTEL_EXPORTER_OTLP_*`).
+
+---
+
+## 1 · Instrumentation Overview
+
+- **Telemetry stack:** OpenTelemetry Web SDK (browser) + Console telemetry bridge → OTLP collector (Tempo/Prometheus/Loki). Server-side endpoints expose `/metrics` (Prometheus) and `/health/*`.  
+- **Sampling:** Front-end spans sample at 5 % by default (`OTEL_TRACES_SAMPLER=parentbased_traceidratio`). Metrics are un-sampled; log sampling is handled per category (§3).  
+- **Correlation IDs:** Every API call carries `x-stellaops-correlation-id`; structured UI events mirror that value so operators can follow a request across gateway, backend, and UI.  
+- **Scope gating:** Operators need the `ui.telemetry` scope to view live charts in the Admin workspace; the scope also controls access to `/console/telemetry` SSE streams.
+
+---
+
+## 2 · Metrics
+
+### 2.1 Experience & Navigation
+
+| Metric | Type | Labels | Notes |
+|--------|------|--------|-------|
+| `ui_route_render_seconds` | Histogram | `route`, `tenant`, `device` (`desktop`,`tablet`) | Time between route activation and first interactive paint. Target P95 ≤ 1.5 s (cached). |
+| `ui_request_duration_seconds` | Histogram | `service`, `method`, `status`, `tenant` | Gateway proxy timing for backend calls performed by the console. Alerts when backend latency degrades. |
+| `ui_filter_apply_total` | Counter | `route`, `filter`, `tenant` | Increments when a global filter or context chip is applied. Used to track adoption of saved views. |
+| `ui_tenant_switch_total` | Counter | `fromTenant`, `toTenant`, `trigger` (`picker`, `shortcut`, `link`) | Emitted after a successful tenant switch; correlates with Authority `ui.tenant.switch` logs. |
+| `ui_offline_banner_seconds` | Histogram | `reason` (`authority`, `manifest`, `gateway`), `tenant` | Duration of offline banner visibility; integrate with air-gap SLAs. |
+
+### 2.2 Security & Session
+
+| Metric | Type | Labels | Notes |
+|--------|------|--------|-------|
+| `ui_dpop_failure_total` | Counter | `endpoint`, `reason` (`nonce`, `jkt`, `clockSkew`) | Raised when DPoP validation fails; pair with Authority audit trail. |
+| `ui_fresh_auth_prompt_total` | Counter | `action` (`token.revoke`, `policy.activate`, `client.create`), `tenant` | Counts fresh-auth modals; backlog above baseline indicates workflow friction. |
+| `ui_fresh_auth_failure_total` | Counter | `action`, `reason` (`timeout`,`cancelled`,`auth_error`) | Optional metric (set `CONSOLE_FRESH_AUTH_METRICS=true` when feature flag lands). |
+
+### 2.3 Downloads & Offline Kit
+
+| Metric | Type | Labels | Notes |
+|--------|------|--------|-------|
+| `ui_download_manifest_refresh_seconds` | Histogram | `tenant`, `channel` (`edge`,`stable`,`airgap`) | Time to fetch and verify downloads manifest. Target < 3 s. |
+| `ui_download_export_queue_depth` | Gauge | `tenant`, `artifactType` (`sbom`,`policy`,`attestation`,`console`) | Mirrors `/console/downloads` queue depth; triggers when offline bundles lag. |
+| `ui_download_command_copied_total` | Counter | `tenant`, `artifactType` | Increments when users copy CLI commands from the UI. Useful to observe CLI parity adoption. |
+
+### 2.4 Telemetry Emission & Errors
+
+| Metric | Type | Labels | Notes |
+|--------|------|--------|-------|
+| `ui_telemetry_batch_failures_total` | Counter | `transport` (`otlp-http`,`otlp-grpc`), `reason` | Emitted by OTLP bridge when batches fail. Enable via `CONSOLE_METRICS_VERBOSE=true`. |
+| `ui_telemetry_queue_depth` | Gauge | `priority` (`normal`,`high`), `tenant` | Browser-side buffer depth; monitor for spikes under degraded collectors. |
+
+> **Scraping tips:**  
+> - Enable `/metrics` via `CONSOLE_METRICS_ENABLED=true`.  
+> - Set `OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.collector:4318` and relevant headers (`OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer <token>`).  
+> - For air-gapped sites, point the exporter to the Offline Kit collector (`localhost:4318`) and forward the metrics snapshot using `stella offline bundle metrics`.
+
+---
+
+## 3 · Logs
+
+- **Format:** JSON via Console log bridge; emitted to stdout and optional OTLP log exporter. Core fields: `timestamp`, `level`, `action`, `route`, `tenant`, `subject`, `correlationId`, `dpop.jkt`, `device`, `offlineMode`.  
+- **Categories:**
+  - `ui.action` – general user interactions (route changes, command palette, filter updates). Sampled 50 % by default; override with feature flag `telemetry.logVerbose`.  
+  - `ui.tenant.switch` – always logged; includes `fromTenant`, `toTenant`, `tokenId`, and Authority audit correlation.  
+  - `ui.download.commandCopied` – download commands copied; includes `artifactId`, `digest`, `manifestVersion`.  
+  - `ui.security.anomaly` – DPoP mismatches, tenant header errors, CSP violations (level = `Warning`).  
+  - `ui.telemetry.failure` – OTLP export errors; include `httpStatus`, `batchSize`, `retryCount`.  
+- **PII handling:** Full emails are scrubbed; only hashed values (`user:<sha256>`) appear unless `ui.admin` + fresh-auth were granted for the action (still redacted in logs).  
+- **Retention:** Recommended 14 days for connected sites, 30 days for sealed/air-gap audits. Ship logs to Loki/Elastic with ingest label `service="stellaops-web-ui"`.
+
+---
+
+## 4 · Traces
+
+- **Span names & attributes:**
+  - `ui.route.transition` – wraps route navigation; attributes: `route`, `tenant`, `renderMillis`, `prefetchHit`.  
+  - `ui.api.fetch` – HTTP fetch to backend; attributes: `service`, `endpoint`, `status`, `networkTime`.  
+  - `ui.sse.stream` – Server-sent event subscriptions (status ticker, runs); attributes: `channel`, `connectedMillis`, `reconnects`.  
+  - `ui.telemetry.batch` – Browser OTLP flush; attributes: `batchSize`, `success`, `retryCount`.  
+  - `ui.policy.action` – Policy workspace actions (simulate, approve, activate) per `docs/UI_GUIDE.md`.  
+- **Propagation:** Spans use W3C `traceparent`; gateway echoes header to backend APIs so traces stitch across UI → gateway → service.  
+- **Sampling controls:** `OTEL_TRACES_SAMPLER_ARG` (ratio) and feature flag `telemetry.forceSampling` (sets to 100 % for incident debugging).  
+- **Viewing traces:** Grafana Tempo or Jaeger via collector. Filter by `service.name = stellaops-console`. For cross-service debugging, filter on `correlationId` and `tenant`.
+
+---
+
+## 5 · Dashboards
+
+### 5.1 Experience Overview
+
+Panels:
+- Route render histogram (P50/P90/P99) by route.  
+- Backend call latency stacked by service (`ui_request_duration_seconds`).  
+- Offline banner duration trend (`ui_offline_banner_seconds`).  
+- Tenant switch volume vs failure rate (overlay `ui_dpop_failure_total`).  
+- Command palette usage (`ui_filter_apply_total` + `ui.action` log counts).
+
+### 5.2 Downloads & Offline Kit
+
+- Manifest refresh time chart (per channel).  
+- Export queue depth gauge with alert thresholds.  
+- CLI command adoption (bar chart per artifact type, using `ui_download_command_copied_total`).  
+- Offline parity banner occurrences (`downloads.offlineParity` flag from API → derived metric).  
+- Last Offline Kit import timestamp (join with Downloads API metadata).
+
+### 5.3 Security & Session
+
+- Fresh-auth prompt counts vs success/fail ratios.  
+- DPoP failure stacked by reason.  
+- Tenant mismatch warnings (from `ui.security.anomaly` logs).  
+- Scope usage heatmap (derived from Authority audit events + UI logs).  
+- CSP violation counts (browser `securitypolicyviolation` listener forwarded to logs).
+
+> Capture screenshots for Grafana once dashboards stabilise (`docs/assets/ui/observability/*.png`). Replace placeholders before releasing the doc.
+
+---
+
+## 6 · Alerting
+
+| Alert | Condition | Suggested Action |
+|-------|-----------|------------------|
+| **ConsoleLatencyHigh** | `ui_route_render_seconds_bucket{le="1.5"}` drops below 0.95 for 3 intervals | Inspect route splits, check backend latencies, review CDN cache. |
+| **BackendLatencyHigh** | `ui_request_duration_seconds_sum / ui_request_duration_seconds_count` > 1 s for any service | Correlate with gateway/service dashboards; escalate to owning guild. |
+| **TenantSwitchFailures** | Increase in `ui_dpop_failure_total` or `ui.security.anomaly` (tenant mismatch) > 3/min | Validate Authority issuer, check clock skew, confirm tenant config. |
+| **FreshAuthLoop** | `ui_fresh_auth_prompt_total` spikes with matching `ui_fresh_auth_failure_total` | Review Authority `/fresh-auth` endpoint, session timeout config, UX regressions. |
+| **OfflineBannerLong** | `ui_offline_banner_seconds` P95 > 120 s | Investigate Authority/gateway availability; verify Offline Kit freshness. |
+| **DownloadsBacklog** | `ui_download_export_queue_depth` > 5 for 10 min OR queue age > alert threshold | Ping Downloads service, ensure manifest pipeline (`DOWNLOADS-CONSOLE-23-001`) is healthy. |
+| **TelemetryExportErrors** | `ui_telemetry_batch_failures_total` > 0 for ≥5 min | Check collector health, credentials, or TLS trust. |
+
+Integrate alerts with Notifier (`ui.alerts`) or existing Ops channels. Tag incidents with `component=console` for correlation.
+
+---
+
+## 7 · Feature Flags & Configuration
+
+| Flag / Env Var | Purpose | Default |
+|----------------|---------|---------|
+| `CONSOLE_FEATURE_FLAGS` | Enables UI modules (`runs`, `downloads`, `policies`, `telemetry`). Telemetry panel requires `telemetry`. | `runs,downloads,policies` |
+| `CONSOLE_METRICS_ENABLED` | Exposes `/metrics` for Prometheus scrape. | `true` |
+| `CONSOLE_METRICS_VERBOSE` | Emits additional batching metrics (`ui_telemetry_*`). | `false` |
+| `CONSOLE_LOG_LEVEL` | Minimum log level (`Information`, `Debug`). Use `Debug` for incident sampling. | `Information` |
+| `CONSOLE_METRICS_SAMPLING` *(planned)* | Controls front-end span sampling ratio. Document once released. | `0.05` |
+| `OTEL_EXPORTER_OTLP_ENDPOINT` | Collector URL; supports HTTPS. | unset |
+| `OTEL_EXPORTER_OTLP_HEADERS` | Comma-separated headers (auth). | unset |
+| `OTEL_EXPORTER_OTLP_INSECURE` | Allow HTTP (dev only). | `false` |
+| `OTEL_SERVICE_NAME` | Service tag for traces/logs. Set to `stellaops-console`. | auto |
+| `CONSOLE_TELEMETRY_SSE_ENABLED` | Enables `/console/telemetry` SSE feed for dashboards. | `true` |
+
+Feature flag changes should be tracked in release notes and mirrored in `docs/UI_GUIDE.md` (navigation and workflow expectations).
+
+---
+
+## 8 · Offline / Air-Gapped Workflow
+
+- Mirror the console image and telemetry collector as part of the Offline Kit (see `/docs/operations/console-docker-install.md` §4).  
+- Scrape metrics locally via `curl -k https://console.local/metrics > metrics.prom`; archive alongside logs for audits.  
+- Use `stella offline kit import` to keep the downloads manifest in sync; dashboards display staleness using `ui_download_manifest_refresh_seconds`.  
+- When collectors are unavailable, console queues OTLP batches (up to 5 min) and exposes backlog through `ui_telemetry_queue_depth`; export queue metrics to prove no data loss.  
+- After reconnecting, run `stella console status --telemetry` *(CLI parity pending; see DOCS-CONSOLE-23-014)* or verify `ui_telemetry_batch_failures_total` resets to zero.  
+- Retain telemetry bundles for 30 days per compliance guidelines; include Grafana JSON exports in audit packages.
+
+---
+
+## 9 · Compliance Checklist
+
+- [ ] `/metrics` scraped in staging & production; dashboards display `ui_route_render_seconds`, `ui_request_duration_seconds`, and downloads metrics.  
+- [ ] OTLP traces/logs confirmed end-to-end (collector, Tempo/Loki).  
+- [ ] Alert rules from §6 implemented in monitoring stack with runbooks linked.  
+- [ ] Feature flags documented and change-controlled; telemetry disabled only with approval.  
+- [ ] DPoP/fresh-auth anomalies correlated with Authority audit logs during drill.  
+- [ ] Offline capture workflow exercised; evidence stored in audit vault.  
+- [ ] Screenshots of Grafana dashboards committed once they stabilise (update references).  
+- [ ] Cross-links verified (`docs/deploy/console.md`, `docs/security/console-security.md`, `docs/UI_GUIDE.md`).
+
+---
+
+## 10 · References
+
+- `/docs/deploy/console.md` – Metrics endpoint, OTLP config, health checks.  
+- `/docs/security/console-security.md` – Security metrics & alert hints.  
+- `docs/UI_GUIDE.md` – Console workflows and offline posture.  
+- `/docs/observability/observability.md` – Platform-wide practices.  
+- `/ops/telemetry-collector.md` & `/ops/telemetry-storage.md` – Collector deployment.  
+- `/docs/operations/console-docker-install.md` – Compose/Helm environment variables.
+
+---
+
+*Last updated: 2025-10-28 (Sprint 23).* 
+