# Telemetry observability runbook (stub · 2025-11-29 demo) ## Dashboards (offline import) - Grafana JSON: `docs/modules/telemetry/operations/dashboards/telemetry-observability.json` (import locally; no external data sources assumed). - Planned panels: collector uptime, scrape errors, ingestion/backlog per tenant, storage retention headroom, query latency p95/p99, and OTLP export errors. ## Key metrics - `telemetry_collector_uptime_seconds` — per-collector uptime. - `telemetry_scrape_failures_total{job}` — scrape failures per job. - `telemetry_ingest_backlog` — queued spans/logs/metrics awaiting storage. - `telemetry_storage_retention_percent_used` — storage utilization against retention budget. - `telemetry_query_latency_seconds_bucket{route}` — API/query latency. - `telemetry_otlp_export_failures_total{signal}` — OTLP export failures by signal. ## Logs & traces - Correlate by `trace_id` and `tenant`; include `collector_id`, `pipeline`, `exporter` fields. - Traces disabled by default for air-gap; enable by setting OTLP endpoints to on-prem collectors. ## Health/diagnostics - `/health/liveness` and `/health/readiness` (collector + storage gateway) check exporter reachability and disk headroom. - `/status` exposes build version, commit, feature flags; verify against offline bundle manifest. - Storage probe: `GET /api/storage/usage` (if available) to confirm retention headroom; otherwise rely on Prometheus metrics. ## Alert hints - OTLP export failures > 0 over 5m. - Ingest backlog above threshold (configurable per tenant/workload). - Query latency p99 > 1s for `/api/query` routes. - Storage utilization > 85% of retention budget. ## Offline verification steps 1) Import Grafana JSON locally; point to Prometheus scrape labeled `telemetry`. 2) Run collector smoke: push sample OTLP spans/logs/metrics to local collector and confirm metrics emit in Prometheus. 3) Fetch `/status` and compare commit/version to offline bundle manifest. ## Evidence locations - Sprint tracker: `docs/implplan/SPRINT_0330_0001_0001_docs_modules_telemetry.md`. - Module docs: `README.md`, `architecture.md`, `implementation_plan.md`. - Dashboard stub: `operations/dashboards/telemetry-observability.json`.