# Notify observability runbook (stub · 2025-11-29 demo) ## Dashboards (offline import) - Grafana JSON: `docs/modules/notify/operations/dashboards/notify-observability.json` (import locally; no external data sources assumed). - Planned panels: enqueue/dequeue rate, delivery latency p95/p99, channel error rate, retry/dead-letter counts, rule evaluation latency, tenant isolation breaches (should stay 0), and notification simulation outcomes. ## Key metrics - `notify_enqueue_total{channel}` — notifications enqueued by channel. - `notify_delivery_latency_seconds_bucket{channel}` — delivery latency per channel. - `notify_delivery_failures_total{channel,reason}` — failed deliveries. - `notify_retry_total{channel}` and `notify_deadletter_total{channel}` — retries and dead letters. - `notify_rule_eval_duration_seconds_bucket` — rule evaluation latency. - `notify_simulation_total{result}` — simulation outcomes when quiet hours/correlation rules applied. ## Logs & traces - Correlate by `notificationId`, `ruleId`, `tenant`, `channel`. Include `quietHoursApplied`, `correlationKey`, `retries` fields. - Traces disabled by default for air-gap; enable by pointing OTLP exporter to on-prem collector. ## Health/diagnostics - `/health/liveness` and `/health/readiness` check queue backend reachability and channel provider credentials. - `/status` exposes build version, commit, feature flags; verify against offline bundle manifest. - Simulation probe: `/api/notify/simulate` with sample rule set to validate correlation/digest wiring once NOTIFY-SVC-39-001..004 land. ## Alert hints - Delivery latency p99 > 1.5s for email/webhook channels. - Dead-letter queue growth > threshold. - Rule evaluation latency p99 > 500ms. - Correlation/quiet-hours simulation failures once enabled. ## Offline verification steps 1) Import Grafana JSON locally; point to Prometheus scrape labeled `notify`. 2) Run `stella notify simulate --rules samples/rules.yaml --dry-run` (once available) and ensure metrics/logs emit locally. 3) Fetch `/status` and compare commit/version to offline bundle manifest. ## Evidence locations - Sprint tracker: `docs/implplan/SPRINT_322_docs_modules_notify.md`. - Module docs: `README.md`, `architecture.md`, `implementation_plan.md`. - Dashboard stub: `operations/dashboards/notify-observability.json`.