# Notifier Runbook Last updated: 2025-11-25 (Docs Tasks Md.V · DOCS-NOTIFY-40-001) ## Purpose Operational steps to deploy, monitor, and recover the Notifications service (WebService + Worker). ## Pre-flight - Secrets stored in Authority: SMTP creds, Slack/Teams hooks, webhook HMAC keys. - Outbound allowlist updated for target channels. - PostgreSQL and Redis reachable; health checks pass. - Offline kit loaded: channel manifests, default templates, rule seeds. ## Deploy 1. Apply Kubernetes manifests/Compose stack from `ops/notify/` with image digests pinned. 2. Set env: - `Notify__Postgres__ConnectionString` - `Notify__Redis__ConnectionString` - `Notify__Authority__BaseUrl` - `Notify__ChannelAllowlist` - `ASPNETCORE_URLS=http://0.0.0.0:8080` 3. Warm caches: `POST /api/v1/notify/admin/warm` (loads rules/templates into memory) — optional. 4. Verify `GET /api/v1/notify/health` returns `ready=true`. ## Monitor - Metrics (Prometheus): - `notify_delivery_attempts_total` by status/channel/tenant. - `notify_escalation_stage_total` by policy/stage. - `notify_rule_eval_seconds_bucket` for worker latency. - Logs: structured JSON with `tenant`, `ruleId`, `deliveryId`, `channel`, `status`. - Traces: span `notify.delivery` with linkage to originating event `traceparent` when provided. ## Common operations - **List stuck deliveries**: `GET /api/v1/notify/deliveries?status=failed&from=`. - **Replay delivery**: `POST /api/v1/notify/deliveries/{id}:replay` (idempotent; only re-renders if inputs unchanged). - **Pause a tenant**: set tenant state `paused=true` via admin API; worker stops sending but keeps evaluating for audit. - **Rotate secrets**: update Authority secret, then `POST /api/v1/notify/channels/{id}:refresh-secret`. ## Failure recovery - Worker crash loop: check Redis connectivity, template compile errors; run `notify-worker --validate-only` using current config. - PostgreSQL outage: worker backs off with exponential retry; after recovery, replay via `:replay` or digests as needed. - Channel outage (e.g., Slack 5xx): throttles + retry policy handle transient errors; for extended outages, disable channel or swap to backup policy. ## Auditing - Delivery ledger retains attempt hashes and signatures; export via `/deliveries?from=...&to=...&format=ndjson` for offline review. - Ack events stored with actor, timestamp, source IP. ## Determinism safeguards - Rule snapshots are versioned per tenant; upgrades swap snapshots atomically. - Template rendering uses deterministic helpers only; no live lookups. - Time sources are UTC; quiet hours evaluated using tenant timezone from config. ## On-call checklist - [ ] Health endpoints green. - [ ] Delivery failure rate < 0.5% over last hour. - [ ] Escalation backlog empty or within SLO. - [ ] Redis memory < 75% and PostgreSQL primary healthy. - [ ] Latest release notes applied and channels validated.