Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
2.8 KiB
2.8 KiB
Notifier Runbook
Last updated: 2025-11-25 (Docs Tasks Md.V · DOCS-NOTIFY-40-001)
Purpose
Operational steps to deploy, monitor, and recover the Notifications service (WebService + Worker).
Pre-flight
- Secrets stored in Authority: SMTP creds, Slack/Teams hooks, webhook HMAC keys.
- Outbound allowlist updated for target channels.
- Mongo and Redis reachable; health checks pass.
- Offline kit loaded: channel manifests, default templates, rule seeds.
Deploy
- Apply Kubernetes manifests/Compose stack from
ops/notify/with image digests pinned. - Set env:
Notify__Mongo__ConnectionStringNotify__Redis__ConnectionStringNotify__Authority__BaseUrlNotify__ChannelAllowlistASPNETCORE_URLS=http://0.0.0.0:8080
- Warm caches:
POST /api/v1/notify/admin/warm(loads rules/templates into memory) — optional. - Verify
GET /api/v1/notify/healthreturnsready=true.
Monitor
- Metrics (Prometheus):
notify_delivery_attempts_totalby status/channel/tenant.notify_escalation_stage_totalby policy/stage.notify_rule_eval_seconds_bucketfor worker latency.
- Logs: structured JSON with
tenant,ruleId,deliveryId,channel,status. - Traces: span
notify.deliverywith linkage to originating eventtraceparentwhen provided.
Common operations
- List stuck deliveries:
GET /api/v1/notify/deliveries?status=failed&from=<utc>. - Replay delivery:
POST /api/v1/notify/deliveries/{id}:replay(idempotent; only re-renders if inputs unchanged). - Pause a tenant: set tenant state
paused=truevia admin API; worker stops sending but keeps evaluating for audit. - Rotate secrets: update Authority secret, then
POST /api/v1/notify/channels/{id}:refresh-secret.
Failure recovery
- Worker crash loop: check Redis connectivity, template compile errors; run
notify-worker --validate-onlyusing current config. - Mongo outage: worker backs off with exponential retry; after recovery, replay via
:replayor digests as needed. - Channel outage (e.g., Slack 5xx): throttles + retry policy handle transient errors; for extended outages, disable channel or swap to backup policy.
Auditing
- Delivery ledger retains attempt hashes and signatures; export via
/deliveries?from=...&to=...&format=ndjsonfor offline review. - Ack events stored with actor, timestamp, source IP.
Determinism safeguards
- Rule snapshots are versioned per tenant; upgrades swap snapshots atomically.
- Template rendering uses deterministic helpers only; no live lookups.
- Time sources are UTC; quiet hours evaluated using tenant timezone from config.
On-call checklist
- Health endpoints green.
- Delivery failure rate < 0.5% over last hour.
- Escalation backlog empty or within SLO.
- Redis memory < 75% and Mongo primary healthy.
- Latest release notes applied and channels validated.