Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
59 lines
2.8 KiB
Markdown
59 lines
2.8 KiB
Markdown
# Notifier Runbook
|
|
|
|
Last updated: 2025-11-25 (Docs Tasks Md.V · DOCS-NOTIFY-40-001)
|
|
|
|
## Purpose
|
|
Operational steps to deploy, monitor, and recover the Notifications service (WebService + Worker).
|
|
|
|
## Pre-flight
|
|
- Secrets stored in Authority: SMTP creds, Slack/Teams hooks, webhook HMAC keys.
|
|
- Outbound allowlist updated for target channels.
|
|
- Mongo and Redis reachable; health checks pass.
|
|
- Offline kit loaded: channel manifests, default templates, rule seeds.
|
|
|
|
## Deploy
|
|
1. Apply Kubernetes manifests/Compose stack from `ops/notify/` with image digests pinned.
|
|
2. Set env:
|
|
- `Notify__Mongo__ConnectionString`
|
|
- `Notify__Redis__ConnectionString`
|
|
- `Notify__Authority__BaseUrl`
|
|
- `Notify__ChannelAllowlist`
|
|
- `ASPNETCORE_URLS=http://0.0.0.0:8080`
|
|
3. Warm caches: `POST /api/v1/notify/admin/warm` (loads rules/templates into memory) — optional.
|
|
4. Verify `GET /api/v1/notify/health` returns `ready=true`.
|
|
|
|
## Monitor
|
|
- Metrics (Prometheus):
|
|
- `notify_delivery_attempts_total` by status/channel/tenant.
|
|
- `notify_escalation_stage_total` by policy/stage.
|
|
- `notify_rule_eval_seconds_bucket` for worker latency.
|
|
- Logs: structured JSON with `tenant`, `ruleId`, `deliveryId`, `channel`, `status`.
|
|
- Traces: span `notify.delivery` with linkage to originating event `traceparent` when provided.
|
|
|
|
## Common operations
|
|
- **List stuck deliveries**: `GET /api/v1/notify/deliveries?status=failed&from=<utc>`.
|
|
- **Replay delivery**: `POST /api/v1/notify/deliveries/{id}:replay` (idempotent; only re-renders if inputs unchanged).
|
|
- **Pause a tenant**: set tenant state `paused=true` via admin API; worker stops sending but keeps evaluating for audit.
|
|
- **Rotate secrets**: update Authority secret, then `POST /api/v1/notify/channels/{id}:refresh-secret`.
|
|
|
|
## Failure recovery
|
|
- Worker crash loop: check Redis connectivity, template compile errors; run `notify-worker --validate-only` using current config.
|
|
- Mongo outage: worker backs off with exponential retry; after recovery, replay via `:replay` or digests as needed.
|
|
- Channel outage (e.g., Slack 5xx): throttles + retry policy handle transient errors; for extended outages, disable channel or swap to backup policy.
|
|
|
|
## Auditing
|
|
- Delivery ledger retains attempt hashes and signatures; export via `/deliveries?from=...&to=...&format=ndjson` for offline review.
|
|
- Ack events stored with actor, timestamp, source IP.
|
|
|
|
## Determinism safeguards
|
|
- Rule snapshots are versioned per tenant; upgrades swap snapshots atomically.
|
|
- Template rendering uses deterministic helpers only; no live lookups.
|
|
- Time sources are UTC; quiet hours evaluated using tenant timezone from config.
|
|
|
|
## On-call checklist
|
|
- [ ] Health endpoints green.
|
|
- [ ] Delivery failure rate < 0.5% over last hour.
|
|
- [ ] Escalation backlog empty or within SLO.
|
|
- [ ] Redis memory < 75% and Mongo primary healthy.
|
|
- [ ] Latest release notes applied and channels validated.
|