Files
git.stella-ops.org/docs/operations/notifier-runbook.md
master 3a2100aa78 Add unit and integration tests for VexCandidateEmitter and SmartDiff repositories
- Implemented comprehensive unit tests for VexCandidateEmitter to validate candidate emission logic based on various scenarios including absent and present APIs, confidence thresholds, and rate limiting.
- Added integration tests for SmartDiff PostgreSQL repositories, covering snapshot storage and retrieval, candidate storage, and material risk change handling.
- Ensured tests validate correct behavior for storing, retrieving, and querying snapshots and candidates, including edge cases and expected outcomes.
2025-12-16 19:00:43 +02:00

2.8 KiB

Notifier Runbook

Last updated: 2025-11-25 (Docs Tasks Md.V · DOCS-NOTIFY-40-001)

Purpose

Operational steps to deploy, monitor, and recover the Notifications service (WebService + Worker).

Pre-flight

  • Secrets stored in Authority: SMTP creds, Slack/Teams hooks, webhook HMAC keys.
  • Outbound allowlist updated for target channels.
  • PostgreSQL and Redis reachable; health checks pass.
  • Offline kit loaded: channel manifests, default templates, rule seeds.

Deploy

  1. Apply Kubernetes manifests/Compose stack from ops/notify/ with image digests pinned.
  2. Set env:
    • Notify__Postgres__ConnectionString
    • Notify__Redis__ConnectionString
    • Notify__Authority__BaseUrl
    • Notify__ChannelAllowlist
    • ASPNETCORE_URLS=http://0.0.0.0:8080
  3. Warm caches: POST /api/v1/notify/admin/warm (loads rules/templates into memory) — optional.
  4. Verify GET /api/v1/notify/health returns ready=true.

Monitor

  • Metrics (Prometheus):
    • notify_delivery_attempts_total by status/channel/tenant.
    • notify_escalation_stage_total by policy/stage.
    • notify_rule_eval_seconds_bucket for worker latency.
  • Logs: structured JSON with tenant, ruleId, deliveryId, channel, status.
  • Traces: span notify.delivery with linkage to originating event traceparent when provided.

Common operations

  • List stuck deliveries: GET /api/v1/notify/deliveries?status=failed&from=<utc>.
  • Replay delivery: POST /api/v1/notify/deliveries/{id}:replay (idempotent; only re-renders if inputs unchanged).
  • Pause a tenant: set tenant state paused=true via admin API; worker stops sending but keeps evaluating for audit.
  • Rotate secrets: update Authority secret, then POST /api/v1/notify/channels/{id}:refresh-secret.

Failure recovery

  • Worker crash loop: check Redis connectivity, template compile errors; run notify-worker --validate-only using current config.
  • PostgreSQL outage: worker backs off with exponential retry; after recovery, replay via :replay or digests as needed.
  • Channel outage (e.g., Slack 5xx): throttles + retry policy handle transient errors; for extended outages, disable channel or swap to backup policy.

Auditing

  • Delivery ledger retains attempt hashes and signatures; export via /deliveries?from=...&to=...&format=ndjson for offline review.
  • Ack events stored with actor, timestamp, source IP.

Determinism safeguards

  • Rule snapshots are versioned per tenant; upgrades swap snapshots atomically.
  • Template rendering uses deterministic helpers only; no live lookups.
  • Time sources are UTC; quiet hours evaluated using tenant timezone from config.

On-call checklist

  • Health endpoints green.
  • Delivery failure rate < 0.5% over last hour.
  • Escalation backlog empty or within SLO.
  • Redis memory < 75% and PostgreSQL primary healthy.
  • Latest release notes applied and channels validated.