Files
git.stella-ops.org/docs/notifications/architecture.md
root 68da90a11a
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Restructure solution layout by module
2025-10-28 15:10:40 +02:00

8.8 KiB
Raw Blame History

Notifications Architecture

Imposed rule: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

This dossier distils the Notify architecture into implementation-ready guidance for service owners, SREs, and integrators. It complements the high-level overview by detailing process boundaries, persistence models, and extensibility points.


1. Runtime shape

          ┌──────────────────┐
          │ Authority (OpTok)│
          └───────┬──────────┘
                  │
          ┌───────▼──────────┐        ┌───────────────┐
          │ Notify.WebService│◀──────▶│ MongoDB       │
Tenant API│  REST + gRPC WIP │        │ rules/channels│
          └───────▲──────────┘        │ deliveries    │
                  │                   │ digests       │
   Internal bus   │                   └───────────────┘
 (NATS/Redis/etc) │
                  │
        ┌─────────▼─────────┐      ┌───────────────┐
        │ Notify.Worker     │◀────▶│ Redis / Cache │
        │ rule eval + render│      │ throttles/locks│
        └─────────▲─────────┘      └───────▲───────┘
                  │                        │
                  │                        │
           ┌──────┴──────┐       ┌─────────┴────────┐
           │ Connectors  │──────▶│ Slack/Teams/...  │
           │ (plug-ins)  │       │ External targets │
           └─────────────┘       └──────────────────┘
  • WebService hosts REST endpoints (/channels, /rules, /templates, /deliveries, /digests, /stats) and handles schema normalisation, validation, and Authority enforcement.
  • Worker subscribes to the platform event bus, evaluates rules per tenant, applies throttles/digests, renders payloads, writes ledger entries, and invokes connectors.
  • Plug-ins live under plugins/notify/ and are loaded deterministically at service start (orderedPlugins list). Each implements connector contracts and optional health/test-preview providers.

Both services share options via notify.yaml (see etc/notify.yaml.sample). For dev/test scenarios, an in-memory repository exists but production requires Mongo + Redis/NATS for durability and coordination.


2. Event ingestion and rule evaluation

  1. Subscription. Workers attach to the internal bus (Redis Streams or NATS JetStream). Each partition key is tenantId|scope.digest|event.kind to preserve order for a given artefact.
  2. Normalisation. Incoming events are hydrated into NotifyEvent envelopes. Payload JSON is normalised (sorted object keys) to preserve determinism and enable hashing.
  3. Rule snapshot. Per-tenant rule sets are cached in memory. Change streams from Mongo trigger snapshot refreshes without restart.
  4. Match pipeline.
    • Tenant check (rule.tenantId vs. event tenant).
  • Kind/namespace/repository/digest filters.
  • Severity and KEV gating based on event deltas.
  • VEX gating using NotifyRuleMatchVex.
  • Action iteration with throttle/digest decisions.
  1. Idempotency. Each action computes hash(ruleId|actionId|event.kind|scope.digest|delta.hash|dayBucket); matches within throttle TTL record status=Throttled and stop.
  2. Dispatch. If digest is instant, the renderer immediately processes the action. Otherwise the event is appended to the digest window for later flush.

Failures during evaluation are logged with correlation IDs and surfaced through /stats and worker metrics (notify_rule_eval_failures_total, notify_digest_flush_errors_total).


3. Rendering & connectors

  • Template resolution. The renderer picks the template in this order: action template → channel default template → locale fallback → built-in minimal template. Locale negotiation reduces en-US to en-us.
  • Helpers & partials. Exposed helpers mirror the list in notifications/templates.md. Plug-ins may register additional helpers but must remain deterministic and side-effect free.
  • Rendering output. NotifyDeliveryRendered captures:
    • channelType, format, locale
    • title, body, optional summary, textBody
    • target (redacted where necessary)
    • attachments[] (safe URLs or references)
    • bodyHash (lowercase SHA-256) for audit parity
  • Connector contract. Connectors implement INotifyConnector (send + health) and can implement INotifyChannelTestProvider for /channels/{id}/test. All plugs are single-tenant aware; secrets are pulled via references at send time and never persisted in Mongo.
  • Retries. Workers track attempts with exponential jitter. On permanent failure, deliveries are marked Failed with statusReason, and optional DLQ fan-out is slated for Sprint 40.

4. Persistence model

Collection Purpose Key fields & indexes
rules Tenant rule definitions. _id, tenantId, enabled; index on {tenantId, enabled}.
channels Channel metadata + config references. _id, tenantId, type; index on {tenantId, type}.
templates Locale-specific render bodies. _id, tenantId, channelType, key; index on {tenantId, channelType, key}.
deliveries Ledger of rendered notifications. _id, tenantId, sentAt; compound index on {tenantId, sentAt:-1} for history queries.
digests Open digest windows per action. _id (tenantId:actionKey:window), status; index on {tenantId, actionKey}.
throttles Short-lived throttle tokens (Mongo or Redis). Key format idem:<hash> with TTL aligned to throttle duration.

Documents are stored using the canonical JSON serializer (NotifyCanonicalJsonSerializer) to preserve property ordering and casing. Schema migration helpers upgrade stored documents when new versions ship.


5. Deployment & configuration

  • Configuration sources. YAML files feed typed options (NotifyMongoOptions, NotifyWorkerOptions, etc.). Environment variables can override connection strings and rate limits for production.
  • Authority integration. Two OAuth clients (notify-web, notify-web-dev) with scopes notify.read and notify.admin are required. Authority enforcement can be disabled for air-gapped dev use by providing developmentSigningKey.
  • Plug-in management. plugins.baseDirectory and orderedPlugins guarantee deterministic loading. Offline Kits copy the plug-in tree verbatim; operations must keep the order aligned across environments.
  • Observability. Workers expose structured logs (ruleId, actionId, eventId, throttleKey). Metrics include:
    • notify_rule_matches_total{tenant,eventKind}
    • notify_delivery_attempts_total{channelType,status}
    • notify_digest_open_windows{window}
    • Optional OpenTelemetry traces for rule evaluation and connector round-trips.
  • Scaling levers. Increase worker replicas to cope with bus throughput; adjust worker.prefetchCount for Redis Streams or ackWait for NATS JetStream. WebService remains stateless and scales horizontally behind the gateway.

6. Roadmap alignment

Backlog Architectural note
NOTIFY-SVC-38-001 Standardise event envelope publication (idempotency keys) ensure bus bindings use the documented key format.
NOTIFY-SVC-38-002..004 Introduce simulation endpoints and throttle dashboards expect additional /internal/notify/simulate routes and metrics; update once merged.
NOTIFY-SVC-39-001..004 Correlation engine, digests generator, simulation API, quiet hours anticipate new Mongo documents (quietHours, correlation caches) and connector metadata (quiet mode hints). Review this guide when implementations land.

Action: schedule a documentation sync with the Notifications Service Guild immediately after NOTIFY-SVC-39-001..004 merge to confirm schema adjustments (e.g., correlation edge storage, quiet hour calendars) and add any new persistence or API details here.


Imposed rule reminder: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.