Files
git.stella-ops.org/docs/modules/notify/architecture-detail.md
master 9eec100204 refactor(notify): merge Notifier WebService into Notify WebService
- Delete dead Notify Worker (NoOp handler)
- Move 51 source files (endpoints, contracts, services, compat stores)
- Transform namespaces from Notifier.WebService to Notify.WebService
- Update DI registrations, WebSocket support, v2 endpoint mapping
- Comment out notifier-web in compose, update gateway routes
- Update architecture docs, port registry, rollout matrix
- Notifier Worker stays as separate delivery engine container

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 13:17:13 +03:00

9.9 KiB
Raw Blame History

Notifications Architecture

Imposed rule: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

This dossier distils the Notify architecture into implementation-ready guidance for service owners, SREs, and integrators. It complements the high-level overview by detailing process boundaries, persistence models, and extensibility points.


1. Runtime shape

          ┌──────────────────┐
          │ Authority (OpTok)│
          └───────┬──────────┘
                  │
          ┌───────▼──────────┐        ┌───────────────┐
          │ Notify.WebService│◀──────▶│ PostgreSQL    │
Tenant API│  REST + gRPC WIP │        │ rules/channels│
          └───────▲──────────┘        │ deliveries    │
                  │                   │ digests       │
   Internal bus   │                   └───────────────┘
 (NATS/Valkey/etc)│
                  │
        ┌─────────▼─────────┐      ┌───────────────┐
        │ Notify.Worker     │◀────▶│ Valkey / Cache│
        │ rule eval + render│      │ throttles/locks│
        └─────────▲─────────┘      └───────▲───────┘
                  │                        │
                  │                        │
           ┌──────┴──────┐       ┌─────────┴────────┐
           │ Connectors  │──────▶│ Slack/Teams/...  │
           │ (plug-ins)  │       │ External targets │
           └─────────────┘       └──────────────────┘
  • 2025-11-02 decision — module boundaries. Keep src/Notify/ as the shared notification toolkit (engine, storage, queue, connectors) that multiple hosts can consume. src/Notifier/ retains the Worker (delivery engine) while the Notifier WebService has been merged into src/Notify/StellaOps.Notify.WebService (2026-04-08). The notifier.stella-ops.local hostname is now a DNS alias on the notify-web container.
  • Notify WebService (src/Notify/StellaOps.Notify.WebService) hosts all REST endpoints — both original Notify v1 (/channels, /rules, /templates, /deliveries, /digests, /stats) and merged Notifier v2 (/api/v2/notify/* escalation, incident, simulation, storm-breaker, etc.) — with schema normalisation, validation, and Authority enforcement.
  • Notifier Worker (src/Notifier/StellaOps.Notifier/StellaOps.Notifier.Worker) subscribes to the platform event bus, evaluates rules per tenant, applies throttles/digests, renders payloads, writes ledger entries, and invokes connectors. It remains a separate container.
  • Plug-ins live under plugins/notify/ and are loaded deterministically at service start (orderedPlugins list). Each implements connector contracts and optional health/test-preview providers.

Both services share options via notify.yaml (see etc/notify.yaml.sample). For dev/test scenarios, an in-memory repository exists but production requires PostgreSQL + Valkey/NATS for durability and coordination.


2. Event ingestion and rule evaluation

  1. Subscription. Workers attach to the internal bus (Valkey Streams or NATS JetStream). Each partition key is tenantId|scope.digest|event.kind to preserve order for a given artefact.
  2. Normalisation. Incoming events are hydrated into NotifyEvent envelopes. Payload JSON is normalised (sorted object keys) to preserve determinism and enable hashing.
  3. Rule snapshot. Per-tenant rule sets are cached in memory. PostgreSQL LISTEN/NOTIFY triggers snapshot refreshes without restart.
  4. Match pipeline.
    • Tenant check (rule.tenantId vs. event tenant).
  • Kind/namespace/repository/digest filters.
  • Severity and KEV gating based on event deltas.
  • VEX gating using NotifyRuleMatchVex.
  • Action iteration with throttle/digest decisions.
  1. Idempotency. Each action computes hash(ruleId|actionId|event.kind|scope.digest|delta.hash|dayBucket); matches within throttle TTL record status=Throttled and stop.
  2. Dispatch. If digest is instant, the renderer immediately processes the action. Otherwise the event is appended to the digest window for later flush.

Failures during evaluation are logged with correlation IDs and surfaced through /stats and worker metrics (notify_rule_eval_failures_total, notify_digest_flush_errors_total).


3. Rendering & connectors

  • Template resolution. The renderer picks the template in this order: action template → channel default template → locale fallback → built-in minimal template. Locale negotiation reduces en-US to en-us.
  • Helpers & partials. Exposed helpers mirror the list in templates.md. Plug-ins may register additional helpers but must remain deterministic and side-effect free.
  • Attestation lifecycle suite. Sprint171 introduced dedicated tmpl-attest-* templates for verification failures, expiring attestations, key rotations, and transparency anomalies (see templates.md §7). Rule actions referencing those templates must populate the attestation context fields so channels stay consistent online/offline.
  • Rendering output. NotifyDeliveryRendered captures:
    • channelType, format, locale
    • title, body, optional summary, textBody
    • target (redacted where necessary)
    • attachments[] (safe URLs or references)
    • bodyHash (lowercase SHA-256) for audit parity
  • Connector contract. Connectors implement INotifyConnector (send + health) and can implement INotifyChannelTestProvider for /channels/{id}/test. All plugs are single-tenant aware; secrets are pulled via references at send time and never persisted in the database.
  • Retries. Workers track attempts with exponential jitter. On permanent failure, deliveries are marked Failed with statusReason, and optional DLQ fan-out is slated for Sprint 40.

4. Persistence model

Table Purpose Key fields & indexes
rules Tenant rule definitions. id, tenant_id, enabled; index on (tenant_id, enabled).
channels Channel metadata + config references. id, tenant_id, type; index on (tenant_id, type).
templates Locale-specific render bodies. id, tenant_id, channel_type, key; index on (tenant_id, channel_type, key).
deliveries Ledger of rendered notifications. id, tenant_id, sent_at; compound index on (tenant_id, sent_at DESC) for history queries.
digests Open digest windows per action. id (tenant_id:action_key:window), status; index on (tenant_id, action_key).
throttles Short-lived throttle tokens (PostgreSQL or Valkey). Key format idem:<hash> with TTL aligned to throttle duration.

Records are stored using the canonical JSON serializer (NotifyCanonicalJsonSerializer) to preserve property ordering and casing. Schema migration helpers upgrade stored records when new versions ship.


5. Deployment & configuration

  • Configuration sources. YAML files feed typed options (NotifyPostgresOptions, NotifyWorkerOptions, etc.). Environment variables can override connection strings and rate limits for production.
  • Authority integration. Two OAuth clients (notify-web, notify-web-dev) with scopes notify.viewer, notify.operator, and (for dev/admin flows) notify.admin are required. Authority enforcement can be disabled for air-gapped dev use by providing developmentSigningKey.
  • Plug-in management. plugins.baseDirectory and orderedPlugins guarantee deterministic loading. Offline Kits copy the plug-in tree verbatim; operations must keep the order aligned across environments.
  • Observability. Workers expose structured logs (ruleId, actionId, eventId, throttleKey). Metrics include:
    • notify_rule_matches_total{tenant,eventKind}
    • notify_delivery_attempts_total{channelType,status}
    • notify_digest_open_windows{window}
    • Optional OpenTelemetry traces for rule evaluation and connector round-trips.
  • Scaling levers. Increase worker replicas to cope with bus throughput; adjust worker.prefetchCount for Valkey Streams or ackWait for NATS JetStream. WebService remains stateless and scales horizontally behind the gateway.

6. Roadmap alignment

Backlog Architectural note
NOTIFY-SVC-38-001 Standardise event envelope publication (idempotency keys) ensure bus bindings use the documented key format.
NOTIFY-SVC-38-002..004 Introduce simulation endpoints and throttle dashboards expect additional /internal/notify/simulate routes and metrics; update once merged.
NOTIFY-SVC-39-001..004 Correlation engine, digests generator, simulation API, quiet hours anticipate new PostgreSQL tables (quiet_hours, correlation caches) and connector metadata (quiet mode hints). Review this guide when implementations land.

Action: schedule a documentation sync with the Notifications Service Guild immediately after NOTIFY-SVC-39-001..004 merge to confirm schema adjustments (e.g., correlation edge storage, quiet hour calendars) and add any new persistence or API details here.


Imposed rule reminder: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.