Files
git.stella-ops.org/docs/notifications/architecture.md
master f98cea3bcf Add Authority Advisory AI and API Lifecycle Configuration
- Introduced AuthorityAdvisoryAiOptions and related classes for managing advisory AI configurations, including remote inference options and tenant-specific settings.
- Added AuthorityApiLifecycleOptions to control API lifecycle settings, including legacy OAuth endpoint configurations.
- Implemented validation and normalization methods for both advisory AI and API lifecycle options to ensure proper configuration.
- Created AuthorityNotificationsOptions and its related classes for managing notification settings, including ack tokens, webhooks, and escalation options.
- Developed IssuerDirectoryClient and related models for interacting with the issuer directory service, including caching mechanisms and HTTP client configurations.
- Added support for dependency injection through ServiceCollectionExtensions for the Issuer Directory Client.
- Updated project file to include necessary package references for the new Issuer Directory Client library.
2025-11-02 13:50:25 +02:00

119 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Notifications Architecture
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
This dossier distils the Notify architecture into implementation-ready guidance for service owners, SREs, and integrators. It complements the high-level overview by detailing process boundaries, persistence models, and extensibility points.
---
## 1. Runtime shape
```
┌──────────────────┐
│ Authority (OpTok)│
└───────┬──────────┘
┌───────▼──────────┐ ┌───────────────┐
│ Notify.WebService│◀──────▶│ MongoDB │
Tenant API│ REST + gRPC WIP │ │ rules/channels│
└───────▲──────────┘ │ deliveries │
│ │ digests │
Internal bus │ └───────────────┘
(NATS/Redis/etc) │
┌─────────▼─────────┐ ┌───────────────┐
│ Notify.Worker │◀────▶│ Redis / Cache │
│ rule eval + render│ │ throttles/locks│
└─────────▲─────────┘ └───────▲───────┘
│ │
│ │
┌──────┴──────┐ ┌─────────┴────────┐
│ Connectors │──────▶│ Slack/Teams/... │
│ (plug-ins) │ │ External targets │
└─────────────┘ └──────────────────┘
```
- **WebService** hosts REST endpoints (`/channels`, `/rules`, `/templates`, `/deliveries`, `/digests`, `/stats`) and handles schema normalisation, validation, and Authority enforcement.
- **Worker** subscribes to the platform event bus, evaluates rules per tenant, applies throttles/digests, renders payloads, writes ledger entries, and invokes connectors.
- **Plug-ins** live under `plugins/notify/` and are loaded deterministically at service start (`orderedPlugins` list). Each implements connector contracts and optional health/test-preview providers.
Both services share options via `notify.yaml` (see `etc/notify.yaml.sample`). For dev/test scenarios, an in-memory repository exists but production requires Mongo + Redis/NATS for durability and coordination.
---
## 2. Event ingestion and rule evaluation
1. **Subscription.** Workers attach to the internal bus (Redis Streams or NATS JetStream). Each partition key is `tenantId|scope.digest|event.kind` to preserve order for a given artefact.
2. **Normalisation.** Incoming events are hydrated into `NotifyEvent` envelopes. Payload JSON is normalised (sorted object keys) to preserve determinism and enable hashing.
3. **Rule snapshot.** Per-tenant rule sets are cached in memory. Change streams from Mongo trigger snapshot refreshes without restart.
4. **Match pipeline.**
- Tenant check (`rule.tenantId` vs. event tenant).
- Kind/namespace/repository/digest filters.
- Severity and KEV gating based on event deltas.
- VEX gating using `NotifyRuleMatchVex`.
- Action iteration with throttle/digest decisions.
5. **Idempotency.** Each action computes `hash(ruleId|actionId|event.kind|scope.digest|delta.hash|dayBucket)`; matches within throttle TTL record `status=Throttled` and stop.
6. **Dispatch.** If digest is `instant`, the renderer immediately processes the action. Otherwise the event is appended to the digest window for later flush.
Failures during evaluation are logged with correlation IDs and surfaced through `/stats` and worker metrics (`notify_rule_eval_failures_total`, `notify_digest_flush_errors_total`).
---
## 3. Rendering & connectors
- **Template resolution.** The renderer picks the template in this order: action template → channel default template → locale fallback → built-in minimal template. Locale negotiation reduces `en-US` to `en-us`.
- **Helpers & partials.** Exposed helpers mirror the list in [`notifications/templates.md`](templates.md#3-variables-helpers-and-context). Plug-ins may register additional helpers but must remain deterministic and side-effect free.
- **Rendering output.** `NotifyDeliveryRendered` captures:
- `channelType`, `format`, `locale`
- `title`, `body`, optional `summary`, `textBody`
- `target` (redacted where necessary)
- `attachments[]` (safe URLs or references)
- `bodyHash` (lowercase SHA-256) for audit parity
- **Connector contract.** Connectors implement `INotifyConnector` (send + health) and can implement `INotifyChannelTestProvider` for `/channels/{id}/test`. All plugs are single-tenant aware; secrets are pulled via references at send time and never persisted in Mongo.
- **Retries.** Workers track attempts with exponential jitter. On permanent failure, deliveries are marked `Failed` with `statusReason`, and optional DLQ fan-out is slated for Sprint 40.
---
## 4. Persistence model
| Collection | Purpose | Key fields & indexes |
|------------|---------|----------------------|
| `rules` | Tenant rule definitions. | `_id`, `tenantId`, `enabled`; index on `{tenantId, enabled}`. |
| `channels` | Channel metadata + config references. | `_id`, `tenantId`, `type`; index on `{tenantId, type}`. |
| `templates` | Locale-specific render bodies. | `_id`, `tenantId`, `channelType`, `key`; index on `{tenantId, channelType, key}`. |
| `deliveries` | Ledger of rendered notifications. | `_id`, `tenantId`, `sentAt`; compound index on `{tenantId, sentAt:-1}` for history queries. |
| `digests` | Open digest windows per action. | `_id` (`tenantId:actionKey:window`), `status`; index on `{tenantId, actionKey}`. |
| `throttles` | Short-lived throttle tokens (Mongo or Redis). | Key format `idem:<hash>` with TTL aligned to throttle duration. |
Documents are stored using the canonical JSON serializer (`NotifyCanonicalJsonSerializer`) to preserve property ordering and casing. Schema migration helpers upgrade stored documents when new versions ship.
---
## 5. Deployment & configuration
- **Configuration sources.** YAML files feed typed options (`NotifyMongoOptions`, `NotifyWorkerOptions`, etc.). Environment variables can override connection strings and rate limits for production.
- **Authority integration.** Two OAuth clients (`notify-web`, `notify-web-dev`) with scopes `notify.viewer`, `notify.operator`, and (for dev/admin flows) `notify.admin` are required. Authority enforcement can be disabled for air-gapped dev use by providing `developmentSigningKey`.
- **Plug-in management.** `plugins.baseDirectory` and `orderedPlugins` guarantee deterministic loading. Offline Kits copy the plug-in tree verbatim; operations must keep the order aligned across environments.
- **Observability.** Workers expose structured logs (`ruleId`, `actionId`, `eventId`, `throttleKey`). Metrics include:
- `notify_rule_matches_total{tenant,eventKind}`
- `notify_delivery_attempts_total{channelType,status}`
- `notify_digest_open_windows{window}`
- Optional OpenTelemetry traces for rule evaluation and connector round-trips.
- **Scaling levers.** Increase worker replicas to cope with bus throughput; adjust `worker.prefetchCount` for Redis Streams or `ackWait` for NATS JetStream. WebService remains stateless and scales horizontally behind the gateway.
---
## 6. Roadmap alignment
| Backlog | Architectural note |
|---------|--------------------|
| `NOTIFY-SVC-38-001` | Standardise event envelope publication (idempotency keys) ensure bus bindings use the documented key format. |
| `NOTIFY-SVC-38-002..004` | Introduce simulation endpoints and throttle dashboards expect additional `/internal/notify/simulate` routes and metrics; update once merged. |
| `NOTIFY-SVC-39-001..004` | Correlation engine, digests generator, simulation API, quiet hours anticipate new Mongo documents (`quietHours`, correlation caches) and connector metadata (quiet mode hints). Review this guide when implementations land. |
Action: schedule a documentation sync with the Notifications Service Guild immediately after `NOTIFY-SVC-39-001..004` merge to confirm schema adjustments (e.g., correlation edge storage, quiet hour calendars) and add any new persistence or API details here.
---
> **Imposed rule reminder:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.