> **Scope.** Implementation‑ready architecture for **Notify** (aligned with Epic 11 – Notifications Studio): a rules‑driven, tenant‑aware notification service that consumes platform events (scan completed, report ready, rescan deltas, attestation logged, admission decisions, etc.), evaluates operator‑defined routing rules, renders **channel‑specific messages** (Slack/Teams/Email/Webhook/PagerDuty/OpsGenie), and delivers them **reliably** with idempotency, throttling, and digests. It is UI‑managed, auditable, and safe by default (no secrets leakage, no spam storms). * **Console frontdoor compatibility (updated 2026-03-10).** The web console reaches Notifier Studio through the gateway-owned `/api/v1/notifier/*` prefix, which translates onto the service-local `/api/v2/notify/*` surface without requiring browser calls to raw service-prefixed routes. * **Console admin routing truthfulness (updated 2026-04-21).** The console uses `/api/v1/notify/*` only for core Notify toolkit flows (channels, rules, deliveries, incidents, acknowledgements). Advanced admin configuration such as quiet-hours, throttles, escalation, and localization is owned by the Notifier frontdoor `/api/v1/notifier/* -> /api/v2/notify/*`; Platform no longer serves synthetic `/api/v1/notify/*` admin compatibility payloads. Digest schedule CRUD remains unavailable in the live API. * **Merged Notify compat surface restoration (updated 2026-04-22).** The merged `src/Notify/*` host now maps the admin compatibility routes expected behind `/api/v1/notifier/*`, including `/api/v2/notify/channels*`, `/deliveries*`, `/simulate*`, `/quiet-hours*`, `/throttle-configs*`, `/escalation-policies*`, and `/overrides*`. Unsupported operator override CRUD now returns an explicit `501` contract response instead of a misleading `404`, and focused proof lives in `src/Notify/__Tests/StellaOps.Notify.WebService.Tests/CrudEndpointsTests.cs`. * **Runtime durability cutover (updated 2026-04-16).** Default `src/Notifier/*` production wiring now resolves queue and storage through the shared `StellaOps.Notify.Persistence` and `StellaOps.Notify.Queue` libraries. `NullNotifyEventQueue` is allowed only in the `Testing` environment, `notify.pack_approvals` is durable, and restart-survival proof is covered by `NotifierDurableRuntimeProofTests` against real Postgres + Redis. * **Correlation incident/throttle durability (updated 2026-04-20).** Non-testing Notify and Notifier hosts no longer keep incident correlation or throttle windows in process-local memory. Both hosts now swap `IIncidentManager` and `INotifyThrottler` onto PostgreSQL-backed runtime services using `notify.correlation_runtime_incidents` and `notify.correlation_runtime_throttle_events`, with restart-survival proof in `NotifierCorrelationDurableRuntimeTests`. * **Localization runtime durability (updated 2026-04-20).** Non-testing Notify and Notifier hosts no longer keep tenant-managed localization bundles in process-local memory. Both hosts now swap `ILocalizationService` onto a PostgreSQL-backed runtime service using `notify.localization_bundles`, while built-in system fallback strings remain compiled defaults, with restart-survival proof in `NotifierLocalizationDurableRuntimeTests`. * **Storm/fallback runtime durability (updated 2026-04-20).** Non-testing Notify and Notifier hosts no longer keep storm detection state, tenant fallback chains, or per-delivery fallback attempts in process-local memory. Both hosts now swap `IStormBreaker` and `IFallbackHandler` onto PostgreSQL-backed runtime services using `notify.storm_runtime_states`, `notify.storm_runtime_events`, `notify.fallback_runtime_chains`, and `notify.fallback_runtime_delivery_states`, with restart-survival proof in `NotifierStormFallbackDurableRuntimeTests`. * **Escalation engine runtime durability (updated 2026-04-20).** Non-testing Notify and Notifier hosts no longer keep live `IEscalationEngine` state in a process-local dictionary. Both hosts now swap `IEscalationEngine` onto a PostgreSQL-backed runtime service using `notify.escalation_states`, with restart-survival proof in `NotifierEscalationRuntimeDurableTests` and startup-contract proof in `NotifyEscalationRuntimeStartupContractTests`. * **External ack/runtime channel durability (updated 2026-04-20).** Non-testing Notifier worker hosts no longer depend on a process-local external-id bridge map or a webhook-only dispatch composition for external channels. The worker now composes `WebhookChannelDispatcher` for chat/webhook routes plus `AdapterChannelDispatcher` for `Email`, `PagerDuty`, and `OpsGenie`, durably records provider `externalId` plus `incidentId` metadata into PostgreSQL-backed delivery state, and resolves PagerDuty/OpsGenie webhook acknowledgements through PostgreSQL-backed lookup after restart. Focused proof lives in `NotifierWorkerHostWiringTests` and `NotifierAckBridgeRuntimeDurableTests`. * **Digest scheduler runtime composition (updated 2026-04-20).** The non-testing Notifier worker now composes `DigestScheduleRunner`, `DigestGenerator`, and `ChannelDigestDistributor` in the live host. Scheduled digests remain configuration-driven and now resolve tenant IDs from `Notifier:DigestSchedule:Schedules:*:TenantIds` through `ConfiguredDigestTenantProvider` instead of the process-local `InMemoryDigestTenantProvider`. There is currently no operator-managed digest schedule CRUD surface in the live runtime; `/digests` administers open digest windows only. Focused proof lives in `NotifierWorkerHostWiringTests`. * **Suppression admin durability (updated 2026-04-16).** Non-testing throttle configuration and operator override APIs no longer use live in-memory state. Both hosts now resolve canonical `/api/v2/throttles*` and `/api/v2/overrides*` plus legacy `/api/v2/notify/throttle-configs*` and `/api/v2/notify/overrides*` through PostgreSQL-backed suppression services, with restart-survival proof in `NotifierSuppressionDurableRuntimeTests`. * **Escalation/on-call durability (updated 2026-04-16).** Non-testing escalation-policy and on-call schedule APIs no longer use live in-memory services or compat repositories. Both hosts now resolve canonical `/api/v2/escalation-policies*` and `/api/v2/oncall-schedules*` plus legacy `/api/v2/notify/escalation-policies*` and `/api/v2/notify/oncall-schedules*` through PostgreSQL-backed runtime services, with restart-survival proof in `NotifierEscalationOnCallDurableRuntimeTests`. * **Quiet-hours/maintenance durability (updated 2026-04-20).** Non-testing quiet-hours calendars and maintenance windows no longer use live in-memory compat repositories or maintenance evaluators. Both hosts now resolve canonical `/api/v2/quiet-hours*` plus legacy `/api/v2/notify/quiet-hours*` and `/api/v2/notify/maintenance-windows*` through PostgreSQL-backed runtime services on the shared `notify.quiet_hours` and `notify.maintenance_windows` tables, with restart-survival proof in `NotifierQuietHoursMaintenanceDurableRuntimeTests`. Fixed-time daily/weekly cron expressions still project truthfully into canonical schedules, and compat-authored cron shapes that cannot be flattened losslessly now evaluate natively from persisted `cronExpression` plus `duration` metadata instead of remaining inert after restart. * **Security/dead-letter durability (updated 2026-04-16).** Non-testing webhook security, tenant isolation, dead-letter administration, and retention cleanup state no longer use live in-memory services. Both hosts now resolve `/api/v2/security*`, `/api/v2/notify/dead-letter*`, `/api/v1/observability/dead-letters*`, and retention endpoints through PostgreSQL-backed runtime services on shared `notify.webhook_security_configs`, `notify.webhook_validation_nonces`, `notify.tenant_resource_owners`, `notify.cross_tenant_grants`, `notify.tenant_isolation_violations`, `notify.dead_letter_entries`, `notify.retention_policies_runtime`, and `notify.retention_cleanup_executions_runtime` tables, with restart-survival proof in `NotifierSecurityDeadLetterDurableRuntimeTests`. * **Testing-only fallback boundary (updated 2026-04-20).** `src/Notifier/*` host startup now registers those durable quiet-hours, suppression, escalation/on-call, security, and dead-letter services directly for non-testing environments instead of composing an in-memory graph and replacing it later. The remaining in-memory admin services are isolated to `Testing`, with startup-contract proof in `StartupDependencyWiringTests`. * **Simulation runtime parity (updated 2026-04-20).** The canonical `/api/v2/simulate*` endpoints and the legacy `/api/v2/notify/simulate*` endpoints in `src/Notifier/` now resolve the same DI-composed simulation runtime, so throttling plus quiet-hours or maintenance suppression behave identically across route families. --- ## 0) Mission & boundaries **Mission.** Convert **facts** from Stella Ops into **actionable, noise-controlled** signals where teams already live (chat, email, paging, and webhooks), with **explainable** reasons and deep links to the UI. **Boundaries.** * Notify **does not make policy decisions** and **does not rescan**; it **consumes** events from Scanner/Scheduler/Excitor/Conselier/Attestor/Zastava and routes them. * Attachments are **links** (UI/attestation pages); Notify **does not** attach SBOMs or large blobs to messages. * Secrets for channels (Slack tokens, SMTP creds) are **referenced**, not stored raw in the database. * **2025-11-02 module boundary.** Maintain `src/Notify/` as the reusable notification toolkit (engine, storage, queue, connectors) and `src/Notifier/` as the Notifications Studio host that composes those libraries. Do not merge directories without an approved packaging RFC that covers build impacts, offline kit parity, and cross-module governance. * **API versioning (updated 2026-02-22).** The API is split across two services: * **Notify** (`src/Notify/`) exposes `/api/v1/notify` — the core notification toolkit (rules, channels, deliveries, templates). This is the lean, canonical API surface. * **Notifier** (`src/Notifier/`) exposes `/api/v2/notify` — the full Notifications Studio with enterprise features (escalation policies, on-call schedules, storm breaker, inbox, retention, simulation, quiet hours, 73+ routes). Notifier also maintains select `/api/v1/notify` endpoints for backward compatibility. * Both versions are **actively maintained and production**. v2 is NOT deprecated — it is the enterprise-tier API hosted by the Notifier Studio service. The previous claim that v2 was "compatibility-only" is stale and has been corrected. --- ## 1) Runtime shape & projects ``` src/ ├─ StellaOps.Notify.WebService/ # REST: rules/channels CRUD, test send, deliveries browse ├─ StellaOps.Notify.Worker/ # consumers + evaluators + renderers + delivery workers ├─ StellaOps.Notify.Connectors.* / # channel plug-ins: Slack, Teams, Email, Webhook (v1) │ └─ *.Tests/ ├─ StellaOps.Notify.Engine/ # rules engine, templates, idempotency, digests, throttles ├─ StellaOps.Notify.Models/ # DTOs (Rule, Channel, Event, Delivery, Template) ├─ StellaOps.Notify.Storage.Postgres/ # canonical persistence (notify schema) ├─ StellaOps.Notify.Queue/ # bus client (Valkey Streams/NATS JetStream) └─ StellaOps.Notify.Tests.* # unit/integration/e2e ``` **Deployables**: * **Notify.WebService** (stateless API) * **Notify.Worker** (horizontal scale) **Dependencies**: Authority (OpToks; DPoP/mTLS), **PostgreSQL** (notify schema), Valkey/NATS (bus), HTTP egress to Slack/Teams/Webhooks/PagerDuty/OpsGenie, SMTP relay for Email. > **Configuration.** Notify.WebService bootstraps from `notify.yaml` (see `etc/notify.yaml.sample`). Use `storage.driver: postgres` and provide `postgres.notify` options (`connectionString`, `schemaName`, pool sizing, timeouts). Authority settings follow the platform defaults—when running locally without Authority, set `authority.enabled: false` and supply `developmentSigningKey` so JWTs can be validated offline. > > `api.rateLimits` exposes token-bucket controls for delivery history queries and test-send previews (`deliveryHistory`, `testSend`). Default values allow generous browsing while preventing accidental bursts; operators can relax/tighten the buckets per deployment. > **Plug-ins.** All channel connectors are packaged under `/plugins/notify`. The ordered load list must start with Slack/Teams before Email/Webhook so chat-first actions are registered deterministically for Offline Kit bundles: > > ```yaml > plugins: > baseDirectory: "/var/opt/stellaops" > directory: "plugins/notify" > orderedPlugins: > - StellaOps.Notify.Connectors.Slack > - StellaOps.Notify.Connectors.Teams > - StellaOps.Notify.Connectors.Email > - StellaOps.Notify.Connectors.Webhook > ``` > > The Offline Kit job simply copies the `plugins/notify` tree into the air-gapped bundle; the ordered list keeps connector manifests stable across environments. > > In the hosted Notifier worker, delivery execution is split across two deterministic dispatch paths: `WebhookChannelDispatcher` continues to handle chat/webhook routes, while `AdapterChannelDispatcher` resolves `Email`, `PagerDuty`, and `OpsGenie` through `IChannelAdapterFactory`. The provider `externalId` emitted by those adapter-backed channels must survive persistence so inbound webhook acknowledgements can be resolved after restart. > **Authority clients.** Register two OAuth clients in StellaOps Authority: `notify-web-dev` (audience `notify.dev`) for development and `notify-web` (audience `notify`) for staging/production. Both require `notify.read` and `notify.admin` scopes and use DPoP-bound client credentials (`client_secret` in the samples). Reference entries live in `etc/authority.yaml.sample`, with placeholder secrets under `etc/secrets/notify-web*.secret.example`. --- ## 2) Responsibilities 1. **Ingest** platform events from internal bus with strong ordering per key (e.g., image digest). 2. **Evaluate rules** (tenant‑scoped) with matchers: severity changes, namespaces, repos, labels, KEV flags, provider provenance (VEX), component keys, admission decisions, etc. 3. **Control noise**: **throttle**, **coalesce** (digest windows), and **dedupe** via idempotency keys. 4. **Render** channel‑specific messages using safe templates; include **evidence** and **links**. 5. **Deliver** with retries/backoff; record outcome; expose delivery history to UI. 6. **Test** paths (send test to channel targets) without touching live rules. 7. **Audit**: log who configured what, when, and why a message was sent. --- ## 3) Event model (inputs) Notify subscribes to the **internal event bus** (produced by services, escaped JSON; gzip allowed with caps): * `scanner.scan.completed` — new SBOM(s) composed; artifacts ready * `scanner.report.ready` — analysis verdict (policy+vex) available; carries deltas summary * `scheduler.rescan.delta` — new findings after Conselier/Excitor deltas (already summarized) * `attestor.logged` — Rekor UUID returned (sbom/report/vex export) * `zastava.admission` — admit/deny with reasons, namespace, image digests * `conselier.export.completed` — new export ready (rarely notified directly; usually drives Scheduler) * `excitor.export.completed` — new consensus snapshot (ditto) **Canonical envelope (bus → Notify.Engine):** ```json { "eventId": "uuid", "kind": "scanner.report.ready", "tenant": "tenant-01", "ts": "2025-10-18T05:41:22Z", "actor": "scanner-webservice", "scope": { "namespace":"payments", "repo":"ghcr.io/acme/api", "digest":"sha256:..." }, "payload": { /* kind-specific fields, see below */ } } ``` **Examples (payload cores):** * `scanner.report.ready`: ```json { "reportId": "report-3def...", "verdict": "fail", "summary": {"total": 12, "blocked": 2, "warned": 3, "ignored": 5, "quieted": 2}, "delta": {"newCritical": 1, "kev": ["CVE-2025-..."]}, "links": {"ui": "https://ui/.../reports/report-3def...", "rekor": "https://rekor/..."}, "dsse": { "...": "..." }, "report": { "...": "..." } } ``` Payload embeds both the canonical report document and the DSSE envelope so connectors, Notify, and UI tooling can reuse the signed bytes without re-serialising. * `scanner.scan.completed`: ```json { "reportId": "report-3def...", "digest": "sha256:...", "verdict": "fail", "summary": {"total": 12, "blocked": 2, "warned": 3, "ignored": 5, "quieted": 2}, "delta": {"newCritical": 1, "kev": ["CVE-2025-..."]}, "policy": {"revisionId": "rev-42", "digest": "27d2..."}, "findings": [{"id": "finding-1", "severity": "Critical", "cve": "CVE-2025-...", "reachability": "runtime"}], "dsse": { "...": "..." } } ``` * `zastava.admission`: ```json { "decision":"deny|allow", "reasons":["unsigned image","missing SBOM"], "images":[{"digest":"sha256:...","signed":false,"hasSbom":false}] } ``` --- ## 4) Rules engine — semantics **Rule shape (simplified):** ```yaml name: "high-critical-alerts-prod" enabled: true match: eventKinds: ["scanner.report.ready","scheduler.rescan.delta","zastava.admission"] namespaces: ["prod-*"] repos: ["ghcr.io/acme/*"] minSeverity: "high" # min of new findings (delta context) kev: true # require KEV-tagged or allow any if false verdict: ["fail","deny"] # filter for report/admission vex: includeRejectedJustifications: false # notify only on accepted 'affected' actions: - channel: "slack:sec-alerts" # reference to Channel object template: "concise" throttle: "5m" - channel: "email:soc" digest: "hourly" template: "detailed" ``` **Evaluation order** 1. **Tenant check** → discard if rule tenant ≠ event tenant. 2. **Kind filter** → discard early. 3. **Scope match** (namespace/repo/labels). 4. **Delta/severity gates** (if event carries `delta`). 5. **VEX gate** (drop if event’s finding is not affected under policy consensus unless rule says otherwise). 6. **Throttling/dedup** (idempotency key) — skip if suppressed. 7. **Actions** → enqueue per‑channel job(s). **Idempotency key**: `hash(ruleId | actionId | event.kind | scope.digest | delta.hash | day-bucket)`; ensures “same alert” doesn’t fire more than once within throttle window. **Digest windows**: maintain per action a **coalescer**: * Window: `5m|15m|1h|1d` (configurable); coalesces events by tenant + namespace/repo or by digest group. * Digest messages summarize top N items and counts, with safe truncation. --- ## 5) Channels & connectors (plug‑ins) Channel config is **two‑part**: a **Channel** record (name, type, options) and a Secret **reference** (Vault/K8s Secret). Connectors are **restart-time plug-ins** discovered on service start (same manifest convention as Concelier/Excititor) and live under `plugins/notify//`. **Built-in channels:** * **Slack**: Bot token (xoxb‑…), `chat.postMessage` + `blocks`; rate limit aware (HTTP 429). * **Microsoft Teams**: Incoming Webhook (or Graph card later); adaptive card payloads. * **Email (SMTP)**: TLS (STARTTLS or implicit), From/To/CC/BCC; HTML+text alt; DKIM optional. * **Generic Webhook**: POST JSON with HMAC signature (Ed25519 or SHA‑256) in headers. * **PagerDuty**: Events API v2 trigger/ack/resolve flow; durable `dedup_key`/external id mapping is persisted with delivery state for restart-safe webhook acknowledgement handling. * **OpsGenie**: Alert create/ack/close flow; alias/external id is persisted with delivery state so inbound acknowledgement webhooks remain restart-safe. **Connector contract:** (implemented by plug-in assemblies) ```csharp public interface INotifyConnector { string Type { get; } // "slack" | "teams" | "email" | "webhook" | ... Task SendAsync(DeliveryContext ctx, CancellationToken ct); Task HealthAsync(ChannelConfig cfg, CancellationToken ct); } ``` For hosted external channels, Notifier worker adapters implement `IChannelAdapter` and are selected by `AdapterChannelDispatcher`. Those adapters must emit stable provider identifiers (`externalId`, `incidentId` where applicable) so the `IAckBridge` webhook path can recover correlation from persisted delivery rows instead of process-local memory. **DeliveryContext** includes **rendered content** and **raw event** for audit. **Test-send previews.** Plug-ins can optionally implement `INotifyChannelTestProvider` to shape `/channels/{id}/test` responses. Providers receive a sanitised `ChannelTestPreviewContext` (channel, tenant, target, timestamp, trace) and return a `NotifyDeliveryRendered` preview + metadata. When no provider is present, the host falls back to a generic preview so the endpoint always responds. **Secrets**: `ChannelConfig.secretRef` points to Authority‑managed secret handle or K8s Secret path; workers load at send-time; plug-in manifests (`notify-plugin.json`) declare capabilities and version. --- ## 6) Templates & rendering **Template engine**: strongly typed, safe Handlebars‑style; no arbitrary code. Partial templates per channel. Deterministic outputs (prop order, no locale drift unless requested). **Variables** (examples): * `event.kind`, `event.ts`, `scope.namespace`, `scope.repo`, `scope.digest` * `payload.verdict`, `payload.delta.newCritical`, `payload.links.ui`, `payload.links.rekor` * `topFindings[]` with `purl`, `vulnId`, `severity` * `policy.name`, `policy.revision` (if available) **Helpers**: * `severity_icon(sev)`, `link(text,url)`, `pluralize(n, "finding")`, `truncate(text, n)`, `code(text)`. **Channel mapping**: * Slack: title + blocks, limited to 50 blocks/3000 chars per section; long lists → link to UI. * Teams: Adaptive Card schema 1.5; fallback text for older channels (surfaced as `teams.fallbackText` metadata alongside webhook hash). * Email: HTML + text; inline table of top N findings, rest behind UI link. * Webhook: JSON with `event`, `ruleId`, `actionId`, `summary`, `links`, and raw `payload` subset. **i18n**: template set per locale (English default; Bulgarian built‑in). --- ## 7) Data model (PostgreSQL) Canonical JSON Schemas for rules/channels/events live in `docs/modules/notify/resources/schemas/`. Sample payloads intended for tests/UI mock responses are captured in `docs/modules/notify/resources/samples/`. **Database**: `stellaops_notify` (PostgreSQL) * `rules` ``` { _id, tenantId, name, enabled, match, actions, createdBy, updatedBy, createdAt, updatedAt } ``` * `channels` ``` { _id, tenantId, name:"slack:sec-alerts", type:"slack", config:{ webhookUrl?:"", channel:"#sec-alerts", workspace?: "...", secretRef:"ref://..." }, createdAt, updatedAt } ``` * `deliveries` ``` { _id, tenantId, ruleId, actionId, eventId, kind, scope, status:"sent|failed|throttled|digested|dropped", externalId?, metadata?, attempts:[{ts, status, code, reason}], rendered:{ title, body, target }, // redacted for PII; body hash stored sentAt, lastError? } ``` PagerDuty and OpsGenie deliveries durably carry the provider `externalId` plus `metadata.incidentId` so inbound webhook acknowledgements can be resolved after worker restart without relying on a process-local bridge map. * `digests` ``` { _id, tenantId, actionKey, window:"hourly", openedAt, items:[{eventId, scope, delta}], status:"open|flushed" } ``` * `correlation_runtime_incidents` ``` { tenantId, incidentId, correlationKey, eventKind, title, status:"open|acknowledged|resolved", eventCount, firstOccurrence, lastOccurrence, acknowledgedBy?, resolvedBy?, eventIds:[eventId...] } ``` * `correlation_runtime_throttle_events` ``` { tenantId, correlationKey, occurredAt } // short-lived, also cached in Valkey ``` * `escalation_states` ``` { tenantId, policyId, incidentId?, correlationId, currentStep, repeatIteration, status:"active|acknowledged|resolved|expired", startedAt, nextEscalationAt, acknowledgedAt?, acknowledgedBy?, metadata } ``` `correlationId` is the durable lookup key for the live string incident id used by the runtime engine. `metadata` carries the runtime-only fields that do not fit the canonical columns yet: `stateId`, external `policyId`, `levelStartedAt`, terminal runtime status (`stopped|exhausted`), `stoppedAt`, `stoppedReason`, and the full escalation `history`. **Indexes**: rules by `{tenantId, enabled}`, deliveries by `{tenantId, sentAt desc}`, digests by `{tenantId, actionKey}`. --- ## 8) External APIs (WebService) Base path: `/api/v1/notify` (Authority OpToks; scopes: `notify.admin` for write, `notify.read` for view). *All* REST calls require the tenant header `X-StellaOps-Tenant` (matches the canonical `tenantId` stored in PostgreSQL). Payloads are normalised via `NotifySchemaMigration` before persistence to guarantee schema version pinning. Authentication today is stubbed with Bearer tokens (`Authorization: Bearer `). When Authority wiring lands, this will switch to OpTok validation + scope enforcement, but the header contract will remain the same. Service configuration exposes `notify:auth:*` keys (issuer, audience, signing key, scope names) so operators can wire the Authority JWKS or (in dev) a symmetric test key. `notify:storage:*` keys cover PostgreSQL connection/schema overrides. Both sets are required for the new API surface. Internal tooling can hit `/internal/notify//normalize` to upgrade legacy JSON and return canonical output used in the docs fixtures. * **Channels** * `POST /channels` | `GET /channels` | `GET /channels/{id}` | `PATCH /channels/{id}` | `DELETE /channels/{id}` * `POST /channels/{id}/test` → send sample message (no rule evaluation); returns `202 Accepted` with rendered preview + metadata (base keys: `channelType`, `target`, `previewProvider`, `traceId` + connector-specific entries); governed by `api.rateLimits:testSend`. * `GET /channels/{id}/health` → connector self‑check (returns redacted metadata: secret refs hashed, sensitive config keys masked, fallbacks noted via `teams.fallbackText`/`teams.validation.*`) * **Rules** * `POST /rules` | `GET /rules` | `GET /rules/{id}` | `PATCH /rules/{id}` | `DELETE /rules/{id}` * `POST /rules/{id}/test` → dry‑run rule against a **sample event** (no delivery unless `--send`) * **Deliveries** * `POST /deliveries` → ingest worker delivery state (idempotent via `deliveryId`). * `GET /deliveries?since=...&status=...&limit=...` → list envelope `{ items, count, continuationToken }` (most recent first); base metadata keys match the test-send response (`channelType`, `target`, `previewProvider`, `traceId`); rate-limited via `api.rateLimits.deliveryHistory`. See `docs/modules/notify/resources/samples/notify-delivery-list-response.sample.json`. * `GET /deliveries/{id}` → detail (redacted body + metadata) * `POST /deliveries/{id}/retry` → force retry (admin, future sprint) * **Admin** * `GET /stats` (per tenant counts, last hour/day) * `GET /healthz|readyz` (liveness) * `POST /locks/acquire` | `POST /locks/release` – worker coordination primitives (short TTL). * `POST /digests` | `GET /digests/{actionKey}` | `DELETE /digests/{actionKey}` – manage open digest windows. * `POST /audit` | `GET /audit?since=&limit=` – append/query structured audit trail entries. ### 8.1 Ack tokens & escalation workflows To support one-click acknowledgements from chat/email, the Notify WebService mints **DSSE ack tokens** via Authority: * `POST /notify/ack-tokens/issue` → returns a DSSE envelope (payload type `application/vnd.stellaops.notify-ack-token+json`) describing the tenant, notification/delivery ids, channel, webhook URL, nonce, permitted actions, and TTL. Requires `notify.operator`; requesting escalation requires the caller to hold `notify.escalate` (and `notify.admin` when configured). Issuance enforces the Authority-side webhook allowlist (`notifications.webhooks.allowedHosts`) before minting tokens. * `POST /notify/ack-tokens/verify` → verifies the DSSE signature, enforces expiry/tenant/action constraints, and emits audit events (`notify.ack.verified`, `notify.ack.escalated`). Scope: `notify.operator` (+`notify.escalate` for escalation). * `POST /notify/ack-tokens/rotate` → rotates the signing key used for ack tokens, requires `notify.admin`, and emits `notify.ack.key_rotated`/`notify.ack.key_rotation_failed` audit events. Operators must supply the new key material (file/KMS/etc. depending on `notifications.ackTokens.keySource`); Authority updates JWKS entries with `use: "notify-ack"` and retires the previous key. * `POST /internal/notifications/ack-tokens/rotate` → legacy bootstrap path (API-key protected) retained for air-gapped initial provisioning; it forwards to the same rotation pipeline as the public endpoint. Authority signs ack tokens using keys configured under `notifications.ackTokens`. Public JWKS responses expose these keys with `use: "notify-ack"` and `status: active|retired`, enabling offline verification by the worker/UI/CLI. Inbound PagerDuty and OpsGenie acknowledgement webhooks must resolve provider identifiers from durable delivery state (`externalId` plus incident metadata), not from process-local runtime maps. Restart-survival is a required property of the non-testing host composition. **Ingestion**: workers do **not** expose public ingestion; they **subscribe** to the internal bus. (Optional `/events/test` for integration testing, admin-only.) --- ## 9) Delivery pipeline (worker) ``` [Event bus] → [Ingestor] → [RuleMatcher] → [Throttle/Dedupe] → [DigestCoalescer] → [Renderer] → [Connector] → [Result] └────────→ [DeliveryStore] ``` * **Ingestor**: N consumers with per‑key ordering (key = tenant|digest|namespace). * **RuleMatcher**: loads active rules snapshot for tenant into memory; vectorized predicate check. * **Throttle/Dedupe**: consult Valkey plus PostgreSQL `notify.correlation_runtime_throttle_events`; if hit → record `status=throttled`. * **DigestCoalescer**: append to open digest window or flush when timer expires. * **Renderer**: select template (channel+locale), inject variables, enforce length limits, compute `bodyHash`. * **Connector**: send; handle provider‑specific rate limits and backoffs; `maxAttempts` with exponential jitter; overflow → DLQ (dead‑letter topic) + UI surfacing. **Idempotency**: per action **idempotency key** stored in Valkey (TTL = `throttle window` or `digest window`). Connectors also respect **provider** idempotency where available (e.g., Slack `client_msg_id`). --- ## 10) Reliability & rate controls * **Per‑tenant** RPM caps (default 600/min) + **per‑channel** concurrency (Slack 1–4, Teams 1–2, Email 8–32 based on relay). * **Backoff** map: Slack 429 → respect `Retry‑After`; SMTP 4xx → retry; 5xx → retry with jitter; permanent rejects → drop with status recorded. * **DLQ**: NATS/Valkey stream `notify.dlq` with `{event, rule, action, error}` for operator inspection; UI shows DLQ items. --- ## 11) Security & privacy * **AuthZ**: all APIs require **Authority** OpToks; actions scoped by tenant. * **Secrets**: `secretRef` only; Notify fetches just‑in‑time from Authority Secret proxy or K8s Secret (mounted). No plaintext secrets in database. * **Egress TLS**: validate SSL; pin domains per channel config; optional CA bundle override for on‑prem SMTP. * **Webhook signing**: HMAC or Ed25519 signatures in `X-StellaOps-Signature` + replay‑window timestamp; include canonical body hash in header. * **Redaction**: deliveries store **hashes** of bodies, not full payloads for chat/email to minimize PII retention (configurable). * **Quiet hours**: per tenant (e.g., 22:00–06:00) route high‑sev only; defer others to digests. * **Loop prevention**: Webhook target allowlist + event origin tags; do not ingest own webhooks. --- ## 12) Observability (Prometheus + OTEL) * `notify.events_consumed_total{kind}` * `notify.rules_matched_total{ruleId}` * `notify.throttled_total{reason}` * `notify.digest_coalesced_total{window}` * `notify.sent_total{channel}` / `notify.failed_total{channel,code}` * `notify.delivery_latency_seconds{channel}` (end‑to‑end) * **Tracing**: spans `ingest`, `match`, `render`, `send`; correlation id = `eventId`. - Runbook + dashboard stub (offline import): `operations/observability.md`, `operations/dashboards/notify-observability.json` (to be populated after next demo). **SLO targets** * Event→delivery p95 **≤ 30–60 s** under nominal load. * Failure rate p95 **< 0.5%** per hour (excluding provider outages). * Duplicate rate **≈ 0** (idempotency working). --- ## 13) Configuration (YAML) ```yaml notify: authority: issuer: "https://authority.internal" require: "dpop" # or "mtls" bus: kind: "valkey" # or "nats" (valkey uses redis:// protocol) streams: - "scanner.events" - "scheduler.events" - "attestor.events" - "zastava.events" postgres: notify: connectionString: "Host=postgres;Port=5432;Database=stellaops_notify;Username=stellaops;Password=stellaops;Pooling=true" schemaName: "notify" commandTimeoutSeconds: 45 limits: perTenantRpm: 600 perChannel: slack: { concurrency: 2 } teams: { concurrency: 1 } email: { concurrency: 8 } webhook: { concurrency: 8 } digests: defaultWindow: "1h" maxItems: 100 quietHours: enabled: true window: "22:00-06:00" minSeverity: "critical" webhooks: sign: method: "ed25519" # or "hmac-sha256" keyRef: "ref://notify/webhook-sign-key" ``` --- ## 14) UI touch‑points * **Notifications → Channels**: add Slack/Teams/Email/Webhook/PagerDuty/OpsGenie; run **health**; rotate secrets. * **Notifications → Rules**: create/edit YAML rules with linting; test with sample events; see match rate. * **Notifications → Deliveries**: timeline with filters (status, channel, rule); inspect last error; retry. * **Digest preview**: shows current window contents and when it will flush. * **Quiet hours**: configure per tenant; show overrides. * **DLQ**: browse dead‑letters; requeue after fix. --- ## 15) Failure modes & responses | Condition | Behavior | | ----------------------------------- | ------------------------------------------------------------------------------------- | | Slack 429 / Teams 429 | Respect `Retry‑After`, backoff with jitter, reduce concurrency | | SMTP transient 4xx | Retry up to `maxAttempts`; escalate to DLQ on exhaust | | Invalid channel secret | Mark channel unhealthy; suppress sends; surface in UI | | Rule explosion (matches everything) | Safety valve: per‑tenant RPM caps; auto‑pause rule after X drops; UI alert | | Bus outage | Buffer to local queue (bounded); resume consuming when healthy | | PostgreSQL slowness | Fall back to Valkey throttles; batch write deliveries; shed low‑priority notifications | --- ## 16) Testing matrix * **Unit**: matchers, throttle math, digest coalescing, idempotency keys, template rendering edge cases. * **Connectors**: provider‑level rate limits, payload size truncation, error mapping. * **Integration**: synthetic event storm (10k/min), ensure p95 latency & duplicate rate. * **Security**: DPoP/mTLS on APIs; secretRef resolution; webhook signing & replay windows. * **i18n**: localized templates render deterministically. * **Chaos**: Slack/Teams API flaps; SMTP greylisting; Valkey hiccups; ensure graceful degradation. --- ## 17) Sequences (representative) **A) New criticals after Conselier delta (Slack immediate + Email hourly digest)** ```mermaid sequenceDiagram autonumber participant SCH as Scheduler participant NO as Notify.Worker participant SL as Slack participant SMTP as Email SCH->>NO: bus event scheduler.rescan.delta { newCritical:1, digest:sha256:... } NO->>NO: match rules (Slack immediate; Email hourly digest) NO->>SL: chat.postMessage (concise) SL-->>NO: 200 OK NO->>NO: append to digest window (email:soc) Note over NO: At window close → render digest email NO->>SMTP: send email (detailed digest) SMTP-->>NO: 250 OK ``` **B) Admission deny (Teams card + Webhook)** ```mermaid sequenceDiagram autonumber participant ZA as Zastava participant NO as Notify.Worker participant TE as Teams participant WH as Webhook ZA->>NO: bus event zastava.admission { decision: "deny", reasons: [...] } NO->>TE: POST adaptive card TE-->>NO: 200 OK NO->>WH: POST JSON (signed) WH-->>NO: 2xx ``` --- ## 18) Implementation notes * **Language**: .NET 10; minimal API; `System.Text.Json` with canonical writer for body hashing; Channels for pipelines. * **Bus**: Valkey Streams (**XGROUP** consumers) or NATS JetStream for at‑least‑once with ack; per‑tenant consumer groups to localize backpressure. * **Templates**: compile and cache per rule+channel+locale; version with rule `updatedAt` to invalidate. * **Rules**: store raw YAML + parsed AST; validate with schema + static checks (e.g., nonsensical combos). * **Secrets**: pluggable secret resolver (Authority Secret proxy, K8s, Vault). * **Rate limiting**: `System.Threading.RateLimiting` + per-connector adapters. --- ## 19) Air-gapped bootstrap configuration Air-gapped deployments ship a deterministic Notifier profile inside the Bootstrap Pack. The artefacts live under `bootstrap/notify/` after running the Offline Kit builder and include: - `notify.yaml` — configuration derived from `etc/notify.airgap.yaml`, pointing to the sealed PostgreSQL/Authority endpoints and loading connectors from the local plug-in directory. - `notify-web.secret.example` — template for the Authority client secret, intended to be renamed to `notify-web.secret` before deployment. - `README.md` — operator guide (`docs/modules/notify/bootstrap-pack.md`). These files are copied automatically by `ops/offline-kit/build_offline_kit.py` via `copy_bootstrap_configs`. Operators mount the configuration and secret into the `StellaOps.Notify.WebService` container (Compose or Kubernetes) to keep sealed-mode roll-outs reproducible. (Notifier WebService was merged into Notify WebService; the `notifier.stella-ops.local` hostname is now an alias on the `notify-web` container.) --- ## 20) Roadmap (post-v1) * **Jira** ticket creation and downstream issue-state synchronization. * **User inbox** (in‑app notifications) + mobile push via webhook relay. * **Anomaly suppression**: auto‑pause noisy rules with hints (learned thresholds). * **Graph rules**: “only notify if *not_affected → affected* transition at consensus layer”. * **Label enrichment**: pluggable taggers (business criticality, data classification) to refine matchers.