Files
git.stella-ops.org/docs/modules/notify/architecture.md
master 7943cfb3af chore(docs+devops): cross-module doc sync + sprint archival moves + compose updates
Bundled pre-session doc + ops work:
- docs/modules/**: sync across advisory-ai, airgap, cli, excititor,
  export-center, findings-ledger, notifier, notify, platform, router,
  sbom-service, ui, web (architectural + operational updates)
- docs/features/**: updates to checked excititor vex pipeline,
  developer workspace, quick verify drawer
- docs top-level: README, quickstart, API_CLI_REFERENCE, UI_GUIDE,
  code-of-conduct/TESTING_PRACTICES updates
- docs/qa/feature-checks/: FLOW.md + excititor state update
- docs/implplan/: remaining sprint updates + new Concelier source
  credentials sprint (SPRINT_20260422_003)
- docs-archived/implplan/: 30 sprint archival moves (ElkSharp series,
  misc completed sprints)
- devops/compose: .env + services compose + env example + router gateway
  config updates

File-level granularity preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 16:06:39 +03:00

38 KiB
Raw Blame History

Scope. Implementationready architecture for Notify (aligned with Epic11 Notifications Studio): a rulesdriven, tenantaware notification service that consumes platform events (scan completed, report ready, rescan deltas, attestation logged, admission decisions, etc.), evaluates operatordefined routing rules, renders channelspecific messages (Slack/Teams/Email/Webhook/PagerDuty/OpsGenie), and delivers them reliably with idempotency, throttling, and digests. It is UImanaged, auditable, and safe by default (no secrets leakage, no spam storms).

  • Console frontdoor compatibility (updated 2026-03-10). The web console reaches Notifier Studio through the gateway-owned /api/v1/notifier/* prefix, which translates onto the service-local /api/v2/notify/* surface without requiring browser calls to raw service-prefixed routes.

  • Console admin routing truthfulness (updated 2026-04-21). The console uses /api/v1/notify/* only for core Notify toolkit flows (channels, rules, deliveries, incidents, acknowledgements). Advanced admin configuration such as quiet-hours, throttles, escalation, and localization is owned by the Notifier frontdoor /api/v1/notifier/* -> /api/v2/notify/*; Platform no longer serves synthetic /api/v1/notify/* admin compatibility payloads. Digest schedule CRUD remains unavailable in the live API.

  • Merged Notify compat surface restoration (updated 2026-04-22). The merged src/Notify/* host now maps the admin compatibility routes expected behind /api/v1/notifier/*, including /api/v2/notify/channels*, /deliveries*, /simulate*, /quiet-hours*, /throttle-configs*, /escalation-policies*, and /overrides*. Unsupported operator override CRUD now returns an explicit 501 contract response instead of a misleading 404, and focused proof lives in src/Notify/__Tests/StellaOps.Notify.WebService.Tests/CrudEndpointsTests.cs.

  • Runtime durability cutover (updated 2026-04-16). Default src/Notifier/* production wiring now resolves queue and storage through the shared StellaOps.Notify.Persistence and StellaOps.Notify.Queue libraries. NullNotifyEventQueue is allowed only in the Testing environment, notify.pack_approvals is durable, and restart-survival proof is covered by NotifierDurableRuntimeProofTests against real Postgres + Redis.

  • Correlation incident/throttle durability (updated 2026-04-20). Non-testing Notify and Notifier hosts no longer keep incident correlation or throttle windows in process-local memory. Both hosts now swap IIncidentManager and INotifyThrottler onto PostgreSQL-backed runtime services using notify.correlation_runtime_incidents and notify.correlation_runtime_throttle_events, with restart-survival proof in NotifierCorrelationDurableRuntimeTests.

  • Localization runtime durability (updated 2026-04-20). Non-testing Notify and Notifier hosts no longer keep tenant-managed localization bundles in process-local memory. Both hosts now swap ILocalizationService onto a PostgreSQL-backed runtime service using notify.localization_bundles, while built-in system fallback strings remain compiled defaults, with restart-survival proof in NotifierLocalizationDurableRuntimeTests.

  • Storm/fallback runtime durability (updated 2026-04-20). Non-testing Notify and Notifier hosts no longer keep storm detection state, tenant fallback chains, or per-delivery fallback attempts in process-local memory. Both hosts now swap IStormBreaker and IFallbackHandler onto PostgreSQL-backed runtime services using notify.storm_runtime_states, notify.storm_runtime_events, notify.fallback_runtime_chains, and notify.fallback_runtime_delivery_states, with restart-survival proof in NotifierStormFallbackDurableRuntimeTests.

  • Escalation engine runtime durability (updated 2026-04-20). Non-testing Notify and Notifier hosts no longer keep live IEscalationEngine state in a process-local dictionary. Both hosts now swap IEscalationEngine onto a PostgreSQL-backed runtime service using notify.escalation_states, with restart-survival proof in NotifierEscalationRuntimeDurableTests and startup-contract proof in NotifyEscalationRuntimeStartupContractTests.

  • External ack/runtime channel durability (updated 2026-04-20). Non-testing Notifier worker hosts no longer depend on a process-local external-id bridge map or a webhook-only dispatch composition for external channels. The worker now composes WebhookChannelDispatcher for chat/webhook routes plus AdapterChannelDispatcher for Email, PagerDuty, and OpsGenie, durably records provider externalId plus incidentId metadata into PostgreSQL-backed delivery state, and resolves PagerDuty/OpsGenie webhook acknowledgements through PostgreSQL-backed lookup after restart. Focused proof lives in NotifierWorkerHostWiringTests and NotifierAckBridgeRuntimeDurableTests.

  • Digest scheduler runtime composition (updated 2026-04-20). The non-testing Notifier worker now composes DigestScheduleRunner, DigestGenerator, and ChannelDigestDistributor in the live host. Scheduled digests remain configuration-driven and now resolve tenant IDs from Notifier:DigestSchedule:Schedules:*:TenantIds through ConfiguredDigestTenantProvider instead of the process-local InMemoryDigestTenantProvider. There is currently no operator-managed digest schedule CRUD surface in the live runtime; /digests administers open digest windows only. Focused proof lives in NotifierWorkerHostWiringTests.

  • Suppression admin durability (updated 2026-04-16). Non-testing throttle configuration and operator override APIs no longer use live in-memory state. Both hosts now resolve canonical /api/v2/throttles* and /api/v2/overrides* plus legacy /api/v2/notify/throttle-configs* and /api/v2/notify/overrides* through PostgreSQL-backed suppression services, with restart-survival proof in NotifierSuppressionDurableRuntimeTests.

  • Escalation/on-call durability (updated 2026-04-16). Non-testing escalation-policy and on-call schedule APIs no longer use live in-memory services or compat repositories. Both hosts now resolve canonical /api/v2/escalation-policies* and /api/v2/oncall-schedules* plus legacy /api/v2/notify/escalation-policies* and /api/v2/notify/oncall-schedules* through PostgreSQL-backed runtime services, with restart-survival proof in NotifierEscalationOnCallDurableRuntimeTests.

  • Quiet-hours/maintenance durability (updated 2026-04-20). Non-testing quiet-hours calendars and maintenance windows no longer use live in-memory compat repositories or maintenance evaluators. Both hosts now resolve canonical /api/v2/quiet-hours* plus legacy /api/v2/notify/quiet-hours* and /api/v2/notify/maintenance-windows* through PostgreSQL-backed runtime services on the shared notify.quiet_hours and notify.maintenance_windows tables, with restart-survival proof in NotifierQuietHoursMaintenanceDurableRuntimeTests. Fixed-time daily/weekly cron expressions still project truthfully into canonical schedules, and compat-authored cron shapes that cannot be flattened losslessly now evaluate natively from persisted cronExpression plus duration metadata instead of remaining inert after restart.

  • Security/dead-letter durability (updated 2026-04-16). Non-testing webhook security, tenant isolation, dead-letter administration, and retention cleanup state no longer use live in-memory services. Both hosts now resolve /api/v2/security*, /api/v2/notify/dead-letter*, /api/v1/observability/dead-letters*, and retention endpoints through PostgreSQL-backed runtime services on shared notify.webhook_security_configs, notify.webhook_validation_nonces, notify.tenant_resource_owners, notify.cross_tenant_grants, notify.tenant_isolation_violations, notify.dead_letter_entries, notify.retention_policies_runtime, and notify.retention_cleanup_executions_runtime tables, with restart-survival proof in NotifierSecurityDeadLetterDurableRuntimeTests.

  • Testing-only fallback boundary (updated 2026-04-20). src/Notifier/* host startup now registers those durable quiet-hours, suppression, escalation/on-call, security, and dead-letter services directly for non-testing environments instead of composing an in-memory graph and replacing it later. The remaining in-memory admin services are isolated to Testing, with startup-contract proof in StartupDependencyWiringTests.

    • Simulation runtime parity (updated 2026-04-20). The canonical /api/v2/simulate* endpoints and the legacy /api/v2/notify/simulate* endpoints in src/Notifier/ now resolve the same DI-composed simulation runtime, so throttling plus quiet-hours or maintenance suppression behave identically across route families.

0) Mission & boundaries

Mission. Convert facts from Stella Ops into actionable, noise-controlled signals where teams already live (chat, email, paging, and webhooks), with explainable reasons and deep links to the UI.

Boundaries.

  • Notify does not make policy decisions and does not rescan; it consumes events from Scanner/Scheduler/Excitor/Conselier/Attestor/Zastava and routes them.
  • Attachments are links (UI/attestation pages); Notify does not attach SBOMs or large blobs to messages.
  • Secrets for channels (Slack tokens, SMTP creds) are referenced, not stored raw in the database.
  • 2025-11-02 module boundary. Maintain src/Notify/ as the reusable notification toolkit (engine, storage, queue, connectors) and src/Notifier/ as the Notifications Studio host that composes those libraries. Do not merge directories without an approved packaging RFC that covers build impacts, offline kit parity, and cross-module governance.
  • API versioning (updated 2026-02-22). The API is split across two services:
    • Notify (src/Notify/) exposes /api/v1/notify — the core notification toolkit (rules, channels, deliveries, templates). This is the lean, canonical API surface.
    • Notifier (src/Notifier/) exposes /api/v2/notify — the full Notifications Studio with enterprise features (escalation policies, on-call schedules, storm breaker, inbox, retention, simulation, quiet hours, 73+ routes). Notifier also maintains select /api/v1/notify endpoints for backward compatibility.
    • Both versions are actively maintained and production. v2 is NOT deprecated — it is the enterprise-tier API hosted by the Notifier Studio service. The previous claim that v2 was "compatibility-only" is stale and has been corrected.

1) Runtime shape & projects

src/
 ├─ StellaOps.Notify.WebService/        # REST: rules/channels CRUD, test send, deliveries browse
 ├─ StellaOps.Notify.Worker/            # consumers + evaluators + renderers + delivery workers
 ├─ StellaOps.Notify.Connectors.* /     # channel plug-ins: Slack, Teams, Email, Webhook (v1)
 │    └─ *.Tests/
 ├─ StellaOps.Notify.Engine/            # rules engine, templates, idempotency, digests, throttles
 ├─ StellaOps.Notify.Models/            # DTOs (Rule, Channel, Event, Delivery, Template)
 ├─ StellaOps.Notify.Storage.Postgres/  # canonical persistence (notify schema)
 ├─ StellaOps.Notify.Queue/             # bus client (Valkey Streams/NATS JetStream)
 └─ StellaOps.Notify.Tests.*            # unit/integration/e2e

Deployables:

  • Notify.WebService (stateless API)
  • Notify.Worker (horizontal scale)

Dependencies: Authority (OpToks; DPoP/mTLS), PostgreSQL (notify schema), Valkey/NATS (bus), HTTP egress to Slack/Teams/Webhooks/PagerDuty/OpsGenie, SMTP relay for Email.

Configuration. Notify.WebService bootstraps from notify.yaml (see etc/notify.yaml.sample). Use storage.driver: postgres and provide postgres.notify options (connectionString, schemaName, pool sizing, timeouts). Authority settings follow the platform defaults—when running locally without Authority, set authority.enabled: false and supply developmentSigningKey so JWTs can be validated offline.

api.rateLimits exposes token-bucket controls for delivery history queries and test-send previews (deliveryHistory, testSend). Default values allow generous browsing while preventing accidental bursts; operators can relax/tighten the buckets per deployment.

Plug-ins. All channel connectors are packaged under <baseDirectory>/plugins/notify. The ordered load list must start with Slack/Teams before Email/Webhook so chat-first actions are registered deterministically for Offline Kit bundles:

plugins:
  baseDirectory: "/var/opt/stellaops"
  directory: "plugins/notify"
  orderedPlugins:
    - StellaOps.Notify.Connectors.Slack
    - StellaOps.Notify.Connectors.Teams
    - StellaOps.Notify.Connectors.Email
    - StellaOps.Notify.Connectors.Webhook

The Offline Kit job simply copies the plugins/notify tree into the air-gapped bundle; the ordered list keeps connector manifests stable across environments.

In the hosted Notifier worker, delivery execution is split across two deterministic dispatch paths: WebhookChannelDispatcher continues to handle chat/webhook routes, while AdapterChannelDispatcher resolves Email, PagerDuty, and OpsGenie through IChannelAdapterFactory. The provider externalId emitted by those adapter-backed channels must survive persistence so inbound webhook acknowledgements can be resolved after restart.

Authority clients. Register two OAuth clients in StellaOps Authority: notify-web-dev (audience notify.dev) for development and notify-web (audience notify) for staging/production. Both require notify.read and notify.admin scopes and use DPoP-bound client credentials (client_secret in the samples). Reference entries live in etc/authority.yaml.sample, with placeholder secrets under etc/secrets/notify-web*.secret.example.


2) Responsibilities

  1. Ingest platform events from internal bus with strong ordering per key (e.g., image digest).
  2. Evaluate rules (tenantscoped) with matchers: severity changes, namespaces, repos, labels, KEV flags, provider provenance (VEX), component keys, admission decisions, etc.
  3. Control noise: throttle, coalesce (digest windows), and dedupe via idempotency keys.
  4. Render channelspecific messages using safe templates; include evidence and links.
  5. Deliver with retries/backoff; record outcome; expose delivery history to UI.
  6. Test paths (send test to channel targets) without touching live rules.
  7. Audit: log who configured what, when, and why a message was sent.

3) Event model (inputs)

Notify subscribes to the internal event bus (produced by services, escaped JSON; gzip allowed with caps):

  • scanner.scan.completed — new SBOM(s) composed; artifacts ready
  • scanner.report.ready — analysis verdict (policy+vex) available; carries deltas summary
  • scheduler.rescan.delta — new findings after Conselier/Excitor deltas (already summarized)
  • attestor.logged — Rekor UUID returned (sbom/report/vex export)
  • zastava.admission — admit/deny with reasons, namespace, image digests
  • conselier.export.completed — new export ready (rarely notified directly; usually drives Scheduler)
  • excitor.export.completed — new consensus snapshot (ditto)

Canonical envelope (bus → Notify.Engine):

{
  "eventId": "uuid",
  "kind": "scanner.report.ready",
  "tenant": "tenant-01",
  "ts": "2025-10-18T05:41:22Z",
  "actor": "scanner-webservice",
  "scope": { "namespace":"payments", "repo":"ghcr.io/acme/api", "digest":"sha256:..." },
  "payload": { /* kind-specific fields, see below */ }
}

Examples (payload cores):

  • scanner.report.ready:

    {
      "reportId": "report-3def...",
      "verdict": "fail",
      "summary": {"total": 12, "blocked": 2, "warned": 3, "ignored": 5, "quieted": 2},
      "delta": {"newCritical": 1, "kev": ["CVE-2025-..."]},
      "links": {"ui": "https://ui/.../reports/report-3def...", "rekor": "https://rekor/..."},
      "dsse": { "...": "..." },
      "report": { "...": "..." }
    }
    

    Payload embeds both the canonical report document and the DSSE envelope so connectors, Notify, and UI tooling can reuse the signed bytes without re-serialising.

  • scanner.scan.completed:

    {
      "reportId": "report-3def...",
      "digest": "sha256:...",
      "verdict": "fail",
      "summary": {"total": 12, "blocked": 2, "warned": 3, "ignored": 5, "quieted": 2},
      "delta": {"newCritical": 1, "kev": ["CVE-2025-..."]},
      "policy": {"revisionId": "rev-42", "digest": "27d2..."},
      "findings": [{"id": "finding-1", "severity": "Critical", "cve": "CVE-2025-...", "reachability": "runtime"}],
      "dsse": { "...": "..." }
    }
    
  • zastava.admission:

    { "decision":"deny|allow", "reasons":["unsigned image","missing SBOM"],
      "images":[{"digest":"sha256:...","signed":false,"hasSbom":false}] }
    

4) Rules engine — semantics

Rule shape (simplified):

name: "high-critical-alerts-prod"
enabled: true
match:
  eventKinds: ["scanner.report.ready","scheduler.rescan.delta","zastava.admission"]
  namespaces: ["prod-*"]
  repos: ["ghcr.io/acme/*"]
  minSeverity: "high"            # min of new findings (delta context)
  kev: true                      # require KEV-tagged or allow any if false
  verdict: ["fail","deny"]       # filter for report/admission
  vex:
    includeRejectedJustifications: false    # notify only on accepted 'affected'
actions:
  - channel: "slack:sec-alerts"  # reference to Channel object
    template: "concise"
    throttle: "5m"
  - channel: "email:soc"
    digest: "hourly"
    template: "detailed"

Evaluation order

  1. Tenant check → discard if rule tenant ≠ event tenant.
  2. Kind filter → discard early.
  3. Scope match (namespace/repo/labels).
  4. Delta/severity gates (if event carries delta).
  5. VEX gate (drop if events finding is not affected under policy consensus unless rule says otherwise).
  6. Throttling/dedup (idempotency key) — skip if suppressed.
  7. Actions → enqueue perchannel job(s).

Idempotency key: hash(ruleId | actionId | event.kind | scope.digest | delta.hash | day-bucket); ensures “same alert” doesnt fire more than once within throttle window.

Digest windows: maintain per action a coalescer:

  • Window: 5m|15m|1h|1d (configurable); coalesces events by tenant + namespace/repo or by digest group.
  • Digest messages summarize top N items and counts, with safe truncation.

5) Channels & connectors (plugins)

Channel config is twopart: a Channel record (name, type, options) and a Secret reference (Vault/K8s Secret). Connectors are restart-time plug-ins discovered on service start (same manifest convention as Concelier/Excititor) and live under plugins/notify/<channel>/.

Built-in channels:

  • Slack: Bot token (xoxb…), chat.postMessage + blocks; rate limit aware (HTTP 429).
  • Microsoft Teams: Incoming Webhook (or Graph card later); adaptive card payloads.
  • Email (SMTP): TLS (STARTTLS or implicit), From/To/CC/BCC; HTML+text alt; DKIM optional.
  • Generic Webhook: POST JSON with HMAC signature (Ed25519 or SHA256) in headers.
  • PagerDuty: Events API v2 trigger/ack/resolve flow; durable dedup_key/external id mapping is persisted with delivery state for restart-safe webhook acknowledgement handling.
  • OpsGenie: Alert create/ack/close flow; alias/external id is persisted with delivery state so inbound acknowledgement webhooks remain restart-safe.

Connector contract: (implemented by plug-in assemblies)

public interface INotifyConnector {
  string Type { get; } // "slack" | "teams" | "email" | "webhook" | ...
  Task<DeliveryResult> SendAsync(DeliveryContext ctx, CancellationToken ct);
  Task<HealthResult> HealthAsync(ChannelConfig cfg, CancellationToken ct);
}

For hosted external channels, Notifier worker adapters implement IChannelAdapter and are selected by AdapterChannelDispatcher. Those adapters must emit stable provider identifiers (externalId, incidentId where applicable) so the IAckBridge webhook path can recover correlation from persisted delivery rows instead of process-local memory.

DeliveryContext includes rendered content and raw event for audit.

Test-send previews. Plug-ins can optionally implement INotifyChannelTestProvider to shape /channels/{id}/test responses. Providers receive a sanitised ChannelTestPreviewContext (channel, tenant, target, timestamp, trace) and return a NotifyDeliveryRendered preview + metadata. When no provider is present, the host falls back to a generic preview so the endpoint always responds.

Secrets: ChannelConfig.secretRef points to Authoritymanaged secret handle or K8s Secret path; workers load at send-time; plug-in manifests (notify-plugin.json) declare capabilities and version.


6) Templates & rendering

Template engine: strongly typed, safe Handlebarsstyle; no arbitrary code. Partial templates per channel. Deterministic outputs (prop order, no locale drift unless requested).

Variables (examples):

  • event.kind, event.ts, scope.namespace, scope.repo, scope.digest
  • payload.verdict, payload.delta.newCritical, payload.links.ui, payload.links.rekor
  • topFindings[] with purl, vulnId, severity
  • policy.name, policy.revision (if available)

Helpers:

  • severity_icon(sev), link(text,url), pluralize(n, "finding"), truncate(text, n), code(text).

Channel mapping:

  • Slack: title + blocks, limited to 50 blocks/3000 chars per section; long lists → link to UI.
  • Teams: Adaptive Card schema 1.5; fallback text for older channels (surfaced as teams.fallbackText metadata alongside webhook hash).
  • Email: HTML + text; inline table of top N findings, rest behind UI link.
  • Webhook: JSON with event, ruleId, actionId, summary, links, and raw payload subset.

i18n: template set per locale (English default; Bulgarian builtin).


7) Data model (PostgreSQL)

Canonical JSON Schemas for rules/channels/events live in docs/modules/notify/resources/schemas/. Sample payloads intended for tests/UI mock responses are captured in docs/modules/notify/resources/samples/.

Database: stellaops_notify (PostgreSQL)

  • rules

    { _id, tenantId, name, enabled, match, actions, createdBy, updatedBy, createdAt, updatedAt }
    
  • channels

    { _id, tenantId, name:"slack:sec-alerts", type:"slack",
      config:{ webhookUrl?:"", channel:"#sec-alerts", workspace?: "...", secretRef:"ref://..." },
      createdAt, updatedAt }
    
  • deliveries

    { _id, tenantId, ruleId, actionId, eventId, kind, scope, status:"sent|failed|throttled|digested|dropped",
      externalId?, metadata?,
      attempts:[{ts, status, code, reason}],
      rendered:{ title, body, target },    // redacted for PII; body hash stored
      sentAt, lastError? }
    

    PagerDuty and OpsGenie deliveries durably carry the provider externalId plus metadata.incidentId so inbound webhook acknowledgements can be resolved after worker restart without relying on a process-local bridge map.

  • digests

    { _id, tenantId, actionKey, window:"hourly", openedAt, items:[{eventId, scope, delta}], status:"open|flushed" }
    
  • correlation_runtime_incidents

    { tenantId, incidentId, correlationKey, eventKind, title, status:"open|acknowledged|resolved",
      eventCount, firstOccurrence, lastOccurrence, acknowledgedBy?, resolvedBy?, eventIds:[eventId...] }
    
  • correlation_runtime_throttle_events

    { tenantId, correlationKey, occurredAt }   // short-lived, also cached in Valkey
    
  • escalation_states

    { tenantId, policyId, incidentId?, correlationId, currentStep, repeatIteration,
      status:"active|acknowledged|resolved|expired", startedAt, nextEscalationAt,
      acknowledgedAt?, acknowledgedBy?, metadata }
    

    correlationId is the durable lookup key for the live string incident id used by the runtime engine. metadata carries the runtime-only fields that do not fit the canonical columns yet: stateId, external policyId, levelStartedAt, terminal runtime status (stopped|exhausted), stoppedAt, stoppedReason, and the full escalation history.

Indexes: rules by {tenantId, enabled}, deliveries by {tenantId, sentAt desc}, digests by {tenantId, actionKey}.


8) External APIs (WebService)

Base path: /api/v1/notify (Authority OpToks; scopes: notify.admin for write, notify.read for view).

All REST calls require the tenant header X-StellaOps-Tenant (matches the canonical tenantId stored in PostgreSQL). Payloads are normalised via NotifySchemaMigration before persistence to guarantee schema version pinning.

Authentication today is stubbed with Bearer tokens (Authorization: Bearer <token>). When Authority wiring lands, this will switch to OpTok validation + scope enforcement, but the header contract will remain the same.

Service configuration exposes notify:auth:* keys (issuer, audience, signing key, scope names) so operators can wire the Authority JWKS or (in dev) a symmetric test key. notify:storage:* keys cover PostgreSQL connection/schema overrides. Both sets are required for the new API surface.

Internal tooling can hit /internal/notify/<entity>/normalize to upgrade legacy JSON and return canonical output used in the docs fixtures.

  • Channels

    • POST /channels | GET /channels | GET /channels/{id} | PATCH /channels/{id} | DELETE /channels/{id}
    • POST /channels/{id}/test → send sample message (no rule evaluation); returns 202 Accepted with rendered preview + metadata (base keys: channelType, target, previewProvider, traceId + connector-specific entries); governed by api.rateLimits:testSend.
  • GET /channels/{id}/health → connector selfcheck (returns redacted metadata: secret refs hashed, sensitive config keys masked, fallbacks noted via teams.fallbackText/teams.validation.*)

  • Rules

    • POST /rules | GET /rules | GET /rules/{id} | PATCH /rules/{id} | DELETE /rules/{id}
    • POST /rules/{id}/test → dryrun rule against a sample event (no delivery unless --send)
  • Deliveries

    • POST /deliveries → ingest worker delivery state (idempotent via deliveryId).
    • GET /deliveries?since=...&status=...&limit=... → list envelope { items, count, continuationToken } (most recent first); base metadata keys match the test-send response (channelType, target, previewProvider, traceId); rate-limited via api.rateLimits.deliveryHistory. See docs/modules/notify/resources/samples/notify-delivery-list-response.sample.json.
    • GET /deliveries/{id} → detail (redacted body + metadata)
    • POST /deliveries/{id}/retry → force retry (admin, future sprint)
  • Admin

    • GET /stats (per tenant counts, last hour/day)
    • GET /healthz|readyz (liveness)
    • POST /locks/acquire | POST /locks/release worker coordination primitives (short TTL).
    • POST /digests | GET /digests/{actionKey} | DELETE /digests/{actionKey} manage open digest windows.
    • POST /audit | GET /audit?since=&limit= append/query structured audit trail entries.

8.1 Ack tokens & escalation workflows

To support one-click acknowledgements from chat/email, the Notify WebService mints DSSE ack tokens via Authority:

  • POST /notify/ack-tokens/issue → returns a DSSE envelope (payload type application/vnd.stellaops.notify-ack-token+json) describing the tenant, notification/delivery ids, channel, webhook URL, nonce, permitted actions, and TTL. Requires notify.operator; requesting escalation requires the caller to hold notify.escalate (and notify.admin when configured). Issuance enforces the Authority-side webhook allowlist (notifications.webhooks.allowedHosts) before minting tokens.
  • POST /notify/ack-tokens/verify → verifies the DSSE signature, enforces expiry/tenant/action constraints, and emits audit events (notify.ack.verified, notify.ack.escalated). Scope: notify.operator (+notify.escalate for escalation).
  • POST /notify/ack-tokens/rotate → rotates the signing key used for ack tokens, requires notify.admin, and emits notify.ack.key_rotated/notify.ack.key_rotation_failed audit events. Operators must supply the new key material (file/KMS/etc. depending on notifications.ackTokens.keySource); Authority updates JWKS entries with use: "notify-ack" and retires the previous key.
  • POST /internal/notifications/ack-tokens/rotate → legacy bootstrap path (API-key protected) retained for air-gapped initial provisioning; it forwards to the same rotation pipeline as the public endpoint.

Authority signs ack tokens using keys configured under notifications.ackTokens. Public JWKS responses expose these keys with use: "notify-ack" and status: active|retired, enabling offline verification by the worker/UI/CLI.

Inbound PagerDuty and OpsGenie acknowledgement webhooks must resolve provider identifiers from durable delivery state (externalId plus incident metadata), not from process-local runtime maps. Restart-survival is a required property of the non-testing host composition.

Ingestion: workers do not expose public ingestion; they subscribe to the internal bus. (Optional /events/test for integration testing, admin-only.)


9) Delivery pipeline (worker)

[Event bus] → [Ingestor] → [RuleMatcher] → [Throttle/Dedupe] → [DigestCoalescer] → [Renderer] → [Connector] → [Result]
                                                 └────────→ [DeliveryStore]
  • Ingestor: N consumers with perkey ordering (key = tenant|digest|namespace).
  • RuleMatcher: loads active rules snapshot for tenant into memory; vectorized predicate check.
  • Throttle/Dedupe: consult Valkey plus PostgreSQL notify.correlation_runtime_throttle_events; if hit → record status=throttled.
  • DigestCoalescer: append to open digest window or flush when timer expires.
  • Renderer: select template (channel+locale), inject variables, enforce length limits, compute bodyHash.
  • Connector: send; handle providerspecific rate limits and backoffs; maxAttempts with exponential jitter; overflow → DLQ (deadletter topic) + UI surfacing.

Idempotency: per action idempotency key stored in Valkey (TTL = throttle window or digest window). Connectors also respect provider idempotency where available (e.g., Slack client_msg_id).


10) Reliability & rate controls

  • Pertenant RPM caps (default 600/min) + perchannel concurrency (Slack 14, Teams 12, Email 832 based on relay).
  • Backoff map: Slack 429 → respect RetryAfter; SMTP 4xx → retry; 5xx → retry with jitter; permanent rejects → drop with status recorded.
  • DLQ: NATS/Valkey stream notify.dlq with {event, rule, action, error} for operator inspection; UI shows DLQ items.

11) Security & privacy

  • AuthZ: all APIs require Authority OpToks; actions scoped by tenant.
  • Secrets: secretRef only; Notify fetches justintime from Authority Secret proxy or K8s Secret (mounted). No plaintext secrets in database.
  • Egress TLS: validate SSL; pin domains per channel config; optional CA bundle override for onprem SMTP.
  • Webhook signing: HMAC or Ed25519 signatures in X-StellaOps-Signature + replaywindow timestamp; include canonical body hash in header.
  • Redaction: deliveries store hashes of bodies, not full payloads for chat/email to minimize PII retention (configurable).
  • Quiet hours: per tenant (e.g., 22:0006:00) route highsev only; defer others to digests.
  • Loop prevention: Webhook target allowlist + event origin tags; do not ingest own webhooks.

12) Observability (Prometheus + OTEL)

  • notify.events_consumed_total{kind}
  • notify.rules_matched_total{ruleId}
  • notify.throttled_total{reason}
  • notify.digest_coalesced_total{window}
  • notify.sent_total{channel} / notify.failed_total{channel,code}
  • notify.delivery_latency_seconds{channel} (endtoend)
  • Tracing: spans ingest, match, render, send; correlation id = eventId.
  • Runbook + dashboard stub (offline import): operations/observability.md, operations/dashboards/notify-observability.json (to be populated after next demo).

SLO targets

  • Event→delivery p95 ≤ 3060s under nominal load.
  • Failure rate p95 < 0.5% per hour (excluding provider outages).
  • Duplicate rate ≈ 0 (idempotency working).

13) Configuration (YAML)

notify:
  authority:
    issuer: "https://authority.internal"
    require: "dpop"               # or "mtls"
  bus:
    kind: "valkey"                # or "nats" (valkey uses redis:// protocol)
    streams:
      - "scanner.events"
      - "scheduler.events"
      - "attestor.events"
      - "zastava.events"
  postgres:
    notify:
      connectionString: "Host=postgres;Port=5432;Database=stellaops_notify;Username=stellaops;Password=stellaops;Pooling=true"
      schemaName: "notify"
      commandTimeoutSeconds: 45
  limits:
    perTenantRpm: 600
    perChannel:
      slack:   { concurrency: 2 }
      teams:   { concurrency: 1 }
      email:   { concurrency: 8 }
      webhook: { concurrency: 8 }
  digests:
    defaultWindow: "1h"
    maxItems: 100
  quietHours:
    enabled: true
    window: "22:00-06:00"
    minSeverity: "critical"
  webhooks:
    sign:
      method: "ed25519"           # or "hmac-sha256"
      keyRef: "ref://notify/webhook-sign-key"

14) UI touchpoints

  • Notifications → Channels: add Slack/Teams/Email/Webhook/PagerDuty/OpsGenie; run health; rotate secrets.
  • Notifications → Rules: create/edit YAML rules with linting; test with sample events; see match rate.
  • Notifications → Deliveries: timeline with filters (status, channel, rule); inspect last error; retry.
  • Digest preview: shows current window contents and when it will flush.
  • Quiet hours: configure per tenant; show overrides.
  • DLQ: browse deadletters; requeue after fix.

15) Failure modes & responses

Condition Behavior
Slack 429 / Teams 429 Respect RetryAfter, backoff with jitter, reduce concurrency
SMTP transient 4xx Retry up to maxAttempts; escalate to DLQ on exhaust
Invalid channel secret Mark channel unhealthy; suppress sends; surface in UI
Rule explosion (matches everything) Safety valve: pertenant RPM caps; autopause rule after X drops; UI alert
Bus outage Buffer to local queue (bounded); resume consuming when healthy
PostgreSQL slowness Fall back to Valkey throttles; batch write deliveries; shed lowpriority notifications

16) Testing matrix

  • Unit: matchers, throttle math, digest coalescing, idempotency keys, template rendering edge cases.
  • Connectors: providerlevel rate limits, payload size truncation, error mapping.
  • Integration: synthetic event storm (10k/min), ensure p95 latency & duplicate rate.
  • Security: DPoP/mTLS on APIs; secretRef resolution; webhook signing & replay windows.
  • i18n: localized templates render deterministically.
  • Chaos: Slack/Teams API flaps; SMTP greylisting; Valkey hiccups; ensure graceful degradation.

17) Sequences (representative)

A) New criticals after Conselier delta (Slack immediate + Email hourly digest)

sequenceDiagram
  autonumber
  participant SCH as Scheduler
  participant NO as Notify.Worker
  participant SL as Slack
  participant SMTP as Email

  SCH->>NO: bus event scheduler.rescan.delta { newCritical:1, digest:sha256:... }
  NO->>NO: match rules (Slack immediate; Email hourly digest)
  NO->>SL: chat.postMessage (concise)
  SL-->>NO: 200 OK
  NO->>NO: append to digest window (email:soc)
  Note over NO: At window close → render digest email
  NO->>SMTP: send email (detailed digest)
  SMTP-->>NO: 250 OK

B) Admission deny (Teams card + Webhook)

sequenceDiagram
  autonumber
  participant ZA as Zastava
  participant NO as Notify.Worker
  participant TE as Teams
  participant WH as Webhook

  ZA->>NO: bus event zastava.admission { decision: "deny", reasons: [...] }
  NO->>TE: POST adaptive card
  TE-->>NO: 200 OK
  NO->>WH: POST JSON (signed)
  WH-->>NO: 2xx

18) Implementation notes

  • Language: .NET 10; minimal API; System.Text.Json with canonical writer for body hashing; Channels for pipelines.
  • Bus: Valkey Streams (XGROUP consumers) or NATS JetStream for atleastonce with ack; pertenant consumer groups to localize backpressure.
  • Templates: compile and cache per rule+channel+locale; version with rule updatedAt to invalidate.
  • Rules: store raw YAML + parsed AST; validate with schema + static checks (e.g., nonsensical combos).
  • Secrets: pluggable secret resolver (Authority Secret proxy, K8s, Vault).
  • Rate limiting: System.Threading.RateLimiting + per-connector adapters.

19) Air-gapped bootstrap configuration

Air-gapped deployments ship a deterministic Notifier profile inside the Bootstrap Pack. The artefacts live under bootstrap/notify/ after running the Offline Kit builder and include:

  • notify.yaml — configuration derived from etc/notify.airgap.yaml, pointing to the sealed PostgreSQL/Authority endpoints and loading connectors from the local plug-in directory.
  • notify-web.secret.example — template for the Authority client secret, intended to be renamed to notify-web.secret before deployment.
  • README.md — operator guide (docs/modules/notify/bootstrap-pack.md).

These files are copied automatically by ops/offline-kit/build_offline_kit.py via copy_bootstrap_configs. Operators mount the configuration and secret into the StellaOps.Notify.WebService container (Compose or Kubernetes) to keep sealed-mode roll-outs reproducible. (Notifier WebService was merged into Notify WebService; the notifier.stella-ops.local hostname is now an alias on the notify-web container.)


20) Roadmap (post-v1)

  • Jira ticket creation and downstream issue-state synchronization.
  • User inbox (inapp notifications) + mobile push via webhook relay.
  • Anomaly suppression: autopause noisy rules with hints (learned thresholds).
  • Graph rules: “only notify if not_affected → affected transition at consensus layer”.
  • Label enrichment: pluggable taggers (business criticality, data classification) to refine matchers.