Bundled pre-session doc + ops work: - docs/modules/**: sync across advisory-ai, airgap, cli, excititor, export-center, findings-ledger, notifier, notify, platform, router, sbom-service, ui, web (architectural + operational updates) - docs/features/**: updates to checked excititor vex pipeline, developer workspace, quick verify drawer - docs top-level: README, quickstart, API_CLI_REFERENCE, UI_GUIDE, code-of-conduct/TESTING_PRACTICES updates - docs/qa/feature-checks/: FLOW.md + excititor state update - docs/implplan/: remaining sprint updates + new Concelier source credentials sprint (SPRINT_20260422_003) - docs-archived/implplan/: 30 sprint archival moves (ElkSharp series, misc completed sprints) - devops/compose: .env + services compose + env example + router gateway config updates File-level granularity preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
38 KiB
Scope. Implementation‑ready architecture for Notify (aligned with Epic 11 – Notifications Studio): a rules‑driven, tenant‑aware notification service that consumes platform events (scan completed, report ready, rescan deltas, attestation logged, admission decisions, etc.), evaluates operator‑defined routing rules, renders channel‑specific messages (Slack/Teams/Email/Webhook/PagerDuty/OpsGenie), and delivers them reliably with idempotency, throttling, and digests. It is UI‑managed, auditable, and safe by default (no secrets leakage, no spam storms).
-
Console frontdoor compatibility (updated 2026-03-10). The web console reaches Notifier Studio through the gateway-owned
/api/v1/notifier/*prefix, which translates onto the service-local/api/v2/notify/*surface without requiring browser calls to raw service-prefixed routes. -
Console admin routing truthfulness (updated 2026-04-21). The console uses
/api/v1/notify/*only for core Notify toolkit flows (channels, rules, deliveries, incidents, acknowledgements). Advanced admin configuration such as quiet-hours, throttles, escalation, and localization is owned by the Notifier frontdoor/api/v1/notifier/* -> /api/v2/notify/*; Platform no longer serves synthetic/api/v1/notify/*admin compatibility payloads. Digest schedule CRUD remains unavailable in the live API. -
Merged Notify compat surface restoration (updated 2026-04-22). The merged
src/Notify/*host now maps the admin compatibility routes expected behind/api/v1/notifier/*, including/api/v2/notify/channels*,/deliveries*,/simulate*,/quiet-hours*,/throttle-configs*,/escalation-policies*, and/overrides*. Unsupported operator override CRUD now returns an explicit501contract response instead of a misleading404, and focused proof lives insrc/Notify/__Tests/StellaOps.Notify.WebService.Tests/CrudEndpointsTests.cs. -
Runtime durability cutover (updated 2026-04-16). Default
src/Notifier/*production wiring now resolves queue and storage through the sharedStellaOps.Notify.PersistenceandStellaOps.Notify.Queuelibraries.NullNotifyEventQueueis allowed only in theTestingenvironment,notify.pack_approvalsis durable, and restart-survival proof is covered byNotifierDurableRuntimeProofTestsagainst real Postgres + Redis. -
Correlation incident/throttle durability (updated 2026-04-20). Non-testing Notify and Notifier hosts no longer keep incident correlation or throttle windows in process-local memory. Both hosts now swap
IIncidentManagerandINotifyThrottleronto PostgreSQL-backed runtime services usingnotify.correlation_runtime_incidentsandnotify.correlation_runtime_throttle_events, with restart-survival proof inNotifierCorrelationDurableRuntimeTests. -
Localization runtime durability (updated 2026-04-20). Non-testing Notify and Notifier hosts no longer keep tenant-managed localization bundles in process-local memory. Both hosts now swap
ILocalizationServiceonto a PostgreSQL-backed runtime service usingnotify.localization_bundles, while built-in system fallback strings remain compiled defaults, with restart-survival proof inNotifierLocalizationDurableRuntimeTests. -
Storm/fallback runtime durability (updated 2026-04-20). Non-testing Notify and Notifier hosts no longer keep storm detection state, tenant fallback chains, or per-delivery fallback attempts in process-local memory. Both hosts now swap
IStormBreakerandIFallbackHandleronto PostgreSQL-backed runtime services usingnotify.storm_runtime_states,notify.storm_runtime_events,notify.fallback_runtime_chains, andnotify.fallback_runtime_delivery_states, with restart-survival proof inNotifierStormFallbackDurableRuntimeTests. -
Escalation engine runtime durability (updated 2026-04-20). Non-testing Notify and Notifier hosts no longer keep live
IEscalationEnginestate in a process-local dictionary. Both hosts now swapIEscalationEngineonto a PostgreSQL-backed runtime service usingnotify.escalation_states, with restart-survival proof inNotifierEscalationRuntimeDurableTestsand startup-contract proof inNotifyEscalationRuntimeStartupContractTests. -
External ack/runtime channel durability (updated 2026-04-20). Non-testing Notifier worker hosts no longer depend on a process-local external-id bridge map or a webhook-only dispatch composition for external channels. The worker now composes
WebhookChannelDispatcherfor chat/webhook routes plusAdapterChannelDispatcherforEmail,PagerDuty, andOpsGenie, durably records providerexternalIdplusincidentIdmetadata into PostgreSQL-backed delivery state, and resolves PagerDuty/OpsGenie webhook acknowledgements through PostgreSQL-backed lookup after restart. Focused proof lives inNotifierWorkerHostWiringTestsandNotifierAckBridgeRuntimeDurableTests. -
Digest scheduler runtime composition (updated 2026-04-20). The non-testing Notifier worker now composes
DigestScheduleRunner,DigestGenerator, andChannelDigestDistributorin the live host. Scheduled digests remain configuration-driven and now resolve tenant IDs fromNotifier:DigestSchedule:Schedules:*:TenantIdsthroughConfiguredDigestTenantProviderinstead of the process-localInMemoryDigestTenantProvider. There is currently no operator-managed digest schedule CRUD surface in the live runtime;/digestsadministers open digest windows only. Focused proof lives inNotifierWorkerHostWiringTests. -
Suppression admin durability (updated 2026-04-16). Non-testing throttle configuration and operator override APIs no longer use live in-memory state. Both hosts now resolve canonical
/api/v2/throttles*and/api/v2/overrides*plus legacy/api/v2/notify/throttle-configs*and/api/v2/notify/overrides*through PostgreSQL-backed suppression services, with restart-survival proof inNotifierSuppressionDurableRuntimeTests. -
Escalation/on-call durability (updated 2026-04-16). Non-testing escalation-policy and on-call schedule APIs no longer use live in-memory services or compat repositories. Both hosts now resolve canonical
/api/v2/escalation-policies*and/api/v2/oncall-schedules*plus legacy/api/v2/notify/escalation-policies*and/api/v2/notify/oncall-schedules*through PostgreSQL-backed runtime services, with restart-survival proof inNotifierEscalationOnCallDurableRuntimeTests. -
Quiet-hours/maintenance durability (updated 2026-04-20). Non-testing quiet-hours calendars and maintenance windows no longer use live in-memory compat repositories or maintenance evaluators. Both hosts now resolve canonical
/api/v2/quiet-hours*plus legacy/api/v2/notify/quiet-hours*and/api/v2/notify/maintenance-windows*through PostgreSQL-backed runtime services on the sharednotify.quiet_hoursandnotify.maintenance_windowstables, with restart-survival proof inNotifierQuietHoursMaintenanceDurableRuntimeTests. Fixed-time daily/weekly cron expressions still project truthfully into canonical schedules, and compat-authored cron shapes that cannot be flattened losslessly now evaluate natively from persistedcronExpressionplusdurationmetadata instead of remaining inert after restart. -
Security/dead-letter durability (updated 2026-04-16). Non-testing webhook security, tenant isolation, dead-letter administration, and retention cleanup state no longer use live in-memory services. Both hosts now resolve
/api/v2/security*,/api/v2/notify/dead-letter*,/api/v1/observability/dead-letters*, and retention endpoints through PostgreSQL-backed runtime services on sharednotify.webhook_security_configs,notify.webhook_validation_nonces,notify.tenant_resource_owners,notify.cross_tenant_grants,notify.tenant_isolation_violations,notify.dead_letter_entries,notify.retention_policies_runtime, andnotify.retention_cleanup_executions_runtimetables, with restart-survival proof inNotifierSecurityDeadLetterDurableRuntimeTests. -
Testing-only fallback boundary (updated 2026-04-20).
src/Notifier/*host startup now registers those durable quiet-hours, suppression, escalation/on-call, security, and dead-letter services directly for non-testing environments instead of composing an in-memory graph and replacing it later. The remaining in-memory admin services are isolated toTesting, with startup-contract proof inStartupDependencyWiringTests.- Simulation runtime parity (updated 2026-04-20). The canonical
/api/v2/simulate*endpoints and the legacy/api/v2/notify/simulate*endpoints insrc/Notifier/now resolve the same DI-composed simulation runtime, so throttling plus quiet-hours or maintenance suppression behave identically across route families.
- Simulation runtime parity (updated 2026-04-20). The canonical
0) Mission & boundaries
Mission. Convert facts from Stella Ops into actionable, noise-controlled signals where teams already live (chat, email, paging, and webhooks), with explainable reasons and deep links to the UI.
Boundaries.
- Notify does not make policy decisions and does not rescan; it consumes events from Scanner/Scheduler/Excitor/Conselier/Attestor/Zastava and routes them.
- Attachments are links (UI/attestation pages); Notify does not attach SBOMs or large blobs to messages.
- Secrets for channels (Slack tokens, SMTP creds) are referenced, not stored raw in the database.
- 2025-11-02 module boundary. Maintain
src/Notify/as the reusable notification toolkit (engine, storage, queue, connectors) andsrc/Notifier/as the Notifications Studio host that composes those libraries. Do not merge directories without an approved packaging RFC that covers build impacts, offline kit parity, and cross-module governance. - API versioning (updated 2026-02-22). The API is split across two services:
- Notify (
src/Notify/) exposes/api/v1/notify— the core notification toolkit (rules, channels, deliveries, templates). This is the lean, canonical API surface. - Notifier (
src/Notifier/) exposes/api/v2/notify— the full Notifications Studio with enterprise features (escalation policies, on-call schedules, storm breaker, inbox, retention, simulation, quiet hours, 73+ routes). Notifier also maintains select/api/v1/notifyendpoints for backward compatibility. - Both versions are actively maintained and production. v2 is NOT deprecated — it is the enterprise-tier API hosted by the Notifier Studio service. The previous claim that v2 was "compatibility-only" is stale and has been corrected.
- Notify (
1) Runtime shape & projects
src/
├─ StellaOps.Notify.WebService/ # REST: rules/channels CRUD, test send, deliveries browse
├─ StellaOps.Notify.Worker/ # consumers + evaluators + renderers + delivery workers
├─ StellaOps.Notify.Connectors.* / # channel plug-ins: Slack, Teams, Email, Webhook (v1)
│ └─ *.Tests/
├─ StellaOps.Notify.Engine/ # rules engine, templates, idempotency, digests, throttles
├─ StellaOps.Notify.Models/ # DTOs (Rule, Channel, Event, Delivery, Template)
├─ StellaOps.Notify.Storage.Postgres/ # canonical persistence (notify schema)
├─ StellaOps.Notify.Queue/ # bus client (Valkey Streams/NATS JetStream)
└─ StellaOps.Notify.Tests.* # unit/integration/e2e
Deployables:
- Notify.WebService (stateless API)
- Notify.Worker (horizontal scale)
Dependencies: Authority (OpToks; DPoP/mTLS), PostgreSQL (notify schema), Valkey/NATS (bus), HTTP egress to Slack/Teams/Webhooks/PagerDuty/OpsGenie, SMTP relay for Email.
Configuration. Notify.WebService bootstraps from
notify.yaml(seeetc/notify.yaml.sample). Usestorage.driver: postgresand providepostgres.notifyoptions (connectionString,schemaName, pool sizing, timeouts). Authority settings follow the platform defaults—when running locally without Authority, setauthority.enabled: falseand supplydevelopmentSigningKeyso JWTs can be validated offline.
api.rateLimitsexposes token-bucket controls for delivery history queries and test-send previews (deliveryHistory,testSend). Default values allow generous browsing while preventing accidental bursts; operators can relax/tighten the buckets per deployment.
Plug-ins. All channel connectors are packaged under
<baseDirectory>/plugins/notify. The ordered load list must start with Slack/Teams before Email/Webhook so chat-first actions are registered deterministically for Offline Kit bundles:plugins: baseDirectory: "/var/opt/stellaops" directory: "plugins/notify" orderedPlugins: - StellaOps.Notify.Connectors.Slack - StellaOps.Notify.Connectors.Teams - StellaOps.Notify.Connectors.Email - StellaOps.Notify.Connectors.WebhookThe Offline Kit job simply copies the
plugins/notifytree into the air-gapped bundle; the ordered list keeps connector manifests stable across environments.In the hosted Notifier worker, delivery execution is split across two deterministic dispatch paths:
WebhookChannelDispatchercontinues to handle chat/webhook routes, whileAdapterChannelDispatcherresolvesPagerDuty, andOpsGeniethroughIChannelAdapterFactory. The providerexternalIdemitted by those adapter-backed channels must survive persistence so inbound webhook acknowledgements can be resolved after restart.
Authority clients. Register two OAuth clients in StellaOps Authority:
notify-web-dev(audiencenotify.dev) for development andnotify-web(audiencenotify) for staging/production. Both requirenotify.readandnotify.adminscopes and use DPoP-bound client credentials (client_secretin the samples). Reference entries live inetc/authority.yaml.sample, with placeholder secrets underetc/secrets/notify-web*.secret.example.
2) Responsibilities
- Ingest platform events from internal bus with strong ordering per key (e.g., image digest).
- Evaluate rules (tenant‑scoped) with matchers: severity changes, namespaces, repos, labels, KEV flags, provider provenance (VEX), component keys, admission decisions, etc.
- Control noise: throttle, coalesce (digest windows), and dedupe via idempotency keys.
- Render channel‑specific messages using safe templates; include evidence and links.
- Deliver with retries/backoff; record outcome; expose delivery history to UI.
- Test paths (send test to channel targets) without touching live rules.
- Audit: log who configured what, when, and why a message was sent.
3) Event model (inputs)
Notify subscribes to the internal event bus (produced by services, escaped JSON; gzip allowed with caps):
scanner.scan.completed— new SBOM(s) composed; artifacts readyscanner.report.ready— analysis verdict (policy+vex) available; carries deltas summaryscheduler.rescan.delta— new findings after Conselier/Excitor deltas (already summarized)attestor.logged— Rekor UUID returned (sbom/report/vex export)zastava.admission— admit/deny with reasons, namespace, image digestsconselier.export.completed— new export ready (rarely notified directly; usually drives Scheduler)excitor.export.completed— new consensus snapshot (ditto)
Canonical envelope (bus → Notify.Engine):
{
"eventId": "uuid",
"kind": "scanner.report.ready",
"tenant": "tenant-01",
"ts": "2025-10-18T05:41:22Z",
"actor": "scanner-webservice",
"scope": { "namespace":"payments", "repo":"ghcr.io/acme/api", "digest":"sha256:..." },
"payload": { /* kind-specific fields, see below */ }
}
Examples (payload cores):
-
scanner.report.ready:{ "reportId": "report-3def...", "verdict": "fail", "summary": {"total": 12, "blocked": 2, "warned": 3, "ignored": 5, "quieted": 2}, "delta": {"newCritical": 1, "kev": ["CVE-2025-..."]}, "links": {"ui": "https://ui/.../reports/report-3def...", "rekor": "https://rekor/..."}, "dsse": { "...": "..." }, "report": { "...": "..." } }Payload embeds both the canonical report document and the DSSE envelope so connectors, Notify, and UI tooling can reuse the signed bytes without re-serialising.
-
scanner.scan.completed:{ "reportId": "report-3def...", "digest": "sha256:...", "verdict": "fail", "summary": {"total": 12, "blocked": 2, "warned": 3, "ignored": 5, "quieted": 2}, "delta": {"newCritical": 1, "kev": ["CVE-2025-..."]}, "policy": {"revisionId": "rev-42", "digest": "27d2..."}, "findings": [{"id": "finding-1", "severity": "Critical", "cve": "CVE-2025-...", "reachability": "runtime"}], "dsse": { "...": "..." } } -
zastava.admission:{ "decision":"deny|allow", "reasons":["unsigned image","missing SBOM"], "images":[{"digest":"sha256:...","signed":false,"hasSbom":false}] }
4) Rules engine — semantics
Rule shape (simplified):
name: "high-critical-alerts-prod"
enabled: true
match:
eventKinds: ["scanner.report.ready","scheduler.rescan.delta","zastava.admission"]
namespaces: ["prod-*"]
repos: ["ghcr.io/acme/*"]
minSeverity: "high" # min of new findings (delta context)
kev: true # require KEV-tagged or allow any if false
verdict: ["fail","deny"] # filter for report/admission
vex:
includeRejectedJustifications: false # notify only on accepted 'affected'
actions:
- channel: "slack:sec-alerts" # reference to Channel object
template: "concise"
throttle: "5m"
- channel: "email:soc"
digest: "hourly"
template: "detailed"
Evaluation order
- Tenant check → discard if rule tenant ≠ event tenant.
- Kind filter → discard early.
- Scope match (namespace/repo/labels).
- Delta/severity gates (if event carries
delta). - VEX gate (drop if event’s finding is not affected under policy consensus unless rule says otherwise).
- Throttling/dedup (idempotency key) — skip if suppressed.
- Actions → enqueue per‑channel job(s).
Idempotency key: hash(ruleId | actionId | event.kind | scope.digest | delta.hash | day-bucket); ensures “same alert” doesn’t fire more than once within throttle window.
Digest windows: maintain per action a coalescer:
- Window:
5m|15m|1h|1d(configurable); coalesces events by tenant + namespace/repo or by digest group. - Digest messages summarize top N items and counts, with safe truncation.
5) Channels & connectors (plug‑ins)
Channel config is two‑part: a Channel record (name, type, options) and a Secret reference (Vault/K8s Secret). Connectors are restart-time plug-ins discovered on service start (same manifest convention as Concelier/Excititor) and live under plugins/notify/<channel>/.
Built-in channels:
- Slack: Bot token (xoxb‑…),
chat.postMessage+blocks; rate limit aware (HTTP 429). - Microsoft Teams: Incoming Webhook (or Graph card later); adaptive card payloads.
- Email (SMTP): TLS (STARTTLS or implicit), From/To/CC/BCC; HTML+text alt; DKIM optional.
- Generic Webhook: POST JSON with HMAC signature (Ed25519 or SHA‑256) in headers.
- PagerDuty: Events API v2 trigger/ack/resolve flow; durable
dedup_key/external id mapping is persisted with delivery state for restart-safe webhook acknowledgement handling. - OpsGenie: Alert create/ack/close flow; alias/external id is persisted with delivery state so inbound acknowledgement webhooks remain restart-safe.
Connector contract: (implemented by plug-in assemblies)
public interface INotifyConnector {
string Type { get; } // "slack" | "teams" | "email" | "webhook" | ...
Task<DeliveryResult> SendAsync(DeliveryContext ctx, CancellationToken ct);
Task<HealthResult> HealthAsync(ChannelConfig cfg, CancellationToken ct);
}
For hosted external channels, Notifier worker adapters implement IChannelAdapter and are selected by AdapterChannelDispatcher. Those adapters must emit stable provider identifiers (externalId, incidentId where applicable) so the IAckBridge webhook path can recover correlation from persisted delivery rows instead of process-local memory.
DeliveryContext includes rendered content and raw event for audit.
Test-send previews. Plug-ins can optionally implement INotifyChannelTestProvider to shape /channels/{id}/test responses. Providers receive a sanitised ChannelTestPreviewContext (channel, tenant, target, timestamp, trace) and return a NotifyDeliveryRendered preview + metadata. When no provider is present, the host falls back to a generic preview so the endpoint always responds.
Secrets: ChannelConfig.secretRef points to Authority‑managed secret handle or K8s Secret path; workers load at send-time; plug-in manifests (notify-plugin.json) declare capabilities and version.
6) Templates & rendering
Template engine: strongly typed, safe Handlebars‑style; no arbitrary code. Partial templates per channel. Deterministic outputs (prop order, no locale drift unless requested).
Variables (examples):
event.kind,event.ts,scope.namespace,scope.repo,scope.digestpayload.verdict,payload.delta.newCritical,payload.links.ui,payload.links.rekortopFindings[]withpurl,vulnId,severitypolicy.name,policy.revision(if available)
Helpers:
severity_icon(sev),link(text,url),pluralize(n, "finding"),truncate(text, n),code(text).
Channel mapping:
- Slack: title + blocks, limited to 50 blocks/3000 chars per section; long lists → link to UI.
- Teams: Adaptive Card schema 1.5; fallback text for older channels (surfaced as
teams.fallbackTextmetadata alongside webhook hash). - Email: HTML + text; inline table of top N findings, rest behind UI link.
- Webhook: JSON with
event,ruleId,actionId,summary,links, and rawpayloadsubset.
i18n: template set per locale (English default; Bulgarian built‑in).
7) Data model (PostgreSQL)
Canonical JSON Schemas for rules/channels/events live in docs/modules/notify/resources/schemas/. Sample payloads intended for tests/UI mock responses are captured in docs/modules/notify/resources/samples/.
Database: stellaops_notify (PostgreSQL)
-
rules{ _id, tenantId, name, enabled, match, actions, createdBy, updatedBy, createdAt, updatedAt } -
channels{ _id, tenantId, name:"slack:sec-alerts", type:"slack", config:{ webhookUrl?:"", channel:"#sec-alerts", workspace?: "...", secretRef:"ref://..." }, createdAt, updatedAt } -
deliveries{ _id, tenantId, ruleId, actionId, eventId, kind, scope, status:"sent|failed|throttled|digested|dropped", externalId?, metadata?, attempts:[{ts, status, code, reason}], rendered:{ title, body, target }, // redacted for PII; body hash stored sentAt, lastError? }PagerDuty and OpsGenie deliveries durably carry the provider
externalIdplusmetadata.incidentIdso inbound webhook acknowledgements can be resolved after worker restart without relying on a process-local bridge map. -
digests{ _id, tenantId, actionKey, window:"hourly", openedAt, items:[{eventId, scope, delta}], status:"open|flushed" } -
correlation_runtime_incidents{ tenantId, incidentId, correlationKey, eventKind, title, status:"open|acknowledged|resolved", eventCount, firstOccurrence, lastOccurrence, acknowledgedBy?, resolvedBy?, eventIds:[eventId...] } -
correlation_runtime_throttle_events{ tenantId, correlationKey, occurredAt } // short-lived, also cached in Valkey -
escalation_states{ tenantId, policyId, incidentId?, correlationId, currentStep, repeatIteration, status:"active|acknowledged|resolved|expired", startedAt, nextEscalationAt, acknowledgedAt?, acknowledgedBy?, metadata }correlationIdis the durable lookup key for the live string incident id used by the runtime engine.metadatacarries the runtime-only fields that do not fit the canonical columns yet:stateId, externalpolicyId,levelStartedAt, terminal runtime status (stopped|exhausted),stoppedAt,stoppedReason, and the full escalationhistory.
Indexes: rules by {tenantId, enabled}, deliveries by {tenantId, sentAt desc}, digests by {tenantId, actionKey}.
8) External APIs (WebService)
Base path: /api/v1/notify (Authority OpToks; scopes: notify.admin for write, notify.read for view).
All REST calls require the tenant header X-StellaOps-Tenant (matches the canonical tenantId stored in PostgreSQL). Payloads are normalised via NotifySchemaMigration before persistence to guarantee schema version pinning.
Authentication today is stubbed with Bearer tokens (Authorization: Bearer <token>). When Authority wiring lands, this will switch to OpTok validation + scope enforcement, but the header contract will remain the same.
Service configuration exposes notify:auth:* keys (issuer, audience, signing key, scope names) so operators can wire the Authority JWKS or (in dev) a symmetric test key. notify:storage:* keys cover PostgreSQL connection/schema overrides. Both sets are required for the new API surface.
Internal tooling can hit /internal/notify/<entity>/normalize to upgrade legacy JSON and return canonical output used in the docs fixtures.
-
Channels
POST /channels|GET /channels|GET /channels/{id}|PATCH /channels/{id}|DELETE /channels/{id}POST /channels/{id}/test→ send sample message (no rule evaluation); returns202 Acceptedwith rendered preview + metadata (base keys:channelType,target,previewProvider,traceId+ connector-specific entries); governed byapi.rateLimits:testSend.
-
GET /channels/{id}/health→ connector self‑check (returns redacted metadata: secret refs hashed, sensitive config keys masked, fallbacks noted viateams.fallbackText/teams.validation.*) -
Rules
POST /rules|GET /rules|GET /rules/{id}|PATCH /rules/{id}|DELETE /rules/{id}POST /rules/{id}/test→ dry‑run rule against a sample event (no delivery unless--send)
-
Deliveries
POST /deliveries→ ingest worker delivery state (idempotent viadeliveryId).GET /deliveries?since=...&status=...&limit=...→ list envelope{ items, count, continuationToken }(most recent first); base metadata keys match the test-send response (channelType,target,previewProvider,traceId); rate-limited viaapi.rateLimits.deliveryHistory. Seedocs/modules/notify/resources/samples/notify-delivery-list-response.sample.json.GET /deliveries/{id}→ detail (redacted body + metadata)POST /deliveries/{id}/retry→ force retry (admin, future sprint)
-
Admin
GET /stats(per tenant counts, last hour/day)GET /healthz|readyz(liveness)POST /locks/acquire|POST /locks/release– worker coordination primitives (short TTL).POST /digests|GET /digests/{actionKey}|DELETE /digests/{actionKey}– manage open digest windows.POST /audit|GET /audit?since=&limit=– append/query structured audit trail entries.
8.1 Ack tokens & escalation workflows
To support one-click acknowledgements from chat/email, the Notify WebService mints DSSE ack tokens via Authority:
POST /notify/ack-tokens/issue→ returns a DSSE envelope (payload typeapplication/vnd.stellaops.notify-ack-token+json) describing the tenant, notification/delivery ids, channel, webhook URL, nonce, permitted actions, and TTL. Requiresnotify.operator; requesting escalation requires the caller to holdnotify.escalate(andnotify.adminwhen configured). Issuance enforces the Authority-side webhook allowlist (notifications.webhooks.allowedHosts) before minting tokens.POST /notify/ack-tokens/verify→ verifies the DSSE signature, enforces expiry/tenant/action constraints, and emits audit events (notify.ack.verified,notify.ack.escalated). Scope:notify.operator(+notify.escalatefor escalation).POST /notify/ack-tokens/rotate→ rotates the signing key used for ack tokens, requiresnotify.admin, and emitsnotify.ack.key_rotated/notify.ack.key_rotation_failedaudit events. Operators must supply the new key material (file/KMS/etc. depending onnotifications.ackTokens.keySource); Authority updates JWKS entries withuse: "notify-ack"and retires the previous key.POST /internal/notifications/ack-tokens/rotate→ legacy bootstrap path (API-key protected) retained for air-gapped initial provisioning; it forwards to the same rotation pipeline as the public endpoint.
Authority signs ack tokens using keys configured under notifications.ackTokens. Public JWKS responses expose these keys with use: "notify-ack" and status: active|retired, enabling offline verification by the worker/UI/CLI.
Inbound PagerDuty and OpsGenie acknowledgement webhooks must resolve provider identifiers from durable delivery state (externalId plus incident metadata), not from process-local runtime maps. Restart-survival is a required property of the non-testing host composition.
Ingestion: workers do not expose public ingestion; they subscribe to the internal bus. (Optional /events/test for integration testing, admin-only.)
9) Delivery pipeline (worker)
[Event bus] → [Ingestor] → [RuleMatcher] → [Throttle/Dedupe] → [DigestCoalescer] → [Renderer] → [Connector] → [Result]
└────────→ [DeliveryStore]
- Ingestor: N consumers with per‑key ordering (key = tenant|digest|namespace).
- RuleMatcher: loads active rules snapshot for tenant into memory; vectorized predicate check.
- Throttle/Dedupe: consult Valkey plus PostgreSQL
notify.correlation_runtime_throttle_events; if hit → recordstatus=throttled. - DigestCoalescer: append to open digest window or flush when timer expires.
- Renderer: select template (channel+locale), inject variables, enforce length limits, compute
bodyHash. - Connector: send; handle provider‑specific rate limits and backoffs;
maxAttemptswith exponential jitter; overflow → DLQ (dead‑letter topic) + UI surfacing.
Idempotency: per action idempotency key stored in Valkey (TTL = throttle window or digest window). Connectors also respect provider idempotency where available (e.g., Slack client_msg_id).
10) Reliability & rate controls
- Per‑tenant RPM caps (default 600/min) + per‑channel concurrency (Slack 1–4, Teams 1–2, Email 8–32 based on relay).
- Backoff map: Slack 429 → respect
Retry‑After; SMTP 4xx → retry; 5xx → retry with jitter; permanent rejects → drop with status recorded. - DLQ: NATS/Valkey stream
notify.dlqwith{event, rule, action, error}for operator inspection; UI shows DLQ items.
11) Security & privacy
- AuthZ: all APIs require Authority OpToks; actions scoped by tenant.
- Secrets:
secretRefonly; Notify fetches just‑in‑time from Authority Secret proxy or K8s Secret (mounted). No plaintext secrets in database. - Egress TLS: validate SSL; pin domains per channel config; optional CA bundle override for on‑prem SMTP.
- Webhook signing: HMAC or Ed25519 signatures in
X-StellaOps-Signature+ replay‑window timestamp; include canonical body hash in header. - Redaction: deliveries store hashes of bodies, not full payloads for chat/email to minimize PII retention (configurable).
- Quiet hours: per tenant (e.g., 22:00–06:00) route high‑sev only; defer others to digests.
- Loop prevention: Webhook target allowlist + event origin tags; do not ingest own webhooks.
12) Observability (Prometheus + OTEL)
notify.events_consumed_total{kind}notify.rules_matched_total{ruleId}notify.throttled_total{reason}notify.digest_coalesced_total{window}notify.sent_total{channel}/notify.failed_total{channel,code}notify.delivery_latency_seconds{channel}(end‑to‑end)- Tracing: spans
ingest,match,render,send; correlation id =eventId.
- Runbook + dashboard stub (offline import):
operations/observability.md,operations/dashboards/notify-observability.json(to be populated after next demo).
SLO targets
- Event→delivery p95 ≤ 30–60 s under nominal load.
- Failure rate p95 < 0.5% per hour (excluding provider outages).
- Duplicate rate ≈ 0 (idempotency working).
13) Configuration (YAML)
notify:
authority:
issuer: "https://authority.internal"
require: "dpop" # or "mtls"
bus:
kind: "valkey" # or "nats" (valkey uses redis:// protocol)
streams:
- "scanner.events"
- "scheduler.events"
- "attestor.events"
- "zastava.events"
postgres:
notify:
connectionString: "Host=postgres;Port=5432;Database=stellaops_notify;Username=stellaops;Password=stellaops;Pooling=true"
schemaName: "notify"
commandTimeoutSeconds: 45
limits:
perTenantRpm: 600
perChannel:
slack: { concurrency: 2 }
teams: { concurrency: 1 }
email: { concurrency: 8 }
webhook: { concurrency: 8 }
digests:
defaultWindow: "1h"
maxItems: 100
quietHours:
enabled: true
window: "22:00-06:00"
minSeverity: "critical"
webhooks:
sign:
method: "ed25519" # or "hmac-sha256"
keyRef: "ref://notify/webhook-sign-key"
14) UI touch‑points
- Notifications → Channels: add Slack/Teams/Email/Webhook/PagerDuty/OpsGenie; run health; rotate secrets.
- Notifications → Rules: create/edit YAML rules with linting; test with sample events; see match rate.
- Notifications → Deliveries: timeline with filters (status, channel, rule); inspect last error; retry.
- Digest preview: shows current window contents and when it will flush.
- Quiet hours: configure per tenant; show overrides.
- DLQ: browse dead‑letters; requeue after fix.
15) Failure modes & responses
| Condition | Behavior |
|---|---|
| Slack 429 / Teams 429 | Respect Retry‑After, backoff with jitter, reduce concurrency |
| SMTP transient 4xx | Retry up to maxAttempts; escalate to DLQ on exhaust |
| Invalid channel secret | Mark channel unhealthy; suppress sends; surface in UI |
| Rule explosion (matches everything) | Safety valve: per‑tenant RPM caps; auto‑pause rule after X drops; UI alert |
| Bus outage | Buffer to local queue (bounded); resume consuming when healthy |
| PostgreSQL slowness | Fall back to Valkey throttles; batch write deliveries; shed low‑priority notifications |
16) Testing matrix
- Unit: matchers, throttle math, digest coalescing, idempotency keys, template rendering edge cases.
- Connectors: provider‑level rate limits, payload size truncation, error mapping.
- Integration: synthetic event storm (10k/min), ensure p95 latency & duplicate rate.
- Security: DPoP/mTLS on APIs; secretRef resolution; webhook signing & replay windows.
- i18n: localized templates render deterministically.
- Chaos: Slack/Teams API flaps; SMTP greylisting; Valkey hiccups; ensure graceful degradation.
17) Sequences (representative)
A) New criticals after Conselier delta (Slack immediate + Email hourly digest)
sequenceDiagram
autonumber
participant SCH as Scheduler
participant NO as Notify.Worker
participant SL as Slack
participant SMTP as Email
SCH->>NO: bus event scheduler.rescan.delta { newCritical:1, digest:sha256:... }
NO->>NO: match rules (Slack immediate; Email hourly digest)
NO->>SL: chat.postMessage (concise)
SL-->>NO: 200 OK
NO->>NO: append to digest window (email:soc)
Note over NO: At window close → render digest email
NO->>SMTP: send email (detailed digest)
SMTP-->>NO: 250 OK
B) Admission deny (Teams card + Webhook)
sequenceDiagram
autonumber
participant ZA as Zastava
participant NO as Notify.Worker
participant TE as Teams
participant WH as Webhook
ZA->>NO: bus event zastava.admission { decision: "deny", reasons: [...] }
NO->>TE: POST adaptive card
TE-->>NO: 200 OK
NO->>WH: POST JSON (signed)
WH-->>NO: 2xx
18) Implementation notes
- Language: .NET 10; minimal API;
System.Text.Jsonwith canonical writer for body hashing; Channels for pipelines. - Bus: Valkey Streams (XGROUP consumers) or NATS JetStream for at‑least‑once with ack; per‑tenant consumer groups to localize backpressure.
- Templates: compile and cache per rule+channel+locale; version with rule
updatedAtto invalidate. - Rules: store raw YAML + parsed AST; validate with schema + static checks (e.g., nonsensical combos).
- Secrets: pluggable secret resolver (Authority Secret proxy, K8s, Vault).
- Rate limiting:
System.Threading.RateLimiting+ per-connector adapters.
19) Air-gapped bootstrap configuration
Air-gapped deployments ship a deterministic Notifier profile inside the
Bootstrap Pack. The artefacts live under bootstrap/notify/ after running the
Offline Kit builder and include:
notify.yaml— configuration derived frometc/notify.airgap.yaml, pointing to the sealed PostgreSQL/Authority endpoints and loading connectors from the local plug-in directory.notify-web.secret.example— template for the Authority client secret, intended to be renamed tonotify-web.secretbefore deployment.README.md— operator guide (docs/modules/notify/bootstrap-pack.md).
These files are copied automatically by ops/offline-kit/build_offline_kit.py
via copy_bootstrap_configs. Operators mount the configuration and secret into
the StellaOps.Notify.WebService container (Compose or Kubernetes) to keep
sealed-mode roll-outs reproducible. (Notifier WebService was merged into
Notify WebService; the notifier.stella-ops.local hostname is now an alias
on the notify-web container.)
20) Roadmap (post-v1)
- Jira ticket creation and downstream issue-state synchronization.
- User inbox (in‑app notifications) + mobile push via webhook relay.
- Anomaly suppression: auto‑pause noisy rules with hints (learned thresholds).
- Graph rules: “only notify if not_affected → affected transition at consensus layer”.
- Label enrichment: pluggable taggers (business criticality, data classification) to refine matchers.