24 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	Scope. Implementation‑ready architecture for Notify: a rules‑driven, tenant‑aware notification service that consumes platform events (scan completed, report ready, rescan deltas, attestation logged, admission decisions, etc.), evaluates operator‑defined routing rules, renders channel‑specific messages (Slack/Teams/Email/Webhook), and delivers them reliably with idempotency, throttling, and digests. It is UI‑managed, auditable, and safe by default (no secrets leakage, no spam storms).
0) Mission & boundaries
Mission. Convert facts from Stella Ops into actionable, noise‑controlled signals where teams already live (chat/email/webhooks), with explainable reasons and deep links to the UI.
Boundaries.
- Notify does not make policy decisions and does not rescan; it consumes events from Scanner/Scheduler/Vexer/Feedser/Attestor/Zastava and routes them.
- Attachments are links (UI/attestation pages); Notify does not attach SBOMs or large blobs to messages.
- Secrets for channels (Slack tokens, SMTP creds) are referenced, not stored raw in Mongo.
1) Runtime shape & projects
src/
 ├─ StellaOps.Notify.WebService/        # REST: rules/channels CRUD, test send, deliveries browse
 ├─ StellaOps.Notify.Worker/            # consumers + evaluators + renderers + delivery workers
 ├─ StellaOps.Notify.Connectors.* /     # channel plug-ins: Slack, Teams, Email, Webhook (v1)
 │    └─ *.Tests/
 ├─ StellaOps.Notify.Engine/            # rules engine, templates, idempotency, digests, throttles
 ├─ StellaOps.Notify.Models/            # DTOs (Rule, Channel, Event, Delivery, Template)
 ├─ StellaOps.Notify.Storage.Mongo/     # rules, channels, deliveries, digests, locks
 ├─ StellaOps.Notify.Queue/             # bus client (Redis Streams/NATS JetStream)
 └─ StellaOps.Notify.Tests.*            # unit/integration/e2e
Deployables:
- Notify.WebService (stateless API)
- Notify.Worker (horizontal scale)
Dependencies: Authority (OpToks; DPoP/mTLS), MongoDB, Redis/NATS (bus), HTTP egress to Slack/Teams/Webhooks, SMTP relay for Email.
Configuration. Notify.WebService bootstraps from
notify.yaml(seeetc/notify.yaml.sample). Usestorage.driver: mongowith a production connection string; the optionalmemorydriver exists only for tests. Authority settings follow the platform defaults—when running locally without Authority, setauthority.enabled: falseand supplydevelopmentSigningKeyso JWTs can be validated offline.
api.rateLimitsexposes token-bucket controls for delivery history queries and test-send previews (deliveryHistory,testSend). Default values allow generous browsing while preventing accidental bursts; operators can relax/tighten the buckets per deployment.
Plug-ins. All channel connectors are packaged under
<baseDirectory>/plugins/notify. The ordered load list must start with Slack/Teams before Email/Webhook so chat-first actions are registered deterministically for Offline Kit bundles:plugins: baseDirectory: "/var/opt/stellaops" directory: "plugins/notify" orderedPlugins: - StellaOps.Notify.Connectors.Slack - StellaOps.Notify.Connectors.Teams - StellaOps.Notify.Connectors.Email - StellaOps.Notify.Connectors.WebhookThe Offline Kit job simply copies the
plugins/notifytree into the air-gapped bundle; the ordered list keeps connector manifests stable across environments.
Authority clients. Register two OAuth clients in StellaOps Authority:
notify-web-dev(audiencenotify.dev) for development andnotify-web(audiencenotify) for staging/production. Both requirenotify.readandnotify.adminscopes and use DPoP-bound client credentials (client_secretin the samples). Reference entries live inetc/authority.yaml.sample, with placeholder secrets underetc/secrets/notify-web*.secret.example.
2) Responsibilities
- Ingest platform events from internal bus with strong ordering per key (e.g., image digest).
- Evaluate rules (tenant‑scoped) with matchers: severity changes, namespaces, repos, labels, KEV flags, provider provenance (VEX), component keys, admission decisions, etc.
- Control noise: throttle, coalesce (digest windows), and dedupe via idempotency keys.
- Render channel‑specific messages using safe templates; include evidence and links.
- Deliver with retries/backoff; record outcome; expose delivery history to UI.
- Test paths (send test to channel targets) without touching live rules.
- Audit: log who configured what, when, and why a message was sent.
3) Event model (inputs)
Notify subscribes to the internal event bus (produced by services, escaped JSON; gzip allowed with caps):
- scanner.scan.completed— new SBOM(s) composed; artifacts ready
- scanner.report.ready— analysis verdict (policy+vex) available; carries deltas summary
- scheduler.rescan.delta— new findings after Feedser/Vexer deltas (already summarized)
- attestor.logged— Rekor UUID returned (sbom/report/vex export)
- zastava.admission— admit/deny with reasons, namespace, image digests
- feedser.export.completed— new export ready (rarely notified directly; usually drives Scheduler)
- vexer.export.completed— new consensus snapshot (ditto)
Canonical envelope (bus → Notify.Engine):
{
  "eventId": "uuid",
  "kind": "scanner.report.ready",
  "tenant": "tenant-01",
  "ts": "2025-10-18T05:41:22Z",
  "actor": "scanner-webservice",
  "scope": { "namespace":"payments", "repo":"ghcr.io/acme/api", "digest":"sha256:..." },
  "payload": { /* kind-specific fields, see below */ }
}
Examples (payload cores):
- 
scanner.report.ready:{ "reportId": "report-3def...", "verdict": "fail", "summary": {"total": 12, "blocked": 2, "warned": 3, "ignored": 5, "quieted": 2}, "delta": {"newCritical": 1, "kev": ["CVE-2025-..."]}, "links": {"ui": "https://ui/.../reports/report-3def...", "rekor": "https://rekor/..."}, "dsse": { "...": "..." }, "report": { "...": "..." } }Payload embeds both the canonical report document and the DSSE envelope so connectors, Notify, and UI tooling can reuse the signed bytes without re-serialising. 
- 
scanner.scan.completed:{ "reportId": "report-3def...", "digest": "sha256:...", "verdict": "fail", "summary": {"total": 12, "blocked": 2, "warned": 3, "ignored": 5, "quieted": 2}, "delta": {"newCritical": 1, "kev": ["CVE-2025-..."]}, "policy": {"revisionId": "rev-42", "digest": "27d2..."}, "findings": [{"id": "finding-1", "severity": "Critical", "cve": "CVE-2025-...", "reachability": "runtime"}], "dsse": { "...": "..." } }
- 
zastava.admission:{ "decision":"deny|allow", "reasons":["unsigned image","missing SBOM"], "images":[{"digest":"sha256:...","signed":false,"hasSbom":false}] }
4) Rules engine — semantics
Rule shape (simplified):
name: "high-critical-alerts-prod"
enabled: true
match:
  eventKinds: ["scanner.report.ready","scheduler.rescan.delta","zastava.admission"]
  namespaces: ["prod-*"]
  repos: ["ghcr.io/acme/*"]
  minSeverity: "high"            # min of new findings (delta context)
  kev: true                      # require KEV-tagged or allow any if false
  verdict: ["fail","deny"]       # filter for report/admission
  vex:
    includeRejectedJustifications: false    # notify only on accepted 'affected'
actions:
  - channel: "slack:sec-alerts"  # reference to Channel object
    template: "concise"
    throttle: "5m"
  - channel: "email:soc"
    digest: "hourly"
    template: "detailed"
Evaluation order
- Tenant check → discard if rule tenant ≠ event tenant.
- Kind filter → discard early.
- Scope match (namespace/repo/labels).
- Delta/severity gates (if event carries delta).
- VEX gate (drop if event’s finding is not affected under policy consensus unless rule says otherwise).
- Throttling/dedup (idempotency key) — skip if suppressed.
- Actions → enqueue per‑channel job(s).
Idempotency key: hash(ruleId | actionId | event.kind | scope.digest | delta.hash | day-bucket); ensures “same alert” doesn’t fire more than once within throttle window.
Digest windows: maintain per action a coalescer:
- Window: 5m|15m|1h|1d(configurable); coalesces events by tenant + namespace/repo or by digest group.
- Digest messages summarize top N items and counts, with safe truncation.
5) Channels & connectors (plug‑ins)
Channel config is two‑part: a Channel record (name, type, options) and a Secret reference (Vault/K8s Secret). Connectors are restart-time plug-ins discovered on service start (same manifest convention as Concelier/Excititor) and live under plugins/notify/<channel>/.
Built‑in v1:
- Slack: Bot token (xoxb‑…), chat.postMessage+blocks; rate limit aware (HTTP 429).
- Microsoft Teams: Incoming Webhook (or Graph card later); adaptive card payloads.
- Email (SMTP): TLS (STARTTLS or implicit), From/To/CC/BCC; HTML+text alt; DKIM optional.
- Generic Webhook: POST JSON with HMAC signature (Ed25519 or SHA‑256) in headers.
Connector contract: (implemented by plug-in assemblies)
public interface INotifyConnector {
  string Type { get; } // "slack" | "teams" | "email" | "webhook" | ...
  Task<DeliveryResult> SendAsync(DeliveryContext ctx, CancellationToken ct);
  Task<HealthResult> HealthAsync(ChannelConfig cfg, CancellationToken ct);
}
DeliveryContext includes rendered content and raw event for audit.
Test-send previews. Plug-ins can optionally implement INotifyChannelTestProvider to shape /channels/{id}/test responses. Providers receive a sanitised ChannelTestPreviewContext (channel, tenant, target, timestamp, trace) and return a NotifyDeliveryRendered preview + metadata. When no provider is present, the host falls back to a generic preview so the endpoint always responds.
Secrets: ChannelConfig.secretRef points to Authority‑managed secret handle or K8s Secret path; workers load at send-time; plug-in manifests (notify-plugin.json) declare capabilities and version.
6) Templates & rendering
Template engine: strongly typed, safe Handlebars‑style; no arbitrary code. Partial templates per channel. Deterministic outputs (prop order, no locale drift unless requested).
Variables (examples):
- event.kind,- event.ts,- scope.namespace,- scope.repo,- scope.digest
- payload.verdict,- payload.delta.newCritical,- payload.links.ui,- payload.links.rekor
- topFindings[]with- purl,- vulnId,- severity
- policy.name,- policy.revision(if available)
Helpers:
- severity_icon(sev),- link(text,url),- pluralize(n, "finding"),- truncate(text, n),- code(text).
Channel mapping:
- Slack: title + blocks, limited to 50 blocks/3000 chars per section; long lists → link to UI.
- Teams: Adaptive Card schema 1.5; fallback text for older channels.
- Email: HTML + text; inline table of top N findings, rest behind UI link.
- Webhook: JSON with event,ruleId,actionId,summary,links, and rawpayloadsubset.
i18n: template set per locale (English default; Bulgarian built‑in).
7) Data model (Mongo)
Canonical JSON Schemas for rules/channels/events live in docs/notify/schemas/. Sample payloads intended for tests/UI mock responses are captured in docs/notify/samples/.
Database: notify
- 
rules{ _id, tenantId, name, enabled, match, actions, createdBy, updatedBy, createdAt, updatedAt }
- 
channels{ _id, tenantId, name:"slack:sec-alerts", type:"slack", config:{ webhookUrl?:"", channel:"#sec-alerts", workspace?: "...", secretRef:"ref://..." }, createdAt, updatedAt }
- 
deliveries{ _id, tenantId, ruleId, actionId, eventId, kind, scope, status:"sent|failed|throttled|digested|dropped", attempts:[{ts, status, code, reason}], rendered:{ title, body, target }, // redacted for PII; body hash stored sentAt, lastError? }
- 
digests{ _id, tenantId, actionKey, window:"hourly", openedAt, items:[{eventId, scope, delta}], status:"open|flushed" }
- 
throttles{ key:"idem:<hash>", ttlAt } // short-lived, also cached in Redis
Indexes: rules by {tenantId, enabled}, deliveries by {tenantId, sentAt desc}, digests by {tenantId, actionKey}.
8) External APIs (WebService)
Base path: /api/v1/notify (Authority OpToks; scopes: notify.admin for write, notify.read for view).
All REST calls require the tenant header X-StellaOps-Tenant (matches the canonical tenantId stored in Mongo). Payloads are normalised via NotifySchemaMigration before persistence to guarantee schema version pinning.
Authentication today is stubbed with Bearer tokens (Authorization: Bearer <token>). When Authority wiring lands, this will switch to OpTok validation + scope enforcement, but the header contract will remain the same.
Service configuration exposes notify:auth:* keys (issuer, audience, signing key, scope names) so operators can wire the Authority JWKS or (in dev) a symmetric test key. notify:storage:* keys cover Mongo URI/database/collection overrides. Both sets are required for the new API surface.
Internal tooling can hit /internal/notify/<entity>/normalize to upgrade legacy JSON and return canonical output used in the docs fixtures.
- 
Channels - POST /channels|- GET /channels|- GET /channels/{id}|- PATCH /channels/{id}|- DELETE /channels/{id}
- POST /channels/{id}/test→ send sample message (no rule evaluation); returns- 202 Acceptedwith rendered preview + metadata (base keys:- channelType,- target,- previewProvider,- traceId+ connector-specific entries); governed by- api.rateLimits:testSend.
- GET /channels/{id}/health→ connector self‑check
 
- 
Rules - POST /rules|- GET /rules|- GET /rules/{id}|- PATCH /rules/{id}|- DELETE /rules/{id}
- POST /rules/{id}/test→ dry‑run rule against a sample event (no delivery unless- --send)
 
- 
Deliveries - POST /deliveries→ ingest worker delivery state (idempotent via- deliveryId).
- GET /deliveries?since=...&status=...&limit=...→ list envelope- { items, count, continuationToken }(most recent first); base metadata keys match the test-send response (- channelType,- target,- previewProvider,- traceId); rate-limited via- api.rateLimits.deliveryHistory. See- docs/notify/samples/notify-delivery-list-response.sample.json.
- GET /deliveries/{id}→ detail (redacted body + metadata)
- POST /deliveries/{id}/retry→ force retry (admin, future sprint)
 
- 
Admin - GET /stats(per tenant counts, last hour/day)
- GET /healthz|readyz(liveness)
- POST /locks/acquire|- POST /locks/release– worker coordination primitives (short TTL).
- POST /digests|- GET /digests/{actionKey}|- DELETE /digests/{actionKey}– manage open digest windows.
- POST /audit|- GET /audit?since=&limit=– append/query structured audit trail entries.
 
Ingestion: workers do not expose public ingestion; they subscribe to the internal bus. (Optional /events/test for integration testing, admin‑only.)
9) Delivery pipeline (worker)
[Event bus] → [Ingestor] → [RuleMatcher] → [Throttle/Dedupe] → [DigestCoalescer] → [Renderer] → [Connector] → [Result]
                                                 └────────→ [DeliveryStore]
- Ingestor: N consumers with per‑key ordering (key = tenant|digest|namespace).
- RuleMatcher: loads active rules snapshot for tenant into memory; vectorized predicate check.
- Throttle/Dedupe: consult Redis + Mongo throttles; if hit → recordstatus=throttled.
- DigestCoalescer: append to open digest window or flush when timer expires.
- Renderer: select template (channel+locale), inject variables, enforce length limits, compute bodyHash.
- Connector: send; handle provider‑specific rate limits and backoffs; maxAttemptswith exponential jitter; overflow → DLQ (dead‑letter topic) + UI surfacing.
Idempotency: per action idempotency key stored in Redis (TTL = throttle window or digest window). Connectors also respect provider idempotency where available (e.g., Slack client_msg_id).
10) Reliability & rate controls
- Per‑tenant RPM caps (default 600/min) + per‑channel concurrency (Slack 1–4, Teams 1–2, Email 8–32 based on relay).
- Backoff map: Slack 429 → respect Retry‑After; SMTP 4xx → retry; 5xx → retry with jitter; permanent rejects → drop with status recorded.
- DLQ: NATS/Redis stream notify.dlqwith{event, rule, action, error}for operator inspection; UI shows DLQ items.
11) Security & privacy
- AuthZ: all APIs require Authority OpToks; actions scoped by tenant.
- Secrets: secretRefonly; Notify fetches just‑in‑time from Authority Secret proxy or K8s Secret (mounted). No plaintext secrets in Mongo.
- Egress TLS: validate SSL; pin domains per channel config; optional CA bundle override for on‑prem SMTP.
- Webhook signing: HMAC or Ed25519 signatures in X-StellaOps-Signature+ replay‑window timestamp; include canonical body hash in header.
- Redaction: deliveries store hashes of bodies, not full payloads for chat/email to minimize PII retention (configurable).
- Quiet hours: per tenant (e.g., 22:00–06:00) route high‑sev only; defer others to digests.
- Loop prevention: Webhook target allowlist + event origin tags; do not ingest own webhooks.
12) Observability (Prometheus + OTEL)
- notify.events_consumed_total{kind}
- notify.rules_matched_total{ruleId}
- notify.throttled_total{reason}
- notify.digest_coalesced_total{window}
- notify.sent_total{channel}/- notify.failed_total{channel,code}
- notify.delivery_latency_seconds{channel}(end‑to‑end)
- Tracing: spans ingest,match,render,send; correlation id =eventId.
SLO targets
- Event→delivery p95 ≤ 30–60 s under nominal load.
- Failure rate p95 < 0.5% per hour (excluding provider outages).
- Duplicate rate ≈ 0 (idempotency working).
13) Configuration (YAML)
notify:
  authority:
    issuer: "https://authority.internal"
    require: "dpop"               # or "mtls"
  bus:
    kind: "redis"                 # or "nats"
    streams:
      - "scanner.events"
      - "scheduler.events"
      - "attestor.events"
      - "zastava.events"
  mongo:
    uri: "mongodb://mongo/notify"
  limits:
    perTenantRpm: 600
    perChannel:
      slack:   { concurrency: 2 }
      teams:   { concurrency: 1 }
      email:   { concurrency: 8 }
      webhook: { concurrency: 8 }
  digests:
    defaultWindow: "1h"
    maxItems: 100
  quietHours:
    enabled: true
    window: "22:00-06:00"
    minSeverity: "critical"
  webhooks:
    sign:
      method: "ed25519"           # or "hmac-sha256"
      keyRef: "ref://notify/webhook-sign-key"
14) UI touch‑points
- Notifications → Channels: add Slack/Teams/Email/Webhook; run health; rotate secrets.
- Notifications → Rules: create/edit YAML rules with linting; test with sample events; see match rate.
- Notifications → Deliveries: timeline with filters (status, channel, rule); inspect last error; retry.
- Digest preview: shows current window contents and when it will flush.
- Quiet hours: configure per tenant; show overrides.
- DLQ: browse dead‑letters; requeue after fix.
15) Failure modes & responses
| Condition | Behavior | 
|---|---|
| Slack 429 / Teams 429 | Respect Retry‑After, backoff with jitter, reduce concurrency | 
| SMTP transient 4xx | Retry up to maxAttempts; escalate to DLQ on exhaust | 
| Invalid channel secret | Mark channel unhealthy; suppress sends; surface in UI | 
| Rule explosion (matches everything) | Safety valve: per‑tenant RPM caps; auto‑pause rule after X drops; UI alert | 
| Bus outage | Buffer to local queue (bounded); resume consuming when healthy | 
| Mongo slowness | Fall back to Redis throttles; batch write deliveries; shed low‑priority notifications | 
16) Testing matrix
- Unit: matchers, throttle math, digest coalescing, idempotency keys, template rendering edge cases.
- Connectors: provider‑level rate limits, payload size truncation, error mapping.
- Integration: synthetic event storm (10k/min), ensure p95 latency & duplicate rate.
- Security: DPoP/mTLS on APIs; secretRef resolution; webhook signing & replay windows.
- i18n: localized templates render deterministically.
- Chaos: Slack/Teams API flaps; SMTP greylisting; Redis hiccups; ensure graceful degradation.
17) Sequences (representative)
A) New criticals after Feedser delta (Slack immediate + Email hourly digest)
sequenceDiagram
  autonumber
  participant SCH as Scheduler
  participant NO as Notify.Worker
  participant SL as Slack
  participant SMTP as Email
  SCH->>NO: bus event scheduler.rescan.delta { newCritical:1, digest:sha256:... }
  NO->>NO: match rules (Slack immediate; Email hourly digest)
  NO->>SL: chat.postMessage (concise)
  SL-->>NO: 200 OK
  NO->>NO: append to digest window (email:soc)
  Note over NO: At window close → render digest email
  NO->>SMTP: send email (detailed digest)
  SMTP-->>NO: 250 OK
B) Admission deny (Teams card + Webhook)
sequenceDiagram
  autonumber
  participant ZA as Zastava
  participant NO as Notify.Worker
  participant TE as Teams
  participant WH as Webhook
  ZA->>NO: bus event zastava.admission { decision: "deny", reasons: [...] }
  NO->>TE: POST adaptive card
  TE-->>NO: 200 OK
  NO->>WH: POST JSON (signed)
  WH-->>NO: 2xx
18) Implementation notes
- Language: .NET 10; minimal API; System.Text.Jsonwith canonical writer for body hashing; Channels for pipelines.
- Bus: Redis Streams (XGROUP consumers) or NATS JetStream for at‑least‑once with ack; per‑tenant consumer groups to localize backpressure.
- Templates: compile and cache per rule+channel+locale; version with rule updatedAtto invalidate.
- Rules: store raw YAML + parsed AST; validate with schema + static checks (e.g., nonsensical combos).
- Secrets: pluggable secret resolver (Authority Secret proxy, K8s, Vault).
- Rate limiting: System.Threading.RateLimiting+ per‑connector adapters.
19) Roadmap (post‑v1)
- PagerDuty/Opsgenie connectors; Jira ticket creation.
- User inbox (in‑app notifications) + mobile push via webhook relay.
- Anomaly suppression: auto‑pause noisy rules with hints (learned thresholds).
- Graph rules: “only notify if not_affected → affected transition at consensus layer”.
- Label enrichment: pluggable taggers (business criticality, data classification) to refine matchers.