24 KiB
Scope. Implementation‑ready architecture for Notify: a rules‑driven, tenant‑aware notification service that consumes platform events (scan completed, report ready, rescan deltas, attestation logged, admission decisions, etc.), evaluates operator‑defined routing rules, renders channel‑specific messages (Slack/Teams/Email/Webhook), and delivers them reliably with idempotency, throttling, and digests. It is UI‑managed, auditable, and safe by default (no secrets leakage, no spam storms).
0) Mission & boundaries
Mission. Convert facts from Stella Ops into actionable, noise‑controlled signals where teams already live (chat/email/webhooks), with explainable reasons and deep links to the UI.
Boundaries.
- Notify does not make policy decisions and does not rescan; it consumes events from Scanner/Scheduler/Vexer/Feedser/Attestor/Zastava and routes them.
- Attachments are links (UI/attestation pages); Notify does not attach SBOMs or large blobs to messages.
- Secrets for channels (Slack tokens, SMTP creds) are referenced, not stored raw in Mongo.
1) Runtime shape & projects
src/
├─ StellaOps.Notify.WebService/ # REST: rules/channels CRUD, test send, deliveries browse
├─ StellaOps.Notify.Worker/ # consumers + evaluators + renderers + delivery workers
├─ StellaOps.Notify.Connectors.* / # channel plug-ins: Slack, Teams, Email, Webhook (v1)
│ └─ *.Tests/
├─ StellaOps.Notify.Engine/ # rules engine, templates, idempotency, digests, throttles
├─ StellaOps.Notify.Models/ # DTOs (Rule, Channel, Event, Delivery, Template)
├─ StellaOps.Notify.Storage.Mongo/ # rules, channels, deliveries, digests, locks
├─ StellaOps.Notify.Queue/ # bus client (Redis Streams/NATS JetStream)
└─ StellaOps.Notify.Tests.* # unit/integration/e2e
Deployables:
- Notify.WebService (stateless API)
- Notify.Worker (horizontal scale)
Dependencies: Authority (OpToks; DPoP/mTLS), MongoDB, Redis/NATS (bus), HTTP egress to Slack/Teams/Webhooks, SMTP relay for Email.
Configuration. Notify.WebService bootstraps from
notify.yaml(seeetc/notify.yaml.sample). Usestorage.driver: mongowith a production connection string; the optionalmemorydriver exists only for tests. Authority settings follow the platform defaults—when running locally without Authority, setauthority.enabled: falseand supplydevelopmentSigningKeyso JWTs can be validated offline.
api.rateLimitsexposes token-bucket controls for delivery history queries and test-send previews (deliveryHistory,testSend). Default values allow generous browsing while preventing accidental bursts; operators can relax/tighten the buckets per deployment.
Plug-ins. All channel connectors are packaged under
<baseDirectory>/plugins/notify. The ordered load list must start with Slack/Teams before Email/Webhook so chat-first actions are registered deterministically for Offline Kit bundles:plugins: baseDirectory: "/var/opt/stellaops" directory: "plugins/notify" orderedPlugins: - StellaOps.Notify.Connectors.Slack - StellaOps.Notify.Connectors.Teams - StellaOps.Notify.Connectors.Email - StellaOps.Notify.Connectors.WebhookThe Offline Kit job simply copies the
plugins/notifytree into the air-gapped bundle; the ordered list keeps connector manifests stable across environments.
Authority clients. Register two OAuth clients in StellaOps Authority:
notify-web-dev(audiencenotify.dev) for development andnotify-web(audiencenotify) for staging/production. Both requirenotify.readandnotify.adminscopes and use DPoP-bound client credentials (client_secretin the samples). Reference entries live inetc/authority.yaml.sample, with placeholder secrets underetc/secrets/notify-web*.secret.example.
2) Responsibilities
- Ingest platform events from internal bus with strong ordering per key (e.g., image digest).
- Evaluate rules (tenant‑scoped) with matchers: severity changes, namespaces, repos, labels, KEV flags, provider provenance (VEX), component keys, admission decisions, etc.
- Control noise: throttle, coalesce (digest windows), and dedupe via idempotency keys.
- Render channel‑specific messages using safe templates; include evidence and links.
- Deliver with retries/backoff; record outcome; expose delivery history to UI.
- Test paths (send test to channel targets) without touching live rules.
- Audit: log who configured what, when, and why a message was sent.
3) Event model (inputs)
Notify subscribes to the internal event bus (produced by services, escaped JSON; gzip allowed with caps):
scanner.scan.completed— new SBOM(s) composed; artifacts readyscanner.report.ready— analysis verdict (policy+vex) available; carries deltas summaryscheduler.rescan.delta— new findings after Feedser/Vexer deltas (already summarized)attestor.logged— Rekor UUID returned (sbom/report/vex export)zastava.admission— admit/deny with reasons, namespace, image digestsfeedser.export.completed— new export ready (rarely notified directly; usually drives Scheduler)vexer.export.completed— new consensus snapshot (ditto)
Canonical envelope (bus → Notify.Engine):
{
"eventId": "uuid",
"kind": "scanner.report.ready",
"tenant": "tenant-01",
"ts": "2025-10-18T05:41:22Z",
"actor": "scanner-webservice",
"scope": { "namespace":"payments", "repo":"ghcr.io/acme/api", "digest":"sha256:..." },
"payload": { /* kind-specific fields, see below */ }
}
Examples (payload cores):
-
scanner.report.ready:{ "reportId": "report-3def...", "verdict": "fail", "summary": {"total": 12, "blocked": 2, "warned": 3, "ignored": 5, "quieted": 2}, "delta": {"newCritical": 1, "kev": ["CVE-2025-..."]}, "links": {"ui": "https://ui/.../reports/report-3def...", "rekor": "https://rekor/..."}, "dsse": { "...": "..." }, "report": { "...": "..." } }Payload embeds both the canonical report document and the DSSE envelope so connectors, Notify, and UI tooling can reuse the signed bytes without re-serialising.
-
scanner.scan.completed:{ "reportId": "report-3def...", "digest": "sha256:...", "verdict": "fail", "summary": {"total": 12, "blocked": 2, "warned": 3, "ignored": 5, "quieted": 2}, "delta": {"newCritical": 1, "kev": ["CVE-2025-..."]}, "policy": {"revisionId": "rev-42", "digest": "27d2..."}, "findings": [{"id": "finding-1", "severity": "Critical", "cve": "CVE-2025-...", "reachability": "runtime"}], "dsse": { "...": "..." } } -
zastava.admission:{ "decision":"deny|allow", "reasons":["unsigned image","missing SBOM"], "images":[{"digest":"sha256:...","signed":false,"hasSbom":false}] }
4) Rules engine — semantics
Rule shape (simplified):
name: "high-critical-alerts-prod"
enabled: true
match:
eventKinds: ["scanner.report.ready","scheduler.rescan.delta","zastava.admission"]
namespaces: ["prod-*"]
repos: ["ghcr.io/acme/*"]
minSeverity: "high" # min of new findings (delta context)
kev: true # require KEV-tagged or allow any if false
verdict: ["fail","deny"] # filter for report/admission
vex:
includeRejectedJustifications: false # notify only on accepted 'affected'
actions:
- channel: "slack:sec-alerts" # reference to Channel object
template: "concise"
throttle: "5m"
- channel: "email:soc"
digest: "hourly"
template: "detailed"
Evaluation order
- Tenant check → discard if rule tenant ≠ event tenant.
- Kind filter → discard early.
- Scope match (namespace/repo/labels).
- Delta/severity gates (if event carries
delta). - VEX gate (drop if event’s finding is not affected under policy consensus unless rule says otherwise).
- Throttling/dedup (idempotency key) — skip if suppressed.
- Actions → enqueue per‑channel job(s).
Idempotency key: hash(ruleId | actionId | event.kind | scope.digest | delta.hash | day-bucket); ensures “same alert” doesn’t fire more than once within throttle window.
Digest windows: maintain per action a coalescer:
- Window:
5m|15m|1h|1d(configurable); coalesces events by tenant + namespace/repo or by digest group. - Digest messages summarize top N items and counts, with safe truncation.
5) Channels & connectors (plug‑ins)
Channel config is two‑part: a Channel record (name, type, options) and a Secret reference (Vault/K8s Secret). Connectors are restart-time plug-ins discovered on service start (same manifest convention as Concelier/Excititor) and live under plugins/notify/<channel>/.
Built‑in v1:
- Slack: Bot token (xoxb‑…),
chat.postMessage+blocks; rate limit aware (HTTP 429). - Microsoft Teams: Incoming Webhook (or Graph card later); adaptive card payloads.
- Email (SMTP): TLS (STARTTLS or implicit), From/To/CC/BCC; HTML+text alt; DKIM optional.
- Generic Webhook: POST JSON with HMAC signature (Ed25519 or SHA‑256) in headers.
Connector contract: (implemented by plug-in assemblies)
public interface INotifyConnector {
string Type { get; } // "slack" | "teams" | "email" | "webhook" | ...
Task<DeliveryResult> SendAsync(DeliveryContext ctx, CancellationToken ct);
Task<HealthResult> HealthAsync(ChannelConfig cfg, CancellationToken ct);
}
DeliveryContext includes rendered content and raw event for audit.
Test-send previews. Plug-ins can optionally implement INotifyChannelTestProvider to shape /channels/{id}/test responses. Providers receive a sanitised ChannelTestPreviewContext (channel, tenant, target, timestamp, trace) and return a NotifyDeliveryRendered preview + metadata. When no provider is present, the host falls back to a generic preview so the endpoint always responds.
Secrets: ChannelConfig.secretRef points to Authority‑managed secret handle or K8s Secret path; workers load at send-time; plug-in manifests (notify-plugin.json) declare capabilities and version.
6) Templates & rendering
Template engine: strongly typed, safe Handlebars‑style; no arbitrary code. Partial templates per channel. Deterministic outputs (prop order, no locale drift unless requested).
Variables (examples):
event.kind,event.ts,scope.namespace,scope.repo,scope.digestpayload.verdict,payload.delta.newCritical,payload.links.ui,payload.links.rekortopFindings[]withpurl,vulnId,severitypolicy.name,policy.revision(if available)
Helpers:
severity_icon(sev),link(text,url),pluralize(n, "finding"),truncate(text, n),code(text).
Channel mapping:
- Slack: title + blocks, limited to 50 blocks/3000 chars per section; long lists → link to UI.
- Teams: Adaptive Card schema 1.5; fallback text for older channels.
- Email: HTML + text; inline table of top N findings, rest behind UI link.
- Webhook: JSON with
event,ruleId,actionId,summary,links, and rawpayloadsubset.
i18n: template set per locale (English default; Bulgarian built‑in).
7) Data model (Mongo)
Canonical JSON Schemas for rules/channels/events live in docs/notify/schemas/. Sample payloads intended for tests/UI mock responses are captured in docs/notify/samples/.
Database: notify
-
rules{ _id, tenantId, name, enabled, match, actions, createdBy, updatedBy, createdAt, updatedAt } -
channels{ _id, tenantId, name:"slack:sec-alerts", type:"slack", config:{ webhookUrl?:"", channel:"#sec-alerts", workspace?: "...", secretRef:"ref://..." }, createdAt, updatedAt } -
deliveries{ _id, tenantId, ruleId, actionId, eventId, kind, scope, status:"sent|failed|throttled|digested|dropped", attempts:[{ts, status, code, reason}], rendered:{ title, body, target }, // redacted for PII; body hash stored sentAt, lastError? } -
digests{ _id, tenantId, actionKey, window:"hourly", openedAt, items:[{eventId, scope, delta}], status:"open|flushed" } -
throttles{ key:"idem:<hash>", ttlAt } // short-lived, also cached in Redis
Indexes: rules by {tenantId, enabled}, deliveries by {tenantId, sentAt desc}, digests by {tenantId, actionKey}.
8) External APIs (WebService)
Base path: /api/v1/notify (Authority OpToks; scopes: notify.admin for write, notify.read for view).
All REST calls require the tenant header X-StellaOps-Tenant (matches the canonical tenantId stored in Mongo). Payloads are normalised via NotifySchemaMigration before persistence to guarantee schema version pinning.
Authentication today is stubbed with Bearer tokens (Authorization: Bearer <token>). When Authority wiring lands, this will switch to OpTok validation + scope enforcement, but the header contract will remain the same.
Service configuration exposes notify:auth:* keys (issuer, audience, signing key, scope names) so operators can wire the Authority JWKS or (in dev) a symmetric test key. notify:storage:* keys cover Mongo URI/database/collection overrides. Both sets are required for the new API surface.
Internal tooling can hit /internal/notify/<entity>/normalize to upgrade legacy JSON and return canonical output used in the docs fixtures.
-
Channels
POST /channels|GET /channels|GET /channels/{id}|PATCH /channels/{id}|DELETE /channels/{id}POST /channels/{id}/test→ send sample message (no rule evaluation); returns202 Acceptedwith rendered preview + metadata (base keys:channelType,target,previewProvider,traceId+ connector-specific entries); governed byapi.rateLimits:testSend.GET /channels/{id}/health→ connector self‑check
-
Rules
POST /rules|GET /rules|GET /rules/{id}|PATCH /rules/{id}|DELETE /rules/{id}POST /rules/{id}/test→ dry‑run rule against a sample event (no delivery unless--send)
-
Deliveries
POST /deliveries→ ingest worker delivery state (idempotent viadeliveryId).GET /deliveries?since=...&status=...&limit=...→ list envelope{ items, count, continuationToken }(most recent first); base metadata keys match the test-send response (channelType,target,previewProvider,traceId); rate-limited viaapi.rateLimits.deliveryHistory. Seedocs/notify/samples/notify-delivery-list-response.sample.json.GET /deliveries/{id}→ detail (redacted body + metadata)POST /deliveries/{id}/retry→ force retry (admin, future sprint)
-
Admin
GET /stats(per tenant counts, last hour/day)GET /healthz|readyz(liveness)POST /locks/acquire|POST /locks/release– worker coordination primitives (short TTL).POST /digests|GET /digests/{actionKey}|DELETE /digests/{actionKey}– manage open digest windows.POST /audit|GET /audit?since=&limit=– append/query structured audit trail entries.
Ingestion: workers do not expose public ingestion; they subscribe to the internal bus. (Optional /events/test for integration testing, admin‑only.)
9) Delivery pipeline (worker)
[Event bus] → [Ingestor] → [RuleMatcher] → [Throttle/Dedupe] → [DigestCoalescer] → [Renderer] → [Connector] → [Result]
└────────→ [DeliveryStore]
- Ingestor: N consumers with per‑key ordering (key = tenant|digest|namespace).
- RuleMatcher: loads active rules snapshot for tenant into memory; vectorized predicate check.
- Throttle/Dedupe: consult Redis + Mongo
throttles; if hit → recordstatus=throttled. - DigestCoalescer: append to open digest window or flush when timer expires.
- Renderer: select template (channel+locale), inject variables, enforce length limits, compute
bodyHash. - Connector: send; handle provider‑specific rate limits and backoffs;
maxAttemptswith exponential jitter; overflow → DLQ (dead‑letter topic) + UI surfacing.
Idempotency: per action idempotency key stored in Redis (TTL = throttle window or digest window). Connectors also respect provider idempotency where available (e.g., Slack client_msg_id).
10) Reliability & rate controls
- Per‑tenant RPM caps (default 600/min) + per‑channel concurrency (Slack 1–4, Teams 1–2, Email 8–32 based on relay).
- Backoff map: Slack 429 → respect
Retry‑After; SMTP 4xx → retry; 5xx → retry with jitter; permanent rejects → drop with status recorded. - DLQ: NATS/Redis stream
notify.dlqwith{event, rule, action, error}for operator inspection; UI shows DLQ items.
11) Security & privacy
- AuthZ: all APIs require Authority OpToks; actions scoped by tenant.
- Secrets:
secretRefonly; Notify fetches just‑in‑time from Authority Secret proxy or K8s Secret (mounted). No plaintext secrets in Mongo. - Egress TLS: validate SSL; pin domains per channel config; optional CA bundle override for on‑prem SMTP.
- Webhook signing: HMAC or Ed25519 signatures in
X-StellaOps-Signature+ replay‑window timestamp; include canonical body hash in header. - Redaction: deliveries store hashes of bodies, not full payloads for chat/email to minimize PII retention (configurable).
- Quiet hours: per tenant (e.g., 22:00–06:00) route high‑sev only; defer others to digests.
- Loop prevention: Webhook target allowlist + event origin tags; do not ingest own webhooks.
12) Observability (Prometheus + OTEL)
notify.events_consumed_total{kind}notify.rules_matched_total{ruleId}notify.throttled_total{reason}notify.digest_coalesced_total{window}notify.sent_total{channel}/notify.failed_total{channel,code}notify.delivery_latency_seconds{channel}(end‑to‑end)- Tracing: spans
ingest,match,render,send; correlation id =eventId.
SLO targets
- Event→delivery p95 ≤ 30–60 s under nominal load.
- Failure rate p95 < 0.5% per hour (excluding provider outages).
- Duplicate rate ≈ 0 (idempotency working).
13) Configuration (YAML)
notify:
authority:
issuer: "https://authority.internal"
require: "dpop" # or "mtls"
bus:
kind: "redis" # or "nats"
streams:
- "scanner.events"
- "scheduler.events"
- "attestor.events"
- "zastava.events"
mongo:
uri: "mongodb://mongo/notify"
limits:
perTenantRpm: 600
perChannel:
slack: { concurrency: 2 }
teams: { concurrency: 1 }
email: { concurrency: 8 }
webhook: { concurrency: 8 }
digests:
defaultWindow: "1h"
maxItems: 100
quietHours:
enabled: true
window: "22:00-06:00"
minSeverity: "critical"
webhooks:
sign:
method: "ed25519" # or "hmac-sha256"
keyRef: "ref://notify/webhook-sign-key"
14) UI touch‑points
- Notifications → Channels: add Slack/Teams/Email/Webhook; run health; rotate secrets.
- Notifications → Rules: create/edit YAML rules with linting; test with sample events; see match rate.
- Notifications → Deliveries: timeline with filters (status, channel, rule); inspect last error; retry.
- Digest preview: shows current window contents and when it will flush.
- Quiet hours: configure per tenant; show overrides.
- DLQ: browse dead‑letters; requeue after fix.
15) Failure modes & responses
| Condition | Behavior |
|---|---|
| Slack 429 / Teams 429 | Respect Retry‑After, backoff with jitter, reduce concurrency |
| SMTP transient 4xx | Retry up to maxAttempts; escalate to DLQ on exhaust |
| Invalid channel secret | Mark channel unhealthy; suppress sends; surface in UI |
| Rule explosion (matches everything) | Safety valve: per‑tenant RPM caps; auto‑pause rule after X drops; UI alert |
| Bus outage | Buffer to local queue (bounded); resume consuming when healthy |
| Mongo slowness | Fall back to Redis throttles; batch write deliveries; shed low‑priority notifications |
16) Testing matrix
- Unit: matchers, throttle math, digest coalescing, idempotency keys, template rendering edge cases.
- Connectors: provider‑level rate limits, payload size truncation, error mapping.
- Integration: synthetic event storm (10k/min), ensure p95 latency & duplicate rate.
- Security: DPoP/mTLS on APIs; secretRef resolution; webhook signing & replay windows.
- i18n: localized templates render deterministically.
- Chaos: Slack/Teams API flaps; SMTP greylisting; Redis hiccups; ensure graceful degradation.
17) Sequences (representative)
A) New criticals after Feedser delta (Slack immediate + Email hourly digest)
sequenceDiagram
autonumber
participant SCH as Scheduler
participant NO as Notify.Worker
participant SL as Slack
participant SMTP as Email
SCH->>NO: bus event scheduler.rescan.delta { newCritical:1, digest:sha256:... }
NO->>NO: match rules (Slack immediate; Email hourly digest)
NO->>SL: chat.postMessage (concise)
SL-->>NO: 200 OK
NO->>NO: append to digest window (email:soc)
Note over NO: At window close → render digest email
NO->>SMTP: send email (detailed digest)
SMTP-->>NO: 250 OK
B) Admission deny (Teams card + Webhook)
sequenceDiagram
autonumber
participant ZA as Zastava
participant NO as Notify.Worker
participant TE as Teams
participant WH as Webhook
ZA->>NO: bus event zastava.admission { decision: "deny", reasons: [...] }
NO->>TE: POST adaptive card
TE-->>NO: 200 OK
NO->>WH: POST JSON (signed)
WH-->>NO: 2xx
18) Implementation notes
- Language: .NET 10; minimal API;
System.Text.Jsonwith canonical writer for body hashing; Channels for pipelines. - Bus: Redis Streams (XGROUP consumers) or NATS JetStream for at‑least‑once with ack; per‑tenant consumer groups to localize backpressure.
- Templates: compile and cache per rule+channel+locale; version with rule
updatedAtto invalidate. - Rules: store raw YAML + parsed AST; validate with schema + static checks (e.g., nonsensical combos).
- Secrets: pluggable secret resolver (Authority Secret proxy, K8s, Vault).
- Rate limiting:
System.Threading.RateLimiting+ per‑connector adapters.
19) Roadmap (post‑v1)
- PagerDuty/Opsgenie connectors; Jira ticket creation.
- User inbox (in‑app notifications) + mobile push via webhook relay.
- Anomaly suppression: auto‑pause noisy rules with hints (learned thresholds).
- Graph rules: “only notify if not_affected → affected transition at consensus layer”.
- Label enrichment: pluggable taggers (business criticality, data classification) to refine matchers.