Files
git.stella-ops.org/docs/ARCHITECTURE_NOTIFY.md

457 lines
18 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

> **Scope.** Implementationready architecture for **Notify**: a rulesdriven, tenantaware notification service that consumes platform events (scan completed, report ready, rescan deltas, attestation logged, admission decisions, etc.), evaluates operatordefined routing rules, renders **channelspecific messages** (Slack/Teams/Email/Webhook), and delivers them **reliably** with idempotency, throttling, and digests. It is UImanaged, auditable, and safe by default (no secrets leakage, no spam storms).
---
## 0) Mission & boundaries
**Mission.** Convert **facts** from StellaOps into **actionable, noisecontrolled** signals where teams already live (chat/email/webhooks), with **explainable** reasons and deep links to the UI.
**Boundaries.**
* Notify **does not make policy decisions** and **does not rescan**; it **consumes** events from Scanner/Scheduler/Vexer/Feedser/Attestor/Zastava and routes them.
* Attachments are **links** (UI/attestation pages); Notify **does not** attach SBOMs or large blobs to messages.
* Secrets for channels (Slack tokens, SMTP creds) are **referenced**, not stored raw in Mongo.
---
## 1) Runtime shape & projects
```
src/
├─ StellaOps.Notify.WebService/ # REST: rules/channels CRUD, test send, deliveries browse
├─ StellaOps.Notify.Worker/ # consumers + evaluators + renderers + delivery workers
├─ StellaOps.Notify.Connectors.* / # channel plug-ins: Slack, Teams, Email, Webhook (v1)
│ └─ *.Tests/
├─ StellaOps.Notify.Engine/ # rules engine, templates, idempotency, digests, throttles
├─ StellaOps.Notify.Models/ # DTOs (Rule, Channel, Event, Delivery, Template)
├─ StellaOps.Notify.Storage.Mongo/ # rules, channels, deliveries, digests, locks
├─ StellaOps.Notify.Queue/ # bus client (Redis Streams/NATS JetStream)
└─ StellaOps.Notify.Tests.* # unit/integration/e2e
```
**Deployables**:
* **Notify.WebService** (stateless API)
* **Notify.Worker** (horizontal scale)
**Dependencies**: Authority (OpToks; DPoP/mTLS), MongoDB, Redis/NATS (bus), HTTP egress to Slack/Teams/Webhooks, SMTP relay for Email.
---
## 2) Responsibilities
1. **Ingest** platform events from internal bus with strong ordering per key (e.g., image digest).
2. **Evaluate rules** (tenantscoped) with matchers: severity changes, namespaces, repos, labels, KEV flags, provider provenance (VEX), component keys, admission decisions, etc.
3. **Control noise**: **throttle**, **coalesce** (digest windows), and **dedupe** via idempotency keys.
4. **Render** channelspecific messages using safe templates; include **evidence** and **links**.
5. **Deliver** with retries/backoff; record outcome; expose delivery history to UI.
6. **Test** paths (send test to channel targets) without touching live rules.
7. **Audit**: log who configured what, when, and why a message was sent.
---
## 3) Event model (inputs)
Notify subscribes to the **internal event bus** (produced by services, escaped JSON; gzip allowed with caps):
* `scanner.scan.completed` — new SBOM(s) composed; artifacts ready
* `scanner.report.ready` — analysis verdict (policy+vex) available; carries deltas summary
* `scheduler.rescan.delta` — new findings after Feedser/Vexer deltas (already summarized)
* `attestor.logged` — Rekor UUID returned (sbom/report/vex export)
* `zastava.admission` — admit/deny with reasons, namespace, image digests
* `feedser.export.completed` — new export ready (rarely notified directly; usually drives Scheduler)
* `vexer.export.completed` — new consensus snapshot (ditto)
**Canonical envelope (bus → Notify.Engine):**
```json
{
"eventId": "uuid",
"kind": "scanner.report.ready",
"tenant": "tenant-01",
"ts": "2025-10-18T05:41:22Z",
"actor": "scanner-webservice",
"scope": { "namespace":"payments", "repo":"ghcr.io/acme/api", "digest":"sha256:..." },
"payload": { /* kind-specific fields, see below */ }
}
```
**Examples (payload cores):**
* `scanner.report.ready`:
```json
{ "verdict":"fail|warn|pass",
"delta": { "newCritical":1, "newHigh":2, "kev":["CVE-2025-..."] },
"topFindings":[{"purl":"pkg:rpm/openssl","vulnId":"CVE-2025-...","severity":"critical"}],
"links":{"ui":"https://ui/...","rekor":"https://rekor/..."} }
```
* `zastava.admission`:
```json
{ "decision":"deny|allow", "reasons":["unsigned image","missing SBOM"],
"images":[{"digest":"sha256:...","signed":false,"hasSbom":false}] }
```
---
## 4) Rules engine — semantics
**Rule shape (simplified):**
```yaml
name: "high-critical-alerts-prod"
enabled: true
match:
eventKinds: ["scanner.report.ready","scheduler.rescan.delta","zastava.admission"]
namespaces: ["prod-*"]
repos: ["ghcr.io/acme/*"]
minSeverity: "high" # min of new findings (delta context)
kev: true # require KEV-tagged or allow any if false
verdict: ["fail","deny"] # filter for report/admission
vex:
includeRejectedJustifications: false # notify only on accepted 'affected'
actions:
- channel: "slack:sec-alerts" # reference to Channel object
template: "concise"
throttle: "5m"
- channel: "email:soc"
digest: "hourly"
template: "detailed"
```
**Evaluation order**
1. **Tenant check** → discard if rule tenant ≠ event tenant.
2. **Kind filter** → discard early.
3. **Scope match** (namespace/repo/labels).
4. **Delta/severity gates** (if event carries `delta`).
5. **VEX gate** (drop if events finding is not affected under policy consensus unless rule says otherwise).
6. **Throttling/dedup** (idempotency key) — skip if suppressed.
7. **Actions** → enqueue perchannel job(s).
**Idempotency key**: `hash(ruleId | actionId | event.kind | scope.digest | delta.hash | day-bucket)`; ensures “same alert” doesnt fire more than once within throttle window.
**Digest windows**: maintain per action a **coalescer**:
* Window: `5m|15m|1h|1d` (configurable); coalesces events by tenant + namespace/repo or by digest group.
* Digest messages summarize top N items and counts, with safe truncation.
---
## 5) Channels & connectors (plugins)
Channel config is **twopart**: a **Channel** record (name, type, options) and a Secret **reference** (Vault/K8s Secret). Connectors are **restart-time plug-ins** discovered on service start (same manifest convention as Concelier/Excititor) and live under `plugins/notify/<channel>/`.
**Builtin v1:**
* **Slack**: Bot token (xoxb…), `chat.postMessage` + `blocks`; rate limit aware (HTTP 429).
* **Microsoft Teams**: Incoming Webhook (or Graph card later); adaptive card payloads.
* **Email (SMTP)**: TLS (STARTTLS or implicit), From/To/CC/BCC; HTML+text alt; DKIM optional.
* **Generic Webhook**: POST JSON with HMAC signature (Ed25519 or SHA256) in headers.
**Connector contract:** (implemented by plug-in assemblies)
```csharp
public interface INotifyConnector {
string Type { get; } // "slack" | "teams" | "email" | "webhook" | ...
Task<DeliveryResult> SendAsync(DeliveryContext ctx, CancellationToken ct);
Task<HealthResult> HealthAsync(ChannelConfig cfg, CancellationToken ct);
}
```
**DeliveryContext** includes **rendered content** and **raw event** for audit.
**Secrets**: `ChannelConfig.secretRef` points to Authoritymanaged secret handle or K8s Secret path; workers load at send-time; plug-in manifests (`notify-plugin.json`) declare capabilities and version.
---
## 6) Templates & rendering
**Template engine**: strongly typed, safe Handlebarsstyle; no arbitrary code. Partial templates per channel. Deterministic outputs (prop order, no locale drift unless requested).
**Variables** (examples):
* `event.kind`, `event.ts`, `scope.namespace`, `scope.repo`, `scope.digest`
* `payload.verdict`, `payload.delta.newCritical`, `payload.links.ui`, `payload.links.rekor`
* `topFindings[]` with `purl`, `vulnId`, `severity`
* `policy.name`, `policy.revision` (if available)
**Helpers**:
* `severity_icon(sev)`, `link(text,url)`, `pluralize(n, "finding")`, `truncate(text, n)`, `code(text)`.
**Channel mapping**:
* Slack: title + blocks, limited to 50 blocks/3000 chars per section; long lists → link to UI.
* Teams: Adaptive Card schema 1.5; fallback text for older channels.
* Email: HTML + text; inline table of top N findings, rest behind UI link.
* Webhook: JSON with `event`, `ruleId`, `actionId`, `summary`, `links`, and raw `payload` subset.
**i18n**: template set per locale (English default; Bulgarian builtin).
---
## 7) Data model (Mongo)
**Database**: `notify`
* `rules`
```
{ _id, tenantId, name, enabled, match, actions, createdBy, updatedBy, createdAt, updatedAt }
```
* `channels`
```
{ _id, tenantId, name:"slack:sec-alerts", type:"slack",
config:{ webhookUrl?:"", channel:"#sec-alerts", workspace?: "...", secretRef:"ref://..." },
createdAt, updatedAt }
```
* `deliveries`
```
{ _id, tenantId, ruleId, actionId, eventId, kind, scope, status:"sent|failed|throttled|digested|dropped",
attempts:[{ts, status, code, reason}],
rendered:{ title, body, target }, // redacted for PII; body hash stored
sentAt, lastError? }
```
* `digests`
```
{ _id, tenantId, actionKey, window:"hourly", openedAt, items:[{eventId, scope, delta}], status:"open|flushed" }
```
* `throttles`
```
{ key:"idem:<hash>", ttlAt } // short-lived, also cached in Redis
```
**Indexes**: rules by `{tenantId, enabled}`, deliveries by `{tenantId, sentAt desc}`, digests by `{tenantId, actionKey}`.
---
## 8) External APIs (WebService)
Base path: `/api/v1/notify` (Authority OpToks; scopes: `notify.admin` for write, `notify.read` for view).
* **Channels**
* `POST /channels` | `GET /channels` | `GET /channels/{id}` | `PATCH /channels/{id}` | `DELETE /channels/{id}`
* `POST /channels/{id}/test` → send sample message (no rule evaluation)
* `GET /channels/{id}/health` → connector selfcheck
* **Rules**
* `POST /rules` | `GET /rules` | `GET /rules/{id}` | `PATCH /rules/{id}` | `DELETE /rules/{id}`
* `POST /rules/{id}/test` → dryrun rule against a **sample event** (no delivery unless `--send`)
* **Deliveries**
* `GET /deliveries?tenant=...&since=...` → list
* `GET /deliveries/{id}` → detail (redacted body + metadata)
* `POST /deliveries/{id}/retry` → force retry (admin)
* **Admin**
* `GET /stats` (per tenant counts, last hour/day)
* `GET /healthz|readyz` (liveness)
**Ingestion**: workers do **not** expose public ingestion; they **subscribe** to the internal bus. (Optional `/events/test` for integration testing, adminonly.)
---
## 9) Delivery pipeline (worker)
```
[Event bus] → [Ingestor] → [RuleMatcher] → [Throttle/Dedupe] → [DigestCoalescer] → [Renderer] → [Connector] → [Result]
└────────→ [DeliveryStore]
```
* **Ingestor**: N consumers with perkey ordering (key = tenant|digest|namespace).
* **RuleMatcher**: loads active rules snapshot for tenant into memory; vectorized predicate check.
* **Throttle/Dedupe**: consult Redis + Mongo `throttles`; if hit → record `status=throttled`.
* **DigestCoalescer**: append to open digest window or flush when timer expires.
* **Renderer**: select template (channel+locale), inject variables, enforce length limits, compute `bodyHash`.
* **Connector**: send; handle providerspecific rate limits and backoffs; `maxAttempts` with exponential jitter; overflow → DLQ (deadletter topic) + UI surfacing.
**Idempotency**: per action **idempotency key** stored in Redis (TTL = `throttle window` or `digest window`). Connectors also respect **provider** idempotency where available (e.g., Slack `client_msg_id`).
---
## 10) Reliability & rate controls
* **Pertenant** RPM caps (default 600/min) + **perchannel** concurrency (Slack 14, Teams 12, Email 832 based on relay).
* **Backoff** map: Slack 429 → respect `RetryAfter`; SMTP 4xx → retry; 5xx → retry with jitter; permanent rejects → drop with status recorded.
* **DLQ**: NATS/Redis stream `notify.dlq` with `{event, rule, action, error}` for operator inspection; UI shows DLQ items.
---
## 11) Security & privacy
* **AuthZ**: all APIs require **Authority** OpToks; actions scoped by tenant.
* **Secrets**: `secretRef` only; Notify fetches justintime from Authority Secret proxy or K8s Secret (mounted). No plaintext secrets in Mongo.
* **Egress TLS**: validate SSL; pin domains per channel config; optional CA bundle override for onprem SMTP.
* **Webhook signing**: HMAC or Ed25519 signatures in `X-StellaOps-Signature` + replaywindow timestamp; include canonical body hash in header.
* **Redaction**: deliveries store **hashes** of bodies, not full payloads for chat/email to minimize PII retention (configurable).
* **Quiet hours**: per tenant (e.g., 22:0006:00) route highsev only; defer others to digests.
* **Loop prevention**: Webhook target allowlist + event origin tags; do not ingest own webhooks.
---
## 12) Observability (Prometheus + OTEL)
* `notify.events_consumed_total{kind}`
* `notify.rules_matched_total{ruleId}`
* `notify.throttled_total{reason}`
* `notify.digest_coalesced_total{window}`
* `notify.sent_total{channel}` / `notify.failed_total{channel,code}`
* `notify.delivery_latency_seconds{channel}` (endtoend)
* **Tracing**: spans `ingest`, `match`, `render`, `send`; correlation id = `eventId`.
**SLO targets**
* Event→delivery p95 **≤ 3060s** under nominal load.
* Failure rate p95 **< 0.5%** per hour (excluding provider outages).
* Duplicate rate **≈ 0** (idempotency working).
---
## 13) Configuration (YAML)
```yaml
notify:
authority:
issuer: "https://authority.internal"
require: "dpop" # or "mtls"
bus:
kind: "redis" # or "nats"
streams:
- "scanner.events"
- "scheduler.events"
- "attestor.events"
- "zastava.events"
mongo:
uri: "mongodb://mongo/notify"
limits:
perTenantRpm: 600
perChannel:
slack: { concurrency: 2 }
teams: { concurrency: 1 }
email: { concurrency: 8 }
webhook: { concurrency: 8 }
digests:
defaultWindow: "1h"
maxItems: 100
quietHours:
enabled: true
window: "22:00-06:00"
minSeverity: "critical"
webhooks:
sign:
method: "ed25519" # or "hmac-sha256"
keyRef: "ref://notify/webhook-sign-key"
```
---
## 14) UI touchpoints
* **Notifications → Channels**: add Slack/Teams/Email/Webhook; run **health**; rotate secrets.
* **Notifications → Rules**: create/edit YAML rules with linting; test with sample events; see match rate.
* **Notifications → Deliveries**: timeline with filters (status, channel, rule); inspect last error; retry.
* **Digest preview**: shows current window contents and when it will flush.
* **Quiet hours**: configure per tenant; show overrides.
* **DLQ**: browse deadletters; requeue after fix.
---
## 15) Failure modes & responses
| Condition | Behavior |
| ----------------------------------- | ------------------------------------------------------------------------------------- |
| Slack 429 / Teams 429 | Respect `RetryAfter`, backoff with jitter, reduce concurrency |
| SMTP transient 4xx | Retry up to `maxAttempts`; escalate to DLQ on exhaust |
| Invalid channel secret | Mark channel unhealthy; suppress sends; surface in UI |
| Rule explosion (matches everything) | Safety valve: pertenant RPM caps; autopause rule after X drops; UI alert |
| Bus outage | Buffer to local queue (bounded); resume consuming when healthy |
| Mongo slowness | Fall back to Redis throttles; batch write deliveries; shed lowpriority notifications |
---
## 16) Testing matrix
* **Unit**: matchers, throttle math, digest coalescing, idempotency keys, template rendering edge cases.
* **Connectors**: providerlevel rate limits, payload size truncation, error mapping.
* **Integration**: synthetic event storm (10k/min), ensure p95 latency & duplicate rate.
* **Security**: DPoP/mTLS on APIs; secretRef resolution; webhook signing & replay windows.
* **i18n**: localized templates render deterministically.
* **Chaos**: Slack/Teams API flaps; SMTP greylisting; Redis hiccups; ensure graceful degradation.
---
## 17) Sequences (representative)
**A) New criticals after Feedser delta (Slack immediate + Email hourly digest)**
```mermaid
sequenceDiagram
autonumber
participant SCH as Scheduler
participant NO as Notify.Worker
participant SL as Slack
participant SMTP as Email
SCH->>NO: bus event scheduler.rescan.delta { newCritical:1, digest:sha256:... }
NO->>NO: match rules (Slack immediate; Email hourly digest)
NO->>SL: chat.postMessage (concise)
SL-->>NO: 200 OK
NO->>NO: append to digest window (email:soc)
Note over NO: At window close → render digest email
NO->>SMTP: send email (detailed digest)
SMTP-->>NO: 250 OK
```
**B) Admission deny (Teams card + Webhook)**
```mermaid
sequenceDiagram
autonumber
participant ZA as Zastava
participant NO as Notify.Worker
participant TE as Teams
participant WH as Webhook
ZA->>NO: bus event zastava.admission { decision: "deny", reasons: [...] }
NO->>TE: POST adaptive card
TE-->>NO: 200 OK
NO->>WH: POST JSON (signed)
WH-->>NO: 2xx
```
---
## 18) Implementation notes
* **Language**: .NET 10; minimal API; `System.Text.Json` with canonical writer for body hashing; Channels for pipelines.
* **Bus**: Redis Streams (**XGROUP** consumers) or NATS JetStream for atleastonce with ack; pertenant consumer groups to localize backpressure.
* **Templates**: compile and cache per rule+channel+locale; version with rule `updatedAt` to invalidate.
* **Rules**: store raw YAML + parsed AST; validate with schema + static checks (e.g., nonsensical combos).
* **Secrets**: pluggable secret resolver (Authority Secret proxy, K8s, Vault).
* **Rate limiting**: `System.Threading.RateLimiting` + perconnector adapters.
---
## 19) Roadmap (postv1)
* **PagerDuty/Opsgenie** connectors; **Jira** ticket creation.
* **User inbox** (inapp notifications) + mobile push via webhook relay.
* **Anomaly suppression**: autopause noisy rules with hints (learned thresholds).
* **Graph rules**: “only notify if *not_affected → affected* transition at consensus layer”.
* **Label enrichment**: pluggable taggers (business criticality, data classification) to refine matchers.