feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules
- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes. - Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes. - Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables. - Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
This commit is contained in:
		
							
								
								
									
										515
									
								
								docs/modules/notify/architecture.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										515
									
								
								docs/modules/notify/architecture.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,515 @@ | ||||
| > **Scope.** Implementation‑ready architecture for **Notify** (aligned with Epic 11 – Notifications Studio): a rules‑driven, tenant‑aware notification service that consumes platform events (scan completed, report ready, rescan deltas, attestation logged, admission decisions, etc.), evaluates operator‑defined routing rules, renders **channel‑specific messages** (Slack/Teams/Email/Webhook), and delivers them **reliably** with idempotency, throttling, and digests. It is UI‑managed, auditable, and safe by default (no secrets leakage, no spam storms). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 0) Mission & boundaries | ||||
|  | ||||
| **Mission.** Convert **facts** from Stella Ops into **actionable, noise‑controlled** signals where teams already live (chat/email/webhooks), with **explainable** reasons and deep links to the UI. | ||||
|  | ||||
| **Boundaries.** | ||||
|  | ||||
| * Notify **does not make policy decisions** and **does not rescan**; it **consumes** events from Scanner/Scheduler/Vexer/Feedser/Attestor/Zastava and routes them. | ||||
| * Attachments are **links** (UI/attestation pages); Notify **does not** attach SBOMs or large blobs to messages. | ||||
| * Secrets for channels (Slack tokens, SMTP creds) are **referenced**, not stored raw in Mongo. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 1) Runtime shape & projects | ||||
|  | ||||
| ``` | ||||
| src/ | ||||
|  ├─ StellaOps.Notify.WebService/        # REST: rules/channels CRUD, test send, deliveries browse | ||||
|  ├─ StellaOps.Notify.Worker/            # consumers + evaluators + renderers + delivery workers | ||||
|  ├─ StellaOps.Notify.Connectors.* /     # channel plug-ins: Slack, Teams, Email, Webhook (v1) | ||||
|  │    └─ *.Tests/ | ||||
|  ├─ StellaOps.Notify.Engine/            # rules engine, templates, idempotency, digests, throttles | ||||
|  ├─ StellaOps.Notify.Models/            # DTOs (Rule, Channel, Event, Delivery, Template) | ||||
|  ├─ StellaOps.Notify.Storage.Mongo/     # rules, channels, deliveries, digests, locks | ||||
|  ├─ StellaOps.Notify.Queue/             # bus client (Redis Streams/NATS JetStream) | ||||
|  └─ StellaOps.Notify.Tests.*            # unit/integration/e2e | ||||
| ``` | ||||
|  | ||||
| **Deployables**: | ||||
|  | ||||
| * **Notify.WebService** (stateless API) | ||||
| * **Notify.Worker** (horizontal scale) | ||||
|  | ||||
| **Dependencies**: Authority (OpToks; DPoP/mTLS), MongoDB, Redis/NATS (bus), HTTP egress to Slack/Teams/Webhooks, SMTP relay for Email. | ||||
|  | ||||
| > **Configuration.** Notify.WebService bootstraps from `notify.yaml` (see `etc/notify.yaml.sample`). Use `storage.driver: mongo` with a production connection string; the optional `memory` driver exists only for tests. Authority settings follow the platform defaults—when running locally without Authority, set `authority.enabled: false` and supply `developmentSigningKey` so JWTs can be validated offline. | ||||
| > | ||||
| > `api.rateLimits` exposes token-bucket controls for delivery history queries and test-send previews (`deliveryHistory`, `testSend`). Default values allow generous browsing while preventing accidental bursts; operators can relax/tighten the buckets per deployment. | ||||
|  | ||||
| > **Plug-ins.** All channel connectors are packaged under `<baseDirectory>/plugins/notify`. The ordered load list must start with Slack/Teams before Email/Webhook so chat-first actions are registered deterministically for Offline Kit bundles: | ||||
| > | ||||
| > ```yaml | ||||
| > plugins: | ||||
| >   baseDirectory: "/var/opt/stellaops" | ||||
| >   directory: "plugins/notify" | ||||
| >   orderedPlugins: | ||||
| >     - StellaOps.Notify.Connectors.Slack | ||||
| >     - StellaOps.Notify.Connectors.Teams | ||||
| >     - StellaOps.Notify.Connectors.Email | ||||
| >     - StellaOps.Notify.Connectors.Webhook | ||||
| > ``` | ||||
| > | ||||
| > The Offline Kit job simply copies the `plugins/notify` tree into the air-gapped bundle; the ordered list keeps connector manifests stable across environments. | ||||
|  | ||||
| > **Authority clients.** Register two OAuth clients in StellaOps Authority: `notify-web-dev` (audience `notify.dev`) for development and `notify-web` (audience `notify`) for staging/production. Both require `notify.read` and `notify.admin` scopes and use DPoP-bound client credentials (`client_secret` in the samples). Reference entries live in `etc/authority.yaml.sample`, with placeholder secrets under `etc/secrets/notify-web*.secret.example`. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 2) Responsibilities | ||||
|  | ||||
| 1. **Ingest** platform events from internal bus with strong ordering per key (e.g., image digest). | ||||
| 2. **Evaluate rules** (tenant‑scoped) with matchers: severity changes, namespaces, repos, labels, KEV flags, provider provenance (VEX), component keys, admission decisions, etc. | ||||
| 3. **Control noise**: **throttle**, **coalesce** (digest windows), and **dedupe** via idempotency keys. | ||||
| 4. **Render** channel‑specific messages using safe templates; include **evidence** and **links**. | ||||
| 5. **Deliver** with retries/backoff; record outcome; expose delivery history to UI. | ||||
| 6. **Test** paths (send test to channel targets) without touching live rules. | ||||
| 7. **Audit**: log who configured what, when, and why a message was sent. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 3) Event model (inputs) | ||||
|  | ||||
| Notify subscribes to the **internal event bus** (produced by services, escaped JSON; gzip allowed with caps): | ||||
|  | ||||
| * `scanner.scan.completed` — new SBOM(s) composed; artifacts ready | ||||
| * `scanner.report.ready` — analysis verdict (policy+vex) available; carries deltas summary | ||||
| * `scheduler.rescan.delta` — new findings after Feedser/Vexer deltas (already summarized) | ||||
| * `attestor.logged` — Rekor UUID returned (sbom/report/vex export) | ||||
| * `zastava.admission` — admit/deny with reasons, namespace, image digests | ||||
| * `feedser.export.completed` — new export ready (rarely notified directly; usually drives Scheduler) | ||||
| * `vexer.export.completed` — new consensus snapshot (ditto) | ||||
|  | ||||
| **Canonical envelope (bus → Notify.Engine):** | ||||
|  | ||||
| ```json | ||||
| { | ||||
|   "eventId": "uuid", | ||||
|   "kind": "scanner.report.ready", | ||||
|   "tenant": "tenant-01", | ||||
|   "ts": "2025-10-18T05:41:22Z", | ||||
|   "actor": "scanner-webservice", | ||||
|   "scope": { "namespace":"payments", "repo":"ghcr.io/acme/api", "digest":"sha256:..." }, | ||||
|   "payload": { /* kind-specific fields, see below */ } | ||||
| } | ||||
| ``` | ||||
|  | ||||
| **Examples (payload cores):** | ||||
|  | ||||
| * `scanner.report.ready`: | ||||
|  | ||||
|   ```json | ||||
|   { | ||||
|     "reportId": "report-3def...", | ||||
|     "verdict": "fail", | ||||
|     "summary": {"total": 12, "blocked": 2, "warned": 3, "ignored": 5, "quieted": 2}, | ||||
|     "delta": {"newCritical": 1, "kev": ["CVE-2025-..."]}, | ||||
|     "links": {"ui": "https://ui/.../reports/report-3def...", "rekor": "https://rekor/..."}, | ||||
|     "dsse": { "...": "..." }, | ||||
|     "report": { "...": "..." } | ||||
|   } | ||||
|   ``` | ||||
|  | ||||
|   Payload embeds both the canonical report document and the DSSE envelope so connectors, Notify, and UI tooling can reuse the signed bytes without re-serialising. | ||||
|  | ||||
| * `scanner.scan.completed`: | ||||
|  | ||||
|   ```json | ||||
|   { | ||||
|     "reportId": "report-3def...", | ||||
|     "digest": "sha256:...", | ||||
|     "verdict": "fail", | ||||
|     "summary": {"total": 12, "blocked": 2, "warned": 3, "ignored": 5, "quieted": 2}, | ||||
|     "delta": {"newCritical": 1, "kev": ["CVE-2025-..."]}, | ||||
|     "policy": {"revisionId": "rev-42", "digest": "27d2..."}, | ||||
|     "findings": [{"id": "finding-1", "severity": "Critical", "cve": "CVE-2025-...", "reachability": "runtime"}], | ||||
|     "dsse": { "...": "..." } | ||||
|   } | ||||
|   ``` | ||||
|  | ||||
| * `zastava.admission`: | ||||
|  | ||||
|   ```json | ||||
|   { "decision":"deny|allow", "reasons":["unsigned image","missing SBOM"], | ||||
|     "images":[{"digest":"sha256:...","signed":false,"hasSbom":false}] } | ||||
|   ``` | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 4) Rules engine — semantics | ||||
|  | ||||
| **Rule shape (simplified):** | ||||
|  | ||||
| ```yaml | ||||
| name: "high-critical-alerts-prod" | ||||
| enabled: true | ||||
| match: | ||||
|   eventKinds: ["scanner.report.ready","scheduler.rescan.delta","zastava.admission"] | ||||
|   namespaces: ["prod-*"] | ||||
|   repos: ["ghcr.io/acme/*"] | ||||
|   minSeverity: "high"            # min of new findings (delta context) | ||||
|   kev: true                      # require KEV-tagged or allow any if false | ||||
|   verdict: ["fail","deny"]       # filter for report/admission | ||||
|   vex: | ||||
|     includeRejectedJustifications: false    # notify only on accepted 'affected' | ||||
| actions: | ||||
|   - channel: "slack:sec-alerts"  # reference to Channel object | ||||
|     template: "concise" | ||||
|     throttle: "5m" | ||||
|   - channel: "email:soc" | ||||
|     digest: "hourly" | ||||
|     template: "detailed" | ||||
| ``` | ||||
|  | ||||
| **Evaluation order** | ||||
|  | ||||
| 1. **Tenant check** → discard if rule tenant ≠ event tenant. | ||||
| 2. **Kind filter** → discard early. | ||||
| 3. **Scope match** (namespace/repo/labels). | ||||
| 4. **Delta/severity gates** (if event carries `delta`). | ||||
| 5. **VEX gate** (drop if event’s finding is not affected under policy consensus unless rule says otherwise). | ||||
| 6. **Throttling/dedup** (idempotency key) — skip if suppressed. | ||||
| 7. **Actions** → enqueue per‑channel job(s). | ||||
|  | ||||
| **Idempotency key**: `hash(ruleId | actionId | event.kind | scope.digest | delta.hash | day-bucket)`; ensures “same alert” doesn’t fire more than once within throttle window. | ||||
|  | ||||
| **Digest windows**: maintain per action a **coalescer**: | ||||
|  | ||||
| * Window: `5m|15m|1h|1d` (configurable); coalesces events by tenant + namespace/repo or by digest group. | ||||
| * Digest messages summarize top N items and counts, with safe truncation. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 5) Channels & connectors (plug‑ins) | ||||
|  | ||||
| Channel config is **two‑part**: a **Channel** record (name, type, options) and a Secret **reference** (Vault/K8s Secret). Connectors are **restart-time plug-ins** discovered on service start (same manifest convention as Concelier/Excititor) and live under `plugins/notify/<channel>/`. | ||||
|  | ||||
| **Built‑in v1:** | ||||
|  | ||||
| * **Slack**: Bot token (xoxb‑…), `chat.postMessage` + `blocks`; rate limit aware (HTTP 429). | ||||
| * **Microsoft Teams**: Incoming Webhook (or Graph card later); adaptive card payloads. | ||||
| * **Email (SMTP)**: TLS (STARTTLS or implicit), From/To/CC/BCC; HTML+text alt; DKIM optional. | ||||
| * **Generic Webhook**: POST JSON with HMAC signature (Ed25519 or SHA‑256) in headers. | ||||
|  | ||||
| **Connector contract:** (implemented by plug-in assemblies) | ||||
|  | ||||
| ```csharp | ||||
| public interface INotifyConnector { | ||||
|   string Type { get; } // "slack" | "teams" | "email" | "webhook" | ... | ||||
|   Task<DeliveryResult> SendAsync(DeliveryContext ctx, CancellationToken ct); | ||||
|   Task<HealthResult> HealthAsync(ChannelConfig cfg, CancellationToken ct); | ||||
| } | ||||
| ``` | ||||
|  | ||||
| **DeliveryContext** includes **rendered content** and **raw event** for audit. | ||||
|  | ||||
| **Test-send previews.** Plug-ins can optionally implement `INotifyChannelTestProvider` to shape `/channels/{id}/test` responses. Providers receive a sanitised `ChannelTestPreviewContext` (channel, tenant, target, timestamp, trace) and return a `NotifyDeliveryRendered` preview + metadata. When no provider is present, the host falls back to a generic preview so the endpoint always responds. | ||||
|  | ||||
| **Secrets**: `ChannelConfig.secretRef` points to Authority‑managed secret handle or K8s Secret path; workers load at send-time; plug-in manifests (`notify-plugin.json`) declare capabilities and version. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 6) Templates & rendering | ||||
|  | ||||
| **Template engine**: strongly typed, safe Handlebars‑style; no arbitrary code. Partial templates per channel. Deterministic outputs (prop order, no locale drift unless requested). | ||||
|  | ||||
| **Variables** (examples): | ||||
|  | ||||
| * `event.kind`, `event.ts`, `scope.namespace`, `scope.repo`, `scope.digest` | ||||
| * `payload.verdict`, `payload.delta.newCritical`, `payload.links.ui`, `payload.links.rekor` | ||||
| * `topFindings[]` with `purl`, `vulnId`, `severity` | ||||
| * `policy.name`, `policy.revision` (if available) | ||||
|  | ||||
| **Helpers**: | ||||
|  | ||||
| * `severity_icon(sev)`, `link(text,url)`, `pluralize(n, "finding")`, `truncate(text, n)`, `code(text)`. | ||||
|  | ||||
| **Channel mapping**: | ||||
|  | ||||
| * Slack: title + blocks, limited to 50 blocks/3000 chars per section; long lists → link to UI. | ||||
| * Teams: Adaptive Card schema 1.5; fallback text for older channels (surfaced as `teams.fallbackText` metadata alongside webhook hash). | ||||
| * Email: HTML + text; inline table of top N findings, rest behind UI link. | ||||
| * Webhook: JSON with `event`, `ruleId`, `actionId`, `summary`, `links`, and raw `payload` subset. | ||||
|  | ||||
| **i18n**: template set per locale (English default; Bulgarian built‑in). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 7) Data model (Mongo) | ||||
|  | ||||
| Canonical JSON Schemas for rules/channels/events live in `docs/modules/notify/resources/schemas/`. Sample payloads intended for tests/UI mock responses are captured in `docs/modules/notify/resources/samples/`. | ||||
|  | ||||
| **Database**: `notify` | ||||
|  | ||||
| * `rules` | ||||
|  | ||||
|   ``` | ||||
|   { _id, tenantId, name, enabled, match, actions, createdBy, updatedBy, createdAt, updatedAt } | ||||
|   ``` | ||||
|  | ||||
| * `channels` | ||||
|  | ||||
|   ``` | ||||
|   { _id, tenantId, name:"slack:sec-alerts", type:"slack", | ||||
|     config:{ webhookUrl?:"", channel:"#sec-alerts", workspace?: "...", secretRef:"ref://..." }, | ||||
|     createdAt, updatedAt } | ||||
|   ``` | ||||
|  | ||||
| * `deliveries` | ||||
|  | ||||
|   ``` | ||||
|   { _id, tenantId, ruleId, actionId, eventId, kind, scope, status:"sent|failed|throttled|digested|dropped", | ||||
|     attempts:[{ts, status, code, reason}], | ||||
|     rendered:{ title, body, target },    // redacted for PII; body hash stored | ||||
|     sentAt, lastError? } | ||||
|   ``` | ||||
|  | ||||
| * `digests` | ||||
|  | ||||
|   ``` | ||||
|   { _id, tenantId, actionKey, window:"hourly", openedAt, items:[{eventId, scope, delta}], status:"open|flushed" } | ||||
|   ``` | ||||
|  | ||||
| * `throttles` | ||||
|  | ||||
|   ``` | ||||
|   { key:"idem:<hash>", ttlAt }   // short-lived, also cached in Redis | ||||
|   ``` | ||||
|  | ||||
| **Indexes**: rules by `{tenantId, enabled}`, deliveries by `{tenantId, sentAt desc}`, digests by `{tenantId, actionKey}`. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 8) External APIs (WebService) | ||||
|  | ||||
| Base path: `/api/v1/notify` (Authority OpToks; scopes: `notify.admin` for write, `notify.read` for view). | ||||
|  | ||||
| *All* REST calls require the tenant header `X-StellaOps-Tenant` (matches the canonical `tenantId` stored in Mongo). Payloads are normalised via `NotifySchemaMigration` before persistence to guarantee schema version pinning. | ||||
|  | ||||
| Authentication today is stubbed with Bearer tokens (`Authorization: Bearer <token>`). When Authority wiring lands, this will switch to OpTok validation + scope enforcement, but the header contract will remain the same. | ||||
|  | ||||
| Service configuration exposes `notify:auth:*` keys (issuer, audience, signing key, scope names) so operators can wire the Authority JWKS or (in dev) a symmetric test key. `notify:storage:*` keys cover Mongo URI/database/collection overrides. Both sets are required for the new API surface. | ||||
|  | ||||
| Internal tooling can hit `/internal/notify/<entity>/normalize` to upgrade legacy JSON and return canonical output used in the docs fixtures. | ||||
|  | ||||
| * **Channels** | ||||
|  | ||||
|   * `POST /channels` | `GET /channels` | `GET /channels/{id}` | `PATCH /channels/{id}` | `DELETE /channels/{id}` | ||||
|   * `POST /channels/{id}/test` → send sample message (no rule evaluation); returns `202 Accepted` with rendered preview + metadata (base keys: `channelType`, `target`, `previewProvider`, `traceId` + connector-specific entries); governed by `api.rateLimits:testSend`. | ||||
| * `GET /channels/{id}/health` → connector self‑check (returns redacted metadata: secret refs hashed, sensitive config keys masked, fallbacks noted via `teams.fallbackText`/`teams.validation.*`) | ||||
|  | ||||
| * **Rules** | ||||
|  | ||||
|   * `POST /rules` | `GET /rules` | `GET /rules/{id}` | `PATCH /rules/{id}` | `DELETE /rules/{id}` | ||||
|   * `POST /rules/{id}/test` → dry‑run rule against a **sample event** (no delivery unless `--send`) | ||||
|  | ||||
| * **Deliveries** | ||||
|  | ||||
|   * `POST /deliveries` → ingest worker delivery state (idempotent via `deliveryId`). | ||||
|   * `GET /deliveries?since=...&status=...&limit=...` → list envelope `{ items, count, continuationToken }` (most recent first); base metadata keys match the test-send response (`channelType`, `target`, `previewProvider`, `traceId`); rate-limited via `api.rateLimits.deliveryHistory`. See `docs/modules/notify/resources/samples/notify-delivery-list-response.sample.json`. | ||||
|   * `GET /deliveries/{id}` → detail (redacted body + metadata) | ||||
|   * `POST /deliveries/{id}/retry` → force retry (admin, future sprint) | ||||
|  | ||||
| * **Admin** | ||||
|  | ||||
|   * `GET /stats` (per tenant counts, last hour/day) | ||||
|   * `GET /healthz|readyz` (liveness) | ||||
|   * `POST /locks/acquire` | `POST /locks/release` – worker coordination primitives (short TTL). | ||||
|   * `POST /digests` | `GET /digests/{actionKey}` | `DELETE /digests/{actionKey}` – manage open digest windows. | ||||
|   * `POST /audit` | `GET /audit?since=&limit=` – append/query structured audit trail entries. | ||||
|  | ||||
| **Ingestion**: workers do **not** expose public ingestion; they **subscribe** to the internal bus. (Optional `/events/test` for integration testing, admin‑only.) | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 9) Delivery pipeline (worker) | ||||
|  | ||||
| ``` | ||||
| [Event bus] → [Ingestor] → [RuleMatcher] → [Throttle/Dedupe] → [DigestCoalescer] → [Renderer] → [Connector] → [Result] | ||||
|                                                  └────────→ [DeliveryStore] | ||||
| ``` | ||||
|  | ||||
| * **Ingestor**: N consumers with per‑key ordering (key = tenant|digest|namespace). | ||||
| * **RuleMatcher**: loads active rules snapshot for tenant into memory; vectorized predicate check. | ||||
| * **Throttle/Dedupe**: consult Redis + Mongo `throttles`; if hit → record `status=throttled`. | ||||
| * **DigestCoalescer**: append to open digest window or flush when timer expires. | ||||
| * **Renderer**: select template (channel+locale), inject variables, enforce length limits, compute `bodyHash`. | ||||
| * **Connector**: send; handle provider‑specific rate limits and backoffs; `maxAttempts` with exponential jitter; overflow → DLQ (dead‑letter topic) + UI surfacing. | ||||
|  | ||||
| **Idempotency**: per action **idempotency key** stored in Redis (TTL = `throttle window` or `digest window`). Connectors also respect **provider** idempotency where available (e.g., Slack `client_msg_id`). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 10) Reliability & rate controls | ||||
|  | ||||
| * **Per‑tenant** RPM caps (default 600/min) + **per‑channel** concurrency (Slack 1–4, Teams 1–2, Email 8–32 based on relay). | ||||
| * **Backoff** map: Slack 429 → respect `Retry‑After`; SMTP 4xx → retry; 5xx → retry with jitter; permanent rejects → drop with status recorded. | ||||
| * **DLQ**: NATS/Redis stream `notify.dlq` with `{event, rule, action, error}` for operator inspection; UI shows DLQ items. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 11) Security & privacy | ||||
|  | ||||
| * **AuthZ**: all APIs require **Authority** OpToks; actions scoped by tenant. | ||||
| * **Secrets**: `secretRef` only; Notify fetches just‑in‑time from Authority Secret proxy or K8s Secret (mounted). No plaintext secrets in Mongo. | ||||
| * **Egress TLS**: validate SSL; pin domains per channel config; optional CA bundle override for on‑prem SMTP. | ||||
| * **Webhook signing**: HMAC or Ed25519 signatures in `X-StellaOps-Signature` + replay‑window timestamp; include canonical body hash in header. | ||||
| * **Redaction**: deliveries store **hashes** of bodies, not full payloads for chat/email to minimize PII retention (configurable). | ||||
| * **Quiet hours**: per tenant (e.g., 22:00–06:00) route high‑sev only; defer others to digests. | ||||
| * **Loop prevention**: Webhook target allowlist + event origin tags; do not ingest own webhooks. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 12) Observability (Prometheus + OTEL) | ||||
|  | ||||
| * `notify.events_consumed_total{kind}` | ||||
| * `notify.rules_matched_total{ruleId}` | ||||
| * `notify.throttled_total{reason}` | ||||
| * `notify.digest_coalesced_total{window}` | ||||
| * `notify.sent_total{channel}` / `notify.failed_total{channel,code}` | ||||
| * `notify.delivery_latency_seconds{channel}` (end‑to‑end) | ||||
| * **Tracing**: spans `ingest`, `match`, `render`, `send`; correlation id = `eventId`. | ||||
|  | ||||
| **SLO targets** | ||||
|  | ||||
| * Event→delivery p95 **≤ 30–60 s** under nominal load. | ||||
| * Failure rate p95 **< 0.5%** per hour (excluding provider outages). | ||||
| * Duplicate rate **≈ 0** (idempotency working). | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 13) Configuration (YAML) | ||||
|  | ||||
| ```yaml | ||||
| notify: | ||||
|   authority: | ||||
|     issuer: "https://authority.internal" | ||||
|     require: "dpop"               # or "mtls" | ||||
|   bus: | ||||
|     kind: "redis"                 # or "nats" | ||||
|     streams: | ||||
|       - "scanner.events" | ||||
|       - "scheduler.events" | ||||
|       - "attestor.events" | ||||
|       - "zastava.events" | ||||
|   mongo: | ||||
|     uri: "mongodb://mongo/notify" | ||||
|   limits: | ||||
|     perTenantRpm: 600 | ||||
|     perChannel: | ||||
|       slack:   { concurrency: 2 } | ||||
|       teams:   { concurrency: 1 } | ||||
|       email:   { concurrency: 8 } | ||||
|       webhook: { concurrency: 8 } | ||||
|   digests: | ||||
|     defaultWindow: "1h" | ||||
|     maxItems: 100 | ||||
|   quietHours: | ||||
|     enabled: true | ||||
|     window: "22:00-06:00" | ||||
|     minSeverity: "critical" | ||||
|   webhooks: | ||||
|     sign: | ||||
|       method: "ed25519"           # or "hmac-sha256" | ||||
|       keyRef: "ref://notify/webhook-sign-key" | ||||
| ``` | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 14) UI touch‑points | ||||
|  | ||||
| * **Notifications → Channels**: add Slack/Teams/Email/Webhook; run **health**; rotate secrets. | ||||
| * **Notifications → Rules**: create/edit YAML rules with linting; test with sample events; see match rate. | ||||
| * **Notifications → Deliveries**: timeline with filters (status, channel, rule); inspect last error; retry. | ||||
| * **Digest preview**: shows current window contents and when it will flush. | ||||
| * **Quiet hours**: configure per tenant; show overrides. | ||||
| * **DLQ**: browse dead‑letters; requeue after fix. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 15) Failure modes & responses | ||||
|  | ||||
| | Condition                           | Behavior                                                                              | | ||||
| | ----------------------------------- | ------------------------------------------------------------------------------------- | | ||||
| | Slack 429 / Teams 429               | Respect `Retry‑After`, backoff with jitter, reduce concurrency                        | | ||||
| | SMTP transient 4xx                  | Retry up to `maxAttempts`; escalate to DLQ on exhaust                                 | | ||||
| | Invalid channel secret              | Mark channel unhealthy; suppress sends; surface in UI                                 | | ||||
| | Rule explosion (matches everything) | Safety valve: per‑tenant RPM caps; auto‑pause rule after X drops; UI alert            | | ||||
| | Bus outage                          | Buffer to local queue (bounded); resume consuming when healthy                        | | ||||
| | Mongo slowness                      | Fall back to Redis throttles; batch write deliveries; shed low‑priority notifications | | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 16) Testing matrix | ||||
|  | ||||
| * **Unit**: matchers, throttle math, digest coalescing, idempotency keys, template rendering edge cases. | ||||
| * **Connectors**: provider‑level rate limits, payload size truncation, error mapping. | ||||
| * **Integration**: synthetic event storm (10k/min), ensure p95 latency & duplicate rate. | ||||
| * **Security**: DPoP/mTLS on APIs; secretRef resolution; webhook signing & replay windows. | ||||
| * **i18n**: localized templates render deterministically. | ||||
| * **Chaos**: Slack/Teams API flaps; SMTP greylisting; Redis hiccups; ensure graceful degradation. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 17) Sequences (representative) | ||||
|  | ||||
| **A) New criticals after Feedser delta (Slack immediate + Email hourly digest)** | ||||
|  | ||||
| ```mermaid | ||||
| sequenceDiagram | ||||
|   autonumber | ||||
|   participant SCH as Scheduler | ||||
|   participant NO as Notify.Worker | ||||
|   participant SL as Slack | ||||
|   participant SMTP as Email | ||||
|  | ||||
|   SCH->>NO: bus event scheduler.rescan.delta { newCritical:1, digest:sha256:... } | ||||
|   NO->>NO: match rules (Slack immediate; Email hourly digest) | ||||
|   NO->>SL: chat.postMessage (concise) | ||||
|   SL-->>NO: 200 OK | ||||
|   NO->>NO: append to digest window (email:soc) | ||||
|   Note over NO: At window close → render digest email | ||||
|   NO->>SMTP: send email (detailed digest) | ||||
|   SMTP-->>NO: 250 OK | ||||
| ``` | ||||
|  | ||||
| **B) Admission deny (Teams card + Webhook)** | ||||
|  | ||||
| ```mermaid | ||||
| sequenceDiagram | ||||
|   autonumber | ||||
|   participant ZA as Zastava | ||||
|   participant NO as Notify.Worker | ||||
|   participant TE as Teams | ||||
|   participant WH as Webhook | ||||
|  | ||||
|   ZA->>NO: bus event zastava.admission { decision: "deny", reasons: [...] } | ||||
|   NO->>TE: POST adaptive card | ||||
|   TE-->>NO: 200 OK | ||||
|   NO->>WH: POST JSON (signed) | ||||
|   WH-->>NO: 2xx | ||||
| ``` | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 18) Implementation notes | ||||
|  | ||||
| * **Language**: .NET 10; minimal API; `System.Text.Json` with canonical writer for body hashing; Channels for pipelines. | ||||
| * **Bus**: Redis Streams (**XGROUP** consumers) or NATS JetStream for at‑least‑once with ack; per‑tenant consumer groups to localize backpressure. | ||||
| * **Templates**: compile and cache per rule+channel+locale; version with rule `updatedAt` to invalidate. | ||||
| * **Rules**: store raw YAML + parsed AST; validate with schema + static checks (e.g., nonsensical combos). | ||||
| * **Secrets**: pluggable secret resolver (Authority Secret proxy, K8s, Vault). | ||||
| * **Rate limiting**: `System.Threading.RateLimiting` + per‑connector adapters. | ||||
|  | ||||
| --- | ||||
|  | ||||
| ## 19) Roadmap (post‑v1) | ||||
|  | ||||
| * **PagerDuty/Opsgenie** connectors; **Jira** ticket creation. | ||||
| * **User inbox** (in‑app notifications) + mobile push via webhook relay. | ||||
| * **Anomaly suppression**: auto‑pause noisy rules with hints (learned thresholds). | ||||
| * **Graph rules**: “only notify if *not_affected → affected* transition at consensus layer”. | ||||
| * **Label enrichment**: pluggable taggers (business criticality, data classification) to refine matchers. | ||||
		Reference in New Issue
	
	Block a user