Files
git.stella-ops.org/docs/ARCHITECTURE_NOTIFY.md
master 5ce40d2eeb feat: Initialize Zastava Webhook service with TLS and Authority authentication
- Added Program.cs to set up the web application with Serilog for logging, health check endpoints, and a placeholder admission endpoint.
- Configured Kestrel server to use TLS 1.3 and handle client certificates appropriately.
- Created StellaOps.Zastava.Webhook.csproj with necessary dependencies including Serilog and Polly.
- Documented tasks in TASKS.md for the Zastava Webhook project, outlining current work and exit criteria for each task.
2025-10-19 18:36:22 +03:00

22 KiB
Raw Blame History

Scope. Implementationready architecture for Notify: a rulesdriven, tenantaware notification service that consumes platform events (scan completed, report ready, rescan deltas, attestation logged, admission decisions, etc.), evaluates operatordefined routing rules, renders channelspecific messages (Slack/Teams/Email/Webhook), and delivers them reliably with idempotency, throttling, and digests. It is UImanaged, auditable, and safe by default (no secrets leakage, no spam storms).


0) Mission & boundaries

Mission. Convert facts from StellaOps into actionable, noisecontrolled signals where teams already live (chat/email/webhooks), with explainable reasons and deep links to the UI.

Boundaries.

  • Notify does not make policy decisions and does not rescan; it consumes events from Scanner/Scheduler/Vexer/Feedser/Attestor/Zastava and routes them.
  • Attachments are links (UI/attestation pages); Notify does not attach SBOMs or large blobs to messages.
  • Secrets for channels (Slack tokens, SMTP creds) are referenced, not stored raw in Mongo.

1) Runtime shape & projects

src/
 ├─ StellaOps.Notify.WebService/        # REST: rules/channels CRUD, test send, deliveries browse
 ├─ StellaOps.Notify.Worker/            # consumers + evaluators + renderers + delivery workers
 ├─ StellaOps.Notify.Connectors.* /     # channel plug-ins: Slack, Teams, Email, Webhook (v1)
 │    └─ *.Tests/
 ├─ StellaOps.Notify.Engine/            # rules engine, templates, idempotency, digests, throttles
 ├─ StellaOps.Notify.Models/            # DTOs (Rule, Channel, Event, Delivery, Template)
 ├─ StellaOps.Notify.Storage.Mongo/     # rules, channels, deliveries, digests, locks
 ├─ StellaOps.Notify.Queue/             # bus client (Redis Streams/NATS JetStream)
 └─ StellaOps.Notify.Tests.*            # unit/integration/e2e

Deployables:

  • Notify.WebService (stateless API)
  • Notify.Worker (horizontal scale)

Dependencies: Authority (OpToks; DPoP/mTLS), MongoDB, Redis/NATS (bus), HTTP egress to Slack/Teams/Webhooks, SMTP relay for Email.

Configuration. Notify.WebService bootstraps from notify.yaml (see etc/notify.yaml.sample). Use storage.driver: mongo with a production connection string; the optional memory driver exists only for tests. Authority settings follow the platform defaults—when running locally without Authority, set authority.enabled: false and supply developmentSigningKey so JWTs can be validated offline.

Plug-ins. All channel connectors are packaged under <baseDirectory>/plugins/notify. The ordered load list must start with Slack/Teams before Email/Webhook so chat-first actions are registered deterministically for Offline Kit bundles:

plugins:
  baseDirectory: "/var/opt/stellaops"
  directory: "plugins/notify"
  orderedPlugins:
    - StellaOps.Notify.Connectors.Slack
    - StellaOps.Notify.Connectors.Teams
    - StellaOps.Notify.Connectors.Email
    - StellaOps.Notify.Connectors.Webhook

The Offline Kit job simply copies the plugins/notify tree into the air-gapped bundle; the ordered list keeps connector manifests stable across environments.

Authority clients. Register two OAuth clients in StellaOps Authority: notify-web-dev (audience notify.dev) for development and notify-web (audience notify) for staging/production. Both require notify.read and notify.admin scopes and use DPoP-bound client credentials (client_secret in the samples). Reference entries live in etc/authority.yaml.sample, with placeholder secrets under etc/secrets/notify-web*.secret.example.


2) Responsibilities

  1. Ingest platform events from internal bus with strong ordering per key (e.g., image digest).
  2. Evaluate rules (tenantscoped) with matchers: severity changes, namespaces, repos, labels, KEV flags, provider provenance (VEX), component keys, admission decisions, etc.
  3. Control noise: throttle, coalesce (digest windows), and dedupe via idempotency keys.
  4. Render channelspecific messages using safe templates; include evidence and links.
  5. Deliver with retries/backoff; record outcome; expose delivery history to UI.
  6. Test paths (send test to channel targets) without touching live rules.
  7. Audit: log who configured what, when, and why a message was sent.

3) Event model (inputs)

Notify subscribes to the internal event bus (produced by services, escaped JSON; gzip allowed with caps):

  • scanner.scan.completed — new SBOM(s) composed; artifacts ready
  • scanner.report.ready — analysis verdict (policy+vex) available; carries deltas summary
  • scheduler.rescan.delta — new findings after Feedser/Vexer deltas (already summarized)
  • attestor.logged — Rekor UUID returned (sbom/report/vex export)
  • zastava.admission — admit/deny with reasons, namespace, image digests
  • feedser.export.completed — new export ready (rarely notified directly; usually drives Scheduler)
  • vexer.export.completed — new consensus snapshot (ditto)

Canonical envelope (bus → Notify.Engine):

{
  "eventId": "uuid",
  "kind": "scanner.report.ready",
  "tenant": "tenant-01",
  "ts": "2025-10-18T05:41:22Z",
  "actor": "scanner-webservice",
  "scope": { "namespace":"payments", "repo":"ghcr.io/acme/api", "digest":"sha256:..." },
  "payload": { /* kind-specific fields, see below */ }
}

Examples (payload cores):

  • scanner.report.ready:

    {
      "reportId": "report-3def...",
      "verdict": "fail",
      "summary": {"total": 12, "blocked": 2, "warned": 3, "ignored": 5, "quieted": 2},
      "delta": {"newCritical": 1, "kev": ["CVE-2025-..."]},
      "links": {"ui": "https://ui/.../reports/report-3def...", "rekor": "https://rekor/..."},
      "dsse": { "...": "..." },
      "report": { "...": "..." }
    }
    

    Payload embeds both the canonical report document and the DSSE envelope so connectors, Notify, and UI tooling can reuse the signed bytes without re-serialising.

  • scanner.scan.completed:

    {
      "reportId": "report-3def...",
      "digest": "sha256:...",
      "verdict": "fail",
      "summary": {"total": 12, "blocked": 2, "warned": 3, "ignored": 5, "quieted": 2},
      "delta": {"newCritical": 1, "kev": ["CVE-2025-..."]},
      "policy": {"revisionId": "rev-42", "digest": "27d2..."},
      "findings": [{"id": "finding-1", "severity": "Critical", "cve": "CVE-2025-...", "reachability": "runtime"}],
      "dsse": { "...": "..." }
    }
    
  • zastava.admission:

    { "decision":"deny|allow", "reasons":["unsigned image","missing SBOM"],
      "images":[{"digest":"sha256:...","signed":false,"hasSbom":false}] }
    

4) Rules engine — semantics

Rule shape (simplified):

name: "high-critical-alerts-prod"
enabled: true
match:
  eventKinds: ["scanner.report.ready","scheduler.rescan.delta","zastava.admission"]
  namespaces: ["prod-*"]
  repos: ["ghcr.io/acme/*"]
  minSeverity: "high"            # min of new findings (delta context)
  kev: true                      # require KEV-tagged or allow any if false
  verdict: ["fail","deny"]       # filter for report/admission
  vex:
    includeRejectedJustifications: false    # notify only on accepted 'affected'
actions:
  - channel: "slack:sec-alerts"  # reference to Channel object
    template: "concise"
    throttle: "5m"
  - channel: "email:soc"
    digest: "hourly"
    template: "detailed"

Evaluation order

  1. Tenant check → discard if rule tenant ≠ event tenant.
  2. Kind filter → discard early.
  3. Scope match (namespace/repo/labels).
  4. Delta/severity gates (if event carries delta).
  5. VEX gate (drop if events finding is not affected under policy consensus unless rule says otherwise).
  6. Throttling/dedup (idempotency key) — skip if suppressed.
  7. Actions → enqueue perchannel job(s).

Idempotency key: hash(ruleId | actionId | event.kind | scope.digest | delta.hash | day-bucket); ensures “same alert” doesnt fire more than once within throttle window.

Digest windows: maintain per action a coalescer:

  • Window: 5m|15m|1h|1d (configurable); coalesces events by tenant + namespace/repo or by digest group.
  • Digest messages summarize top N items and counts, with safe truncation.

5) Channels & connectors (plugins)

Channel config is twopart: a Channel record (name, type, options) and a Secret reference (Vault/K8s Secret). Connectors are restart-time plug-ins discovered on service start (same manifest convention as Concelier/Excititor) and live under plugins/notify/<channel>/.

Builtin v1:

  • Slack: Bot token (xoxb…), chat.postMessage + blocks; rate limit aware (HTTP 429).
  • Microsoft Teams: Incoming Webhook (or Graph card later); adaptive card payloads.
  • Email (SMTP): TLS (STARTTLS or implicit), From/To/CC/BCC; HTML+text alt; DKIM optional.
  • Generic Webhook: POST JSON with HMAC signature (Ed25519 or SHA256) in headers.

Connector contract: (implemented by plug-in assemblies)

public interface INotifyConnector {
  string Type { get; } // "slack" | "teams" | "email" | "webhook" | ...
  Task<DeliveryResult> SendAsync(DeliveryContext ctx, CancellationToken ct);
  Task<HealthResult> HealthAsync(ChannelConfig cfg, CancellationToken ct);
}

DeliveryContext includes rendered content and raw event for audit.

Secrets: ChannelConfig.secretRef points to Authoritymanaged secret handle or K8s Secret path; workers load at send-time; plug-in manifests (notify-plugin.json) declare capabilities and version.


6) Templates & rendering

Template engine: strongly typed, safe Handlebarsstyle; no arbitrary code. Partial templates per channel. Deterministic outputs (prop order, no locale drift unless requested).

Variables (examples):

  • event.kind, event.ts, scope.namespace, scope.repo, scope.digest
  • payload.verdict, payload.delta.newCritical, payload.links.ui, payload.links.rekor
  • topFindings[] with purl, vulnId, severity
  • policy.name, policy.revision (if available)

Helpers:

  • severity_icon(sev), link(text,url), pluralize(n, "finding"), truncate(text, n), code(text).

Channel mapping:

  • Slack: title + blocks, limited to 50 blocks/3000 chars per section; long lists → link to UI.
  • Teams: Adaptive Card schema 1.5; fallback text for older channels.
  • Email: HTML + text; inline table of top N findings, rest behind UI link.
  • Webhook: JSON with event, ruleId, actionId, summary, links, and raw payload subset.

i18n: template set per locale (English default; Bulgarian builtin).


7) Data model (Mongo)

Canonical JSON Schemas for rules/channels/events live in docs/notify/schemas/. Sample payloads intended for tests/UI mock responses are captured in docs/notify/samples/.

Database: notify

  • rules

    { _id, tenantId, name, enabled, match, actions, createdBy, updatedBy, createdAt, updatedAt }
    
  • channels

    { _id, tenantId, name:"slack:sec-alerts", type:"slack",
      config:{ webhookUrl?:"", channel:"#sec-alerts", workspace?: "...", secretRef:"ref://..." },
      createdAt, updatedAt }
    
  • deliveries

    { _id, tenantId, ruleId, actionId, eventId, kind, scope, status:"sent|failed|throttled|digested|dropped",
      attempts:[{ts, status, code, reason}],
      rendered:{ title, body, target },    // redacted for PII; body hash stored
      sentAt, lastError? }
    
  • digests

    { _id, tenantId, actionKey, window:"hourly", openedAt, items:[{eventId, scope, delta}], status:"open|flushed" }
    
  • throttles

    { key:"idem:<hash>", ttlAt }   // short-lived, also cached in Redis
    

Indexes: rules by {tenantId, enabled}, deliveries by {tenantId, sentAt desc}, digests by {tenantId, actionKey}.


8) External APIs (WebService)

Base path: /api/v1/notify (Authority OpToks; scopes: notify.admin for write, notify.read for view).

All REST calls require the tenant header X-StellaOps-Tenant (matches the canonical tenantId stored in Mongo). Payloads are normalised via NotifySchemaMigration before persistence to guarantee schema version pinning.

Authentication today is stubbed with Bearer tokens (Authorization: Bearer <token>). When Authority wiring lands, this will switch to OpTok validation + scope enforcement, but the header contract will remain the same.

Service configuration exposes notify:auth:* keys (issuer, audience, signing key, scope names) so operators can wire the Authority JWKS or (in dev) a symmetric test key. notify:storage:* keys cover Mongo URI/database/collection overrides. Both sets are required for the new API surface.

Internal tooling can hit /internal/notify/<entity>/normalize to upgrade legacy JSON and return canonical output used in the docs fixtures.

  • Channels

    • POST /channels | GET /channels | GET /channels/{id} | PATCH /channels/{id} | DELETE /channels/{id}
    • POST /channels/{id}/test → send sample message (no rule evaluation)
    • GET /channels/{id}/health → connector selfcheck
  • Rules

    • POST /rules | GET /rules | GET /rules/{id} | PATCH /rules/{id} | DELETE /rules/{id}
    • POST /rules/{id}/test → dryrun rule against a sample event (no delivery unless --send)
  • Deliveries

    • POST /deliveries → ingest worker delivery state (idempotent via deliveryId).
    • GET /deliveries?since=...&status=...&limit=... → list (most recent first)
    • GET /deliveries/{id} → detail (redacted body + metadata)
    • POST /deliveries/{id}/retry → force retry (admin, future sprint)
  • Admin

    • GET /stats (per tenant counts, last hour/day)
    • GET /healthz|readyz (liveness)
    • POST /locks/acquire | POST /locks/release worker coordination primitives (short TTL).
    • POST /digests | GET /digests/{actionKey} | DELETE /digests/{actionKey} manage open digest windows.
    • POST /audit | GET /audit?since=&limit= append/query structured audit trail entries.

Ingestion: workers do not expose public ingestion; they subscribe to the internal bus. (Optional /events/test for integration testing, adminonly.)


9) Delivery pipeline (worker)

[Event bus] → [Ingestor] → [RuleMatcher] → [Throttle/Dedupe] → [DigestCoalescer] → [Renderer] → [Connector] → [Result]
                                                 └────────→ [DeliveryStore]
  • Ingestor: N consumers with perkey ordering (key = tenant|digest|namespace).
  • RuleMatcher: loads active rules snapshot for tenant into memory; vectorized predicate check.
  • Throttle/Dedupe: consult Redis + Mongo throttles; if hit → record status=throttled.
  • DigestCoalescer: append to open digest window or flush when timer expires.
  • Renderer: select template (channel+locale), inject variables, enforce length limits, compute bodyHash.
  • Connector: send; handle providerspecific rate limits and backoffs; maxAttempts with exponential jitter; overflow → DLQ (deadletter topic) + UI surfacing.

Idempotency: per action idempotency key stored in Redis (TTL = throttle window or digest window). Connectors also respect provider idempotency where available (e.g., Slack client_msg_id).


10) Reliability & rate controls

  • Pertenant RPM caps (default 600/min) + perchannel concurrency (Slack 14, Teams 12, Email 832 based on relay).
  • Backoff map: Slack 429 → respect RetryAfter; SMTP 4xx → retry; 5xx → retry with jitter; permanent rejects → drop with status recorded.
  • DLQ: NATS/Redis stream notify.dlq with {event, rule, action, error} for operator inspection; UI shows DLQ items.

11) Security & privacy

  • AuthZ: all APIs require Authority OpToks; actions scoped by tenant.
  • Secrets: secretRef only; Notify fetches justintime from Authority Secret proxy or K8s Secret (mounted). No plaintext secrets in Mongo.
  • Egress TLS: validate SSL; pin domains per channel config; optional CA bundle override for onprem SMTP.
  • Webhook signing: HMAC or Ed25519 signatures in X-StellaOps-Signature + replaywindow timestamp; include canonical body hash in header.
  • Redaction: deliveries store hashes of bodies, not full payloads for chat/email to minimize PII retention (configurable).
  • Quiet hours: per tenant (e.g., 22:0006:00) route highsev only; defer others to digests.
  • Loop prevention: Webhook target allowlist + event origin tags; do not ingest own webhooks.

12) Observability (Prometheus + OTEL)

  • notify.events_consumed_total{kind}
  • notify.rules_matched_total{ruleId}
  • notify.throttled_total{reason}
  • notify.digest_coalesced_total{window}
  • notify.sent_total{channel} / notify.failed_total{channel,code}
  • notify.delivery_latency_seconds{channel} (endtoend)
  • Tracing: spans ingest, match, render, send; correlation id = eventId.

SLO targets

  • Event→delivery p95 ≤ 3060s under nominal load.
  • Failure rate p95 < 0.5% per hour (excluding provider outages).
  • Duplicate rate ≈ 0 (idempotency working).

13) Configuration (YAML)

notify:
  authority:
    issuer: "https://authority.internal"
    require: "dpop"               # or "mtls"
  bus:
    kind: "redis"                 # or "nats"
    streams:
      - "scanner.events"
      - "scheduler.events"
      - "attestor.events"
      - "zastava.events"
  mongo:
    uri: "mongodb://mongo/notify"
  limits:
    perTenantRpm: 600
    perChannel:
      slack:   { concurrency: 2 }
      teams:   { concurrency: 1 }
      email:   { concurrency: 8 }
      webhook: { concurrency: 8 }
  digests:
    defaultWindow: "1h"
    maxItems: 100
  quietHours:
    enabled: true
    window: "22:00-06:00"
    minSeverity: "critical"
  webhooks:
    sign:
      method: "ed25519"           # or "hmac-sha256"
      keyRef: "ref://notify/webhook-sign-key"

14) UI touchpoints

  • Notifications → Channels: add Slack/Teams/Email/Webhook; run health; rotate secrets.
  • Notifications → Rules: create/edit YAML rules with linting; test with sample events; see match rate.
  • Notifications → Deliveries: timeline with filters (status, channel, rule); inspect last error; retry.
  • Digest preview: shows current window contents and when it will flush.
  • Quiet hours: configure per tenant; show overrides.
  • DLQ: browse deadletters; requeue after fix.

15) Failure modes & responses

Condition Behavior
Slack 429 / Teams 429 Respect RetryAfter, backoff with jitter, reduce concurrency
SMTP transient 4xx Retry up to maxAttempts; escalate to DLQ on exhaust
Invalid channel secret Mark channel unhealthy; suppress sends; surface in UI
Rule explosion (matches everything) Safety valve: pertenant RPM caps; autopause rule after X drops; UI alert
Bus outage Buffer to local queue (bounded); resume consuming when healthy
Mongo slowness Fall back to Redis throttles; batch write deliveries; shed lowpriority notifications

16) Testing matrix

  • Unit: matchers, throttle math, digest coalescing, idempotency keys, template rendering edge cases.
  • Connectors: providerlevel rate limits, payload size truncation, error mapping.
  • Integration: synthetic event storm (10k/min), ensure p95 latency & duplicate rate.
  • Security: DPoP/mTLS on APIs; secretRef resolution; webhook signing & replay windows.
  • i18n: localized templates render deterministically.
  • Chaos: Slack/Teams API flaps; SMTP greylisting; Redis hiccups; ensure graceful degradation.

17) Sequences (representative)

A) New criticals after Feedser delta (Slack immediate + Email hourly digest)

sequenceDiagram
  autonumber
  participant SCH as Scheduler
  participant NO as Notify.Worker
  participant SL as Slack
  participant SMTP as Email

  SCH->>NO: bus event scheduler.rescan.delta { newCritical:1, digest:sha256:... }
  NO->>NO: match rules (Slack immediate; Email hourly digest)
  NO->>SL: chat.postMessage (concise)
  SL-->>NO: 200 OK
  NO->>NO: append to digest window (email:soc)
  Note over NO: At window close → render digest email
  NO->>SMTP: send email (detailed digest)
  SMTP-->>NO: 250 OK

B) Admission deny (Teams card + Webhook)

sequenceDiagram
  autonumber
  participant ZA as Zastava
  participant NO as Notify.Worker
  participant TE as Teams
  participant WH as Webhook

  ZA->>NO: bus event zastava.admission { decision: "deny", reasons: [...] }
  NO->>TE: POST adaptive card
  TE-->>NO: 200 OK
  NO->>WH: POST JSON (signed)
  WH-->>NO: 2xx

18) Implementation notes

  • Language: .NET 10; minimal API; System.Text.Json with canonical writer for body hashing; Channels for pipelines.
  • Bus: Redis Streams (XGROUP consumers) or NATS JetStream for atleastonce with ack; pertenant consumer groups to localize backpressure.
  • Templates: compile and cache per rule+channel+locale; version with rule updatedAt to invalidate.
  • Rules: store raw YAML + parsed AST; validate with schema + static checks (e.g., nonsensical combos).
  • Secrets: pluggable secret resolver (Authority Secret proxy, K8s, Vault).
  • Rate limiting: System.Threading.RateLimiting + perconnector adapters.

19) Roadmap (postv1)

  • PagerDuty/Opsgenie connectors; Jira ticket creation.
  • User inbox (inapp notifications) + mobile push via webhook relay.
  • Anomaly suppression: autopause noisy rules with hints (learned thresholds).
  • Graph rules: “only notify if not_affected → affected transition at consensus layer”.
  • Label enrichment: pluggable taggers (business criticality, data classification) to refine matchers.