feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules

- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes.
- Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes.
- Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables.
- Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
This commit is contained in:
2025-10-30 00:09:39 +02:00
parent 3154c67978
commit 7b5bdcf4d3
503 changed files with 16136 additions and 54638 deletions

View File

@@ -0,0 +1,22 @@
# Notify agent guide
## Mission
Notify evaluates operator-defined rules against platform events and dispatches channel-specific payloads with full auditability.
## Key docs
- [Module README](./README.md)
- [Architecture](./architecture.md)
- [Implementation plan](./implementation_plan.md)
- [Task board](./TASKS.md)
## How to get started
1. Open ../../implplan/SPRINTS.md and locate the stories referencing this module.
2. Review ./TASKS.md for local follow-ups and confirm status transitions (TODO → DOING → DONE/BLOCKED).
3. Read the architecture and README for domain context before editing code or docs.
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.

View File

@@ -0,0 +1,35 @@
# StellaOps Notify
Notify evaluates operator-defined rules against platform events and dispatches channel-specific payloads with full auditability.
## Responsibilities
- Process event streams and apply tenant-scoped routing rules.
- Render connector-specific payloads (email, Slack, Teams, webhook, custom).
- Enforce throttling, digests, and delivery retries.
- Surface delivery/audit data for UI and CLI consumers.
## Key components
- `StellaOps.Notify.WebService` (rules API + preview).
- `StellaOps.Notify.Worker` (delivery engine).
- Connector libraries under `StellaOps.Notify.Connectors.*`.
## Integrations & dependencies
- MongoDB for rule/channel storage.
- Redis/NATS for delivery queues.
- CLI/UI for authoring and monitoring notifications.
## Operational notes
- Schema fixtures in ./resources/schemas & ./resources/samples.
- Connector-specific monitoring dashboards.
- Offline runner guidance inside operations playbook.
## Related resources
- ./resources/schemas
- ./resources/samples
## Backlog references
- NOTIFY-SVC-38..40 (Notify backlog) referenced in `docs/README.md`.
- DOCS-NOTIFY updates tracked in ../../TASKS.md when available.
## Epic alignment
- **Epic 11 Notifications Studio:** deliver notifications workspace, preview tooling, immutable delivery ledger, and tenant-scoped throttling/digest controls.

View File

@@ -0,0 +1,9 @@
# Task board — Notify
> Local tasks should link back to ./AGENTS.md and mirror status updates into ../../TASKS.md when applicable.
| ID | Status | Owner(s) | Description | Notes |
|----|--------|----------|-------------|-------|
| NOTIFY-DOCS-0001 | DOING (2025-10-29) | Docs Guild | Validate that ./README.md aligns with the latest release notes. | See ./AGENTS.md |
| NOTIFY-OPS-0001 | TODO | Ops Guild | Review runbooks/observability assets after next sprint demo. | Sync outcomes back to ../../TASKS.md |
| NOTIFY-ENG-0001 | TODO | Module Team | Cross-check implementation plan milestones against ../../implplan/SPRINTS.md. | Update status via ./AGENTS.md workflow |

View File

@@ -0,0 +1,515 @@
> **Scope.** Implementationready architecture for **Notify** (aligned with Epic11 Notifications Studio): a rulesdriven, tenantaware notification service that consumes platform events (scan completed, report ready, rescan deltas, attestation logged, admission decisions, etc.), evaluates operatordefined routing rules, renders **channelspecific messages** (Slack/Teams/Email/Webhook), and delivers them **reliably** with idempotency, throttling, and digests. It is UImanaged, auditable, and safe by default (no secrets leakage, no spam storms).
---
## 0) Mission & boundaries
**Mission.** Convert **facts** from StellaOps into **actionable, noisecontrolled** signals where teams already live (chat/email/webhooks), with **explainable** reasons and deep links to the UI.
**Boundaries.**
* Notify **does not make policy decisions** and **does not rescan**; it **consumes** events from Scanner/Scheduler/Vexer/Feedser/Attestor/Zastava and routes them.
* Attachments are **links** (UI/attestation pages); Notify **does not** attach SBOMs or large blobs to messages.
* Secrets for channels (Slack tokens, SMTP creds) are **referenced**, not stored raw in Mongo.
---
## 1) Runtime shape & projects
```
src/
├─ StellaOps.Notify.WebService/ # REST: rules/channels CRUD, test send, deliveries browse
├─ StellaOps.Notify.Worker/ # consumers + evaluators + renderers + delivery workers
├─ StellaOps.Notify.Connectors.* / # channel plug-ins: Slack, Teams, Email, Webhook (v1)
│ └─ *.Tests/
├─ StellaOps.Notify.Engine/ # rules engine, templates, idempotency, digests, throttles
├─ StellaOps.Notify.Models/ # DTOs (Rule, Channel, Event, Delivery, Template)
├─ StellaOps.Notify.Storage.Mongo/ # rules, channels, deliveries, digests, locks
├─ StellaOps.Notify.Queue/ # bus client (Redis Streams/NATS JetStream)
└─ StellaOps.Notify.Tests.* # unit/integration/e2e
```
**Deployables**:
* **Notify.WebService** (stateless API)
* **Notify.Worker** (horizontal scale)
**Dependencies**: Authority (OpToks; DPoP/mTLS), MongoDB, Redis/NATS (bus), HTTP egress to Slack/Teams/Webhooks, SMTP relay for Email.
> **Configuration.** Notify.WebService bootstraps from `notify.yaml` (see `etc/notify.yaml.sample`). Use `storage.driver: mongo` with a production connection string; the optional `memory` driver exists only for tests. Authority settings follow the platform defaults—when running locally without Authority, set `authority.enabled: false` and supply `developmentSigningKey` so JWTs can be validated offline.
>
> `api.rateLimits` exposes token-bucket controls for delivery history queries and test-send previews (`deliveryHistory`, `testSend`). Default values allow generous browsing while preventing accidental bursts; operators can relax/tighten the buckets per deployment.
> **Plug-ins.** All channel connectors are packaged under `<baseDirectory>/plugins/notify`. The ordered load list must start with Slack/Teams before Email/Webhook so chat-first actions are registered deterministically for Offline Kit bundles:
>
> ```yaml
> plugins:
> baseDirectory: "/var/opt/stellaops"
> directory: "plugins/notify"
> orderedPlugins:
> - StellaOps.Notify.Connectors.Slack
> - StellaOps.Notify.Connectors.Teams
> - StellaOps.Notify.Connectors.Email
> - StellaOps.Notify.Connectors.Webhook
> ```
>
> The Offline Kit job simply copies the `plugins/notify` tree into the air-gapped bundle; the ordered list keeps connector manifests stable across environments.
> **Authority clients.** Register two OAuth clients in StellaOps Authority: `notify-web-dev` (audience `notify.dev`) for development and `notify-web` (audience `notify`) for staging/production. Both require `notify.read` and `notify.admin` scopes and use DPoP-bound client credentials (`client_secret` in the samples). Reference entries live in `etc/authority.yaml.sample`, with placeholder secrets under `etc/secrets/notify-web*.secret.example`.
---
## 2) Responsibilities
1. **Ingest** platform events from internal bus with strong ordering per key (e.g., image digest).
2. **Evaluate rules** (tenantscoped) with matchers: severity changes, namespaces, repos, labels, KEV flags, provider provenance (VEX), component keys, admission decisions, etc.
3. **Control noise**: **throttle**, **coalesce** (digest windows), and **dedupe** via idempotency keys.
4. **Render** channelspecific messages using safe templates; include **evidence** and **links**.
5. **Deliver** with retries/backoff; record outcome; expose delivery history to UI.
6. **Test** paths (send test to channel targets) without touching live rules.
7. **Audit**: log who configured what, when, and why a message was sent.
---
## 3) Event model (inputs)
Notify subscribes to the **internal event bus** (produced by services, escaped JSON; gzip allowed with caps):
* `scanner.scan.completed` — new SBOM(s) composed; artifacts ready
* `scanner.report.ready` — analysis verdict (policy+vex) available; carries deltas summary
* `scheduler.rescan.delta` — new findings after Feedser/Vexer deltas (already summarized)
* `attestor.logged` — Rekor UUID returned (sbom/report/vex export)
* `zastava.admission` — admit/deny with reasons, namespace, image digests
* `feedser.export.completed` — new export ready (rarely notified directly; usually drives Scheduler)
* `vexer.export.completed` — new consensus snapshot (ditto)
**Canonical envelope (bus → Notify.Engine):**
```json
{
"eventId": "uuid",
"kind": "scanner.report.ready",
"tenant": "tenant-01",
"ts": "2025-10-18T05:41:22Z",
"actor": "scanner-webservice",
"scope": { "namespace":"payments", "repo":"ghcr.io/acme/api", "digest":"sha256:..." },
"payload": { /* kind-specific fields, see below */ }
}
```
**Examples (payload cores):**
* `scanner.report.ready`:
```json
{
"reportId": "report-3def...",
"verdict": "fail",
"summary": {"total": 12, "blocked": 2, "warned": 3, "ignored": 5, "quieted": 2},
"delta": {"newCritical": 1, "kev": ["CVE-2025-..."]},
"links": {"ui": "https://ui/.../reports/report-3def...", "rekor": "https://rekor/..."},
"dsse": { "...": "..." },
"report": { "...": "..." }
}
```
Payload embeds both the canonical report document and the DSSE envelope so connectors, Notify, and UI tooling can reuse the signed bytes without re-serialising.
* `scanner.scan.completed`:
```json
{
"reportId": "report-3def...",
"digest": "sha256:...",
"verdict": "fail",
"summary": {"total": 12, "blocked": 2, "warned": 3, "ignored": 5, "quieted": 2},
"delta": {"newCritical": 1, "kev": ["CVE-2025-..."]},
"policy": {"revisionId": "rev-42", "digest": "27d2..."},
"findings": [{"id": "finding-1", "severity": "Critical", "cve": "CVE-2025-...", "reachability": "runtime"}],
"dsse": { "...": "..." }
}
```
* `zastava.admission`:
```json
{ "decision":"deny|allow", "reasons":["unsigned image","missing SBOM"],
"images":[{"digest":"sha256:...","signed":false,"hasSbom":false}] }
```
---
## 4) Rules engine — semantics
**Rule shape (simplified):**
```yaml
name: "high-critical-alerts-prod"
enabled: true
match:
eventKinds: ["scanner.report.ready","scheduler.rescan.delta","zastava.admission"]
namespaces: ["prod-*"]
repos: ["ghcr.io/acme/*"]
minSeverity: "high" # min of new findings (delta context)
kev: true # require KEV-tagged or allow any if false
verdict: ["fail","deny"] # filter for report/admission
vex:
includeRejectedJustifications: false # notify only on accepted 'affected'
actions:
- channel: "slack:sec-alerts" # reference to Channel object
template: "concise"
throttle: "5m"
- channel: "email:soc"
digest: "hourly"
template: "detailed"
```
**Evaluation order**
1. **Tenant check** → discard if rule tenant ≠ event tenant.
2. **Kind filter** → discard early.
3. **Scope match** (namespace/repo/labels).
4. **Delta/severity gates** (if event carries `delta`).
5. **VEX gate** (drop if events finding is not affected under policy consensus unless rule says otherwise).
6. **Throttling/dedup** (idempotency key) — skip if suppressed.
7. **Actions** → enqueue perchannel job(s).
**Idempotency key**: `hash(ruleId | actionId | event.kind | scope.digest | delta.hash | day-bucket)`; ensures “same alert” doesnt fire more than once within throttle window.
**Digest windows**: maintain per action a **coalescer**:
* Window: `5m|15m|1h|1d` (configurable); coalesces events by tenant + namespace/repo or by digest group.
* Digest messages summarize top N items and counts, with safe truncation.
---
## 5) Channels & connectors (plugins)
Channel config is **twopart**: a **Channel** record (name, type, options) and a Secret **reference** (Vault/K8s Secret). Connectors are **restart-time plug-ins** discovered on service start (same manifest convention as Concelier/Excititor) and live under `plugins/notify/<channel>/`.
**Builtin v1:**
* **Slack**: Bot token (xoxb…), `chat.postMessage` + `blocks`; rate limit aware (HTTP 429).
* **Microsoft Teams**: Incoming Webhook (or Graph card later); adaptive card payloads.
* **Email (SMTP)**: TLS (STARTTLS or implicit), From/To/CC/BCC; HTML+text alt; DKIM optional.
* **Generic Webhook**: POST JSON with HMAC signature (Ed25519 or SHA256) in headers.
**Connector contract:** (implemented by plug-in assemblies)
```csharp
public interface INotifyConnector {
string Type { get; } // "slack" | "teams" | "email" | "webhook" | ...
Task<DeliveryResult> SendAsync(DeliveryContext ctx, CancellationToken ct);
Task<HealthResult> HealthAsync(ChannelConfig cfg, CancellationToken ct);
}
```
**DeliveryContext** includes **rendered content** and **raw event** for audit.
**Test-send previews.** Plug-ins can optionally implement `INotifyChannelTestProvider` to shape `/channels/{id}/test` responses. Providers receive a sanitised `ChannelTestPreviewContext` (channel, tenant, target, timestamp, trace) and return a `NotifyDeliveryRendered` preview + metadata. When no provider is present, the host falls back to a generic preview so the endpoint always responds.
**Secrets**: `ChannelConfig.secretRef` points to Authoritymanaged secret handle or K8s Secret path; workers load at send-time; plug-in manifests (`notify-plugin.json`) declare capabilities and version.
---
## 6) Templates & rendering
**Template engine**: strongly typed, safe Handlebarsstyle; no arbitrary code. Partial templates per channel. Deterministic outputs (prop order, no locale drift unless requested).
**Variables** (examples):
* `event.kind`, `event.ts`, `scope.namespace`, `scope.repo`, `scope.digest`
* `payload.verdict`, `payload.delta.newCritical`, `payload.links.ui`, `payload.links.rekor`
* `topFindings[]` with `purl`, `vulnId`, `severity`
* `policy.name`, `policy.revision` (if available)
**Helpers**:
* `severity_icon(sev)`, `link(text,url)`, `pluralize(n, "finding")`, `truncate(text, n)`, `code(text)`.
**Channel mapping**:
* Slack: title + blocks, limited to 50 blocks/3000 chars per section; long lists → link to UI.
* Teams: Adaptive Card schema 1.5; fallback text for older channels (surfaced as `teams.fallbackText` metadata alongside webhook hash).
* Email: HTML + text; inline table of top N findings, rest behind UI link.
* Webhook: JSON with `event`, `ruleId`, `actionId`, `summary`, `links`, and raw `payload` subset.
**i18n**: template set per locale (English default; Bulgarian builtin).
---
## 7) Data model (Mongo)
Canonical JSON Schemas for rules/channels/events live in `docs/modules/notify/resources/schemas/`. Sample payloads intended for tests/UI mock responses are captured in `docs/modules/notify/resources/samples/`.
**Database**: `notify`
* `rules`
```
{ _id, tenantId, name, enabled, match, actions, createdBy, updatedBy, createdAt, updatedAt }
```
* `channels`
```
{ _id, tenantId, name:"slack:sec-alerts", type:"slack",
config:{ webhookUrl?:"", channel:"#sec-alerts", workspace?: "...", secretRef:"ref://..." },
createdAt, updatedAt }
```
* `deliveries`
```
{ _id, tenantId, ruleId, actionId, eventId, kind, scope, status:"sent|failed|throttled|digested|dropped",
attempts:[{ts, status, code, reason}],
rendered:{ title, body, target }, // redacted for PII; body hash stored
sentAt, lastError? }
```
* `digests`
```
{ _id, tenantId, actionKey, window:"hourly", openedAt, items:[{eventId, scope, delta}], status:"open|flushed" }
```
* `throttles`
```
{ key:"idem:<hash>", ttlAt } // short-lived, also cached in Redis
```
**Indexes**: rules by `{tenantId, enabled}`, deliveries by `{tenantId, sentAt desc}`, digests by `{tenantId, actionKey}`.
---
## 8) External APIs (WebService)
Base path: `/api/v1/notify` (Authority OpToks; scopes: `notify.admin` for write, `notify.read` for view).
*All* REST calls require the tenant header `X-StellaOps-Tenant` (matches the canonical `tenantId` stored in Mongo). Payloads are normalised via `NotifySchemaMigration` before persistence to guarantee schema version pinning.
Authentication today is stubbed with Bearer tokens (`Authorization: Bearer <token>`). When Authority wiring lands, this will switch to OpTok validation + scope enforcement, but the header contract will remain the same.
Service configuration exposes `notify:auth:*` keys (issuer, audience, signing key, scope names) so operators can wire the Authority JWKS or (in dev) a symmetric test key. `notify:storage:*` keys cover Mongo URI/database/collection overrides. Both sets are required for the new API surface.
Internal tooling can hit `/internal/notify/<entity>/normalize` to upgrade legacy JSON and return canonical output used in the docs fixtures.
* **Channels**
* `POST /channels` | `GET /channels` | `GET /channels/{id}` | `PATCH /channels/{id}` | `DELETE /channels/{id}`
* `POST /channels/{id}/test` → send sample message (no rule evaluation); returns `202 Accepted` with rendered preview + metadata (base keys: `channelType`, `target`, `previewProvider`, `traceId` + connector-specific entries); governed by `api.rateLimits:testSend`.
* `GET /channels/{id}/health` → connector selfcheck (returns redacted metadata: secret refs hashed, sensitive config keys masked, fallbacks noted via `teams.fallbackText`/`teams.validation.*`)
* **Rules**
* `POST /rules` | `GET /rules` | `GET /rules/{id}` | `PATCH /rules/{id}` | `DELETE /rules/{id}`
* `POST /rules/{id}/test` → dryrun rule against a **sample event** (no delivery unless `--send`)
* **Deliveries**
* `POST /deliveries` → ingest worker delivery state (idempotent via `deliveryId`).
* `GET /deliveries?since=...&status=...&limit=...` → list envelope `{ items, count, continuationToken }` (most recent first); base metadata keys match the test-send response (`channelType`, `target`, `previewProvider`, `traceId`); rate-limited via `api.rateLimits.deliveryHistory`. See `docs/modules/notify/resources/samples/notify-delivery-list-response.sample.json`.
* `GET /deliveries/{id}` → detail (redacted body + metadata)
* `POST /deliveries/{id}/retry` → force retry (admin, future sprint)
* **Admin**
* `GET /stats` (per tenant counts, last hour/day)
* `GET /healthz|readyz` (liveness)
* `POST /locks/acquire` | `POST /locks/release` worker coordination primitives (short TTL).
* `POST /digests` | `GET /digests/{actionKey}` | `DELETE /digests/{actionKey}` manage open digest windows.
* `POST /audit` | `GET /audit?since=&limit=` append/query structured audit trail entries.
**Ingestion**: workers do **not** expose public ingestion; they **subscribe** to the internal bus. (Optional `/events/test` for integration testing, adminonly.)
---
## 9) Delivery pipeline (worker)
```
[Event bus] → [Ingestor] → [RuleMatcher] → [Throttle/Dedupe] → [DigestCoalescer] → [Renderer] → [Connector] → [Result]
└────────→ [DeliveryStore]
```
* **Ingestor**: N consumers with perkey ordering (key = tenant|digest|namespace).
* **RuleMatcher**: loads active rules snapshot for tenant into memory; vectorized predicate check.
* **Throttle/Dedupe**: consult Redis + Mongo `throttles`; if hit → record `status=throttled`.
* **DigestCoalescer**: append to open digest window or flush when timer expires.
* **Renderer**: select template (channel+locale), inject variables, enforce length limits, compute `bodyHash`.
* **Connector**: send; handle providerspecific rate limits and backoffs; `maxAttempts` with exponential jitter; overflow → DLQ (deadletter topic) + UI surfacing.
**Idempotency**: per action **idempotency key** stored in Redis (TTL = `throttle window` or `digest window`). Connectors also respect **provider** idempotency where available (e.g., Slack `client_msg_id`).
---
## 10) Reliability & rate controls
* **Pertenant** RPM caps (default 600/min) + **perchannel** concurrency (Slack 14, Teams 12, Email 832 based on relay).
* **Backoff** map: Slack 429 → respect `RetryAfter`; SMTP 4xx → retry; 5xx → retry with jitter; permanent rejects → drop with status recorded.
* **DLQ**: NATS/Redis stream `notify.dlq` with `{event, rule, action, error}` for operator inspection; UI shows DLQ items.
---
## 11) Security & privacy
* **AuthZ**: all APIs require **Authority** OpToks; actions scoped by tenant.
* **Secrets**: `secretRef` only; Notify fetches justintime from Authority Secret proxy or K8s Secret (mounted). No plaintext secrets in Mongo.
* **Egress TLS**: validate SSL; pin domains per channel config; optional CA bundle override for onprem SMTP.
* **Webhook signing**: HMAC or Ed25519 signatures in `X-StellaOps-Signature` + replaywindow timestamp; include canonical body hash in header.
* **Redaction**: deliveries store **hashes** of bodies, not full payloads for chat/email to minimize PII retention (configurable).
* **Quiet hours**: per tenant (e.g., 22:0006:00) route highsev only; defer others to digests.
* **Loop prevention**: Webhook target allowlist + event origin tags; do not ingest own webhooks.
---
## 12) Observability (Prometheus + OTEL)
* `notify.events_consumed_total{kind}`
* `notify.rules_matched_total{ruleId}`
* `notify.throttled_total{reason}`
* `notify.digest_coalesced_total{window}`
* `notify.sent_total{channel}` / `notify.failed_total{channel,code}`
* `notify.delivery_latency_seconds{channel}` (endtoend)
* **Tracing**: spans `ingest`, `match`, `render`, `send`; correlation id = `eventId`.
**SLO targets**
* Event→delivery p95 **≤ 3060s** under nominal load.
* Failure rate p95 **< 0.5%** per hour (excluding provider outages).
* Duplicate rate **≈ 0** (idempotency working).
---
## 13) Configuration (YAML)
```yaml
notify:
authority:
issuer: "https://authority.internal"
require: "dpop" # or "mtls"
bus:
kind: "redis" # or "nats"
streams:
- "scanner.events"
- "scheduler.events"
- "attestor.events"
- "zastava.events"
mongo:
uri: "mongodb://mongo/notify"
limits:
perTenantRpm: 600
perChannel:
slack: { concurrency: 2 }
teams: { concurrency: 1 }
email: { concurrency: 8 }
webhook: { concurrency: 8 }
digests:
defaultWindow: "1h"
maxItems: 100
quietHours:
enabled: true
window: "22:00-06:00"
minSeverity: "critical"
webhooks:
sign:
method: "ed25519" # or "hmac-sha256"
keyRef: "ref://notify/webhook-sign-key"
```
---
## 14) UI touchpoints
* **Notifications → Channels**: add Slack/Teams/Email/Webhook; run **health**; rotate secrets.
* **Notifications → Rules**: create/edit YAML rules with linting; test with sample events; see match rate.
* **Notifications → Deliveries**: timeline with filters (status, channel, rule); inspect last error; retry.
* **Digest preview**: shows current window contents and when it will flush.
* **Quiet hours**: configure per tenant; show overrides.
* **DLQ**: browse deadletters; requeue after fix.
---
## 15) Failure modes & responses
| Condition | Behavior |
| ----------------------------------- | ------------------------------------------------------------------------------------- |
| Slack 429 / Teams 429 | Respect `RetryAfter`, backoff with jitter, reduce concurrency |
| SMTP transient 4xx | Retry up to `maxAttempts`; escalate to DLQ on exhaust |
| Invalid channel secret | Mark channel unhealthy; suppress sends; surface in UI |
| Rule explosion (matches everything) | Safety valve: pertenant RPM caps; autopause rule after X drops; UI alert |
| Bus outage | Buffer to local queue (bounded); resume consuming when healthy |
| Mongo slowness | Fall back to Redis throttles; batch write deliveries; shed lowpriority notifications |
---
## 16) Testing matrix
* **Unit**: matchers, throttle math, digest coalescing, idempotency keys, template rendering edge cases.
* **Connectors**: providerlevel rate limits, payload size truncation, error mapping.
* **Integration**: synthetic event storm (10k/min), ensure p95 latency & duplicate rate.
* **Security**: DPoP/mTLS on APIs; secretRef resolution; webhook signing & replay windows.
* **i18n**: localized templates render deterministically.
* **Chaos**: Slack/Teams API flaps; SMTP greylisting; Redis hiccups; ensure graceful degradation.
---
## 17) Sequences (representative)
**A) New criticals after Feedser delta (Slack immediate + Email hourly digest)**
```mermaid
sequenceDiagram
autonumber
participant SCH as Scheduler
participant NO as Notify.Worker
participant SL as Slack
participant SMTP as Email
SCH->>NO: bus event scheduler.rescan.delta { newCritical:1, digest:sha256:... }
NO->>NO: match rules (Slack immediate; Email hourly digest)
NO->>SL: chat.postMessage (concise)
SL-->>NO: 200 OK
NO->>NO: append to digest window (email:soc)
Note over NO: At window close → render digest email
NO->>SMTP: send email (detailed digest)
SMTP-->>NO: 250 OK
```
**B) Admission deny (Teams card + Webhook)**
```mermaid
sequenceDiagram
autonumber
participant ZA as Zastava
participant NO as Notify.Worker
participant TE as Teams
participant WH as Webhook
ZA->>NO: bus event zastava.admission { decision: "deny", reasons: [...] }
NO->>TE: POST adaptive card
TE-->>NO: 200 OK
NO->>WH: POST JSON (signed)
WH-->>NO: 2xx
```
---
## 18) Implementation notes
* **Language**: .NET 10; minimal API; `System.Text.Json` with canonical writer for body hashing; Channels for pipelines.
* **Bus**: Redis Streams (**XGROUP** consumers) or NATS JetStream for atleastonce with ack; pertenant consumer groups to localize backpressure.
* **Templates**: compile and cache per rule+channel+locale; version with rule `updatedAt` to invalidate.
* **Rules**: store raw YAML + parsed AST; validate with schema + static checks (e.g., nonsensical combos).
* **Secrets**: pluggable secret resolver (Authority Secret proxy, K8s, Vault).
* **Rate limiting**: `System.Threading.RateLimiting` + perconnector adapters.
---
## 19) Roadmap (postv1)
* **PagerDuty/Opsgenie** connectors; **Jira** ticket creation.
* **User inbox** (inapp notifications) + mobile push via webhook relay.
* **Anomaly suppression**: autopause noisy rules with hints (learned thresholds).
* **Graph rules**: “only notify if *not_affected → affected* transition at consensus layer”.
* **Label enrichment**: pluggable taggers (business criticality, data classification) to refine matchers.

View File

@@ -0,0 +1,61 @@
# Implementation plan — Notify
## Delivery phases
- **Phase 1 Core rules engine & delivery ledger**
Implement rules/channels schema, event ingestion, rule evaluation, idempotent deliveries, and audit logging.
- **Phase 2 Connectors & rendering**
Ship Slack/Teams/Email/Webhook connectors, template rendering, localization, throttling, retries, and secret referencing.
- **Phase 3 Console & CLI authoring**
Provide UI/CLI for rule authoring, previews, channel health, delivery browsing, digests, and test sends.
- **Phase 4 Governance & observability**
Add approvals, RBAC, tenant quotas, Notify metrics/logs/traces, dashboards, Notify-specific alerts, and Notify runbooks.
- **Phase 5 Offline & compliance**
Produce Offline Kit bundles (rules/channels/deploy scripts), signed exports, retention policies, and auditing for regulated environments.
## Work breakdown
- **Service & worker**
- REST API for rules/channels/delivery history, idempotency middleware, digest scheduler.
- Worker pipelines for event intake, rule matching, template rendering, delivery execution, retries, and throttling.
- Delivery ledger capturing payload metadata, response, retry state, DSSE signatures.
- **Connectors**
- Slack/Teams/Email/Webhook plug-ins with configuration validation, rate limiting, error classification.
- Secrets referenced via Authority/Secret store; no plaintext storage.
- **Console & CLI**
- Console module for rules builder, condition editor, preview, test send, delivery insights, digests and schedule configuration.
- CLI (`stella notify rule|channel|delivery`) for automation, export/import.
- **Integrations**
- Event sources: Concelier, Excititor, Policy Engine, Vuln Explorer, Export Center, Attestor, Zastava, Scheduler.
- Notify events to Notify (meta) for failure escalations, accepted-risk expiration reminders.
- **Observability & ops**
- Metrics: delivery success/failure, retry counts, throttle hits, digest generation, channel health.
- Logs/traces with tenant, rule ID, channel, correlation ID; dashboards and alerts.
- Runbooks for misconfigured channels, throttling, event backlog, incident digest.
- **Docs & compliance**
- Update Notifications Studio guides, channel runbooks, security/RBAC docs, Offline Kit instructions.
- Provide compliance checklist (audit logging, retention, opt-out).
## Acceptance criteria
- Rules evaluate deterministically per event; deliveries idempotent with audit trail and DSSE signatures.
- Channel connectors support retries, rate limits, health checks, previews; secrets referenced securely.
- Console/CLI support rule creation, testing, digests, delivery browsing, and export/import workflows.
- Observability dashboards track delivery health; alerts fire for sustained failures or backlog; runbooks cover remediation.
- Offline Kit bundle contains configs, rules, digests, and deployment scripts for air-gapped installs.
- Notify respects tenancy and RBAC; governance (approvals, change log) enforced for high-impact rules.
## Risks & mitigations
- **Notification storms:** throttling, digests, dedupe windows, preview/test gating.
- **Secret compromise:** secret references only, rotation workflows, audit logging.
- **Connector API changes:** versioned adapter layer, nightly health checks, fallback channels.
- **Noise vs signal:** simulation previews, metrics, rule scoring, recommended defaults.
- **Offline parity:** export/import of rules, connectors, and digests with signed manifests.
## Test strategy
- **Unit:** rule evaluation, template rendering, connector clients, throttling, digests.
- **Integration:** end-to-end events from core services, multi-channel routing, retries, audit logging.
- **Performance:** burst throttling, digest creation, large rule sets.
- **Security:** RBAC tests, tenant isolation, secret reference validation, DSSE signature verification.
- **Offline:** export/import round-trips, Offline Kit deployment, manual delivery replay.
## Definition of done
- Notify service, workers, connectors, Console/CLI, observability, and Offline Kit assets shipped with documentation and runbooks.
- Compliance checklist appended to docs; ./TASKS.md and ../../TASKS.md updated with progress.

View File

@@ -0,0 +1,32 @@
{
"schemaVersion": "notify.channel@1",
"channelId": "channel-slack-sec-ops",
"tenantId": "tenant-01",
"name": "slack:sec-ops",
"type": "slack",
"displayName": "SecOps Slack",
"description": "Primary incident response channel.",
"config": {
"secretRef": "ref://notify/channels/slack/sec-ops",
"target": "#sec-ops",
"properties": {
"workspace": "stellaops-sec"
},
"limits": {
"concurrency": 2,
"requestsPerMinute": 60,
"timeout": "PT10S"
}
},
"enabled": true,
"labels": {
"team": "secops"
},
"metadata": {
"createdByTask": "NOTIFY-MODELS-15-102"
},
"createdBy": "ops:amir",
"createdAt": "2025-10-18T17:02:11+00:00",
"updatedBy": "ops:amir",
"updatedAt": "2025-10-18T17:45:00+00:00"
}

View File

@@ -0,0 +1,46 @@
{
"items": [
{
"deliveryId": "delivery-7f3b6c51",
"tenantId": "tenant-acme",
"ruleId": "rule-critical-slack",
"actionId": "slack-secops",
"eventId": "4f6e9c09-01b4-4c2a-8a57-3d06de182d74",
"kind": "scanner.report.ready",
"status": "Sent",
"statusReason": null,
"rendered": {
"channelType": "Slack",
"format": "Slack",
"target": "#sec-alerts",
"title": "Critical findings detected",
"body": "{\"text\":\"Critical findings detected\",\"blocks\":[{\"type\":\"section\",\"text\":{\"type\":\"mrkdwn\",\"text\":\"*Critical findings detected*\\n1 new critical finding across 2 images.\"}},{\"type\":\"context\",\"elements\":[{\"type\":\"mrkdwn\",\"text\":\"Preview generated 2025-10-19T16:23:41.889Z · Trace `trace-58c212`\"}]}]}",
"summary": "1 new critical finding across 2 images.",
"textBody": "1 new critical finding across 2 images.\nTrace: trace-58c212",
"locale": "en-us",
"bodyHash": "febf9b2a630d862b07f4390edfbf31f5e8b836529f5232c491f4b3f6dba4a4b2",
"attachments": []
},
"attempts": [
{
"timestamp": "2025-10-19T16:23:42.112Z",
"status": "Succeeded",
"statusCode": 200,
"reason": null
}
],
"metadata": {
"channelType": "slack",
"target": "#sec-alerts",
"previewProvider": "fallback",
"traceId": "trace-58c212",
"slack.channel": "#sec-alerts"
},
"createdAt": "2025-10-19T16:23:41.889Z",
"sentAt": "2025-10-19T16:23:42.101Z",
"completedAt": "2025-10-19T16:23:42.112Z"
}
],
"count": 1,
"continuationToken": "2025-10-19T16:23:41.889Z|tenant-acme:delivery-7f3b6c51"
}

View File

@@ -0,0 +1,34 @@
{
"eventId": "8a8d6a2f-9315-49fe-9d52-8fec79ec7aeb",
"kind": "scanner.report.ready",
"version": "1",
"tenant": "tenant-01",
"ts": "2025-10-19T03:58:42+00:00",
"actor": "scanner-webservice",
"scope": {
"namespace": "prod-payment",
"repo": "ghcr.io/acme/api",
"digest": "sha256:79c1f9e5...",
"labels": {
"environment": "production"
},
"attributes": {}
},
"payload": {
"delta": {
"kev": [
"CVE-2025-40123"
],
"newCritical": 1,
"newHigh": 2
},
"links": {
"rekor": "https://rekor.stella.local/api/v1/log/entries/1",
"ui": "https://ui.stella.local/reports/sha256-79c1f9e5"
},
"verdict": "fail"
},
"attributes": {
"correlationId": "scan-23a6"
}
}

View File

@@ -0,0 +1,63 @@
{
"schemaVersion": "notify.rule@1",
"ruleId": "rule-secops-critical",
"tenantId": "tenant-01",
"name": "Critical digests to SecOps",
"description": "Escalate KEV-tagged findings to on-call feeds.",
"enabled": true,
"match": {
"eventKinds": [
"scanner.report.ready",
"scheduler.rescan.delta"
],
"namespaces": [
"prod-*"
],
"repositories": [],
"digests": [],
"labels": [],
"componentPurls": [],
"minSeverity": "high",
"verdicts": [],
"kevOnly": true,
"vex": {
"includeAcceptedJustifications": false,
"includeRejectedJustifications": false,
"includeUnknownJustifications": false,
"justificationKinds": [
"component-remediated",
"not-affected"
]
}
},
"actions": [
{
"actionId": "email-digest",
"channel": "email:soc",
"digest": "hourly",
"template": "digest",
"enabled": true,
"metadata": {
"locale": "en-us"
}
},
{
"actionId": "slack-oncall",
"channel": "slack:sec-ops",
"template": "concise",
"throttle": "PT5M",
"metadata": {},
"enabled": true
}
],
"labels": {
"team": "secops"
},
"metadata": {
"source": "sprint-15"
},
"createdBy": "ops:zoya",
"createdAt": "2025-10-19T04:12:27+00:00",
"updatedBy": "ops:zoya",
"updatedAt": "2025-10-19T04:45:03+00:00"
}

View File

@@ -0,0 +1,19 @@
{
"schemaVersion": "notify.template@1",
"templateId": "tmpl-slack-concise",
"tenantId": "tenant-01",
"channelType": "slack",
"key": "concise",
"locale": "en-us",
"body": "{{severity_icon payload.delta.newCritical}} {{summary}}",
"description": "Slack concise message for high severity findings.",
"renderMode": "markdown",
"format": "slack",
"metadata": {
"version": "2025-10-19"
},
"createdBy": "ops:zoya",
"createdAt": "2025-10-19T05:00:00+00:00",
"updatedBy": "ops:zoya",
"updatedAt": "2025-10-19T05:45:00+00:00"
}

View File

@@ -0,0 +1,73 @@
{
"$id": "https://stella-ops.org/schemas/notify/notify-channel@1.json",
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Notify Channel",
"type": "object",
"required": [
"schemaVersion",
"channelId",
"tenantId",
"name",
"type",
"config",
"enabled",
"createdAt",
"updatedAt"
],
"properties": {
"schemaVersion": {"type": "string", "const": "notify.channel@1"},
"channelId": {"type": "string"},
"tenantId": {"type": "string"},
"name": {"type": "string"},
"type": {
"type": "string",
"enum": ["slack", "teams", "email", "webhook", "custom"]
},
"displayName": {"type": "string"},
"description": {"type": "string"},
"config": {"$ref": "#/$defs/channelConfig"},
"enabled": {"type": "boolean"},
"labels": {"$ref": "#/$defs/stringMap"},
"metadata": {"$ref": "#/$defs/stringMap"},
"createdBy": {"type": "string"},
"createdAt": {"type": "string", "format": "date-time"},
"updatedBy": {"type": "string"},
"updatedAt": {"type": "string", "format": "date-time"}
},
"additionalProperties": false,
"$defs": {
"channelConfig": {
"type": "object",
"required": ["secretRef"],
"properties": {
"secretRef": {"type": "string"},
"target": {"type": "string"},
"endpoint": {"type": "string", "format": "uri"},
"properties": {"$ref": "#/$defs/stringMap"},
"limits": {"$ref": "#/$defs/channelLimits"}
},
"additionalProperties": false
},
"channelLimits": {
"type": "object",
"properties": {
"concurrency": {"type": "integer", "minimum": 1},
"requestsPerMinute": {"type": "integer", "minimum": 1},
"timeout": {
"type": "string",
"pattern": "^P(T.*)?$",
"description": "ISO 8601 duration"
},
"maxBatchSize": {"type": "integer", "minimum": 1}
},
"additionalProperties": false
},
"stringMap": {
"type": "object",
"patternProperties": {
".*": {"type": "string"}
},
"additionalProperties": false
}
}
}

View File

@@ -0,0 +1,56 @@
{
"$id": "https://stella-ops.org/schemas/notify/notify-event@1.json",
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Notify Event Envelope",
"type": "object",
"required": ["eventId", "kind", "tenant", "ts", "payload"],
"properties": {
"eventId": {"type": "string", "format": "uuid"},
"kind": {
"type": "string",
"description": "Event kind identifier (e.g. scanner.report.ready).",
"enum": [
"scanner.report.ready",
"scanner.scan.completed",
"scheduler.rescan.delta",
"attestor.logged",
"zastava.admission",
"feedser.export.completed",
"vexer.export.completed"
]
},
"version": {"type": "string"},
"tenant": {"type": "string"},
"ts": {"type": "string", "format": "date-time"},
"actor": {"type": "string"},
"scope": {
"type": "object",
"properties": {
"namespace": {"type": "string"},
"repo": {"type": "string"},
"digest": {"type": "string"},
"component": {"type": "string"},
"image": {"type": "string"},
"labels": {"$ref": "#/$defs/stringMap"},
"attributes": {"$ref": "#/$defs/stringMap"}
},
"additionalProperties": false
},
"payload": {
"type": "object",
"description": "Event specific body; see individual schemas for shapes.",
"additionalProperties": true
},
"attributes": {"$ref": "#/$defs/stringMap"}
},
"additionalProperties": false,
"$defs": {
"stringMap": {
"type": "object",
"patternProperties": {
".*": {"type": "string"}
},
"additionalProperties": false
}
}
}

View File

@@ -0,0 +1,96 @@
{
"$id": "https://stella-ops.org/schemas/notify/notify-rule@1.json",
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Notify Rule",
"type": "object",
"required": [
"schemaVersion",
"ruleId",
"tenantId",
"name",
"enabled",
"match",
"actions",
"createdAt",
"updatedAt"
],
"properties": {
"schemaVersion": {"type": "string", "const": "notify.rule@1"},
"ruleId": {"type": "string"},
"tenantId": {"type": "string"},
"name": {"type": "string"},
"description": {"type": "string"},
"enabled": {"type": "boolean"},
"match": {"$ref": "#/$defs/ruleMatch"},
"actions": {
"type": "array",
"minItems": 1,
"items": {"$ref": "#/$defs/ruleAction"}
},
"labels": {"$ref": "#/$defs/stringMap"},
"metadata": {"$ref": "#/$defs/stringMap"},
"createdBy": {"type": "string"},
"createdAt": {"type": "string", "format": "date-time"},
"updatedBy": {"type": "string"},
"updatedAt": {"type": "string", "format": "date-time"}
},
"additionalProperties": false,
"$defs": {
"ruleMatch": {
"type": "object",
"properties": {
"eventKinds": {"$ref": "#/$defs/stringArray"},
"namespaces": {"$ref": "#/$defs/stringArray"},
"repositories": {"$ref": "#/$defs/stringArray"},
"digests": {"$ref": "#/$defs/stringArray"},
"labels": {"$ref": "#/$defs/stringArray"},
"componentPurls": {"$ref": "#/$defs/stringArray"},
"minSeverity": {"type": "string"},
"verdicts": {"$ref": "#/$defs/stringArray"},
"kevOnly": {"type": "boolean"},
"vex": {"$ref": "#/$defs/ruleMatchVex"}
},
"additionalProperties": false
},
"ruleMatchVex": {
"type": "object",
"properties": {
"includeAcceptedJustifications": {"type": "boolean"},
"includeRejectedJustifications": {"type": "boolean"},
"includeUnknownJustifications": {"type": "boolean"},
"justificationKinds": {"$ref": "#/$defs/stringArray"}
},
"additionalProperties": false
},
"ruleAction": {
"type": "object",
"required": ["actionId", "channel", "enabled"],
"properties": {
"actionId": {"type": "string"},
"channel": {"type": "string"},
"template": {"type": "string"},
"digest": {"type": "string"},
"throttle": {
"type": "string",
"pattern": "^P(T.*)?$",
"description": "ISO 8601 duration"
},
"locale": {"type": "string"},
"enabled": {"type": "boolean"},
"metadata": {"$ref": "#/$defs/stringMap"}
},
"additionalProperties": false
},
"stringArray": {
"type": "array",
"items": {"type": "string"}
},
"stringMap": {
"type": "object",
"patternProperties": {
".*": {"type": "string"}
},
"additionalProperties": false
}
}
}

View File

@@ -0,0 +1,55 @@
{
"$id": "https://stella-ops.org/schemas/notify/notify-template@1.json",
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Notify Template",
"type": "object",
"required": [
"schemaVersion",
"templateId",
"tenantId",
"channelType",
"key",
"locale",
"body",
"renderMode",
"format",
"createdAt",
"updatedAt"
],
"properties": {
"schemaVersion": {"type": "string", "const": "notify.template@1"},
"templateId": {"type": "string"},
"tenantId": {"type": "string"},
"channelType": {
"type": "string",
"enum": ["slack", "teams", "email", "webhook", "custom"]
},
"key": {"type": "string"},
"locale": {"type": "string"},
"body": {"type": "string"},
"description": {"type": "string"},
"renderMode": {
"type": "string",
"enum": ["markdown", "html", "adaptiveCard", "plainText", "json"]
},
"format": {
"type": "string",
"enum": ["slack", "teams", "email", "webhook", "json"]
},
"metadata": {"$ref": "#/$defs/stringMap"},
"createdBy": {"type": "string"},
"createdAt": {"type": "string", "format": "date-time"},
"updatedBy": {"type": "string"},
"updatedAt": {"type": "string", "format": "date-time"}
},
"additionalProperties": false,
"$defs": {
"stringMap": {
"type": "object",
"patternProperties": {
".*": {"type": "string"}
},
"additionalProperties": false
}
}
}