- Added "StellaOps.Policy.Engine", "StellaOps.Cartographer", and "StellaOps.SbomService" projects to the StellaOps solution. - Created AGENTS.md to outline the Contract Testing Guild Charter, detailing mission, scope, and definition of done. - Established TASKS.md for the Contract Testing Task Board, outlining tasks for Sprint 62 and Sprint 63 related to mock servers and replay testing.
437 lines
38 KiB
Markdown
437 lines
38 KiB
Markdown
No file to print
|
||
Fine. Let’s build the thing that tells everyone everything, but not all at once, and not twelve times in a row at 3 a.m.
|
||
|
||
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
|
||
|
||
---
|
||
|
||
# Epic 11: Notifications Studio
|
||
|
||
**Short name:** `Notifications Studio`
|
||
**Primary service:** `notifier`
|
||
**Surfaces:** Console (Web UI), CLI, Web API
|
||
**Touches:** Orchestrator, Policy Engine, Findings Ledger, Conseiller (Feedser), Excitator (Vexer), VEX Consensus Lens, Export Center, Authority (authN/Z), Telemetry, Object Storage/KMS
|
||
|
||
**AOC ground rule:** Conseiller and Excitator aggregate but never merge. Notifications must link to original advisories/VEX/SBOM evidence and show policy-evaluated context without rewriting the underlying records.
|
||
|
||
---
|
||
|
||
## 1) What it is
|
||
|
||
Notifications Studio is a policy-aware, multi‑channel event routing and messaging system for StellaOps. It ingests signals from scans, policy evaluations, VEX updates, SBOM changes, orchestrated jobs, and user actions. It then decides whether to notify, whom to notify, over which channel, with what template, and at what cadence. Features include per-tenant routing rules, correlation and deduplication, throttling, quiet hours, on-call escalations, digest generation, simulation (“what would have fired”), and full provenance so compliance doesn’t faint.
|
||
|
||
Channels: email, chat (Slack-compatible webhook), Teams-compatible webhook, webhooks (generic), PagerDuty-compatible events, OpsGenie-compatible events, CLI inbox, and in‑app toast/feed. Optional SNMP trap if you insist on living dangerously.
|
||
|
||
---
|
||
|
||
## 2) Why (brief)
|
||
|
||
Because “turning everything into red pings” is not incident response, it’s chaos cosplay. Teams need signal-to-noise control, explainable routing, and reproducible outcomes tied to policy and evidence. Also, humans like sleep.
|
||
|
||
---
|
||
|
||
## 3) How it should work (maximum detail)
|
||
|
||
### 3.1 Capabilities
|
||
|
||
* **Event ingestion:** normalized events from scans, policy decisions, VEX/consensus changes, SBOM diffs, source/job lifecycle, export completion, auth/security anomalies.
|
||
* **Routing rules:** target by tenant, team, artifact scope (repo/image/package), severity/risk bands, labels/tags, environment, app, and policy decision.
|
||
* **Correlation & dedup:** collapse flapping or duplicate events into incidents; configurable keys and windows.
|
||
* **Throttling:** per-rule, per-recipient, and global throughput limits; backoff strategies.
|
||
* **Quiet hours & maintenance windows:** rule-based suppression by calendar/timezone; exceptions for “break glass.”
|
||
* **Digests:** periodic summaries (hourly/daily/weekly) with top-N items and deltas.
|
||
* **Escalations:** on-call schedules, multi-step escalation, ack/resolve feedback.
|
||
* **Templating:** versioned templates with variables, localization, channel-specific formatting, and preview/simulate.
|
||
* **Simulation:** dry-run against historical event sets to show what would have fired and why.
|
||
* **Provenance:** message includes links to exportable manifest, run IDs, advisory/VEX IDs, policy snapshot ID.
|
||
* **Privacy & PII:** redaction rules at template and field level; secrets never leave the building.
|
||
|
||
### 3.2 Event & signal model
|
||
|
||
* **Event:** atomic occurrence (e.g., `policy.violation.created`, `vex.statement.updated`, `sbom.graph.changed`, `job.failed`).
|
||
|
||
* Required fields: `event_id`, `type`, `tenant_id`, `subject` (artifact pointer), `time`, `severity`, `risk`, `labels`, `evidence_refs[]` (URIs to advisories/VEX/SBOM), `run_id`, `policy_version` if applicable.
|
||
|
||
* **Incident:** correlated set of events described by a correlation key (e.g., `pkg@version+advisory_id+env`). Has state: `open`, `acknowledged`, `resolved`.
|
||
|
||
* **Notification:** delivery instance tied to a rule, a template version, a channel, and either an event or incident.
|
||
|
||
### 3.3 Routing model
|
||
|
||
1. **Match** events against **Subscriptions** (who cares) and **Rules** (what/when/how).
|
||
2. **Evaluate** rule conditions: event fields, policy decisions, time windows, labels, recipient attributes.
|
||
3. **Correlate** into incidents when configured.
|
||
4. **Decide** delivery policy: immediate, digest, escalate, or suppress.
|
||
5. **Dispatch** to channel adapters with rendered templates and metadata.
|
||
|
||
### 3.4 Correlation & dedup
|
||
|
||
* Correlation key expressions using templating over event fields.
|
||
* Windows: sliding window per key, e.g., 10 minutes.
|
||
* State machine:
|
||
|
||
* New event ⇒ open incident if none exists.
|
||
* Flapping rules merge repeats into the same incident; updates bump counters and timestamps.
|
||
* Resolution signals or explicit acks transition state.
|
||
|
||
### 3.5 Throttling & rate limiting
|
||
|
||
* Per-rule token buckets, per-recipient quotas, and global circuit breaker.
|
||
* When throttled, one **summary notification** replaces N dropped ones with counts and first/last timestamps.
|
||
|
||
### 3.6 Quiet hours & maintenance
|
||
|
||
* Declarative schedules: cron expressions and calendar ranges with tenant timezones.
|
||
* Maintenance windows can suppress everything except allowlisted types (e.g., auth anomalies).
|
||
|
||
### 3.7 Digests
|
||
|
||
* Build-time queries over the Findings Ledger: top risks, new advisories affecting production, unresolved policy violations since last digest.
|
||
* Deduplicate across subjects, link to evidence and Policy Studio simulations.
|
||
|
||
### 3.8 Escalations & acks
|
||
|
||
* Channel-specific ack flows:
|
||
|
||
* Chat/Webhook: action buttons/links.
|
||
* PagerDuty-like: incident key mapping, acknowledge/resolve bridging.
|
||
* Email: signed one-click ack link landing on Console.
|
||
* Escalation steps: after T minutes unacked, notify next group; optionally change channel.
|
||
|
||
### 3.9 Templating
|
||
|
||
* JSON templates for machine channels; Markdown/HTML for human channels.
|
||
* Variables: all event fields, incident aggregates, policy rationales, and helper formatters (risk badge, subject link).
|
||
* Localization keys with default English; fallback to default when missing.
|
||
|
||
### 3.10 Policy integration
|
||
|
||
* Rules can depend on policy decisions, e.g., notify only when `decision=deny` or `risk >= high` as computed by Policy Engine.
|
||
* Notification actions are policy resources; RBAC enforced by Authority.
|
||
|
||
### 3.11 Security & tenancy
|
||
|
||
* All routing and content rendering occur inside tenant-scoped boundaries.
|
||
* External webhooks use per-target credentials and IP allowlists; secrets stored via KMS and never returned.
|
||
* Links include signed, short-lived tokens with least privilege.
|
||
|
||
### 3.12 Observability
|
||
|
||
* Metrics: `notifier_events_total{type}`, `notifier_notifications_sent_total{channel}`, `notifier_throttled_total`, `notifier_dropped_total`, `notifier_latency_ms{phase}`, `notifier_incidents_open`.
|
||
* Traces: spans for match, correlate, render, dispatch with `event_id`, `incident_id`, `rule_id`.
|
||
* Audit: who created/edited rules, when templates changed, who acknowledged.
|
||
|
||
### 3.13 Performance targets
|
||
|
||
* 10k events/min sustained per shard with p99 end-to-end notification latency under 5 seconds for immediate routes.
|
||
* Digest rendering for 100k artifact set under 60 seconds.
|
||
|
||
### 3.14 Edge cases
|
||
|
||
* **Storms:** circuit breaker kicks in; emit single high-level summary + export link.
|
||
* **Out-of-order events:** correlate by event time and tolerate late arrivals within window.
|
||
* **Channel failure:** retry with exponential backoff; dead-letter queue with operator alerts.
|
||
* **Content size:** truncate with “+N more” and attach Export Center manifest link.
|
||
|
||
---
|
||
|
||
## 4) Architecture
|
||
|
||
### 4.1 Services
|
||
|
||
* **notifier (API + workers):** rule eval, correlation, dispatch orchestration.
|
||
* **channel-adapters:** email, chat, webhook, pager, cli, in-app.
|
||
* **template-service:** manages versions, previews, localization bundles.
|
||
* **preference-service:** subscriptions, quiet hours, on-call schedules.
|
||
* **simulation-engine:** dry-run against historical events.
|
||
* **ack-bridge:** receives ack/resolve callbacks and updates incidents.
|
||
|
||
### 4.2 Storage model (selected tables)
|
||
|
||
* `notif_rules` — id, tenant_id, name, enabled, match_json, actions_json, throttle_json, correlation_expr, quiet_json, created_by, updated_at.
|
||
* `notif_subscriptions` — id, tenant_id, scope_json (teams, recipients, labels), channels[], severity_filter, policy_filter.
|
||
* `notif_templates` — id, name, version, channel, lang, content, variables_json, created_by.
|
||
* `notif_events` — event_id, tenant_id, type, time, subject, severity, risk, payload_json, run_id, policy_version.
|
||
* `notif_incidents` — id, tenant_id, key, state, first_seen, last_seen, counts_json, assignee, ack_token_hash.
|
||
* `notif_messages` — id, incident_id or event_id, rule_id, channel, state, attempt_count, error, sent_at, response_meta.
|
||
* `notif_digests` — id, window, query_json, generated_at, storage_uri, signature_uri.
|
||
* `notif_channels` — id, type, config_secret_ref, tenant_id, health.
|
||
* `notif_schedules` — tenant_id, schedule_json (quiet/maintenance), tz.
|
||
|
||
### 4.3 Message flow
|
||
|
||
1. Orchestrator emits event to `events.bus`.
|
||
2. notifier consumes, normalizes, stores to `notif_events`.
|
||
3. Rules evaluation returns actions: notify/digest/suppress/escalate.
|
||
4. Correlator looks up or creates incident; throttler checks budgets.
|
||
5. Renderer pulls template version, renders content with variables and provenance links.
|
||
6. Dispatcher sends via channel adapter; updates `notif_messages`.
|
||
7. Ack-bridge updates `notif_incidents` and may cancel pending escalations.
|
||
|
||
---
|
||
|
||
## 5) APIs
|
||
|
||
```
|
||
POST /notifications/rules
|
||
GET /notifications/rules?enabled=&tenant_id=
|
||
PATCH /notifications/rules/{rule_id}
|
||
DELETE /notifications/rules/{rule_id}
|
||
|
||
POST /notifications/templates
|
||
GET /notifications/templates?channel=&name=
|
||
GET /notifications/templates/{id}/versions
|
||
POST /notifications/templates/{id}/preview
|
||
|
||
POST /notifications/simulate # { events_query, at_time, rule_ids? }
|
||
GET /notifications/incidents?state=&key=&from=&to=
|
||
GET /notifications/incidents/{id}
|
||
POST /notifications/incidents/{id}/ack
|
||
POST /notifications/incidents/{id}/resolve
|
||
|
||
POST /notifications/digests/run
|
||
GET /notifications/messages?incident_id=&state=
|
||
GET /notifications/metrics/overview
|
||
WS /notifications/streams/live
|
||
```
|
||
|
||
### Rule example
|
||
|
||
```json
|
||
{
|
||
"name": "High-risk policy denies in prod",
|
||
"enabled": true,
|
||
"match": {
|
||
"type": ["policy.violation.created"],
|
||
"risk_at_least": "high",
|
||
"labels.any": ["env:prod"]
|
||
},
|
||
"correlation": {
|
||
"key": "{{ subject.repo }}:{{ payload.advisory_id }}:{{ payload.version }}",
|
||
"window": "10m"
|
||
},
|
||
"actions": [
|
||
{ "notify": { "channels": ["chat:secops", "pager:oncall"], "recipients": ["team:secops"] }},
|
||
{ "escalate_after": "15m", "to": ["pager:security_manager"] }
|
||
],
|
||
"throttle": { "per_recipient_per_5m": 5 },
|
||
"quiet": { "respect": ["tenant.default"] }
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 6) CLI
|
||
|
||
```
|
||
stella notify rules list
|
||
stella notify rule create --file rule.json
|
||
stella notify rule test --file rule.json --events 'from=2025-01-01 to=2025-01-07'
|
||
stella notify templates list --channel chat
|
||
stella notify template preview --id tmpl_policy_denied@2 --event-file e.json
|
||
stella notify incidents list --state open
|
||
stella notify incidents ack <incident-id>
|
||
stella notify digests run --profile weekly-risk
|
||
stella notify channels add --type webhook --name chat:secops --config config.yaml
|
||
```
|
||
|
||
Exit codes: `0` ok, `2` bad args, `4` not found, `5` denied.
|
||
|
||
---
|
||
|
||
## 7) Console (Web UI)
|
||
|
||
* **Studio Home:** live feed, filters, storm indicator, quick mute.
|
||
* **Rules:** list, create, test with sample events and “explain why matched.”
|
||
* **Templates:** versioned editor with variables, preview per channel, localization bundles.
|
||
* **Recipients & Channels:** teams, emails, webhooks, on-call schedules; health checks.
|
||
* **Incidents:** kanban table with open/ack/resolved, incident drill-in, action history, ack/resolve buttons.
|
||
* **Digests:** profiles, last runs, preview of next run, subscribe/unsubscribe.
|
||
* **Simulation:** pick time window and rules; get a diff report.
|
||
* **Settings:** quiet hours, maintenance calendars, default throttles.
|
||
|
||
---
|
||
|
||
## 8) Implementation plan
|
||
|
||
### New modules
|
||
|
||
* `src/StellaOps.Notifier/`
|
||
|
||
* `api/` REST + WS
|
||
* `engine/` rule matching, correlation, throttling
|
||
* `render/` templating, localization, redaction
|
||
* `dispatch/` adapters, retries, health
|
||
* `simulation/` dry-run planner
|
||
* `ack/` inbound ack handlers
|
||
* `state/` repos and migrations
|
||
* `metrics/`, `audit/`, `security/`
|
||
|
||
* `src/StellaOps.UI/`
|
||
|
||
* Pages: Studio, Rules, Templates, Incidents, Digests, Settings
|
||
|
||
* `src/StellaOps.Cli/`
|
||
|
||
### Updates in existing services
|
||
|
||
* Orchestrator: standardized event envelope, idempotency keys, retry semantics.
|
||
* Policy Engine: enrich violation events with decision rationale id.
|
||
* VEX Lens: emit events on consensus changes; include delta size.
|
||
* Findings Ledger: efficient queries for digest builders.
|
||
* Authority: RBAC scopes for notifications management and read.
|
||
|
||
### Packaging
|
||
|
||
* Containers: `stella/notifier:<ver>` worker and API.
|
||
* Helm: horizontal shards, channel credentials as secrets, KMS for token signing, rate limits.
|
||
|
||
### Rollout phases
|
||
|
||
1. Immediate notifications for policy violations and job failures, with email/chat/webhook.
|
||
2. Correlation, throttling, quiet hours, digests.
|
||
3. Escalations, simulation, localization, PagerDuty/OpsGenie bridges.
|
||
|
||
---
|
||
|
||
## 9) Documentation changes
|
||
|
||
Create/update:
|
||
|
||
1. `/docs/notifications/overview.md`
|
||
Problem statement, capabilities, AOC ground rule, channel catalog.
|
||
|
||
2. `/docs/notifications/architecture.md`
|
||
Services, data model, event flow diagrams, scaling.
|
||
|
||
3. `/docs/notifications/rules.md`
|
||
Rule schema, examples, correlation keys, throttling, quiet hours.
|
||
|
||
4. `/docs/notifications/templates.md`
|
||
Templating reference, variables, redaction, localization, previews.
|
||
|
||
5. `/docs/notifications/channels.md`
|
||
Channel configuration, secrets, webhooks, retries, health checks.
|
||
|
||
6. `/docs/notifications/digests.md`
|
||
Profiles, queries, schedules, export links.
|
||
|
||
7. `/docs/notifications/escalations.md`
|
||
On-call schedules, acks, bridging to external incident tools.
|
||
|
||
8. `/docs/notifications/api.md`
|
||
Endpoints, request/response samples, error model.
|
||
|
||
9. `/docs/operations/notifier-runbook.md`
|
||
Storm handling, dead-letter recovery, throttle tuning.
|
||
|
||
10. `/docs/security/notifications-hardening.md`
|
||
RBAC, tenant isolation, signed links, channel allowlists.
|
||
|
||
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
|
||
|
||
---
|
||
|
||
## 10) Engineering tasks
|
||
|
||
### Backend
|
||
|
||
* [ ] DB migrations for all `notif_*` tables, with tenant and time indexes.
|
||
* [ ] Event envelope and ingestion consumer with exactly-once semantics via outbox/idempotency keys.
|
||
* [ ] Rule engine with CEL-like predicates and safe evaluation sandbox.
|
||
* [ ] Correlation engine with pluggable key expressions and windows.
|
||
* [ ] Throttler with token buckets and per-recipient quotas.
|
||
* [ ] Quiet hours/maintenance evaluator with timezone awareness.
|
||
* [ ] Template renderer: Markdown/HTML/JSON, localization, redaction.
|
||
* [ ] Channel adapters: email, chat-webhook, generic webhook; health checks and retries.
|
||
* [ ] Digest generator with query planner and pagination.
|
||
* [ ] Escalation and ack-bridge; signed ack links.
|
||
* [ ] Simulation engine against historical events.
|
||
* [ ] Metrics, tracing, audit logs.
|
||
|
||
### Console
|
||
|
||
* [ ] Rule editor with live “explain” and test harness.
|
||
* [ ] Template editor with versioning, preview per channel, variable inspector.
|
||
* [ ] Incidents dashboard with ack/resolve actions.
|
||
* [ ] Channel config UI with secret refs and health check button.
|
||
* [ ] Digest profile creator and preview runner.
|
||
* [ ] Live feed with WS stream and storm banner.
|
||
|
||
### CLI
|
||
|
||
* [ ] Commands listed in §6; include file-based previews and simulation outputs.
|
||
* [ ] Login flow supports ack token redemption.
|
||
|
||
### Integrations
|
||
|
||
* [ ] Orchestrator event publishers instrumented across scan, export, VEX, policy, job lifecycle.
|
||
* [ ] Policy Studio ties to template variables for rationale rendering.
|
||
* [ ] Export Center links embedded in messages for “view data safely elsewhere.”
|
||
|
||
### Security & RBAC
|
||
|
||
* [ ] Roles: `Notify.Viewer`, `Notify.Operator`, `Notify.Admin`.
|
||
* [ ] Signed URLs for ack with 15-minute expiry; rotating keys in KMS.
|
||
* [ ] Webhook IP allowlists and HMAC verification; replay protection.
|
||
|
||
### Docs
|
||
|
||
* [ ] Author and cross-link all docs in §9; add examples and diagrams.
|
||
* [ ] Append imposed rule line to every page.
|
||
|
||
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
|
||
|
||
---
|
||
|
||
## 11) Acceptance criteria
|
||
|
||
* Operators can create a rule, test it against historical events, and see exactly which notifications would be sent and why.
|
||
* Storm control reduces a 10k/min event burst to a capped set of human notifications plus one summary.
|
||
* Correlation merges duplicate policy violations into incidents with accurate counts and timelines.
|
||
* Quiet hours suppress non-critical routes and allow “break glass” types.
|
||
* Digests generate correct top-N and deltas, with links back to evidence and policy snapshot.
|
||
* Acks flow through and stop escalations within 2 seconds p95.
|
||
* RBAC enforces tenant scoping; no cross-tenant leakage in messages or acks.
|
||
* All notifications contain provenance links and never include secrets or PII.
|
||
|
||
---
|
||
|
||
## 12) Risks & mitigations
|
||
|
||
* **Notification storms still overwhelm humans.**
|
||
Mitigation: default throttles, storm breaker, digest fallbacks, operator-visible storm banner.
|
||
|
||
* **Template injection or data leak.**
|
||
Mitigation: sandboxed templating with allowlisted helpers, redaction defaults, content length limits, HTML sanitization.
|
||
|
||
* **Channel outages.**
|
||
Mitigation: retries with exponential backoff, dead-letter queues, multi-channel fallbacks, health probes.
|
||
|
||
* **Timezone chaos in quiet hours.**
|
||
Mitigation: per-recipient and per-tenant TZ with canonical UTC storage; preview UI to visualize schedules.
|
||
|
||
* **Correlation misses due to key drift.**
|
||
Mitigation: versioned correlation expressions with simulation and backtests; safe migration plan.
|
||
|
||
---
|
||
|
||
## 13) Test plan
|
||
|
||
* **Unit:** rule predicate evaluation, correlation windows, throttler math, quiet schedule math, template rendering with redaction.
|
||
* **Integration:** end-to-end from event to message across each channel; retries and acks; digests with large datasets.
|
||
* **Load:** 10k events/min stress with mixed types; p99 latency tracked; storm breaker behavior.
|
||
* **Security:** ack token tamper tests, HMAC webhook verification, RBAC fuzzing, tenant isolation.
|
||
* **Chaos:** kill dispatcher mid-batch; ensure idempotent resend without duplicates; intentionally delay events to test late-arrival handling.
|
||
* **UX:** rule “explain” correctness, template preview parity with real sends, live feed accuracy.
|
||
|
||
---
|
||
|
||
## 14) Philosophy
|
||
|
||
* **Signal, not theater.** If a notification doesn’t drive an action, it’s spam with extra steps.
|
||
* **Explain everything.** Every route and message must be auditable and reproducible.
|
||
* **Humans first.** Respect quiet hours, escalate sanely, digest by default when noise climbs.
|
||
|
||
> Final reminder: **Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.**
|