docs consolidation work

This commit is contained in:
StellaOps Bot
2025-12-25 10:53:53 +02:00
parent b9f71fc7e9
commit deb82b4f03
117 changed files with 852 additions and 847 deletions

View File

@@ -17,7 +17,7 @@ Notify evaluates operator-defined rules against platform events and dispatches c
4. Coordinate cross-module changes in the main /AGENTS.md description and through the sprint plan.
## Guardrails
- Honour the Aggregation-Only Contract where applicable (see ../../ingestion/aggregation-only-contract.md).
- Honour the Aggregation-Only Contract where applicable (see ../../aoc/aggregation-only-contract.md).
- Preserve determinism: sort outputs, normalise timestamps (UTC ISO-8601), and avoid machine-specific artefacts.
- Keep Offline Kit parity in mind—document air-gapped workflows for any new feature.
- Update runbooks/observability assets when operational characteristics change.

View File

@@ -28,25 +28,25 @@ Notify (Notifications Studio) converts platform events into tenant-scoped alerts
Status for these items is tracked in `src/Notifier/StellaOps.Notifier/TASKS.md` and sprint plans; update this README once tasks merge.
## Key docs & release alignment
- [`docs/notifications/overview.md`](../../notifications/overview.md) — summary of capabilities, imposed rules, and customer journey.
- [`docs/notifications/architecture.md`](../../notifications/architecture.md) — Notifications Studio runtime view (published 2025-10-29).
- [`docs/notifications/rules.md`](../../notifications/rules.md) — declarative matcher syntax and evaluation order.
- [`docs/notifications/digests.md`](../../notifications/digests.md) — digest windows, coalescing logic, and delivery samples.
- [`docs/notifications/templates.md`](../../notifications/templates.md) — template helpers, localisation, and redaction guidelines.
- [`docs/updates/2025-10-29-notify-docs.md`](../../updates/2025-10-29-notify-docs.md) — latest release note; follow-ups remain to validate connector metadata, quiet-hours semantics, and simulation payloads once Sprint 39 drops land.
- [`overview.md`](overview.md) — summary of capabilities, imposed rules, and customer journey.
- [`architecture.md`](architecture.md) / [`architecture-detail.md`](architecture-detail.md) — Notifications Studio runtime view.
- [`rules.md`](rules.md) — declarative matcher syntax and evaluation order.
- [`digests.md`](digests.md) — digest windows, coalescing logic, and delivery samples.
- [`templates.md`](templates.md) — template helpers, localisation, and redaction guidelines.
- [`docs/implplan/archived/updates/2025-10-29-notify-docs.md`](../../implplan/archived/updates/2025-10-29-notify-docs.md) — latest release note; follow-ups remain to validate connector metadata, quiet-hours semantics, and simulation payloads once Sprint 39 drops land.
## Integrations & dependencies
- **Storage:** PostgreSQL (schema `notify`) for rules, channels, deliveries, digests, and throttles; Valkey for worker coordination.
- **Queues:** Valkey Streams or NATS JetStream for ingestion, throttling, and DLQs (`notify.dlq`).
- **Authority:** OpTok-protected APIs, DPoP-backed CLI/UI scopes (`notify.viewer`, `notify.operator`, `notify.admin`), and secret references for channel credentials.
- **Observability:** Prometheus metrics (`notify.sent_total`, `notify.failed_total`, `notify.digest_coalesced_total`, etc.), OTEL traces, and dashboards documented in `docs/notifications/architecture.md#12-observability-prometheus--otel`.
- **Observability:** Prometheus metrics (`notify.sent_total`, `notify.failed_total`, `notify.digest_coalesced_total`, etc.), OTEL traces, and dashboards documented in `architecture-detail.md`.
## Operational notes
- Schema fixtures live in `./resources/schemas`; event and delivery samples live in `./resources/samples` for contract tests and UI mocks.
- Offline Kit bundles ship plug-ins, default templates, and seed rules; update manifests under `ops/offline-kit/` when connectors change.
- Dashboards and alert references depend on `DEVOPS-NOTIFY-39-002`; coordinate before renaming metrics or labels.
- Observability assets: `operations/observability.md` and `operations/dashboards/notify-observability.json` (offline import).
- When releasing new rule or connector features, mirror guidance into `docs/notifications/*.md` and checklists in `docs/updates/2025-10-29-notify-docs.md` until the follow-ups are closed.
- When releasing new rule or connector features, update guidance in this directory and related checklists until the follow-ups are closed.
## Epic alignment
- **Epic 11 Notifications Studio:** notifications workspace, preview tooling, immutable delivery ledger, throttling/digest controls, and forthcoming correlation/simulation features.

View File

@@ -0,0 +1,42 @@
# Notifications API
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Last updated: 2025-11-25 (Docs Tasks Md.V · DOCS-NOTIFY-40-001)
All endpoints require `Authorization: Bearer <token>` and `X-Stella-Tenant` header. Responses use the common error envelope (`docs/api/overview.md`). Paths are rooted at `/api/v1/notify`.
## Channels
- `POST /channels` — create channel. Body matches `channels.md` schema. Returns `201` + channel.
- `GET /channels` — list channels (deterministic order: type ASC, id ASC). Supports `type` filter.
- `GET /channels/{id}` — fetch single channel.
- `DELETE /channels/{id}` — soft-delete; fails if referenced by active rules unless `force=true` query.
## Rules
- `POST /rules` — create/update rule; idempotency via `Idempotency-Key`.
- `GET /rules` — list rules with paging (`page_token`, `page_size`). Sorted by `name` ASC.
- `POST /rules:preview` — dry-run rule against sample event; returns matched actions and rendered templates.
## Policies & escalations
- `POST /policies/escalations` — create escalation policy (see `escalations.md`).
- `GET /policies/escalations` — list policies.
## Deliveries & digests
- `GET /deliveries` — query delivery ledger; filters: `status`, `channel`, `rule_id`, `from`, `to`. Sorted by `createdUtc` DESC then `id` ASC.
- `GET /deliveries/{id}` — single delivery with rendered payload hash and attempts.
- `POST /digests/preview` — preview digest rendering for a tenant/rule set; returns deterministic body/hash without sending.
## Acknowledgements
- `POST /acks/{token}` — acknowledge an escalation token. Validates DSSE signature, token expiry, and tenant. Returns `200` with cancellation summary.
## Simulations
- `POST /simulations/rules` — simulate a rule set for a supplied event payload; no side effects. Returns matched actions and throttling outcome.
## Health & metadata
- `GET /health` — liveness/readiness probes.
- `GET /metadata` — returns supported channel types, max payload sizes, and server version.
## Determinism notes
- All list endpoints are stable and include `next_page_token` when applicable.
- Templates render with fixed locale `en-US` unless `Accept-Language` provided; rendering is pure (no network calls).
- `bodyHash` uses SHA-256 over canonical JSON; repeated sends with identical inputs produce identical hashes and are de-duplicated.

View File

@@ -0,0 +1,120 @@
# Notifications Architecture
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
This dossier distils the Notify architecture into implementation-ready guidance for service owners, SREs, and integrators. It complements the high-level overview by detailing process boundaries, persistence models, and extensibility points.
---
## 1. Runtime shape
```
┌──────────────────┐
│ Authority (OpTok)│
└───────┬──────────┘
┌───────▼──────────┐ ┌───────────────┐
│ Notify.WebService│◀──────▶│ PostgreSQL │
Tenant API│ REST + gRPC WIP │ │ rules/channels│
└───────▲──────────┘ │ deliveries │
│ │ digests │
Internal bus │ └───────────────┘
(NATS/Valkey/etc)│
┌─────────▼─────────┐ ┌───────────────┐
│ Notify.Worker │◀────▶│ Valkey / Cache│
│ rule eval + render│ │ throttles/locks│
└─────────▲─────────┘ └───────▲───────┘
│ │
│ │
┌──────┴──────┐ ┌─────────┴────────┐
│ Connectors │──────▶│ Slack/Teams/... │
│ (plug-ins) │ │ External targets │
└─────────────┘ └──────────────────┘
```
- **2025-11-02 decision — module boundaries.** Keep `src/Notify/` as the shared notification toolkit (engine, storage, queue, connectors) that multiple hosts can consume. `src/Notifier/` remains the Notifications Studio runtime (WebService + Worker) composed from those libraries. Do not collapse the directories until a packaging RFC covers build impacts, offline kit parity, and imposed-rule propagation.
- **WebService** hosts REST endpoints (`/channels`, `/rules`, `/templates`, `/deliveries`, `/digests`, `/stats`) and handles schema normalisation, validation, and Authority enforcement.
- **Worker** subscribes to the platform event bus, evaluates rules per tenant, applies throttles/digests, renders payloads, writes ledger entries, and invokes connectors.
- **Plug-ins** live under `plugins/notify/` and are loaded deterministically at service start (`orderedPlugins` list). Each implements connector contracts and optional health/test-preview providers.
Both services share options via `notify.yaml` (see `etc/notify.yaml.sample`). For dev/test scenarios, an in-memory repository exists but production requires PostgreSQL + Valkey/NATS for durability and coordination.
---
## 2. Event ingestion and rule evaluation
1. **Subscription.** Workers attach to the internal bus (Valkey Streams or NATS JetStream). Each partition key is `tenantId|scope.digest|event.kind` to preserve order for a given artefact.
2. **Normalisation.** Incoming events are hydrated into `NotifyEvent` envelopes. Payload JSON is normalised (sorted object keys) to preserve determinism and enable hashing.
3. **Rule snapshot.** Per-tenant rule sets are cached in memory. PostgreSQL LISTEN/NOTIFY triggers snapshot refreshes without restart.
4. **Match pipeline.**
- Tenant check (`rule.tenantId` vs. event tenant).
- Kind/namespace/repository/digest filters.
- Severity and KEV gating based on event deltas.
- VEX gating using `NotifyRuleMatchVex`.
- Action iteration with throttle/digest decisions.
5. **Idempotency.** Each action computes `hash(ruleId|actionId|event.kind|scope.digest|delta.hash|dayBucket)`; matches within throttle TTL record `status=Throttled` and stop.
6. **Dispatch.** If digest is `instant`, the renderer immediately processes the action. Otherwise the event is appended to the digest window for later flush.
Failures during evaluation are logged with correlation IDs and surfaced through `/stats` and worker metrics (`notify_rule_eval_failures_total`, `notify_digest_flush_errors_total`).
---
## 3. Rendering & connectors
- **Template resolution.** The renderer picks the template in this order: action template → channel default template → locale fallback → built-in minimal template. Locale negotiation reduces `en-US` to `en-us`.
- **Helpers & partials.** Exposed helpers mirror the list in [`templates.md`](templates.md#3-variables-helpers-and-context). Plug-ins may register additional helpers but must remain deterministic and side-effect free.
- **Attestation lifecycle suite.** Sprint171 introduced dedicated `tmpl-attest-*` templates for verification failures, expiring attestations, key rotations, and transparency anomalies (see [`templates.md` §7](templates.md#7-attestation--signing-lifecycle-templates-notify-attest-74-001)). Rule actions referencing those templates must populate the attestation context fields so channels stay consistent online/offline.
- **Rendering output.** `NotifyDeliveryRendered` captures:
- `channelType`, `format`, `locale`
- `title`, `body`, optional `summary`, `textBody`
- `target` (redacted where necessary)
- `attachments[]` (safe URLs or references)
- `bodyHash` (lowercase SHA-256) for audit parity
- **Connector contract.** Connectors implement `INotifyConnector` (send + health) and can implement `INotifyChannelTestProvider` for `/channels/{id}/test`. All plugs are single-tenant aware; secrets are pulled via references at send time and never persisted in the database.
- **Retries.** Workers track attempts with exponential jitter. On permanent failure, deliveries are marked `Failed` with `statusReason`, and optional DLQ fan-out is slated for Sprint 40.
---
## 4. Persistence model
| Table | Purpose | Key fields & indexes |
|-------|---------|----------------------|
| `rules` | Tenant rule definitions. | `id`, `tenant_id`, `enabled`; index on `(tenant_id, enabled)`. |
| `channels` | Channel metadata + config references. | `id`, `tenant_id`, `type`; index on `(tenant_id, type)`. |
| `templates` | Locale-specific render bodies. | `id`, `tenant_id`, `channel_type`, `key`; index on `(tenant_id, channel_type, key)`. |
| `deliveries` | Ledger of rendered notifications. | `id`, `tenant_id`, `sent_at`; compound index on `(tenant_id, sent_at DESC)` for history queries. |
| `digests` | Open digest windows per action. | `id` (`tenant_id:action_key:window`), `status`; index on `(tenant_id, action_key)`. |
| `throttles` | Short-lived throttle tokens (PostgreSQL or Valkey). | Key format `idem:<hash>` with TTL aligned to throttle duration. |
Records are stored using the canonical JSON serializer (`NotifyCanonicalJsonSerializer`) to preserve property ordering and casing. Schema migration helpers upgrade stored records when new versions ship.
---
## 5. Deployment & configuration
- **Configuration sources.** YAML files feed typed options (`NotifyPostgresOptions`, `NotifyWorkerOptions`, etc.). Environment variables can override connection strings and rate limits for production.
- **Authority integration.** Two OAuth clients (`notify-web`, `notify-web-dev`) with scopes `notify.viewer`, `notify.operator`, and (for dev/admin flows) `notify.admin` are required. Authority enforcement can be disabled for air-gapped dev use by providing `developmentSigningKey`.
- **Plug-in management.** `plugins.baseDirectory` and `orderedPlugins` guarantee deterministic loading. Offline Kits copy the plug-in tree verbatim; operations must keep the order aligned across environments.
- **Observability.** Workers expose structured logs (`ruleId`, `actionId`, `eventId`, `throttleKey`). Metrics include:
- `notify_rule_matches_total{tenant,eventKind}`
- `notify_delivery_attempts_total{channelType,status}`
- `notify_digest_open_windows{window}`
- Optional OpenTelemetry traces for rule evaluation and connector round-trips.
- **Scaling levers.** Increase worker replicas to cope with bus throughput; adjust `worker.prefetchCount` for Valkey Streams or `ackWait` for NATS JetStream. WebService remains stateless and scales horizontally behind the gateway.
---
## 6. Roadmap alignment
| Backlog | Architectural note |
|---------|--------------------|
| `NOTIFY-SVC-38-001` | Standardise event envelope publication (idempotency keys) ensure bus bindings use the documented key format. |
| `NOTIFY-SVC-38-002..004` | Introduce simulation endpoints and throttle dashboards expect additional `/internal/notify/simulate` routes and metrics; update once merged. |
| `NOTIFY-SVC-39-001..004` | Correlation engine, digests generator, simulation API, quiet hours anticipate new PostgreSQL tables (`quiet_hours`, correlation caches) and connector metadata (quiet mode hints). Review this guide when implementations land. |
Action: schedule a documentation sync with the Notifications Service Guild immediately after `NOTIFY-SVC-39-001..004` merge to confirm schema adjustments (e.g., correlation edge storage, quiet hour calendars) and add any new persistence or API details here.
---
> **Imposed rule reminder:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

View File

@@ -0,0 +1,51 @@
# Notification Channels
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Last updated: 2025-11-25 (Docs Tasks Md.V · DOCS-NOTIFY-40-001)
## Supported channel types
- **Slack / Teams**: webhook-based with optional slash-command ack URLs.
- **Email (SMTP/SMTPS)**: relay-only; secrets provided via `secretRef` in Authority.
- **Generic webhook**: signed (HMAC-SHA256) payloads with replay protection and allowlisted hosts.
- **Pager duty-style escalation webhooks**: same contract as generic webhooks but with escalation metadata.
- **Console in-app**: stored delivery rendered in UI; always enabled for each tenant.
## Channel resource schema (Notify API)
```json
{
"id": "uuid",
"tenant": "string",
"type": "slack|teams|email|webhook|pager|console",
"endpoint": "https://..." ,
"secretRef": "authority://secrets/notify/slack-hook", // optional per type
"labels": { "env": "prod", "team": "sre" },
"throttle": { "windowSeconds": 60, "max": 10 },
"quietHours": { "from": "22:00", "to": "06:00", "timezone": "UTC" },
"enabled": true,
"createdUtc": "2025-11-25T00:00:00Z"
}
```
- **Determinism**: channel ids are UUIDv5 seeded by `(tenant, type, endpoint)` when created via manifests; server generates new IDs for ad-hoc API calls.
- **Validation**: endpoints must be on the allowlist; secretRef must exist in Authority; quiet hours use 24h clock UTC.
## Connector rules
- No secrets are stored in Notify DB; only `secretRef` is persisted.
- Per-tenant allowlists control outbound hostnames/ports; defaults block public internet in air-gapped kits.
- Payload signing:
- Slack/Teams: bearer secret in URL (indirect via secretRef) plus optional HMAC header `X-Stella-Signature` for mirror validation.
- Webhook/Pager: HMAC `X-Stella-Signature` (hex) over body with nonce + timestamp; receivers must enforce 5minute skew.
## Offline posture
- Offline kits ship default channel manifests under `out/offline/notify/channels/*.json` with placeholder endpoints.
- Operators must replace endpoints and secretRefs before deploy; validation rejects placeholder values.
## Observability
- Emit `notify.channel.delivery` counter with tags: `channel_type`, `tenant`, `status` (success/fail/throttled/quiet_hours), `rule_id`.
- Store delivery attempt hashes in the delivery ledger; duplicate payloads are de-duplicated per `(channel, bodyHash)` for 24h.
## Safety checklist
- [ ] Endpoint on allowlist and TLS valid.
- [ ] `secretRef` exists in Authority and scoped to tenant.
- [ ] Quiet hours configured for non-critical alerts; throttles set for bursty rules.
- [ ] HMAC signing verified in downstream system (webhook/pager).

View File

@@ -0,0 +1,92 @@
# Notifications Digests
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Digests coalesce multiple matching events into a single notification when rules request batched delivery. They protect responders from alert storms while preserving a deterministic record of every input.
---
## 1. Digest lifecycle
1. **Window selection.** Rule actions opt into a digest cadence by setting `actions[].digest` (`instant`, `5m`, `15m`, `1h`, `1d`). `instant` skips digest logic entirely.
2. **Aggregation.** When an event matches, the worker appends it to the open digest window (`tenantId + actionId + window`). Events include the canonical scope, delta counts, and references.
3. **Flush.** When the window expires or hits the workers safety cap (configurable), the worker renders a digest template and emits a single delivery with status `Digested`.
4. **Audit.** The delivery ledger links back to the digest document so operators can inspect individual items and the aggregated summary.
---
## 2. Storage model
Digest state lives in PostgreSQL (`notify.digests` table) and mirrors the schema described in [modules/notify/architecture.md](../modules/notify/architecture.md#7-data-model):
```json
{
"id": "tenant-dev:act-email-compliance:1h",
"tenantId": "tenant-dev",
"actionKey": "act-email-compliance",
"window": "1h",
"openedAt": "2025-10-24T08:00:00Z",
"status": "open",
"items": [
{
"eventId": "00000000-0000-0000-0000-000000000001",
"scope": {
"namespace": "prod-payments",
"repo": "ghcr.io/acme/api",
"digest": "sha256:…"
},
"delta": {
"newCritical": 1,
"kev": 1
}
}
]
}
```
- `status` reflects whether the window is currently collecting (`open`) or has been completed (`closed`). Future revisions may introduce `flushing` for in-progress operations.
- `items[].delta` captures aggregated counts for reporting (e.g., new critical findings, KEV, quieted).
- Workers use optimistic concurrency on the document ID to avoid duplicate flushes across replicas.
---
## 3. Rendering and templates
- Digest deliveries use the same template engine as instant notifications. Templates receive an additional `digest` object with `window`, `openedAt`, `itemCount`, and `items` (findings grouped by namespace/repository when available).
- Provide digest-specific templates (e.g., `tmpl-digest-hourly`) so the body can enumerate top offenders, summarise totals, and link to detailed dashboards.
- When no template is specified, Notify falls back to channel defaults that emphasise summary counts and redirect to Console for detail.
---
## 4. API surface
| Endpoint | Description | Notes |
|----------|-------------|-------|
| `POST /digests` | Issues administrative commands (e.g., force flush, reopen) for a specific action/window. | Request body specifies the command target; requires `notify.admin`. |
| `GET /digests/{actionKey}` | Returns the currently open window (if any) for the referenced action. | Supports operators/CLI inspecting pending digests; requires `notify.viewer`. |
| `DELETE /digests/{actionKey}` | Drops the open window without notifying (emergency stop). | Emits an audit record; use sparingly. |
All routes honour the tenant header and reuse the standard Notify rate limits.
---
## 5. Worker behaviour and safety nets
- **Idempotency.** Flush operations generate a deterministic digest delivery ID (`digest:<tenant>:<actionId>:<window>:<openedAt>`). Retries reuse the same ID.
- **Throttles.** Digest generation respects action throttles; setting an aggressive throttle together with a digest window may result in deliberate skips (logged as `Throttled` in the delivery ledger).
- **Quiet hours.** Future sprint work (`NOTIFY-SVC-39-004`) integrates quiet-hour calendars. When enabled, flush timers pause during quiet windows and resume afterwards.
- **Back-pressure.** When the window reaches the configured item cap before the timer, the worker flushes early and starts a new window immediately.
- **Crash resilience.** Workers rebuild in-flight windows from PostgreSQL on startup; partially flushed windows remain closed after success or reopened if the flush fails.
---
## 6. Operator guidance
- Choose hourly digests for high-volume compliance events; daily digests suit executive reporting.
- Pair digests with incident-focused instant rules so critical items surface immediately while less urgent noise is summarised.
- Monitor `/stats` output for `openDigestCount` to ensure windows are flushing; spikes may indicate downstream connector failures.
- When testing new digest templates, open a small (`5m`) window, trigger sample events, then call `POST /digests/{actionId}/flush` to validate rendering before moving to longer cadences.
---
> **Imposed rule reminder:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

View File

@@ -0,0 +1,51 @@
# Escalations & Acknowledgements
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Last updated: 2025-11-25 (Docs Tasks Md.V · DOCS-NOTIFY-40-001)
## Model
- **Escalation policy**: ordered stages of channels with delays; stored per tenant.
- **Acknowledgement**: DSSE-signed token embedded in messages; acknowledger must present token to stop escalation.
- **Suppression**: rules may mark events as non-escalating (informational) while still sending single notifications.
## Policy schema (conceptual)
```json
{
"id": "uuid",
"tenant": "string",
"name": "pager-policy-prod",
"stages": [
{ "delaySeconds": 0, "channels": ["slack-prod", "email-oncall"] },
{ "delaySeconds": 900, "channels": ["pager-primary"] },
{ "delaySeconds": 1800,"channels": ["pager-management"] }
],
"autoCloseMinutes": 120,
"retry": { "maxAttempts": 3, "backoffSeconds": 60 }
}
```
- Stages execute sequentially until an **ack** is recorded.
- Deterministic ordering: channels within a stage are sorted lexicographically before dispatch.
## Ack tokens
- Token payload: `{ tenant, deliveryId, expiresUtc, ruleId, actionHash }`.
- Signed with Authority-issued DSSE key; verified by Notify WebService before accepting `POST /acks/{token}`.
- Expiry defaults to 24h; tokens are single-use and idempotent.
## Escalation flow
1) Rule fires → action references an escalation policy.
2) Stage 0 deliveries sent; ledger records attempts and ack URL.
3) If no ack by `delaySeconds`, next stage dispatches; repeats until ack or final stage.
4) On ack, remaining stages are cancelled; ledger entry marked `acknowledged` with timestamp and subject.
## Quiet hours & throttles
- Quiet hours suppress *new* escalations; in-flight escalations continue.
- Per-policy throttle prevents repeated escalation runs for identical `actionHash` within a configurable window (default 30m).
## Observability
- Counters: `notify.escalation.started`, `notify.escalation.stage_sent`, `notify.escalation.ack`, `notify.escalation.cancelled` tagged by `tenant`, `policy`, `stage`.
- Logs: structured `escalation.{started|stage_sent|ack|cancelled}` with delivery ids and rationale.
## Runbooks
- Update escalation policy safely: create new policy id, switch rules, then delete old policy to avoid mid-flight ambiguity.
- If a stage storms, set throttle higher or add quiet hours; do not delete the policy mid-flight—use `cancelEscalation` endpoint instead.

View File

@@ -0,0 +1,9 @@
{
"tenant_id": "tenant-123",
"delivery_id": "00000000-0000-4000-8000-000000000001",
"channel": "email",
"subject": "User signup",
"body": "User john@example.com joined",
"redacted_body": "User ***@example.com joined",
"pii_hash": "dd4eefc8dded5d6f46c832e959ba0eef95ee8b77f10ac0aae90f7c89ad42906c"
}

View File

@@ -0,0 +1 @@
{"template_id":"tmpl-incident-start","locale":"en-US","channel":"email","expected_hash":"05eb80e384eaf6edf0c44a655ca9064ca4e88b8ad7cefa1483eda5c9aaface00","body_sample_path":"tmpl-incident-start.email.en-US.json"}

View File

@@ -0,0 +1,6 @@
{
"subject": "Incident started: ${incident_id}",
"body": "Incident ${incident_id} started at ${started_at}. Severity: ${severity}.",
"merge_fields": ["incident_id", "started_at", "severity"],
"preview_hash": "05eb80e384eaf6edf0c44a655ca9064ca4e88b8ad7cefa1483eda5c9aaface00"
}

View File

@@ -0,0 +1,10 @@
{
"trace_id": "00000000000000000000000000000001",
"tenant_id": "tenant-123",
"rule_id": "RULE-INCIDENT",
"channel_id": "email-default",
"attributes": {
"delivery_id": "00000000-0000-4000-8000-000000000001",
"status": "sent"
}
}

View File

@@ -0,0 +1,32 @@
# Notify Gaps NR1NR10 — Remediation Blueprint (source: `docs/product-advisories/31-Nov-2025 FINDINGS.md`)
## Scope
Close NR1NR10 by defining contracts, evidence, and deterministic test hooks for the Notifier runtime (service + worker + offline kit). This doc is the detailed layer referenced by sprint `SPRINT_0171_0001_0001_notifier_i` and NOTIFY-GAPS-171-014.
## Gap requirements, evidence, and tests
| ID | Requirement | Evidence to publish | Deterministic tests/fixtures |
| --- | --- | --- | --- |
| NR1 | Versioned JSON Schemas for event envelopes, rules, templates, channels, receipts, and webhooks; DSSE-signed catalog with canonical hash recipe (BLAKE3-256 over normalized JSON). | `docs/modules/notify/schemas/notify-schemas-catalog.json` + `.dsse.json`; `docs/modules/notify/schemas/inputs.lock` capturing digests and canonicalization flags. | Golden canonicalization harness under `tests/notifications/Schemas/SchemaCanonicalizationTests.cs` using frozen inputs + hash assertions. |
| NR2 | Tenant scoping + approvals for high-impact rules (escalations, PII, cross-tenant fan-out). Every API and receipt carries `tenant_id`; RBAC/approvals enforced. | RBAC/approval matrix (`docs/modules/notify/security/tenant-approvals.md`) listing actions × roles × required approvals. | API contract tests in `StellaOps.Notifier.Tests/TenantScopeTests.cs` plus integration fixtures with mixed-tenant payloads (should reject). |
| NR3 | Deterministic rendering/localization: stable merge-field ordering, UTC ISO-8601 timestamps, locale whitelist, hashed previews recorded in ledger. | Rendering fixture pack `docs/modules/notify/fixtures/rendering/*.json`; hash ledger samples `docs/modules/notify/fixtures/rendering/index.ndjson` with BLAKE3 digests. | `StellaOps.Notifier.Tests/RenderingDeterminismTests.cs` compares golden bodies/subjects across locales/timezones; seeds fixed RNG/time. |
| NR4 | Quotas/backpressure/DLQ: per-tenant/channel quotas, burst budgets, enqueue gating, DLQ schema with redrive + idempotent keys; metrics/alerts for backlog/DLQ growth. | Quota policy `docs/modules/notify/operations/quotas.md`; DLQ schema `docs/modules/notify/schemas/dlq-notify.schema.json`. | Worker tests `StellaOps.Notifier.Tests/BackpressureAndDlqTests.cs` validating quota enforcement, DLQ insertion, redrive idempotency. |
| NR5 | Retry & idempotency: canonical `delivery_id` (UUIDv7) + dedupe key (event×rule×channel); bounded exponential backoff with jitter; idempotent connectors; ignore out-of-order acks. | Retry matrix `docs/modules/notify/operations/retries.md`; connector idempotency checklist. | `StellaOps.Notifier.Tests/RetryPolicyTests.cs` + connector harness fixtures demonstrating dedupe across duplicate events. |
| NR6 | Webhook/ack security: HMAC or mTLS/DPoP required; signed ack URLs/tokens with nonce, expiry, audience, single-use; per-tenant allowlists for domains/paths. | Security policy `docs/modules/notify/security/webhook-ack-hardening.md`; sample signed-ack token format + validation steps. | Negative-path tests `StellaOps.Notifier.Tests/WebhookSecurityTests.cs` covering wrong HMAC, replayed nonce, expired token, disallowed domain. |
| NR7 | Redaction & PII limits: classify template fields; redact secrets/PII in storage/logs; hash sensitive values; size/field allowlists; previews/logs default to redacted variant. | Redaction catalog `docs/modules/notify/security/redaction-catalog.md`; sample redacted payloads `docs/modules/notify/fixtures/redaction/*.json`. | `StellaOps.Notifier.Tests/RedactionTests.cs` asserting stored/preview payloads match redacted expectations. |
| NR8 | Observability SLO alerts: SLOs for delivery latency/success/backlog/DLQ age; standard metrics names; dashboards/alerts/runbooks; traces include tenant/rule/channel IDs with sampling rules. | Dashboard JSON `docs/modules/notify/operations/dashboards/notify-slo.json`; alert rules `docs/modules/notify/operations/alerts/notify-slo-alerts.yaml`; runbook link. | `StellaOps.Notifier.Tests/ObservabilityContractsTests.cs` verifying metric names/labels; trace exemplar fixture `docs/modules/notify/fixtures/traces/sample-trace.json`. |
| NR9 | Offline notify-kit with DSSE: bundle schemas, rules/templates, connector configs, verify script, hash list, time-anchor hook; deterministic packaging flags; tenant/env scoping; DSSE-signed manifest. | Manifest `offline/notifier/notify-kit.manifest.json`, DSSE `offline/notifier/notify-kit.manifest.dsse.json`, hash list `offline/notifier/artifact-hashes.json`, verify script `offline/notifier/verify_notify_kit.sh`. | Determinism check `tests/offline/NotifyKitDeterminismTests.sh` (shell) verifying hash list, DSSE, scope enforcement, packaging flags. |
| NR10 | Mandatory simulations & evidence before activation: dry-run against frozen fixtures; DSSE-signed simulation results attached to approvals; regression tests per high-impact rule/template change. | Simulation report `docs/modules/notify/simulations/<rule-id>-report.json` + DSSE; approval evidence log `docs/modules/notify/simulations/index.ndjson`. | `StellaOps.Notifier.Tests/SimulationGateTests.cs` enforcing simulation requirement and evidence linkage before `active=true`. |
## Delivery + governance hooks
- Add the above evidence paths to the NOTIFY-GAPS-171-014 task in `docs/implplan/SPRINT_0171_0001_0001_notifier_i.md` and mirror status in `src/Notifier/StellaOps.Notifier/TASKS.md`.
- When artifacts land, append TRX/fixture links in the sprint **Execution Log** and reference this doc under **Decisions & Risks**.
- Offline kit artefacts must mirror mirror/offline packaging rules (deterministic flags, time-anchor hook, PQ dual-sign toggle) already used by Mirror/Offline sprints.
- Simulation evidence lives in `docs/modules/notify/simulations/` (index.ndjson + per-rule reports) and is validated by contract tests under `Contracts/PolicyDocsCompletenessTests.cs`.
- Contract tests under `Contracts/` verify schema catalog ↔ DSSE alignment, fixture hashes, simulation index presence, and offline kit manifest/DSSE consistency.
## Next steps
1) Generate initial schema catalog (`notify-schemas-catalog.json`) with rule/template/channel/webhook/receipt definitions and run canonicalization harness.
2) Produce redaction catalog, quotas policy, retry matrix, and security hardening docs referenced above.
3) Add golden fixtures/tests outlined above and wire CI filters to run determinism + security suites for Notify.
4) Build notify-kit manifest + DSSE and publish `verify_notify_kit.sh` aligned with offline bundle policies.

View File

@@ -0,0 +1,27 @@
groups:
- name: notify-slo
rules:
- alert: NotifyDeliverySuccessSLO
expr: sum(rate(notify_delivery_success_total[5m])) / sum(rate(notify_delivery_total[5m])) < 0.98
for: 10m
labels:
severity: page
annotations:
summary: "Notify delivery success below SLO"
description: "Success ratio below 98% over 10m"
- alert: NotifyBacklogDepthHigh
expr: notify_backlog_depth > 5000
for: 5m
labels:
severity: page
annotations:
summary: "Notify backlog too high"
description: "Backlog depth exceeded 5000 messages"
- alert: NotifyDlqGrowth
expr: rate(notify_dlq_depth[10m]) > 50
for: 10m
labels:
severity: ticket
annotations:
summary: "Notify DLQ growth"
description: "Dead letter queue growing faster than threshold"

View File

@@ -0,0 +1,9 @@
{
"title": "Notify SLO",
"panels": [
{ "title": "Delivery success", "target": "sum(rate(notify_delivery_success_total[5m])) / sum(rate(notify_delivery_total[5m]))" },
{ "title": "Backlog depth", "target": "notify_backlog_depth" },
{ "title": "DLQ depth", "target": "notify_dlq_depth" },
{ "title": "Latency p95", "target": "histogram_quantile(0.95, rate(notify_delivery_latency_seconds_bucket[5m]))" }
]
}

View File

@@ -0,0 +1,7 @@
# Quotas, backpressure, and DLQ (NR4)
- Per-tenant quotas: 500 deliveries/minute default; channel overrides: webhook 200/min, email 120/min, chat 240/min.
- Burst budget: 2x quota for 60 seconds, then hard clamp.
- Backpressure: reject enqueue when backlog > quota*10 or DLQ growth > 5%/min.
- DLQ schema: `docs/modules/notify/schemas/dlq-notify.schema.json`; redrive requires idempotent `delivery_id`/`dedupe_key`.
- Metrics to alert: backlog depth, DLQ depth, redrive success rate, enqueue reject count.

View File

@@ -0,0 +1,7 @@
# Retry and idempotency policy (NR5)
- `delivery_id`: UUIDv7; `dedupe_key`: hash(event_id + rule_id + channel_id).
- Backoff: exponential with jitter; base 2s, factor 2, max 5 attempts, cap 5 minutes between attempts.
- Connectors must be idempotent; retries reuse the same `dedupe_key` and must not duplicate sends.
- Out-of-order acks ignored: only monotonic `attempt` accepted.
- Record retry outcomes in receipts and include attempt count + reason.

View File

@@ -0,0 +1,78 @@
# Notifications Overview
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Notifications Studio turns raw platform events into concise, tenant-scoped alerts that reach the right responders without overwhelming them. The service is sovereign/offline-first, follows the Aggregation-Only Contract (AOC), and produces deterministic outputs so the same configuration yields identical deliveries across environments.
---
## 1. Mission & value
- **Reduce noise.** Only materially new or high-impact changes reach chat, email, or webhooks thanks to rule filters, throttles, and digest windows.
- **Explainable results.** Every delivery is traceable back to a rule, action, and event payload stored in the delivery ledger; operators can audit what fired and why.
- **Safe by default.** Secrets remain in external stores, templates are sandboxed, quiet hours and throttles prevent storms, and idempotency guarantees protect downstream systems.
- **Offline-aligned.** All configuration, templates, and plug-ins ship with Offline Kits; no external SaaS is required to send notifications.
---
## 2. Core capabilities
| Capability | What it does | Key docs |
|------------|--------------|----------|
| Rules engine | Declarative matchers for event kinds, severities, namespaces, VEX context, KEV flags, and more. | [rules.md](rules.md) |
| Channel catalog | Slack, Teams, Email, Webhook connectors loaded via restart-time plug-ins; metadata stored without secrets. | [architecture.md](architecture.md) |
| Templates | Locale-aware, deterministic rendering via safe helpers; channel defaults plus tenant-specific overrides, including the attestation lifecycle suite (`tmpl-attest-*`). | [templates.md](templates.md#7-attestation--signing-lifecycle-templates-notify-attest-74-001) |
| Digests | Coalesce bursts into periodic summaries with deterministic IDs and audit trails. | [digests.md](digests.md) |
| Delivery ledger | Tracks rendered payload hashes, attempts, throttles, and outcomes for every action. | [architecture.md](architecture.md#7-data-model) |
| Ack tokens | DSSE-signed acknowledgement tokens with webhook allowlists and escalation guardrails enforced by Authority. | [architecture.md](architecture.md#81-ack-tokens--escalation-workflows) |
---
## 3. How it fits into StellaOps
1. **Producers emit events.** Scanner, Scheduler, VEX Lens, Attestor, and Zastava publish canonical envelopes (`NotifyEvent`) onto the internal bus.
2. **Notify.Worker evaluates rules.** For each tenant, the worker applies match filters, VEX gates, throttles, and digest policies before rendering the action.
3. **Connectors deliver.** Channel plug-ins send the rendered payload to Slack/Teams/Email/Webhook targets and report back attempts and outcomes.
4. **Consumers investigate.** Operators pivot from message links into Console dashboards, SBOM views, or policy overlays with correlation IDs preserved.
The Notify WebService fronts worker state with REST APIs used by the UI and CLI. Tenants authenticate via StellaOps Authority scopes `notify.viewer`, `notify.operator`, and (for escalated actions) `notify.admin`. All operations require the tenant header (`X-StellaOps-Tenant`) to preserve sovereignty boundaries.
---
## 4. Operating model
| Area | Guidance |
|------|----------|
| **Tenancy** | Each rule, channel, template, and delivery belongs to exactly one tenant. Cross-tenant sharing is intentionally unsupported. |
| **Determinism** | Configuration persistence normalises strings and sorts collections. Template rendering produces identical `bodyHash` values when inputs match; attestation events always reference the canonical `tmpl-attest-*` keys documented in the template guide. |
| **Scaling** | Workers scale horizontally; per-tenant rule snapshots are cached and refreshed from PostgreSQL change notifications. Valkey (or Redis-compatible) guards throttles and locks. |
| **Offline** | Offline Kits include plug-ins, default templates, and seed rules. Operators can edit YAML/JSON manifests before air-gapped deployment. |
| **Security** | Channel secrets use indirection (`secretRef`), Authority-protected OAuth clients secure API access, and delivery payloads are redacted before storage where required. |
| **Module boundaries** | 2025-11-02 decision: keep `src/Notify/` as the shared notification toolkit and `src/Notifier/` as the Notifications Studio runtime host until a packaging RFC covers the implications of merging. |
---
## 5. Getting started (first 30 minutes)
| Step | Goal | Reference |
|------|------|-----------|
| 1 | Deploy Notify WebService + Worker with PostgreSQL and Valkey | [`modules/notify/architecture.md`](../modules/notify/architecture.md#1-runtime-shape--projects) |
| 2 | Register OAuth clients/scopes in Authority | [`etc/authority.yaml.sample`](../../etc/authority.yaml.sample) |
| 3 | Install channel plug-ins and capture secret references | [`plugins/notify`](../../plugins) |
| 4 | Create a tenant rule and test preview | [`POST /channels/{id}/test`](../modules/notify/architecture.md#8-external-apis-webservice) |
| 5 | Inspect deliveries and digests | `/api/v1/notify/deliveries`, `/api/v1/notify/digests` |
---
## 6. Alignment with implementation work
| Backlog item | Impact on docs | Status |
|--------------|----------------|--------|
| `NOTIFY-SVC-38-001..004` | Foundational correlation, throttling, simulation hooks. | **In progress** align behaviour once services publish beta APIs. |
| `NOTIFY-SVC-39-001..004` | Adds correlation engine, digest generator, simulation API, quiet hours. | **Pending** revisit rule/digest sections when these tasks merge. |
Action: coordinate with the Notifications Service Guild when `NOTIFY-SVC-39-001..004` land to validate payload fields, quiet-hours semantics, and any new connector metadata that should be documented here and in the channel-specific guides.
---
> **Imposed rule reminder:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

View File

@@ -0,0 +1,259 @@
# Pack Approvals Notification Contract
> **Status:** Implemented (NOTIFY-SVC-37-001)
> **Last Updated:** 2025-11-27
> **OpenAPI Spec:** `src/Notifier/StellaOps.Notifier/StellaOps.Notifier.WebService/openapi/pack-approvals.yaml`
## Overview
This document defines the canonical contract for pack approval notifications between Task Runner and the Notifier service. It covers event payloads, resume token mechanics, error handling, and security requirements.
## Event Kinds
| Kind | Description | Trigger |
|------|-------------|---------|
| `pack.approval.requested` | Approval required for pack deployment | Task Runner initiates deployment requiring approval |
| `pack.approval.updated` | Approval state changed | Decision recorded or timeout |
| `pack.policy.hold` | Policy gate blocked deployment | Policy Engine rejects deployment |
| `pack.policy.released` | Policy hold lifted | Policy conditions satisfied |
## Canonical Event Schema
```json
{
"eventId": "550e8400-e29b-41d4-a716-446655440000",
"issuedAt": "2025-11-27T10:30:00Z",
"kind": "pack.approval.requested",
"packId": "pkg:oci/stellaops/scanner@v2.1.0",
"policy": {
"id": "policy-prod-deploy",
"version": "1.2.3"
},
"decision": "pending",
"actor": "ci-pipeline@stellaops.example.com",
"resumeToken": "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9...",
"summary": "Deployment approval required for production scanner update",
"labels": {
"environment": "production",
"team": "security"
}
}
```
### Required Fields
| Field | Type | Description |
|-------|------|-------------|
| `eventId` | UUID | Unique event identifier; used for deduplication |
| `issuedAt` | ISO 8601 | Event timestamp in UTC |
| `kind` | string | Event type (see Event Kinds table) |
| `packId` | string | Package identifier in PURL format |
| `decision` | string | Current state: `pending`, `approved`, `rejected`, `hold`, `expired` |
| `actor` | string | Identity that triggered the event |
### Optional Fields
| Field | Type | Description |
|-------|------|-------------|
| `policy` | object | Policy metadata (`id`, `version`) |
| `resumeToken` | string | Opaque token for Task Runner resume flow |
| `summary` | string | Human-readable summary for notifications |
| `labels` | object | Custom key-value metadata |
## Resume Token Mechanics
### Token Flow
```
┌─────────────┐ POST /pack-approvals ┌──────────────┐
│ Task Runner │ ──────────────────────────────►│ Notifier │
│ │ { resumeToken: "abc123" } │ │
│ │◄──────────────────────────────│ │
│ │ X-Resume-After: "abc123" │ │
└─────────────┘ └──────────────┘
│ │
│ │
│ User acknowledges approval │
│ ▼
│ ┌──────────────────────────────┐
│ │ POST /pack-approvals/{id}/ack
│ │ { ackToken: "..." } │
│ └──────────────────────────────┘
│ │
│◄─────────────────────────────────────────────┤
│ Resume callback (webhook or message bus) │
```
### Token Properties
- **Format:** Opaque string; clients must not parse or modify
- **TTL:** 24 hours from `issuedAt`
- **Uniqueness:** Scoped to tenant + packId + eventId
- **Expiry Handling:** Expired tokens return `410 Gone`
### X-Resume-After Header
When `resumeToken` is present in the request, the server echoes it in the `X-Resume-After` response header. This enables cursor-based processing for Task Runner polling.
## Error Handling
### HTTP Status Codes
| Code | Meaning | Client Action |
|------|---------|---------------|
| `200` | Duplicate request (idempotent) | Treat as success |
| `202` | Accepted for processing | Continue normal flow |
| `204` | Acknowledgement recorded | Continue normal flow |
| `400` | Validation error | Fix request and retry |
| `401` | Authentication required | Refresh token and retry |
| `403` | Insufficient permissions | Check scope; contact admin |
| `404` | Resource not found | Verify packId; may have expired |
| `410` | Token expired | Re-initiate approval flow |
| `429` | Rate limited | Retry after `Retry-After` seconds |
| `5xx` | Server error | Retry with exponential backoff |
### Error Response Format
```json
{
"error": {
"code": "invalid_request",
"message": "eventId, packId, kind, decision, actor are required.",
"traceId": "00-abc123-def456-00"
}
}
```
### Retry Strategy
- **Transient errors (5xx, 429):** Exponential backoff starting at 1s, max 60s, max 5 retries
- **Validation errors (4xx except 429):** Do not retry; fix request
- **Idempotency:** Safe to retry any request with the same `Idempotency-Key`
## Security Requirements
### Authentication
All endpoints require a valid OAuth2 bearer token with one of these scopes:
- `packs.approve` — Full approval flow access
- `Notifier.Events:Write` — Event ingestion only
### Tenant Isolation
- `X-StellaOps-Tenant` header is **required** on all requests
- Server validates token tenant claim matches header
- Cross-tenant access returns `403 Forbidden`
### Idempotency
- `Idempotency-Key` header is **required** for POST endpoints
- Keys are scoped to tenant and expire after 15 minutes
- Duplicate requests within the window return `200 OK`
### HMAC Signature (Webhooks)
For webhook callbacks from Notifier to Task Runner:
```
X-StellaOps-Signature: sha256=<hex-encoded-signature>
X-StellaOps-Timestamp: <unix-timestamp>
```
Signature computed as:
```
HMAC-SHA256(secret, timestamp + "." + body)
```
Verification requirements:
- Reject if timestamp is >5 minutes old
- Reject if signature does not match
- Reject if body has been modified
### IP Allowlist
Configurable per environment in `notifier:security:ipAllowlist`:
```yaml
notifier:
security:
ipAllowlist:
- "10.0.0.0/8"
- "192.168.1.100"
```
### Sensitive Data Handling
- **Resume tokens:** Encrypted at rest; never logged in full
- **Ack tokens:** Signed with KMS; validated on acknowledgement
- **Labels:** Redacted if keys match `secret`, `password`, `token`, `key` patterns
## Audit Trail
All operations emit structured audit events:
| Event | Fields | Retention |
|-------|--------|-----------|
| `pack.approval.ingested` | packId, kind, decision, actor, eventId | 90 days |
| `pack.approval.acknowledged` | packId, ackToken, decision, actor | 90 days |
| `pack.policy.hold` | packId, policyId, reason | 90 days |
## Observability
### Metrics
| Metric | Type | Labels |
|--------|------|--------|
| `notifier_pack_approvals_total` | Counter | `kind`, `decision`, `tenant` |
| `notifier_pack_approvals_outstanding` | Gauge | `tenant` |
| `notifier_pack_approval_ack_latency_seconds` | Histogram | `decision` |
| `notifier_pack_approval_errors_total` | Counter | `code`, `tenant` |
### Structured Logs
All operations include:
- `traceId` — Distributed trace correlation
- `tenantId` — Tenant identifier
- `packId` — Package identifier
- `eventId` — Event identifier
## Integration Examples
### Task Runner → Notifier (Ingestion)
```bash
curl -X POST https://notifier.stellaops.example.com/api/v1/notify/pack-approvals \
-H "Authorization: Bearer $TOKEN" \
-H "X-StellaOps-Tenant: tenant-acme-corp" \
-H "Idempotency-Key: $(uuidgen)" \
-H "Content-Type: application/json" \
-d '{
"eventId": "550e8400-e29b-41d4-a716-446655440000",
"issuedAt": "2025-11-27T10:30:00Z",
"kind": "pack.approval.requested",
"packId": "pkg:oci/stellaops/scanner@v2.1.0",
"decision": "pending",
"actor": "ci-pipeline@stellaops.example.com",
"resumeToken": "abc123",
"summary": "Approval required for production deployment"
}'
```
### Console → Notifier (Acknowledgement)
```bash
curl -X POST https://notifier.stellaops.example.com/api/v1/notify/pack-approvals/pkg%3Aoci%2Fstellaops%2Fscanner%40v2.1.0/ack \
-H "Authorization: Bearer $TOKEN" \
-H "X-StellaOps-Tenant: tenant-acme-corp" \
-H "Content-Type: application/json" \
-d '{
"ackToken": "ack-token-xyz789",
"decision": "approved",
"comment": "Reviewed and approved"
}'
```
## Related Documents
- [Pack Approvals Integration Requirements](pack-approvals-integration.md)
- [Notifications Architecture](architecture.md)
- [Notifications API Reference](api.md)
- [Notification Templates](templates.md)

View File

@@ -0,0 +1,62 @@
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
# Pack Approval Notification Integration — Requirements
## Overview
Task Runner now produces pack plans with explicit approval and policy-gate metadata. The Notifications service must ingest those events, persist their state, and fan out actionable alerts (approvals requested, policy holds, resumptions). This document captures the requirements for the first Notifications sprint dedicated to the Task Runner bridge.
Deliverables feed Sprint 37 tasks (`NOTIFY-SVC-37-00x`) and unblock Task Runner sprint 43 (`TASKRUN-43-001`).
## Functional Requirements
### 1. Approval Event Contract
- Define a canonical schema for **PackApprovalRequested** and **PackApprovalUpdated** events.
- Fields must include `runId`, `approvalId`, tenant context, plan hash, required grants, step identifiers, message template, and resume callback metadata.
- Provide an OpenAPI fragment and x-go/x-cs models for Task Runner and CLI compatibility.
- Document error/acknowledgement semantics (success, retryable failure, validation failure).
### 2. Ingestion & Persistence
- Expose a secure Notifications API endpoint (`POST /notifications/pack-approvals`) receiving Task Runner events.
- Validate scope (`packs.approve`, `Notifier.Events:Write`) and tenant match.
- Persist approval state transitions in PostgreSQL (`notify.pack_approvals` table) with indexes on run/approval/tenant.
- Store outbound notification audit records with correlation IDs to support Task Runner resume flow.
### 3. Notification Routing
- Derive recipients from new rule predicates (`event.kind == "pack.approval"`).
- Render approval templates (email + webhook JSON) including plan metadata and approval links (resume token).
- Emit policy gate notifications as “hold” incidents with context (parameters, messages).
- Support localization fallback and redaction of secrets (never ship approval tokens unencrypted).
### 4. Resume & Ack Handshake
- Provide an approval ack endpoint (`POST /notifications/pack-approvals/{runId}/{approvalId}/ack`) that records decision metadata and forwards to Task Runner resume hook (HTTP callback + message bus placeholder).
- Return structured responses with resume token / status for CLI integration.
- Ensure idempotent updates (dedupe by runId + approvalId + decisionHash).
### 5. Observability & Security
- Emit metrics for approval notifications queued/sent, outstanding approvals, and acknowledgement latency.
- Log audit trail events (`pack.approval.requested`, `pack.approval.acknowledged`, `pack.policy.hold`).
- Enforce HMAC or mTLS for Task Runner -> Notifier ingestion; support configurable IP allowlist.
- Provide chaos-test plan for notification failure modes (channel outage, storage failure).
## Non-Functional Requirements
- Deterministic processing: identical approval events lead to identical outbound notifications (idempotent).
- Timeouts: ingestion endpoint must respond < 500ms under nominal load.
- Retry strategy: Task Runner expects 5xx/429 for transient errors; document backoff guidance.
- Data retention: approval records retained 90 days, purge job tracked under ops runbook.
## Sprint 37 Task Mapping
| Task ID | Scope |
| --- | --- |
| **NOTIFY-SVC-37-001** | Author this contract doc, OpenAPI fragment, and schema references. Coordinate with Task Runner/Authority guilds. |
| **NOTIFY-SVC-37-002** | Implement secure ingestion endpoint, PostgreSQL persistence, and audit hooks. Provide integration tests with sample events. |
| **NOTIFY-SVC-37-003** | Build approval/policy notification templates, routing rules, and channel dispatch (email + webhook). |
| **NOTIFY-SVC-37-004** | Ship acknowledgement endpoint + Task Runner callback client, resume token handling, and metrics/dashboards. |
## Open Questions
1. Who owns approval resume callback (Task Runner Worker vs Orchestrator)? Resolve before NOTIFY-SVC-37-004.
2. Should approvals generate incidents in existing incident schema or dedicated table? Decision impacts PostgreSQL schema design.
3. Authority scopes for approval ingestion/ack reuse `packs.approve` or introduce `packs.approve:notify`? Coordinate with Authority team.

View File

@@ -0,0 +1,160 @@
# Notifications Rules
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Rules decide which platform events deserve a notification, how aggressively they should be throttled, and which channels/actions should run. They are tenant-scoped contracts that guarantee deterministic routing across Notify.Worker replicas.
---
## 1. Rule lifecycle
1. **Authoring.** Operators create or update rules through the Notify WebService (`POST /rules`, `PATCH /rules/{id}`) or UI. Payloads are normalised to the current `NotifyRule` schema version.
2. **Evaluation.** Notify.Worker evaluates enabled rules per incoming event. Tenancy is enforced first, followed by match filters, VEX gates, throttles, and digest handling.
3. **Delivery.** Matching actions are enqueued with an idempotency key to prevent storm loops. Throttle rejections and digest coalescing are recorded in the delivery ledger.
4. **Audit.** Every change carries `createdBy`/`updatedBy` plus timestamps; the delivery ledger references `ruleId`/`actionId` for traceability.
---
## 2. Rule schema reference
| Field | Type | Notes |
|-------|------|-------|
| `ruleId` | string | Stable identifier; clients may provide UUID/slug. |
| `tenantId` | string | Must match the tenant header supplied when the rule is created. |
| `name` | string | Display label shown in UI and audits. |
| `description` | string? | Optional operator-facing note. |
| `enabled` | bool | Disabled rules remain stored but skipped during evaluation. |
| `labels` | map<string,string> | Sorted, trimmed key/value tags supporting filtering. |
| `metadata` | map<string,string> | Reserved for automation; stored verbatim (sorted). |
| `match` | [`NotifyRuleMatch`](#3-match-filters) | Declarative filters applied before actions execute. |
| `actions[]` | [`NotifyRuleAction`](#4-actions-throttles-and-digests) | Ordered set of channel dispatchers; minimum one. |
| `createdBy`/`createdAt` | string?, instant | Populated automatically when omitted. |
| `updatedBy`/`updatedAt` | string?, instant | Defaults to creation values when unspecified. |
| `schemaVersion` | string | Auto-upgraded during persistence; use for migrations. |
Rules are immutable snapshots; updates produce a full document write so workers observing change streams can refresh caches deterministically.
---
## 3. Match filters
`NotifyRuleMatch` narrows which events trigger the rule. All string collections are trimmed, deduplicated, and sorted to guarantee deterministic evaluation.
| Field | Type | Behaviour |
|-------|------|-----------|
| `eventKinds[]` | string | Lower-cased; supports any canonical Notify event (`scanner.report.ready`, `scheduler.rescan.delta`, `zastava.admission`, etc.). Empty list matches all kinds. |
| `namespaces[]` | string | Exact match against `event.scope.namespace`. Supports glob-style filters via upstream enrichment (planned). |
| `repositories[]` | string | Matches `event.scope.repo`. |
| `digests[]` | string | Lower-cased; matches `event.scope.digest`. |
| `labels[]` | string | Matches event attributes or delta labels (`kev`, `critical`, `license`, …). |
| `componentPurls[]` | string | Matches component identifiers inside the event payload when provided. |
| `minSeverity` | string? | Lower-cased severity gate (e.g., `medium`, `high`, `critical`). Evaluated on new findings inside event deltas; events lacking severity bypass this gate unless set. |
| `verdicts[]` | string | Accepts scan/report verdicts (`fail`, `warn`, `block`, `escalate`, `deny`). |
| `kevOnly` | bool? | When `true`, only KEV-tagged findings fire. |
| `vex` | object | Additional gating aligned with VEX consensus; see below. |
### 3.1 VEX gates
`NotifyRuleMatchVex` offers fine-grained control when VEX findings accompany events:
| Field | Default | Effect |
|-------|---------|--------|
| `includeAcceptedJustifications` | `true` | Include findings marked `not_affected`/`acceptable` in consensus. |
| `includeRejectedJustifications` | `false` | Surface findings the consensus rejected. |
| `includeUnknownJustifications` | `false` | Allow findings without explicit justification. |
| `justificationKinds[]` | `[]` | Optional allow-list of justification codes (e.g., `exploit_observed`, `component_not_present`). |
If the VEX block filters out every applicable finding, the rule is treated as a non-match and no actions run.
---
## 4. Actions, throttles, and digests
Each rule requires at least one action. Actions are deduplicated and sorted by `actionId`, so prefer deterministic identifiers.
| Field | Type | Notes |
|-------|------|-------|
| `actionId` | string | Stable identifier unique within the rule. |
| `channel` | string | Reference to a channel (`channelId`) configured in `/channels`. |
| `template` | string? | Template key to use for rendering; falls back to channel default when omitted. |
| `digest` | string? | Digest window key (`instant`, `5m`, `15m`, `1h`, `1d`). `instant` bypasses coalescing. |
| `throttle` | ISO8601 duration? | Optional throttle TTL (`PT300S`, `PT1H`). Prevents duplicate deliveries when the same idempotency hash appears before expiry. |
| `locale` | string? | BCP-47 tag (stored lower-case). Template lookup falls back to channel locale then `en-us`. |
| `enabled` | bool | Disabled actions skip rendering but remain stored. |
| `metadata` | map<string,string> | Connector-specific hints (priority, layout, etc.). |
### 4.0 Attestation lifecycle templates
Rules targeting attestation/signing events (`attestor.verification.failed`, `attestor.attestation.expiring`, `authority.keys.revoked`, `attestor.transparency.anomaly`) must reference the dedicated template keys documented in [`templates.md` §7](templates.md#7-attestation--signing-lifecycle-templates-notify-attest-74-001) so payloads remain deterministic across channels and Offline Kits:
| Event kind | Required template key | Notes |
| --- | --- | --- |
| `attestor.verification.failed` | `tmpl-attest-verify-fail` | Include failure code, Rekor UUID/index, last good attestation link. |
| `attestor.attestation.expiring` | `tmpl-attest-expiry-warning` | Surface issued/expires timestamps, time remaining, renewal instructions. |
| `authority.keys.revoked` / `authority.keys.rotated` | `tmpl-attest-key-rotation` | List rotation batch ID, impacted services, remediation steps. |
| `attestor.transparency.anomaly` | `tmpl-attest-transparency-anomaly` | Highlight Rekor/witness metadata and anomaly classification. |
Locale-specific variants keep the same template key while varying `locale`; rule actions shouldn't create ad-hoc templates for these events.
### 4.1 Evaluation order
1. Verify channel exists and is enabled; disabled channels mark the delivery as `Dropped`.
2. Apply throttle idempotency key: `hash(ruleId|actionId|event.kind|scope.digest|delta.hash|dayBucket)`. Hits are logged as `Throttled`.
3. If the action defines a digest window other than `instant`, append the event to the open window and defer delivery until flush.
4. When delivery proceeds, the renderer resolves the template, locale, and metadata before invoking the connector.
---
## 5. Example rule payload
```json
{
"ruleId": "rule-critical-soc",
"tenantId": "tenant-dev",
"name": "Critical scanner verdicts",
"description": "Route KEV-tagged critical findings to SOC Slack with zero delay.",
"enabled": true,
"match": {
"eventKinds": ["scanner.report.ready"],
"labels": ["kev", "critical"],
"minSeverity": "critical",
"verdicts": ["fail", "block"],
"kevOnly": true
},
"actions": [
{
"actionId": "act-slack-critical",
"channel": "chn-slack-soc",
"template": "tmpl-critical",
"digest": "instant",
"throttle": "PT300S",
"locale": "en-us",
"metadata": {
"priority": "p1"
}
}
],
"labels": {
"owner": "soc"
},
"metadata": {
"revision": "12"
}
}
```
Dry-run calls (`POST /rules/{id}/test`) accept the same structure along with a sample Notify event payload to exercise match logic without invoking connectors.
---
## 6. Operational guidance
- Keep rule scopes narrow (namespace/repository) before relying on severity gates; this minimises noise and improves digest summarisation.
- Always configure a throttle window for instant actions to protect against repeated upstream retries.
- Use rule labels to organise dashboards and access control (e.g., `owner:soc`, `env:prod`).
- Prefer tenant-specific rule IDs so Offline Kit exports remain deterministic across environments.
- If a rule depends on derived metadata (e.g., policy verdict tags), list those dependencies in the rule description for audit readiness.
---
> **Imposed rule reminder:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

View File

@@ -0,0 +1,3 @@
# Notify Schemas Catalog
Placeholder for NR1 deliverables: versioned JSON Schemas for Notify event envelopes, rules, templates, channels, receipts, and webhooks. Publish `notify-schemas-catalog.json` + `.dsse.json` here with canonicalization recipe (BLAKE3-256 over normalized JSON) and `inputs.lock` capturing digests.

View File

@@ -0,0 +1,20 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://stella-ops.org/notify/schemas/channel.schema.json",
"title": "Notify Channel Configuration",
"type": "object",
"required": ["schema_version", "tenant_id", "channel_id", "kind", "config"],
"properties": {
"schema_version": { "type": "string", "pattern": "^v[0-9]+\\.[0-9]+$" },
"tenant_id": { "type": "string", "minLength": 1 },
"channel_id": { "type": "string", "pattern": "^[A-Z0-9_-]{4,64}$" },
"kind": { "type": "string", "enum": ["email", "slack", "teams", "webhook", "sms"] },
"config": { "type": "object" },
"secrets_ref": { "type": "object", "additionalProperties": { "type": "string" } },
"rate_limit": { "type": "object" },
"enabled": { "type": "boolean", "default": true },
"created_at": { "type": "string", "format": "date-time" },
"updated_at": { "type": "string", "format": "date-time" }
},
"additionalProperties": false
}

View File

@@ -0,0 +1,20 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://stella-ops.org/notify/schemas/dlq-notify.schema.json",
"title": "Notify Dead Letter Entry",
"type": "object",
"required": ["schema_version", "tenant_id", "delivery_id", "reason", "payload", "first_failed_at"],
"properties": {
"schema_version": { "type": "string", "pattern": "^v[0-9]+\\.[0-9]+$" },
"tenant_id": { "type": "string", "minLength": 1 },
"delivery_id": { "type": "string", "pattern": "^[0-9a-fA-F-]{18,36}$" },
"reason": { "type": "string" },
"payload": { "type": "object" },
"backoff_attempts": { "type": "integer", "minimum": 0 },
"dedupe_key": { "type": "string" },
"first_failed_at": { "type": "string", "format": "date-time" },
"last_failed_at": { "type": "string", "format": "date-time" },
"redrive_after": { "type": "string", "format": "date-time" }
},
"additionalProperties": false
}

View File

@@ -0,0 +1,26 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://stella-ops.org/notify/schemas/event-envelope.schema.json",
"title": "Notify Event Envelope",
"type": "object",
"required": [
"schema_version",
"tenant_id",
"event_id",
"occurred_at",
"kind",
"payload"
],
"properties": {
"schema_version": { "type": "string", "pattern": "^v[0-9]+\\.[0-9]+$" },
"tenant_id": { "type": "string", "minLength": 1 },
"event_id": { "type": "string", "pattern": "^[0-9a-fA-F-]{18,36}$" },
"occurred_at": { "type": "string", "format": "date-time" },
"kind": { "type": "string", "minLength": 1 },
"correlation_id": { "type": "string" },
"source": { "type": "string" },
"payload": { "type": "object" },
"attributes": { "type": "object", "additionalProperties": { "type": ["string", "number", "boolean", "null"] } }
},
"additionalProperties": false
}

View File

@@ -0,0 +1,14 @@
{
"catalog": "notify-schemas-catalog.json",
"hash_algorithm": "blake3-256",
"canonicalization": "json-normalized-utf8",
"entries": [
{ "file": "event-envelope.schema.json", "digest": "0534e778a7e24dfdcbdc66cec2902f24684ec0bdf26d708ab9bca98e6674a318" },
{ "file": "rule.schema.json", "digest": "34d4f1c2ba97b76acf85ad61f4e8de4591664eefecbc7ebb6d168aa5a998ddd1" },
{ "file": "template.schema.json", "digest": "e0a8f9bb5e5f29a11b040e7cb0e7e9a8c5d42256f9a4bd72f79460eb613dac52" },
{ "file": "channel.schema.json", "digest": "bd9e2dfb4e6e7e7a38f26cc94ae8bcdf9b8c44b1e97bf78c146711783fe8fa2b" },
{ "file": "receipt.schema.json", "digest": "fb4431019b3803081983b215fc9ca2e7618c3cf91f8274baedf72cacad8dfe46" },
{ "file": "webhook.schema.json", "digest": "54a6e0d956fd6af7e88f6508bda78221ca04cfedea4112bfefc7fa5dbfa45c09" },
{ "file": "dlq-notify.schema.json", "digest": "1330e589245b923f6e1fea6af080b7b302a97effa360a90dbef4ba3b06021b2f" }
]
}

View File

@@ -0,0 +1,11 @@
{
"payloadType": "application/vnd.notify.schema-catalog+json",
"payload": "eyJjYW5vbmljYWxpemF0aW9uIjoianNvbi1ub3JtYWxpemVkLXV0ZjgiLCJjYXRhbG9nX3ZlcnNpb24iOiJ2MS4wIiwiZ2VuZXJhdGVkX2F0IjoiMjAyNS0xMi0wNFQwMDowMDowMFoiLCJoYXNoX2FsZ29yaXRobSI6ImJsYWtlMy0yNTYiLCJzY2hlbWFzIjpbeyJkaWdlc3QiOiIwNTM0ZTc3OGE3ZTI0ZGZkY2JkYzY2Y2VjMjkwMmYyNDY4NGVjMGJkZjI2ZDcwOGFiOWJjYTk4ZTY2NzRhMzE4IiwiZmlsZSI6ImV2ZW50LWVudmVsb3BlLnNjaGVtYS5qc29uIiwiaWQiOiJldmVudC1lbnZlbG9wZSIsInZlcnNpb24iOiJ2MS4wIn0seyJkaWdlc3QiOiIzNGQ0ZjFjMmJhOTdiNzZhY2Y4NWFkNjFmNGU4ZGU0NTkxNjY0ZWVmZWNiYzdlYmI2ZDE2OGFhNWE5OThkZGQxIiwiZmlsZSI6InJ1bGUuc2NoZW1hLmpzb24iLCJpZCI6InJ1bGUiLCJ2ZXJzaW9uIjoidjEuMCJ9LHsiZGlnZXN0IjoiZTBhOGY5YmI1ZTVmMjlhMTFiMDQwZTdjYjBlN2U5YThjNWQ0MjI1NmY5YTRiZDcyZjc5NDYwZWI2MTNkYWM1MiIsImZpbGUiOiJ0ZW1wbGF0ZS5zY2hlbWEuanNvbiIsImlkIjoidGVtcGxhdGUiLCJ2ZXJzaW9uIjoidjEuMCJ9LHsiZGlnZXN0IjoiYmQ5ZTJkZmI0ZTZlN2U3YTM4ZjI2Y2M5NGFlOGJjZGY5YjhjNDRiMWU5N2JmNzhjMTQ2NzExNzgzZmU4ZmEyYiIsImZpbGUiOiJjaGFubmVsLnNjaGVtYS5qc29uIiwiaWQiOiJjaGFubmVsIiwidmVyc2lvbiI6InYxLjAifSx7ImRpZ2VzdCI6ImZiNDQzMTAxOWIzODAzMDgxOTgzYjIxNWZjOWNhMmU3NjE4YzNjZjkxZjgyNzRiYWVkZjcyY2FjYWQ4ZGZlNDYiLCJmaWxlIjoicmVjZWlwdC5zY2hlbWEuanNvbiIsImlkIjoicmVjZWlwdCIsInZlcnNpb24iOiJ2MS4wIn0seyJkaWdlc3QiOiI1NGE2ZTBkOTU2ZmQ2YWY3ZTg4ZjY1MDhiZGE3ODIyMWNhMDRjZmVkZWE0MTEyYmZlZmM3ZmE1ZGJmYTQ1YzA5IiwiZmlsZSI6IndlYmhvb2suc2NoZW1hLmpzb24iLCJpZCI6IndlYmhvb2siLCJ2ZXJzaW9uIjoidjEuMCJ9LHsiZGlnZXN0IjoiMTMzMGU1ODkyNDViOTIzZjZlMWZlYTZhZjA4MGI3YjMwMmE5N2VmZmEzNjBhOTBkYmVmNGJhM2IwNjAyMWIyZiIsImZpbGUiOiJkbHEtbm90aWZ5LnNjaGVtYS5qc29uIiwiaWQiOiJkbHEiLCJ2ZXJzaW9uIjoidjEuMCJ9XX0=",
"signatures": [
{
"sig": "99WPzzc6sCaEQHXk2B15aLxtG/Ics6qsgHYa2oDTI1g=",
"keyid": "notify-dev-hmac-001",
"signedAt": "2025-12-04T21:12:53+00:00"
}
]
}

View File

@@ -0,0 +1,15 @@
{
"catalog_version": "v1.0",
"hash_algorithm": "blake3-256",
"canonicalization": "json-normalized-utf8",
"generated_at": "2025-12-04T00:00:00Z",
"schemas": [
{ "id": "event-envelope", "file": "event-envelope.schema.json", "version": "v1.0", "digest": "0534e778a7e24dfdcbdc66cec2902f24684ec0bdf26d708ab9bca98e6674a318" },
{ "id": "rule", "file": "rule.schema.json", "version": "v1.0", "digest": "34d4f1c2ba97b76acf85ad61f4e8de4591664eefecbc7ebb6d168aa5a998ddd1" },
{ "id": "template", "file": "template.schema.json", "version": "v1.0", "digest": "e0a8f9bb5e5f29a11b040e7cb0e7e9a8c5d42256f9a4bd72f79460eb613dac52" },
{ "id": "channel", "file": "channel.schema.json", "version": "v1.0", "digest": "bd9e2dfb4e6e7e7a38f26cc94ae8bcdf9b8c44b1e97bf78c146711783fe8fa2b" },
{ "id": "receipt", "file": "receipt.schema.json", "version": "v1.0", "digest": "fb4431019b3803081983b215fc9ca2e7618c3cf91f8274baedf72cacad8dfe46" },
{ "id": "webhook", "file": "webhook.schema.json", "version": "v1.0", "digest": "54a6e0d956fd6af7e88f6508bda78221ca04cfedea4112bfefc7fa5dbfa45c09" },
{ "id": "dlq", "file": "dlq-notify.schema.json", "version": "v1.0", "digest": "1330e589245b923f6e1fea6af080b7b302a97effa360a90dbef4ba3b06021b2f" }
]
}

View File

@@ -0,0 +1,21 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://stella-ops.org/notify/schemas/receipt.schema.json",
"title": "Notify Delivery Receipt",
"type": "object",
"required": ["schema_version", "tenant_id", "delivery_id", "rule_id", "channel", "status", "sent_at"],
"properties": {
"schema_version": { "type": "string", "pattern": "^v[0-9]+\\.[0-9]+$" },
"tenant_id": { "type": "string", "minLength": 1 },
"delivery_id": { "type": "string", "pattern": "^[0-9a-fA-F-]{18,36}$" },
"rule_id": { "type": "string" },
"channel": { "type": "string" },
"status": { "type": "string", "enum": ["sent", "delivered", "failed", "queued", "acknowledged"] },
"attempt": { "type": "integer", "minimum": 1 },
"sent_at": { "type": "string", "format": "date-time" },
"ack_url": { "type": "string", "format": "uri" },
"response": { "type": "object" },
"errors": { "type": "array", "items": { "type": "string" } }
},
"additionalProperties": false
}

View File

@@ -0,0 +1,37 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://stella-ops.org/notify/schemas/rule.schema.json",
"title": "Notify Rule",
"type": "object",
"required": [
"schema_version",
"tenant_id",
"rule_id",
"name",
"sources",
"predicates",
"actions",
"approvals_required"
],
"properties": {
"schema_version": { "type": "string", "pattern": "^v[0-9]+\\.[0-9]+$" },
"tenant_id": { "type": "string", "minLength": 1 },
"rule_id": { "type": "string", "pattern": "^[A-Z0-9_-]{4,64}$" },
"name": { "type": "string", "minLength": 1 },
"description": { "type": "string" },
"severity": { "type": "string", "enum": ["info", "low", "medium", "high", "critical"] },
"sources": { "type": "array", "items": { "type": "string" }, "minItems": 1 },
"predicates": { "type": "array", "items": { "type": "object" }, "minItems": 1 },
"actions": {
"type": "array",
"items": { "type": "object" },
"minItems": 1
},
"approvals_required": { "type": "integer", "minimum": 0, "maximum": 3 },
"quiet_hours": { "type": "array", "items": { "type": "string" } },
"simulation_required": { "type": "boolean", "default": true },
"created_at": { "type": "string", "format": "date-time" },
"updated_at": { "type": "string", "format": "date-time" }
},
"additionalProperties": false
}

View File

@@ -0,0 +1,22 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://stella-ops.org/notify/schemas/template.schema.json",
"title": "Notify Template",
"type": "object",
"required": ["schema_version", "tenant_id", "template_id", "channel", "locale", "body"],
"properties": {
"schema_version": { "type": "string", "pattern": "^v[0-9]+\\.[0-9]+$" },
"tenant_id": { "type": "string", "minLength": 1 },
"template_id": { "type": "string", "pattern": "^[A-Z0-9_-]{4,64}$" },
"channel": { "type": "string", "enum": ["email", "slack", "teams", "webhook", "sms"] },
"locale": { "type": "string", "pattern": "^[a-z]{2}(-[A-Z]{2})?$" },
"subject": { "type": "string" },
"body": { "type": "string" },
"helpers": { "type": "object", "additionalProperties": { "type": "string" } },
"merge_fields": { "type": "array", "items": { "type": "string" } },
"preview_hash": { "type": "string" },
"created_at": { "type": "string", "format": "date-time" },
"updated_at": { "type": "string", "format": "date-time" }
},
"additionalProperties": false
}

View File

@@ -0,0 +1,20 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://stella-ops.org/notify/schemas/webhook.schema.json",
"title": "Notify Webhook Payload",
"type": "object",
"required": ["schema_version", "tenant_id", "delivery_id", "signature", "body"],
"properties": {
"schema_version": { "type": "string", "pattern": "^v[0-9]+\\.[0-9]+$" },
"tenant_id": { "type": "string", "minLength": 1 },
"delivery_id": { "type": "string", "pattern": "^[0-9a-fA-F-]{18,36}$" },
"signature": { "type": "string" },
"hmac_id": { "type": "string" },
"body": { "type": "object" },
"sent_at": { "type": "string", "format": "date-time" },
"nonce": { "type": "string" },
"audience": { "type": "string" },
"expires_at": { "type": "string", "format": "date-time" }
},
"additionalProperties": false
}

View File

@@ -0,0 +1,3 @@
# Notify Security Notes
Holds NR2, NR6, and NR7 artefacts: tenant/RBAC approval matrix, webhook/ack hardening policy (HMAC/mTLS/DPoP + signed acks), and redaction/PII catalog with sanitized fixture samples.

View File

@@ -0,0 +1,6 @@
# Redaction and PII catalog (NR7)
- Classify merge fields: identifiers (hash), secrets (strip), PII (mask), operational metadata (retain).
- Storage and previews must use redacted forms by default; full bodies allowed only with `Notify.Audit` permission.
- Log payloads must omit secrets; hashes use BLAKE3-256 over UTF-8 normalized values.
- Fixtures under `docs/modules/notify/fixtures/redaction/` show expected redacted shapes for templates and receipts.

View File

@@ -0,0 +1,6 @@
# Tenant scoping and approvals (NR2)
- All Notify APIs require `tenant_id` in request and ledger records.
- High-impact actions (escalations, PII-bearing templates, cross-tenant fan-out) need N-of-M approvals: default 2 of 3 approvers with `Notify.Approver` role.
- Approvals captured as DSSE-signed records (future hook) and stored alongside rule change requests.
- Rejection reasons must be logged and returned in error payloads; audit log keeps requester, approver IDs, timestamps, and rule/template IDs.

View File

@@ -0,0 +1,6 @@
# Webhook and ack security (NR6)
- Webhooks must use HMAC-SHA256 with per-tenant rotating secrets or mTLS/DPoP. `hmac_id` maps to secret material.
- Ack URLs carry signed tokens (nonce, audience, tenant_id, delivery_id, expires_at) and are single-use. Reject replay or expired tokens.
- Enforce allowlists for domains and paths per tenant; deny wildcards.
- Capture failures in observability pipeline and DLQ with redrive after investigation.

View File

@@ -0,0 +1 @@
{"rule_id":"RULE-INCIDENT","tenant_id":"tenant-123","simulation_report":"sample-simulation-report.json","status":"passed"}

View File

@@ -0,0 +1,28 @@
{
"rule_id": "sample-rule",
"simulation_id": "sim-2025-12-04-001",
"executed_at": "2025-12-04T00:00:00Z",
"tenant_id": "test-tenant",
"fixtures_used": [
"docs/notifications/fixtures/rendering/tmpl-incident-start.email.en-US.json"
],
"channels_tested": ["email", "slack"],
"results": {
"events_processed": 10,
"deliveries_simulated": 20,
"delivery_success": 20,
"delivery_failure": 0,
"quota_blocked": 0,
"redaction_applied": true
},
"determinism_check": {
"hash_algorithm": "blake3-256",
"output_digest": "0000000000000000000000000000000000000000000000000000000000000000",
"verified": true
},
"approval_required": true,
"approval_status": "pending",
"evidence_links": [
"docs/notifications/fixtures/rendering/index.ndjson"
]
}

View File

@@ -0,0 +1,13 @@
{
"rule_id": "RULE-INCIDENT",
"tenant_id": "tenant-123",
"fixtures_version": "v1",
"result": "passed",
"evidence": [
{
"event_id": "evt-1",
"decision": "send",
"channel": "email"
}
]
}

View File

@@ -0,0 +1,86 @@
# Notifier Telemetry SLO Webhook Schema (1.0.0)
Purpose: define the payload emitted by Telemetry SLO evaluators toward Notifier so that NOTIFY-OBS-51-001 can consume alerts deterministically (online and offline).
## Delivery contract
- Content-Type: `application/json`
- Encoding: UTF-8
- Authentication: mTLS (service identity) or DPoP/JWT with `aud` = `notifier` and `scope` = `obs:slo:ingest`.
- Determinism: timestamps are UTC ISO-8601 with `Z`; field order stable for hashing (see canonical JSON below).
## Payload fields
```
{
"id": "uuid",
"tenant": "string", // required; aligns with orchestrator/telemetry tenant id
"service": "string", // logical service name
"host": "string", // optional; k8s node/hostname
"slo": {
"name": "string", // human-readable
"id": "string", // immutable key used for dedupe
"objective": {
"window": "PT5M", // ISO-8601 duration
"target": 0.995 // decimal between 0 and 1
}
},
"metric": {
"type": "latency|error|availability|custom",
"value": 0.0123, // double; units depend on type
"unit": "seconds|ratio|percent|count",
"labels": { // sanitized, deterministic ordering when serialized
"endpoint": "/api/jobs",
"method": "GET"
}
},
"window": {
"start": "2025-11-19T12:00:00Z",
"end": "2025-11-19T12:05:00Z"
},
"breach": {
"state": "breaching|warning|ok",
"reason": "p95 latency above objective",
"evidence": [
{
"type": "timeseries",
"href": "cas://telemetry/series/abc123",
"hash": "sha256:..."
}
]
},
"quietHours": {
"active": false,
"policyId": null
},
"trace": {
"trace_id": "optional-trace-id",
"span_id": "optional-span-id"
},
"version": "1.0.0",
"issued_at": "2025-11-19T12:05:07Z"
}
```
### Canonical JSON rules
- Sort object keys lexicographically before hashing/signing.
- Use lowercase for enum-like fields shown above.
- `version` is required for evolution; new fields must be add-only.
### Retry and idempotency
- `id` is the idempotency key; Notifier treats duplicates as no-op.
- Producers retry with exponential backoff up to 10 minutes; consumers respond 2xx only after persistence.
### Validation checklist (for tests/CI)
- Required fields: id, tenant, service, slo.id, slo.objective.window, slo.objective.target, metric.type, metric.value, window.start/end, breach.state, version, issued_at.
- Timestamps parse with `DateTimeStyles.RoundtripKind`.
- When `breach.state=ok`, `breach.reason` may be null but object must exist.
- `quietHours.active=true` must include `policyId`.
### Sample canonical JSON (minified)
```
{"breach":{"evidence":[],"reason":"p99 latency above objective","state":"breaching"},"host":"orchestrator-0","id":"8c1d58c4-b1de-4b3c-9c7b-40a6b0f8d4c1","issued_at":"2025-11-19T12:05:07Z","metric":{"labels":{"endpoint":"/api/jobs","method":"GET"},"type":"latency","unit":"seconds","value":1.234},"quietHours":{"active":false,"policyId":null},"service":"orchestrator","slo":{"id":"orch-api-latency-p99","name":"Orchestrator API p99","objective":{"target":0.99,"window":"PT5M"}},"tenant":"default","trace":{"span_id":null,"trace_id":null},"version":"1.0.0","window":{"end":"2025-11-19T12:05:00Z","start":"2025-11-19T12:00:00Z"}}
```
### Evidence to surface in sprint tasks
- File: `docs/modules/notify/slo-webhook-schema.md` (this document).
- Sample payload (canonical) and validation checklist above.
- Dependencies: upstream Telemetry evaluator must emit `metric.labels` sanitized; Notifier to persist `id` for idempotency.

View File

@@ -0,0 +1,239 @@
# Notifications Templates
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Templates shape the payload rendered for each channel when a rule action fires. They are deterministic, locale-aware artefacts stored alongside rules so Notify.Worker replicas can render identical messages regardless of environment.
---
## 1. Template lifecycle
1. **Authoring.** Operators create templates via the API (`POST /templates`) or UI. Each template binds to a channel type (`Slack`, `Teams`, `Email`, `Webhook`, `Custom`) and a locale.
2. **Reference.** Rule actions opt in by referencing the template key (`actions[].template`). Channel defaults apply when no template is specified.
3. **Rendering.** During delivery, the worker resolves the template (locale fallbacks included), executes it using the safe Handlebars-style engine, and passes the rendered payload plus metadata to the connector.
4. **Audit.** Rendered payloads stored in the delivery ledger include the `templateId` so operators can trace which text was used.
---
## 2. Template schema reference
| Field | Type | Notes |
|-------|------|-------|
| `templateId` | string | Stable identifier (UUID/slug). |
| `tenantId` | string | Must match the tenant header in API calls. |
| `channelType` | [`NotifyChannelType`](../modules/notify/architecture.md#5-channels--connectors-plug-ins) | Determines connector payload envelope. |
| `key` | string | Human-readable key referenced by rules (`tmpl-critical`). |
| `locale` | string | BCP-47 tag, stored lower-case (`en-us`, `bg-bg`). |
| `body` | string | Template body; rendered strictly without executing arbitrary code. |
| `renderMode` | enum | `Markdown`, `Html`, `AdaptiveCard`, `PlainText`, or `Json`. Guides connector sanitisation. |
| `format` | enum | `Slack`, `Teams`, `Email`, `Webhook`, or `Json`. Signals delivery payload structure. |
| `description` | string? | Optional operator note. |
| `metadata` | map<string,string> | Sorted map for automation (layout hints, fallback text). |
| `createdBy`/`createdAt` | string?, instant | Auto-populated. |
| `updatedBy`/`updatedAt` | string?, instant | Auto-populated. |
| `schemaVersion` | string | Auto-upgraded on persistence. |
Templates are normalised: string fields trimmed, locale lower-cased, metadata sorted to preserve determinism.
---
## 3. Variables, helpers, and context
Templates receive a structured context derived from the Notify event, rule match, and rendering metadata.
| Path | Description |
|------|-------------|
| `event.*` | Canonical event envelope (`kind`, `tenant`, `ts`, `actor`). |
| `event.scope.*` | Namespace, repository, digest, image, component identifiers, labels, attributes. |
| `payload.*` | Raw event payload (e.g., `payload.verdict`, `payload.delta.*`, `payload.links.*`). |
| `rule.*` | Rule descriptor (`ruleId`, `name`, `labels`, `metadata`). |
| `action.*` | Action descriptor (`actionId`, `channel`, `digest`, `throttle`, `metadata`). |
| `policy.*` | Policy metadata when supplied (`revisionId`, `name`). |
| `topFindings[]` | Top-N findings summarised for convenience (vulnerability ID, severity, reachability). |
| `digest.*` | When rendering digest flushes: `window`, `openedAt`, `itemCount`. |
Built-in helpers mirror the architecture dossier:
| Helper | Usage |
|--------|-------|
| `severity_icon severity` | Returns emoji/text badge representing severity. |
| `link text url` | Produces channel-safe hyperlink. |
| `pluralize count "finding"` | Adds plural suffix when `count != 1`. |
| `truncate text maxLength` | Cuts strings while preserving determinism. |
| `code text` | Formats inline code (Markdown/HTML aware). |
Connectors may expose additional helpers via partials, but must remain deterministic and side-effect free.
---
## 4. Sample templates
### 4.1 Slack (Markdown + block kit)
```hbs
{{#*inline "findingLine"}}
- {{severity_icon severity}} {{vulnId}} ({{severity}}) in `{{component}}`
{{/inline}}
*:rotating_light: {{payload.summary.total}} findings {{#if payload.delta.newCritical}}(new critical: {{payload.delta.newCritical}}){{/if}}*
{{#if topFindings.length}}
Top findings:
{{#each topFindings}}{{> findingLine}}{{/each}}
{{/if}}
{{link "Open report in Console" payload.links.ui}}
```
### 4.2 Email (HTML + text alternative)
```hbs
<h2>{{payload.verdict}} for {{event.scope.repo}}</h2>
<p>{{payload.summary.total}} findings ({{payload.summary.blocked}} blocked, {{payload.summary.warned}} warned)</p>
<table>
<thead><tr><th>Finding</th><th>Severity</th><th>Package</th></tr></thead>
<tbody>
{{#each topFindings}}
<tr>
<td>{{this.vulnId}}</td>
<td>{{this.severity}}</td>
<td>{{this.component}}</td>
</tr>
{{/each}}
</tbody>
</table>
<p>{{link "View full analysis" payload.links.ui}}</p>
```
When delivering via email, connectors automatically attach a plain-text alternative derived from the rendered content to preserve accessibility.
---
## 5. Preview and validation
- `POST /channels/{id}/test` accepts an optional `templateId` and sample payload to produce a rendered preview without dispatching the event. Results include channel type, target, title/summary, locale, body hash, and connector metadata.
- UI previews rely on the same API and highlight connector fallbacks (e.g., Teams adaptive card vs. text fallback).
- Offline Kit scenarios can call `/internal/notify/templates/normalize` to ensure bundled templates match the canonical schema before packaging.
---
## 6. Best practices
- Keep channel-specific limits in mind (Slack block/character quotas, Teams adaptive card size, email line length). Lean on digests to summarise long lists.
- Provide locale-specific versions for high-volume tenants; Notify selects the closest locale, falling back to `en-us`.
- Store connector-specific hints (`metadata.layout`, `metadata.emoji`) in template metadata rather than rules when they affect rendering.
- Version template bodies through metadata (e.g., `metadata.revision: "2025-10-28"`) so tenants can track changes over time.
- Run test previews whenever introducing new helpers to confirm body hashes remain stable across environments.
---
## 7. Attestation & signing lifecycle templates (NOTIFY-ATTEST-74-001)
Attestation lifecycle events (verification failures, expiring attestations, key revocations, transparency anomalies) reuse the same structural context so operators can differentiate urgency while reusing channels. Every template **must** surface:
- **Subject** (`payload.subject.digest`, `payload.subject.repository`, `payload.subject.tag`).
- **Attestation metadata** (`payload.attestation.kind`, `payload.attestation.id`, `payload.attestation.issuedAt`, `payload.attestation.expiresAt`).
- **Signer/Key fingerprint** (`payload.signer.kid`, `payload.signer.algorithm`, `payload.signer.rotationId`).
- **Traceability** (`payload.links.console`, `payload.links.rekor`, `payload.links.docs`).
### 7.1 Template keys & channels
| Event | Template key | Required channels | Optional channels | Notes |
| --- | --- | --- | --- | --- |
| Verification failure (`attestor.verification.failed`) | `tmpl-attest-verify-fail` | Slack `sec-alerts`, Email `supply-chain@`, Webhook (Pager/SOC) | Teams `risk-war-room`, Custom SIEM feed | Include failure code, Rekor UUID, last-known good attestation link. |
| Expiring attestation (`attestor.attestation.expiring`) | `tmpl-attest-expiry-warning` | Email summary, Slack reminder | Digest window (daily) | Provide expiration window, renewal instructions, `expiresIn` helper. |
| Key revocation/rotation (`authority.keys.revoked`, `authority.keys.rotated`) | `tmpl-attest-key-rotation` | Email + Webhook | Slack (if SOC watches channel) | Add rotation batch ID, impacted tenants/services, remediation steps. |
| Transparency anomaly (`attestor.transparency.anomaly`) | `tmpl-attest-transparency-anomaly` | Slack high-priority, Webhook, PagerDuty | Email follow-up | Show Rekor index delta, witness ID, anomaly classification, recommended actions. |
Assign these keys when creating templates so rule actions can reference them deterministically (`actions[].template: "tmpl-attest-verify-fail"`).
### 7.2 Context helpers
- `attestation_status_badge status`: renders ✅/⚠️/❌ depending on verdict (`valid`, `expiring`, `failed`).
- `expires_in expiresAt now`: returns human-readable duration, constrained to deterministic units (h/d).
- `fingerprint key`: shortens long key IDs/pems, exposing the last 10 characters.
### 7.3 Slack sample (verification failure)
```hbs
:rotating_light: {{attestation_status_badge payload.failure.status}} verification failed for `{{payload.subject.digest}}`
Signer: `{{fingerprint payload.signer.kid}}` ({{payload.signer.algorithm}})
Reason: `{{payload.failure.reasonCode}}` — {{payload.failure.reason}}
Last valid attestation: {{link "Console report" payload.links.console}}
Rekor entry: {{link "Transparency log" payload.links.rekor}}
```
### 7.4 Email sample (expiring attestation)
```hbs
<h2>Attestation expiry notice</h2>
<p>The attestation for <code>{{payload.subject.repository}}</code> (digest {{payload.subject.digest}}) expires on <strong>{{payload.attestation.expiresAt}}</strong>.</p>
<ul>
<li>Issued: {{payload.attestation.issuedAt}}</li>
<li>Signer: {{payload.signer.kid}} ({{payload.signer.algorithm}})</li>
<li>Time remaining: {{expires_in payload.attestation.expiresAt event.ts}}</li>
</ul>
<p>Please rotate the attestation before expiry. Reference <a href="{{payload.links.docs}}">renewal steps</a>.</p>
```
### 7.5 Webhook sample (transparency anomaly)
```json
{
"event": "attestor.transparency.anomaly",
"tenantId": "{{event.tenant}}",
"subjectDigest": "{{payload.subject.digest}}",
"rekorIndex": "{{payload.transparency.rekorIndex}}",
"witnessId": "{{payload.transparency.witnessId}}",
"anomaly": "{{payload.transparency.classification}}",
"detailsUrl": "{{payload.links.console}}",
"recommendation": "{{payload.recommendation}}"
}
```
### 7.6 Offline kit guidance
- Bundle these templates (JSON export) under `offline/notifier/templates/attestation/`.
- Baseline English templates for Slack, Email, and Webhook ship in the repository at `offline/notifier/templates/attestation/*.template.json`; copy and localise them per tenant as needed.
- Provide localized variants for `en-us` and `de-de` at minimum; additional locales can be appended per customer.
- Include preview fixtures in Offline Kit smoke tests to guarantee channel render parity when air-gapped.
---
> **Imposed rule reminder:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
---
## 8. Incident mode templates (NOTIFY-OBS-55-001)
Incident toggles are high-noise events that must pierce quiet hours and include audit-ready context. Use dedicated templates so downstream tooling can distinguish activation vs. recovery and surface the required evidence.
**Required context keys**
- `payload.incidentId`, `payload.reason`, `payload.startedAt` / `payload.stoppedAt`.
- `payload.links.trace` (root cause trace/span), `payload.links.evidence` (timeline/export bundle), `payload.links.timeline`.
- `payload.retentionDays` (active) and `payload.retentionBaselineDays` (post-incident).
- `payload.quietHoursOverride` (boolean) to justify bypassing quiet hours.
- `payload.legal.jurisdiction`, `payload.legal.ticket`, `payload.legal.logPath` for compliance logging.
**Template keys**
- `tmpl-incident-start` — activation notice.
- `tmpl-incident-stop` — recovery/cleanup notice.
**Slack sample (start)**
```hbs
:rotating_light: Incident mode activated for {{payload.incidentId}}
Reason: {{payload.reason}}
Trace: {{link "root span" payload.links.trace}} · Evidence: {{link "bundle" payload.links.evidence}}
Retention extended to {{payload.retentionDays}} days (baseline {{payload.retentionBaselineDays}})
Quiet hours overridden: {{payload.quietHoursOverride}}
Legal: {{payload.legal.jurisdiction}} (ticket {{payload.legal.ticket}})
```
**Email sample (stop)**
```hbs
<h2>Incident mode cleared: {{payload.incidentId}}</h2>
<p>Stopped at {{payload.stoppedAt}} — retention reset to {{payload.retentionBaselineDays}} days.</p>
<p>Timeline: {{link "view timeline" payload.links.timeline}} · Audit log: {{payload.legal.logPath}}</p>
```
See `src/Notifier/StellaOps.Notifier/docs/incident-mode-rules.sample.json` for ready-to-import rules referencing these templates with quiet-hour overrides and legal logging metadata.