chore(docs+devops): cross-module doc sync + sprint archival moves + compose updates
Bundled pre-session doc + ops work: - docs/modules/**: sync across advisory-ai, airgap, cli, excititor, export-center, findings-ledger, notifier, notify, platform, router, sbom-service, ui, web (architectural + operational updates) - docs/features/**: updates to checked excititor vex pipeline, developer workspace, quick verify drawer - docs top-level: README, quickstart, API_CLI_REFERENCE, UI_GUIDE, code-of-conduct/TESTING_PRACTICES updates - docs/qa/feature-checks/: FLOW.md + excititor state update - docs/implplan/: remaining sprint updates + new Concelier source credentials sprint (SPRINT_20260422_003) - docs-archived/implplan/: 30 sprint archival moves (ElkSharp series, misc completed sprints) - devops/compose: .env + services compose + env example + router gateway config updates File-level granularity preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -25,6 +25,7 @@ All endpoints require `Authorization: Bearer <token>` and `X-Stella-Tenant` head
|
||||
- `GET /deliveries` — query delivery ledger; filters: `status`, `channel`, `rule_id`, `from`, `to`. Sorted by `createdUtc` DESC then `id` ASC.
|
||||
- `GET /deliveries/{id}` — single delivery with rendered payload hash and attempts.
|
||||
- `POST /digests/preview` — preview digest rendering for a tenant/rule set; returns deterministic body/hash without sending.
|
||||
- Digest schedule cadence is worker configuration (`Notifier:DigestSchedule`), not an API-managed resource; current digest endpoints administer rendered output or open windows only.
|
||||
|
||||
## Acknowledgements
|
||||
- `POST /acks/{token}` — acknowledge an escalation token. Validates DSSE signature, token expiry, and tenant. Returns `200` with cancellation summary.
|
||||
|
||||
@@ -1,17 +1,28 @@
|
||||
> **Scope.** Implementation‑ready architecture for **Notify** (aligned with Epic 11 – Notifications Studio): a rules‑driven, tenant‑aware notification service that consumes platform events (scan completed, report ready, rescan deltas, attestation logged, admission decisions, etc.), evaluates operator‑defined routing rules, renders **channel‑specific messages** (Slack/Teams/Email/Webhook), and delivers them **reliably** with idempotency, throttling, and digests. It is UI‑managed, auditable, and safe by default (no secrets leakage, no spam storms).
|
||||
> **Scope.** Implementation‑ready architecture for **Notify** (aligned with Epic 11 – Notifications Studio): a rules‑driven, tenant‑aware notification service that consumes platform events (scan completed, report ready, rescan deltas, attestation logged, admission decisions, etc.), evaluates operator‑defined routing rules, renders **channel‑specific messages** (Slack/Teams/Email/Webhook/PagerDuty/OpsGenie), and delivers them **reliably** with idempotency, throttling, and digests. It is UI‑managed, auditable, and safe by default (no secrets leakage, no spam storms).
|
||||
|
||||
* **Console frontdoor compatibility (updated 2026-03-10).** The web console reaches Notifier Studio through the gateway-owned `/api/v1/notifier/*` prefix, which translates onto the service-local `/api/v2/notify/*` surface without requiring browser calls to raw service-prefixed routes.
|
||||
* **Console admin routing truthfulness (updated 2026-04-21).** The console uses `/api/v1/notify/*` only for core Notify toolkit flows (channels, rules, deliveries, incidents, acknowledgements). Advanced admin configuration such as quiet-hours, throttles, escalation, and localization is owned by the Notifier frontdoor `/api/v1/notifier/* -> /api/v2/notify/*`; Platform no longer serves synthetic `/api/v1/notify/*` admin compatibility payloads. Digest schedule CRUD remains unavailable in the live API.
|
||||
* **Merged Notify compat surface restoration (updated 2026-04-22).** The merged `src/Notify/*` host now maps the admin compatibility routes expected behind `/api/v1/notifier/*`, including `/api/v2/notify/channels*`, `/deliveries*`, `/simulate*`, `/quiet-hours*`, `/throttle-configs*`, `/escalation-policies*`, and `/overrides*`. Unsupported operator override CRUD now returns an explicit `501` contract response instead of a misleading `404`, and focused proof lives in `src/Notify/__Tests/StellaOps.Notify.WebService.Tests/CrudEndpointsTests.cs`.
|
||||
* **Runtime durability cutover (updated 2026-04-16).** Default `src/Notifier/*` production wiring now resolves queue and storage through the shared `StellaOps.Notify.Persistence` and `StellaOps.Notify.Queue` libraries. `NullNotifyEventQueue` is allowed only in the `Testing` environment, `notify.pack_approvals` is durable, and restart-survival proof is covered by `NotifierDurableRuntimeProofTests` against real Postgres + Redis.
|
||||
* **Correlation incident/throttle durability (updated 2026-04-20).** Non-testing Notify and Notifier hosts no longer keep incident correlation or throttle windows in process-local memory. Both hosts now swap `IIncidentManager` and `INotifyThrottler` onto PostgreSQL-backed runtime services using `notify.correlation_runtime_incidents` and `notify.correlation_runtime_throttle_events`, with restart-survival proof in `NotifierCorrelationDurableRuntimeTests`.
|
||||
* **Localization runtime durability (updated 2026-04-20).** Non-testing Notify and Notifier hosts no longer keep tenant-managed localization bundles in process-local memory. Both hosts now swap `ILocalizationService` onto a PostgreSQL-backed runtime service using `notify.localization_bundles`, while built-in system fallback strings remain compiled defaults, with restart-survival proof in `NotifierLocalizationDurableRuntimeTests`.
|
||||
* **Storm/fallback runtime durability (updated 2026-04-20).** Non-testing Notify and Notifier hosts no longer keep storm detection state, tenant fallback chains, or per-delivery fallback attempts in process-local memory. Both hosts now swap `IStormBreaker` and `IFallbackHandler` onto PostgreSQL-backed runtime services using `notify.storm_runtime_states`, `notify.storm_runtime_events`, `notify.fallback_runtime_chains`, and `notify.fallback_runtime_delivery_states`, with restart-survival proof in `NotifierStormFallbackDurableRuntimeTests`.
|
||||
* **Escalation engine runtime durability (updated 2026-04-20).** Non-testing Notify and Notifier hosts no longer keep live `IEscalationEngine` state in a process-local dictionary. Both hosts now swap `IEscalationEngine` onto a PostgreSQL-backed runtime service using `notify.escalation_states`, with restart-survival proof in `NotifierEscalationRuntimeDurableTests` and startup-contract proof in `NotifyEscalationRuntimeStartupContractTests`.
|
||||
* **External ack/runtime channel durability (updated 2026-04-20).** Non-testing Notifier worker hosts no longer depend on a process-local external-id bridge map or a webhook-only dispatch composition for external channels. The worker now composes `WebhookChannelDispatcher` for chat/webhook routes plus `AdapterChannelDispatcher` for `Email`, `PagerDuty`, and `OpsGenie`, durably records provider `externalId` plus `incidentId` metadata into PostgreSQL-backed delivery state, and resolves PagerDuty/OpsGenie webhook acknowledgements through PostgreSQL-backed lookup after restart. Focused proof lives in `NotifierWorkerHostWiringTests` and `NotifierAckBridgeRuntimeDurableTests`.
|
||||
* **Digest scheduler runtime composition (updated 2026-04-20).** The non-testing Notifier worker now composes `DigestScheduleRunner`, `DigestGenerator`, and `ChannelDigestDistributor` in the live host. Scheduled digests remain configuration-driven and now resolve tenant IDs from `Notifier:DigestSchedule:Schedules:*:TenantIds` through `ConfiguredDigestTenantProvider` instead of the process-local `InMemoryDigestTenantProvider`. There is currently no operator-managed digest schedule CRUD surface in the live runtime; `/digests` administers open digest windows only. Focused proof lives in `NotifierWorkerHostWiringTests`.
|
||||
* **Suppression admin durability (updated 2026-04-16).** Non-testing throttle configuration and operator override APIs no longer use live in-memory state. Both hosts now resolve canonical `/api/v2/throttles*` and `/api/v2/overrides*` plus legacy `/api/v2/notify/throttle-configs*` and `/api/v2/notify/overrides*` through PostgreSQL-backed suppression services, with restart-survival proof in `NotifierSuppressionDurableRuntimeTests`.
|
||||
* **Escalation/on-call durability (updated 2026-04-16).** Non-testing escalation-policy and on-call schedule APIs no longer use live in-memory services or compat repositories. Both hosts now resolve canonical `/api/v2/escalation-policies*` and `/api/v2/oncall-schedules*` plus legacy `/api/v2/notify/escalation-policies*` and `/api/v2/notify/oncall-schedules*` through PostgreSQL-backed runtime services, with restart-survival proof in `NotifierEscalationOnCallDurableRuntimeTests`.
|
||||
* **Quiet-hours/maintenance durability (updated 2026-04-16).** Non-testing quiet-hours calendars and maintenance windows no longer use live in-memory compat repositories or maintenance evaluators. Both hosts now resolve canonical `/api/v2/quiet-hours*` plus legacy `/api/v2/notify/quiet-hours*` and `/api/v2/notify/maintenance-windows*` through PostgreSQL-backed runtime services on the shared `notify.quiet_hours` and `notify.maintenance_windows` tables, with restart-survival proof in `NotifierQuietHoursMaintenanceDurableRuntimeTests`. Fixed-time daily/weekly cron expressions project truthfully into canonical schedules; more complex cron shapes are persisted for round-trip reads but remain inert until a cron-native evaluator lands.
|
||||
* **Quiet-hours/maintenance durability (updated 2026-04-20).** Non-testing quiet-hours calendars and maintenance windows no longer use live in-memory compat repositories or maintenance evaluators. Both hosts now resolve canonical `/api/v2/quiet-hours*` plus legacy `/api/v2/notify/quiet-hours*` and `/api/v2/notify/maintenance-windows*` through PostgreSQL-backed runtime services on the shared `notify.quiet_hours` and `notify.maintenance_windows` tables, with restart-survival proof in `NotifierQuietHoursMaintenanceDurableRuntimeTests`. Fixed-time daily/weekly cron expressions still project truthfully into canonical schedules, and compat-authored cron shapes that cannot be flattened losslessly now evaluate natively from persisted `cronExpression` plus `duration` metadata instead of remaining inert after restart.
|
||||
* **Security/dead-letter durability (updated 2026-04-16).** Non-testing webhook security, tenant isolation, dead-letter administration, and retention cleanup state no longer use live in-memory services. Both hosts now resolve `/api/v2/security*`, `/api/v2/notify/dead-letter*`, `/api/v1/observability/dead-letters*`, and retention endpoints through PostgreSQL-backed runtime services on shared `notify.webhook_security_configs`, `notify.webhook_validation_nonces`, `notify.tenant_resource_owners`, `notify.cross_tenant_grants`, `notify.tenant_isolation_violations`, `notify.dead_letter_entries`, `notify.retention_policies_runtime`, and `notify.retention_cleanup_executions_runtime` tables, with restart-survival proof in `NotifierSecurityDeadLetterDurableRuntimeTests`.
|
||||
* **Testing-only fallback boundary (updated 2026-04-20).** `src/Notifier/*` host startup now registers those durable quiet-hours, suppression, escalation/on-call, security, and dead-letter services directly for non-testing environments instead of composing an in-memory graph and replacing it later. The remaining in-memory admin services are isolated to `Testing`, with startup-contract proof in `StartupDependencyWiringTests`.
|
||||
|
||||
* **Simulation runtime parity (updated 2026-04-20).** The canonical `/api/v2/simulate*` endpoints and the legacy `/api/v2/notify/simulate*` endpoints in `src/Notifier/` now resolve the same DI-composed simulation runtime, so throttling plus quiet-hours or maintenance suppression behave identically across route families.
|
||||
|
||||
---
|
||||
|
||||
## 0) Mission & boundaries
|
||||
|
||||
**Mission.** Convert **facts** from Stella Ops into **actionable, noise‑controlled** signals where teams already live (chat/email/webhooks), with **explainable** reasons and deep links to the UI.
|
||||
**Mission.** Convert **facts** from Stella Ops into **actionable, noise-controlled** signals where teams already live (chat, email, paging, and webhooks), with **explainable** reasons and deep links to the UI.
|
||||
|
||||
**Boundaries.**
|
||||
|
||||
@@ -46,7 +57,7 @@ src/
|
||||
* **Notify.WebService** (stateless API)
|
||||
* **Notify.Worker** (horizontal scale)
|
||||
|
||||
**Dependencies**: Authority (OpToks; DPoP/mTLS), **PostgreSQL** (notify schema), Valkey/NATS (bus), HTTP egress to Slack/Teams/Webhooks, SMTP relay for Email.
|
||||
**Dependencies**: Authority (OpToks; DPoP/mTLS), **PostgreSQL** (notify schema), Valkey/NATS (bus), HTTP egress to Slack/Teams/Webhooks/PagerDuty/OpsGenie, SMTP relay for Email.
|
||||
|
||||
> **Configuration.** Notify.WebService bootstraps from `notify.yaml` (see `etc/notify.yaml.sample`). Use `storage.driver: postgres` and provide `postgres.notify` options (`connectionString`, `schemaName`, pool sizing, timeouts). Authority settings follow the platform defaults—when running locally without Authority, set `authority.enabled: false` and supply `developmentSigningKey` so JWTs can be validated offline.
|
||||
>
|
||||
@@ -66,6 +77,8 @@ src/
|
||||
> ```
|
||||
>
|
||||
> The Offline Kit job simply copies the `plugins/notify` tree into the air-gapped bundle; the ordered list keeps connector manifests stable across environments.
|
||||
>
|
||||
> In the hosted Notifier worker, delivery execution is split across two deterministic dispatch paths: `WebhookChannelDispatcher` continues to handle chat/webhook routes, while `AdapterChannelDispatcher` resolves `Email`, `PagerDuty`, and `OpsGenie` through `IChannelAdapterFactory`. The provider `externalId` emitted by those adapter-backed channels must survive persistence so inbound webhook acknowledgements can be resolved after restart.
|
||||
|
||||
> **Authority clients.** Register two OAuth clients in StellaOps Authority: `notify-web-dev` (audience `notify.dev`) for development and `notify-web` (audience `notify`) for staging/production. Both require `notify.read` and `notify.admin` scopes and use DPoP-bound client credentials (`client_secret` in the samples). Reference entries live in `etc/authority.yaml.sample`, with placeholder secrets under `etc/secrets/notify-web*.secret.example`.
|
||||
|
||||
@@ -199,12 +212,14 @@ actions:
|
||||
|
||||
Channel config is **two‑part**: a **Channel** record (name, type, options) and a Secret **reference** (Vault/K8s Secret). Connectors are **restart-time plug-ins** discovered on service start (same manifest convention as Concelier/Excititor) and live under `plugins/notify/<channel>/`.
|
||||
|
||||
**Built‑in v1:**
|
||||
**Built-in channels:**
|
||||
|
||||
* **Slack**: Bot token (xoxb‑…), `chat.postMessage` + `blocks`; rate limit aware (HTTP 429).
|
||||
* **Microsoft Teams**: Incoming Webhook (or Graph card later); adaptive card payloads.
|
||||
* **Email (SMTP)**: TLS (STARTTLS or implicit), From/To/CC/BCC; HTML+text alt; DKIM optional.
|
||||
* **Generic Webhook**: POST JSON with HMAC signature (Ed25519 or SHA‑256) in headers.
|
||||
* **PagerDuty**: Events API v2 trigger/ack/resolve flow; durable `dedup_key`/external id mapping is persisted with delivery state for restart-safe webhook acknowledgement handling.
|
||||
* **OpsGenie**: Alert create/ack/close flow; alias/external id is persisted with delivery state so inbound acknowledgement webhooks remain restart-safe.
|
||||
|
||||
**Connector contract:** (implemented by plug-in assemblies)
|
||||
|
||||
@@ -216,6 +231,8 @@ public interface INotifyConnector {
|
||||
}
|
||||
```
|
||||
|
||||
For hosted external channels, Notifier worker adapters implement `IChannelAdapter` and are selected by `AdapterChannelDispatcher`. Those adapters must emit stable provider identifiers (`externalId`, `incidentId` where applicable) so the `IAckBridge` webhook path can recover correlation from persisted delivery rows instead of process-local memory.
|
||||
|
||||
**DeliveryContext** includes **rendered content** and **raw event** for audit.
|
||||
|
||||
**Test-send previews.** Plug-ins can optionally implement `INotifyChannelTestProvider` to shape `/channels/{id}/test` responses. Providers receive a sanitised `ChannelTestPreviewContext` (channel, tenant, target, timestamp, trace) and return a `NotifyDeliveryRendered` preview + metadata. When no provider is present, the host falls back to a generic preview so the endpoint always responds.
|
||||
@@ -274,23 +291,43 @@ Canonical JSON Schemas for rules/channels/events live in `docs/modules/notify/re
|
||||
|
||||
```
|
||||
{ _id, tenantId, ruleId, actionId, eventId, kind, scope, status:"sent|failed|throttled|digested|dropped",
|
||||
externalId?, metadata?,
|
||||
attempts:[{ts, status, code, reason}],
|
||||
rendered:{ title, body, target }, // redacted for PII; body hash stored
|
||||
sentAt, lastError? }
|
||||
```
|
||||
|
||||
PagerDuty and OpsGenie deliveries durably carry the provider `externalId` plus `metadata.incidentId` so inbound webhook acknowledgements can be resolved after worker restart without relying on a process-local bridge map.
|
||||
|
||||
* `digests`
|
||||
|
||||
```
|
||||
{ _id, tenantId, actionKey, window:"hourly", openedAt, items:[{eventId, scope, delta}], status:"open|flushed" }
|
||||
```
|
||||
|
||||
* `throttles`
|
||||
* `correlation_runtime_incidents`
|
||||
|
||||
```
|
||||
{ key:"idem:<hash>", ttlAt } // short-lived, also cached in Valkey
|
||||
{ tenantId, incidentId, correlationKey, eventKind, title, status:"open|acknowledged|resolved",
|
||||
eventCount, firstOccurrence, lastOccurrence, acknowledgedBy?, resolvedBy?, eventIds:[eventId...] }
|
||||
```
|
||||
|
||||
* `correlation_runtime_throttle_events`
|
||||
|
||||
```
|
||||
{ tenantId, correlationKey, occurredAt } // short-lived, also cached in Valkey
|
||||
```
|
||||
|
||||
* `escalation_states`
|
||||
|
||||
```
|
||||
{ tenantId, policyId, incidentId?, correlationId, currentStep, repeatIteration,
|
||||
status:"active|acknowledged|resolved|expired", startedAt, nextEscalationAt,
|
||||
acknowledgedAt?, acknowledgedBy?, metadata }
|
||||
```
|
||||
|
||||
`correlationId` is the durable lookup key for the live string incident id used by the runtime engine. `metadata` carries the runtime-only fields that do not fit the canonical columns yet: `stateId`, external `policyId`, `levelStartedAt`, terminal runtime status (`stopped|exhausted`), `stoppedAt`, `stoppedReason`, and the full escalation `history`.
|
||||
|
||||
**Indexes**: rules by `{tenantId, enabled}`, deliveries by `{tenantId, sentAt desc}`, digests by `{tenantId, actionKey}`.
|
||||
|
||||
---
|
||||
@@ -344,6 +381,8 @@ To support one-click acknowledgements from chat/email, the Notify WebService min
|
||||
|
||||
Authority signs ack tokens using keys configured under `notifications.ackTokens`. Public JWKS responses expose these keys with `use: "notify-ack"` and `status: active|retired`, enabling offline verification by the worker/UI/CLI.
|
||||
|
||||
Inbound PagerDuty and OpsGenie acknowledgement webhooks must resolve provider identifiers from durable delivery state (`externalId` plus incident metadata), not from process-local runtime maps. Restart-survival is a required property of the non-testing host composition.
|
||||
|
||||
**Ingestion**: workers do **not** expose public ingestion; they **subscribe** to the internal bus. (Optional `/events/test` for integration testing, admin-only.)
|
||||
|
||||
---
|
||||
@@ -357,7 +396,7 @@ Authority signs ack tokens using keys configured under `notifications.ackTokens`
|
||||
|
||||
* **Ingestor**: N consumers with per‑key ordering (key = tenant|digest|namespace).
|
||||
* **RuleMatcher**: loads active rules snapshot for tenant into memory; vectorized predicate check.
|
||||
* **Throttle/Dedupe**: consult Valkey + PostgreSQL `throttles`; if hit → record `status=throttled`.
|
||||
* **Throttle/Dedupe**: consult Valkey plus PostgreSQL `notify.correlation_runtime_throttle_events`; if hit → record `status=throttled`.
|
||||
* **DigestCoalescer**: append to open digest window or flush when timer expires.
|
||||
* **Renderer**: select template (channel+locale), inject variables, enforce length limits, compute `bodyHash`.
|
||||
* **Connector**: send; handle provider‑specific rate limits and backoffs; `maxAttempts` with exponential jitter; overflow → DLQ (dead‑letter topic) + UI surfacing.
|
||||
@@ -448,7 +487,7 @@ notify:
|
||||
|
||||
## 14) UI touch‑points
|
||||
|
||||
* **Notifications → Channels**: add Slack/Teams/Email/Webhook; run **health**; rotate secrets.
|
||||
* **Notifications → Channels**: add Slack/Teams/Email/Webhook/PagerDuty/OpsGenie; run **health**; rotate secrets.
|
||||
* **Notifications → Rules**: create/edit YAML rules with linting; test with sample events; see match rate.
|
||||
* **Notifications → Deliveries**: timeline with filters (status, channel, rule); inspect last error; retry.
|
||||
* **Digest preview**: shows current window contents and when it will flush.
|
||||
@@ -557,7 +596,7 @@ on the `notify-web` container.)
|
||||
|
||||
## 20) Roadmap (post-v1)
|
||||
|
||||
* **PagerDuty/Opsgenie** connectors; **Jira** ticket creation.
|
||||
* **Jira** ticket creation and downstream issue-state synchronization.
|
||||
* **User inbox** (in‑app notifications) + mobile push via webhook relay.
|
||||
* **Anomaly suppression**: auto‑pause noisy rules with hints (learned thresholds).
|
||||
* **Graph rules**: “only notify if *not_affected → affected* transition at consensus layer”.
|
||||
|
||||
@@ -2,7 +2,9 @@
|
||||
|
||||
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
|
||||
|
||||
Digests coalesce multiple matching events into a single notification when rules request batched delivery. They protect responders from alert storms while preserving a deterministic record of every input.
|
||||
Digests coalesce multiple matching events into a single notification when rules request batched delivery. They protect responders from alert storms while preserving a deterministic record of every input.
|
||||
|
||||
Scheduled digest cadence is currently configured on the worker through `Notifier:DigestSchedule`. Notify does not expose a live API for creating or editing digest schedules; the `/digests` routes below operate on open digest windows that already exist because rules and worker configuration caused them to be created. The Web console must therefore surface digest schedule CRUD as unavailable and must not emulate `/api/v1/notify/digest-schedules` with synthetic runtime data.
|
||||
|
||||
---
|
||||
|
||||
@@ -66,7 +68,9 @@ Digest state lives in PostgreSQL (`notify.digests` table) and mirrors the schema
|
||||
| `GET /digests/{actionKey}` | Returns the currently open window (if any) for the referenced action. | Supports operators/CLI inspecting pending digests; requires `notify.viewer`. |
|
||||
| `DELETE /digests/{actionKey}` | Drops the open window without notifying (emergency stop). | Emits an audit record; use sparingly. |
|
||||
|
||||
All routes honour the tenant header and reuse the standard Notify rate limits.
|
||||
All routes honour the tenant header and reuse the standard Notify rate limits.
|
||||
|
||||
There is intentionally no `POST /digest-schedules` or equivalent schedule CRUD endpoint in the live API today. If product requirements later need operator-managed schedules, that work must introduce a persisted contract, docs, and startup/runtime wiring explicitly.
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user