chore(docs+devops): cross-module doc sync + sprint archival moves + compose updates
Bundled pre-session doc + ops work: - docs/modules/**: sync across advisory-ai, airgap, cli, excititor, export-center, findings-ledger, notifier, notify, platform, router, sbom-service, ui, web (architectural + operational updates) - docs/features/**: updates to checked excititor vex pipeline, developer workspace, quick verify drawer - docs top-level: README, quickstart, API_CLI_REFERENCE, UI_GUIDE, code-of-conduct/TESTING_PRACTICES updates - docs/qa/feature-checks/: FLOW.md + excititor state update - docs/implplan/: remaining sprint updates + new Concelier source credentials sprint (SPRINT_20260422_003) - docs-archived/implplan/: 30 sprint archival moves (ElkSharp series, misc completed sprints) - devops/compose: .env + services compose + env example + router gateway config updates File-level granularity preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,17 +1,28 @@
|
||||
> **Scope.** Implementation‑ready architecture for **Notify** (aligned with Epic 11 – Notifications Studio): a rules‑driven, tenant‑aware notification service that consumes platform events (scan completed, report ready, rescan deltas, attestation logged, admission decisions, etc.), evaluates operator‑defined routing rules, renders **channel‑specific messages** (Slack/Teams/Email/Webhook), and delivers them **reliably** with idempotency, throttling, and digests. It is UI‑managed, auditable, and safe by default (no secrets leakage, no spam storms).
|
||||
> **Scope.** Implementation‑ready architecture for **Notify** (aligned with Epic 11 – Notifications Studio): a rules‑driven, tenant‑aware notification service that consumes platform events (scan completed, report ready, rescan deltas, attestation logged, admission decisions, etc.), evaluates operator‑defined routing rules, renders **channel‑specific messages** (Slack/Teams/Email/Webhook/PagerDuty/OpsGenie), and delivers them **reliably** with idempotency, throttling, and digests. It is UI‑managed, auditable, and safe by default (no secrets leakage, no spam storms).
|
||||
|
||||
* **Console frontdoor compatibility (updated 2026-03-10).** The web console reaches Notifier Studio through the gateway-owned `/api/v1/notifier/*` prefix, which translates onto the service-local `/api/v2/notify/*` surface without requiring browser calls to raw service-prefixed routes.
|
||||
* **Console admin routing truthfulness (updated 2026-04-21).** The console uses `/api/v1/notify/*` only for core Notify toolkit flows (channels, rules, deliveries, incidents, acknowledgements). Advanced admin configuration such as quiet-hours, throttles, escalation, and localization is owned by the Notifier frontdoor `/api/v1/notifier/* -> /api/v2/notify/*`; Platform no longer serves synthetic `/api/v1/notify/*` admin compatibility payloads. Digest schedule CRUD remains unavailable in the live API.
|
||||
* **Merged Notify compat surface restoration (updated 2026-04-22).** The merged `src/Notify/*` host now maps the admin compatibility routes expected behind `/api/v1/notifier/*`, including `/api/v2/notify/channels*`, `/deliveries*`, `/simulate*`, `/quiet-hours*`, `/throttle-configs*`, `/escalation-policies*`, and `/overrides*`. Unsupported operator override CRUD now returns an explicit `501` contract response instead of a misleading `404`, and focused proof lives in `src/Notify/__Tests/StellaOps.Notify.WebService.Tests/CrudEndpointsTests.cs`.
|
||||
* **Runtime durability cutover (updated 2026-04-16).** Default `src/Notifier/*` production wiring now resolves queue and storage through the shared `StellaOps.Notify.Persistence` and `StellaOps.Notify.Queue` libraries. `NullNotifyEventQueue` is allowed only in the `Testing` environment, `notify.pack_approvals` is durable, and restart-survival proof is covered by `NotifierDurableRuntimeProofTests` against real Postgres + Redis.
|
||||
* **Correlation incident/throttle durability (updated 2026-04-20).** Non-testing Notify and Notifier hosts no longer keep incident correlation or throttle windows in process-local memory. Both hosts now swap `IIncidentManager` and `INotifyThrottler` onto PostgreSQL-backed runtime services using `notify.correlation_runtime_incidents` and `notify.correlation_runtime_throttle_events`, with restart-survival proof in `NotifierCorrelationDurableRuntimeTests`.
|
||||
* **Localization runtime durability (updated 2026-04-20).** Non-testing Notify and Notifier hosts no longer keep tenant-managed localization bundles in process-local memory. Both hosts now swap `ILocalizationService` onto a PostgreSQL-backed runtime service using `notify.localization_bundles`, while built-in system fallback strings remain compiled defaults, with restart-survival proof in `NotifierLocalizationDurableRuntimeTests`.
|
||||
* **Storm/fallback runtime durability (updated 2026-04-20).** Non-testing Notify and Notifier hosts no longer keep storm detection state, tenant fallback chains, or per-delivery fallback attempts in process-local memory. Both hosts now swap `IStormBreaker` and `IFallbackHandler` onto PostgreSQL-backed runtime services using `notify.storm_runtime_states`, `notify.storm_runtime_events`, `notify.fallback_runtime_chains`, and `notify.fallback_runtime_delivery_states`, with restart-survival proof in `NotifierStormFallbackDurableRuntimeTests`.
|
||||
* **Escalation engine runtime durability (updated 2026-04-20).** Non-testing Notify and Notifier hosts no longer keep live `IEscalationEngine` state in a process-local dictionary. Both hosts now swap `IEscalationEngine` onto a PostgreSQL-backed runtime service using `notify.escalation_states`, with restart-survival proof in `NotifierEscalationRuntimeDurableTests` and startup-contract proof in `NotifyEscalationRuntimeStartupContractTests`.
|
||||
* **External ack/runtime channel durability (updated 2026-04-20).** Non-testing Notifier worker hosts no longer depend on a process-local external-id bridge map or a webhook-only dispatch composition for external channels. The worker now composes `WebhookChannelDispatcher` for chat/webhook routes plus `AdapterChannelDispatcher` for `Email`, `PagerDuty`, and `OpsGenie`, durably records provider `externalId` plus `incidentId` metadata into PostgreSQL-backed delivery state, and resolves PagerDuty/OpsGenie webhook acknowledgements through PostgreSQL-backed lookup after restart. Focused proof lives in `NotifierWorkerHostWiringTests` and `NotifierAckBridgeRuntimeDurableTests`.
|
||||
* **Digest scheduler runtime composition (updated 2026-04-20).** The non-testing Notifier worker now composes `DigestScheduleRunner`, `DigestGenerator`, and `ChannelDigestDistributor` in the live host. Scheduled digests remain configuration-driven and now resolve tenant IDs from `Notifier:DigestSchedule:Schedules:*:TenantIds` through `ConfiguredDigestTenantProvider` instead of the process-local `InMemoryDigestTenantProvider`. There is currently no operator-managed digest schedule CRUD surface in the live runtime; `/digests` administers open digest windows only. Focused proof lives in `NotifierWorkerHostWiringTests`.
|
||||
* **Suppression admin durability (updated 2026-04-16).** Non-testing throttle configuration and operator override APIs no longer use live in-memory state. Both hosts now resolve canonical `/api/v2/throttles*` and `/api/v2/overrides*` plus legacy `/api/v2/notify/throttle-configs*` and `/api/v2/notify/overrides*` through PostgreSQL-backed suppression services, with restart-survival proof in `NotifierSuppressionDurableRuntimeTests`.
|
||||
* **Escalation/on-call durability (updated 2026-04-16).** Non-testing escalation-policy and on-call schedule APIs no longer use live in-memory services or compat repositories. Both hosts now resolve canonical `/api/v2/escalation-policies*` and `/api/v2/oncall-schedules*` plus legacy `/api/v2/notify/escalation-policies*` and `/api/v2/notify/oncall-schedules*` through PostgreSQL-backed runtime services, with restart-survival proof in `NotifierEscalationOnCallDurableRuntimeTests`.
|
||||
* **Quiet-hours/maintenance durability (updated 2026-04-16).** Non-testing quiet-hours calendars and maintenance windows no longer use live in-memory compat repositories or maintenance evaluators. Both hosts now resolve canonical `/api/v2/quiet-hours*` plus legacy `/api/v2/notify/quiet-hours*` and `/api/v2/notify/maintenance-windows*` through PostgreSQL-backed runtime services on the shared `notify.quiet_hours` and `notify.maintenance_windows` tables, with restart-survival proof in `NotifierQuietHoursMaintenanceDurableRuntimeTests`. Fixed-time daily/weekly cron expressions project truthfully into canonical schedules; more complex cron shapes are persisted for round-trip reads but remain inert until a cron-native evaluator lands.
|
||||
* **Quiet-hours/maintenance durability (updated 2026-04-20).** Non-testing quiet-hours calendars and maintenance windows no longer use live in-memory compat repositories or maintenance evaluators. Both hosts now resolve canonical `/api/v2/quiet-hours*` plus legacy `/api/v2/notify/quiet-hours*` and `/api/v2/notify/maintenance-windows*` through PostgreSQL-backed runtime services on the shared `notify.quiet_hours` and `notify.maintenance_windows` tables, with restart-survival proof in `NotifierQuietHoursMaintenanceDurableRuntimeTests`. Fixed-time daily/weekly cron expressions still project truthfully into canonical schedules, and compat-authored cron shapes that cannot be flattened losslessly now evaluate natively from persisted `cronExpression` plus `duration` metadata instead of remaining inert after restart.
|
||||
* **Security/dead-letter durability (updated 2026-04-16).** Non-testing webhook security, tenant isolation, dead-letter administration, and retention cleanup state no longer use live in-memory services. Both hosts now resolve `/api/v2/security*`, `/api/v2/notify/dead-letter*`, `/api/v1/observability/dead-letters*`, and retention endpoints through PostgreSQL-backed runtime services on shared `notify.webhook_security_configs`, `notify.webhook_validation_nonces`, `notify.tenant_resource_owners`, `notify.cross_tenant_grants`, `notify.tenant_isolation_violations`, `notify.dead_letter_entries`, `notify.retention_policies_runtime`, and `notify.retention_cleanup_executions_runtime` tables, with restart-survival proof in `NotifierSecurityDeadLetterDurableRuntimeTests`.
|
||||
* **Testing-only fallback boundary (updated 2026-04-20).** `src/Notifier/*` host startup now registers those durable quiet-hours, suppression, escalation/on-call, security, and dead-letter services directly for non-testing environments instead of composing an in-memory graph and replacing it later. The remaining in-memory admin services are isolated to `Testing`, with startup-contract proof in `StartupDependencyWiringTests`.
|
||||
|
||||
* **Simulation runtime parity (updated 2026-04-20).** The canonical `/api/v2/simulate*` endpoints and the legacy `/api/v2/notify/simulate*` endpoints in `src/Notifier/` now resolve the same DI-composed simulation runtime, so throttling plus quiet-hours or maintenance suppression behave identically across route families.
|
||||
|
||||
---
|
||||
|
||||
## 0) Mission & boundaries
|
||||
|
||||
**Mission.** Convert **facts** from Stella Ops into **actionable, noise‑controlled** signals where teams already live (chat/email/webhooks), with **explainable** reasons and deep links to the UI.
|
||||
**Mission.** Convert **facts** from Stella Ops into **actionable, noise-controlled** signals where teams already live (chat, email, paging, and webhooks), with **explainable** reasons and deep links to the UI.
|
||||
|
||||
**Boundaries.**
|
||||
|
||||
@@ -46,7 +57,7 @@ src/
|
||||
* **Notify.WebService** (stateless API)
|
||||
* **Notify.Worker** (horizontal scale)
|
||||
|
||||
**Dependencies**: Authority (OpToks; DPoP/mTLS), **PostgreSQL** (notify schema), Valkey/NATS (bus), HTTP egress to Slack/Teams/Webhooks, SMTP relay for Email.
|
||||
**Dependencies**: Authority (OpToks; DPoP/mTLS), **PostgreSQL** (notify schema), Valkey/NATS (bus), HTTP egress to Slack/Teams/Webhooks/PagerDuty/OpsGenie, SMTP relay for Email.
|
||||
|
||||
> **Configuration.** Notify.WebService bootstraps from `notify.yaml` (see `etc/notify.yaml.sample`). Use `storage.driver: postgres` and provide `postgres.notify` options (`connectionString`, `schemaName`, pool sizing, timeouts). Authority settings follow the platform defaults—when running locally without Authority, set `authority.enabled: false` and supply `developmentSigningKey` so JWTs can be validated offline.
|
||||
>
|
||||
@@ -66,6 +77,8 @@ src/
|
||||
> ```
|
||||
>
|
||||
> The Offline Kit job simply copies the `plugins/notify` tree into the air-gapped bundle; the ordered list keeps connector manifests stable across environments.
|
||||
>
|
||||
> In the hosted Notifier worker, delivery execution is split across two deterministic dispatch paths: `WebhookChannelDispatcher` continues to handle chat/webhook routes, while `AdapterChannelDispatcher` resolves `Email`, `PagerDuty`, and `OpsGenie` through `IChannelAdapterFactory`. The provider `externalId` emitted by those adapter-backed channels must survive persistence so inbound webhook acknowledgements can be resolved after restart.
|
||||
|
||||
> **Authority clients.** Register two OAuth clients in StellaOps Authority: `notify-web-dev` (audience `notify.dev`) for development and `notify-web` (audience `notify`) for staging/production. Both require `notify.read` and `notify.admin` scopes and use DPoP-bound client credentials (`client_secret` in the samples). Reference entries live in `etc/authority.yaml.sample`, with placeholder secrets under `etc/secrets/notify-web*.secret.example`.
|
||||
|
||||
@@ -199,12 +212,14 @@ actions:
|
||||
|
||||
Channel config is **two‑part**: a **Channel** record (name, type, options) and a Secret **reference** (Vault/K8s Secret). Connectors are **restart-time plug-ins** discovered on service start (same manifest convention as Concelier/Excititor) and live under `plugins/notify/<channel>/`.
|
||||
|
||||
**Built‑in v1:**
|
||||
**Built-in channels:**
|
||||
|
||||
* **Slack**: Bot token (xoxb‑…), `chat.postMessage` + `blocks`; rate limit aware (HTTP 429).
|
||||
* **Microsoft Teams**: Incoming Webhook (or Graph card later); adaptive card payloads.
|
||||
* **Email (SMTP)**: TLS (STARTTLS or implicit), From/To/CC/BCC; HTML+text alt; DKIM optional.
|
||||
* **Generic Webhook**: POST JSON with HMAC signature (Ed25519 or SHA‑256) in headers.
|
||||
* **PagerDuty**: Events API v2 trigger/ack/resolve flow; durable `dedup_key`/external id mapping is persisted with delivery state for restart-safe webhook acknowledgement handling.
|
||||
* **OpsGenie**: Alert create/ack/close flow; alias/external id is persisted with delivery state so inbound acknowledgement webhooks remain restart-safe.
|
||||
|
||||
**Connector contract:** (implemented by plug-in assemblies)
|
||||
|
||||
@@ -216,6 +231,8 @@ public interface INotifyConnector {
|
||||
}
|
||||
```
|
||||
|
||||
For hosted external channels, Notifier worker adapters implement `IChannelAdapter` and are selected by `AdapterChannelDispatcher`. Those adapters must emit stable provider identifiers (`externalId`, `incidentId` where applicable) so the `IAckBridge` webhook path can recover correlation from persisted delivery rows instead of process-local memory.
|
||||
|
||||
**DeliveryContext** includes **rendered content** and **raw event** for audit.
|
||||
|
||||
**Test-send previews.** Plug-ins can optionally implement `INotifyChannelTestProvider` to shape `/channels/{id}/test` responses. Providers receive a sanitised `ChannelTestPreviewContext` (channel, tenant, target, timestamp, trace) and return a `NotifyDeliveryRendered` preview + metadata. When no provider is present, the host falls back to a generic preview so the endpoint always responds.
|
||||
@@ -274,23 +291,43 @@ Canonical JSON Schemas for rules/channels/events live in `docs/modules/notify/re
|
||||
|
||||
```
|
||||
{ _id, tenantId, ruleId, actionId, eventId, kind, scope, status:"sent|failed|throttled|digested|dropped",
|
||||
externalId?, metadata?,
|
||||
attempts:[{ts, status, code, reason}],
|
||||
rendered:{ title, body, target }, // redacted for PII; body hash stored
|
||||
sentAt, lastError? }
|
||||
```
|
||||
|
||||
PagerDuty and OpsGenie deliveries durably carry the provider `externalId` plus `metadata.incidentId` so inbound webhook acknowledgements can be resolved after worker restart without relying on a process-local bridge map.
|
||||
|
||||
* `digests`
|
||||
|
||||
```
|
||||
{ _id, tenantId, actionKey, window:"hourly", openedAt, items:[{eventId, scope, delta}], status:"open|flushed" }
|
||||
```
|
||||
|
||||
* `throttles`
|
||||
* `correlation_runtime_incidents`
|
||||
|
||||
```
|
||||
{ key:"idem:<hash>", ttlAt } // short-lived, also cached in Valkey
|
||||
{ tenantId, incidentId, correlationKey, eventKind, title, status:"open|acknowledged|resolved",
|
||||
eventCount, firstOccurrence, lastOccurrence, acknowledgedBy?, resolvedBy?, eventIds:[eventId...] }
|
||||
```
|
||||
|
||||
* `correlation_runtime_throttle_events`
|
||||
|
||||
```
|
||||
{ tenantId, correlationKey, occurredAt } // short-lived, also cached in Valkey
|
||||
```
|
||||
|
||||
* `escalation_states`
|
||||
|
||||
```
|
||||
{ tenantId, policyId, incidentId?, correlationId, currentStep, repeatIteration,
|
||||
status:"active|acknowledged|resolved|expired", startedAt, nextEscalationAt,
|
||||
acknowledgedAt?, acknowledgedBy?, metadata }
|
||||
```
|
||||
|
||||
`correlationId` is the durable lookup key for the live string incident id used by the runtime engine. `metadata` carries the runtime-only fields that do not fit the canonical columns yet: `stateId`, external `policyId`, `levelStartedAt`, terminal runtime status (`stopped|exhausted`), `stoppedAt`, `stoppedReason`, and the full escalation `history`.
|
||||
|
||||
**Indexes**: rules by `{tenantId, enabled}`, deliveries by `{tenantId, sentAt desc}`, digests by `{tenantId, actionKey}`.
|
||||
|
||||
---
|
||||
@@ -344,6 +381,8 @@ To support one-click acknowledgements from chat/email, the Notify WebService min
|
||||
|
||||
Authority signs ack tokens using keys configured under `notifications.ackTokens`. Public JWKS responses expose these keys with `use: "notify-ack"` and `status: active|retired`, enabling offline verification by the worker/UI/CLI.
|
||||
|
||||
Inbound PagerDuty and OpsGenie acknowledgement webhooks must resolve provider identifiers from durable delivery state (`externalId` plus incident metadata), not from process-local runtime maps. Restart-survival is a required property of the non-testing host composition.
|
||||
|
||||
**Ingestion**: workers do **not** expose public ingestion; they **subscribe** to the internal bus. (Optional `/events/test` for integration testing, admin-only.)
|
||||
|
||||
---
|
||||
@@ -357,7 +396,7 @@ Authority signs ack tokens using keys configured under `notifications.ackTokens`
|
||||
|
||||
* **Ingestor**: N consumers with per‑key ordering (key = tenant|digest|namespace).
|
||||
* **RuleMatcher**: loads active rules snapshot for tenant into memory; vectorized predicate check.
|
||||
* **Throttle/Dedupe**: consult Valkey + PostgreSQL `throttles`; if hit → record `status=throttled`.
|
||||
* **Throttle/Dedupe**: consult Valkey plus PostgreSQL `notify.correlation_runtime_throttle_events`; if hit → record `status=throttled`.
|
||||
* **DigestCoalescer**: append to open digest window or flush when timer expires.
|
||||
* **Renderer**: select template (channel+locale), inject variables, enforce length limits, compute `bodyHash`.
|
||||
* **Connector**: send; handle provider‑specific rate limits and backoffs; `maxAttempts` with exponential jitter; overflow → DLQ (dead‑letter topic) + UI surfacing.
|
||||
@@ -448,7 +487,7 @@ notify:
|
||||
|
||||
## 14) UI touch‑points
|
||||
|
||||
* **Notifications → Channels**: add Slack/Teams/Email/Webhook; run **health**; rotate secrets.
|
||||
* **Notifications → Channels**: add Slack/Teams/Email/Webhook/PagerDuty/OpsGenie; run **health**; rotate secrets.
|
||||
* **Notifications → Rules**: create/edit YAML rules with linting; test with sample events; see match rate.
|
||||
* **Notifications → Deliveries**: timeline with filters (status, channel, rule); inspect last error; retry.
|
||||
* **Digest preview**: shows current window contents and when it will flush.
|
||||
@@ -557,7 +596,7 @@ on the `notify-web` container.)
|
||||
|
||||
## 20) Roadmap (post-v1)
|
||||
|
||||
* **PagerDuty/Opsgenie** connectors; **Jira** ticket creation.
|
||||
* **Jira** ticket creation and downstream issue-state synchronization.
|
||||
* **User inbox** (in‑app notifications) + mobile push via webhook relay.
|
||||
* **Anomaly suppression**: auto‑pause noisy rules with hints (learned thresholds).
|
||||
* **Graph rules**: “only notify if *not_affected → affected* transition at consensus layer”.
|
||||
|
||||
Reference in New Issue
Block a user