Files
git.stella-ops.org/docs/modules/notify/implementation_plan.md
master e950474a77
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
api-governance / spectral-lint (push) Has been cancelled
oas-ci / oas-validate (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
up
2025-11-27 15:16:31 +02:00

156 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Implementation plan — Notify
## Delivery phases
- **Phase 1 Core rules engine & delivery ledger**
Implement rules/channels schema, event ingestion, rule evaluation, idempotent deliveries, and audit logging.
- **Phase 2 Connectors & rendering**
Ship Slack/Teams/Email/Webhook connectors, template rendering, localization, throttling, retries, and secret referencing.
- **Phase 3 Console & CLI authoring**
Provide UI/CLI for rule authoring, previews, channel health, delivery browsing, digests, and test sends.
- **Phase 4 Governance & observability**
Add approvals, RBAC, tenant quotas, Notify metrics/logs/traces, dashboards, Notify-specific alerts, and Notify runbooks.
- **Phase 5 Offline & compliance**
Produce Offline Kit bundles (rules/channels/deploy scripts), signed exports, retention policies, and auditing for regulated environments.
## Work breakdown
- **Service & worker**
- REST API for rules/channels/delivery history, idempotency middleware, digest scheduler.
- Worker pipelines for event intake, rule matching, template rendering, delivery execution, retries, and throttling.
- Delivery ledger capturing payload metadata, response, retry state, DSSE signatures.
- **Connectors**
- Slack/Teams/Email/Webhook plug-ins with configuration validation, rate limiting, error classification.
- Secrets referenced via Authority/Secret store; no plaintext storage.
- **Console & CLI**
- Console module for rules builder, condition editor, preview, test send, delivery insights, digests and schedule configuration.
- CLI (`stella notify rule|channel|delivery`) for automation, export/import.
- **Integrations**
- Event sources: Concelier, Excititor, Policy Engine, Vuln Explorer, Export Center, Attestor, Zastava, Scheduler.
- Notify events to Notify (meta) for failure escalations, accepted-risk expiration reminders.
- **Observability & ops**
- Metrics: delivery success/failure, retry counts, throttle hits, digest generation, channel health.
- Logs/traces with tenant, rule ID, channel, correlation ID; dashboards and alerts.
- Runbooks for misconfigured channels, throttling, event backlog, incident digest.
- **Docs & compliance**
- Update Notifications Studio guides, channel runbooks, security/RBAC docs, Offline Kit instructions.
- Provide compliance checklist (audit logging, retention, opt-out).
## Acceptance criteria
- Rules evaluate deterministically per event; deliveries idempotent with audit trail and DSSE signatures.
- Channel connectors support retries, rate limits, health checks, previews; secrets referenced securely.
- Console/CLI support rule creation, testing, digests, delivery browsing, and export/import workflows.
- Observability dashboards track delivery health; alerts fire for sustained failures or backlog; runbooks cover remediation.
- Offline Kit bundle contains configs, rules, digests, and deployment scripts for air-gapped installs.
- Notify respects tenancy and RBAC; governance (approvals, change log) enforced for high-impact rules.
## Risks & mitigations
- **Notification storms:** throttling, digests, dedupe windows, preview/test gating.
- **Secret compromise:** secret references only, rotation workflows, audit logging.
- **Connector API changes:** versioned adapter layer, nightly health checks, fallback channels.
- **Noise vs signal:** simulation previews, metrics, rule scoring, recommended defaults.
- **Offline parity:** export/import of rules, connectors, and digests with signed manifests.
## Test strategy
- **Unit:** rule evaluation, template rendering, connector clients, throttling, digests.
- **Integration:** end-to-end events from core services, multi-channel routing, retries, audit logging.
- **Performance:** burst throttling, digest creation, large rule sets.
- **Security:** RBAC tests, tenant isolation, secret reference validation, DSSE signature verification.
- **Offline:** export/import round-trips, Offline Kit deployment, manual delivery replay.
## Definition of done
- Notify service, workers, connectors, Console/CLI, observability, and Offline Kit assets shipped with documentation and runbooks.
- Compliance checklist appended to docs; ./TASKS.md and ../../TASKS.md updated with progress.
---
## Sprint readiness tracker
> Last updated: 2025-11-27 (NOTIFY-ENG-0001)
This section maps delivery phases to implementation sprints and tracks readiness checkpoints.
### Phase 1 — Core rules engine & delivery ledger
| Task ID | Status | Sprint | Notes |
|---------|--------|--------|-------|
| NOTIFY-SVC-37-001 | ✅ DONE (2025-11-24) | SPRINT_0172_0001_0002_notifier_ii | Pack approval contract published (OpenAPI schema, payloads). |
| NOTIFY-SVC-37-002 | ✅ DONE (2025-11-24) | SPRINT_0172_0001_0002_notifier_ii | Ingestion endpoint with Mongo persistence, idempotent writes, audit trail. |
| NOTIFY-SVC-37-003 | 🔄 DOING | SPRINT_0172_0001_0002_notifier_ii | Approval/policy templates, routing predicates; dispatch/rendering pending. |
| NOTIFY-SVC-37-004 | ✅ DONE (2025-11-24) | SPRINT_0172_0001_0002_notifier_ii | Acknowledgement API, test harness, metrics. |
| NOTIFY-OAS-61-001 | ✅ DONE (2025-11-17) | SPRINT_0171_0001_0001_notifier_i | OAS with rules/templates/incidents/quiet hours endpoints. |
| NOTIFY-OAS-61-002 | ✅ DONE (2025-11-17) | SPRINT_0171_0001_0001_notifier_i | `/.well-known/openapi` discovery endpoint. |
| NOTIFY-OAS-62-001 | ✅ DONE (2025-11-17) | SPRINT_0171_0001_0001_notifier_i | SDK examples for rule CRUD. |
| NOTIFY-OAS-63-001 | ✅ DONE (2025-11-17) | SPRINT_0171_0001_0001_notifier_i | Deprecation headers and templates. |
**Checkpoint:** Core rules engine mostly complete; template dispatch/rendering in progress.
### Phase 2 — Connectors & rendering
| Task ID | Status | Sprint | Notes |
|---------|--------|--------|-------|
| NOTIFY-SVC-38-002 | 📝 TODO | SPRINT_0172_0001_0002_notifier_ii | Channel adapters (email, chat webhook, generic webhook) with retry policies. |
| NOTIFY-SVC-38-003 | 📝 TODO | SPRINT_0172_0001_0002_notifier_ii | Template service, renderer with redaction and localization. |
| NOTIFY-SVC-38-004 | 📝 TODO | SPRINT_0172_0001_0002_notifier_ii | REST + WS APIs for rules CRUD, templates preview, incidents. |
| NOTIFY-DOC-70-001 | ✅ DONE (2025-11-02) | SPRINT_0171_0001_0001_notifier_i | Architecture docs for `src/Notify` vs `src/Notifier` split. |
**Checkpoint:** Connector and rendering work not yet started; depends on Phase 1 completion.
### Phase 3 — Console & CLI authoring
| Task ID | Status | Sprint | Notes |
|---------|--------|--------|-------|
| NOTIFY-SVC-39-001 | 📝 TODO | SPRINT_0172_0001_0002_notifier_ii | Correlation engine with throttler, quiet hours, incident lifecycle. |
| NOTIFY-SVC-39-002 | 📝 TODO | SPRINT_0172_0001_0002_notifier_ii | Digest generator with schedule runner. |
| NOTIFY-SVC-39-003 | 📝 TODO | SPRINT_0172_0001_0002_notifier_ii | Simulation engine for dry-run rules against historical events. |
| NOTIFY-SVC-39-004 | 📝 TODO | SPRINT_0172_0001_0002_notifier_ii | Quiet hour calendars with audit logging. |
**Checkpoint:** Console/CLI authoring work not started; depends on Phase 2 completion.
### Phase 4 — Governance & observability
| Task ID | Status | Sprint | Notes |
|---------|--------|--------|-------|
| NOTIFY-SVC-40-001 | 📝 TODO | SPRINT_0172_0001_0002_notifier_ii | Escalations, on-call schedules, PagerDuty/OpsGenie adapters. |
| NOTIFY-SVC-40-002 | 📝 TODO | SPRINT_0172_0001_0002_notifier_ii | Summary storm breaker, localization bundles. |
| NOTIFY-SVC-40-003 | 📝 TODO | SPRINT_0172_0001_0002_notifier_ii | Security hardening (signed ack links, webhook HMAC). |
| NOTIFY-SVC-40-004 | 📝 TODO | SPRINT_0172_0001_0002_notifier_ii | Observability metrics/traces, dead-letter handling, chaos tests. |
| NOTIFY-OBS-51-001 | ✅ DONE (2025-11-22) | SPRINT_0171_0001_0001_notifier_i | SLO evaluator webhooks with templates/routing/suppression. |
| NOTIFY-OBS-55-001 | ✅ DONE (2025-11-22) | SPRINT_0171_0001_0001_notifier_i | Incident mode templates with evidence/trace/retention context. |
| NOTIFY-ATTEST-74-001 | ✅ DONE (2025-11-16) | SPRINT_0171_0001_0001_notifier_i | Templates for verification failures, key revocations, transparency. |
| NOTIFY-ATTEST-74-002 | 📝 TODO | SPRINT_0171_0001_0001_notifier_i | Wire notifications to key rotation/revocation events. |
| NOTIFY-RISK-66-001 | ⏳ BLOCKED | SPRINT_0171_0001_0001_notifier_i | Risk severity escalation triggers; needs POLICY-RISK-40-002. |
| NOTIFY-RISK-67-001 | ⏳ BLOCKED | SPRINT_0171_0001_0001_notifier_i | Risk profile publish/deprecate notifications. |
| NOTIFY-RISK-68-001 | ⏳ BLOCKED | SPRINT_0171_0001_0001_notifier_i | Per-profile routing, quiet hours, dedupe. |
**Checkpoint:** Core observability complete; governance and risk notifications blocked on upstream dependencies.
### Phase 5 — Offline & compliance
| Task ID | Status | Sprint | Notes |
|---------|--------|--------|-------|
| NOTIFY-AIRGAP-56-002 | ✅ DONE | SPRINT_0171_0001_0001_notifier_i | Bootstrap Pack with deterministic secrets and offline validation. |
| NOTIFY-TEN-48-001 | ⏳ BLOCKED | SPRINT_0173_0001_0003_notifier_iii | Tenant-scope rules/templates; needs Sprint 0172 tenancy model. |
**Checkpoint:** Offline basics complete; tenancy work blocked on upstream Sprint 0172.
---
### Overall readiness summary
| Phase | Status | Blocking items |
|-------|--------|----------------|
| **1 Core rules engine** | 🔄 In progress | NOTIFY-SVC-37-003 dispatch/rendering |
| **2 Connectors & rendering** | 📝 Not started | Phase 1 completion |
| **3 Console & CLI** | 📝 Not started | Phase 2 completion |
| **4 Governance & observability** | 🔄 Partial | POLICY-RISK-40-002 for risk notifications |
| **5 Offline & compliance** | 🔄 Partial | Sprint 0172 tenancy model |
### Cross-module dependencies
| Dependency | Required by | Status |
|------------|-------------|--------|
| Attestor payload localization | NOTIFY-ATTEST-74-002 | Freeze pending |
| POLICY-RISK-40-002 export | NOTIFY-RISK-66/67/68 | BLOCKED |
| Sprint 0172 tenancy model | NOTIFY-TEN-48-001 | In progress |
| Telemetry SLO webhook schema | NOTIFY-OBS-51-001 | ✅ Published (`docs/notifications/slo-webhook-schema.md`) |
### Next actions
1. Complete NOTIFY-SVC-37-003 dispatch/rendering wiring (Sprint 0172).
2. Start NOTIFY-SVC-38-002 channel adapters once Phase 1 closes.
3. Track POLICY-RISK-40-002 to unblock risk notification tasks.
4. Monitor Sprint 0172 tenancy model for NOTIFY-TEN-48-001.