- Added `FilesystemPackRunProvenanceWriter` to write provenance manifests to the filesystem. - Introduced `MongoPackRunArtifactReader` to read artifacts from MongoDB. - Created `MongoPackRunProvenanceWriter` to store provenance manifests in MongoDB. - Developed unit tests for filesystem and MongoDB provenance writers. - Established `ITimelineEventStore` and `ITimelineIngestionService` interfaces for timeline event handling. - Implemented `TimelineIngestionService` to validate and persist timeline events with hashing. - Created PostgreSQL schema and migration scripts for timeline indexing. - Added dependency injection support for timeline indexer services. - Developed tests for timeline ingestion and schema validation.
39 lines
2.3 KiB
Markdown
39 lines
2.3 KiB
Markdown
# Notify observability runbook (stub · 2025-11-29 demo)
|
|
|
|
## Dashboards (offline import)
|
|
- Grafana JSON: `docs/modules/notify/operations/dashboards/notify-observability.json` (import locally; no external data sources assumed).
|
|
- Planned panels: enqueue/dequeue rate, delivery latency p95/p99, channel error rate, retry/dead-letter counts, rule evaluation latency, tenant isolation breaches (should stay 0), and notification simulation outcomes.
|
|
|
|
## Key metrics
|
|
- `notify_enqueue_total{channel}` — notifications enqueued by channel.
|
|
- `notify_delivery_latency_seconds_bucket{channel}` — delivery latency per channel.
|
|
- `notify_delivery_failures_total{channel,reason}` — failed deliveries.
|
|
- `notify_retry_total{channel}` and `notify_deadletter_total{channel}` — retries and dead letters.
|
|
- `notify_rule_eval_duration_seconds_bucket` — rule evaluation latency.
|
|
- `notify_simulation_total{result}` — simulation outcomes when quiet hours/correlation rules applied.
|
|
|
|
## Logs & traces
|
|
- Correlate by `notificationId`, `ruleId`, `tenant`, `channel`. Include `quietHoursApplied`, `correlationKey`, `retries` fields.
|
|
- Traces disabled by default for air-gap; enable by pointing OTLP exporter to on-prem collector.
|
|
|
|
## Health/diagnostics
|
|
- `/health/liveness` and `/health/readiness` check queue backend reachability and channel provider credentials.
|
|
- `/status` exposes build version, commit, feature flags; verify against offline bundle manifest.
|
|
- Simulation probe: `/api/notify/simulate` with sample rule set to validate correlation/digest wiring once NOTIFY-SVC-39-001..004 land.
|
|
|
|
## Alert hints
|
|
- Delivery latency p99 > 1.5s for email/webhook channels.
|
|
- Dead-letter queue growth > threshold.
|
|
- Rule evaluation latency p99 > 500ms.
|
|
- Correlation/quiet-hours simulation failures once enabled.
|
|
|
|
## Offline verification steps
|
|
1) Import Grafana JSON locally; point to Prometheus scrape labeled `notify`.
|
|
2) Run `stella notify simulate --rules samples/rules.yaml --dry-run` (once available) and ensure metrics/logs emit locally.
|
|
3) Fetch `/status` and compare commit/version to offline bundle manifest.
|
|
|
|
## Evidence locations
|
|
- Sprint tracker: `docs/implplan/SPRINT_322_docs_modules_notify.md`.
|
|
- Module docs: `README.md`, `architecture.md`, `implementation_plan.md`.
|
|
- Dashboard stub: `operations/dashboards/notify-observability.json`.
|