- Added `FilesystemPackRunProvenanceWriter` to write provenance manifests to the filesystem. - Introduced `MongoPackRunArtifactReader` to read artifacts from MongoDB. - Created `MongoPackRunProvenanceWriter` to store provenance manifests in MongoDB. - Developed unit tests for filesystem and MongoDB provenance writers. - Established `ITimelineEventStore` and `ITimelineIngestionService` interfaces for timeline event handling. - Implemented `TimelineIngestionService` to validate and persist timeline events with hashing. - Created PostgreSQL schema and migration scripts for timeline indexing. - Added dependency injection support for timeline indexer services. - Developed tests for timeline ingestion and schema validation.
2.3 KiB
2.3 KiB
Notify observability runbook (stub · 2025-11-29 demo)
Dashboards (offline import)
- Grafana JSON:
docs/modules/notify/operations/dashboards/notify-observability.json(import locally; no external data sources assumed). - Planned panels: enqueue/dequeue rate, delivery latency p95/p99, channel error rate, retry/dead-letter counts, rule evaluation latency, tenant isolation breaches (should stay 0), and notification simulation outcomes.
Key metrics
notify_enqueue_total{channel}— notifications enqueued by channel.notify_delivery_latency_seconds_bucket{channel}— delivery latency per channel.notify_delivery_failures_total{channel,reason}— failed deliveries.notify_retry_total{channel}andnotify_deadletter_total{channel}— retries and dead letters.notify_rule_eval_duration_seconds_bucket— rule evaluation latency.notify_simulation_total{result}— simulation outcomes when quiet hours/correlation rules applied.
Logs & traces
- Correlate by
notificationId,ruleId,tenant,channel. IncludequietHoursApplied,correlationKey,retriesfields. - Traces disabled by default for air-gap; enable by pointing OTLP exporter to on-prem collector.
Health/diagnostics
/health/livenessand/health/readinesscheck queue backend reachability and channel provider credentials./statusexposes build version, commit, feature flags; verify against offline bundle manifest.- Simulation probe:
/api/notify/simulatewith sample rule set to validate correlation/digest wiring once NOTIFY-SVC-39-001..004 land.
Alert hints
- Delivery latency p99 > 1.5s for email/webhook channels.
- Dead-letter queue growth > threshold.
- Rule evaluation latency p99 > 500ms.
- Correlation/quiet-hours simulation failures once enabled.
Offline verification steps
- Import Grafana JSON locally; point to Prometheus scrape labeled
notify. - Run
stella notify simulate --rules samples/rules.yaml --dry-run(once available) and ensure metrics/logs emit locally. - Fetch
/statusand compare commit/version to offline bundle manifest.
Evidence locations
- Sprint tracker:
docs/implplan/SPRINT_322_docs_modules_notify.md. - Module docs:
README.md,architecture.md,implementation_plan.md. - Dashboard stub:
operations/dashboards/notify-observability.json.