Files
git.stella-ops.org/docs/modules/notify/operations/observability.md
StellaOps Bot 17d45a6d30
Some checks failed
Airgap Sealed CI Smoke / sealed-smoke (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
Export Center CI / export-ci (push) Has been cancelled
feat: Implement Filesystem and MongoDB provenance writers for PackRun execution context
- Added `FilesystemPackRunProvenanceWriter` to write provenance manifests to the filesystem.
- Introduced `MongoPackRunArtifactReader` to read artifacts from MongoDB.
- Created `MongoPackRunProvenanceWriter` to store provenance manifests in MongoDB.
- Developed unit tests for filesystem and MongoDB provenance writers.
- Established `ITimelineEventStore` and `ITimelineIngestionService` interfaces for timeline event handling.
- Implemented `TimelineIngestionService` to validate and persist timeline events with hashing.
- Created PostgreSQL schema and migration scripts for timeline indexing.
- Added dependency injection support for timeline indexer services.
- Developed tests for timeline ingestion and schema validation.
2025-11-30 15:38:14 +02:00

39 lines
2.3 KiB
Markdown

# Notify observability runbook (stub · 2025-11-29 demo)
## Dashboards (offline import)
- Grafana JSON: `docs/modules/notify/operations/dashboards/notify-observability.json` (import locally; no external data sources assumed).
- Planned panels: enqueue/dequeue rate, delivery latency p95/p99, channel error rate, retry/dead-letter counts, rule evaluation latency, tenant isolation breaches (should stay 0), and notification simulation outcomes.
## Key metrics
- `notify_enqueue_total{channel}` — notifications enqueued by channel.
- `notify_delivery_latency_seconds_bucket{channel}` — delivery latency per channel.
- `notify_delivery_failures_total{channel,reason}` — failed deliveries.
- `notify_retry_total{channel}` and `notify_deadletter_total{channel}` — retries and dead letters.
- `notify_rule_eval_duration_seconds_bucket` — rule evaluation latency.
- `notify_simulation_total{result}` — simulation outcomes when quiet hours/correlation rules applied.
## Logs & traces
- Correlate by `notificationId`, `ruleId`, `tenant`, `channel`. Include `quietHoursApplied`, `correlationKey`, `retries` fields.
- Traces disabled by default for air-gap; enable by pointing OTLP exporter to on-prem collector.
## Health/diagnostics
- `/health/liveness` and `/health/readiness` check queue backend reachability and channel provider credentials.
- `/status` exposes build version, commit, feature flags; verify against offline bundle manifest.
- Simulation probe: `/api/notify/simulate` with sample rule set to validate correlation/digest wiring once NOTIFY-SVC-39-001..004 land.
## Alert hints
- Delivery latency p99 > 1.5s for email/webhook channels.
- Dead-letter queue growth > threshold.
- Rule evaluation latency p99 > 500ms.
- Correlation/quiet-hours simulation failures once enabled.
## Offline verification steps
1) Import Grafana JSON locally; point to Prometheus scrape labeled `notify`.
2) Run `stella notify simulate --rules samples/rules.yaml --dry-run` (once available) and ensure metrics/logs emit locally.
3) Fetch `/status` and compare commit/version to offline bundle manifest.
## Evidence locations
- Sprint tracker: `docs/implplan/SPRINT_322_docs_modules_notify.md`.
- Module docs: `README.md`, `architecture.md`, `implementation_plan.md`.
- Dashboard stub: `operations/dashboards/notify-observability.json`.