Files
git.stella-ops.org/docs/modules/notify/operations/observability.md
StellaOps Bot 17d45a6d30
Some checks failed
Airgap Sealed CI Smoke / sealed-smoke (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
Export Center CI / export-ci (push) Has been cancelled
feat: Implement Filesystem and MongoDB provenance writers for PackRun execution context
- Added `FilesystemPackRunProvenanceWriter` to write provenance manifests to the filesystem.
- Introduced `MongoPackRunArtifactReader` to read artifacts from MongoDB.
- Created `MongoPackRunProvenanceWriter` to store provenance manifests in MongoDB.
- Developed unit tests for filesystem and MongoDB provenance writers.
- Established `ITimelineEventStore` and `ITimelineIngestionService` interfaces for timeline event handling.
- Implemented `TimelineIngestionService` to validate and persist timeline events with hashing.
- Created PostgreSQL schema and migration scripts for timeline indexing.
- Added dependency injection support for timeline indexer services.
- Developed tests for timeline ingestion and schema validation.
2025-11-30 15:38:14 +02:00

2.3 KiB

Notify observability runbook (stub · 2025-11-29 demo)

Dashboards (offline import)

  • Grafana JSON: docs/modules/notify/operations/dashboards/notify-observability.json (import locally; no external data sources assumed).
  • Planned panels: enqueue/dequeue rate, delivery latency p95/p99, channel error rate, retry/dead-letter counts, rule evaluation latency, tenant isolation breaches (should stay 0), and notification simulation outcomes.

Key metrics

  • notify_enqueue_total{channel} — notifications enqueued by channel.
  • notify_delivery_latency_seconds_bucket{channel} — delivery latency per channel.
  • notify_delivery_failures_total{channel,reason} — failed deliveries.
  • notify_retry_total{channel} and notify_deadletter_total{channel} — retries and dead letters.
  • notify_rule_eval_duration_seconds_bucket — rule evaluation latency.
  • notify_simulation_total{result} — simulation outcomes when quiet hours/correlation rules applied.

Logs & traces

  • Correlate by notificationId, ruleId, tenant, channel. Include quietHoursApplied, correlationKey, retries fields.
  • Traces disabled by default for air-gap; enable by pointing OTLP exporter to on-prem collector.

Health/diagnostics

  • /health/liveness and /health/readiness check queue backend reachability and channel provider credentials.
  • /status exposes build version, commit, feature flags; verify against offline bundle manifest.
  • Simulation probe: /api/notify/simulate with sample rule set to validate correlation/digest wiring once NOTIFY-SVC-39-001..004 land.

Alert hints

  • Delivery latency p99 > 1.5s for email/webhook channels.
  • Dead-letter queue growth > threshold.
  • Rule evaluation latency p99 > 500ms.
  • Correlation/quiet-hours simulation failures once enabled.

Offline verification steps

  1. Import Grafana JSON locally; point to Prometheus scrape labeled notify.
  2. Run stella notify simulate --rules samples/rules.yaml --dry-run (once available) and ensure metrics/logs emit locally.
  3. Fetch /status and compare commit/version to offline bundle manifest.

Evidence locations

  • Sprint tracker: docs/implplan/SPRINT_322_docs_modules_notify.md.
  • Module docs: README.md, architecture.md, implementation_plan.md.
  • Dashboard stub: operations/dashboards/notify-observability.json.