stella-ops.org/git.stella-ops.org

Fork 0

Files

StellaOps Bot 9f6e6f7fb3

Docs CI / lint-and-preview (push) Has been cancelled

Details

Signals CI & Image / signals-ci (push) Has been cancelled

Details

Policy Lint & Smoke / policy-lint (push) Has been cancelled

Details

Policy Simulation / policy-simulate (push) Has been cancelled

Details

SDK Publish & Sign / sdk-publish (push) Has been cancelled

Details

AOC Guard CI / aoc-guard (push) Has been cancelled

Details

AOC Guard CI / aoc-verify (push) Has been cancelled

Details

Concelier Attestation Tests / attestation-tests (push) Has been cancelled

Details

devportal-offline / build-offline (push) Has been cancelled

Details

2025-11-25 22:09:44 +02:00

2.8 KiB

Raw Blame History

Notifier Runbook

Last updated: 2025-11-25 (Docs Tasks Md.V · DOCS-NOTIFY-40-001)

Purpose

Operational steps to deploy, monitor, and recover the Notifications service (WebService + Worker).

Pre-flight

Secrets stored in Authority: SMTP creds, Slack/Teams hooks, webhook HMAC keys.
Outbound allowlist updated for target channels.
Mongo and Redis reachable; health checks pass.
Offline kit loaded: channel manifests, default templates, rule seeds.

Deploy

Apply Kubernetes manifests/Compose stack from ops/notify/ with image digests pinned.
Set env:
- Notify__Mongo__ConnectionString
- Notify__Redis__ConnectionString
- Notify__Authority__BaseUrl
- Notify__ChannelAllowlist
- ASPNETCORE_URLS=http://0.0.0.0:8080
Warm caches: POST /api/v1/notify/admin/warm (loads rules/templates into memory) — optional.
Verify GET /api/v1/notify/health returns ready=true.

Monitor

Metrics (Prometheus):
- notify_delivery_attempts_total by status/channel/tenant.
- notify_escalation_stage_total by policy/stage.
- notify_rule_eval_seconds_bucket for worker latency.
Logs: structured JSON with tenant, ruleId, deliveryId, channel, status.
Traces: span notify.delivery with linkage to originating event traceparent when provided.

Common operations

List stuck deliveries: GET /api/v1/notify/deliveries?status=failed&from=<utc>.
Replay delivery: POST /api/v1/notify/deliveries/{id}:replay (idempotent; only re-renders if inputs unchanged).
Pause a tenant: set tenant state paused=true via admin API; worker stops sending but keeps evaluating for audit.
Rotate secrets: update Authority secret, then POST /api/v1/notify/channels/{id}:refresh-secret.

Failure recovery

Worker crash loop: check Redis connectivity, template compile errors; run notify-worker --validate-only using current config.
Mongo outage: worker backs off with exponential retry; after recovery, replay via :replay or digests as needed.
Channel outage (e.g., Slack 5xx): throttles + retry policy handle transient errors; for extended outages, disable channel or swap to backup policy.

Auditing

Delivery ledger retains attempt hashes and signatures; export via /deliveries?from=...&to=...&format=ndjson for offline review.
Ack events stored with actor, timestamp, source IP.

Determinism safeguards

Rule snapshots are versioned per tenant; upgrades swap snapshots atomically.
Template rendering uses deterministic helpers only; no live lookups.
Time sources are UTC; quiet hours evaluated using tenant timezone from config.

On-call checklist

Health endpoints green.
Delivery failure rate < 0.5% over last hour.
Escalation backlog empty or within SLO.
Redis memory < 75% and Mongo primary healthy.
Latest release notes applied and channels validated.

2.8 KiB Raw Blame History

Notifier Runbook

Purpose

Pre-flight

Deploy

Monitor

Common operations

Failure recovery

Auditing

Determinism safeguards

On-call checklist

2.8 KiB

Raw Blame History