Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
52 lines
2.5 KiB
Markdown
52 lines
2.5 KiB
Markdown
# Escalations & Acknowledgements
|
|
|
|
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
|
|
|
|
Last updated: 2025-11-25 (Docs Tasks Md.V · DOCS-NOTIFY-40-001)
|
|
|
|
## Model
|
|
- **Escalation policy**: ordered stages of channels with delays; stored per tenant.
|
|
- **Acknowledgement**: DSSE-signed token embedded in messages; acknowledger must present token to stop escalation.
|
|
- **Suppression**: rules may mark events as non-escalating (informational) while still sending single notifications.
|
|
|
|
## Policy schema (conceptual)
|
|
```json
|
|
{
|
|
"id": "uuid",
|
|
"tenant": "string",
|
|
"name": "pager-policy-prod",
|
|
"stages": [
|
|
{ "delaySeconds": 0, "channels": ["slack-prod", "email-oncall"] },
|
|
{ "delaySeconds": 900, "channels": ["pager-primary"] },
|
|
{ "delaySeconds": 1800,"channels": ["pager-management"] }
|
|
],
|
|
"autoCloseMinutes": 120,
|
|
"retry": { "maxAttempts": 3, "backoffSeconds": 60 }
|
|
}
|
|
```
|
|
- Stages execute sequentially until an **ack** is recorded.
|
|
- Deterministic ordering: channels within a stage are sorted lexicographically before dispatch.
|
|
|
|
## Ack tokens
|
|
- Token payload: `{ tenant, deliveryId, expiresUtc, ruleId, actionHash }`.
|
|
- Signed with Authority-issued DSSE key; verified by Notify WebService before accepting `POST /acks/{token}`.
|
|
- Expiry defaults to 24h; tokens are single-use and idempotent.
|
|
|
|
## Escalation flow
|
|
1) Rule fires → action references an escalation policy.
|
|
2) Stage 0 deliveries sent; ledger records attempts and ack URL.
|
|
3) If no ack by `delaySeconds`, next stage dispatches; repeats until ack or final stage.
|
|
4) On ack, remaining stages are cancelled; ledger entry marked `acknowledged` with timestamp and subject.
|
|
|
|
## Quiet hours & throttles
|
|
- Quiet hours suppress *new* escalations; in-flight escalations continue.
|
|
- Per-policy throttle prevents repeated escalation runs for identical `actionHash` within a configurable window (default 30m).
|
|
|
|
## Observability
|
|
- Counters: `notify.escalation.started`, `notify.escalation.stage_sent`, `notify.escalation.ack`, `notify.escalation.cancelled` tagged by `tenant`, `policy`, `stage`.
|
|
- Logs: structured `escalation.{started|stage_sent|ack|cancelled}` with delivery ids and rationale.
|
|
|
|
## Runbooks
|
|
- Update escalation policy safely: create new policy id, switch rules, then delete old policy to avoid mid-flight ambiguity.
|
|
- If a stage storms, set throttle higher or add quiet hours; do not delete the policy mid-flight—use `cancelEscalation` endpoint instead.
|