up
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled

This commit is contained in:
StellaOps Bot
2025-11-25 22:09:44 +02:00
parent 6bee1fdcf5
commit 9f6e6f7fb3
116 changed files with 4495 additions and 730 deletions

View File

@@ -0,0 +1,51 @@
# Escalations & Acknowledgements
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Last updated: 2025-11-25 (Docs Tasks Md.V · DOCS-NOTIFY-40-001)
## Model
- **Escalation policy**: ordered stages of channels with delays; stored per tenant.
- **Acknowledgement**: DSSE-signed token embedded in messages; acknowledger must present token to stop escalation.
- **Suppression**: rules may mark events as non-escalating (informational) while still sending single notifications.
## Policy schema (conceptual)
```json
{
"id": "uuid",
"tenant": "string",
"name": "pager-policy-prod",
"stages": [
{ "delaySeconds": 0, "channels": ["slack-prod", "email-oncall"] },
{ "delaySeconds": 900, "channels": ["pager-primary"] },
{ "delaySeconds": 1800,"channels": ["pager-management"] }
],
"autoCloseMinutes": 120,
"retry": { "maxAttempts": 3, "backoffSeconds": 60 }
}
```
- Stages execute sequentially until an **ack** is recorded.
- Deterministic ordering: channels within a stage are sorted lexicographically before dispatch.
## Ack tokens
- Token payload: `{ tenant, deliveryId, expiresUtc, ruleId, actionHash }`.
- Signed with Authority-issued DSSE key; verified by Notify WebService before accepting `POST /acks/{token}`.
- Expiry defaults to 24h; tokens are single-use and idempotent.
## Escalation flow
1) Rule fires → action references an escalation policy.
2) Stage 0 deliveries sent; ledger records attempts and ack URL.
3) If no ack by `delaySeconds`, next stage dispatches; repeats until ack or final stage.
4) On ack, remaining stages are cancelled; ledger entry marked `acknowledged` with timestamp and subject.
## Quiet hours & throttles
- Quiet hours suppress *new* escalations; in-flight escalations continue.
- Per-policy throttle prevents repeated escalation runs for identical `actionHash` within a configurable window (default 30m).
## Observability
- Counters: `notify.escalation.started`, `notify.escalation.stage_sent`, `notify.escalation.ack`, `notify.escalation.cancelled` tagged by `tenant`, `policy`, `stage`.
- Logs: structured `escalation.{started|stage_sent|ack|cancelled}` with delivery ids and rationale.
## Runbooks
- Update escalation policy safely: create new policy id, switch rules, then delete old policy to avoid mid-flight ambiguity.
- If a stage storms, set throttle higher or add quiet hours; do not delete the policy mid-flight—use `cancelEscalation` endpoint instead.