2.0 KiB
2.0 KiB
Incident Mode Automation (DEVOPS-OBS-55-001)
What it does
- Auto-enables an incident feature flag when SLO burn rate crosses a threshold.
- Writes deterministic retention overrides (hours) for downstream storage/ingest.
- Auto-clears after a cooldown once burn is back under the reset threshold.
- Offline-friendly: no external calls; pure file outputs under
out/incident-mode/.
Inputs
- Burn rate multiple (fast-burn): required.
- Thresholds/cooldown/retention configurable via CLI flags or env vars.
- Optional note for audit context.
Outputs
flag.json— enabled/disabled + burn rate and note.retention.json— retention override hours + applied time.last_burn.txt,cooldown.txt— trace for automation/testing.
Usage
# Activate if burn >= 2.5, otherwise decay cooldown; clear after 15 mins <0.4
scripts/observability/incident-mode.sh \
--burn-rate 3.2 \
--threshold 2.5 \
--reset-threshold 0.4 \
--cooldown-mins 15 \
--retention-hours 48 \
--note "api error burst"
# Later (burn back to normal):
scripts/observability/incident-mode.sh --burn-rate 0.2 --reset-threshold 0.4 --cooldown-mins 15
Outputs land in out/incident-mode/ by default (override with --state-dir).
Integration hooks
- Prometheus rule should page on SLOBurnRateFast (already in
alerts-slo.yaml). - A small runner (cron/workflow) can feed burn rate into this script from PromQL
(
scalar(slo:burn_rate:fast)), then distributeflag.jsonvia configmap/secret. - Downstream services can read
retention.jsonto temporarily raise retention windows during incident mode.
Determinism
- Timestamps are UTC ISO-8601; no network dependencies.
- State is contained under the chosen
state-dirfor reproducible runs.
Clearing / reset
- Cooldown counter increments only when burn stays below reset threshold.
- Once cooldown minutes are met,
flag.jsonflipsenabled=falseand the script leaves prior retention files untouched (downstream can prune separately).