# Incident Mode Automation (DEVOPS-OBS-55-001) ## What it does - Auto-enables an *incident* feature flag when SLO burn rate crosses a threshold. - Writes deterministic retention overrides (hours) for downstream storage/ingest. - Auto-clears after a cooldown once burn is back under the reset threshold. - Offline-friendly: no external calls; pure file outputs under `out/incident-mode/`. ## Inputs - Burn rate multiple (fast-burn): required. - Thresholds/cooldown/retention configurable via CLI flags or env vars. - Optional note for audit context. ## Outputs - `flag.json` — enabled/disabled + burn rate and note. - `retention.json` — retention override hours + applied time. - `last_burn.txt`, `cooldown.txt` — trace for automation/testing. ## Usage ```bash # Activate if burn >= 2.5, otherwise decay cooldown; clear after 15 mins <0.4 scripts/observability/incident-mode.sh \ --burn-rate 3.2 \ --threshold 2.5 \ --reset-threshold 0.4 \ --cooldown-mins 15 \ --retention-hours 48 \ --note "api error burst" # Later (burn back to normal): scripts/observability/incident-mode.sh --burn-rate 0.2 --reset-threshold 0.4 --cooldown-mins 15 ``` Outputs land in `out/incident-mode/` by default (override with `--state-dir`). ## Integration hooks - Prometheus rule should page on SLOBurnRateFast (already in `alerts-slo.yaml`). - A small runner (cron/workflow) can feed burn rate into this script from PromQL (`scalar(slo:burn_rate:fast)`), then distribute `flag.json` via configmap/secret. - Downstream services can read `retention.json` to temporarily raise retention windows during incident mode. ## Determinism - Timestamps are UTC ISO-8601; no network dependencies. - State is contained under the chosen `state-dir` for reproducible runs. ## Clearing / reset - Cooldown counter increments only when burn stays below reset threshold. - Once cooldown minutes are met, `flag.json` flips `enabled=false` and the script leaves prior retention files untouched (downstream can prune separately).