# Incident Mode Automation (DEVOPS-OBS-55-001)

## What it does
- Auto-enables an *incident* feature flag when SLO burn rate crosses a threshold.
- Writes deterministic retention overrides (hours) for downstream storage/ingest.
- Auto-clears after a cooldown once burn is back under the reset threshold.
- Offline-friendly: no external calls; pure file outputs under `out/incident-mode/`.

## Inputs
- Burn rate multiple (fast-burn): required.
- Thresholds/cooldown/retention configurable via CLI flags or env vars.
- Optional note for audit context.

## Outputs
- `flag.json` — enabled/disabled + burn rate and note.
- `retention.json` — retention override hours + applied time.
- `last_burn.txt`, `cooldown.txt` — trace for automation/testing.

## Usage
```bash
# Activate if burn >= 2.5, otherwise decay cooldown; clear after 15 mins <0.4
scripts/observability/incident-mode.sh \
  --burn-rate 3.2 \
  --threshold 2.5 \
  --reset-threshold 0.4 \
  --cooldown-mins 15 \
  --retention-hours 48 \
  --note "api error burst"

# Later (burn back to normal):
scripts/observability/incident-mode.sh --burn-rate 0.2 --reset-threshold 0.4 --cooldown-mins 15
```
Outputs land in `out/incident-mode/` by default (override with `--state-dir`).

## Integration hooks
- Prometheus rule should page on SLOBurnRateFast (already in `alerts-slo.yaml`).
- A small runner (cron/workflow) can feed burn rate into this script from PromQL
  (`scalar(slo:burn_rate:fast)`), then distribute `flag.json` via configmap/secret.
- Downstream services can read `retention.json` to temporarily raise retention
  windows during incident mode.

## Determinism
- Timestamps are UTC ISO-8601; no network dependencies.
- State is contained under the chosen `state-dir` for reproducible runs.

## Clearing / reset
- Cooldown counter increments only when burn stays below reset threshold.
- Once cooldown minutes are met, `flag.json` flips `enabled=false` and the script
  leaves prior retention files untouched (downstream can prune separately).