50 lines
2.0 KiB
Markdown
50 lines
2.0 KiB
Markdown
# Incident Mode Automation (DEVOPS-OBS-55-001)
|
|
|
|
## What it does
|
|
- Auto-enables an *incident* feature flag when SLO burn rate crosses a threshold.
|
|
- Writes deterministic retention overrides (hours) for downstream storage/ingest.
|
|
- Auto-clears after a cooldown once burn is back under the reset threshold.
|
|
- Offline-friendly: no external calls; pure file outputs under `out/incident-mode/`.
|
|
|
|
## Inputs
|
|
- Burn rate multiple (fast-burn): required.
|
|
- Thresholds/cooldown/retention configurable via CLI flags or env vars.
|
|
- Optional note for audit context.
|
|
|
|
## Outputs
|
|
- `flag.json` — enabled/disabled + burn rate and note.
|
|
- `retention.json` — retention override hours + applied time.
|
|
- `last_burn.txt`, `cooldown.txt` — trace for automation/testing.
|
|
|
|
## Usage
|
|
```bash
|
|
# Activate if burn >= 2.5, otherwise decay cooldown; clear after 15 mins <0.4
|
|
scripts/observability/incident-mode.sh \
|
|
--burn-rate 3.2 \
|
|
--threshold 2.5 \
|
|
--reset-threshold 0.4 \
|
|
--cooldown-mins 15 \
|
|
--retention-hours 48 \
|
|
--note "api error burst"
|
|
|
|
# Later (burn back to normal):
|
|
scripts/observability/incident-mode.sh --burn-rate 0.2 --reset-threshold 0.4 --cooldown-mins 15
|
|
```
|
|
Outputs land in `out/incident-mode/` by default (override with `--state-dir`).
|
|
|
|
## Integration hooks
|
|
- Prometheus rule should page on SLOBurnRateFast (already in `alerts-slo.yaml`).
|
|
- A small runner (cron/workflow) can feed burn rate into this script from PromQL
|
|
(`scalar(slo:burn_rate:fast)`), then distribute `flag.json` via configmap/secret.
|
|
- Downstream services can read `retention.json` to temporarily raise retention
|
|
windows during incident mode.
|
|
|
|
## Determinism
|
|
- Timestamps are UTC ISO-8601; no network dependencies.
|
|
- State is contained under the chosen `state-dir` for reproducible runs.
|
|
|
|
## Clearing / reset
|
|
- Cooldown counter increments only when burn stays below reset threshold.
|
|
- Once cooldown minutes are met, `flag.json` flips `enabled=false` and the script
|
|
leaves prior retention files untouched (downstream can prune separately).
|