work
This commit is contained in:
49
ops/devops/observability/incident-mode.md
Normal file
49
ops/devops/observability/incident-mode.md
Normal file
@@ -0,0 +1,49 @@
|
||||
# Incident Mode Automation (DEVOPS-OBS-55-001)
|
||||
|
||||
## What it does
|
||||
- Auto-enables an *incident* feature flag when SLO burn rate crosses a threshold.
|
||||
- Writes deterministic retention overrides (hours) for downstream storage/ingest.
|
||||
- Auto-clears after a cooldown once burn is back under the reset threshold.
|
||||
- Offline-friendly: no external calls; pure file outputs under `out/incident-mode/`.
|
||||
|
||||
## Inputs
|
||||
- Burn rate multiple (fast-burn): required.
|
||||
- Thresholds/cooldown/retention configurable via CLI flags or env vars.
|
||||
- Optional note for audit context.
|
||||
|
||||
## Outputs
|
||||
- `flag.json` — enabled/disabled + burn rate and note.
|
||||
- `retention.json` — retention override hours + applied time.
|
||||
- `last_burn.txt`, `cooldown.txt` — trace for automation/testing.
|
||||
|
||||
## Usage
|
||||
```bash
|
||||
# Activate if burn >= 2.5, otherwise decay cooldown; clear after 15 mins <0.4
|
||||
scripts/observability/incident-mode.sh \
|
||||
--burn-rate 3.2 \
|
||||
--threshold 2.5 \
|
||||
--reset-threshold 0.4 \
|
||||
--cooldown-mins 15 \
|
||||
--retention-hours 48 \
|
||||
--note "api error burst"
|
||||
|
||||
# Later (burn back to normal):
|
||||
scripts/observability/incident-mode.sh --burn-rate 0.2 --reset-threshold 0.4 --cooldown-mins 15
|
||||
```
|
||||
Outputs land in `out/incident-mode/` by default (override with `--state-dir`).
|
||||
|
||||
## Integration hooks
|
||||
- Prometheus rule should page on SLOBurnRateFast (already in `alerts-slo.yaml`).
|
||||
- A small runner (cron/workflow) can feed burn rate into this script from PromQL
|
||||
(`scalar(slo:burn_rate:fast)`), then distribute `flag.json` via configmap/secret.
|
||||
- Downstream services can read `retention.json` to temporarily raise retention
|
||||
windows during incident mode.
|
||||
|
||||
## Determinism
|
||||
- Timestamps are UTC ISO-8601; no network dependencies.
|
||||
- State is contained under the chosen `state-dir` for reproducible runs.
|
||||
|
||||
## Clearing / reset
|
||||
- Cooldown counter increments only when burn stays below reset threshold.
|
||||
- Once cooldown minutes are met, `flag.json` flips `enabled=false` and the script
|
||||
leaves prior retention files untouched (downstream can prune separately).
|
||||
Reference in New Issue
Block a user