Files
git.stella-ops.org/ops/devops/observability/incident-mode.md
StellaOps Bot 6bee1fdcf5
Some checks failed
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
work
2025-11-25 08:01:23 +02:00

50 lines
2.0 KiB
Markdown

# Incident Mode Automation (DEVOPS-OBS-55-001)
## What it does
- Auto-enables an *incident* feature flag when SLO burn rate crosses a threshold.
- Writes deterministic retention overrides (hours) for downstream storage/ingest.
- Auto-clears after a cooldown once burn is back under the reset threshold.
- Offline-friendly: no external calls; pure file outputs under `out/incident-mode/`.
## Inputs
- Burn rate multiple (fast-burn): required.
- Thresholds/cooldown/retention configurable via CLI flags or env vars.
- Optional note for audit context.
## Outputs
- `flag.json` — enabled/disabled + burn rate and note.
- `retention.json` — retention override hours + applied time.
- `last_burn.txt`, `cooldown.txt` — trace for automation/testing.
## Usage
```bash
# Activate if burn >= 2.5, otherwise decay cooldown; clear after 15 mins <0.4
scripts/observability/incident-mode.sh \
--burn-rate 3.2 \
--threshold 2.5 \
--reset-threshold 0.4 \
--cooldown-mins 15 \
--retention-hours 48 \
--note "api error burst"
# Later (burn back to normal):
scripts/observability/incident-mode.sh --burn-rate 0.2 --reset-threshold 0.4 --cooldown-mins 15
```
Outputs land in `out/incident-mode/` by default (override with `--state-dir`).
## Integration hooks
- Prometheus rule should page on SLOBurnRateFast (already in `alerts-slo.yaml`).
- A small runner (cron/workflow) can feed burn rate into this script from PromQL
(`scalar(slo:burn_rate:fast)`), then distribute `flag.json` via configmap/secret.
- Downstream services can read `retention.json` to temporarily raise retention
windows during incident mode.
## Determinism
- Timestamps are UTC ISO-8601; no network dependencies.
- State is contained under the chosen `state-dir` for reproducible runs.
## Clearing / reset
- Cooldown counter increments only when burn stays below reset threshold.
- Once cooldown minutes are met, `flag.json` flips `enabled=false` and the script
leaves prior retention files untouched (downstream can prune separately).