Files
git.stella-ops.org/ops/devops/observability/incident-mode.md
StellaOps Bot 6bee1fdcf5
Some checks failed
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
work
2025-11-25 08:01:23 +02:00

2.0 KiB

Incident Mode Automation (DEVOPS-OBS-55-001)

What it does

  • Auto-enables an incident feature flag when SLO burn rate crosses a threshold.
  • Writes deterministic retention overrides (hours) for downstream storage/ingest.
  • Auto-clears after a cooldown once burn is back under the reset threshold.
  • Offline-friendly: no external calls; pure file outputs under out/incident-mode/.

Inputs

  • Burn rate multiple (fast-burn): required.
  • Thresholds/cooldown/retention configurable via CLI flags or env vars.
  • Optional note for audit context.

Outputs

  • flag.json — enabled/disabled + burn rate and note.
  • retention.json — retention override hours + applied time.
  • last_burn.txt, cooldown.txt — trace for automation/testing.

Usage

# Activate if burn >= 2.5, otherwise decay cooldown; clear after 15 mins <0.4
scripts/observability/incident-mode.sh \
  --burn-rate 3.2 \
  --threshold 2.5 \
  --reset-threshold 0.4 \
  --cooldown-mins 15 \
  --retention-hours 48 \
  --note "api error burst"

# Later (burn back to normal):
scripts/observability/incident-mode.sh --burn-rate 0.2 --reset-threshold 0.4 --cooldown-mins 15

Outputs land in out/incident-mode/ by default (override with --state-dir).

Integration hooks

  • Prometheus rule should page on SLOBurnRateFast (already in alerts-slo.yaml).
  • A small runner (cron/workflow) can feed burn rate into this script from PromQL (scalar(slo:burn_rate:fast)), then distribute flag.json via configmap/secret.
  • Downstream services can read retention.json to temporarily raise retention windows during incident mode.

Determinism

  • Timestamps are UTC ISO-8601; no network dependencies.
  • State is contained under the chosen state-dir for reproducible runs.

Clearing / reset

  • Cooldown counter increments only when burn stays below reset threshold.
  • Once cooldown minutes are met, flag.json flips enabled=false and the script leaves prior retention files untouched (downstream can prune separately).