stella-ops.org/git.stella-ops.org

Fork 0

Files

StellaOps Bot 6bee1fdcf5

AOC Guard CI / aoc-guard (push) Has been cancelled

Details

AOC Guard CI / aoc-verify (push) Has been cancelled

Details

Concelier Attestation Tests / attestation-tests (push) Has been cancelled

Details

Docs CI / lint-and-preview (push) Has been cancelled

Details

work

2025-11-25 08:01:23 +02:00

2.0 KiB

Raw Blame History

Incident Mode Automation (DEVOPS-OBS-55-001)

What it does

Auto-enables an incident feature flag when SLO burn rate crosses a threshold.
Writes deterministic retention overrides (hours) for downstream storage/ingest.
Auto-clears after a cooldown once burn is back under the reset threshold.
Offline-friendly: no external calls; pure file outputs under out/incident-mode/.

Inputs

Burn rate multiple (fast-burn): required.
Thresholds/cooldown/retention configurable via CLI flags or env vars.
Optional note for audit context.

Outputs

flag.json — enabled/disabled + burn rate and note.
retention.json — retention override hours + applied time.
last_burn.txt, cooldown.txt — trace for automation/testing.

Usage

# Activate if burn >= 2.5, otherwise decay cooldown; clear after 15 mins <0.4
scripts/observability/incident-mode.sh \
  --burn-rate 3.2 \
  --threshold 2.5 \
  --reset-threshold 0.4 \
  --cooldown-mins 15 \
  --retention-hours 48 \
  --note "api error burst"

# Later (burn back to normal):
scripts/observability/incident-mode.sh --burn-rate 0.2 --reset-threshold 0.4 --cooldown-mins 15

Outputs land in out/incident-mode/ by default (override with --state-dir).

Integration hooks

Prometheus rule should page on SLOBurnRateFast (already in alerts-slo.yaml).
A small runner (cron/workflow) can feed burn rate into this script from PromQL (scalar(slo:burn_rate:fast)), then distribute flag.json via configmap/secret.
Downstream services can read retention.json to temporarily raise retention windows during incident mode.

Determinism

Timestamps are UTC ISO-8601; no network dependencies.
State is contained under the chosen state-dir for reproducible runs.

Clearing / reset

Cooldown counter increments only when burn stays below reset threshold.
Once cooldown minutes are met, flag.json flips enabled=false and the script leaves prior retention files untouched (downstream can prune separately).

2.0 KiB Raw Blame History

Incident Mode Automation (DEVOPS-OBS-55-001)

What it does

Inputs

Outputs

Usage

Integration hooks

Determinism

Clearing / reset

2.0 KiB

Raw Blame History