git.stella-ops.org/20-Dec-2025 - Moat Explanation - Risk Budgets and Diff-Aware Release Gates.md at 05833e0af21a44cf323e627808141fce0d84ca2f - git.stella-ops.org

stella-ops.org/git.stella-ops.org

Fork 0

Files

master d0a7b88398 move docs/**/archived/* to docs-archived/**/*

2026-01-05 16:02:11 +02:00

16 KiB

Raw Blame History

Stella Ops Guidelines

Risk Budgets and Diff-Aware Release Gates

Audience: Product Managers (PMs) and Development Managers (DMs) Applies to: All customer-impacting software and configuration changes shipped by Stella Ops (code, infrastructure-as-code, runtime config, feature flags, data migrations, dependency upgrades).

1) What we are optimizing for

Stella Ops ships quickly without letting change-driven incidents, security regressions, or data integrity failures become the hidden cost of “speed.”

These guidelines enforce two linked controls:

Risk Budgets — a quantitative “capacity to take risk” that prevents reliability and trust from being silently depleted.
Diff-Aware Release Gates — release checks whose strictness scales with what changed (the diff), not with generic process.

Together they let us move fast on low-risk diffs and slow down only when the change warrants it.

2) Non-negotiable principles

All changes are risk-bearing (even “small” diffs). We quantify and route them accordingly.
Risk is managed at the product/service boundary (each service has its own budget and gating profile).
Automation first, approvals last. Humans review what automation cannot reliably verify.
Blast radius is a first-class variable. A safe rollout beats a perfect code review.
Exceptions are allowed but never free. Every bypass is logged, justified, and paid back via budget reduction and follow-up controls.

3) Definitions

3.1 Risk Budget (what it is)

A Risk Budget is the amount of change-risk a product/service is allowed to take over a defined window (typically a sprint or month) without increasing the probability of customer harm beyond the agreed tolerance.

It is a management control, not a theoretical score.

3.2 Risk Budget vs. Error Budget (important distinction)

Error Budget (classic SRE): backward-looking tolerance for actual unreliability vs. SLO.
Risk Budget (this policy): forward-looking tolerance for change risk before shipping.

They interact:

If error budget is burned (service is unstable), risk budget is automatically constrained.
If risk budget is low, release gates tighten by policy.

3.3 Diff-aware release gates (what it is)

A release gate is a set of required checks (tests, scans, reviews, rollout controls) that must pass before a change can progress. Diff-aware means the gate level is determined by:

what changed (diff classification),
where it changed (criticality),
how it ships (blast radius controls),
and current operational context (incidents, SLO health, budget remaining).

4) Roles and accountability

Product Manager (PM) — accountable for risk appetite

PM responsibilities:

Define product-level risk tolerance with stakeholders (customer impact tolerance, regulatory constraints).
Approve the Risk Budget Policy settings for their product/service tier (criticality level, default gates).
Prioritize reliability work when budgets are constrained.
Own customer communications for degraded service or risk-driven release deferrals.

Development Manager (DM) — accountable for enforcement and engineering hygiene

DM responsibilities:

Ensure pipelines implement diff classification and enforce gates.
Ensure tests, telemetry, rollout mechanisms, and rollback procedures exist and are maintained.
Ensure “exceptions” process is real (logged, postmortemed, paid back).
Own staffing/rotation decisions to ensure safe releases (on-call readiness, release captains).

Shared responsibilities

PM + DM jointly:

Review risk budget status weekly.
Resolve trade-offs: feature velocity vs. reliability/security work.
Approve gate profile changes (tighten/loosen) based on evidence.

5) Risk Budgets

5.1 Establish service tiers (criticality)

Each service/product component must be assigned a Criticality Tier:

Tier 0 – Internal only (no external customers; low business impact)
Tier 1 – Customer-facing non-critical (degradation tolerated; limited blast radius)
Tier 2 – Customer-facing critical (core workflows; meaningful revenue/trust impact)
Tier 3 – Safety/financial/data-critical (payments, auth, permissions, PII, regulated workflows)

Tier drives default budgets and minimum gates.

5.2 Choose a budget window and units

Window: default to monthly with weekly tracking; optionally sprint-based if release cadence is sprint-coupled. Units: use Risk Points (RP) — consumed by each change. (Do not overcomplicate at first; tune with data.)

Recommended initial monthly budgets (adjust after 2–3 cycles with evidence):

Tier 0: 300 RP/month
Tier 1: 200 RP/month
Tier 2: 120 RP/month
Tier 3: 80 RP/month

Interpretation: Tier 3 ships fewer “risky” changes; it can still ship frequently, but changes must be decomposed into low-risk diffs and shipped with strong controls.

5.3 Risk Point scoring (how changes consume budget)

Every change gets a Release Risk Score (RRS) in RP.

A practical baseline model:

RRS = Base(criticality) + Diff Risk + Operational Context – Mitigations

Base (criticality):

Tier 0: +1
Tier 1: +3
Tier 2: +6
Tier 3: +10

Diff Risk (additive):

+1: docs, comments, non-executed code paths, telemetry-only additions
+3: UI changes, non-core logic changes, refactors with high test coverage
+6: API contract changes, dependency upgrades, medium-complexity logic in a core path
+10: database schema migrations, auth/permission logic, data retention/PII handling
+15: infra/networking changes, encryption/key handling, payment flows, queue semantics changes

Operational Context (additive):

+5: service currently in incident or had Sev1/Sev2 in last 7 days
+3: error budget < 50% remaining
+2: on-call load high (paging above normal baseline)
+5: release during restricted windows (holidays/freeze) via exception

Mitigations (subtract):

–3: feature flag with staged rollout + instant kill switch verified
–3: canary + automated health gates + rollback tested in last 30 days
–2: high-confidence integration coverage for touched components
–2: no data migration OR backward-compatible migration with proven rollback
–2: change isolated behind permission boundary / limited cohort

Minimum RRS floor: never below 1 RP.

DM is responsible for making sure the pipeline can calculate a default RRS automatically and require humans only for edge cases.

5.4 Budget operating rules

Budget ledger: Maintain a per-service ledger:

Budget allocated for the window
RP consumed per release
RP remaining
Trendline (projected depletion date)
Exceptions (break-glass releases)

Control thresholds:

Green (≥60% remaining): normal operation
Yellow (30–59%): additional caution; gates tighten by 1 level for medium/high-risk diffs
Red (<30%): freeze high-risk diffs; allow only low-risk changes or reliability/security work
Exhausted (≤0%): releases restricted to incident fixes, security fixes, and rollback-only, with tightened gates and explicit sign-off

5.5 What to do when budget is low (expected behavior)

When Yellow/Red:

PM shifts roadmap execution toward:
- reliability work, defect burn-down,
- decomposing large changes into smaller, reversible diffs,
- reducing scope of risky features.
DM enforces:
- smaller diffs,
- increased feature flagging,
- staged rollout requirements,
- improved test/observability coverage.

Budget constraints are a signal, not a punishment.

5.6 Budget replenishment and incentives

Budgets replenish on the window boundary, but we also allow earned capacity:

If a service improves change failure rate and MTTR for 2 consecutive windows, it may earn:
- +10–20% budget increase or
- one gate level relaxation for specific change categories

This must be evidence-driven (metrics, not opinions).

6) Diff-Aware Release Gates

6.1 Diff classification (what the pipeline must detect)

At minimum, automatically classify diffs into these categories:

Code scope

Executable code vs docs-only
Core vs non-core modules (define module ownership boundaries)
Hot paths (latency-sensitive), correctness-sensitive paths

Data scope

Schema migration (additive vs breaking)
Backfill jobs / batch jobs
Data model changes impacting downstream consumers
PII / regulated data touchpoints

Security scope

Authn/authz logic
Permission checks
Secrets, key handling, encryption changes
Dependency changes with known CVEs

Infra scope

IaC changes, networking, load balancer, DNS, autoscaling
Runtime config changes (feature flags, limits, thresholds)
Queue/topic changes, retention settings

Interface scope

Public API contract changes
Backward compatibility of payloads/events
Client version dependency

6.2 Gate levels

Define Gate Levels G0–G4. The pipeline assigns one based on diff + context + budget.

G0 — No-risk / administrative

Use for:

docs-only, comments-only, non-functional metadata

Requirements:

Lint/format checks
Basic CI pass (build)

G1 — Low risk

Use for:

small, localized code changes with strong unit coverage
non-core UI changes
telemetry additions (no removal)

Requirements:

All automated unit tests
Static analysis/linting
1 peer review (code owner not required if outside critical modules)
Automated deploy to staging
Post-deploy smoke checks

G2 — Moderate risk

Use for:

moderate logic changes in customer-facing paths
dependency upgrades
API changes that are backward compatible
config changes affecting behavior

Requirements:

G1 +
Integration tests relevant to impacted modules
Code owner review for touched modules
Feature flag required if customer impact possible
Staged rollout: canary or small cohort
Rollback plan documented in PR

G3 — High risk

Use for:

schema migrations
auth/permission changes
core business logic in critical flows
infra changes affecting availability
non-trivial concurrency/queue semantics changes

Requirements:

G2 +
Security scan + dependency audit (must pass, exceptions logged)
Migration plan (forward + rollback) reviewed
Load/performance checks if in hot path
Observability: new/updated dashboards/alerts for the change
Release captain / on-call sign-off (someone accountable live)
Progressive delivery with automatic health gates (error rate/latency)

G4 — Very high risk / safety-critical / budget-constrained releases

Use for:

Tier 3 critical systems with low budget remaining
changes during freeze windows via exception
broad blast radius changes (platform-wide)
remediation after major incident where recurrence risk is high

Requirements:

G3 +
Formal risk review (PM+DM+Security/SRE) in writing
Explicit rollback rehearsal or prior proven rollback path
Extended canary period with success criteria and abort criteria
Customer comms plan if impact is plausible
Post-release verification checklist executed and logged

6.3 Gate selection logic (policy)

Default rule:

Compute RRS (Risk Points) from diff + context.
Map RRS to default gate:
- 1–5 RP → G1
- 6–12 RP → G2
- 13–20 RP → G3
- 21+ RP → G4
Apply modifiers:
- If budget Yellow: escalate one gate for changes ≥ G2
- If budget Red: escalate one gate for changes ≥ G1 and block high-risk categories unless exception
- If active incident or error budget severely degraded: block non-fix releases by default

DM must ensure the pipeline enforces this mapping automatically.

6.4 “Diff-aware” also means “blast-radius aware”

If the diff is inherently risky, reduce risk operationally:

feature flags with cohort controls
dark launches (ship code disabled)
canary deployments
blue/green with quick revert
backwards-compatible DB migrations (expand/contract pattern)
circuit breakers and rate limiting
progressive exposure by tenant / region / account segment

Large diffs are not “made safe” by more reviewers; they are made safe by reversibility and containment.

7) Exceptions (“break glass”) policy

Exceptions are permitted only when one of these is true:

incident mitigation or customer harm prevention,
urgent security fix (actively exploited or high severity),
legal/compliance deadline.

Requirements for any exception:

Recorded rationale in the PR/release ticket
Named approver(s): DM + on-call owner; PM for customer-impacting risk
Mandatory follow-up within 5 business days:
- post-incident or post-release review
- remediation tasks created and prioritized
Budget penalty: subtract additional RP (e.g., +50% of the change’s RRS) to reflect unmanaged risk

Repeated exceptions are a governance failure and trigger gate tightening.

8) Operational metrics (what PMs and DMs must review)

Minimum weekly review dashboard per service:

Risk budget remaining (RP and %)
Deploy frequency
Change failure rate
MTTR
Sev1/Sev2 count (rolling 30/90 days)
SLO / error budget status
Gate compliance rate (how often gates were bypassed)
Diff size distribution (are we shipping huge diffs?)
Rollback frequency and time-to-rollback

Policy expectation:

If change failure rate or MTTR worsens materially over 2 windows, budgets tighten and gate mapping escalates until stability returns.

9) Practical operating cadence

Weekly (PM + DM)

Review budgets and trends
Identify upcoming high-risk releases and plan staged rollouts
Confirm staffing for release windows (release captain / on-call coverage)
Decide whether to defer, decompose, or harden changes

Per release (DM-led, PM informed)

Ensure correct gate level
Verify rollout + rollback readiness
Confirm monitoring/alerts exist and are watched during rollout
Execute post-release verification checklist

Monthly (leadership)

Adjust tier assignments if product criticality changed
Recalibrate budget numbers based on measured outcomes
Identify systemic causes: test gaps, observability gaps, deployment tooling gaps

10) Required templates (standardize execution)

10.1 Release Plan (required for G2+)

What is changing (1–3 bullets)
Expected customer impact (or “none”)
Diff category flags (DB/auth/infra/API/etc.)
Rollout strategy (canary/cohort/blue-green)
Abort criteria (exact metrics/thresholds)
Rollback steps (exact commands/process)
Owners during rollout (names)

10.2 Migration Plan (required for schema/data changes)

Migration type: additive / expand-contract / breaking (breaking is disallowed without explicit G4 approval)
Backfill approach and rate limits
Validation checks (row counts, invariants)
Rollback strategy (including data implications)

10.3 Post-release Verification Checklist (G1+)

Smoke test results
Key dashboards checked (latency, error rate, saturation)
Alerts status
User-facing workflows validated (as applicable)
Ticket updated with outcome

11) What “good” looks like

Low-risk diffs ship quickly with minimal ceremony (G0–G1).
High-risk diffs are decomposed and shipped progressively, not heroically.
Risk budgets are visible, used in planning, and treated as a real constraint.
Exceptions are rare and followed by concrete remediation.
Over time: deploy frequency stays high while change failure rate and MTTR decrease.

12) Immediate adoption checklist (first 30 days)

DM deliverables

Implement diff classification in CI/CD (at least: DB/auth/infra/API/deps/config)
Implement automatic gate mapping and enforcement
Add “release plan” and “rollback plan” checks for G2+
Add logging for gate overrides

PM deliverables

Confirm service tiering for owned areas
Approve initial monthly RP budgets
Add risk budget review to the weekly product/engineering ritual
Reprioritize work when budgets hit Yellow/Red (explicitly)

If you want, I can also provide:

a concrete scoring worksheet (ready to paste into Confluence/Notion),
a CI/CD policy example (e.g., GitHub Actions / GitLab rules) that computes gate level from diff patterns,
and a one-page “Release Captain Runbook” aligned to G2–G4.

16 KiB Raw Blame History Unescape Escape

Stella Ops Guidelines

Risk Budgets and Diff-Aware Release Gates

1) What we are optimizing for

2) Non-negotiable principles

3) Definitions

3.1 Risk Budget (what it is)

3.2 Risk Budget vs. Error Budget (important distinction)

3.3 Diff-aware release gates (what it is)

4) Roles and accountability

Product Manager (PM) — accountable for risk appetite

Development Manager (DM) — accountable for enforcement and engineering hygiene

Shared responsibilities

5) Risk Budgets

5.1 Establish service tiers (criticality)

5.2 Choose a budget window and units

5.3 Risk Point scoring (how changes consume budget)

5.4 Budget operating rules

5.5 What to do when budget is low (expected behavior)

5.6 Budget replenishment and incentives

6) Diff-Aware Release Gates

6.1 Diff classification (what the pipeline must detect)

6.2 Gate levels

G0 — No-risk / administrative

G1 — Low risk

G2 — Moderate risk

G3 — High risk

G4 — Very high risk / safety-critical / budget-constrained releases

6.3 Gate selection logic (policy)

6.4 “Diff-aware” also means “blast-radius aware”

7) Exceptions (“break glass”) policy

8) Operational metrics (what PMs and DMs must review)

9) Practical operating cadence

Weekly (PM + DM)

Per release (DM-led, PM informed)

Monthly (leadership)

10) Required templates (standardize execution)

10.1 Release Plan (required for G2+)

10.2 Migration Plan (required for schema/data changes)

10.3 Post-release Verification Checklist (G1+)

11) What “good” looks like

12) Immediate adoption checklist (first 30 days)

16 KiB

Raw Blame History