Files
git.stella-ops.org/docs-archived/product-advisories/2025-12-21-moat-phase2/20-Dec-2025 - Moat Explanation - Risk Budgets and Diff-Aware Release Gates.md
2026-01-05 16:02:11 +02:00

16 KiB
Raw Blame History

Stella Ops Guidelines

Risk Budgets and Diff-Aware Release Gates

Audience: Product Managers (PMs) and Development Managers (DMs) Applies to: All customer-impacting software and configuration changes shipped by Stella Ops (code, infrastructure-as-code, runtime config, feature flags, data migrations, dependency upgrades).


1) What we are optimizing for

Stella Ops ships quickly without letting change-driven incidents, security regressions, or data integrity failures become the hidden cost of “speed.”

These guidelines enforce two linked controls:

  1. Risk Budgets — a quantitative “capacity to take risk” that prevents reliability and trust from being silently depleted.
  2. Diff-Aware Release Gates — release checks whose strictness scales with what changed (the diff), not with generic process.

Together they let us move fast on low-risk diffs and slow down only when the change warrants it.


2) Non-negotiable principles

  1. All changes are risk-bearing (even “small” diffs). We quantify and route them accordingly.
  2. Risk is managed at the product/service boundary (each service has its own budget and gating profile).
  3. Automation first, approvals last. Humans review what automation cannot reliably verify.
  4. Blast radius is a first-class variable. A safe rollout beats a perfect code review.
  5. Exceptions are allowed but never free. Every bypass is logged, justified, and paid back via budget reduction and follow-up controls.

3) Definitions

3.1 Risk Budget (what it is)

A Risk Budget is the amount of change-risk a product/service is allowed to take over a defined window (typically a sprint or month) without increasing the probability of customer harm beyond the agreed tolerance.

It is a management control, not a theoretical score.

3.2 Risk Budget vs. Error Budget (important distinction)

  • Error Budget (classic SRE): backward-looking tolerance for actual unreliability vs. SLO.
  • Risk Budget (this policy): forward-looking tolerance for change risk before shipping.

They interact:

  • If error budget is burned (service is unstable), risk budget is automatically constrained.
  • If risk budget is low, release gates tighten by policy.

3.3 Diff-aware release gates (what it is)

A release gate is a set of required checks (tests, scans, reviews, rollout controls) that must pass before a change can progress. Diff-aware means the gate level is determined by:

  • what changed (diff classification),
  • where it changed (criticality),
  • how it ships (blast radius controls),
  • and current operational context (incidents, SLO health, budget remaining).

4) Roles and accountability

Product Manager (PM) — accountable for risk appetite

PM responsibilities:

  • Define product-level risk tolerance with stakeholders (customer impact tolerance, regulatory constraints).
  • Approve the Risk Budget Policy settings for their product/service tier (criticality level, default gates).
  • Prioritize reliability work when budgets are constrained.
  • Own customer communications for degraded service or risk-driven release deferrals.

Development Manager (DM) — accountable for enforcement and engineering hygiene

DM responsibilities:

  • Ensure pipelines implement diff classification and enforce gates.
  • Ensure tests, telemetry, rollout mechanisms, and rollback procedures exist and are maintained.
  • Ensure “exceptions” process is real (logged, postmortemed, paid back).
  • Own staffing/rotation decisions to ensure safe releases (on-call readiness, release captains).

Shared responsibilities

PM + DM jointly:

  • Review risk budget status weekly.
  • Resolve trade-offs: feature velocity vs. reliability/security work.
  • Approve gate profile changes (tighten/loosen) based on evidence.

5) Risk Budgets

5.1 Establish service tiers (criticality)

Each service/product component must be assigned a Criticality Tier:

  • Tier 0 Internal only (no external customers; low business impact)
  • Tier 1 Customer-facing non-critical (degradation tolerated; limited blast radius)
  • Tier 2 Customer-facing critical (core workflows; meaningful revenue/trust impact)
  • Tier 3 Safety/financial/data-critical (payments, auth, permissions, PII, regulated workflows)

Tier drives default budgets and minimum gates.

5.2 Choose a budget window and units

Window: default to monthly with weekly tracking; optionally sprint-based if release cadence is sprint-coupled. Units: use Risk Points (RP) — consumed by each change. (Do not overcomplicate at first; tune with data.)

Recommended initial monthly budgets (adjust after 23 cycles with evidence):

  • Tier 0: 300 RP/month
  • Tier 1: 200 RP/month
  • Tier 2: 120 RP/month
  • Tier 3: 80 RP/month

Interpretation: Tier 3 ships fewer “risky” changes; it can still ship frequently, but changes must be decomposed into low-risk diffs and shipped with strong controls.

5.3 Risk Point scoring (how changes consume budget)

Every change gets a Release Risk Score (RRS) in RP.

A practical baseline model:

RRS = Base(criticality) + Diff Risk + Operational Context Mitigations

Base (criticality):

  • Tier 0: +1
  • Tier 1: +3
  • Tier 2: +6
  • Tier 3: +10

Diff Risk (additive):

  • +1: docs, comments, non-executed code paths, telemetry-only additions
  • +3: UI changes, non-core logic changes, refactors with high test coverage
  • +6: API contract changes, dependency upgrades, medium-complexity logic in a core path
  • +10: database schema migrations, auth/permission logic, data retention/PII handling
  • +15: infra/networking changes, encryption/key handling, payment flows, queue semantics changes

Operational Context (additive):

  • +5: service currently in incident or had Sev1/Sev2 in last 7 days
  • +3: error budget < 50% remaining
  • +2: on-call load high (paging above normal baseline)
  • +5: release during restricted windows (holidays/freeze) via exception

Mitigations (subtract):

  • 3: feature flag with staged rollout + instant kill switch verified
  • 3: canary + automated health gates + rollback tested in last 30 days
  • 2: high-confidence integration coverage for touched components
  • 2: no data migration OR backward-compatible migration with proven rollback
  • 2: change isolated behind permission boundary / limited cohort

Minimum RRS floor: never below 1 RP.

DM is responsible for making sure the pipeline can calculate a default RRS automatically and require humans only for edge cases.

5.4 Budget operating rules

Budget ledger: Maintain a per-service ledger:

  • Budget allocated for the window
  • RP consumed per release
  • RP remaining
  • Trendline (projected depletion date)
  • Exceptions (break-glass releases)

Control thresholds:

  • Green (≥60% remaining): normal operation
  • Yellow (3059%): additional caution; gates tighten by 1 level for medium/high-risk diffs
  • Red (<30%): freeze high-risk diffs; allow only low-risk changes or reliability/security work
  • Exhausted (≤0%): releases restricted to incident fixes, security fixes, and rollback-only, with tightened gates and explicit sign-off

5.5 What to do when budget is low (expected behavior)

When Yellow/Red:

  • PM shifts roadmap execution toward:

    • reliability work, defect burn-down,
    • decomposing large changes into smaller, reversible diffs,
    • reducing scope of risky features.
  • DM enforces:

    • smaller diffs,
    • increased feature flagging,
    • staged rollout requirements,
    • improved test/observability coverage.

Budget constraints are a signal, not a punishment.

5.6 Budget replenishment and incentives

Budgets replenish on the window boundary, but we also allow earned capacity:

  • If a service improves change failure rate and MTTR for 2 consecutive windows, it may earn:

    • +1020% budget increase or
    • one gate level relaxation for specific change categories

This must be evidence-driven (metrics, not opinions).


6) Diff-Aware Release Gates

6.1 Diff classification (what the pipeline must detect)

At minimum, automatically classify diffs into these categories:

Code scope

  • Executable code vs docs-only
  • Core vs non-core modules (define module ownership boundaries)
  • Hot paths (latency-sensitive), correctness-sensitive paths

Data scope

  • Schema migration (additive vs breaking)
  • Backfill jobs / batch jobs
  • Data model changes impacting downstream consumers
  • PII / regulated data touchpoints

Security scope

  • Authn/authz logic
  • Permission checks
  • Secrets, key handling, encryption changes
  • Dependency changes with known CVEs

Infra scope

  • IaC changes, networking, load balancer, DNS, autoscaling
  • Runtime config changes (feature flags, limits, thresholds)
  • Queue/topic changes, retention settings

Interface scope

  • Public API contract changes
  • Backward compatibility of payloads/events
  • Client version dependency

6.2 Gate levels

Define Gate Levels G0G4. The pipeline assigns one based on diff + context + budget.

G0 — No-risk / administrative

Use for:

  • docs-only, comments-only, non-functional metadata

Requirements:

  • Lint/format checks
  • Basic CI pass (build)

G1 — Low risk

Use for:

  • small, localized code changes with strong unit coverage
  • non-core UI changes
  • telemetry additions (no removal)

Requirements:

  • All automated unit tests
  • Static analysis/linting
  • 1 peer review (code owner not required if outside critical modules)
  • Automated deploy to staging
  • Post-deploy smoke checks

G2 — Moderate risk

Use for:

  • moderate logic changes in customer-facing paths
  • dependency upgrades
  • API changes that are backward compatible
  • config changes affecting behavior

Requirements:

  • G1 +
  • Integration tests relevant to impacted modules
  • Code owner review for touched modules
  • Feature flag required if customer impact possible
  • Staged rollout: canary or small cohort
  • Rollback plan documented in PR

G3 — High risk

Use for:

  • schema migrations
  • auth/permission changes
  • core business logic in critical flows
  • infra changes affecting availability
  • non-trivial concurrency/queue semantics changes

Requirements:

  • G2 +
  • Security scan + dependency audit (must pass, exceptions logged)
  • Migration plan (forward + rollback) reviewed
  • Load/performance checks if in hot path
  • Observability: new/updated dashboards/alerts for the change
  • Release captain / on-call sign-off (someone accountable live)
  • Progressive delivery with automatic health gates (error rate/latency)

G4 — Very high risk / safety-critical / budget-constrained releases

Use for:

  • Tier 3 critical systems with low budget remaining
  • changes during freeze windows via exception
  • broad blast radius changes (platform-wide)
  • remediation after major incident where recurrence risk is high

Requirements:

  • G3 +
  • Formal risk review (PM+DM+Security/SRE) in writing
  • Explicit rollback rehearsal or prior proven rollback path
  • Extended canary period with success criteria and abort criteria
  • Customer comms plan if impact is plausible
  • Post-release verification checklist executed and logged

6.3 Gate selection logic (policy)

Default rule:

  1. Compute RRS (Risk Points) from diff + context.

  2. Map RRS to default gate:

    • 15 RP → G1
    • 612 RP → G2
    • 1320 RP → G3
    • 21+ RP → G4
  3. Apply modifiers:

    • If budget Yellow: escalate one gate for changes ≥ G2
    • If budget Red: escalate one gate for changes ≥ G1 and block high-risk categories unless exception
    • If active incident or error budget severely degraded: block non-fix releases by default

DM must ensure the pipeline enforces this mapping automatically.

6.4 “Diff-aware” also means “blast-radius aware”

If the diff is inherently risky, reduce risk operationally:

  • feature flags with cohort controls
  • dark launches (ship code disabled)
  • canary deployments
  • blue/green with quick revert
  • backwards-compatible DB migrations (expand/contract pattern)
  • circuit breakers and rate limiting
  • progressive exposure by tenant / region / account segment

Large diffs are not “made safe” by more reviewers; they are made safe by reversibility and containment.


7) Exceptions (“break glass”) policy

Exceptions are permitted only when one of these is true:

  • incident mitigation or customer harm prevention,
  • urgent security fix (actively exploited or high severity),
  • legal/compliance deadline.

Requirements for any exception:

  • Recorded rationale in the PR/release ticket

  • Named approver(s): DM + on-call owner; PM for customer-impacting risk

  • Mandatory follow-up within 5 business days:

    • post-incident or post-release review
    • remediation tasks created and prioritized
  • Budget penalty: subtract additional RP (e.g., +50% of the changes RRS) to reflect unmanaged risk

Repeated exceptions are a governance failure and trigger gate tightening.


8) Operational metrics (what PMs and DMs must review)

Minimum weekly review dashboard per service:

  • Risk budget remaining (RP and %)
  • Deploy frequency
  • Change failure rate
  • MTTR
  • Sev1/Sev2 count (rolling 30/90 days)
  • SLO / error budget status
  • Gate compliance rate (how often gates were bypassed)
  • Diff size distribution (are we shipping huge diffs?)
  • Rollback frequency and time-to-rollback

Policy expectation:

  • If change failure rate or MTTR worsens materially over 2 windows, budgets tighten and gate mapping escalates until stability returns.

9) Practical operating cadence

Weekly (PM + DM)

  • Review budgets and trends
  • Identify upcoming high-risk releases and plan staged rollouts
  • Confirm staffing for release windows (release captain / on-call coverage)
  • Decide whether to defer, decompose, or harden changes

Per release (DM-led, PM informed)

  • Ensure correct gate level
  • Verify rollout + rollback readiness
  • Confirm monitoring/alerts exist and are watched during rollout
  • Execute post-release verification checklist

Monthly (leadership)

  • Adjust tier assignments if product criticality changed
  • Recalibrate budget numbers based on measured outcomes
  • Identify systemic causes: test gaps, observability gaps, deployment tooling gaps

10) Required templates (standardize execution)

10.1 Release Plan (required for G2+)

  • What is changing (13 bullets)
  • Expected customer impact (or “none”)
  • Diff category flags (DB/auth/infra/API/etc.)
  • Rollout strategy (canary/cohort/blue-green)
  • Abort criteria (exact metrics/thresholds)
  • Rollback steps (exact commands/process)
  • Owners during rollout (names)

10.2 Migration Plan (required for schema/data changes)

  • Migration type: additive / expand-contract / breaking (breaking is disallowed without explicit G4 approval)
  • Backfill approach and rate limits
  • Validation checks (row counts, invariants)
  • Rollback strategy (including data implications)

10.3 Post-release Verification Checklist (G1+)

  • Smoke test results
  • Key dashboards checked (latency, error rate, saturation)
  • Alerts status
  • User-facing workflows validated (as applicable)
  • Ticket updated with outcome

11) What “good” looks like

  • Low-risk diffs ship quickly with minimal ceremony (G0G1).
  • High-risk diffs are decomposed and shipped progressively, not heroically.
  • Risk budgets are visible, used in planning, and treated as a real constraint.
  • Exceptions are rare and followed by concrete remediation.
  • Over time: deploy frequency stays high while change failure rate and MTTR decrease.

12) Immediate adoption checklist (first 30 days)

DM deliverables

  • Implement diff classification in CI/CD (at least: DB/auth/infra/API/deps/config)
  • Implement automatic gate mapping and enforcement
  • Add “release plan” and “rollback plan” checks for G2+
  • Add logging for gate overrides

PM deliverables

  • Confirm service tiering for owned areas
  • Approve initial monthly RP budgets
  • Add risk budget review to the weekly product/engineering ritual
  • Reprioritize work when budgets hit Yellow/Red (explicitly)

If you want, I can also provide:

  • a concrete scoring worksheet (ready to paste into Confluence/Notion),
  • a CI/CD policy example (e.g., GitHub Actions / GitLab rules) that computes gate level from diff patterns,
  • and a one-page “Release Captain Runbook” aligned to G2G4.