16 KiB
Stella Ops Guidelines
Risk Budgets and Diff-Aware Release Gates
Audience: Product Managers (PMs) and Development Managers (DMs) Applies to: All customer-impacting software and configuration changes shipped by Stella Ops (code, infrastructure-as-code, runtime config, feature flags, data migrations, dependency upgrades).
1) What we are optimizing for
Stella Ops ships quickly without letting change-driven incidents, security regressions, or data integrity failures become the hidden cost of “speed.”
These guidelines enforce two linked controls:
- Risk Budgets — a quantitative “capacity to take risk” that prevents reliability and trust from being silently depleted.
- Diff-Aware Release Gates — release checks whose strictness scales with what changed (the diff), not with generic process.
Together they let us move fast on low-risk diffs and slow down only when the change warrants it.
2) Non-negotiable principles
- All changes are risk-bearing (even “small” diffs). We quantify and route them accordingly.
- Risk is managed at the product/service boundary (each service has its own budget and gating profile).
- Automation first, approvals last. Humans review what automation cannot reliably verify.
- Blast radius is a first-class variable. A safe rollout beats a perfect code review.
- Exceptions are allowed but never free. Every bypass is logged, justified, and paid back via budget reduction and follow-up controls.
3) Definitions
3.1 Risk Budget (what it is)
A Risk Budget is the amount of change-risk a product/service is allowed to take over a defined window (typically a sprint or month) without increasing the probability of customer harm beyond the agreed tolerance.
It is a management control, not a theoretical score.
3.2 Risk Budget vs. Error Budget (important distinction)
- Error Budget (classic SRE): backward-looking tolerance for actual unreliability vs. SLO.
- Risk Budget (this policy): forward-looking tolerance for change risk before shipping.
They interact:
- If error budget is burned (service is unstable), risk budget is automatically constrained.
- If risk budget is low, release gates tighten by policy.
3.3 Diff-aware release gates (what it is)
A release gate is a set of required checks (tests, scans, reviews, rollout controls) that must pass before a change can progress. Diff-aware means the gate level is determined by:
- what changed (diff classification),
- where it changed (criticality),
- how it ships (blast radius controls),
- and current operational context (incidents, SLO health, budget remaining).
4) Roles and accountability
Product Manager (PM) — accountable for risk appetite
PM responsibilities:
- Define product-level risk tolerance with stakeholders (customer impact tolerance, regulatory constraints).
- Approve the Risk Budget Policy settings for their product/service tier (criticality level, default gates).
- Prioritize reliability work when budgets are constrained.
- Own customer communications for degraded service or risk-driven release deferrals.
Development Manager (DM) — accountable for enforcement and engineering hygiene
DM responsibilities:
- Ensure pipelines implement diff classification and enforce gates.
- Ensure tests, telemetry, rollout mechanisms, and rollback procedures exist and are maintained.
- Ensure “exceptions” process is real (logged, postmortemed, paid back).
- Own staffing/rotation decisions to ensure safe releases (on-call readiness, release captains).
Shared responsibilities
PM + DM jointly:
- Review risk budget status weekly.
- Resolve trade-offs: feature velocity vs. reliability/security work.
- Approve gate profile changes (tighten/loosen) based on evidence.
5) Risk Budgets
5.1 Establish service tiers (criticality)
Each service/product component must be assigned a Criticality Tier:
- Tier 0 – Internal only (no external customers; low business impact)
- Tier 1 – Customer-facing non-critical (degradation tolerated; limited blast radius)
- Tier 2 – Customer-facing critical (core workflows; meaningful revenue/trust impact)
- Tier 3 – Safety/financial/data-critical (payments, auth, permissions, PII, regulated workflows)
Tier drives default budgets and minimum gates.
5.2 Choose a budget window and units
Window: default to monthly with weekly tracking; optionally sprint-based if release cadence is sprint-coupled. Units: use Risk Points (RP) — consumed by each change. (Do not overcomplicate at first; tune with data.)
Recommended initial monthly budgets (adjust after 2–3 cycles with evidence):
- Tier 0: 300 RP/month
- Tier 1: 200 RP/month
- Tier 2: 120 RP/month
- Tier 3: 80 RP/month
Interpretation: Tier 3 ships fewer “risky” changes; it can still ship frequently, but changes must be decomposed into low-risk diffs and shipped with strong controls.
5.3 Risk Point scoring (how changes consume budget)
Every change gets a Release Risk Score (RRS) in RP.
A practical baseline model:
RRS = Base(criticality) + Diff Risk + Operational Context – Mitigations
Base (criticality):
- Tier 0: +1
- Tier 1: +3
- Tier 2: +6
- Tier 3: +10
Diff Risk (additive):
- +1: docs, comments, non-executed code paths, telemetry-only additions
- +3: UI changes, non-core logic changes, refactors with high test coverage
- +6: API contract changes, dependency upgrades, medium-complexity logic in a core path
- +10: database schema migrations, auth/permission logic, data retention/PII handling
- +15: infra/networking changes, encryption/key handling, payment flows, queue semantics changes
Operational Context (additive):
- +5: service currently in incident or had Sev1/Sev2 in last 7 days
- +3: error budget < 50% remaining
- +2: on-call load high (paging above normal baseline)
- +5: release during restricted windows (holidays/freeze) via exception
Mitigations (subtract):
- –3: feature flag with staged rollout + instant kill switch verified
- –3: canary + automated health gates + rollback tested in last 30 days
- –2: high-confidence integration coverage for touched components
- –2: no data migration OR backward-compatible migration with proven rollback
- –2: change isolated behind permission boundary / limited cohort
Minimum RRS floor: never below 1 RP.
DM is responsible for making sure the pipeline can calculate a default RRS automatically and require humans only for edge cases.
5.4 Budget operating rules
Budget ledger: Maintain a per-service ledger:
- Budget allocated for the window
- RP consumed per release
- RP remaining
- Trendline (projected depletion date)
- Exceptions (break-glass releases)
Control thresholds:
- Green (≥60% remaining): normal operation
- Yellow (30–59%): additional caution; gates tighten by 1 level for medium/high-risk diffs
- Red (<30%): freeze high-risk diffs; allow only low-risk changes or reliability/security work
- Exhausted (≤0%): releases restricted to incident fixes, security fixes, and rollback-only, with tightened gates and explicit sign-off
5.5 What to do when budget is low (expected behavior)
When Yellow/Red:
-
PM shifts roadmap execution toward:
- reliability work, defect burn-down,
- decomposing large changes into smaller, reversible diffs,
- reducing scope of risky features.
-
DM enforces:
- smaller diffs,
- increased feature flagging,
- staged rollout requirements,
- improved test/observability coverage.
Budget constraints are a signal, not a punishment.
5.6 Budget replenishment and incentives
Budgets replenish on the window boundary, but we also allow earned capacity:
-
If a service improves change failure rate and MTTR for 2 consecutive windows, it may earn:
- +10–20% budget increase or
- one gate level relaxation for specific change categories
This must be evidence-driven (metrics, not opinions).
6) Diff-Aware Release Gates
6.1 Diff classification (what the pipeline must detect)
At minimum, automatically classify diffs into these categories:
Code scope
- Executable code vs docs-only
- Core vs non-core modules (define module ownership boundaries)
- Hot paths (latency-sensitive), correctness-sensitive paths
Data scope
- Schema migration (additive vs breaking)
- Backfill jobs / batch jobs
- Data model changes impacting downstream consumers
- PII / regulated data touchpoints
Security scope
- Authn/authz logic
- Permission checks
- Secrets, key handling, encryption changes
- Dependency changes with known CVEs
Infra scope
- IaC changes, networking, load balancer, DNS, autoscaling
- Runtime config changes (feature flags, limits, thresholds)
- Queue/topic changes, retention settings
Interface scope
- Public API contract changes
- Backward compatibility of payloads/events
- Client version dependency
6.2 Gate levels
Define Gate Levels G0–G4. The pipeline assigns one based on diff + context + budget.
G0 — No-risk / administrative
Use for:
- docs-only, comments-only, non-functional metadata
Requirements:
- Lint/format checks
- Basic CI pass (build)
G1 — Low risk
Use for:
- small, localized code changes with strong unit coverage
- non-core UI changes
- telemetry additions (no removal)
Requirements:
- All automated unit tests
- Static analysis/linting
- 1 peer review (code owner not required if outside critical modules)
- Automated deploy to staging
- Post-deploy smoke checks
G2 — Moderate risk
Use for:
- moderate logic changes in customer-facing paths
- dependency upgrades
- API changes that are backward compatible
- config changes affecting behavior
Requirements:
- G1 +
- Integration tests relevant to impacted modules
- Code owner review for touched modules
- Feature flag required if customer impact possible
- Staged rollout: canary or small cohort
- Rollback plan documented in PR
G3 — High risk
Use for:
- schema migrations
- auth/permission changes
- core business logic in critical flows
- infra changes affecting availability
- non-trivial concurrency/queue semantics changes
Requirements:
- G2 +
- Security scan + dependency audit (must pass, exceptions logged)
- Migration plan (forward + rollback) reviewed
- Load/performance checks if in hot path
- Observability: new/updated dashboards/alerts for the change
- Release captain / on-call sign-off (someone accountable live)
- Progressive delivery with automatic health gates (error rate/latency)
G4 — Very high risk / safety-critical / budget-constrained releases
Use for:
- Tier 3 critical systems with low budget remaining
- changes during freeze windows via exception
- broad blast radius changes (platform-wide)
- remediation after major incident where recurrence risk is high
Requirements:
- G3 +
- Formal risk review (PM+DM+Security/SRE) in writing
- Explicit rollback rehearsal or prior proven rollback path
- Extended canary period with success criteria and abort criteria
- Customer comms plan if impact is plausible
- Post-release verification checklist executed and logged
6.3 Gate selection logic (policy)
Default rule:
-
Compute RRS (Risk Points) from diff + context.
-
Map RRS to default gate:
- 1–5 RP → G1
- 6–12 RP → G2
- 13–20 RP → G3
- 21+ RP → G4
-
Apply modifiers:
- If budget Yellow: escalate one gate for changes ≥ G2
- If budget Red: escalate one gate for changes ≥ G1 and block high-risk categories unless exception
- If active incident or error budget severely degraded: block non-fix releases by default
DM must ensure the pipeline enforces this mapping automatically.
6.4 “Diff-aware” also means “blast-radius aware”
If the diff is inherently risky, reduce risk operationally:
- feature flags with cohort controls
- dark launches (ship code disabled)
- canary deployments
- blue/green with quick revert
- backwards-compatible DB migrations (expand/contract pattern)
- circuit breakers and rate limiting
- progressive exposure by tenant / region / account segment
Large diffs are not “made safe” by more reviewers; they are made safe by reversibility and containment.
7) Exceptions (“break glass”) policy
Exceptions are permitted only when one of these is true:
- incident mitigation or customer harm prevention,
- urgent security fix (actively exploited or high severity),
- legal/compliance deadline.
Requirements for any exception:
-
Recorded rationale in the PR/release ticket
-
Named approver(s): DM + on-call owner; PM for customer-impacting risk
-
Mandatory follow-up within 5 business days:
- post-incident or post-release review
- remediation tasks created and prioritized
-
Budget penalty: subtract additional RP (e.g., +50% of the change’s RRS) to reflect unmanaged risk
Repeated exceptions are a governance failure and trigger gate tightening.
8) Operational metrics (what PMs and DMs must review)
Minimum weekly review dashboard per service:
- Risk budget remaining (RP and %)
- Deploy frequency
- Change failure rate
- MTTR
- Sev1/Sev2 count (rolling 30/90 days)
- SLO / error budget status
- Gate compliance rate (how often gates were bypassed)
- Diff size distribution (are we shipping huge diffs?)
- Rollback frequency and time-to-rollback
Policy expectation:
- If change failure rate or MTTR worsens materially over 2 windows, budgets tighten and gate mapping escalates until stability returns.
9) Practical operating cadence
Weekly (PM + DM)
- Review budgets and trends
- Identify upcoming high-risk releases and plan staged rollouts
- Confirm staffing for release windows (release captain / on-call coverage)
- Decide whether to defer, decompose, or harden changes
Per release (DM-led, PM informed)
- Ensure correct gate level
- Verify rollout + rollback readiness
- Confirm monitoring/alerts exist and are watched during rollout
- Execute post-release verification checklist
Monthly (leadership)
- Adjust tier assignments if product criticality changed
- Recalibrate budget numbers based on measured outcomes
- Identify systemic causes: test gaps, observability gaps, deployment tooling gaps
10) Required templates (standardize execution)
10.1 Release Plan (required for G2+)
- What is changing (1–3 bullets)
- Expected customer impact (or “none”)
- Diff category flags (DB/auth/infra/API/etc.)
- Rollout strategy (canary/cohort/blue-green)
- Abort criteria (exact metrics/thresholds)
- Rollback steps (exact commands/process)
- Owners during rollout (names)
10.2 Migration Plan (required for schema/data changes)
- Migration type: additive / expand-contract / breaking (breaking is disallowed without explicit G4 approval)
- Backfill approach and rate limits
- Validation checks (row counts, invariants)
- Rollback strategy (including data implications)
10.3 Post-release Verification Checklist (G1+)
- Smoke test results
- Key dashboards checked (latency, error rate, saturation)
- Alerts status
- User-facing workflows validated (as applicable)
- Ticket updated with outcome
11) What “good” looks like
- Low-risk diffs ship quickly with minimal ceremony (G0–G1).
- High-risk diffs are decomposed and shipped progressively, not heroically.
- Risk budgets are visible, used in planning, and treated as a real constraint.
- Exceptions are rare and followed by concrete remediation.
- Over time: deploy frequency stays high while change failure rate and MTTR decrease.
12) Immediate adoption checklist (first 30 days)
DM deliverables
- Implement diff classification in CI/CD (at least: DB/auth/infra/API/deps/config)
- Implement automatic gate mapping and enforcement
- Add “release plan” and “rollback plan” checks for G2+
- Add logging for gate overrides
PM deliverables
- Confirm service tiering for owned areas
- Approve initial monthly RP budgets
- Add risk budget review to the weekly product/engineering ritual
- Reprioritize work when budgets hit Yellow/Red (explicitly)
If you want, I can also provide:
- a concrete scoring worksheet (ready to paste into Confluence/Notion),
- a CI/CD policy example (e.g., GitHub Actions / GitLab rules) that computes gate level from diff patterns,
- and a one-page “Release Captain Runbook” aligned to G2–G4.