## Stella Ops Guidelines ### Risk Budgets and Diff-Aware Release Gates **Audience:** Product Managers (PMs) and Development Managers (DMs) **Applies to:** All customer-impacting software and configuration changes shipped by Stella Ops (code, infrastructure-as-code, runtime config, feature flags, data migrations, dependency upgrades). --- ## 1) What we are optimizing for Stella Ops ships quickly **without** letting change-driven incidents, security regressions, or data integrity failures become the hidden cost of “speed.” These guidelines enforce two linked controls: 1. **Risk Budgets** — a quantitative “capacity to take risk” that prevents reliability and trust from being silently depleted. 2. **Diff-Aware Release Gates** — release checks whose strictness scales with *what changed* (the diff), not with generic process. Together they let us move fast on low-risk diffs and slow down only when the change warrants it. --- ## 2) Non-negotiable principles 1. **All changes are risk-bearing** (even “small” diffs). We quantify and route them accordingly. 2. **Risk is managed at the product/service boundary** (each service has its own budget and gating profile). 3. **Automation first, approvals last**. Humans review what automation cannot reliably verify. 4. **Blast radius is a first-class variable**. A safe rollout beats a perfect code review. 5. **Exceptions are allowed but never free**. Every bypass is logged, justified, and paid back via budget reduction and follow-up controls. --- ## 3) Definitions ### 3.1 Risk Budget (what it is) A **Risk Budget** is the amount of change-risk a product/service is allowed to take over a defined window (typically a sprint or month) **without increasing the probability of customer harm beyond the agreed tolerance**. It is a management control, not a theoretical score. ### 3.2 Risk Budget vs. Error Budget (important distinction) * **Error Budget** (classic SRE): backward-looking tolerance for *actual* unreliability vs. SLO. * **Risk Budget** (this policy): forward-looking tolerance for *change risk* before shipping. They interact: * If error budget is burned (service is unstable), risk budget is automatically constrained. * If risk budget is low, release gates tighten by policy. ### 3.3 Diff-aware release gates (what it is) A **release gate** is a set of required checks (tests, scans, reviews, rollout controls) that must pass before a change can progress. **Diff-aware** means the gate level is determined by: * what changed (diff classification), * where it changed (criticality), * how it ships (blast radius controls), * and current operational context (incidents, SLO health, budget remaining). --- ## 4) Roles and accountability ### Product Manager (PM) — accountable for risk appetite PM responsibilities: * Define product-level risk tolerance with stakeholders (customer impact tolerance, regulatory constraints). * Approve the **Risk Budget Policy settings** for their product/service tier (criticality level, default gates). * Prioritize reliability work when budgets are constrained. * Own customer communications for degraded service or risk-driven release deferrals. ### Development Manager (DM) — accountable for enforcement and engineering hygiene DM responsibilities: * Ensure pipelines implement diff classification and enforce gates. * Ensure tests, telemetry, rollout mechanisms, and rollback procedures exist and are maintained. * Ensure “exceptions” process is real (logged, postmortemed, paid back). * Own staffing/rotation decisions to ensure safe releases (on-call readiness, release captains). ### Shared responsibilities PM + DM jointly: * Review risk budget status weekly. * Resolve trade-offs: feature velocity vs. reliability/security work. * Approve gate profile changes (tighten/loosen) based on evidence. --- ## 5) Risk Budgets ### 5.1 Establish service tiers (criticality) Each service/product component must be assigned a **Criticality Tier**: * **Tier 0 – Internal only** (no external customers; low business impact) * **Tier 1 – Customer-facing non-critical** (degradation tolerated; limited blast radius) * **Tier 2 – Customer-facing critical** (core workflows; meaningful revenue/trust impact) * **Tier 3 – Safety/financial/data-critical** (payments, auth, permissions, PII, regulated workflows) Tier drives default budgets and minimum gates. ### 5.2 Choose a budget window and units **Window:** default to **monthly** with weekly tracking; optionally sprint-based if release cadence is sprint-coupled. **Units:** use **Risk Points (RP)** — consumed by each change. (Do not overcomplicate at first; tune with data.) Recommended initial monthly budgets (adjust after 2–3 cycles with evidence): * Tier 0: 300 RP/month * Tier 1: 200 RP/month * Tier 2: 120 RP/month * Tier 3: 80 RP/month > Interpretation: Tier 3 ships fewer “risky” changes; it can still ship frequently, but changes must be decomposed into low-risk diffs and shipped with strong controls. ### 5.3 Risk Point scoring (how changes consume budget) Every change gets a **Release Risk Score (RRS)** in RP. A practical baseline model: **RRS = Base(criticality) + Diff Risk + Operational Context – Mitigations** **Base (criticality):** * Tier 0: +1 * Tier 1: +3 * Tier 2: +6 * Tier 3: +10 **Diff Risk (additive):** * +1: docs, comments, non-executed code paths, telemetry-only additions * +3: UI changes, non-core logic changes, refactors with high test coverage * +6: API contract changes, dependency upgrades, medium-complexity logic in a core path * +10: database schema migrations, auth/permission logic, data retention/PII handling * +15: infra/networking changes, encryption/key handling, payment flows, queue semantics changes **Operational Context (additive):** * +5: service currently in incident or had Sev1/Sev2 in last 7 days * +3: error budget < 50% remaining * +2: on-call load high (paging above normal baseline) * +5: release during restricted windows (holidays/freeze) via exception **Mitigations (subtract):** * –3: feature flag with staged rollout + instant kill switch verified * –3: canary + automated health gates + rollback tested in last 30 days * –2: high-confidence integration coverage for touched components * –2: no data migration OR backward-compatible migration with proven rollback * –2: change isolated behind permission boundary / limited cohort **Minimum RRS floor:** never below 1 RP. DM is responsible for making sure the pipeline can calculate a *default* RRS automatically and require humans only for edge cases. ### 5.4 Budget operating rules **Budget ledger:** Maintain a per-service ledger: * Budget allocated for the window * RP consumed per release * RP remaining * Trendline (projected depletion date) * Exceptions (break-glass releases) **Control thresholds:** * **Green (≥60% remaining):** normal operation * **Yellow (30–59%):** additional caution; gates tighten by 1 level for medium/high-risk diffs * **Red (<30%):** freeze high-risk diffs; allow only low-risk changes or reliability/security work * **Exhausted (≤0%):** releases restricted to incident fixes, security fixes, and rollback-only, with tightened gates and explicit sign-off ### 5.5 What to do when budget is low (expected behavior) When Yellow/Red: * PM shifts roadmap execution toward: * reliability work, defect burn-down, * decomposing large changes into smaller, reversible diffs, * reducing scope of risky features. * DM enforces: * smaller diffs, * increased feature flagging, * staged rollout requirements, * improved test/observability coverage. Budget constraints are a signal, not a punishment. ### 5.6 Budget replenishment and incentives Budgets replenish on the window boundary, but we also allow **earned capacity**: * If a service improves change failure rate and MTTR for 2 consecutive windows, it may earn: * +10–20% budget increase **or** * one gate level relaxation for specific change categories This must be evidence-driven (metrics, not opinions). --- ## 6) Diff-Aware Release Gates ### 6.1 Diff classification (what the pipeline must detect) At minimum, automatically classify diffs into these categories: **Code scope** * Executable code vs docs-only * Core vs non-core modules (define module ownership boundaries) * Hot paths (latency-sensitive), correctness-sensitive paths **Data scope** * Schema migration (additive vs breaking) * Backfill jobs / batch jobs * Data model changes impacting downstream consumers * PII / regulated data touchpoints **Security scope** * Authn/authz logic * Permission checks * Secrets, key handling, encryption changes * Dependency changes with known CVEs **Infra scope** * IaC changes, networking, load balancer, DNS, autoscaling * Runtime config changes (feature flags, limits, thresholds) * Queue/topic changes, retention settings **Interface scope** * Public API contract changes * Backward compatibility of payloads/events * Client version dependency ### 6.2 Gate levels Define **Gate Levels G0–G4**. The pipeline assigns one based on diff + context + budget. #### G0 — No-risk / administrative Use for: * docs-only, comments-only, non-functional metadata Requirements: * Lint/format checks * Basic CI pass (build) #### G1 — Low risk Use for: * small, localized code changes with strong unit coverage * non-core UI changes * telemetry additions (no removal) Requirements: * All automated unit tests * Static analysis/linting * 1 peer review (code owner not required if outside critical modules) * Automated deploy to staging * Post-deploy smoke checks #### G2 — Moderate risk Use for: * moderate logic changes in customer-facing paths * dependency upgrades * API changes that are backward compatible * config changes affecting behavior Requirements: * G1 + * Integration tests relevant to impacted modules * Code owner review for touched modules * Feature flag required if customer impact possible * Staged rollout: canary or small cohort * Rollback plan documented in PR #### G3 — High risk Use for: * schema migrations * auth/permission changes * core business logic in critical flows * infra changes affecting availability * non-trivial concurrency/queue semantics changes Requirements: * G2 + * Security scan + dependency audit (must pass, exceptions logged) * Migration plan (forward + rollback) reviewed * Load/performance checks if in hot path * Observability: new/updated dashboards/alerts for the change * Release captain / on-call sign-off (someone accountable live) * Progressive delivery with automatic health gates (error rate/latency) #### G4 — Very high risk / safety-critical / budget-constrained releases Use for: * Tier 3 critical systems with low budget remaining * changes during freeze windows via exception * broad blast radius changes (platform-wide) * remediation after major incident where recurrence risk is high Requirements: * G3 + * Formal risk review (PM+DM+Security/SRE) in writing * Explicit rollback rehearsal or prior proven rollback path * Extended canary period with success criteria and abort criteria * Customer comms plan if impact is plausible * Post-release verification checklist executed and logged ### 6.3 Gate selection logic (policy) Default rule: 1. Compute **RRS** (Risk Points) from diff + context. 2. Map RRS to default gate: * 1–5 RP → G1 * 6–12 RP → G2 * 13–20 RP → G3 * 21+ RP → G4 3. Apply modifiers: * If **budget Yellow**: escalate one gate for changes ≥ G2 * If **budget Red**: escalate one gate for changes ≥ G1 and block high-risk categories unless exception * If active incident or error budget severely degraded: block non-fix releases by default DM must ensure the pipeline enforces this mapping automatically. ### 6.4 “Diff-aware” also means “blast-radius aware” If the diff is inherently risky, reduce risk operationally: * feature flags with cohort controls * dark launches (ship code disabled) * canary deployments * blue/green with quick revert * backwards-compatible DB migrations (expand/contract pattern) * circuit breakers and rate limiting * progressive exposure by tenant / region / account segment Large diffs are not “made safe” by more reviewers; they are made safe by **reversibility and containment**. --- ## 7) Exceptions (“break glass”) policy Exceptions are permitted only when one of these is true: * incident mitigation or customer harm prevention, * urgent security fix (actively exploited or high severity), * legal/compliance deadline. **Requirements for any exception:** * Recorded rationale in the PR/release ticket * Named approver(s): DM + on-call owner; PM for customer-impacting risk * Mandatory follow-up within 5 business days: * post-incident or post-release review * remediation tasks created and prioritized * **Budget penalty:** subtract additional RP (e.g., +50% of the change’s RRS) to reflect unmanaged risk Repeated exceptions are a governance failure and trigger gate tightening. --- ## 8) Operational metrics (what PMs and DMs must review) Minimum weekly review dashboard per service: * **Risk budget remaining** (RP and %) * **Deploy frequency** * **Change failure rate** * **MTTR** * **Sev1/Sev2 count** (rolling 30/90 days) * **SLO / error budget status** * **Gate compliance rate** (how often gates were bypassed) * **Diff size distribution** (are we shipping huge diffs?) * **Rollback frequency and time-to-rollback** Policy expectation: * If change failure rate or MTTR worsens materially over 2 windows, budgets tighten and gate mapping escalates until stability returns. --- ## 9) Practical operating cadence ### Weekly (PM + DM) * Review budgets and trends * Identify upcoming high-risk releases and plan staged rollouts * Confirm staffing for release windows (release captain / on-call coverage) * Decide whether to defer, decompose, or harden changes ### Per release (DM-led, PM informed) * Ensure correct gate level * Verify rollout + rollback readiness * Confirm monitoring/alerts exist and are watched during rollout * Execute post-release verification checklist ### Monthly (leadership) * Adjust tier assignments if product criticality changed * Recalibrate budget numbers based on measured outcomes * Identify systemic causes: test gaps, observability gaps, deployment tooling gaps --- ## 10) Required templates (standardize execution) ### 10.1 Release Plan (required for G2+) * What is changing (1–3 bullets) * Expected customer impact (or “none”) * Diff category flags (DB/auth/infra/API/etc.) * Rollout strategy (canary/cohort/blue-green) * Abort criteria (exact metrics/thresholds) * Rollback steps (exact commands/process) * Owners during rollout (names) ### 10.2 Migration Plan (required for schema/data changes) * Migration type: additive / expand-contract / breaking (breaking is disallowed without explicit G4 approval) * Backfill approach and rate limits * Validation checks (row counts, invariants) * Rollback strategy (including data implications) ### 10.3 Post-release Verification Checklist (G1+) * Smoke test results * Key dashboards checked (latency, error rate, saturation) * Alerts status * User-facing workflows validated (as applicable) * Ticket updated with outcome --- ## 11) What “good” looks like * Low-risk diffs ship quickly with minimal ceremony (G0–G1). * High-risk diffs are decomposed and shipped progressively, not heroically. * Risk budgets are visible, used in planning, and treated as a real constraint. * Exceptions are rare and followed by concrete remediation. * Over time: deploy frequency stays high while change failure rate and MTTR decrease. --- ## 12) Immediate adoption checklist (first 30 days) **DM deliverables** * Implement diff classification in CI/CD (at least: DB/auth/infra/API/deps/config) * Implement automatic gate mapping and enforcement * Add “release plan” and “rollback plan” checks for G2+ * Add logging for gate overrides **PM deliverables** * Confirm service tiering for owned areas * Approve initial monthly RP budgets * Add risk budget review to the weekly product/engineering ritual * Reprioritize work when budgets hit Yellow/Red (explicitly) --- If you want, I can also provide: * a concrete scoring worksheet (ready to paste into Confluence/Notion), * a CI/CD policy example (e.g., GitHub Actions / GitLab rules) that computes gate level from diff patterns, * and a one-page “Release Captain Runbook” aligned to G2–G4.