## Stella Ops Guidelines

### Risk Budgets and Diff-Aware Release Gates

**Audience:** Product Managers (PMs) and Development Managers (DMs)
**Applies to:** All customer-impacting software and configuration changes shipped by Stella Ops (code, infrastructure-as-code, runtime config, feature flags, data migrations, dependency upgrades).

---

## 1) What we are optimizing for

Stella Ops ships quickly **without** letting change-driven incidents, security regressions, or data integrity failures become the hidden cost of “speed.”

These guidelines enforce two linked controls:

1. **Risk Budgets** — a quantitative “capacity to take risk” that prevents reliability and trust from being silently depleted.
2. **Diff-Aware Release Gates** — release checks whose strictness scales with *what changed* (the diff), not with generic process.

Together they let us move fast on low-risk diffs and slow down only when the change warrants it.

---

## 2) Non-negotiable principles

1. **All changes are risk-bearing** (even “small” diffs). We quantify and route them accordingly.
2. **Risk is managed at the product/service boundary** (each service has its own budget and gating profile).
3. **Automation first, approvals last**. Humans review what automation cannot reliably verify.
4. **Blast radius is a first-class variable**. A safe rollout beats a perfect code review.
5. **Exceptions are allowed but never free**. Every bypass is logged, justified, and paid back via budget reduction and follow-up controls.

---

## 3) Definitions

### 3.1 Risk Budget (what it is)

A **Risk Budget** is the amount of change-risk a product/service is allowed to take over a defined window (typically a sprint or month) **without increasing the probability of customer harm beyond the agreed tolerance**.

It is a management control, not a theoretical score.

### 3.2 Risk Budget vs. Error Budget (important distinction)

* **Error Budget** (classic SRE): backward-looking tolerance for *actual* unreliability vs. SLO.
* **Risk Budget** (this policy): forward-looking tolerance for *change risk* before shipping.

They interact:

* If error budget is burned (service is unstable), risk budget is automatically constrained.
* If risk budget is low, release gates tighten by policy.

### 3.3 Diff-aware release gates (what it is)

A **release gate** is a set of required checks (tests, scans, reviews, rollout controls) that must pass before a change can progress.
**Diff-aware** means the gate level is determined by:

* what changed (diff classification),
* where it changed (criticality),
* how it ships (blast radius controls),
* and current operational context (incidents, SLO health, budget remaining).

---

## 4) Roles and accountability

### Product Manager (PM) — accountable for risk appetite

PM responsibilities:

* Define product-level risk tolerance with stakeholders (customer impact tolerance, regulatory constraints).
* Approve the **Risk Budget Policy settings** for their product/service tier (criticality level, default gates).
* Prioritize reliability work when budgets are constrained.
* Own customer communications for degraded service or risk-driven release deferrals.

### Development Manager (DM) — accountable for enforcement and engineering hygiene

DM responsibilities:

* Ensure pipelines implement diff classification and enforce gates.
* Ensure tests, telemetry, rollout mechanisms, and rollback procedures exist and are maintained.
* Ensure “exceptions” process is real (logged, postmortemed, paid back).
* Own staffing/rotation decisions to ensure safe releases (on-call readiness, release captains).

### Shared responsibilities

PM + DM jointly:

* Review risk budget status weekly.
* Resolve trade-offs: feature velocity vs. reliability/security work.
* Approve gate profile changes (tighten/loosen) based on evidence.

---

## 5) Risk Budgets

### 5.1 Establish service tiers (criticality)

Each service/product component must be assigned a **Criticality Tier**:

* **Tier 0 – Internal only** (no external customers; low business impact)
* **Tier 1 – Customer-facing non-critical** (degradation tolerated; limited blast radius)
* **Tier 2 – Customer-facing critical** (core workflows; meaningful revenue/trust impact)
* **Tier 3 – Safety/financial/data-critical** (payments, auth, permissions, PII, regulated workflows)

Tier drives default budgets and minimum gates.

### 5.2 Choose a budget window and units

**Window:** default to **monthly** with weekly tracking; optionally sprint-based if release cadence is sprint-coupled.
**Units:** use **Risk Points (RP)** — consumed by each change. (Do not overcomplicate at first; tune with data.)

Recommended initial monthly budgets (adjust after 2–3 cycles with evidence):

* Tier 0: 300 RP/month
* Tier 1: 200 RP/month
* Tier 2: 120 RP/month
* Tier 3: 80 RP/month

> Interpretation: Tier 3 ships fewer “risky” changes; it can still ship frequently, but changes must be decomposed into low-risk diffs and shipped with strong controls.

### 5.3 Risk Point scoring (how changes consume budget)

Every change gets a **Release Risk Score (RRS)** in RP.

A practical baseline model:

**RRS = Base(criticality) + Diff Risk + Operational Context – Mitigations**

**Base (criticality):**

* Tier 0: +1
* Tier 1: +3
* Tier 2: +6
* Tier 3: +10

**Diff Risk (additive):**

* +1: docs, comments, non-executed code paths, telemetry-only additions
* +3: UI changes, non-core logic changes, refactors with high test coverage
* +6: API contract changes, dependency upgrades, medium-complexity logic in a core path
* +10: database schema migrations, auth/permission logic, data retention/PII handling
* +15: infra/networking changes, encryption/key handling, payment flows, queue semantics changes

**Operational Context (additive):**

* +5: service currently in incident or had Sev1/Sev2 in last 7 days
* +3: error budget < 50% remaining
* +2: on-call load high (paging above normal baseline)
* +5: release during restricted windows (holidays/freeze) via exception

**Mitigations (subtract):**

* –3: feature flag with staged rollout + instant kill switch verified
* –3: canary + automated health gates + rollback tested in last 30 days
* –2: high-confidence integration coverage for touched components
* –2: no data migration OR backward-compatible migration with proven rollback
* –2: change isolated behind permission boundary / limited cohort

**Minimum RRS floor:** never below 1 RP.

DM is responsible for making sure the pipeline can calculate a *default* RRS automatically and require humans only for edge cases.

### 5.4 Budget operating rules

**Budget ledger:** Maintain a per-service ledger:

* Budget allocated for the window
* RP consumed per release
* RP remaining
* Trendline (projected depletion date)
* Exceptions (break-glass releases)

**Control thresholds:**

* **Green (≥60% remaining):** normal operation
* **Yellow (30–59%):** additional caution; gates tighten by 1 level for medium/high-risk diffs
* **Red (<30%):** freeze high-risk diffs; allow only low-risk changes or reliability/security work
* **Exhausted (≤0%):** releases restricted to incident fixes, security fixes, and rollback-only, with tightened gates and explicit sign-off

### 5.5 What to do when budget is low (expected behavior)

When Yellow/Red:

* PM shifts roadmap execution toward:

  * reliability work, defect burn-down,
  * decomposing large changes into smaller, reversible diffs,
  * reducing scope of risky features.
* DM enforces:

  * smaller diffs,
  * increased feature flagging,
  * staged rollout requirements,
  * improved test/observability coverage.

Budget constraints are a signal, not a punishment.

### 5.6 Budget replenishment and incentives

Budgets replenish on the window boundary, but we also allow **earned capacity**:

* If a service improves change failure rate and MTTR for 2 consecutive windows, it may earn:

  * +10–20% budget increase **or**
  * one gate level relaxation for specific change categories

This must be evidence-driven (metrics, not opinions).

---

## 6) Diff-Aware Release Gates

### 6.1 Diff classification (what the pipeline must detect)

At minimum, automatically classify diffs into these categories:

**Code scope**

* Executable code vs docs-only
* Core vs non-core modules (define module ownership boundaries)
* Hot paths (latency-sensitive), correctness-sensitive paths

**Data scope**

* Schema migration (additive vs breaking)
* Backfill jobs / batch jobs
* Data model changes impacting downstream consumers
* PII / regulated data touchpoints

**Security scope**

* Authn/authz logic
* Permission checks
* Secrets, key handling, encryption changes
* Dependency changes with known CVEs

**Infra scope**

* IaC changes, networking, load balancer, DNS, autoscaling
* Runtime config changes (feature flags, limits, thresholds)
* Queue/topic changes, retention settings

**Interface scope**

* Public API contract changes
* Backward compatibility of payloads/events
* Client version dependency

### 6.2 Gate levels

Define **Gate Levels G0–G4**. The pipeline assigns one based on diff + context + budget.

#### G0 — No-risk / administrative

Use for:

* docs-only, comments-only, non-functional metadata

Requirements:

* Lint/format checks
* Basic CI pass (build)

#### G1 — Low risk

Use for:

* small, localized code changes with strong unit coverage
* non-core UI changes
* telemetry additions (no removal)

Requirements:

* All automated unit tests
* Static analysis/linting
* 1 peer review (code owner not required if outside critical modules)
* Automated deploy to staging
* Post-deploy smoke checks

#### G2 — Moderate risk

Use for:

* moderate logic changes in customer-facing paths
* dependency upgrades
* API changes that are backward compatible
* config changes affecting behavior

Requirements:

* G1 +
* Integration tests relevant to impacted modules
* Code owner review for touched modules
* Feature flag required if customer impact possible
* Staged rollout: canary or small cohort
* Rollback plan documented in PR

#### G3 — High risk

Use for:

* schema migrations
* auth/permission changes
* core business logic in critical flows
* infra changes affecting availability
* non-trivial concurrency/queue semantics changes

Requirements:

* G2 +
* Security scan + dependency audit (must pass, exceptions logged)
* Migration plan (forward + rollback) reviewed
* Load/performance checks if in hot path
* Observability: new/updated dashboards/alerts for the change
* Release captain / on-call sign-off (someone accountable live)
* Progressive delivery with automatic health gates (error rate/latency)

#### G4 — Very high risk / safety-critical / budget-constrained releases

Use for:

* Tier 3 critical systems with low budget remaining
* changes during freeze windows via exception
* broad blast radius changes (platform-wide)
* remediation after major incident where recurrence risk is high

Requirements:

* G3 +
* Formal risk review (PM+DM+Security/SRE) in writing
* Explicit rollback rehearsal or prior proven rollback path
* Extended canary period with success criteria and abort criteria
* Customer comms plan if impact is plausible
* Post-release verification checklist executed and logged

### 6.3 Gate selection logic (policy)

Default rule:

1. Compute **RRS** (Risk Points) from diff + context.
2. Map RRS to default gate:

   * 1–5 RP → G1
   * 6–12 RP → G2
   * 13–20 RP → G3
   * 21+ RP → G4
3. Apply modifiers:

   * If **budget Yellow**: escalate one gate for changes ≥ G2
   * If **budget Red**: escalate one gate for changes ≥ G1 and block high-risk categories unless exception
   * If active incident or error budget severely degraded: block non-fix releases by default

DM must ensure the pipeline enforces this mapping automatically.

### 6.4 “Diff-aware” also means “blast-radius aware”

If the diff is inherently risky, reduce risk operationally:

* feature flags with cohort controls
* dark launches (ship code disabled)
* canary deployments
* blue/green with quick revert
* backwards-compatible DB migrations (expand/contract pattern)
* circuit breakers and rate limiting
* progressive exposure by tenant / region / account segment

Large diffs are not “made safe” by more reviewers; they are made safe by **reversibility and containment**.

---

## 7) Exceptions (“break glass”) policy

Exceptions are permitted only when one of these is true:

* incident mitigation or customer harm prevention,
* urgent security fix (actively exploited or high severity),
* legal/compliance deadline.

**Requirements for any exception:**

* Recorded rationale in the PR/release ticket
* Named approver(s): DM + on-call owner; PM for customer-impacting risk
* Mandatory follow-up within 5 business days:

  * post-incident or post-release review
  * remediation tasks created and prioritized
* **Budget penalty:** subtract additional RP (e.g., +50% of the change’s RRS) to reflect unmanaged risk

Repeated exceptions are a governance failure and trigger gate tightening.

---

## 8) Operational metrics (what PMs and DMs must review)

Minimum weekly review dashboard per service:

* **Risk budget remaining** (RP and %)
* **Deploy frequency**
* **Change failure rate**
* **MTTR**
* **Sev1/Sev2 count** (rolling 30/90 days)
* **SLO / error budget status**
* **Gate compliance rate** (how often gates were bypassed)
* **Diff size distribution** (are we shipping huge diffs?)
* **Rollback frequency and time-to-rollback**

Policy expectation:

* If change failure rate or MTTR worsens materially over 2 windows, budgets tighten and gate mapping escalates until stability returns.

---

## 9) Practical operating cadence

### Weekly (PM + DM)

* Review budgets and trends
* Identify upcoming high-risk releases and plan staged rollouts
* Confirm staffing for release windows (release captain / on-call coverage)
* Decide whether to defer, decompose, or harden changes

### Per release (DM-led, PM informed)

* Ensure correct gate level
* Verify rollout + rollback readiness
* Confirm monitoring/alerts exist and are watched during rollout
* Execute post-release verification checklist

### Monthly (leadership)

* Adjust tier assignments if product criticality changed
* Recalibrate budget numbers based on measured outcomes
* Identify systemic causes: test gaps, observability gaps, deployment tooling gaps

---

## 10) Required templates (standardize execution)

### 10.1 Release Plan (required for G2+)

* What is changing (1–3 bullets)
* Expected customer impact (or “none”)
* Diff category flags (DB/auth/infra/API/etc.)
* Rollout strategy (canary/cohort/blue-green)
* Abort criteria (exact metrics/thresholds)
* Rollback steps (exact commands/process)
* Owners during rollout (names)

### 10.2 Migration Plan (required for schema/data changes)

* Migration type: additive / expand-contract / breaking (breaking is disallowed without explicit G4 approval)
* Backfill approach and rate limits
* Validation checks (row counts, invariants)
* Rollback strategy (including data implications)

### 10.3 Post-release Verification Checklist (G1+)

* Smoke test results
* Key dashboards checked (latency, error rate, saturation)
* Alerts status
* User-facing workflows validated (as applicable)
* Ticket updated with outcome

---

## 11) What “good” looks like

* Low-risk diffs ship quickly with minimal ceremony (G0–G1).
* High-risk diffs are decomposed and shipped progressively, not heroically.
* Risk budgets are visible, used in planning, and treated as a real constraint.
* Exceptions are rare and followed by concrete remediation.
* Over time: deploy frequency stays high while change failure rate and MTTR decrease.

---

## 12) Immediate adoption checklist (first 30 days)

**DM deliverables**

* Implement diff classification in CI/CD (at least: DB/auth/infra/API/deps/config)
* Implement automatic gate mapping and enforcement
* Add “release plan” and “rollback plan” checks for G2+
* Add logging for gate overrides

**PM deliverables**

* Confirm service tiering for owned areas
* Approve initial monthly RP budgets
* Add risk budget review to the weekly product/engineering ritual
* Reprioritize work when budgets hit Yellow/Red (explicitly)

---

If you want, I can also provide:

* a concrete scoring worksheet (ready to paste into Confluence/Notion),
* a CI/CD policy example (e.g., GitHub Actions / GitLab rules) that computes gate level from diff patterns,
* and a one-page “Release Captain Runbook” aligned to G2–G4.