17 KiB
Multi-Tenant Policy Rollout Flow
Overview
The Multi-Tenant Policy Rollout Flow describes how StellaOps propagates policy changes across multiple tenants in a controlled, auditable manner. This flow supports staged rollouts, canary deployments, and rollback capabilities for enterprise policy governance.
Business Value: Centralized policy management with controlled rollout reduces risk of policy changes breaking production workflows while ensuring consistent security standards across the organization.
Actors
| Actor | Type | Role |
|---|---|---|
| Policy Admin | Human | Creates and approves policy changes |
| Platform Admin | Human | Manages cross-tenant rollouts |
| Policy Engine | Service | Evaluates and applies policies |
| Authority | Service | Manages tenant hierarchy |
| Notify | Service | Alerts on rollout status |
| Scheduler | Service | Orchestrates staged rollout |
Prerequisites
- Multi-tenant environment configured
- Tenant hierarchy defined (org → teams → projects)
- Policy inheritance rules established
- Rollout approval workflow configured
Tenant Hierarchy
Organization (acme-corp)
├── Team: Platform Engineering
│ ├── Project: core-services
│ └── Project: infrastructure
├── Team: Product Development
│ ├── Project: web-app
│ ├── Project: mobile-api
│ └── Project: data-pipeline
└── Team: Security
└── Project: security-tools
Flow Diagram
┌─────────────────────────────────────────────────────────────────────────────────┐
│ Multi-Tenant Policy Rollout Flow │
└─────────────────────────────────────────────────────────────────────────────────┘
┌──────────┐ ┌─────────┐ ┌───────────┐ ┌──────────┐ ┌────────┐ ┌────────┐
│ Policy │ │ Policy │ │ Scheduler │ │ Authority│ │ Policy │ │ Notify │
│ Admin │ │ Store │ │ │ │ │ │ Engine │ │ │
└────┬─────┘ └────┬────┘ └─────┬─────┘ └────┬─────┘ └───┬────┘ └───┬────┘
│ │ │ │ │ │
│ Create │ │ │ │ │
│ policy v2 │ │ │ │ │
│────────────>│ │ │ │ │
│ │ │ │ │ │
│ │ Store as │ │ │ │
│ │ draft │ │ │ │
│ │───┐ │ │ │ │
│ │ │ │ │ │ │
│ │<──┘ │ │ │ │
│ │ │ │ │ │
│ Define │ │ │ │ │
│ rollout │ │ │ │ │
│────────────────────────────> │ │ │
│ │ │ │ │ │
│ │ │ Get tenant │ │ │
│ │ │ hierarchy │ │ │
│ │ │────────────>│ │ │
│ │ │ │ │ │
│ │ │ Tenant tree │ │ │
│ │ │<────────────│ │ │
│ │ │ │ │ │
│ Rollout │ │ │ │ │
│ plan │ │ │ │ │
│<──────────────────────────── │ │ │
│ │ │ │ │ │
│ Approve │ │ │ │ │
│────────────────────────────> │ │ │
│ │ │ │ │ │
│ │ │ Stage 1: │ │ │
│ │ │ Canary │ │ │
│ │ │─────────────────────────>│ │
│ │ │ │ │ │
│ │ │ │ │ Apply to │
│ │ │ │ │ canary │
│ │ │ │ │ tenant │
│ │ │ │ │───┐ │
│ │ │ │ │ │ │
│ │ │ │ │<──┘ │
│ │ │ │ │ │
│ │ │ Monitor │ │ │
│ │ │ (24h) │ │ │
│ │ │───┐ │ │ │
│ │ │ │ │ │ │
│ │ │<──┘ │ │ │
│ │ │ │ │ │
│ │ │ Stage 2: │ │ │
│ │ │ 25% tenants │ │ │
│ │ │─────────────────────────>│ │
│ │ │ │ │ │
│ │ │ ... │ │ │
│ │ │ │ │ │
│ │ │ Stage N: │ │ │
│ │ │ 100% │ │ │
│ │ │─────────────────────────>│ │
│ │ │ │ │ │
│ │ │ Complete │ │ │
│ │ │───────────────────────────────────────>
│ │ │ │ │ │
│ Rollout │ │ │ │ │ Notify
│ complete │ │ │ │ │ admins
│<────────────────────────────────────────────────────────────────────
│ │ │ │ │ │
Step-by-Step
1. Policy Creation
Policy Admin creates new policy version:
# Policy Set: production-v2
version: "stella-dsl@1"
name: production
version_tag: "v2.0.0"
description: "Updated production policy with KEV blocking"
changes_from_v1:
- added: block-kev-vulnerabilities
- modified: critical-threshold (9.0 → 8.5)
- removed: legacy-exception-rule
rules:
- name: block-kev-vulnerabilities
description: Block any KEV-listed vulnerability
condition: kev == true
action: FAIL
severity: critical
- name: no-critical-reachable
condition: |
severity == 'critical' AND
cvss >= 8.5 AND
reachability IN ['SR', 'RO', 'CR']
action: FAIL
2. Rollout Plan Definition
Platform Admin defines rollout strategy:
{
"rollout_id": "rollout-789",
"policy_set": "production",
"from_version": "v1.0.0",
"to_version": "v2.0.0",
"strategy": "staged",
"stages": [
{
"name": "canary",
"description": "Single low-risk tenant",
"tenants": ["platform-eng-core-services"],
"duration": "24h",
"success_criteria": {
"max_new_failures": 5,
"max_failure_rate_increase": 0.05
},
"auto_proceed": false
},
{
"name": "early-adopters",
"description": "25% of tenants (by scan volume)",
"selection": {
"method": "percentage",
"value": 25,
"weight_by": "scan_volume"
},
"duration": "48h",
"success_criteria": {
"max_new_failures": 20,
"max_failure_rate_increase": 0.10
},
"auto_proceed": true
},
{
"name": "majority",
"description": "75% of tenants",
"selection": {
"method": "percentage",
"value": 75
},
"duration": "24h",
"auto_proceed": true
},
{
"name": "full",
"description": "100% of tenants",
"selection": {
"method": "all"
}
}
],
"rollback": {
"automatic": true,
"triggers": [
{"metric": "failure_rate_increase", "threshold": 0.20},
{"metric": "new_critical_blocks", "threshold": 50}
]
}
}
3. Impact Analysis
Before approval, system analyzes potential impact:
{
"impact_analysis": {
"rollout_id": "rollout-789",
"analysis_date": "2024-12-29T10:00:00Z",
"historical_data_range": "30d",
"results": {
"total_scans_analyzed": 15234,
"predicted_new_failures": 127,
"predicted_failure_rate_change": "+0.83%",
"affected_images": 89,
"by_team": [
{"team": "Product Development", "new_failures": 78},
{"team": "Platform Engineering", "new_failures": 31},
{"team": "Security", "new_failures": 18}
],
"top_triggered_rules": [
{"rule": "block-kev-vulnerabilities", "count": 45},
{"rule": "no-critical-reachable", "count": 82}
],
"recommendation": "PROCEED_WITH_CAUTION"
}
}
}
4. Approval and Initiation
Policy Admin approves rollout after review:
{
"approval": {
"rollout_id": "rollout-789",
"approved_by": "policy-admin@acme.com",
"approved_at": "2024-12-29T11:00:00Z",
"approval_notes": "Impact acceptable. Notified affected teams.",
"notifications_sent": [
{"channel": "slack", "target": "#platform-engineering"},
{"channel": "email", "target": "team-leads@acme.com"}
]
}
}
5. Staged Execution
Scheduler executes each stage:
Stage 1: Canary
{
"stage_execution": {
"rollout_id": "rollout-789",
"stage": "canary",
"started_at": "2024-12-29T11:00:00Z",
"tenants_activated": ["platform-eng-core-services"],
"status": "monitoring"
}
}
Stage Monitoring
{
"stage_metrics": {
"rollout_id": "rollout-789",
"stage": "canary",
"monitored_period": "24h",
"metrics": {
"scans_evaluated": 234,
"new_failures": 3,
"failure_rate_before": 0.12,
"failure_rate_after": 0.13,
"success_criteria_met": true
}
}
}
6. Progressive Rollout
After canary success, proceed to next stages:
{
"stage_progression": {
"rollout_id": "rollout-789",
"completed_stages": ["canary", "early-adopters", "majority"],
"current_stage": "full",
"tenants_on_v2": 47,
"tenants_on_v1": 0,
"total_rollout_duration": "96h",
"status": "completed"
}
}
7. Rollback (If Needed)
If success criteria not met, automatic rollback:
{
"rollback": {
"rollout_id": "rollout-789",
"triggered_at": "2024-12-30T15:30:00Z",
"trigger_reason": "failure_rate_increase exceeded 0.20 threshold",
"rollback_stage": "early-adopters",
"tenants_rolled_back": 12,
"action": "reverted to v1.0.0",
"notifications_sent": true
}
}
Rollout Strategies
Blue-Green
strategy: blue_green
config:
parallel_evaluation: true # Both versions evaluated
comparison_period: 24h
switch_threshold:
verdict_match_rate: 0.95
Canary with Traffic Split
strategy: canary_traffic
config:
initial_percentage: 5
increment: 10
increment_interval: 4h
max_error_rate: 0.01
Feature Flag
strategy: feature_flag
config:
flag_name: "policy-v2-enabled"
default: false
overrides:
- tenant: "security-team"
value: true
Data Contracts
Rollout Plan Schema
interface RolloutPlan {
rollout_id: string;
policy_set: string;
from_version: string;
to_version: string;
strategy: 'staged' | 'blue_green' | 'canary_traffic' | 'feature_flag';
stages: Stage[];
rollback: {
automatic: boolean;
triggers: RollbackTrigger[];
};
notifications: NotificationConfig[];
}
interface Stage {
name: string;
description?: string;
tenants?: string[];
selection?: TenantSelection;
duration?: string; // ISO-8601 duration
success_criteria?: SuccessCriteria;
auto_proceed?: boolean;
}
Rollout Status Schema
interface RolloutStatus {
rollout_id: string;
status: 'pending' | 'in_progress' | 'paused' | 'completed' | 'rolled_back' | 'failed';
current_stage?: string;
stages: Array<{
name: string;
status: 'pending' | 'active' | 'monitoring' | 'completed' | 'failed';
started_at?: string;
completed_at?: string;
metrics?: StageMetrics;
}>;
tenant_status: Array<{
tenant_id: string;
policy_version: string;
activated_at?: string;
}>;
}
Policy Inheritance
Organization Policy (base)
└── inherits_from: stellaops-default
Team Policy (override)
└── inherits_from: organization
└── overrides: [severity-thresholds]
Project Policy (final)
└── inherits_from: team
└── overrides: [specific-exceptions]
Resolution order: Project → Team → Organization → Platform Default
Error Handling
| Error | Recovery |
|---|---|
| Stage timeout | Pause rollout, alert admin |
| Metrics unavailable | Use last known good, extend monitoring |
| Tenant unreachable | Skip tenant, continue with others |
| Rollback failure | Manual intervention required |
Observability
Metrics
| Metric | Type | Labels |
|---|---|---|
policy_rollout_status |
Gauge | rollout_id, stage |
policy_rollout_tenant_count |
Gauge | rollout_id, version |
policy_rollout_failures_total |
Counter | rollout_id, stage |
policy_version_active |
Gauge | tenant, policy_set, version |
Key Log Events
| Event | Level | Fields |
|---|---|---|
rollout.created |
INFO | rollout_id, policy_set, stages |
rollout.stage.started |
INFO | rollout_id, stage, tenants |
rollout.stage.completed |
INFO | rollout_id, stage, metrics |
rollout.rollback |
WARN | rollout_id, reason, tenants |
rollout.completed |
INFO | rollout_id, duration |
Related Flows
- Policy Evaluation Flow - Policy application
- Exception Approval Workflow - Exception handling
- Notification Flow - Rollout alerts