477 lines
17 KiB
Markdown
477 lines
17 KiB
Markdown
# Multi-Tenant Policy Rollout Flow
|
|
|
|
## Overview
|
|
|
|
The Multi-Tenant Policy Rollout Flow describes how StellaOps propagates policy changes across multiple tenants in a controlled, auditable manner. This flow supports staged rollouts, canary deployments, and rollback capabilities for enterprise policy governance.
|
|
|
|
**Business Value**: Centralized policy management with controlled rollout reduces risk of policy changes breaking production workflows while ensuring consistent security standards across the organization.
|
|
|
|
## Actors
|
|
|
|
| Actor | Type | Role |
|
|
|-------|------|------|
|
|
| Policy Admin | Human | Creates and approves policy changes |
|
|
| Platform Admin | Human | Manages cross-tenant rollouts |
|
|
| Policy Engine | Service | Evaluates and applies policies |
|
|
| Authority | Service | Manages tenant hierarchy |
|
|
| Notify | Service | Alerts on rollout status |
|
|
| Scheduler | Service | Orchestrates staged rollout |
|
|
|
|
## Prerequisites
|
|
|
|
- Multi-tenant environment configured
|
|
- Tenant hierarchy defined (org → teams → projects)
|
|
- Policy inheritance rules established
|
|
- Rollout approval workflow configured
|
|
|
|
## Tenant Hierarchy
|
|
|
|
```
|
|
Organization (acme-corp)
|
|
├── Team: Platform Engineering
|
|
│ ├── Project: core-services
|
|
│ └── Project: infrastructure
|
|
├── Team: Product Development
|
|
│ ├── Project: web-app
|
|
│ ├── Project: mobile-api
|
|
│ └── Project: data-pipeline
|
|
└── Team: Security
|
|
└── Project: security-tools
|
|
```
|
|
|
|
## Flow Diagram
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────────────┐
|
|
│ Multi-Tenant Policy Rollout Flow │
|
|
└─────────────────────────────────────────────────────────────────────────────────┘
|
|
|
|
┌──────────┐ ┌─────────┐ ┌───────────┐ ┌──────────┐ ┌────────┐ ┌────────┐
|
|
│ Policy │ │ Policy │ │ Scheduler │ │ Authority│ │ Policy │ │ Notify │
|
|
│ Admin │ │ Store │ │ │ │ │ │ Engine │ │ │
|
|
└────┬─────┘ └────┬────┘ └─────┬─────┘ └────┬─────┘ └───┬────┘ └───┬────┘
|
|
│ │ │ │ │ │
|
|
│ Create │ │ │ │ │
|
|
│ policy v2 │ │ │ │ │
|
|
│────────────>│ │ │ │ │
|
|
│ │ │ │ │ │
|
|
│ │ Store as │ │ │ │
|
|
│ │ draft │ │ │ │
|
|
│ │───┐ │ │ │ │
|
|
│ │ │ │ │ │ │
|
|
│ │<──┘ │ │ │ │
|
|
│ │ │ │ │ │
|
|
│ Define │ │ │ │ │
|
|
│ rollout │ │ │ │ │
|
|
│────────────────────────────> │ │ │
|
|
│ │ │ │ │ │
|
|
│ │ │ Get tenant │ │ │
|
|
│ │ │ hierarchy │ │ │
|
|
│ │ │────────────>│ │ │
|
|
│ │ │ │ │ │
|
|
│ │ │ Tenant tree │ │ │
|
|
│ │ │<────────────│ │ │
|
|
│ │ │ │ │ │
|
|
│ Rollout │ │ │ │ │
|
|
│ plan │ │ │ │ │
|
|
│<──────────────────────────── │ │ │
|
|
│ │ │ │ │ │
|
|
│ Approve │ │ │ │ │
|
|
│────────────────────────────> │ │ │
|
|
│ │ │ │ │ │
|
|
│ │ │ Stage 1: │ │ │
|
|
│ │ │ Canary │ │ │
|
|
│ │ │─────────────────────────>│ │
|
|
│ │ │ │ │ │
|
|
│ │ │ │ │ Apply to │
|
|
│ │ │ │ │ canary │
|
|
│ │ │ │ │ tenant │
|
|
│ │ │ │ │───┐ │
|
|
│ │ │ │ │ │ │
|
|
│ │ │ │ │<──┘ │
|
|
│ │ │ │ │ │
|
|
│ │ │ Monitor │ │ │
|
|
│ │ │ (24h) │ │ │
|
|
│ │ │───┐ │ │ │
|
|
│ │ │ │ │ │ │
|
|
│ │ │<──┘ │ │ │
|
|
│ │ │ │ │ │
|
|
│ │ │ Stage 2: │ │ │
|
|
│ │ │ 25% tenants │ │ │
|
|
│ │ │─────────────────────────>│ │
|
|
│ │ │ │ │ │
|
|
│ │ │ ... │ │ │
|
|
│ │ │ │ │ │
|
|
│ │ │ Stage N: │ │ │
|
|
│ │ │ 100% │ │ │
|
|
│ │ │─────────────────────────>│ │
|
|
│ │ │ │ │ │
|
|
│ │ │ Complete │ │ │
|
|
│ │ │───────────────────────────────────────>
|
|
│ │ │ │ │ │
|
|
│ Rollout │ │ │ │ │ Notify
|
|
│ complete │ │ │ │ │ admins
|
|
│<────────────────────────────────────────────────────────────────────
|
|
│ │ │ │ │ │
|
|
```
|
|
|
|
## Step-by-Step
|
|
|
|
### 1. Policy Creation
|
|
|
|
Policy Admin creates new policy version:
|
|
|
|
```yaml
|
|
# Policy Set: production-v2
|
|
version: "stella-dsl@1"
|
|
name: production
|
|
version_tag: "v2.0.0"
|
|
description: "Updated production policy with KEV blocking"
|
|
|
|
changes_from_v1:
|
|
- added: block-kev-vulnerabilities
|
|
- modified: critical-threshold (9.0 → 8.5)
|
|
- removed: legacy-exception-rule
|
|
|
|
rules:
|
|
- name: block-kev-vulnerabilities
|
|
description: Block any KEV-listed vulnerability
|
|
condition: kev == true
|
|
action: FAIL
|
|
severity: critical
|
|
|
|
- name: no-critical-reachable
|
|
condition: |
|
|
severity == 'critical' AND
|
|
cvss >= 8.5 AND
|
|
reachability IN ['SR', 'RO', 'CR']
|
|
action: FAIL
|
|
```
|
|
|
|
### 2. Rollout Plan Definition
|
|
|
|
Platform Admin defines rollout strategy:
|
|
|
|
```json
|
|
{
|
|
"rollout_id": "rollout-789",
|
|
"policy_set": "production",
|
|
"from_version": "v1.0.0",
|
|
"to_version": "v2.0.0",
|
|
"strategy": "staged",
|
|
"stages": [
|
|
{
|
|
"name": "canary",
|
|
"description": "Single low-risk tenant",
|
|
"tenants": ["platform-eng-core-services"],
|
|
"duration": "24h",
|
|
"success_criteria": {
|
|
"max_new_failures": 5,
|
|
"max_failure_rate_increase": 0.05
|
|
},
|
|
"auto_proceed": false
|
|
},
|
|
{
|
|
"name": "early-adopters",
|
|
"description": "25% of tenants (by scan volume)",
|
|
"selection": {
|
|
"method": "percentage",
|
|
"value": 25,
|
|
"weight_by": "scan_volume"
|
|
},
|
|
"duration": "48h",
|
|
"success_criteria": {
|
|
"max_new_failures": 20,
|
|
"max_failure_rate_increase": 0.10
|
|
},
|
|
"auto_proceed": true
|
|
},
|
|
{
|
|
"name": "majority",
|
|
"description": "75% of tenants",
|
|
"selection": {
|
|
"method": "percentage",
|
|
"value": 75
|
|
},
|
|
"duration": "24h",
|
|
"auto_proceed": true
|
|
},
|
|
{
|
|
"name": "full",
|
|
"description": "100% of tenants",
|
|
"selection": {
|
|
"method": "all"
|
|
}
|
|
}
|
|
],
|
|
"rollback": {
|
|
"automatic": true,
|
|
"triggers": [
|
|
{"metric": "failure_rate_increase", "threshold": 0.20},
|
|
{"metric": "new_critical_blocks", "threshold": 50}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
### 3. Impact Analysis
|
|
|
|
Before approval, system analyzes potential impact:
|
|
|
|
```json
|
|
{
|
|
"impact_analysis": {
|
|
"rollout_id": "rollout-789",
|
|
"analysis_date": "2024-12-29T10:00:00Z",
|
|
"historical_data_range": "30d",
|
|
"results": {
|
|
"total_scans_analyzed": 15234,
|
|
"predicted_new_failures": 127,
|
|
"predicted_failure_rate_change": "+0.83%",
|
|
"affected_images": 89,
|
|
"by_team": [
|
|
{"team": "Product Development", "new_failures": 78},
|
|
{"team": "Platform Engineering", "new_failures": 31},
|
|
{"team": "Security", "new_failures": 18}
|
|
],
|
|
"top_triggered_rules": [
|
|
{"rule": "block-kev-vulnerabilities", "count": 45},
|
|
{"rule": "no-critical-reachable", "count": 82}
|
|
],
|
|
"recommendation": "PROCEED_WITH_CAUTION"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### 4. Approval and Initiation
|
|
|
|
Policy Admin approves rollout after review:
|
|
|
|
```json
|
|
{
|
|
"approval": {
|
|
"rollout_id": "rollout-789",
|
|
"approved_by": "policy-admin@acme.com",
|
|
"approved_at": "2024-12-29T11:00:00Z",
|
|
"approval_notes": "Impact acceptable. Notified affected teams.",
|
|
"notifications_sent": [
|
|
{"channel": "slack", "target": "#platform-engineering"},
|
|
{"channel": "email", "target": "team-leads@acme.com"}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
### 5. Staged Execution
|
|
|
|
Scheduler executes each stage:
|
|
|
|
#### Stage 1: Canary
|
|
```json
|
|
{
|
|
"stage_execution": {
|
|
"rollout_id": "rollout-789",
|
|
"stage": "canary",
|
|
"started_at": "2024-12-29T11:00:00Z",
|
|
"tenants_activated": ["platform-eng-core-services"],
|
|
"status": "monitoring"
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Stage Monitoring
|
|
```json
|
|
{
|
|
"stage_metrics": {
|
|
"rollout_id": "rollout-789",
|
|
"stage": "canary",
|
|
"monitored_period": "24h",
|
|
"metrics": {
|
|
"scans_evaluated": 234,
|
|
"new_failures": 3,
|
|
"failure_rate_before": 0.12,
|
|
"failure_rate_after": 0.13,
|
|
"success_criteria_met": true
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### 6. Progressive Rollout
|
|
|
|
After canary success, proceed to next stages:
|
|
|
|
```json
|
|
{
|
|
"stage_progression": {
|
|
"rollout_id": "rollout-789",
|
|
"completed_stages": ["canary", "early-adopters", "majority"],
|
|
"current_stage": "full",
|
|
"tenants_on_v2": 47,
|
|
"tenants_on_v1": 0,
|
|
"total_rollout_duration": "96h",
|
|
"status": "completed"
|
|
}
|
|
}
|
|
```
|
|
|
|
### 7. Rollback (If Needed)
|
|
|
|
If success criteria not met, automatic rollback:
|
|
|
|
```json
|
|
{
|
|
"rollback": {
|
|
"rollout_id": "rollout-789",
|
|
"triggered_at": "2024-12-30T15:30:00Z",
|
|
"trigger_reason": "failure_rate_increase exceeded 0.20 threshold",
|
|
"rollback_stage": "early-adopters",
|
|
"tenants_rolled_back": 12,
|
|
"action": "reverted to v1.0.0",
|
|
"notifications_sent": true
|
|
}
|
|
}
|
|
```
|
|
|
|
## Rollout Strategies
|
|
|
|
### Blue-Green
|
|
|
|
```yaml
|
|
strategy: blue_green
|
|
config:
|
|
parallel_evaluation: true # Both versions evaluated
|
|
comparison_period: 24h
|
|
switch_threshold:
|
|
verdict_match_rate: 0.95
|
|
```
|
|
|
|
### Canary with Traffic Split
|
|
|
|
```yaml
|
|
strategy: canary_traffic
|
|
config:
|
|
initial_percentage: 5
|
|
increment: 10
|
|
increment_interval: 4h
|
|
max_error_rate: 0.01
|
|
```
|
|
|
|
### Feature Flag
|
|
|
|
```yaml
|
|
strategy: feature_flag
|
|
config:
|
|
flag_name: "policy-v2-enabled"
|
|
default: false
|
|
overrides:
|
|
- tenant: "security-team"
|
|
value: true
|
|
```
|
|
|
|
## Data Contracts
|
|
|
|
### Rollout Plan Schema
|
|
|
|
```typescript
|
|
interface RolloutPlan {
|
|
rollout_id: string;
|
|
policy_set: string;
|
|
from_version: string;
|
|
to_version: string;
|
|
strategy: 'staged' | 'blue_green' | 'canary_traffic' | 'feature_flag';
|
|
stages: Stage[];
|
|
rollback: {
|
|
automatic: boolean;
|
|
triggers: RollbackTrigger[];
|
|
};
|
|
notifications: NotificationConfig[];
|
|
}
|
|
|
|
interface Stage {
|
|
name: string;
|
|
description?: string;
|
|
tenants?: string[];
|
|
selection?: TenantSelection;
|
|
duration?: string; // ISO-8601 duration
|
|
success_criteria?: SuccessCriteria;
|
|
auto_proceed?: boolean;
|
|
}
|
|
```
|
|
|
|
### Rollout Status Schema
|
|
|
|
```typescript
|
|
interface RolloutStatus {
|
|
rollout_id: string;
|
|
status: 'pending' | 'in_progress' | 'paused' | 'completed' | 'rolled_back' | 'failed';
|
|
current_stage?: string;
|
|
stages: Array<{
|
|
name: string;
|
|
status: 'pending' | 'active' | 'monitoring' | 'completed' | 'failed';
|
|
started_at?: string;
|
|
completed_at?: string;
|
|
metrics?: StageMetrics;
|
|
}>;
|
|
tenant_status: Array<{
|
|
tenant_id: string;
|
|
policy_version: string;
|
|
activated_at?: string;
|
|
}>;
|
|
}
|
|
```
|
|
|
|
## Policy Inheritance
|
|
|
|
```
|
|
Organization Policy (base)
|
|
└── inherits_from: stellaops-default
|
|
|
|
Team Policy (override)
|
|
└── inherits_from: organization
|
|
└── overrides: [severity-thresholds]
|
|
|
|
Project Policy (final)
|
|
└── inherits_from: team
|
|
└── overrides: [specific-exceptions]
|
|
```
|
|
|
|
Resolution order: Project → Team → Organization → Platform Default
|
|
|
|
## Error Handling
|
|
|
|
| Error | Recovery |
|
|
|-------|----------|
|
|
| Stage timeout | Pause rollout, alert admin |
|
|
| Metrics unavailable | Use last known good, extend monitoring |
|
|
| Tenant unreachable | Skip tenant, continue with others |
|
|
| Rollback failure | Manual intervention required |
|
|
|
|
## Observability
|
|
|
|
### Metrics
|
|
|
|
| Metric | Type | Labels |
|
|
|--------|------|--------|
|
|
| `policy_rollout_status` | Gauge | `rollout_id`, `stage` |
|
|
| `policy_rollout_tenant_count` | Gauge | `rollout_id`, `version` |
|
|
| `policy_rollout_failures_total` | Counter | `rollout_id`, `stage` |
|
|
| `policy_version_active` | Gauge | `tenant`, `policy_set`, `version` |
|
|
|
|
### Key Log Events
|
|
|
|
| Event | Level | Fields |
|
|
|-------|-------|--------|
|
|
| `rollout.created` | INFO | `rollout_id`, `policy_set`, `stages` |
|
|
| `rollout.stage.started` | INFO | `rollout_id`, `stage`, `tenants` |
|
|
| `rollout.stage.completed` | INFO | `rollout_id`, `stage`, `metrics` |
|
|
| `rollout.rollback` | WARN | `rollout_id`, `reason`, `tenants` |
|
|
| `rollout.completed` | INFO | `rollout_id`, `duration` |
|
|
|
|
## Related Flows
|
|
|
|
- [Policy Evaluation Flow](04-policy-evaluation-flow.md) - Policy application
|
|
- [Exception Approval Workflow](17-exception-approval-workflow.md) - Exception handling
|
|
- [Notification Flow](05-notification-flow.md) - Rollout alerts
|