Files
git.stella-ops.org/docs/flows/14-multi-tenant-policy-rollout-flow.md
StellaOps Bot ca578801fd save progress
2026-01-03 00:49:19 +02:00

17 KiB

Multi-Tenant Policy Rollout Flow

Overview

The Multi-Tenant Policy Rollout Flow describes how StellaOps propagates policy changes across multiple tenants in a controlled, auditable manner. This flow supports staged rollouts, canary deployments, and rollback capabilities for enterprise policy governance.

Business Value: Centralized policy management with controlled rollout reduces risk of policy changes breaking production workflows while ensuring consistent security standards across the organization.

Actors

Actor Type Role
Policy Admin Human Creates and approves policy changes
Platform Admin Human Manages cross-tenant rollouts
Policy Engine Service Evaluates and applies policies
Authority Service Manages tenant hierarchy
Notify Service Alerts on rollout status
Scheduler Service Orchestrates staged rollout

Prerequisites

  • Multi-tenant environment configured
  • Tenant hierarchy defined (org → teams → projects)
  • Policy inheritance rules established
  • Rollout approval workflow configured

Tenant Hierarchy

Organization (acme-corp)
├── Team: Platform Engineering
│   ├── Project: core-services
│   └── Project: infrastructure
├── Team: Product Development
│   ├── Project: web-app
│   ├── Project: mobile-api
│   └── Project: data-pipeline
└── Team: Security
    └── Project: security-tools

Flow Diagram

┌─────────────────────────────────────────────────────────────────────────────────┐
│                     Multi-Tenant Policy Rollout Flow                             │
└─────────────────────────────────────────────────────────────────────────────────┘

┌──────────┐  ┌─────────┐  ┌───────────┐  ┌──────────┐  ┌────────┐  ┌────────┐
│  Policy  │  │ Policy  │  │ Scheduler │  │ Authority│  │ Policy │  │ Notify │
│  Admin   │  │ Store   │  │           │  │          │  │ Engine │  │        │
└────┬─────┘  └────┬────┘  └─────┬─────┘  └────┬─────┘  └───┬────┘  └───┬────┘
     │             │             │             │            │           │
     │ Create      │             │             │            │           │
     │ policy v2   │             │             │            │           │
     │────────────>│             │             │            │           │
     │             │             │             │            │           │
     │             │ Store as    │             │            │           │
     │             │ draft       │             │            │           │
     │             │───┐         │             │            │           │
     │             │   │         │             │            │           │
     │             │<──┘         │             │            │           │
     │             │             │             │            │           │
     │ Define      │             │             │            │           │
     │ rollout     │             │             │            │           │
     │────────────────────────────>            │            │           │
     │             │             │             │            │           │
     │             │             │ Get tenant  │            │           │
     │             │             │ hierarchy   │            │           │
     │             │             │────────────>│            │           │
     │             │             │             │            │           │
     │             │             │ Tenant tree │            │           │
     │             │             │<────────────│            │           │
     │             │             │             │            │           │
     │ Rollout     │             │             │            │           │
     │ plan        │             │             │            │           │
     │<────────────────────────────            │            │           │
     │             │             │             │            │           │
     │ Approve     │             │             │            │           │
     │────────────────────────────>            │            │           │
     │             │             │             │            │           │
     │             │             │ Stage 1:    │            │           │
     │             │             │ Canary      │            │           │
     │             │             │─────────────────────────>│           │
     │             │             │             │            │           │
     │             │             │             │            │ Apply to  │
     │             │             │             │            │ canary    │
     │             │             │             │            │ tenant    │
     │             │             │             │            │───┐       │
     │             │             │             │            │   │       │
     │             │             │             │            │<──┘       │
     │             │             │             │            │           │
     │             │             │ Monitor     │            │           │
     │             │             │ (24h)       │            │           │
     │             │             │───┐         │            │           │
     │             │             │   │         │            │           │
     │             │             │<──┘         │            │           │
     │             │             │             │            │           │
     │             │             │ Stage 2:    │            │           │
     │             │             │ 25% tenants │            │           │
     │             │             │─────────────────────────>│           │
     │             │             │             │            │           │
     │             │             │ ...         │            │           │
     │             │             │             │            │           │
     │             │             │ Stage N:    │            │           │
     │             │             │ 100%        │            │           │
     │             │             │─────────────────────────>│           │
     │             │             │             │            │           │
     │             │             │ Complete    │            │           │
     │             │             │───────────────────────────────────────>
     │             │             │             │            │           │
     │ Rollout     │             │             │            │           │ Notify
     │ complete    │             │             │            │           │ admins
     │<────────────────────────────────────────────────────────────────────
     │             │             │             │            │           │

Step-by-Step

1. Policy Creation

Policy Admin creates new policy version:

# Policy Set: production-v2
version: "stella-dsl@1"
name: production
version_tag: "v2.0.0"
description: "Updated production policy with KEV blocking"

changes_from_v1:
  - added: block-kev-vulnerabilities
  - modified: critical-threshold (9.0 → 8.5)
  - removed: legacy-exception-rule

rules:
  - name: block-kev-vulnerabilities
    description: Block any KEV-listed vulnerability
    condition: kev == true
    action: FAIL
    severity: critical

  - name: no-critical-reachable
    condition: |
      severity == 'critical' AND
      cvss >= 8.5 AND
      reachability IN ['SR', 'RO', 'CR']
    action: FAIL

2. Rollout Plan Definition

Platform Admin defines rollout strategy:

{
  "rollout_id": "rollout-789",
  "policy_set": "production",
  "from_version": "v1.0.0",
  "to_version": "v2.0.0",
  "strategy": "staged",
  "stages": [
    {
      "name": "canary",
      "description": "Single low-risk tenant",
      "tenants": ["platform-eng-core-services"],
      "duration": "24h",
      "success_criteria": {
        "max_new_failures": 5,
        "max_failure_rate_increase": 0.05
      },
      "auto_proceed": false
    },
    {
      "name": "early-adopters",
      "description": "25% of tenants (by scan volume)",
      "selection": {
        "method": "percentage",
        "value": 25,
        "weight_by": "scan_volume"
      },
      "duration": "48h",
      "success_criteria": {
        "max_new_failures": 20,
        "max_failure_rate_increase": 0.10
      },
      "auto_proceed": true
    },
    {
      "name": "majority",
      "description": "75% of tenants",
      "selection": {
        "method": "percentage",
        "value": 75
      },
      "duration": "24h",
      "auto_proceed": true
    },
    {
      "name": "full",
      "description": "100% of tenants",
      "selection": {
        "method": "all"
      }
    }
  ],
  "rollback": {
    "automatic": true,
    "triggers": [
      {"metric": "failure_rate_increase", "threshold": 0.20},
      {"metric": "new_critical_blocks", "threshold": 50}
    ]
  }
}

3. Impact Analysis

Before approval, system analyzes potential impact:

{
  "impact_analysis": {
    "rollout_id": "rollout-789",
    "analysis_date": "2024-12-29T10:00:00Z",
    "historical_data_range": "30d",
    "results": {
      "total_scans_analyzed": 15234,
      "predicted_new_failures": 127,
      "predicted_failure_rate_change": "+0.83%",
      "affected_images": 89,
      "by_team": [
        {"team": "Product Development", "new_failures": 78},
        {"team": "Platform Engineering", "new_failures": 31},
        {"team": "Security", "new_failures": 18}
      ],
      "top_triggered_rules": [
        {"rule": "block-kev-vulnerabilities", "count": 45},
        {"rule": "no-critical-reachable", "count": 82}
      ],
      "recommendation": "PROCEED_WITH_CAUTION"
    }
  }
}

4. Approval and Initiation

Policy Admin approves rollout after review:

{
  "approval": {
    "rollout_id": "rollout-789",
    "approved_by": "policy-admin@acme.com",
    "approved_at": "2024-12-29T11:00:00Z",
    "approval_notes": "Impact acceptable. Notified affected teams.",
    "notifications_sent": [
      {"channel": "slack", "target": "#platform-engineering"},
      {"channel": "email", "target": "team-leads@acme.com"}
    ]
  }
}

5. Staged Execution

Scheduler executes each stage:

Stage 1: Canary

{
  "stage_execution": {
    "rollout_id": "rollout-789",
    "stage": "canary",
    "started_at": "2024-12-29T11:00:00Z",
    "tenants_activated": ["platform-eng-core-services"],
    "status": "monitoring"
  }
}

Stage Monitoring

{
  "stage_metrics": {
    "rollout_id": "rollout-789",
    "stage": "canary",
    "monitored_period": "24h",
    "metrics": {
      "scans_evaluated": 234,
      "new_failures": 3,
      "failure_rate_before": 0.12,
      "failure_rate_after": 0.13,
      "success_criteria_met": true
    }
  }
}

6. Progressive Rollout

After canary success, proceed to next stages:

{
  "stage_progression": {
    "rollout_id": "rollout-789",
    "completed_stages": ["canary", "early-adopters", "majority"],
    "current_stage": "full",
    "tenants_on_v2": 47,
    "tenants_on_v1": 0,
    "total_rollout_duration": "96h",
    "status": "completed"
  }
}

7. Rollback (If Needed)

If success criteria not met, automatic rollback:

{
  "rollback": {
    "rollout_id": "rollout-789",
    "triggered_at": "2024-12-30T15:30:00Z",
    "trigger_reason": "failure_rate_increase exceeded 0.20 threshold",
    "rollback_stage": "early-adopters",
    "tenants_rolled_back": 12,
    "action": "reverted to v1.0.0",
    "notifications_sent": true
  }
}

Rollout Strategies

Blue-Green

strategy: blue_green
config:
  parallel_evaluation: true  # Both versions evaluated
  comparison_period: 24h
  switch_threshold:
    verdict_match_rate: 0.95

Canary with Traffic Split

strategy: canary_traffic
config:
  initial_percentage: 5
  increment: 10
  increment_interval: 4h
  max_error_rate: 0.01

Feature Flag

strategy: feature_flag
config:
  flag_name: "policy-v2-enabled"
  default: false
  overrides:
    - tenant: "security-team"
      value: true

Data Contracts

Rollout Plan Schema

interface RolloutPlan {
  rollout_id: string;
  policy_set: string;
  from_version: string;
  to_version: string;
  strategy: 'staged' | 'blue_green' | 'canary_traffic' | 'feature_flag';
  stages: Stage[];
  rollback: {
    automatic: boolean;
    triggers: RollbackTrigger[];
  };
  notifications: NotificationConfig[];
}

interface Stage {
  name: string;
  description?: string;
  tenants?: string[];
  selection?: TenantSelection;
  duration?: string;  // ISO-8601 duration
  success_criteria?: SuccessCriteria;
  auto_proceed?: boolean;
}

Rollout Status Schema

interface RolloutStatus {
  rollout_id: string;
  status: 'pending' | 'in_progress' | 'paused' | 'completed' | 'rolled_back' | 'failed';
  current_stage?: string;
  stages: Array<{
    name: string;
    status: 'pending' | 'active' | 'monitoring' | 'completed' | 'failed';
    started_at?: string;
    completed_at?: string;
    metrics?: StageMetrics;
  }>;
  tenant_status: Array<{
    tenant_id: string;
    policy_version: string;
    activated_at?: string;
  }>;
}

Policy Inheritance

Organization Policy (base)
    └── inherits_from: stellaops-default

Team Policy (override)
    └── inherits_from: organization
    └── overrides: [severity-thresholds]

Project Policy (final)
    └── inherits_from: team
    └── overrides: [specific-exceptions]

Resolution order: Project → Team → Organization → Platform Default

Error Handling

Error Recovery
Stage timeout Pause rollout, alert admin
Metrics unavailable Use last known good, extend monitoring
Tenant unreachable Skip tenant, continue with others
Rollback failure Manual intervention required

Observability

Metrics

Metric Type Labels
policy_rollout_status Gauge rollout_id, stage
policy_rollout_tenant_count Gauge rollout_id, version
policy_rollout_failures_total Counter rollout_id, stage
policy_version_active Gauge tenant, policy_set, version

Key Log Events

Event Level Fields
rollout.created INFO rollout_id, policy_set, stages
rollout.stage.started INFO rollout_id, stage, tenants
rollout.stage.completed INFO rollout_id, stage, metrics
rollout.rollback WARN rollout_id, reason, tenants
rollout.completed INFO rollout_id, duration