Files
git.stella-ops.org/docs/modules/release-orchestrator/enhancements/drift-remediation.md
2026-01-17 21:32:08 +02:00

24 KiB

Drift Remediation Automation

Overview

Drift Remediation Automation extends the existing drift detection system with intelligent, policy-driven automatic remediation. While drift detection identifies divergence between expected and actual state, remediation automation closes the loop by taking corrective action without manual intervention.

This is a best-in-class implementation that balances automation with safety, providing configurable remediation strategies, severity-based prioritization, and comprehensive audit trails.


Design Principles

  1. Safety First: Auto-remediation never executes without explicit policy authorization
  2. Gradual Escalation: Start with notifications, escalate to remediation based on drift age/severity
  3. Deterministic Actions: Remediation produces identical outcomes for identical drift states
  4. Full Auditability: Every remediation action generates signed evidence packets
  5. Blast Radius Control: Limit concurrent remediations; prevent cascading failures
  6. Human Override: Operators can pause, cancel, or override any remediation

Architecture

Component Overview

┌─────────────────────────────────────────────────────────────────────┐
│                     Drift Remediation System                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────────────┐    ┌──────────────────┐    ┌───────────────┐  │
│  │ DriftDetector   │───▶│ RemediationEngine│───▶│ ActionExecutor│  │
│  │ (existing)      │    │                  │    │               │  │
│  └─────────────────┘    └──────────────────┘    └───────────────┘  │
│           │                      │                      │          │
│           ▼                      ▼                      ▼          │
│  ┌─────────────────┐    ┌──────────────────┐    ┌───────────────┐  │
│  │ SeverityScorer  │    │ PolicyEvaluator  │    │ EvidenceWriter│  │
│  │                 │    │                  │    │               │  │
│  └─────────────────┘    └──────────────────┘    └───────────────┘  │
│           │                      │                      │          │
│           ▼                      ▼                      ▼          │
│  ┌─────────────────┐    ┌──────────────────┐    ┌───────────────┐  │
│  │ AlertRouter     │    │ ReconcileScheduler│   │ MetricsEmitter│  │
│  │                 │    │                  │    │               │  │
│  └─────────────────┘    └──────────────────┘    └───────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Key Components

1. SeverityScorer

Calculates drift severity based on multiple weighted factors:

public sealed record DriftSeverity
{
    public DriftSeverityLevel Level { get; init; }      // Critical, High, Medium, Low, Info
    public int Score { get; init; }                     // 0-100 numeric score
    public ImmutableArray<SeverityFactor> Factors { get; init; }
    public TimeSpan DriftAge { get; init; }
    public bool RequiresImmediate { get; init; }
}

public enum DriftSeverityLevel
{
    Info = 0,       // Cosmetic differences (labels, annotations)
    Low = 25,       // Non-critical drift (resource limits changed)
    Medium = 50,    // Functional drift (ports, volumes)
    High = 75,      // Security drift (image digest mismatch)
    Critical = 100  // Severe drift (container missing, wrong image)
}

Severity Factors:

Factor Weight Description
Drift Type 30% Missing > Digest Mismatch > Status Mismatch > Unexpected
Drift Age 25% Older drift = higher severity
Environment Criticality 20% Production > Staging > Development
Component Criticality 15% Core services weighted higher
Blast Radius 10% Number of dependent services affected

2. RemediationPolicy

Defines when and how to remediate drift:

public sealed record RemediationPolicy
{
    public Guid Id { get; init; }
    public string Name { get; init; }
    public Guid EnvironmentId { get; init; }

    // Triggers
    public RemediationTrigger Trigger { get; init; }
    public DriftSeverityLevel MinimumSeverity { get; init; }
    public TimeSpan MinimumDriftAge { get; init; }
    public TimeSpan MaximumDriftAge { get; init; }  // Escalate to manual if exceeded

    // Actions
    public RemediationAction Action { get; init; }
    public RemediationStrategy Strategy { get; init; }

    // Safety limits
    public int MaxConcurrentRemediations { get; init; }
    public int MaxRemediationsPerHour { get; init; }
    public TimeSpan CooldownPeriod { get; init; }

    // Schedule
    public RemediationWindow? MaintenanceWindow { get; init; }
    public ImmutableArray<DayOfWeek> AllowedDays { get; init; }
    public TimeOnly AllowedStartTime { get; init; }
    public TimeOnly AllowedEndTime { get; init; }

    // Notifications
    public NotificationConfig Notifications { get; init; }
}

public enum RemediationTrigger
{
    Immediate,          // Remediate as soon as detected
    Scheduled,          // Wait for maintenance window
    AgeThreshold,       // Remediate after drift exceeds age
    SeverityEscalation, // Remediate when severity increases
    Manual              // Notification only, human initiates
}

public enum RemediationAction
{
    NotifyOnly,         // Alert but don't act
    Reconcile,          // Restore to expected state
    Rollback,           // Rollback to previous known-good release
    Scale,              // Adjust replica count
    Restart,            // Restart containers
    Quarantine          // Isolate drifted targets from traffic
}

public enum RemediationStrategy
{
    AllAtOnce,          // Remediate all drifted targets simultaneously
    Rolling,            // Remediate one at a time with health checks
    Canary,             // Remediate one, verify, then proceed
    BlueGreen           // Deploy to standby, switch traffic
}

3. RemediationEngine

Orchestrates the remediation process:

public sealed class RemediationEngine
{
    public async Task<RemediationPlan> CreatePlanAsync(
        DriftReport driftReport,
        RemediationPolicy policy,
        CancellationToken ct)
    {
        // 1. Score severity for each drift item
        var scoredDrifts = await _severityScorer.ScoreAsync(driftReport.Items, ct);

        // 2. Filter by policy thresholds
        var actionable = scoredDrifts
            .Where(d => d.Severity.Level >= policy.MinimumSeverity)
            .Where(d => d.Severity.DriftAge >= policy.MinimumDriftAge)
            .ToImmutableArray();

        // 3. Check maintenance window
        if (!IsWithinMaintenanceWindow(policy))
            return RemediationPlan.Deferred(actionable, policy.MaintenanceWindow);

        // 4. Check rate limits
        var allowed = await CheckRateLimitsAsync(actionable, policy, ct);

        // 5. Build execution plan
        return BuildExecutionPlan(allowed, policy);
    }

    public async Task<RemediationResult> ExecuteAsync(
        RemediationPlan plan,
        CancellationToken ct)
    {
        // Execute with blast radius control
        var semaphore = new SemaphoreSlim(plan.Policy.MaxConcurrentRemediations);
        var results = new ConcurrentBag<TargetRemediationResult>();

        foreach (var batch in plan.Batches)
        {
            var tasks = batch.Targets.Select(async target =>
            {
                await semaphore.WaitAsync(ct);
                try
                {
                    return await RemediateTargetAsync(target, plan, ct);
                }
                finally
                {
                    semaphore.Release();
                }
            });

            var batchResults = await Task.WhenAll(tasks);
            results.AddRange(batchResults);

            // Health check between batches for rolling strategy
            if (plan.Policy.Strategy == RemediationStrategy.Rolling)
            {
                await VerifyBatchHealthAsync(batchResults, ct);
            }
        }

        // Generate evidence
        var evidence = await _evidenceWriter.WriteAsync(plan, results, ct);

        return new RemediationResult(plan.Id, results.ToImmutableArray(), evidence);
    }
}

4. ReconcileScheduler

Manages scheduled reconciliation runs:

public sealed class ReconcileScheduler
{
    private readonly TimeProvider _timeProvider;
    private readonly IRemediationPolicyStore _policyStore;
    private readonly IDriftDetector _driftDetector;
    private readonly RemediationEngine _engine;

    public async Task RunScheduledReconciliationAsync(CancellationToken ct)
    {
        var policies = await _policyStore.GetScheduledPoliciesAsync(ct);

        foreach (var policy in policies)
        {
            if (!IsWithinWindow(policy))
                continue;

            // Detect drift
            var inventory = await _inventoryService.GetCurrentAsync(policy.EnvironmentId, ct);
            var expected = await _releaseService.GetExpectedStateAsync(policy.EnvironmentId, ct);
            var drift = _driftDetector.Detect(inventory, expected);

            if (drift.HasDrift)
            {
                var plan = await _engine.CreatePlanAsync(drift, policy, ct);
                await _engine.ExecuteAsync(plan, ct);
            }
        }
    }
}

Data Models

RemediationPlan

public sealed record RemediationPlan
{
    public Guid Id { get; init; }
    public Guid DriftReportId { get; init; }
    public RemediationPolicy Policy { get; init; }
    public RemediationPlanStatus Status { get; init; }
    public ImmutableArray<RemediationBatch> Batches { get; init; }
    public DateTimeOffset CreatedAt { get; init; }
    public DateTimeOffset? ScheduledFor { get; init; }
    public DateTimeOffset? StartedAt { get; init; }
    public DateTimeOffset? CompletedAt { get; init; }
    public string? DeferralReason { get; init; }
}

public enum RemediationPlanStatus
{
    Created,
    Scheduled,
    Deferred,       // Waiting for maintenance window
    Running,
    Paused,         // Human intervention requested
    Succeeded,
    PartialSuccess, // Some targets remediated, some failed
    Failed,
    Cancelled
}

public sealed record RemediationBatch
{
    public int Order { get; init; }
    public ImmutableArray<RemediationTarget> Targets { get; init; }
    public TimeSpan? DelayAfter { get; init; }
    public bool RequiresHealthCheck { get; init; }
}

public sealed record RemediationTarget
{
    public Guid TargetId { get; init; }
    public string TargetName { get; init; }
    public DriftItem Drift { get; init; }
    public DriftSeverity Severity { get; init; }
    public RemediationAction Action { get; init; }
    public string? ActionPayload { get; init; }  // Compose file, rollback digest, etc.
}

RemediationResult

public sealed record RemediationResult
{
    public Guid PlanId { get; init; }
    public RemediationResultStatus Status { get; init; }
    public ImmutableArray<TargetRemediationResult> TargetResults { get; init; }
    public Guid EvidencePacketId { get; init; }
    public TimeSpan Duration { get; init; }
    public RemediationMetrics Metrics { get; init; }
}

public sealed record TargetRemediationResult
{
    public Guid TargetId { get; init; }
    public RemediationTargetStatus Status { get; init; }
    public string? Error { get; init; }
    public TimeSpan Duration { get; init; }
    public string? PreviousDigest { get; init; }
    public string? CurrentDigest { get; init; }
    public ImmutableArray<string> Logs { get; init; }
}

public sealed record RemediationMetrics
{
    public int TotalTargets { get; init; }
    public int Succeeded { get; init; }
    public int Failed { get; init; }
    public int Skipped { get; init; }
    public TimeSpan TotalDuration { get; init; }
    public TimeSpan AverageTargetDuration { get; init; }
}

API Design

REST Endpoints

# Policies
POST   /api/v1/remediation/policies                    # Create policy
GET    /api/v1/remediation/policies                    # List policies
GET    /api/v1/remediation/policies/{id}               # Get policy
PUT    /api/v1/remediation/policies/{id}               # Update policy
DELETE /api/v1/remediation/policies/{id}               # Delete policy
POST   /api/v1/remediation/policies/{id}/activate      # Activate policy
POST   /api/v1/remediation/policies/{id}/deactivate    # Deactivate policy

# Plans
GET    /api/v1/remediation/plans                       # List plans
GET    /api/v1/remediation/plans/{id}                  # Get plan details
POST   /api/v1/remediation/plans/{id}/execute          # Execute deferred plan
POST   /api/v1/remediation/plans/{id}/pause            # Pause running plan
POST   /api/v1/remediation/plans/{id}/resume           # Resume paused plan
POST   /api/v1/remediation/plans/{id}/cancel           # Cancel plan

# On-demand
POST   /api/v1/remediation/preview                     # Preview remediation (dry-run)
POST   /api/v1/remediation/execute                     # Execute immediate remediation

# History
GET    /api/v1/remediation/history                     # List remediation history
GET    /api/v1/remediation/history/{id}                # Get remediation result
GET    /api/v1/remediation/history/{id}/evidence       # Get evidence packet

WebSocket Events

// Real-time remediation updates
interface RemediationEvent {
  type: 'plan.created' | 'plan.started' | 'plan.completed' |
        'target.started' | 'target.completed' | 'target.failed';
  planId: string;
  targetId?: string;
  status: string;
  progress?: number;
  message?: string;
  timestamp: string;
}

Severity Scoring Algorithm

public sealed class SeverityScorer
{
    private readonly SeverityScoringConfig _config;

    public DriftSeverity Score(DriftItem drift, ScoringContext context)
    {
        var factors = new List<SeverityFactor>();
        var score = 0.0;

        // Factor 1: Drift Type (30%)
        var typeScore = drift.Type switch
        {
            DriftType.Missing => 100,
            DriftType.DigestMismatch => 80,
            DriftType.StatusMismatch => 50,
            DriftType.Unexpected => 30,
            _ => 10
        };
        factors.Add(new SeverityFactor("DriftType", typeScore, 0.30));
        score += typeScore * 0.30;

        // Factor 2: Drift Age (25%)
        var ageScore = CalculateAgeScore(drift.DetectedAt, context.Now);
        factors.Add(new SeverityFactor("DriftAge", ageScore, 0.25));
        score += ageScore * 0.25;

        // Factor 3: Environment Criticality (20%)
        var envScore = context.Environment.Criticality switch
        {
            EnvironmentCriticality.Production => 100,
            EnvironmentCriticality.Staging => 60,
            EnvironmentCriticality.Development => 20,
            _ => 10
        };
        factors.Add(new SeverityFactor("EnvironmentCriticality", envScore, 0.20));
        score += envScore * 0.20;

        // Factor 4: Component Criticality (15%)
        var componentScore = context.ComponentCriticality.GetValueOrDefault(drift.ComponentId, 50);
        factors.Add(new SeverityFactor("ComponentCriticality", componentScore, 0.15));
        score += componentScore * 0.15;

        // Factor 5: Blast Radius (10%)
        var blastScore = CalculateBlastRadius(drift, context.DependencyGraph);
        factors.Add(new SeverityFactor("BlastRadius", blastScore, 0.10));
        score += blastScore * 0.10;

        return new DriftSeverity
        {
            Level = ScoreToLevel((int)score),
            Score = (int)score,
            Factors = factors.ToImmutableArray(),
            DriftAge = context.Now - drift.DetectedAt,
            RequiresImmediate = score >= 90
        };
    }

    private int CalculateAgeScore(DateTimeOffset detectedAt, DateTimeOffset now)
    {
        var age = now - detectedAt;
        return age.TotalMinutes switch
        {
            < 5 => 10,      // Very fresh - low urgency
            < 30 => 30,     // Recent
            < 60 => 50,     // 1 hour
            < 240 => 70,    // 4 hours
            < 1440 => 85,   // 24 hours
            _ => 100        // > 24 hours - critical
        };
    }

    private int CalculateBlastRadius(DriftItem drift, DependencyGraph graph)
    {
        var dependents = graph.GetDependents(drift.ComponentId);
        return dependents.Count switch
        {
            0 => 10,
            < 3 => 30,
            < 10 => 60,
            < 25 => 80,
            _ => 100
        };
    }
}

Safety Mechanisms

1. Rate Limiting

public sealed class RemediationRateLimiter
{
    public async Task<RateLimitResult> CheckAsync(
        RemediationPolicy policy,
        int requestedCount,
        CancellationToken ct)
    {
        var hourlyCount = await GetHourlyRemediationCountAsync(policy.Id, ct);
        var dailyCount = await GetDailyRemediationCountAsync(policy.Id, ct);

        if (hourlyCount + requestedCount > policy.MaxRemediationsPerHour)
        {
            return RateLimitResult.Exceeded(
                $"Hourly limit exceeded: {hourlyCount}/{policy.MaxRemediationsPerHour}");
        }

        var lastRemediation = await GetLastRemediationAsync(policy.Id, ct);
        if (lastRemediation != null)
        {
            var timeSinceLast = _timeProvider.GetUtcNow() - lastRemediation.CompletedAt;
            if (timeSinceLast < policy.CooldownPeriod)
            {
                return RateLimitResult.Cooldown(policy.CooldownPeriod - timeSinceLast);
            }
        }

        return RateLimitResult.Allowed(requestedCount);
    }
}

2. Blast Radius Control

// Maximum percentage of targets that can be remediated in one operation
public const int MaxTargetPercentage = 25;

// Never remediate more than this many targets at once
public const int AbsoluteMaxTargets = 10;

// Minimum healthy targets required before remediation
public const double MinHealthyPercentage = 0.75;

3. Circuit Breaker

public sealed class RemediationCircuitBreaker
{
    private int _consecutiveFailures;
    private DateTimeOffset? _openedAt;

    public bool IsOpen => _openedAt != null &&
        (_timeProvider.GetUtcNow() - _openedAt.Value) < _config.OpenDuration;

    public void RecordSuccess()
    {
        _consecutiveFailures = 0;
        _openedAt = null;
    }

    public void RecordFailure()
    {
        _consecutiveFailures++;
        if (_consecutiveFailures >= _config.FailureThreshold)
        {
            _openedAt = _timeProvider.GetUtcNow();
            _logger.LogWarning("Remediation circuit breaker opened after {Failures} failures",
                _consecutiveFailures);
        }
    }
}

Metrics & Observability

Prometheus Metrics

# Counters
stella_remediation_plans_total{environment, policy, status}
stella_remediation_targets_total{environment, action, status}
stella_remediation_rate_limit_hits_total{policy}

# Histograms
stella_remediation_plan_duration_seconds{environment, strategy}
stella_remediation_target_duration_seconds{environment, action}
stella_remediation_detection_to_action_seconds{environment, severity}

# Gauges
stella_drift_items_pending_remediation{environment, severity}
stella_remediation_circuit_breaker_open{policy}

Structured Logging

{
  "event": "remediation.target.completed",
  "plan_id": "abc-123",
  "target_id": "target-456",
  "environment": "production",
  "action": "reconcile",
  "drift_type": "digest_mismatch",
  "severity": "high",
  "duration_ms": 4532,
  "status": "succeeded",
  "previous_digest": "sha256:abc...",
  "current_digest": "sha256:def...",
  "correlation_id": "xyz-789"
}

Evidence Generation

Every remediation produces a sealed evidence packet:

public sealed record RemediationEvidence
{
    // What drifted
    public ImmutableArray<DriftItem> DetectedDrift { get; init; }
    public ImmutableArray<DriftSeverity> Severities { get; init; }

    // Policy applied
    public RemediationPolicy Policy { get; init; }

    // Plan executed
    public RemediationPlan Plan { get; init; }

    // Results
    public ImmutableArray<TargetRemediationResult> Results { get; init; }

    // Who/when
    public string InitiatedBy { get; init; }  // "system:auto" or user ID
    public DateTimeOffset InitiatedAt { get; init; }
    public DateTimeOffset CompletedAt { get; init; }

    // Artifacts
    public ImmutableArray<string> GeneratedArtifacts { get; init; }  // Compose files, scripts
}

Configuration

Default Policy Template

name: "production-auto-remediation"
environment_id: "prod-001"

trigger: age_threshold
minimum_severity: high
minimum_drift_age: "00:15:00"  # 15 minutes
maximum_drift_age: "24:00:00"  # 24 hours, then escalate to manual

action: reconcile
strategy: rolling

safety:
  max_concurrent_remediations: 2
  max_remediations_per_hour: 10
  cooldown_period: "00:05:00"  # 5 minutes between remediations

schedule:
  maintenance_window:
    enabled: true
    start: "02:00"
    end: "06:00"
    timezone: "UTC"
  allowed_days: [monday, tuesday, wednesday, thursday, friday]

notifications:
  on_plan_created: true
  on_remediation_started: true
  on_remediation_completed: true
  on_remediation_failed: true
  channels:
    - type: slack
      channel: "#ops-alerts"
    - type: email
      recipients: ["ops-team@example.com"]

Test Strategy

Unit Tests

  • Severity scoring with various drift combinations
  • Rate limiting logic
  • Circuit breaker state transitions
  • Policy evaluation with edge cases

Integration Tests

  • Full remediation flow: detect → plan → execute → verify
  • Maintenance window enforcement
  • Rate limit enforcement across multiple requests
  • Evidence packet generation and signing

Chaos Tests

  • Agent failure during remediation
  • Database unavailability during plan execution
  • Concurrent remediation requests
  • Clock skew handling

Golden Tests

  • Deterministic severity scores for fixed inputs
  • Deterministic plan generation for fixed drift reports
  • Evidence packet structure validation

Migration Path

Phase 1: Foundation (Week 1-2)

  • Severity scoring service
  • Remediation policy model and store
  • Basic API endpoints

Phase 2: Engine (Week 3-4)

  • Remediation engine implementation
  • Plan creation and execution
  • Target remediation logic

Phase 3: Safety (Week 5)

  • Rate limiting
  • Circuit breaker
  • Blast radius controls

Phase 4: Scheduling (Week 6)

  • Maintenance window support
  • Scheduled reconciliation
  • Age-based escalation

Phase 5: Observability (Week 7)

  • Metrics emission
  • Evidence generation
  • Alert integration

Phase 6: UI & Polish (Week 8)

  • Web console integration
  • Real-time updates
  • Policy management UI