# Drift Remediation Automation ## Overview Drift Remediation Automation extends the existing drift detection system with intelligent, policy-driven automatic remediation. While drift detection identifies divergence between expected and actual state, remediation automation closes the loop by taking corrective action without manual intervention. This is a best-in-class implementation that balances automation with safety, providing configurable remediation strategies, severity-based prioritization, and comprehensive audit trails. --- ## Design Principles 1. **Safety First**: Auto-remediation never executes without explicit policy authorization 2. **Gradual Escalation**: Start with notifications, escalate to remediation based on drift age/severity 3. **Deterministic Actions**: Remediation produces identical outcomes for identical drift states 4. **Full Auditability**: Every remediation action generates signed evidence packets 5. **Blast Radius Control**: Limit concurrent remediations; prevent cascading failures 6. **Human Override**: Operators can pause, cancel, or override any remediation --- ## Architecture ### Component Overview ``` ┌─────────────────────────────────────────────────────────────────────┐ │ Drift Remediation System │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────┐ ┌──────────────────┐ ┌───────────────┐ │ │ │ DriftDetector │───▶│ RemediationEngine│───▶│ ActionExecutor│ │ │ │ (existing) │ │ │ │ │ │ │ └─────────────────┘ └──────────────────┘ └───────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌─────────────────┐ ┌──────────────────┐ ┌───────────────┐ │ │ │ SeverityScorer │ │ PolicyEvaluator │ │ EvidenceWriter│ │ │ │ │ │ │ │ │ │ │ └─────────────────┘ └──────────────────┘ └───────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌─────────────────┐ ┌──────────────────┐ ┌───────────────┐ │ │ │ AlertRouter │ │ ReconcileScheduler│ │ MetricsEmitter│ │ │ │ │ │ │ │ │ │ │ └─────────────────┘ └──────────────────┘ └───────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` ### Key Components #### 1. SeverityScorer Calculates drift severity based on multiple weighted factors: ```csharp public sealed record DriftSeverity { public DriftSeverityLevel Level { get; init; } // Critical, High, Medium, Low, Info public int Score { get; init; } // 0-100 numeric score public ImmutableArray Factors { get; init; } public TimeSpan DriftAge { get; init; } public bool RequiresImmediate { get; init; } } public enum DriftSeverityLevel { Info = 0, // Cosmetic differences (labels, annotations) Low = 25, // Non-critical drift (resource limits changed) Medium = 50, // Functional drift (ports, volumes) High = 75, // Security drift (image digest mismatch) Critical = 100 // Severe drift (container missing, wrong image) } ``` **Severity Factors:** | Factor | Weight | Description | |--------|--------|-------------| | Drift Type | 30% | Missing > Digest Mismatch > Status Mismatch > Unexpected | | Drift Age | 25% | Older drift = higher severity | | Environment Criticality | 20% | Production > Staging > Development | | Component Criticality | 15% | Core services weighted higher | | Blast Radius | 10% | Number of dependent services affected | #### 2. RemediationPolicy Defines when and how to remediate drift: ```csharp public sealed record RemediationPolicy { public Guid Id { get; init; } public string Name { get; init; } public Guid EnvironmentId { get; init; } // Triggers public RemediationTrigger Trigger { get; init; } public DriftSeverityLevel MinimumSeverity { get; init; } public TimeSpan MinimumDriftAge { get; init; } public TimeSpan MaximumDriftAge { get; init; } // Escalate to manual if exceeded // Actions public RemediationAction Action { get; init; } public RemediationStrategy Strategy { get; init; } // Safety limits public int MaxConcurrentRemediations { get; init; } public int MaxRemediationsPerHour { get; init; } public TimeSpan CooldownPeriod { get; init; } // Schedule public RemediationWindow? MaintenanceWindow { get; init; } public ImmutableArray AllowedDays { get; init; } public TimeOnly AllowedStartTime { get; init; } public TimeOnly AllowedEndTime { get; init; } // Notifications public NotificationConfig Notifications { get; init; } } public enum RemediationTrigger { Immediate, // Remediate as soon as detected Scheduled, // Wait for maintenance window AgeThreshold, // Remediate after drift exceeds age SeverityEscalation, // Remediate when severity increases Manual // Notification only, human initiates } public enum RemediationAction { NotifyOnly, // Alert but don't act Reconcile, // Restore to expected state Rollback, // Rollback to previous known-good release Scale, // Adjust replica count Restart, // Restart containers Quarantine // Isolate drifted targets from traffic } public enum RemediationStrategy { AllAtOnce, // Remediate all drifted targets simultaneously Rolling, // Remediate one at a time with health checks Canary, // Remediate one, verify, then proceed BlueGreen // Deploy to standby, switch traffic } ``` #### 3. RemediationEngine Orchestrates the remediation process: ```csharp public sealed class RemediationEngine { public async Task CreatePlanAsync( DriftReport driftReport, RemediationPolicy policy, CancellationToken ct) { // 1. Score severity for each drift item var scoredDrifts = await _severityScorer.ScoreAsync(driftReport.Items, ct); // 2. Filter by policy thresholds var actionable = scoredDrifts .Where(d => d.Severity.Level >= policy.MinimumSeverity) .Where(d => d.Severity.DriftAge >= policy.MinimumDriftAge) .ToImmutableArray(); // 3. Check maintenance window if (!IsWithinMaintenanceWindow(policy)) return RemediationPlan.Deferred(actionable, policy.MaintenanceWindow); // 4. Check rate limits var allowed = await CheckRateLimitsAsync(actionable, policy, ct); // 5. Build execution plan return BuildExecutionPlan(allowed, policy); } public async Task ExecuteAsync( RemediationPlan plan, CancellationToken ct) { // Execute with blast radius control var semaphore = new SemaphoreSlim(plan.Policy.MaxConcurrentRemediations); var results = new ConcurrentBag(); foreach (var batch in plan.Batches) { var tasks = batch.Targets.Select(async target => { await semaphore.WaitAsync(ct); try { return await RemediateTargetAsync(target, plan, ct); } finally { semaphore.Release(); } }); var batchResults = await Task.WhenAll(tasks); results.AddRange(batchResults); // Health check between batches for rolling strategy if (plan.Policy.Strategy == RemediationStrategy.Rolling) { await VerifyBatchHealthAsync(batchResults, ct); } } // Generate evidence var evidence = await _evidenceWriter.WriteAsync(plan, results, ct); return new RemediationResult(plan.Id, results.ToImmutableArray(), evidence); } } ``` #### 4. ReconcileScheduler Manages scheduled reconciliation runs: ```csharp public sealed class ReconcileScheduler { private readonly TimeProvider _timeProvider; private readonly IRemediationPolicyStore _policyStore; private readonly IDriftDetector _driftDetector; private readonly RemediationEngine _engine; public async Task RunScheduledReconciliationAsync(CancellationToken ct) { var policies = await _policyStore.GetScheduledPoliciesAsync(ct); foreach (var policy in policies) { if (!IsWithinWindow(policy)) continue; // Detect drift var inventory = await _inventoryService.GetCurrentAsync(policy.EnvironmentId, ct); var expected = await _releaseService.GetExpectedStateAsync(policy.EnvironmentId, ct); var drift = _driftDetector.Detect(inventory, expected); if (drift.HasDrift) { var plan = await _engine.CreatePlanAsync(drift, policy, ct); await _engine.ExecuteAsync(plan, ct); } } } } ``` --- ## Data Models ### RemediationPlan ```csharp public sealed record RemediationPlan { public Guid Id { get; init; } public Guid DriftReportId { get; init; } public RemediationPolicy Policy { get; init; } public RemediationPlanStatus Status { get; init; } public ImmutableArray Batches { get; init; } public DateTimeOffset CreatedAt { get; init; } public DateTimeOffset? ScheduledFor { get; init; } public DateTimeOffset? StartedAt { get; init; } public DateTimeOffset? CompletedAt { get; init; } public string? DeferralReason { get; init; } } public enum RemediationPlanStatus { Created, Scheduled, Deferred, // Waiting for maintenance window Running, Paused, // Human intervention requested Succeeded, PartialSuccess, // Some targets remediated, some failed Failed, Cancelled } public sealed record RemediationBatch { public int Order { get; init; } public ImmutableArray Targets { get; init; } public TimeSpan? DelayAfter { get; init; } public bool RequiresHealthCheck { get; init; } } public sealed record RemediationTarget { public Guid TargetId { get; init; } public string TargetName { get; init; } public DriftItem Drift { get; init; } public DriftSeverity Severity { get; init; } public RemediationAction Action { get; init; } public string? ActionPayload { get; init; } // Compose file, rollback digest, etc. } ``` ### RemediationResult ```csharp public sealed record RemediationResult { public Guid PlanId { get; init; } public RemediationResultStatus Status { get; init; } public ImmutableArray TargetResults { get; init; } public Guid EvidencePacketId { get; init; } public TimeSpan Duration { get; init; } public RemediationMetrics Metrics { get; init; } } public sealed record TargetRemediationResult { public Guid TargetId { get; init; } public RemediationTargetStatus Status { get; init; } public string? Error { get; init; } public TimeSpan Duration { get; init; } public string? PreviousDigest { get; init; } public string? CurrentDigest { get; init; } public ImmutableArray Logs { get; init; } } public sealed record RemediationMetrics { public int TotalTargets { get; init; } public int Succeeded { get; init; } public int Failed { get; init; } public int Skipped { get; init; } public TimeSpan TotalDuration { get; init; } public TimeSpan AverageTargetDuration { get; init; } } ``` --- ## API Design ### REST Endpoints ``` # Policies POST /api/v1/remediation/policies # Create policy GET /api/v1/remediation/policies # List policies GET /api/v1/remediation/policies/{id} # Get policy PUT /api/v1/remediation/policies/{id} # Update policy DELETE /api/v1/remediation/policies/{id} # Delete policy POST /api/v1/remediation/policies/{id}/activate # Activate policy POST /api/v1/remediation/policies/{id}/deactivate # Deactivate policy # Plans GET /api/v1/remediation/plans # List plans GET /api/v1/remediation/plans/{id} # Get plan details POST /api/v1/remediation/plans/{id}/execute # Execute deferred plan POST /api/v1/remediation/plans/{id}/pause # Pause running plan POST /api/v1/remediation/plans/{id}/resume # Resume paused plan POST /api/v1/remediation/plans/{id}/cancel # Cancel plan # On-demand POST /api/v1/remediation/preview # Preview remediation (dry-run) POST /api/v1/remediation/execute # Execute immediate remediation # History GET /api/v1/remediation/history # List remediation history GET /api/v1/remediation/history/{id} # Get remediation result GET /api/v1/remediation/history/{id}/evidence # Get evidence packet ``` ### WebSocket Events ```typescript // Real-time remediation updates interface RemediationEvent { type: 'plan.created' | 'plan.started' | 'plan.completed' | 'target.started' | 'target.completed' | 'target.failed'; planId: string; targetId?: string; status: string; progress?: number; message?: string; timestamp: string; } ``` --- ## Severity Scoring Algorithm ```csharp public sealed class SeverityScorer { private readonly SeverityScoringConfig _config; public DriftSeverity Score(DriftItem drift, ScoringContext context) { var factors = new List(); var score = 0.0; // Factor 1: Drift Type (30%) var typeScore = drift.Type switch { DriftType.Missing => 100, DriftType.DigestMismatch => 80, DriftType.StatusMismatch => 50, DriftType.Unexpected => 30, _ => 10 }; factors.Add(new SeverityFactor("DriftType", typeScore, 0.30)); score += typeScore * 0.30; // Factor 2: Drift Age (25%) var ageScore = CalculateAgeScore(drift.DetectedAt, context.Now); factors.Add(new SeverityFactor("DriftAge", ageScore, 0.25)); score += ageScore * 0.25; // Factor 3: Environment Criticality (20%) var envScore = context.Environment.Criticality switch { EnvironmentCriticality.Production => 100, EnvironmentCriticality.Staging => 60, EnvironmentCriticality.Development => 20, _ => 10 }; factors.Add(new SeverityFactor("EnvironmentCriticality", envScore, 0.20)); score += envScore * 0.20; // Factor 4: Component Criticality (15%) var componentScore = context.ComponentCriticality.GetValueOrDefault(drift.ComponentId, 50); factors.Add(new SeverityFactor("ComponentCriticality", componentScore, 0.15)); score += componentScore * 0.15; // Factor 5: Blast Radius (10%) var blastScore = CalculateBlastRadius(drift, context.DependencyGraph); factors.Add(new SeverityFactor("BlastRadius", blastScore, 0.10)); score += blastScore * 0.10; return new DriftSeverity { Level = ScoreToLevel((int)score), Score = (int)score, Factors = factors.ToImmutableArray(), DriftAge = context.Now - drift.DetectedAt, RequiresImmediate = score >= 90 }; } private int CalculateAgeScore(DateTimeOffset detectedAt, DateTimeOffset now) { var age = now - detectedAt; return age.TotalMinutes switch { < 5 => 10, // Very fresh - low urgency < 30 => 30, // Recent < 60 => 50, // 1 hour < 240 => 70, // 4 hours < 1440 => 85, // 24 hours _ => 100 // > 24 hours - critical }; } private int CalculateBlastRadius(DriftItem drift, DependencyGraph graph) { var dependents = graph.GetDependents(drift.ComponentId); return dependents.Count switch { 0 => 10, < 3 => 30, < 10 => 60, < 25 => 80, _ => 100 }; } } ``` --- ## Safety Mechanisms ### 1. Rate Limiting ```csharp public sealed class RemediationRateLimiter { public async Task CheckAsync( RemediationPolicy policy, int requestedCount, CancellationToken ct) { var hourlyCount = await GetHourlyRemediationCountAsync(policy.Id, ct); var dailyCount = await GetDailyRemediationCountAsync(policy.Id, ct); if (hourlyCount + requestedCount > policy.MaxRemediationsPerHour) { return RateLimitResult.Exceeded( $"Hourly limit exceeded: {hourlyCount}/{policy.MaxRemediationsPerHour}"); } var lastRemediation = await GetLastRemediationAsync(policy.Id, ct); if (lastRemediation != null) { var timeSinceLast = _timeProvider.GetUtcNow() - lastRemediation.CompletedAt; if (timeSinceLast < policy.CooldownPeriod) { return RateLimitResult.Cooldown(policy.CooldownPeriod - timeSinceLast); } } return RateLimitResult.Allowed(requestedCount); } } ``` ### 2. Blast Radius Control ```csharp // Maximum percentage of targets that can be remediated in one operation public const int MaxTargetPercentage = 25; // Never remediate more than this many targets at once public const int AbsoluteMaxTargets = 10; // Minimum healthy targets required before remediation public const double MinHealthyPercentage = 0.75; ``` ### 3. Circuit Breaker ```csharp public sealed class RemediationCircuitBreaker { private int _consecutiveFailures; private DateTimeOffset? _openedAt; public bool IsOpen => _openedAt != null && (_timeProvider.GetUtcNow() - _openedAt.Value) < _config.OpenDuration; public void RecordSuccess() { _consecutiveFailures = 0; _openedAt = null; } public void RecordFailure() { _consecutiveFailures++; if (_consecutiveFailures >= _config.FailureThreshold) { _openedAt = _timeProvider.GetUtcNow(); _logger.LogWarning("Remediation circuit breaker opened after {Failures} failures", _consecutiveFailures); } } } ``` --- ## Metrics & Observability ### Prometheus Metrics ``` # Counters stella_remediation_plans_total{environment, policy, status} stella_remediation_targets_total{environment, action, status} stella_remediation_rate_limit_hits_total{policy} # Histograms stella_remediation_plan_duration_seconds{environment, strategy} stella_remediation_target_duration_seconds{environment, action} stella_remediation_detection_to_action_seconds{environment, severity} # Gauges stella_drift_items_pending_remediation{environment, severity} stella_remediation_circuit_breaker_open{policy} ``` ### Structured Logging ```json { "event": "remediation.target.completed", "plan_id": "abc-123", "target_id": "target-456", "environment": "production", "action": "reconcile", "drift_type": "digest_mismatch", "severity": "high", "duration_ms": 4532, "status": "succeeded", "previous_digest": "sha256:abc...", "current_digest": "sha256:def...", "correlation_id": "xyz-789" } ``` --- ## Evidence Generation Every remediation produces a sealed evidence packet: ```csharp public sealed record RemediationEvidence { // What drifted public ImmutableArray DetectedDrift { get; init; } public ImmutableArray Severities { get; init; } // Policy applied public RemediationPolicy Policy { get; init; } // Plan executed public RemediationPlan Plan { get; init; } // Results public ImmutableArray Results { get; init; } // Who/when public string InitiatedBy { get; init; } // "system:auto" or user ID public DateTimeOffset InitiatedAt { get; init; } public DateTimeOffset CompletedAt { get; init; } // Artifacts public ImmutableArray GeneratedArtifacts { get; init; } // Compose files, scripts } ``` --- ## Configuration ### Default Policy Template ```yaml name: "production-auto-remediation" environment_id: "prod-001" trigger: age_threshold minimum_severity: high minimum_drift_age: "00:15:00" # 15 minutes maximum_drift_age: "24:00:00" # 24 hours, then escalate to manual action: reconcile strategy: rolling safety: max_concurrent_remediations: 2 max_remediations_per_hour: 10 cooldown_period: "00:05:00" # 5 minutes between remediations schedule: maintenance_window: enabled: true start: "02:00" end: "06:00" timezone: "UTC" allowed_days: [monday, tuesday, wednesday, thursday, friday] notifications: on_plan_created: true on_remediation_started: true on_remediation_completed: true on_remediation_failed: true channels: - type: slack channel: "#ops-alerts" - type: email recipients: ["ops-team@example.com"] ``` --- ## Test Strategy ### Unit Tests - Severity scoring with various drift combinations - Rate limiting logic - Circuit breaker state transitions - Policy evaluation with edge cases ### Integration Tests - Full remediation flow: detect → plan → execute → verify - Maintenance window enforcement - Rate limit enforcement across multiple requests - Evidence packet generation and signing ### Chaos Tests - Agent failure during remediation - Database unavailability during plan execution - Concurrent remediation requests - Clock skew handling ### Golden Tests - Deterministic severity scores for fixed inputs - Deterministic plan generation for fixed drift reports - Evidence packet structure validation --- ## Migration Path ### Phase 1: Foundation (Week 1-2) - Severity scoring service - Remediation policy model and store - Basic API endpoints ### Phase 2: Engine (Week 3-4) - Remediation engine implementation - Plan creation and execution - Target remediation logic ### Phase 3: Safety (Week 5) - Rate limiting - Circuit breaker - Blast radius controls ### Phase 4: Scheduling (Week 6) - Maintenance window support - Scheduled reconciliation - Age-based escalation ### Phase 5: Observability (Week 7) - Metrics emission - Evidence generation - Alert integration ### Phase 6: UI & Polish (Week 8) - Web console integration - Real-time updates - Policy management UI