24 KiB
24 KiB
Drift Remediation Automation
Overview
Drift Remediation Automation extends the existing drift detection system with intelligent, policy-driven automatic remediation. While drift detection identifies divergence between expected and actual state, remediation automation closes the loop by taking corrective action without manual intervention.
This is a best-in-class implementation that balances automation with safety, providing configurable remediation strategies, severity-based prioritization, and comprehensive audit trails.
Design Principles
- Safety First: Auto-remediation never executes without explicit policy authorization
- Gradual Escalation: Start with notifications, escalate to remediation based on drift age/severity
- Deterministic Actions: Remediation produces identical outcomes for identical drift states
- Full Auditability: Every remediation action generates signed evidence packets
- Blast Radius Control: Limit concurrent remediations; prevent cascading failures
- Human Override: Operators can pause, cancel, or override any remediation
Architecture
Component Overview
┌─────────────────────────────────────────────────────────────────────┐
│ Drift Remediation System │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌──────────────────┐ ┌───────────────┐ │
│ │ DriftDetector │───▶│ RemediationEngine│───▶│ ActionExecutor│ │
│ │ (existing) │ │ │ │ │ │
│ └─────────────────┘ └──────────────────┘ └───────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────┐ ┌──────────────────┐ ┌───────────────┐ │
│ │ SeverityScorer │ │ PolicyEvaluator │ │ EvidenceWriter│ │
│ │ │ │ │ │ │ │
│ └─────────────────┘ └──────────────────┘ └───────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────┐ ┌──────────────────┐ ┌───────────────┐ │
│ │ AlertRouter │ │ ReconcileScheduler│ │ MetricsEmitter│ │
│ │ │ │ │ │ │ │
│ └─────────────────┘ └──────────────────┘ └───────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Key Components
1. SeverityScorer
Calculates drift severity based on multiple weighted factors:
public sealed record DriftSeverity
{
public DriftSeverityLevel Level { get; init; } // Critical, High, Medium, Low, Info
public int Score { get; init; } // 0-100 numeric score
public ImmutableArray<SeverityFactor> Factors { get; init; }
public TimeSpan DriftAge { get; init; }
public bool RequiresImmediate { get; init; }
}
public enum DriftSeverityLevel
{
Info = 0, // Cosmetic differences (labels, annotations)
Low = 25, // Non-critical drift (resource limits changed)
Medium = 50, // Functional drift (ports, volumes)
High = 75, // Security drift (image digest mismatch)
Critical = 100 // Severe drift (container missing, wrong image)
}
Severity Factors:
| Factor | Weight | Description |
|---|---|---|
| Drift Type | 30% | Missing > Digest Mismatch > Status Mismatch > Unexpected |
| Drift Age | 25% | Older drift = higher severity |
| Environment Criticality | 20% | Production > Staging > Development |
| Component Criticality | 15% | Core services weighted higher |
| Blast Radius | 10% | Number of dependent services affected |
2. RemediationPolicy
Defines when and how to remediate drift:
public sealed record RemediationPolicy
{
public Guid Id { get; init; }
public string Name { get; init; }
public Guid EnvironmentId { get; init; }
// Triggers
public RemediationTrigger Trigger { get; init; }
public DriftSeverityLevel MinimumSeverity { get; init; }
public TimeSpan MinimumDriftAge { get; init; }
public TimeSpan MaximumDriftAge { get; init; } // Escalate to manual if exceeded
// Actions
public RemediationAction Action { get; init; }
public RemediationStrategy Strategy { get; init; }
// Safety limits
public int MaxConcurrentRemediations { get; init; }
public int MaxRemediationsPerHour { get; init; }
public TimeSpan CooldownPeriod { get; init; }
// Schedule
public RemediationWindow? MaintenanceWindow { get; init; }
public ImmutableArray<DayOfWeek> AllowedDays { get; init; }
public TimeOnly AllowedStartTime { get; init; }
public TimeOnly AllowedEndTime { get; init; }
// Notifications
public NotificationConfig Notifications { get; init; }
}
public enum RemediationTrigger
{
Immediate, // Remediate as soon as detected
Scheduled, // Wait for maintenance window
AgeThreshold, // Remediate after drift exceeds age
SeverityEscalation, // Remediate when severity increases
Manual // Notification only, human initiates
}
public enum RemediationAction
{
NotifyOnly, // Alert but don't act
Reconcile, // Restore to expected state
Rollback, // Rollback to previous known-good release
Scale, // Adjust replica count
Restart, // Restart containers
Quarantine // Isolate drifted targets from traffic
}
public enum RemediationStrategy
{
AllAtOnce, // Remediate all drifted targets simultaneously
Rolling, // Remediate one at a time with health checks
Canary, // Remediate one, verify, then proceed
BlueGreen // Deploy to standby, switch traffic
}
3. RemediationEngine
Orchestrates the remediation process:
public sealed class RemediationEngine
{
public async Task<RemediationPlan> CreatePlanAsync(
DriftReport driftReport,
RemediationPolicy policy,
CancellationToken ct)
{
// 1. Score severity for each drift item
var scoredDrifts = await _severityScorer.ScoreAsync(driftReport.Items, ct);
// 2. Filter by policy thresholds
var actionable = scoredDrifts
.Where(d => d.Severity.Level >= policy.MinimumSeverity)
.Where(d => d.Severity.DriftAge >= policy.MinimumDriftAge)
.ToImmutableArray();
// 3. Check maintenance window
if (!IsWithinMaintenanceWindow(policy))
return RemediationPlan.Deferred(actionable, policy.MaintenanceWindow);
// 4. Check rate limits
var allowed = await CheckRateLimitsAsync(actionable, policy, ct);
// 5. Build execution plan
return BuildExecutionPlan(allowed, policy);
}
public async Task<RemediationResult> ExecuteAsync(
RemediationPlan plan,
CancellationToken ct)
{
// Execute with blast radius control
var semaphore = new SemaphoreSlim(plan.Policy.MaxConcurrentRemediations);
var results = new ConcurrentBag<TargetRemediationResult>();
foreach (var batch in plan.Batches)
{
var tasks = batch.Targets.Select(async target =>
{
await semaphore.WaitAsync(ct);
try
{
return await RemediateTargetAsync(target, plan, ct);
}
finally
{
semaphore.Release();
}
});
var batchResults = await Task.WhenAll(tasks);
results.AddRange(batchResults);
// Health check between batches for rolling strategy
if (plan.Policy.Strategy == RemediationStrategy.Rolling)
{
await VerifyBatchHealthAsync(batchResults, ct);
}
}
// Generate evidence
var evidence = await _evidenceWriter.WriteAsync(plan, results, ct);
return new RemediationResult(plan.Id, results.ToImmutableArray(), evidence);
}
}
4. ReconcileScheduler
Manages scheduled reconciliation runs:
public sealed class ReconcileScheduler
{
private readonly TimeProvider _timeProvider;
private readonly IRemediationPolicyStore _policyStore;
private readonly IDriftDetector _driftDetector;
private readonly RemediationEngine _engine;
public async Task RunScheduledReconciliationAsync(CancellationToken ct)
{
var policies = await _policyStore.GetScheduledPoliciesAsync(ct);
foreach (var policy in policies)
{
if (!IsWithinWindow(policy))
continue;
// Detect drift
var inventory = await _inventoryService.GetCurrentAsync(policy.EnvironmentId, ct);
var expected = await _releaseService.GetExpectedStateAsync(policy.EnvironmentId, ct);
var drift = _driftDetector.Detect(inventory, expected);
if (drift.HasDrift)
{
var plan = await _engine.CreatePlanAsync(drift, policy, ct);
await _engine.ExecuteAsync(plan, ct);
}
}
}
}
Data Models
RemediationPlan
public sealed record RemediationPlan
{
public Guid Id { get; init; }
public Guid DriftReportId { get; init; }
public RemediationPolicy Policy { get; init; }
public RemediationPlanStatus Status { get; init; }
public ImmutableArray<RemediationBatch> Batches { get; init; }
public DateTimeOffset CreatedAt { get; init; }
public DateTimeOffset? ScheduledFor { get; init; }
public DateTimeOffset? StartedAt { get; init; }
public DateTimeOffset? CompletedAt { get; init; }
public string? DeferralReason { get; init; }
}
public enum RemediationPlanStatus
{
Created,
Scheduled,
Deferred, // Waiting for maintenance window
Running,
Paused, // Human intervention requested
Succeeded,
PartialSuccess, // Some targets remediated, some failed
Failed,
Cancelled
}
public sealed record RemediationBatch
{
public int Order { get; init; }
public ImmutableArray<RemediationTarget> Targets { get; init; }
public TimeSpan? DelayAfter { get; init; }
public bool RequiresHealthCheck { get; init; }
}
public sealed record RemediationTarget
{
public Guid TargetId { get; init; }
public string TargetName { get; init; }
public DriftItem Drift { get; init; }
public DriftSeverity Severity { get; init; }
public RemediationAction Action { get; init; }
public string? ActionPayload { get; init; } // Compose file, rollback digest, etc.
}
RemediationResult
public sealed record RemediationResult
{
public Guid PlanId { get; init; }
public RemediationResultStatus Status { get; init; }
public ImmutableArray<TargetRemediationResult> TargetResults { get; init; }
public Guid EvidencePacketId { get; init; }
public TimeSpan Duration { get; init; }
public RemediationMetrics Metrics { get; init; }
}
public sealed record TargetRemediationResult
{
public Guid TargetId { get; init; }
public RemediationTargetStatus Status { get; init; }
public string? Error { get; init; }
public TimeSpan Duration { get; init; }
public string? PreviousDigest { get; init; }
public string? CurrentDigest { get; init; }
public ImmutableArray<string> Logs { get; init; }
}
public sealed record RemediationMetrics
{
public int TotalTargets { get; init; }
public int Succeeded { get; init; }
public int Failed { get; init; }
public int Skipped { get; init; }
public TimeSpan TotalDuration { get; init; }
public TimeSpan AverageTargetDuration { get; init; }
}
API Design
REST Endpoints
# Policies
POST /api/v1/remediation/policies # Create policy
GET /api/v1/remediation/policies # List policies
GET /api/v1/remediation/policies/{id} # Get policy
PUT /api/v1/remediation/policies/{id} # Update policy
DELETE /api/v1/remediation/policies/{id} # Delete policy
POST /api/v1/remediation/policies/{id}/activate # Activate policy
POST /api/v1/remediation/policies/{id}/deactivate # Deactivate policy
# Plans
GET /api/v1/remediation/plans # List plans
GET /api/v1/remediation/plans/{id} # Get plan details
POST /api/v1/remediation/plans/{id}/execute # Execute deferred plan
POST /api/v1/remediation/plans/{id}/pause # Pause running plan
POST /api/v1/remediation/plans/{id}/resume # Resume paused plan
POST /api/v1/remediation/plans/{id}/cancel # Cancel plan
# On-demand
POST /api/v1/remediation/preview # Preview remediation (dry-run)
POST /api/v1/remediation/execute # Execute immediate remediation
# History
GET /api/v1/remediation/history # List remediation history
GET /api/v1/remediation/history/{id} # Get remediation result
GET /api/v1/remediation/history/{id}/evidence # Get evidence packet
WebSocket Events
// Real-time remediation updates
interface RemediationEvent {
type: 'plan.created' | 'plan.started' | 'plan.completed' |
'target.started' | 'target.completed' | 'target.failed';
planId: string;
targetId?: string;
status: string;
progress?: number;
message?: string;
timestamp: string;
}
Severity Scoring Algorithm
public sealed class SeverityScorer
{
private readonly SeverityScoringConfig _config;
public DriftSeverity Score(DriftItem drift, ScoringContext context)
{
var factors = new List<SeverityFactor>();
var score = 0.0;
// Factor 1: Drift Type (30%)
var typeScore = drift.Type switch
{
DriftType.Missing => 100,
DriftType.DigestMismatch => 80,
DriftType.StatusMismatch => 50,
DriftType.Unexpected => 30,
_ => 10
};
factors.Add(new SeverityFactor("DriftType", typeScore, 0.30));
score += typeScore * 0.30;
// Factor 2: Drift Age (25%)
var ageScore = CalculateAgeScore(drift.DetectedAt, context.Now);
factors.Add(new SeverityFactor("DriftAge", ageScore, 0.25));
score += ageScore * 0.25;
// Factor 3: Environment Criticality (20%)
var envScore = context.Environment.Criticality switch
{
EnvironmentCriticality.Production => 100,
EnvironmentCriticality.Staging => 60,
EnvironmentCriticality.Development => 20,
_ => 10
};
factors.Add(new SeverityFactor("EnvironmentCriticality", envScore, 0.20));
score += envScore * 0.20;
// Factor 4: Component Criticality (15%)
var componentScore = context.ComponentCriticality.GetValueOrDefault(drift.ComponentId, 50);
factors.Add(new SeverityFactor("ComponentCriticality", componentScore, 0.15));
score += componentScore * 0.15;
// Factor 5: Blast Radius (10%)
var blastScore = CalculateBlastRadius(drift, context.DependencyGraph);
factors.Add(new SeverityFactor("BlastRadius", blastScore, 0.10));
score += blastScore * 0.10;
return new DriftSeverity
{
Level = ScoreToLevel((int)score),
Score = (int)score,
Factors = factors.ToImmutableArray(),
DriftAge = context.Now - drift.DetectedAt,
RequiresImmediate = score >= 90
};
}
private int CalculateAgeScore(DateTimeOffset detectedAt, DateTimeOffset now)
{
var age = now - detectedAt;
return age.TotalMinutes switch
{
< 5 => 10, // Very fresh - low urgency
< 30 => 30, // Recent
< 60 => 50, // 1 hour
< 240 => 70, // 4 hours
< 1440 => 85, // 24 hours
_ => 100 // > 24 hours - critical
};
}
private int CalculateBlastRadius(DriftItem drift, DependencyGraph graph)
{
var dependents = graph.GetDependents(drift.ComponentId);
return dependents.Count switch
{
0 => 10,
< 3 => 30,
< 10 => 60,
< 25 => 80,
_ => 100
};
}
}
Safety Mechanisms
1. Rate Limiting
public sealed class RemediationRateLimiter
{
public async Task<RateLimitResult> CheckAsync(
RemediationPolicy policy,
int requestedCount,
CancellationToken ct)
{
var hourlyCount = await GetHourlyRemediationCountAsync(policy.Id, ct);
var dailyCount = await GetDailyRemediationCountAsync(policy.Id, ct);
if (hourlyCount + requestedCount > policy.MaxRemediationsPerHour)
{
return RateLimitResult.Exceeded(
$"Hourly limit exceeded: {hourlyCount}/{policy.MaxRemediationsPerHour}");
}
var lastRemediation = await GetLastRemediationAsync(policy.Id, ct);
if (lastRemediation != null)
{
var timeSinceLast = _timeProvider.GetUtcNow() - lastRemediation.CompletedAt;
if (timeSinceLast < policy.CooldownPeriod)
{
return RateLimitResult.Cooldown(policy.CooldownPeriod - timeSinceLast);
}
}
return RateLimitResult.Allowed(requestedCount);
}
}
2. Blast Radius Control
// Maximum percentage of targets that can be remediated in one operation
public const int MaxTargetPercentage = 25;
// Never remediate more than this many targets at once
public const int AbsoluteMaxTargets = 10;
// Minimum healthy targets required before remediation
public const double MinHealthyPercentage = 0.75;
3. Circuit Breaker
public sealed class RemediationCircuitBreaker
{
private int _consecutiveFailures;
private DateTimeOffset? _openedAt;
public bool IsOpen => _openedAt != null &&
(_timeProvider.GetUtcNow() - _openedAt.Value) < _config.OpenDuration;
public void RecordSuccess()
{
_consecutiveFailures = 0;
_openedAt = null;
}
public void RecordFailure()
{
_consecutiveFailures++;
if (_consecutiveFailures >= _config.FailureThreshold)
{
_openedAt = _timeProvider.GetUtcNow();
_logger.LogWarning("Remediation circuit breaker opened after {Failures} failures",
_consecutiveFailures);
}
}
}
Metrics & Observability
Prometheus Metrics
# Counters
stella_remediation_plans_total{environment, policy, status}
stella_remediation_targets_total{environment, action, status}
stella_remediation_rate_limit_hits_total{policy}
# Histograms
stella_remediation_plan_duration_seconds{environment, strategy}
stella_remediation_target_duration_seconds{environment, action}
stella_remediation_detection_to_action_seconds{environment, severity}
# Gauges
stella_drift_items_pending_remediation{environment, severity}
stella_remediation_circuit_breaker_open{policy}
Structured Logging
{
"event": "remediation.target.completed",
"plan_id": "abc-123",
"target_id": "target-456",
"environment": "production",
"action": "reconcile",
"drift_type": "digest_mismatch",
"severity": "high",
"duration_ms": 4532,
"status": "succeeded",
"previous_digest": "sha256:abc...",
"current_digest": "sha256:def...",
"correlation_id": "xyz-789"
}
Evidence Generation
Every remediation produces a sealed evidence packet:
public sealed record RemediationEvidence
{
// What drifted
public ImmutableArray<DriftItem> DetectedDrift { get; init; }
public ImmutableArray<DriftSeverity> Severities { get; init; }
// Policy applied
public RemediationPolicy Policy { get; init; }
// Plan executed
public RemediationPlan Plan { get; init; }
// Results
public ImmutableArray<TargetRemediationResult> Results { get; init; }
// Who/when
public string InitiatedBy { get; init; } // "system:auto" or user ID
public DateTimeOffset InitiatedAt { get; init; }
public DateTimeOffset CompletedAt { get; init; }
// Artifacts
public ImmutableArray<string> GeneratedArtifacts { get; init; } // Compose files, scripts
}
Configuration
Default Policy Template
name: "production-auto-remediation"
environment_id: "prod-001"
trigger: age_threshold
minimum_severity: high
minimum_drift_age: "00:15:00" # 15 minutes
maximum_drift_age: "24:00:00" # 24 hours, then escalate to manual
action: reconcile
strategy: rolling
safety:
max_concurrent_remediations: 2
max_remediations_per_hour: 10
cooldown_period: "00:05:00" # 5 minutes between remediations
schedule:
maintenance_window:
enabled: true
start: "02:00"
end: "06:00"
timezone: "UTC"
allowed_days: [monday, tuesday, wednesday, thursday, friday]
notifications:
on_plan_created: true
on_remediation_started: true
on_remediation_completed: true
on_remediation_failed: true
channels:
- type: slack
channel: "#ops-alerts"
- type: email
recipients: ["ops-team@example.com"]
Test Strategy
Unit Tests
- Severity scoring with various drift combinations
- Rate limiting logic
- Circuit breaker state transitions
- Policy evaluation with edge cases
Integration Tests
- Full remediation flow: detect → plan → execute → verify
- Maintenance window enforcement
- Rate limit enforcement across multiple requests
- Evidence packet generation and signing
Chaos Tests
- Agent failure during remediation
- Database unavailability during plan execution
- Concurrent remediation requests
- Clock skew handling
Golden Tests
- Deterministic severity scores for fixed inputs
- Deterministic plan generation for fixed drift reports
- Evidence packet structure validation
Migration Path
Phase 1: Foundation (Week 1-2)
- Severity scoring service
- Remediation policy model and store
- Basic API endpoints
Phase 2: Engine (Week 3-4)
- Remediation engine implementation
- Plan creation and execution
- Target remediation logic
Phase 3: Safety (Week 5)
- Rate limiting
- Circuit breaker
- Blast radius controls
Phase 4: Scheduling (Week 6)
- Maintenance window support
- Scheduled reconciliation
- Age-based escalation
Phase 5: Observability (Week 7)
- Metrics emission
- Evidence generation
- Alert integration
Phase 6: UI & Polish (Week 8)
- Web console integration
- Real-time updates
- Policy management UI