release orchestration strengthening
This commit is contained in:
@@ -0,0 +1,749 @@
|
||||
# Drift Remediation Automation
|
||||
|
||||
## Overview
|
||||
|
||||
Drift Remediation Automation extends the existing drift detection system with intelligent, policy-driven automatic remediation. While drift detection identifies divergence between expected and actual state, remediation automation closes the loop by taking corrective action without manual intervention.
|
||||
|
||||
This is a best-in-class implementation that balances automation with safety, providing configurable remediation strategies, severity-based prioritization, and comprehensive audit trails.
|
||||
|
||||
---
|
||||
|
||||
## Design Principles
|
||||
|
||||
1. **Safety First**: Auto-remediation never executes without explicit policy authorization
|
||||
2. **Gradual Escalation**: Start with notifications, escalate to remediation based on drift age/severity
|
||||
3. **Deterministic Actions**: Remediation produces identical outcomes for identical drift states
|
||||
4. **Full Auditability**: Every remediation action generates signed evidence packets
|
||||
5. **Blast Radius Control**: Limit concurrent remediations; prevent cascading failures
|
||||
6. **Human Override**: Operators can pause, cancel, or override any remediation
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
### Component Overview
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ Drift Remediation System │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌─────────────────┐ ┌──────────────────┐ ┌───────────────┐ │
|
||||
│ │ DriftDetector │───▶│ RemediationEngine│───▶│ ActionExecutor│ │
|
||||
│ │ (existing) │ │ │ │ │ │
|
||||
│ └─────────────────┘ └──────────────────┘ └───────────────┘ │
|
||||
│ │ │ │ │
|
||||
│ ▼ ▼ ▼ │
|
||||
│ ┌─────────────────┐ ┌──────────────────┐ ┌───────────────┐ │
|
||||
│ │ SeverityScorer │ │ PolicyEvaluator │ │ EvidenceWriter│ │
|
||||
│ │ │ │ │ │ │ │
|
||||
│ └─────────────────┘ └──────────────────┘ └───────────────┘ │
|
||||
│ │ │ │ │
|
||||
│ ▼ ▼ ▼ │
|
||||
│ ┌─────────────────┐ ┌──────────────────┐ ┌───────────────┐ │
|
||||
│ │ AlertRouter │ │ ReconcileScheduler│ │ MetricsEmitter│ │
|
||||
│ │ │ │ │ │ │ │
|
||||
│ └─────────────────┘ └──────────────────┘ └───────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Key Components
|
||||
|
||||
#### 1. SeverityScorer
|
||||
|
||||
Calculates drift severity based on multiple weighted factors:
|
||||
|
||||
```csharp
|
||||
public sealed record DriftSeverity
|
||||
{
|
||||
public DriftSeverityLevel Level { get; init; } // Critical, High, Medium, Low, Info
|
||||
public int Score { get; init; } // 0-100 numeric score
|
||||
public ImmutableArray<SeverityFactor> Factors { get; init; }
|
||||
public TimeSpan DriftAge { get; init; }
|
||||
public bool RequiresImmediate { get; init; }
|
||||
}
|
||||
|
||||
public enum DriftSeverityLevel
|
||||
{
|
||||
Info = 0, // Cosmetic differences (labels, annotations)
|
||||
Low = 25, // Non-critical drift (resource limits changed)
|
||||
Medium = 50, // Functional drift (ports, volumes)
|
||||
High = 75, // Security drift (image digest mismatch)
|
||||
Critical = 100 // Severe drift (container missing, wrong image)
|
||||
}
|
||||
```
|
||||
|
||||
**Severity Factors:**
|
||||
|
||||
| Factor | Weight | Description |
|
||||
|--------|--------|-------------|
|
||||
| Drift Type | 30% | Missing > Digest Mismatch > Status Mismatch > Unexpected |
|
||||
| Drift Age | 25% | Older drift = higher severity |
|
||||
| Environment Criticality | 20% | Production > Staging > Development |
|
||||
| Component Criticality | 15% | Core services weighted higher |
|
||||
| Blast Radius | 10% | Number of dependent services affected |
|
||||
|
||||
#### 2. RemediationPolicy
|
||||
|
||||
Defines when and how to remediate drift:
|
||||
|
||||
```csharp
|
||||
public sealed record RemediationPolicy
|
||||
{
|
||||
public Guid Id { get; init; }
|
||||
public string Name { get; init; }
|
||||
public Guid EnvironmentId { get; init; }
|
||||
|
||||
// Triggers
|
||||
public RemediationTrigger Trigger { get; init; }
|
||||
public DriftSeverityLevel MinimumSeverity { get; init; }
|
||||
public TimeSpan MinimumDriftAge { get; init; }
|
||||
public TimeSpan MaximumDriftAge { get; init; } // Escalate to manual if exceeded
|
||||
|
||||
// Actions
|
||||
public RemediationAction Action { get; init; }
|
||||
public RemediationStrategy Strategy { get; init; }
|
||||
|
||||
// Safety limits
|
||||
public int MaxConcurrentRemediations { get; init; }
|
||||
public int MaxRemediationsPerHour { get; init; }
|
||||
public TimeSpan CooldownPeriod { get; init; }
|
||||
|
||||
// Schedule
|
||||
public RemediationWindow? MaintenanceWindow { get; init; }
|
||||
public ImmutableArray<DayOfWeek> AllowedDays { get; init; }
|
||||
public TimeOnly AllowedStartTime { get; init; }
|
||||
public TimeOnly AllowedEndTime { get; init; }
|
||||
|
||||
// Notifications
|
||||
public NotificationConfig Notifications { get; init; }
|
||||
}
|
||||
|
||||
public enum RemediationTrigger
|
||||
{
|
||||
Immediate, // Remediate as soon as detected
|
||||
Scheduled, // Wait for maintenance window
|
||||
AgeThreshold, // Remediate after drift exceeds age
|
||||
SeverityEscalation, // Remediate when severity increases
|
||||
Manual // Notification only, human initiates
|
||||
}
|
||||
|
||||
public enum RemediationAction
|
||||
{
|
||||
NotifyOnly, // Alert but don't act
|
||||
Reconcile, // Restore to expected state
|
||||
Rollback, // Rollback to previous known-good release
|
||||
Scale, // Adjust replica count
|
||||
Restart, // Restart containers
|
||||
Quarantine // Isolate drifted targets from traffic
|
||||
}
|
||||
|
||||
public enum RemediationStrategy
|
||||
{
|
||||
AllAtOnce, // Remediate all drifted targets simultaneously
|
||||
Rolling, // Remediate one at a time with health checks
|
||||
Canary, // Remediate one, verify, then proceed
|
||||
BlueGreen // Deploy to standby, switch traffic
|
||||
}
|
||||
```
|
||||
|
||||
#### 3. RemediationEngine
|
||||
|
||||
Orchestrates the remediation process:
|
||||
|
||||
```csharp
|
||||
public sealed class RemediationEngine
|
||||
{
|
||||
public async Task<RemediationPlan> CreatePlanAsync(
|
||||
DriftReport driftReport,
|
||||
RemediationPolicy policy,
|
||||
CancellationToken ct)
|
||||
{
|
||||
// 1. Score severity for each drift item
|
||||
var scoredDrifts = await _severityScorer.ScoreAsync(driftReport.Items, ct);
|
||||
|
||||
// 2. Filter by policy thresholds
|
||||
var actionable = scoredDrifts
|
||||
.Where(d => d.Severity.Level >= policy.MinimumSeverity)
|
||||
.Where(d => d.Severity.DriftAge >= policy.MinimumDriftAge)
|
||||
.ToImmutableArray();
|
||||
|
||||
// 3. Check maintenance window
|
||||
if (!IsWithinMaintenanceWindow(policy))
|
||||
return RemediationPlan.Deferred(actionable, policy.MaintenanceWindow);
|
||||
|
||||
// 4. Check rate limits
|
||||
var allowed = await CheckRateLimitsAsync(actionable, policy, ct);
|
||||
|
||||
// 5. Build execution plan
|
||||
return BuildExecutionPlan(allowed, policy);
|
||||
}
|
||||
|
||||
public async Task<RemediationResult> ExecuteAsync(
|
||||
RemediationPlan plan,
|
||||
CancellationToken ct)
|
||||
{
|
||||
// Execute with blast radius control
|
||||
var semaphore = new SemaphoreSlim(plan.Policy.MaxConcurrentRemediations);
|
||||
var results = new ConcurrentBag<TargetRemediationResult>();
|
||||
|
||||
foreach (var batch in plan.Batches)
|
||||
{
|
||||
var tasks = batch.Targets.Select(async target =>
|
||||
{
|
||||
await semaphore.WaitAsync(ct);
|
||||
try
|
||||
{
|
||||
return await RemediateTargetAsync(target, plan, ct);
|
||||
}
|
||||
finally
|
||||
{
|
||||
semaphore.Release();
|
||||
}
|
||||
});
|
||||
|
||||
var batchResults = await Task.WhenAll(tasks);
|
||||
results.AddRange(batchResults);
|
||||
|
||||
// Health check between batches for rolling strategy
|
||||
if (plan.Policy.Strategy == RemediationStrategy.Rolling)
|
||||
{
|
||||
await VerifyBatchHealthAsync(batchResults, ct);
|
||||
}
|
||||
}
|
||||
|
||||
// Generate evidence
|
||||
var evidence = await _evidenceWriter.WriteAsync(plan, results, ct);
|
||||
|
||||
return new RemediationResult(plan.Id, results.ToImmutableArray(), evidence);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### 4. ReconcileScheduler
|
||||
|
||||
Manages scheduled reconciliation runs:
|
||||
|
||||
```csharp
|
||||
public sealed class ReconcileScheduler
|
||||
{
|
||||
private readonly TimeProvider _timeProvider;
|
||||
private readonly IRemediationPolicyStore _policyStore;
|
||||
private readonly IDriftDetector _driftDetector;
|
||||
private readonly RemediationEngine _engine;
|
||||
|
||||
public async Task RunScheduledReconciliationAsync(CancellationToken ct)
|
||||
{
|
||||
var policies = await _policyStore.GetScheduledPoliciesAsync(ct);
|
||||
|
||||
foreach (var policy in policies)
|
||||
{
|
||||
if (!IsWithinWindow(policy))
|
||||
continue;
|
||||
|
||||
// Detect drift
|
||||
var inventory = await _inventoryService.GetCurrentAsync(policy.EnvironmentId, ct);
|
||||
var expected = await _releaseService.GetExpectedStateAsync(policy.EnvironmentId, ct);
|
||||
var drift = _driftDetector.Detect(inventory, expected);
|
||||
|
||||
if (drift.HasDrift)
|
||||
{
|
||||
var plan = await _engine.CreatePlanAsync(drift, policy, ct);
|
||||
await _engine.ExecuteAsync(plan, ct);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Data Models
|
||||
|
||||
### RemediationPlan
|
||||
|
||||
```csharp
|
||||
public sealed record RemediationPlan
|
||||
{
|
||||
public Guid Id { get; init; }
|
||||
public Guid DriftReportId { get; init; }
|
||||
public RemediationPolicy Policy { get; init; }
|
||||
public RemediationPlanStatus Status { get; init; }
|
||||
public ImmutableArray<RemediationBatch> Batches { get; init; }
|
||||
public DateTimeOffset CreatedAt { get; init; }
|
||||
public DateTimeOffset? ScheduledFor { get; init; }
|
||||
public DateTimeOffset? StartedAt { get; init; }
|
||||
public DateTimeOffset? CompletedAt { get; init; }
|
||||
public string? DeferralReason { get; init; }
|
||||
}
|
||||
|
||||
public enum RemediationPlanStatus
|
||||
{
|
||||
Created,
|
||||
Scheduled,
|
||||
Deferred, // Waiting for maintenance window
|
||||
Running,
|
||||
Paused, // Human intervention requested
|
||||
Succeeded,
|
||||
PartialSuccess, // Some targets remediated, some failed
|
||||
Failed,
|
||||
Cancelled
|
||||
}
|
||||
|
||||
public sealed record RemediationBatch
|
||||
{
|
||||
public int Order { get; init; }
|
||||
public ImmutableArray<RemediationTarget> Targets { get; init; }
|
||||
public TimeSpan? DelayAfter { get; init; }
|
||||
public bool RequiresHealthCheck { get; init; }
|
||||
}
|
||||
|
||||
public sealed record RemediationTarget
|
||||
{
|
||||
public Guid TargetId { get; init; }
|
||||
public string TargetName { get; init; }
|
||||
public DriftItem Drift { get; init; }
|
||||
public DriftSeverity Severity { get; init; }
|
||||
public RemediationAction Action { get; init; }
|
||||
public string? ActionPayload { get; init; } // Compose file, rollback digest, etc.
|
||||
}
|
||||
```
|
||||
|
||||
### RemediationResult
|
||||
|
||||
```csharp
|
||||
public sealed record RemediationResult
|
||||
{
|
||||
public Guid PlanId { get; init; }
|
||||
public RemediationResultStatus Status { get; init; }
|
||||
public ImmutableArray<TargetRemediationResult> TargetResults { get; init; }
|
||||
public Guid EvidencePacketId { get; init; }
|
||||
public TimeSpan Duration { get; init; }
|
||||
public RemediationMetrics Metrics { get; init; }
|
||||
}
|
||||
|
||||
public sealed record TargetRemediationResult
|
||||
{
|
||||
public Guid TargetId { get; init; }
|
||||
public RemediationTargetStatus Status { get; init; }
|
||||
public string? Error { get; init; }
|
||||
public TimeSpan Duration { get; init; }
|
||||
public string? PreviousDigest { get; init; }
|
||||
public string? CurrentDigest { get; init; }
|
||||
public ImmutableArray<string> Logs { get; init; }
|
||||
}
|
||||
|
||||
public sealed record RemediationMetrics
|
||||
{
|
||||
public int TotalTargets { get; init; }
|
||||
public int Succeeded { get; init; }
|
||||
public int Failed { get; init; }
|
||||
public int Skipped { get; init; }
|
||||
public TimeSpan TotalDuration { get; init; }
|
||||
public TimeSpan AverageTargetDuration { get; init; }
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## API Design
|
||||
|
||||
### REST Endpoints
|
||||
|
||||
```
|
||||
# Policies
|
||||
POST /api/v1/remediation/policies # Create policy
|
||||
GET /api/v1/remediation/policies # List policies
|
||||
GET /api/v1/remediation/policies/{id} # Get policy
|
||||
PUT /api/v1/remediation/policies/{id} # Update policy
|
||||
DELETE /api/v1/remediation/policies/{id} # Delete policy
|
||||
POST /api/v1/remediation/policies/{id}/activate # Activate policy
|
||||
POST /api/v1/remediation/policies/{id}/deactivate # Deactivate policy
|
||||
|
||||
# Plans
|
||||
GET /api/v1/remediation/plans # List plans
|
||||
GET /api/v1/remediation/plans/{id} # Get plan details
|
||||
POST /api/v1/remediation/plans/{id}/execute # Execute deferred plan
|
||||
POST /api/v1/remediation/plans/{id}/pause # Pause running plan
|
||||
POST /api/v1/remediation/plans/{id}/resume # Resume paused plan
|
||||
POST /api/v1/remediation/plans/{id}/cancel # Cancel plan
|
||||
|
||||
# On-demand
|
||||
POST /api/v1/remediation/preview # Preview remediation (dry-run)
|
||||
POST /api/v1/remediation/execute # Execute immediate remediation
|
||||
|
||||
# History
|
||||
GET /api/v1/remediation/history # List remediation history
|
||||
GET /api/v1/remediation/history/{id} # Get remediation result
|
||||
GET /api/v1/remediation/history/{id}/evidence # Get evidence packet
|
||||
```
|
||||
|
||||
### WebSocket Events
|
||||
|
||||
```typescript
|
||||
// Real-time remediation updates
|
||||
interface RemediationEvent {
|
||||
type: 'plan.created' | 'plan.started' | 'plan.completed' |
|
||||
'target.started' | 'target.completed' | 'target.failed';
|
||||
planId: string;
|
||||
targetId?: string;
|
||||
status: string;
|
||||
progress?: number;
|
||||
message?: string;
|
||||
timestamp: string;
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Severity Scoring Algorithm
|
||||
|
||||
```csharp
|
||||
public sealed class SeverityScorer
|
||||
{
|
||||
private readonly SeverityScoringConfig _config;
|
||||
|
||||
public DriftSeverity Score(DriftItem drift, ScoringContext context)
|
||||
{
|
||||
var factors = new List<SeverityFactor>();
|
||||
var score = 0.0;
|
||||
|
||||
// Factor 1: Drift Type (30%)
|
||||
var typeScore = drift.Type switch
|
||||
{
|
||||
DriftType.Missing => 100,
|
||||
DriftType.DigestMismatch => 80,
|
||||
DriftType.StatusMismatch => 50,
|
||||
DriftType.Unexpected => 30,
|
||||
_ => 10
|
||||
};
|
||||
factors.Add(new SeverityFactor("DriftType", typeScore, 0.30));
|
||||
score += typeScore * 0.30;
|
||||
|
||||
// Factor 2: Drift Age (25%)
|
||||
var ageScore = CalculateAgeScore(drift.DetectedAt, context.Now);
|
||||
factors.Add(new SeverityFactor("DriftAge", ageScore, 0.25));
|
||||
score += ageScore * 0.25;
|
||||
|
||||
// Factor 3: Environment Criticality (20%)
|
||||
var envScore = context.Environment.Criticality switch
|
||||
{
|
||||
EnvironmentCriticality.Production => 100,
|
||||
EnvironmentCriticality.Staging => 60,
|
||||
EnvironmentCriticality.Development => 20,
|
||||
_ => 10
|
||||
};
|
||||
factors.Add(new SeverityFactor("EnvironmentCriticality", envScore, 0.20));
|
||||
score += envScore * 0.20;
|
||||
|
||||
// Factor 4: Component Criticality (15%)
|
||||
var componentScore = context.ComponentCriticality.GetValueOrDefault(drift.ComponentId, 50);
|
||||
factors.Add(new SeverityFactor("ComponentCriticality", componentScore, 0.15));
|
||||
score += componentScore * 0.15;
|
||||
|
||||
// Factor 5: Blast Radius (10%)
|
||||
var blastScore = CalculateBlastRadius(drift, context.DependencyGraph);
|
||||
factors.Add(new SeverityFactor("BlastRadius", blastScore, 0.10));
|
||||
score += blastScore * 0.10;
|
||||
|
||||
return new DriftSeverity
|
||||
{
|
||||
Level = ScoreToLevel((int)score),
|
||||
Score = (int)score,
|
||||
Factors = factors.ToImmutableArray(),
|
||||
DriftAge = context.Now - drift.DetectedAt,
|
||||
RequiresImmediate = score >= 90
|
||||
};
|
||||
}
|
||||
|
||||
private int CalculateAgeScore(DateTimeOffset detectedAt, DateTimeOffset now)
|
||||
{
|
||||
var age = now - detectedAt;
|
||||
return age.TotalMinutes switch
|
||||
{
|
||||
< 5 => 10, // Very fresh - low urgency
|
||||
< 30 => 30, // Recent
|
||||
< 60 => 50, // 1 hour
|
||||
< 240 => 70, // 4 hours
|
||||
< 1440 => 85, // 24 hours
|
||||
_ => 100 // > 24 hours - critical
|
||||
};
|
||||
}
|
||||
|
||||
private int CalculateBlastRadius(DriftItem drift, DependencyGraph graph)
|
||||
{
|
||||
var dependents = graph.GetDependents(drift.ComponentId);
|
||||
return dependents.Count switch
|
||||
{
|
||||
0 => 10,
|
||||
< 3 => 30,
|
||||
< 10 => 60,
|
||||
< 25 => 80,
|
||||
_ => 100
|
||||
};
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Safety Mechanisms
|
||||
|
||||
### 1. Rate Limiting
|
||||
|
||||
```csharp
|
||||
public sealed class RemediationRateLimiter
|
||||
{
|
||||
public async Task<RateLimitResult> CheckAsync(
|
||||
RemediationPolicy policy,
|
||||
int requestedCount,
|
||||
CancellationToken ct)
|
||||
{
|
||||
var hourlyCount = await GetHourlyRemediationCountAsync(policy.Id, ct);
|
||||
var dailyCount = await GetDailyRemediationCountAsync(policy.Id, ct);
|
||||
|
||||
if (hourlyCount + requestedCount > policy.MaxRemediationsPerHour)
|
||||
{
|
||||
return RateLimitResult.Exceeded(
|
||||
$"Hourly limit exceeded: {hourlyCount}/{policy.MaxRemediationsPerHour}");
|
||||
}
|
||||
|
||||
var lastRemediation = await GetLastRemediationAsync(policy.Id, ct);
|
||||
if (lastRemediation != null)
|
||||
{
|
||||
var timeSinceLast = _timeProvider.GetUtcNow() - lastRemediation.CompletedAt;
|
||||
if (timeSinceLast < policy.CooldownPeriod)
|
||||
{
|
||||
return RateLimitResult.Cooldown(policy.CooldownPeriod - timeSinceLast);
|
||||
}
|
||||
}
|
||||
|
||||
return RateLimitResult.Allowed(requestedCount);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Blast Radius Control
|
||||
|
||||
```csharp
|
||||
// Maximum percentage of targets that can be remediated in one operation
|
||||
public const int MaxTargetPercentage = 25;
|
||||
|
||||
// Never remediate more than this many targets at once
|
||||
public const int AbsoluteMaxTargets = 10;
|
||||
|
||||
// Minimum healthy targets required before remediation
|
||||
public const double MinHealthyPercentage = 0.75;
|
||||
```
|
||||
|
||||
### 3. Circuit Breaker
|
||||
|
||||
```csharp
|
||||
public sealed class RemediationCircuitBreaker
|
||||
{
|
||||
private int _consecutiveFailures;
|
||||
private DateTimeOffset? _openedAt;
|
||||
|
||||
public bool IsOpen => _openedAt != null &&
|
||||
(_timeProvider.GetUtcNow() - _openedAt.Value) < _config.OpenDuration;
|
||||
|
||||
public void RecordSuccess()
|
||||
{
|
||||
_consecutiveFailures = 0;
|
||||
_openedAt = null;
|
||||
}
|
||||
|
||||
public void RecordFailure()
|
||||
{
|
||||
_consecutiveFailures++;
|
||||
if (_consecutiveFailures >= _config.FailureThreshold)
|
||||
{
|
||||
_openedAt = _timeProvider.GetUtcNow();
|
||||
_logger.LogWarning("Remediation circuit breaker opened after {Failures} failures",
|
||||
_consecutiveFailures);
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Metrics & Observability
|
||||
|
||||
### Prometheus Metrics
|
||||
|
||||
```
|
||||
# Counters
|
||||
stella_remediation_plans_total{environment, policy, status}
|
||||
stella_remediation_targets_total{environment, action, status}
|
||||
stella_remediation_rate_limit_hits_total{policy}
|
||||
|
||||
# Histograms
|
||||
stella_remediation_plan_duration_seconds{environment, strategy}
|
||||
stella_remediation_target_duration_seconds{environment, action}
|
||||
stella_remediation_detection_to_action_seconds{environment, severity}
|
||||
|
||||
# Gauges
|
||||
stella_drift_items_pending_remediation{environment, severity}
|
||||
stella_remediation_circuit_breaker_open{policy}
|
||||
```
|
||||
|
||||
### Structured Logging
|
||||
|
||||
```json
|
||||
{
|
||||
"event": "remediation.target.completed",
|
||||
"plan_id": "abc-123",
|
||||
"target_id": "target-456",
|
||||
"environment": "production",
|
||||
"action": "reconcile",
|
||||
"drift_type": "digest_mismatch",
|
||||
"severity": "high",
|
||||
"duration_ms": 4532,
|
||||
"status": "succeeded",
|
||||
"previous_digest": "sha256:abc...",
|
||||
"current_digest": "sha256:def...",
|
||||
"correlation_id": "xyz-789"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Evidence Generation
|
||||
|
||||
Every remediation produces a sealed evidence packet:
|
||||
|
||||
```csharp
|
||||
public sealed record RemediationEvidence
|
||||
{
|
||||
// What drifted
|
||||
public ImmutableArray<DriftItem> DetectedDrift { get; init; }
|
||||
public ImmutableArray<DriftSeverity> Severities { get; init; }
|
||||
|
||||
// Policy applied
|
||||
public RemediationPolicy Policy { get; init; }
|
||||
|
||||
// Plan executed
|
||||
public RemediationPlan Plan { get; init; }
|
||||
|
||||
// Results
|
||||
public ImmutableArray<TargetRemediationResult> Results { get; init; }
|
||||
|
||||
// Who/when
|
||||
public string InitiatedBy { get; init; } // "system:auto" or user ID
|
||||
public DateTimeOffset InitiatedAt { get; init; }
|
||||
public DateTimeOffset CompletedAt { get; init; }
|
||||
|
||||
// Artifacts
|
||||
public ImmutableArray<string> GeneratedArtifacts { get; init; } // Compose files, scripts
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
### Default Policy Template
|
||||
|
||||
```yaml
|
||||
name: "production-auto-remediation"
|
||||
environment_id: "prod-001"
|
||||
|
||||
trigger: age_threshold
|
||||
minimum_severity: high
|
||||
minimum_drift_age: "00:15:00" # 15 minutes
|
||||
maximum_drift_age: "24:00:00" # 24 hours, then escalate to manual
|
||||
|
||||
action: reconcile
|
||||
strategy: rolling
|
||||
|
||||
safety:
|
||||
max_concurrent_remediations: 2
|
||||
max_remediations_per_hour: 10
|
||||
cooldown_period: "00:05:00" # 5 minutes between remediations
|
||||
|
||||
schedule:
|
||||
maintenance_window:
|
||||
enabled: true
|
||||
start: "02:00"
|
||||
end: "06:00"
|
||||
timezone: "UTC"
|
||||
allowed_days: [monday, tuesday, wednesday, thursday, friday]
|
||||
|
||||
notifications:
|
||||
on_plan_created: true
|
||||
on_remediation_started: true
|
||||
on_remediation_completed: true
|
||||
on_remediation_failed: true
|
||||
channels:
|
||||
- type: slack
|
||||
channel: "#ops-alerts"
|
||||
- type: email
|
||||
recipients: ["ops-team@example.com"]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Test Strategy
|
||||
|
||||
### Unit Tests
|
||||
|
||||
- Severity scoring with various drift combinations
|
||||
- Rate limiting logic
|
||||
- Circuit breaker state transitions
|
||||
- Policy evaluation with edge cases
|
||||
|
||||
### Integration Tests
|
||||
|
||||
- Full remediation flow: detect → plan → execute → verify
|
||||
- Maintenance window enforcement
|
||||
- Rate limit enforcement across multiple requests
|
||||
- Evidence packet generation and signing
|
||||
|
||||
### Chaos Tests
|
||||
|
||||
- Agent failure during remediation
|
||||
- Database unavailability during plan execution
|
||||
- Concurrent remediation requests
|
||||
- Clock skew handling
|
||||
|
||||
### Golden Tests
|
||||
|
||||
- Deterministic severity scores for fixed inputs
|
||||
- Deterministic plan generation for fixed drift reports
|
||||
- Evidence packet structure validation
|
||||
|
||||
---
|
||||
|
||||
## Migration Path
|
||||
|
||||
### Phase 1: Foundation (Week 1-2)
|
||||
- Severity scoring service
|
||||
- Remediation policy model and store
|
||||
- Basic API endpoints
|
||||
|
||||
### Phase 2: Engine (Week 3-4)
|
||||
- Remediation engine implementation
|
||||
- Plan creation and execution
|
||||
- Target remediation logic
|
||||
|
||||
### Phase 3: Safety (Week 5)
|
||||
- Rate limiting
|
||||
- Circuit breaker
|
||||
- Blast radius controls
|
||||
|
||||
### Phase 4: Scheduling (Week 6)
|
||||
- Maintenance window support
|
||||
- Scheduled reconciliation
|
||||
- Age-based escalation
|
||||
|
||||
### Phase 5: Observability (Week 7)
|
||||
- Metrics emission
|
||||
- Evidence generation
|
||||
- Alert integration
|
||||
|
||||
### Phase 6: UI & Polish (Week 8)
|
||||
- Web console integration
|
||||
- Real-time updates
|
||||
- Policy management UI
|
||||
Reference in New Issue
Block a user