49 KiB
Agent Operations & Easy Setup
Overview
The Agent Operations enhancement transforms agent deployment from a manual, error-prone process into a streamlined, self-healing experience. It provides zero-touch bootstrap, declarative configuration, comprehensive health diagnostics (Doctor plugin), and operational tooling that makes agents easy to deploy, monitor, and maintain at scale.
This enhancement complements Sprint 034 (Agent Resilience) by focusing on the operational and configuration aspects rather than the clustering and failover mechanisms.
Design Principles
- Zero-Touch Bootstrap: Agents should be deployable with a single command
- Declarative Configuration: Define desired state, system converges automatically
- Self-Diagnosing: Agents report their own health issues with remediation hints
- Operator-Friendly: Clear CLI commands, meaningful error messages, runbook links
- Secure by Default: Auto-provisioned certificates, secrets never on disk
- Observable: Rich metrics, structured logs, distributed tracing
Current Pain Points
| Pain Point | Current State | Target State |
|---|---|---|
| Certificate Management | Manual paths to cert/key/ca files | Auto-provisioned, auto-renewed |
| Configuration | Static YAML files, manual edits | Declarative config with drift detection |
| Health Monitoring | Binary alive/offline | Multi-dimensional health scoring |
| Troubleshooting | Manual log inspection | Doctor plugin with guided remediation |
| Scaling | Manual per-agent setup | Bootstrap token + auto-join |
| Updates | Manual agent binary updates | Auto-update with rollback |
| Network Issues | Silent failures | Connection diagnostics with hints |
Architecture
Component Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ Agent Operations & Setup │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐ │
│ │ BootstrapService │───▶│ ConfigManager │───▶│ CertificateManager│ │
│ │ │ │ │ │ │ │
│ └───────────────────┘ └───────────────────┘ └───────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐ │
│ │ AgentDoctor │ │ ConnectionDoctor │ │ UpdateManager │ │
│ │ │ │ │ │ │ │
│ └───────────────────┘ └───────────────────┘ └───────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐ │
│ │ DiagnosticReport │ │ RemediationEngine │ │ OperatorCLI │ │
│ │ │ │ │ │ │ │
│ └───────────────────┘ └───────────────────┘ └───────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Bootstrap Flow
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ stella │ │ Orchestrator│ │ Agent │
│ agent │─────▶│ (API) │─────▶│ Running │
│ bootstrap │ │ │ │ │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
│ 1. Request token │ │
│────────────────────▶│ │
│ 2. Return token │ │
│◀────────────────────│ │
│ │ │
│ 3. Start agent with token │
│─────────────────────────────────────────▶│
│ │ 4. Exchange token │
│ │◀───────────────────│
│ │ 5. Issue cert │
│ │───────────────────▶│
│ │ 6. Register │
│ │◀───────────────────│
│ │ 7. Confirm │
│ │───────────────────▶│
Key Components
1. Bootstrap Service
Zero-touch agent deployment:
public sealed class BootstrapService
{
public async Task<BootstrapResult> BootstrapAgentAsync(
BootstrapRequest request,
CancellationToken ct)
{
// 1. Generate bootstrap token (one-time use, 15-minute expiry)
var token = await _tokenService.GenerateBootstrapTokenAsync(
new TokenRequest
{
AgentName = request.AgentName,
Environment = request.Environment,
Capabilities = request.Capabilities,
ExpiresIn = TimeSpan.FromMinutes(15),
MaxUses = 1
}, ct);
// 2. Generate agent configuration
var config = GenerateAgentConfig(request, token);
// 3. Generate installation script
var script = GenerateInstallScript(request.Platform, config);
return new BootstrapResult
{
Token = token.Value,
TokenExpires = token.ExpiresAt,
Configuration = config,
InstallScript = script,
InstallCommand = GetOneLineInstaller(request.Platform, token)
};
}
private string GetOneLineInstaller(Platform platform, BootstrapToken token)
{
return platform switch
{
Platform.Linux => $"curl -sSL https://stella.example.com/install.sh | sudo bash -s -- --token {token.Value}",
Platform.Windows => $"iwr https://stella.example.com/install.ps1 -UseBasicParsing | iex; Install-StellaAgent -Token {token.Value}",
Platform.Docker => $"docker run -d --name stella-agent -e STELLA_BOOTSTRAP_TOKEN={token.Value} stella/agent:latest",
_ => throw new UnsupportedPlatformException(platform)
};
}
}
public sealed record BootstrapRequest
{
public string AgentName { get; init; }
public string Environment { get; init; }
public Platform Platform { get; init; }
public ImmutableArray<AgentCapability> Capabilities { get; init; }
public ImmutableDictionary<string, string> Labels { get; init; }
public string? ClusterId { get; init; } // Join existing cluster
}
public sealed record BootstrapResult
{
public string Token { get; init; }
public DateTimeOffset TokenExpires { get; init; }
public AgentConfiguration Configuration { get; init; }
public string InstallScript { get; init; }
public string InstallCommand { get; init; }
}
2. Configuration Manager
Declarative configuration with drift detection:
public sealed class AgentConfigManager
{
public async Task<ConfigurationState> ApplyConfigurationAsync(
AgentConfiguration desired,
CancellationToken ct)
{
var current = await _configStore.GetCurrentAsync(ct);
var diff = ComputeDiff(current, desired);
if (diff.HasChanges)
{
_logger.LogInformation("Configuration drift detected: {Changes}", diff.Summary);
// Validate changes are safe
var validation = await ValidateChangesAsync(diff, ct);
if (!validation.IsValid)
{
return new ConfigurationState
{
Status = ConfigStatus.ValidationFailed,
Errors = validation.Errors
};
}
// Apply changes with rollback capability
try
{
await ApplyChangesAsync(diff, ct);
await _configStore.SaveAsync(desired, ct);
return new ConfigurationState
{
Status = ConfigStatus.Applied,
AppliedChanges = diff.Changes
};
}
catch (Exception ex)
{
await RollbackAsync(current, ct);
throw new ConfigurationApplyException("Failed to apply configuration", ex);
}
}
return new ConfigurationState { Status = ConfigStatus.NoChanges };
}
public async Task<ConfigDrift> DetectDriftAsync(CancellationToken ct)
{
var desired = await _configStore.GetDesiredAsync(ct);
var actual = await _configStore.GetActualAsync(ct);
return new ConfigDrift
{
HasDrift = !desired.Equals(actual),
DesiredState = desired,
ActualState = actual,
Differences = ComputeDiff(actual, desired).Changes
};
}
}
// Declarative configuration model
public sealed record AgentConfiguration
{
// Identity
public string AgentId { get; init; }
public string AgentName { get; init; }
public string Environment { get; init; }
public ImmutableDictionary<string, string> Labels { get; init; }
// Connection
public string OrchestratorUrl { get; init; }
public TimeSpan HeartbeatInterval { get; init; } = TimeSpan.FromSeconds(30);
public TimeSpan ReconnectBackoff { get; init; } = TimeSpan.FromSeconds(5);
public int MaxReconnectAttempts { get; init; } = 10;
// Capabilities
public ImmutableArray<AgentCapability> Capabilities { get; init; }
// Resources
public ResourceLimits ResourceLimits { get; init; }
public int MaxConcurrentTasks { get; init; } = 5;
public TimeSpan DefaultTaskTimeout { get; init; } = TimeSpan.FromMinutes(30);
// Security
public CertificateConfig Certificates { get; init; }
public bool AutoRenewCertificates { get; init; } = true;
public TimeSpan CertificateRenewalThreshold { get; init; } = TimeSpan.FromDays(7);
// Clustering (optional)
public ClusterConfig? Cluster { get; init; }
// Observability
public ObservabilityConfig Observability { get; init; }
// Auto-update
public AutoUpdateConfig? AutoUpdate { get; init; }
}
public sealed record CertificateConfig
{
public CertificateSource Source { get; init; } = CertificateSource.AutoProvision;
public string? CertificatePath { get; init; } // Only if Source = File
public string? PrivateKeyPath { get; init; } // Only if Source = File
public string? CaCertificatePath { get; init; } // Only if Source = File
}
public enum CertificateSource
{
AutoProvision, // Orchestrator provisions via bootstrap
File, // Manual file paths
Vault, // HashiCorp Vault
ACME, // Let's Encrypt / ACME
AzureKeyVault, // Azure Key Vault
AWSKMS // AWS KMS/Secrets Manager
}
3. Certificate Manager
Automatic certificate lifecycle:
public sealed class AgentCertificateManager
{
public async Task<CertificateState> EnsureCertificateAsync(CancellationToken ct)
{
var current = await GetCurrentCertificateAsync(ct);
if (current == null)
{
_logger.LogInformation("No certificate found, requesting new certificate");
return await ProvisionCertificateAsync(ct);
}
var expiresIn = current.NotAfter - _timeProvider.GetUtcNow();
var threshold = _config.CertificateRenewalThreshold;
if (expiresIn <= TimeSpan.Zero)
{
_logger.LogWarning("Certificate expired, requesting renewal");
return await RenewCertificateAsync(current, ct);
}
if (expiresIn <= threshold)
{
_logger.LogInformation(
"Certificate expires in {Days} days, renewing proactively",
expiresIn.TotalDays);
return await RenewCertificateAsync(current, ct);
}
return new CertificateState
{
Status = CertificateStatus.Valid,
Certificate = current,
ExpiresAt = current.NotAfter,
RenewalScheduled = current.NotAfter - threshold
};
}
private async Task<CertificateState> ProvisionCertificateAsync(CancellationToken ct)
{
// Generate key pair locally (private key never leaves agent)
using var rsa = RSA.Create(4096);
// Create CSR
var csr = CreateCertificateSigningRequest(rsa);
// Submit CSR to orchestrator
var signedCert = await _orchestratorClient.SubmitCSRAsync(
new CSRRequest
{
AgentId = _config.AgentId,
CSR = csr,
RequestedValidity = TimeSpan.FromDays(365)
}, ct);
// Store certificate and key securely
await _certStore.StoreCertificateAsync(signedCert, ct);
await _keyStore.StorePrivateKeyAsync(rsa, ct);
return new CertificateState
{
Status = CertificateStatus.Provisioned,
Certificate = signedCert,
ExpiresAt = signedCert.NotAfter
};
}
}
4. Agent Doctor (Health Checks)
Comprehensive health diagnostics:
public sealed class AgentDoctor
{
private readonly ImmutableArray<IAgentHealthCheck> _checks;
public AgentDoctor()
{
_checks = new IAgentHealthCheck[]
{
// Core checks
new CertificateExpiryCheck(),
new CertificateValidityCheck(),
new OrchestratorConnectivityCheck(),
new HeartbeatCheck(),
// Resource checks
new DiskSpaceCheck(),
new MemoryUsageCheck(),
new CpuUsageCheck(),
new FileDescriptorCheck(),
// Configuration checks
new ConfigurationValidityCheck(),
new ConfigurationDriftCheck(),
new CapabilityCheck(),
// Network checks
new RegistryConnectivityCheck(),
new DNSResolutionCheck(),
new TLSVersionCheck(),
new MTLSHandshakeCheck(),
// Task execution checks
new DockerConnectivityCheck(),
new DockerVersionCheck(),
new TaskQueueDepthCheck(),
new FailedTaskRateCheck(),
// Cluster checks (if clustered)
new ClusterMembershipCheck(),
new LeaderConnectivityCheck(),
new StateSyncCheck()
}.ToImmutableArray();
}
public async Task<AgentDiagnosticReport> RunDiagnosticsAsync(
DiagnosticOptions options,
CancellationToken ct)
{
var results = new List<HealthCheckResult>();
var startTime = _timeProvider.GetUtcNow();
foreach (var check in _checks)
{
if (options.Categories.Any() &&
!options.Categories.Contains(check.Category))
{
continue;
}
try
{
var result = await check.ExecuteAsync(ct);
results.Add(result);
if (result.Status == HealthStatus.Critical && options.StopOnCritical)
{
break;
}
}
catch (Exception ex)
{
results.Add(new HealthCheckResult
{
CheckName = check.Name,
Status = HealthStatus.Error,
Message = $"Check failed with exception: {ex.Message}",
Exception = ex
});
}
}
return new AgentDiagnosticReport
{
AgentId = _config.AgentId,
AgentName = _config.AgentName,
Timestamp = startTime,
Duration = _timeProvider.GetUtcNow() - startTime,
OverallStatus = DetermineOverallStatus(results),
Results = results.ToImmutableArray(),
Remediations = GenerateRemediations(results)
};
}
private ImmutableArray<RemediationStep> GenerateRemediations(
List<HealthCheckResult> results)
{
var remediations = new List<RemediationStep>();
foreach (var result in results.Where(r => r.Status != HealthStatus.Healthy))
{
var steps = _remediationEngine.GetRemediationSteps(result);
remediations.AddRange(steps);
}
// Sort by priority and deduplicate
return remediations
.DistinctBy(r => r.Id)
.OrderByDescending(r => r.Priority)
.ToImmutableArray();
}
}
// Individual health checks
public sealed class CertificateExpiryCheck : IAgentHealthCheck
{
public string Name => "Certificate Expiry";
public string Category => "Security";
public async Task<HealthCheckResult> ExecuteAsync(CancellationToken ct)
{
var cert = await _certManager.GetCurrentCertificateAsync(ct);
if (cert == null)
{
return new HealthCheckResult
{
CheckName = Name,
Status = HealthStatus.Critical,
Message = "No certificate found",
RemediationHint = "Run 'stella agent bootstrap' to provision certificate",
RunbookUrl = "https://docs.stella-ops.org/runbooks/agent-no-certificate"
};
}
var expiresIn = cert.NotAfter - _timeProvider.GetUtcNow();
if (expiresIn <= TimeSpan.Zero)
{
return new HealthCheckResult
{
CheckName = Name,
Status = HealthStatus.Critical,
Message = $"Certificate expired on {cert.NotAfter:u}",
RemediationHint = "Run 'stella agent renew-cert' or restart agent for auto-renewal",
RunbookUrl = "https://docs.stella-ops.org/runbooks/agent-cert-expired"
};
}
if (expiresIn <= TimeSpan.FromDays(7))
{
return new HealthCheckResult
{
CheckName = Name,
Status = HealthStatus.Warning,
Message = $"Certificate expires in {expiresIn.TotalDays:F0} days",
RemediationHint = "Certificate will auto-renew if enabled, or run 'stella agent renew-cert'",
Data = new Dictionary<string, object>
{
["expires_at"] = cert.NotAfter,
["expires_in_days"] = expiresIn.TotalDays
}
};
}
return new HealthCheckResult
{
CheckName = Name,
Status = HealthStatus.Healthy,
Message = $"Certificate valid until {cert.NotAfter:u} ({expiresIn.TotalDays:F0} days)",
Data = new Dictionary<string, object>
{
["expires_at"] = cert.NotAfter,
["expires_in_days"] = expiresIn.TotalDays
}
};
}
}
public sealed class OrchestratorConnectivityCheck : IAgentHealthCheck
{
public string Name => "Orchestrator Connectivity";
public string Category => "Network";
public async Task<HealthCheckResult> ExecuteAsync(CancellationToken ct)
{
var endpoint = _config.OrchestratorUrl;
try
{
// Test DNS resolution
var uri = new Uri(endpoint);
var addresses = await Dns.GetHostAddressesAsync(uri.Host, ct);
if (addresses.Length == 0)
{
return new HealthCheckResult
{
CheckName = Name,
Status = HealthStatus.Critical,
Message = $"DNS resolution failed for {uri.Host}",
RemediationHint = "Check DNS settings and network connectivity",
RunbookUrl = "https://docs.stella-ops.org/runbooks/agent-dns-failure"
};
}
// Test TCP connection
using var tcpClient = new TcpClient();
var connectTask = tcpClient.ConnectAsync(uri.Host, uri.Port, ct);
var completed = await Task.WhenAny(
connectTask.AsTask(),
Task.Delay(TimeSpan.FromSeconds(5), ct));
if (completed != connectTask.AsTask() || !tcpClient.Connected)
{
return new HealthCheckResult
{
CheckName = Name,
Status = HealthStatus.Critical,
Message = $"TCP connection to {endpoint} timed out",
RemediationHint = "Check firewall rules and network connectivity",
RunbookUrl = "https://docs.stella-ops.org/runbooks/agent-connection-timeout"
};
}
// Test mTLS handshake
var tlsResult = await TestMTLSHandshakeAsync(uri, ct);
if (!tlsResult.Success)
{
return new HealthCheckResult
{
CheckName = Name,
Status = HealthStatus.Critical,
Message = $"mTLS handshake failed: {tlsResult.Error}",
RemediationHint = tlsResult.RemediationHint,
RunbookUrl = "https://docs.stella-ops.org/runbooks/agent-mtls-failure"
};
}
// Test gRPC health endpoint
var healthResult = await _orchestratorClient.HealthCheckAsync(ct);
return new HealthCheckResult
{
CheckName = Name,
Status = HealthStatus.Healthy,
Message = $"Connected to orchestrator at {endpoint}",
Data = new Dictionary<string, object>
{
["resolved_addresses"] = addresses.Select(a => a.ToString()).ToArray(),
["tls_version"] = tlsResult.TlsVersion,
["latency_ms"] = healthResult.LatencyMs
}
};
}
catch (Exception ex)
{
return new HealthCheckResult
{
CheckName = Name,
Status = HealthStatus.Critical,
Message = $"Connectivity check failed: {ex.Message}",
Exception = ex,
RemediationHint = "Check network configuration and orchestrator status",
RunbookUrl = "https://docs.stella-ops.org/runbooks/agent-connectivity"
};
}
}
}
public sealed class DockerConnectivityCheck : IAgentHealthCheck
{
public string Name => "Docker Connectivity";
public string Category => "Runtime";
public async Task<HealthCheckResult> ExecuteAsync(CancellationToken ct)
{
try
{
var version = await _dockerClient.GetVersionAsync(ct);
// Check minimum version
var minVersion = new Version(20, 10, 0);
var currentVersion = new Version(version.Version);
if (currentVersion < minVersion)
{
return new HealthCheckResult
{
CheckName = Name,
Status = HealthStatus.Warning,
Message = $"Docker version {version.Version} is below recommended {minVersion}",
RemediationHint = "Upgrade Docker to version 20.10 or later",
Data = new Dictionary<string, object>
{
["docker_version"] = version.Version,
["api_version"] = version.ApiVersion,
["min_recommended"] = minVersion.ToString()
}
};
}
return new HealthCheckResult
{
CheckName = Name,
Status = HealthStatus.Healthy,
Message = $"Docker {version.Version} connected",
Data = new Dictionary<string, object>
{
["docker_version"] = version.Version,
["api_version"] = version.ApiVersion,
["os"] = version.Os,
["arch"] = version.Arch
}
};
}
catch (Exception ex)
{
return new HealthCheckResult
{
CheckName = Name,
Status = HealthStatus.Critical,
Message = $"Docker connectivity failed: {ex.Message}",
Exception = ex,
RemediationHint = "Ensure Docker daemon is running and agent has permission to access Docker socket",
RunbookUrl = "https://docs.stella-ops.org/runbooks/agent-docker-connectivity"
};
}
}
}
5. Remediation Engine
Guided problem resolution:
public sealed class RemediationEngine
{
public ImmutableArray<RemediationStep> GetRemediationSteps(
HealthCheckResult result)
{
var steps = new List<RemediationStep>();
// Match result to known remediation patterns
var pattern = _patterns.FirstOrDefault(p => p.Matches(result));
if (pattern != null)
{
steps.AddRange(pattern.Steps);
}
// Add generic remediation based on status
if (result.Status == HealthStatus.Critical)
{
steps.Add(new RemediationStep
{
Id = "check-logs",
Priority = RemediationPriority.High,
Title = "Check Agent Logs",
Description = "Review agent logs for detailed error information",
Command = "stella agent logs --tail 100",
RunbookUrl = result.RunbookUrl
});
}
return steps.ToImmutableArray();
}
private readonly ImmutableArray<RemediationPattern> _patterns = new[]
{
new RemediationPattern
{
CheckName = "Certificate Expiry",
StatusMatch = HealthStatus.Critical,
Steps = new[]
{
new RemediationStep
{
Id = "renew-cert",
Priority = RemediationPriority.Critical,
Title = "Renew Agent Certificate",
Description = "Agent certificate has expired and must be renewed",
Command = "stella agent renew-cert --force",
Automated = true
},
new RemediationStep
{
Id = "restart-agent",
Priority = RemediationPriority.High,
Title = "Restart Agent",
Description = "Restart agent to apply new certificate",
Command = "systemctl restart stella-agent",
Automated = false
}
}
},
new RemediationPattern
{
CheckName = "Orchestrator Connectivity",
MessageContains = "DNS resolution failed",
Steps = new[]
{
new RemediationStep
{
Id = "check-dns",
Priority = RemediationPriority.Critical,
Title = "Verify DNS Configuration",
Description = "Check that DNS servers are configured and reachable",
Command = "cat /etc/resolv.conf && nslookup orchestrator.example.com",
Automated = false
},
new RemediationStep
{
Id = "check-hosts",
Priority = RemediationPriority.High,
Title = "Check /etc/hosts",
Description = "Verify no conflicting entries in hosts file",
Command = "grep orchestrator /etc/hosts",
Automated = false
}
}
},
new RemediationPattern
{
CheckName = "Docker Connectivity",
Steps = new[]
{
new RemediationStep
{
Id = "check-docker-daemon",
Priority = RemediationPriority.Critical,
Title = "Check Docker Daemon",
Description = "Verify Docker daemon is running",
Command = "systemctl status docker",
Automated = false
},
new RemediationStep
{
Id = "check-docker-socket",
Priority = RemediationPriority.High,
Title = "Check Docker Socket Permissions",
Description = "Verify agent has access to Docker socket",
Command = "ls -la /var/run/docker.sock && groups stella-agent",
Automated = false
}
}
}
}.ToImmutableArray();
}
public sealed record RemediationStep
{
public string Id { get; init; }
public RemediationPriority Priority { get; init; }
public string Title { get; init; }
public string Description { get; init; }
public string? Command { get; init; }
public string? RunbookUrl { get; init; }
public bool Automated { get; init; }
public TimeSpan? EstimatedDuration { get; init; }
}
6. Auto-Update Manager
Safe agent binary updates:
public sealed class AgentUpdateManager
{
public async Task<UpdateResult> CheckAndApplyUpdateAsync(
CancellationToken ct)
{
if (!_config.AutoUpdate?.Enabled == true)
{
return new UpdateResult { Status = UpdateStatus.Disabled };
}
// Check for available update
var available = await _updateService.CheckForUpdateAsync(
_config.AgentVersion,
_config.AutoUpdate.Channel,
ct);
if (!available.HasUpdate)
{
return new UpdateResult { Status = UpdateStatus.UpToDate };
}
// Verify update signature
var verified = await _signatureVerifier.VerifyAsync(
available.Package,
available.Signature,
ct);
if (!verified)
{
_logger.LogError("Update signature verification failed");
return new UpdateResult
{
Status = UpdateStatus.VerificationFailed,
Error = "Package signature verification failed"
};
}
// Check if update window is allowed
if (!IsInUpdateWindow())
{
_logger.LogInformation(
"Update available but outside update window, scheduling for {Window}",
_config.AutoUpdate.MaintenanceWindow);
return new UpdateResult
{
Status = UpdateStatus.Scheduled,
ScheduledFor = GetNextMaintenanceWindow()
};
}
// Drain active tasks
await DrainActiveTasksAsync(ct);
// Download and apply update
try
{
var packagePath = await DownloadPackageAsync(available, ct);
// Create rollback point
var rollbackPoint = await CreateRollbackPointAsync(ct);
// Apply update
await ApplyUpdateAsync(packagePath, ct);
// Verify new version starts correctly
var healthCheck = await VerifyNewVersionAsync(ct);
if (!healthCheck.Healthy)
{
_logger.LogError("New version health check failed, rolling back");
await RollbackAsync(rollbackPoint, ct);
return new UpdateResult
{
Status = UpdateStatus.RolledBack,
Error = healthCheck.Error
};
}
return new UpdateResult
{
Status = UpdateStatus.Applied,
PreviousVersion = _config.AgentVersion,
NewVersion = available.Version
};
}
catch (Exception ex)
{
_logger.LogError(ex, "Update failed, attempting rollback");
await RollbackAsync(ct);
return new UpdateResult
{
Status = UpdateStatus.Failed,
Error = ex.Message
};
}
}
}
public sealed record AutoUpdateConfig
{
public bool Enabled { get; init; } = false;
public UpdateChannel Channel { get; init; } = UpdateChannel.Stable;
public string? MaintenanceWindow { get; init; } // Cron expression
public bool DrainBeforeUpdate { get; init; } = true;
public TimeSpan DrainTimeout { get; init; } = TimeSpan.FromMinutes(5);
public int MaxRollbackVersions { get; init; } = 3;
}
public enum UpdateChannel
{
Stable,
Beta,
Canary
}
7. Operator CLI Commands
Streamlined operational commands:
public sealed class AgentOperatorCommands
{
// Bootstrap new agent
// stella agent bootstrap --name prod-agent-01 --env production --platform linux
[Command("agent bootstrap")]
public async Task<int> BootstrapAsync(
[Option] string name,
[Option] string env,
[Option] Platform platform = Platform.Linux,
[Option] string[]? capabilities = null,
[Option] string? cluster = null)
{
var result = await _bootstrap.BootstrapAgentAsync(new BootstrapRequest
{
AgentName = name,
Environment = env,
Platform = platform,
Capabilities = capabilities?.ToImmutableArray() ?? ImmutableArray<AgentCapability>.Empty,
ClusterId = cluster
}, _ct);
Console.WriteLine($"Bootstrap token generated (expires in 15 minutes):");
Console.WriteLine();
Console.WriteLine($" Token: {result.Token}");
Console.WriteLine();
Console.WriteLine($"One-line installer:");
Console.WriteLine($" {result.InstallCommand}");
Console.WriteLine();
Console.WriteLine($"Or download the install script:");
Console.WriteLine($" stella agent install-script --token {result.Token} --output install.sh");
return 0;
}
// Run diagnostics
// stella agent doctor [--category security] [--fix]
[Command("agent doctor")]
public async Task<int> DoctorAsync(
[Option] string? agentId = null,
[Option] string[]? categories = null,
[Option] bool fix = false,
[Option] OutputFormat format = OutputFormat.Table)
{
var options = new DiagnosticOptions
{
Categories = categories?.ToImmutableArray() ?? ImmutableArray<string>.Empty,
IncludeRemediations = true
};
var report = agentId != null
? await _doctor.RunRemoteDiagnosticsAsync(agentId, options, _ct)
: await _doctor.RunDiagnosticsAsync(options, _ct);
// Display results
RenderDiagnosticReport(report, format);
// Optionally apply automated fixes
if (fix && report.Remediations.Any(r => r.Automated))
{
Console.WriteLine();
Console.WriteLine("Applying automated remediations...");
foreach (var remediation in report.Remediations.Where(r => r.Automated))
{
Console.WriteLine($" - {remediation.Title}");
await _remediation.ApplyAsync(remediation, _ct);
}
}
return report.OverallStatus == HealthStatus.Healthy ? 0 : 1;
}
// View agent configuration
// stella agent config [--agent-id xyz] [--diff]
[Command("agent config")]
public async Task<int> ConfigAsync(
[Option] string? agentId = null,
[Option] bool diff = false,
[Option] OutputFormat format = OutputFormat.Yaml)
{
if (diff)
{
var drift = await _configManager.DetectDriftAsync(_ct);
RenderConfigDiff(drift, format);
return drift.HasDrift ? 1 : 0;
}
var config = await _configManager.GetConfigurationAsync(agentId, _ct);
RenderConfiguration(config, format);
return 0;
}
// Apply configuration changes
// stella agent apply -f agent-config.yaml
[Command("agent apply")]
public async Task<int> ApplyAsync(
[Option('f')] string configFile)
{
var config = await LoadConfigurationAsync(configFile);
var validation = await _configManager.ValidateAsync(config, _ct);
if (!validation.IsValid)
{
Console.WriteLine("Configuration validation failed:");
foreach (var error in validation.Errors)
{
Console.WriteLine($" - {error}");
}
return 1;
}
var result = await _configManager.ApplyConfigurationAsync(config, _ct);
if (result.Status == ConfigStatus.Applied)
{
Console.WriteLine($"Configuration applied successfully ({result.AppliedChanges.Length} changes)");
return 0;
}
Console.WriteLine($"Configuration apply failed: {result.Status}");
return 1;
}
// Renew certificate
// stella agent renew-cert [--force]
[Command("agent renew-cert")]
public async Task<int> RenewCertAsync(
[Option] bool force = false)
{
var result = await _certManager.RenewCertificateAsync(force, _ct);
if (result.Status == CertificateStatus.Renewed)
{
Console.WriteLine($"Certificate renewed successfully");
Console.WriteLine($" New expiry: {result.ExpiresAt:u}");
return 0;
}
Console.WriteLine($"Certificate renewal failed: {result.Error}");
return 1;
}
// View agent logs
// stella agent logs [--tail 100] [--follow] [--level error]
[Command("agent logs")]
public async Task<int> LogsAsync(
[Option] string? agentId = null,
[Option] int tail = 50,
[Option] bool follow = false,
[Option] LogLevel? level = null)
{
await foreach (var entry in _logService.StreamLogsAsync(
agentId, tail, follow, level, _ct))
{
RenderLogEntry(entry);
}
return 0;
}
// Force update
// stella agent update [--version x.y.z] [--force]
[Command("agent update")]
public async Task<int> UpdateAsync(
[Option] string? version = null,
[Option] bool force = false)
{
var result = await _updateManager.UpdateToVersionAsync(version, force, _ct);
Console.WriteLine($"Update status: {result.Status}");
if (result.Status == UpdateStatus.Applied)
{
Console.WriteLine($" Previous: {result.PreviousVersion}");
Console.WriteLine($" Current: {result.NewVersion}");
}
return result.Status == UpdateStatus.Applied ? 0 : 1;
}
}
Doctor Plugin for Server-Side
Central Doctor plugin for agent fleet health:
// src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Agent/AgentHealthPlugin.cs
public sealed class AgentHealthPlugin : IDoctorPlugin
{
public string Name => "Agent Health";
public string Description => "Monitors agent fleet health and connectivity";
public ImmutableArray<IDoctorCheck> Checks => new IDoctorCheck[]
{
new AgentHeartbeatFreshnessCheck(),
new AgentCertificateExpiryCheck(),
new AgentVersionConsistencyCheck(),
new AgentCapacityCheck(),
new StaleAgentCheck(),
new AgentClusterHealthCheck(),
new TaskQueueBacklogCheck(),
new FailedTaskRateCheck(),
new AgentResourceUtilizationCheck()
}.ToImmutableArray();
}
public sealed class AgentHeartbeatFreshnessCheck : IDoctorCheck
{
public string Name => "Agent Heartbeat Freshness";
public CheckSeverity Severity => CheckSeverity.Critical;
public async Task<DoctorCheckResult> ExecuteAsync(CancellationToken ct)
{
var agents = await _agentStore.GetAllAsync(ct);
var staleAgents = new List<string>();
var warningAgents = new List<string>();
foreach (var agent in agents.Where(a => a.Status != AgentStatus.Deactivated))
{
var heartbeatAge = _timeProvider.GetUtcNow() - agent.LastHeartbeat;
if (heartbeatAge > TimeSpan.FromMinutes(5))
{
staleAgents.Add($"{agent.Name} (last heartbeat: {heartbeatAge.TotalMinutes:F0}m ago)");
}
else if (heartbeatAge > TimeSpan.FromMinutes(2))
{
warningAgents.Add($"{agent.Name} (last heartbeat: {heartbeatAge.TotalSeconds:F0}s ago)");
}
}
if (staleAgents.Any())
{
return new DoctorCheckResult
{
Status = CheckStatus.Critical,
Message = $"{staleAgents.Count} agent(s) have stale heartbeats",
Details = staleAgents,
Remediation = "Check agent connectivity and status. Run 'stella agent doctor --agent-id <id>' for diagnostics."
};
}
if (warningAgents.Any())
{
return new DoctorCheckResult
{
Status = CheckStatus.Warning,
Message = $"{warningAgents.Count} agent(s) have delayed heartbeats",
Details = warningAgents
};
}
return new DoctorCheckResult
{
Status = CheckStatus.Healthy,
Message = $"All {agents.Count} agents have fresh heartbeats"
};
}
}
public sealed class AgentCertificateExpiryCheck : IDoctorCheck
{
public string Name => "Agent Certificate Expiry";
public CheckSeverity Severity => CheckSeverity.High;
public async Task<DoctorCheckResult> ExecuteAsync(CancellationToken ct)
{
var agents = await _agentStore.GetAllAsync(ct);
var expiringSoon = new List<string>();
var expired = new List<string>();
foreach (var agent in agents)
{
var expiresIn = agent.CertificateExpiry - _timeProvider.GetUtcNow();
if (expiresIn <= TimeSpan.Zero)
{
expired.Add($"{agent.Name} (expired {-expiresIn.TotalDays:F0} days ago)");
}
else if (expiresIn <= TimeSpan.FromDays(7))
{
expiringSoon.Add($"{agent.Name} (expires in {expiresIn.TotalDays:F0} days)");
}
}
if (expired.Any())
{
return new DoctorCheckResult
{
Status = CheckStatus.Critical,
Message = $"{expired.Count} agent(s) have expired certificates",
Details = expired,
Remediation = "Renew certificates immediately: 'stella agent renew-cert --agent-id <id>'"
};
}
if (expiringSoon.Any())
{
return new DoctorCheckResult
{
Status = CheckStatus.Warning,
Message = $"{expiringSoon.Count} agent(s) have certificates expiring soon",
Details = expiringSoon,
Remediation = "Schedule certificate renewal before expiry"
};
}
return new DoctorCheckResult
{
Status = CheckStatus.Healthy,
Message = "All agent certificates are valid"
};
}
}
Configuration Examples
Minimal Configuration (Bootstrap)
# Bootstrapped agent - minimal config required
agent:
name: prod-agent-01
orchestrator_url: https://orchestrator.example.com:8443
# Everything else is auto-configured via bootstrap
Full Configuration
agent:
# Identity
id: a1b2c3d4-e5f6-7890-abcd-ef1234567890
name: prod-agent-01
environment: production
labels:
region: us-east-1
tier: web
# Connection
orchestrator_url: https://orchestrator.example.com:8443
heartbeat_interval: 30s
reconnect_backoff: 5s
max_reconnect_attempts: 10
# Capabilities
capabilities:
- docker
- compose
- health_check
# Resources
max_concurrent_tasks: 5
default_task_timeout: 30m
resource_limits:
cpu_percent: 80
memory_percent: 80
disk_percent: 90
# Certificates
certificates:
source: auto_provision # auto_provision | file | vault
auto_renew: true
renewal_threshold: 7d
# Clustering (optional)
cluster:
id: prod-cluster-01
mode: active_active # active_passive | active_active | sharded
min_members: 2
# Observability
observability:
metrics:
enabled: true
port: 9090
logging:
level: info
format: json
tracing:
enabled: true
endpoint: http://jaeger:14268/api/traces
# Auto-update (optional)
auto_update:
enabled: true
channel: stable # stable | beta | canary
maintenance_window: "0 3 * * *" # 3 AM daily
drain_before_update: true
CLI Quick Reference
# Bootstrap new agent
stella agent bootstrap --name prod-01 --env production --platform linux
# Run health diagnostics
stella agent doctor
stella agent doctor --category security --fix
stella agent doctor --agent-id abc123 --format json
# View/apply configuration
stella agent config
stella agent config --diff
stella agent apply -f agent-config.yaml
# Certificate management
stella agent renew-cert
stella agent renew-cert --force
# Logs and debugging
stella agent logs --tail 100
stella agent logs --follow --level error
# Updates
stella agent update
stella agent update --version 2.1.0
# Status and health
stella agent status
stella agent list --env production
stella agent health abc123
Metrics & Observability
Prometheus Metrics
# Bootstrap
stella_agent_bootstrap_total{environment, platform}
stella_agent_bootstrap_success_total{environment}
stella_agent_bootstrap_failed_total{environment, reason}
# Configuration
stella_agent_config_drift_detected_total{agent_id}
stella_agent_config_apply_total{agent_id, status}
# Certificates
stella_agent_certificate_expiry_seconds{agent_id}
stella_agent_certificate_renewal_total{agent_id, status}
# Health Checks
stella_agent_health_check_total{agent_id, check_name, status}
stella_agent_health_score{agent_id}
# Updates
stella_agent_update_available{agent_id, current_version, available_version}
stella_agent_update_applied_total{agent_id, status}
stella_agent_update_rollback_total{agent_id}
Test Strategy
Unit Tests
- Bootstrap token generation and validation
- Configuration diff computation
- Certificate lifecycle logic
- Health check execution
- Remediation matching
Integration Tests
- Full bootstrap flow
- Configuration apply with rollback
- Certificate renewal
- Auto-update with rollback
- Doctor diagnostics
E2E Tests
- Bootstrap to running agent
- Multi-agent cluster formation
- Failover scenarios
- Update and rollback scenarios
Migration Path
Phase 1: Bootstrap Service (Week 1-2)
- Bootstrap token service
- One-line installer generation
- Platform-specific install scripts
Phase 2: Configuration Manager (Week 3-4)
- Declarative configuration model
- Drift detection
- Apply with rollback
Phase 3: Certificate Manager (Week 5-6)
- Auto-provisioning
- Auto-renewal
- Multi-source support (Vault, ACME, etc.)
Phase 4: Agent Doctor (Week 7-8)
- Core health checks
- Remediation engine
- CLI integration
Phase 5: Doctor Plugin (Week 9-10)
- Server-side fleet health
- Dashboard integration
- Alerting rules
Phase 6: Auto-Update (Week 11-12)
- Update service
- Safe rollback
- Maintenance windows