# Agent Operations & Easy Setup ## Overview The Agent Operations enhancement transforms agent deployment from a manual, error-prone process into a streamlined, self-healing experience. It provides zero-touch bootstrap, declarative configuration, comprehensive health diagnostics (Doctor plugin), and operational tooling that makes agents easy to deploy, monitor, and maintain at scale. This enhancement complements Sprint 034 (Agent Resilience) by focusing on the operational and configuration aspects rather than the clustering and failover mechanisms. --- ## Design Principles 1. **Zero-Touch Bootstrap**: Agents should be deployable with a single command 2. **Declarative Configuration**: Define desired state, system converges automatically 3. **Self-Diagnosing**: Agents report their own health issues with remediation hints 4. **Operator-Friendly**: Clear CLI commands, meaningful error messages, runbook links 5. **Secure by Default**: Auto-provisioned certificates, secrets never on disk 6. **Observable**: Rich metrics, structured logs, distributed tracing --- ## Current Pain Points | Pain Point | Current State | Target State | |------------|---------------|--------------| | Certificate Management | Manual paths to cert/key/ca files | Auto-provisioned, auto-renewed | | Configuration | Static YAML files, manual edits | Declarative config with drift detection | | Health Monitoring | Binary alive/offline | Multi-dimensional health scoring | | Troubleshooting | Manual log inspection | Doctor plugin with guided remediation | | Scaling | Manual per-agent setup | Bootstrap token + auto-join | | Updates | Manual agent binary updates | Auto-update with rollback | | Network Issues | Silent failures | Connection diagnostics with hints | --- ## Architecture ### Component Overview ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ Agent Operations & Setup │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐ │ │ │ BootstrapService │───▶│ ConfigManager │───▶│ CertificateManager│ │ │ │ │ │ │ │ │ │ │ └───────────────────┘ └───────────────────┘ └───────────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐ │ │ │ AgentDoctor │ │ ConnectionDoctor │ │ UpdateManager │ │ │ │ │ │ │ │ │ │ │ └───────────────────┘ └───────────────────┘ └───────────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐ │ │ │ DiagnosticReport │ │ RemediationEngine │ │ OperatorCLI │ │ │ │ │ │ │ │ │ │ │ └───────────────────┘ └───────────────────┘ └───────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ Bootstrap Flow ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ stella │ │ Orchestrator│ │ Agent │ │ agent │─────▶│ (API) │─────▶│ Running │ │ bootstrap │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ 1. Request token │ │ │────────────────────▶│ │ │ 2. Return token │ │ │◀────────────────────│ │ │ │ │ │ 3. Start agent with token │ │─────────────────────────────────────────▶│ │ │ 4. Exchange token │ │ │◀───────────────────│ │ │ 5. Issue cert │ │ │───────────────────▶│ │ │ 6. Register │ │ │◀───────────────────│ │ │ 7. Confirm │ │ │───────────────────▶│ ``` --- ## Key Components ### 1. Bootstrap Service Zero-touch agent deployment: ```csharp public sealed class BootstrapService { public async Task BootstrapAgentAsync( BootstrapRequest request, CancellationToken ct) { // 1. Generate bootstrap token (one-time use, 15-minute expiry) var token = await _tokenService.GenerateBootstrapTokenAsync( new TokenRequest { AgentName = request.AgentName, Environment = request.Environment, Capabilities = request.Capabilities, ExpiresIn = TimeSpan.FromMinutes(15), MaxUses = 1 }, ct); // 2. Generate agent configuration var config = GenerateAgentConfig(request, token); // 3. Generate installation script var script = GenerateInstallScript(request.Platform, config); return new BootstrapResult { Token = token.Value, TokenExpires = token.ExpiresAt, Configuration = config, InstallScript = script, InstallCommand = GetOneLineInstaller(request.Platform, token) }; } private string GetOneLineInstaller(Platform platform, BootstrapToken token) { return platform switch { Platform.Linux => $"curl -sSL https://stella.example.com/install.sh | sudo bash -s -- --token {token.Value}", Platform.Windows => $"iwr https://stella.example.com/install.ps1 -UseBasicParsing | iex; Install-StellaAgent -Token {token.Value}", Platform.Docker => $"docker run -d --name stella-agent -e STELLA_BOOTSTRAP_TOKEN={token.Value} stella/agent:latest", _ => throw new UnsupportedPlatformException(platform) }; } } public sealed record BootstrapRequest { public string AgentName { get; init; } public string Environment { get; init; } public Platform Platform { get; init; } public ImmutableArray Capabilities { get; init; } public ImmutableDictionary Labels { get; init; } public string? ClusterId { get; init; } // Join existing cluster } public sealed record BootstrapResult { public string Token { get; init; } public DateTimeOffset TokenExpires { get; init; } public AgentConfiguration Configuration { get; init; } public string InstallScript { get; init; } public string InstallCommand { get; init; } } ``` ### 2. Configuration Manager Declarative configuration with drift detection: ```csharp public sealed class AgentConfigManager { public async Task ApplyConfigurationAsync( AgentConfiguration desired, CancellationToken ct) { var current = await _configStore.GetCurrentAsync(ct); var diff = ComputeDiff(current, desired); if (diff.HasChanges) { _logger.LogInformation("Configuration drift detected: {Changes}", diff.Summary); // Validate changes are safe var validation = await ValidateChangesAsync(diff, ct); if (!validation.IsValid) { return new ConfigurationState { Status = ConfigStatus.ValidationFailed, Errors = validation.Errors }; } // Apply changes with rollback capability try { await ApplyChangesAsync(diff, ct); await _configStore.SaveAsync(desired, ct); return new ConfigurationState { Status = ConfigStatus.Applied, AppliedChanges = diff.Changes }; } catch (Exception ex) { await RollbackAsync(current, ct); throw new ConfigurationApplyException("Failed to apply configuration", ex); } } return new ConfigurationState { Status = ConfigStatus.NoChanges }; } public async Task DetectDriftAsync(CancellationToken ct) { var desired = await _configStore.GetDesiredAsync(ct); var actual = await _configStore.GetActualAsync(ct); return new ConfigDrift { HasDrift = !desired.Equals(actual), DesiredState = desired, ActualState = actual, Differences = ComputeDiff(actual, desired).Changes }; } } // Declarative configuration model public sealed record AgentConfiguration { // Identity public string AgentId { get; init; } public string AgentName { get; init; } public string Environment { get; init; } public ImmutableDictionary Labels { get; init; } // Connection public string OrchestratorUrl { get; init; } public TimeSpan HeartbeatInterval { get; init; } = TimeSpan.FromSeconds(30); public TimeSpan ReconnectBackoff { get; init; } = TimeSpan.FromSeconds(5); public int MaxReconnectAttempts { get; init; } = 10; // Capabilities public ImmutableArray Capabilities { get; init; } // Resources public ResourceLimits ResourceLimits { get; init; } public int MaxConcurrentTasks { get; init; } = 5; public TimeSpan DefaultTaskTimeout { get; init; } = TimeSpan.FromMinutes(30); // Security public CertificateConfig Certificates { get; init; } public bool AutoRenewCertificates { get; init; } = true; public TimeSpan CertificateRenewalThreshold { get; init; } = TimeSpan.FromDays(7); // Clustering (optional) public ClusterConfig? Cluster { get; init; } // Observability public ObservabilityConfig Observability { get; init; } // Auto-update public AutoUpdateConfig? AutoUpdate { get; init; } } public sealed record CertificateConfig { public CertificateSource Source { get; init; } = CertificateSource.AutoProvision; public string? CertificatePath { get; init; } // Only if Source = File public string? PrivateKeyPath { get; init; } // Only if Source = File public string? CaCertificatePath { get; init; } // Only if Source = File } public enum CertificateSource { AutoProvision, // Orchestrator provisions via bootstrap File, // Manual file paths Vault, // HashiCorp Vault ACME, // Let's Encrypt / ACME AzureKeyVault, // Azure Key Vault AWSKMS // AWS KMS/Secrets Manager } ``` ### 3. Certificate Manager Automatic certificate lifecycle: ```csharp public sealed class AgentCertificateManager { public async Task EnsureCertificateAsync(CancellationToken ct) { var current = await GetCurrentCertificateAsync(ct); if (current == null) { _logger.LogInformation("No certificate found, requesting new certificate"); return await ProvisionCertificateAsync(ct); } var expiresIn = current.NotAfter - _timeProvider.GetUtcNow(); var threshold = _config.CertificateRenewalThreshold; if (expiresIn <= TimeSpan.Zero) { _logger.LogWarning("Certificate expired, requesting renewal"); return await RenewCertificateAsync(current, ct); } if (expiresIn <= threshold) { _logger.LogInformation( "Certificate expires in {Days} days, renewing proactively", expiresIn.TotalDays); return await RenewCertificateAsync(current, ct); } return new CertificateState { Status = CertificateStatus.Valid, Certificate = current, ExpiresAt = current.NotAfter, RenewalScheduled = current.NotAfter - threshold }; } private async Task ProvisionCertificateAsync(CancellationToken ct) { // Generate key pair locally (private key never leaves agent) using var rsa = RSA.Create(4096); // Create CSR var csr = CreateCertificateSigningRequest(rsa); // Submit CSR to orchestrator var signedCert = await _orchestratorClient.SubmitCSRAsync( new CSRRequest { AgentId = _config.AgentId, CSR = csr, RequestedValidity = TimeSpan.FromDays(365) }, ct); // Store certificate and key securely await _certStore.StoreCertificateAsync(signedCert, ct); await _keyStore.StorePrivateKeyAsync(rsa, ct); return new CertificateState { Status = CertificateStatus.Provisioned, Certificate = signedCert, ExpiresAt = signedCert.NotAfter }; } } ``` ### 4. Agent Doctor (Health Checks) Comprehensive health diagnostics: ```csharp public sealed class AgentDoctor { private readonly ImmutableArray _checks; public AgentDoctor() { _checks = new IAgentHealthCheck[] { // Core checks new CertificateExpiryCheck(), new CertificateValidityCheck(), new OrchestratorConnectivityCheck(), new HeartbeatCheck(), // Resource checks new DiskSpaceCheck(), new MemoryUsageCheck(), new CpuUsageCheck(), new FileDescriptorCheck(), // Configuration checks new ConfigurationValidityCheck(), new ConfigurationDriftCheck(), new CapabilityCheck(), // Network checks new RegistryConnectivityCheck(), new DNSResolutionCheck(), new TLSVersionCheck(), new MTLSHandshakeCheck(), // Task execution checks new DockerConnectivityCheck(), new DockerVersionCheck(), new TaskQueueDepthCheck(), new FailedTaskRateCheck(), // Cluster checks (if clustered) new ClusterMembershipCheck(), new LeaderConnectivityCheck(), new StateSyncCheck() }.ToImmutableArray(); } public async Task RunDiagnosticsAsync( DiagnosticOptions options, CancellationToken ct) { var results = new List(); var startTime = _timeProvider.GetUtcNow(); foreach (var check in _checks) { if (options.Categories.Any() && !options.Categories.Contains(check.Category)) { continue; } try { var result = await check.ExecuteAsync(ct); results.Add(result); if (result.Status == HealthStatus.Critical && options.StopOnCritical) { break; } } catch (Exception ex) { results.Add(new HealthCheckResult { CheckName = check.Name, Status = HealthStatus.Error, Message = $"Check failed with exception: {ex.Message}", Exception = ex }); } } return new AgentDiagnosticReport { AgentId = _config.AgentId, AgentName = _config.AgentName, Timestamp = startTime, Duration = _timeProvider.GetUtcNow() - startTime, OverallStatus = DetermineOverallStatus(results), Results = results.ToImmutableArray(), Remediations = GenerateRemediations(results) }; } private ImmutableArray GenerateRemediations( List results) { var remediations = new List(); foreach (var result in results.Where(r => r.Status != HealthStatus.Healthy)) { var steps = _remediationEngine.GetRemediationSteps(result); remediations.AddRange(steps); } // Sort by priority and deduplicate return remediations .DistinctBy(r => r.Id) .OrderByDescending(r => r.Priority) .ToImmutableArray(); } } // Individual health checks public sealed class CertificateExpiryCheck : IAgentHealthCheck { public string Name => "Certificate Expiry"; public string Category => "Security"; public async Task ExecuteAsync(CancellationToken ct) { var cert = await _certManager.GetCurrentCertificateAsync(ct); if (cert == null) { return new HealthCheckResult { CheckName = Name, Status = HealthStatus.Critical, Message = "No certificate found", RemediationHint = "Run 'stella agent bootstrap' to provision certificate", RunbookUrl = "https://docs.stella-ops.org/runbooks/agent-no-certificate" }; } var expiresIn = cert.NotAfter - _timeProvider.GetUtcNow(); if (expiresIn <= TimeSpan.Zero) { return new HealthCheckResult { CheckName = Name, Status = HealthStatus.Critical, Message = $"Certificate expired on {cert.NotAfter:u}", RemediationHint = "Run 'stella agent renew-cert' or restart agent for auto-renewal", RunbookUrl = "https://docs.stella-ops.org/runbooks/agent-cert-expired" }; } if (expiresIn <= TimeSpan.FromDays(7)) { return new HealthCheckResult { CheckName = Name, Status = HealthStatus.Warning, Message = $"Certificate expires in {expiresIn.TotalDays:F0} days", RemediationHint = "Certificate will auto-renew if enabled, or run 'stella agent renew-cert'", Data = new Dictionary { ["expires_at"] = cert.NotAfter, ["expires_in_days"] = expiresIn.TotalDays } }; } return new HealthCheckResult { CheckName = Name, Status = HealthStatus.Healthy, Message = $"Certificate valid until {cert.NotAfter:u} ({expiresIn.TotalDays:F0} days)", Data = new Dictionary { ["expires_at"] = cert.NotAfter, ["expires_in_days"] = expiresIn.TotalDays } }; } } public sealed class OrchestratorConnectivityCheck : IAgentHealthCheck { public string Name => "Orchestrator Connectivity"; public string Category => "Network"; public async Task ExecuteAsync(CancellationToken ct) { var endpoint = _config.OrchestratorUrl; try { // Test DNS resolution var uri = new Uri(endpoint); var addresses = await Dns.GetHostAddressesAsync(uri.Host, ct); if (addresses.Length == 0) { return new HealthCheckResult { CheckName = Name, Status = HealthStatus.Critical, Message = $"DNS resolution failed for {uri.Host}", RemediationHint = "Check DNS settings and network connectivity", RunbookUrl = "https://docs.stella-ops.org/runbooks/agent-dns-failure" }; } // Test TCP connection using var tcpClient = new TcpClient(); var connectTask = tcpClient.ConnectAsync(uri.Host, uri.Port, ct); var completed = await Task.WhenAny( connectTask.AsTask(), Task.Delay(TimeSpan.FromSeconds(5), ct)); if (completed != connectTask.AsTask() || !tcpClient.Connected) { return new HealthCheckResult { CheckName = Name, Status = HealthStatus.Critical, Message = $"TCP connection to {endpoint} timed out", RemediationHint = "Check firewall rules and network connectivity", RunbookUrl = "https://docs.stella-ops.org/runbooks/agent-connection-timeout" }; } // Test mTLS handshake var tlsResult = await TestMTLSHandshakeAsync(uri, ct); if (!tlsResult.Success) { return new HealthCheckResult { CheckName = Name, Status = HealthStatus.Critical, Message = $"mTLS handshake failed: {tlsResult.Error}", RemediationHint = tlsResult.RemediationHint, RunbookUrl = "https://docs.stella-ops.org/runbooks/agent-mtls-failure" }; } // Test gRPC health endpoint var healthResult = await _orchestratorClient.HealthCheckAsync(ct); return new HealthCheckResult { CheckName = Name, Status = HealthStatus.Healthy, Message = $"Connected to orchestrator at {endpoint}", Data = new Dictionary { ["resolved_addresses"] = addresses.Select(a => a.ToString()).ToArray(), ["tls_version"] = tlsResult.TlsVersion, ["latency_ms"] = healthResult.LatencyMs } }; } catch (Exception ex) { return new HealthCheckResult { CheckName = Name, Status = HealthStatus.Critical, Message = $"Connectivity check failed: {ex.Message}", Exception = ex, RemediationHint = "Check network configuration and orchestrator status", RunbookUrl = "https://docs.stella-ops.org/runbooks/agent-connectivity" }; } } } public sealed class DockerConnectivityCheck : IAgentHealthCheck { public string Name => "Docker Connectivity"; public string Category => "Runtime"; public async Task ExecuteAsync(CancellationToken ct) { try { var version = await _dockerClient.GetVersionAsync(ct); // Check minimum version var minVersion = new Version(20, 10, 0); var currentVersion = new Version(version.Version); if (currentVersion < minVersion) { return new HealthCheckResult { CheckName = Name, Status = HealthStatus.Warning, Message = $"Docker version {version.Version} is below recommended {minVersion}", RemediationHint = "Upgrade Docker to version 20.10 or later", Data = new Dictionary { ["docker_version"] = version.Version, ["api_version"] = version.ApiVersion, ["min_recommended"] = minVersion.ToString() } }; } return new HealthCheckResult { CheckName = Name, Status = HealthStatus.Healthy, Message = $"Docker {version.Version} connected", Data = new Dictionary { ["docker_version"] = version.Version, ["api_version"] = version.ApiVersion, ["os"] = version.Os, ["arch"] = version.Arch } }; } catch (Exception ex) { return new HealthCheckResult { CheckName = Name, Status = HealthStatus.Critical, Message = $"Docker connectivity failed: {ex.Message}", Exception = ex, RemediationHint = "Ensure Docker daemon is running and agent has permission to access Docker socket", RunbookUrl = "https://docs.stella-ops.org/runbooks/agent-docker-connectivity" }; } } } ``` ### 5. Remediation Engine Guided problem resolution: ```csharp public sealed class RemediationEngine { public ImmutableArray GetRemediationSteps( HealthCheckResult result) { var steps = new List(); // Match result to known remediation patterns var pattern = _patterns.FirstOrDefault(p => p.Matches(result)); if (pattern != null) { steps.AddRange(pattern.Steps); } // Add generic remediation based on status if (result.Status == HealthStatus.Critical) { steps.Add(new RemediationStep { Id = "check-logs", Priority = RemediationPriority.High, Title = "Check Agent Logs", Description = "Review agent logs for detailed error information", Command = "stella agent logs --tail 100", RunbookUrl = result.RunbookUrl }); } return steps.ToImmutableArray(); } private readonly ImmutableArray _patterns = new[] { new RemediationPattern { CheckName = "Certificate Expiry", StatusMatch = HealthStatus.Critical, Steps = new[] { new RemediationStep { Id = "renew-cert", Priority = RemediationPriority.Critical, Title = "Renew Agent Certificate", Description = "Agent certificate has expired and must be renewed", Command = "stella agent renew-cert --force", Automated = true }, new RemediationStep { Id = "restart-agent", Priority = RemediationPriority.High, Title = "Restart Agent", Description = "Restart agent to apply new certificate", Command = "systemctl restart stella-agent", Automated = false } } }, new RemediationPattern { CheckName = "Orchestrator Connectivity", MessageContains = "DNS resolution failed", Steps = new[] { new RemediationStep { Id = "check-dns", Priority = RemediationPriority.Critical, Title = "Verify DNS Configuration", Description = "Check that DNS servers are configured and reachable", Command = "cat /etc/resolv.conf && nslookup orchestrator.example.com", Automated = false }, new RemediationStep { Id = "check-hosts", Priority = RemediationPriority.High, Title = "Check /etc/hosts", Description = "Verify no conflicting entries in hosts file", Command = "grep orchestrator /etc/hosts", Automated = false } } }, new RemediationPattern { CheckName = "Docker Connectivity", Steps = new[] { new RemediationStep { Id = "check-docker-daemon", Priority = RemediationPriority.Critical, Title = "Check Docker Daemon", Description = "Verify Docker daemon is running", Command = "systemctl status docker", Automated = false }, new RemediationStep { Id = "check-docker-socket", Priority = RemediationPriority.High, Title = "Check Docker Socket Permissions", Description = "Verify agent has access to Docker socket", Command = "ls -la /var/run/docker.sock && groups stella-agent", Automated = false } } } }.ToImmutableArray(); } public sealed record RemediationStep { public string Id { get; init; } public RemediationPriority Priority { get; init; } public string Title { get; init; } public string Description { get; init; } public string? Command { get; init; } public string? RunbookUrl { get; init; } public bool Automated { get; init; } public TimeSpan? EstimatedDuration { get; init; } } ``` ### 6. Auto-Update Manager Safe agent binary updates: ```csharp public sealed class AgentUpdateManager { public async Task CheckAndApplyUpdateAsync( CancellationToken ct) { if (!_config.AutoUpdate?.Enabled == true) { return new UpdateResult { Status = UpdateStatus.Disabled }; } // Check for available update var available = await _updateService.CheckForUpdateAsync( _config.AgentVersion, _config.AutoUpdate.Channel, ct); if (!available.HasUpdate) { return new UpdateResult { Status = UpdateStatus.UpToDate }; } // Verify update signature var verified = await _signatureVerifier.VerifyAsync( available.Package, available.Signature, ct); if (!verified) { _logger.LogError("Update signature verification failed"); return new UpdateResult { Status = UpdateStatus.VerificationFailed, Error = "Package signature verification failed" }; } // Check if update window is allowed if (!IsInUpdateWindow()) { _logger.LogInformation( "Update available but outside update window, scheduling for {Window}", _config.AutoUpdate.MaintenanceWindow); return new UpdateResult { Status = UpdateStatus.Scheduled, ScheduledFor = GetNextMaintenanceWindow() }; } // Drain active tasks await DrainActiveTasksAsync(ct); // Download and apply update try { var packagePath = await DownloadPackageAsync(available, ct); // Create rollback point var rollbackPoint = await CreateRollbackPointAsync(ct); // Apply update await ApplyUpdateAsync(packagePath, ct); // Verify new version starts correctly var healthCheck = await VerifyNewVersionAsync(ct); if (!healthCheck.Healthy) { _logger.LogError("New version health check failed, rolling back"); await RollbackAsync(rollbackPoint, ct); return new UpdateResult { Status = UpdateStatus.RolledBack, Error = healthCheck.Error }; } return new UpdateResult { Status = UpdateStatus.Applied, PreviousVersion = _config.AgentVersion, NewVersion = available.Version }; } catch (Exception ex) { _logger.LogError(ex, "Update failed, attempting rollback"); await RollbackAsync(ct); return new UpdateResult { Status = UpdateStatus.Failed, Error = ex.Message }; } } } public sealed record AutoUpdateConfig { public bool Enabled { get; init; } = false; public UpdateChannel Channel { get; init; } = UpdateChannel.Stable; public string? MaintenanceWindow { get; init; } // Cron expression public bool DrainBeforeUpdate { get; init; } = true; public TimeSpan DrainTimeout { get; init; } = TimeSpan.FromMinutes(5); public int MaxRollbackVersions { get; init; } = 3; } public enum UpdateChannel { Stable, Beta, Canary } ``` ### 7. Operator CLI Commands Streamlined operational commands: ```csharp public sealed class AgentOperatorCommands { // Bootstrap new agent // stella agent bootstrap --name prod-agent-01 --env production --platform linux [Command("agent bootstrap")] public async Task BootstrapAsync( [Option] string name, [Option] string env, [Option] Platform platform = Platform.Linux, [Option] string[]? capabilities = null, [Option] string? cluster = null) { var result = await _bootstrap.BootstrapAgentAsync(new BootstrapRequest { AgentName = name, Environment = env, Platform = platform, Capabilities = capabilities?.ToImmutableArray() ?? ImmutableArray.Empty, ClusterId = cluster }, _ct); Console.WriteLine($"Bootstrap token generated (expires in 15 minutes):"); Console.WriteLine(); Console.WriteLine($" Token: {result.Token}"); Console.WriteLine(); Console.WriteLine($"One-line installer:"); Console.WriteLine($" {result.InstallCommand}"); Console.WriteLine(); Console.WriteLine($"Or download the install script:"); Console.WriteLine($" stella agent install-script --token {result.Token} --output install.sh"); return 0; } // Run diagnostics // stella agent doctor [--category security] [--fix] [Command("agent doctor")] public async Task DoctorAsync( [Option] string? agentId = null, [Option] string[]? categories = null, [Option] bool fix = false, [Option] OutputFormat format = OutputFormat.Table) { var options = new DiagnosticOptions { Categories = categories?.ToImmutableArray() ?? ImmutableArray.Empty, IncludeRemediations = true }; var report = agentId != null ? await _doctor.RunRemoteDiagnosticsAsync(agentId, options, _ct) : await _doctor.RunDiagnosticsAsync(options, _ct); // Display results RenderDiagnosticReport(report, format); // Optionally apply automated fixes if (fix && report.Remediations.Any(r => r.Automated)) { Console.WriteLine(); Console.WriteLine("Applying automated remediations..."); foreach (var remediation in report.Remediations.Where(r => r.Automated)) { Console.WriteLine($" - {remediation.Title}"); await _remediation.ApplyAsync(remediation, _ct); } } return report.OverallStatus == HealthStatus.Healthy ? 0 : 1; } // View agent configuration // stella agent config [--agent-id xyz] [--diff] [Command("agent config")] public async Task ConfigAsync( [Option] string? agentId = null, [Option] bool diff = false, [Option] OutputFormat format = OutputFormat.Yaml) { if (diff) { var drift = await _configManager.DetectDriftAsync(_ct); RenderConfigDiff(drift, format); return drift.HasDrift ? 1 : 0; } var config = await _configManager.GetConfigurationAsync(agentId, _ct); RenderConfiguration(config, format); return 0; } // Apply configuration changes // stella agent apply -f agent-config.yaml [Command("agent apply")] public async Task ApplyAsync( [Option('f')] string configFile) { var config = await LoadConfigurationAsync(configFile); var validation = await _configManager.ValidateAsync(config, _ct); if (!validation.IsValid) { Console.WriteLine("Configuration validation failed:"); foreach (var error in validation.Errors) { Console.WriteLine($" - {error}"); } return 1; } var result = await _configManager.ApplyConfigurationAsync(config, _ct); if (result.Status == ConfigStatus.Applied) { Console.WriteLine($"Configuration applied successfully ({result.AppliedChanges.Length} changes)"); return 0; } Console.WriteLine($"Configuration apply failed: {result.Status}"); return 1; } // Renew certificate // stella agent renew-cert [--force] [Command("agent renew-cert")] public async Task RenewCertAsync( [Option] bool force = false) { var result = await _certManager.RenewCertificateAsync(force, _ct); if (result.Status == CertificateStatus.Renewed) { Console.WriteLine($"Certificate renewed successfully"); Console.WriteLine($" New expiry: {result.ExpiresAt:u}"); return 0; } Console.WriteLine($"Certificate renewal failed: {result.Error}"); return 1; } // View agent logs // stella agent logs [--tail 100] [--follow] [--level error] [Command("agent logs")] public async Task LogsAsync( [Option] string? agentId = null, [Option] int tail = 50, [Option] bool follow = false, [Option] LogLevel? level = null) { await foreach (var entry in _logService.StreamLogsAsync( agentId, tail, follow, level, _ct)) { RenderLogEntry(entry); } return 0; } // Force update // stella agent update [--version x.y.z] [--force] [Command("agent update")] public async Task UpdateAsync( [Option] string? version = null, [Option] bool force = false) { var result = await _updateManager.UpdateToVersionAsync(version, force, _ct); Console.WriteLine($"Update status: {result.Status}"); if (result.Status == UpdateStatus.Applied) { Console.WriteLine($" Previous: {result.PreviousVersion}"); Console.WriteLine($" Current: {result.NewVersion}"); } return result.Status == UpdateStatus.Applied ? 0 : 1; } } ``` --- ## Doctor Plugin for Server-Side Central Doctor plugin for agent fleet health: ```csharp // src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Agent/AgentHealthPlugin.cs public sealed class AgentHealthPlugin : IDoctorPlugin { public string Name => "Agent Health"; public string Description => "Monitors agent fleet health and connectivity"; public ImmutableArray Checks => new IDoctorCheck[] { new AgentHeartbeatFreshnessCheck(), new AgentCertificateExpiryCheck(), new AgentVersionConsistencyCheck(), new AgentCapacityCheck(), new StaleAgentCheck(), new AgentClusterHealthCheck(), new TaskQueueBacklogCheck(), new FailedTaskRateCheck(), new AgentResourceUtilizationCheck() }.ToImmutableArray(); } public sealed class AgentHeartbeatFreshnessCheck : IDoctorCheck { public string Name => "Agent Heartbeat Freshness"; public CheckSeverity Severity => CheckSeverity.Critical; public async Task ExecuteAsync(CancellationToken ct) { var agents = await _agentStore.GetAllAsync(ct); var staleAgents = new List(); var warningAgents = new List(); foreach (var agent in agents.Where(a => a.Status != AgentStatus.Deactivated)) { var heartbeatAge = _timeProvider.GetUtcNow() - agent.LastHeartbeat; if (heartbeatAge > TimeSpan.FromMinutes(5)) { staleAgents.Add($"{agent.Name} (last heartbeat: {heartbeatAge.TotalMinutes:F0}m ago)"); } else if (heartbeatAge > TimeSpan.FromMinutes(2)) { warningAgents.Add($"{agent.Name} (last heartbeat: {heartbeatAge.TotalSeconds:F0}s ago)"); } } if (staleAgents.Any()) { return new DoctorCheckResult { Status = CheckStatus.Critical, Message = $"{staleAgents.Count} agent(s) have stale heartbeats", Details = staleAgents, Remediation = "Check agent connectivity and status. Run 'stella agent doctor --agent-id ' for diagnostics." }; } if (warningAgents.Any()) { return new DoctorCheckResult { Status = CheckStatus.Warning, Message = $"{warningAgents.Count} agent(s) have delayed heartbeats", Details = warningAgents }; } return new DoctorCheckResult { Status = CheckStatus.Healthy, Message = $"All {agents.Count} agents have fresh heartbeats" }; } } public sealed class AgentCertificateExpiryCheck : IDoctorCheck { public string Name => "Agent Certificate Expiry"; public CheckSeverity Severity => CheckSeverity.High; public async Task ExecuteAsync(CancellationToken ct) { var agents = await _agentStore.GetAllAsync(ct); var expiringSoon = new List(); var expired = new List(); foreach (var agent in agents) { var expiresIn = agent.CertificateExpiry - _timeProvider.GetUtcNow(); if (expiresIn <= TimeSpan.Zero) { expired.Add($"{agent.Name} (expired {-expiresIn.TotalDays:F0} days ago)"); } else if (expiresIn <= TimeSpan.FromDays(7)) { expiringSoon.Add($"{agent.Name} (expires in {expiresIn.TotalDays:F0} days)"); } } if (expired.Any()) { return new DoctorCheckResult { Status = CheckStatus.Critical, Message = $"{expired.Count} agent(s) have expired certificates", Details = expired, Remediation = "Renew certificates immediately: 'stella agent renew-cert --agent-id '" }; } if (expiringSoon.Any()) { return new DoctorCheckResult { Status = CheckStatus.Warning, Message = $"{expiringSoon.Count} agent(s) have certificates expiring soon", Details = expiringSoon, Remediation = "Schedule certificate renewal before expiry" }; } return new DoctorCheckResult { Status = CheckStatus.Healthy, Message = "All agent certificates are valid" }; } } ``` --- ## Configuration Examples ### Minimal Configuration (Bootstrap) ```yaml # Bootstrapped agent - minimal config required agent: name: prod-agent-01 orchestrator_url: https://orchestrator.example.com:8443 # Everything else is auto-configured via bootstrap ``` ### Full Configuration ```yaml agent: # Identity id: a1b2c3d4-e5f6-7890-abcd-ef1234567890 name: prod-agent-01 environment: production labels: region: us-east-1 tier: web # Connection orchestrator_url: https://orchestrator.example.com:8443 heartbeat_interval: 30s reconnect_backoff: 5s max_reconnect_attempts: 10 # Capabilities capabilities: - docker - compose - health_check # Resources max_concurrent_tasks: 5 default_task_timeout: 30m resource_limits: cpu_percent: 80 memory_percent: 80 disk_percent: 90 # Certificates certificates: source: auto_provision # auto_provision | file | vault auto_renew: true renewal_threshold: 7d # Clustering (optional) cluster: id: prod-cluster-01 mode: active_active # active_passive | active_active | sharded min_members: 2 # Observability observability: metrics: enabled: true port: 9090 logging: level: info format: json tracing: enabled: true endpoint: http://jaeger:14268/api/traces # Auto-update (optional) auto_update: enabled: true channel: stable # stable | beta | canary maintenance_window: "0 3 * * *" # 3 AM daily drain_before_update: true ``` --- ## CLI Quick Reference ```bash # Bootstrap new agent stella agent bootstrap --name prod-01 --env production --platform linux # Run health diagnostics stella agent doctor stella agent doctor --category security --fix stella agent doctor --agent-id abc123 --format json # View/apply configuration stella agent config stella agent config --diff stella agent apply -f agent-config.yaml # Certificate management stella agent renew-cert stella agent renew-cert --force # Logs and debugging stella agent logs --tail 100 stella agent logs --follow --level error # Updates stella agent update stella agent update --version 2.1.0 # Status and health stella agent status stella agent list --env production stella agent health abc123 ``` --- ## Metrics & Observability ### Prometheus Metrics ``` # Bootstrap stella_agent_bootstrap_total{environment, platform} stella_agent_bootstrap_success_total{environment} stella_agent_bootstrap_failed_total{environment, reason} # Configuration stella_agent_config_drift_detected_total{agent_id} stella_agent_config_apply_total{agent_id, status} # Certificates stella_agent_certificate_expiry_seconds{agent_id} stella_agent_certificate_renewal_total{agent_id, status} # Health Checks stella_agent_health_check_total{agent_id, check_name, status} stella_agent_health_score{agent_id} # Updates stella_agent_update_available{agent_id, current_version, available_version} stella_agent_update_applied_total{agent_id, status} stella_agent_update_rollback_total{agent_id} ``` --- ## Test Strategy ### Unit Tests - Bootstrap token generation and validation - Configuration diff computation - Certificate lifecycle logic - Health check execution - Remediation matching ### Integration Tests - Full bootstrap flow - Configuration apply with rollback - Certificate renewal - Auto-update with rollback - Doctor diagnostics ### E2E Tests - Bootstrap to running agent - Multi-agent cluster formation - Failover scenarios - Update and rollback scenarios --- ## Migration Path ### Phase 1: Bootstrap Service (Week 1-2) - Bootstrap token service - One-line installer generation - Platform-specific install scripts ### Phase 2: Configuration Manager (Week 3-4) - Declarative configuration model - Drift detection - Apply with rollback ### Phase 3: Certificate Manager (Week 5-6) - Auto-provisioning - Auto-renewal - Multi-source support (Vault, ACME, etc.) ### Phase 4: Agent Doctor (Week 7-8) - Core health checks - Remediation engine - CLI integration ### Phase 5: Doctor Plugin (Week 9-10) - Server-side fleet health - Dashboard integration - Alerting rules ### Phase 6: Auto-Update (Week 11-12) - Update service - Safe rollback - Maintenance windows