Files
git.stella-ops.org/docs/modules/release-orchestrator/enhancements/agent-operations.md
2026-01-17 21:32:08 +02:00

49 KiB

Agent Operations & Easy Setup

Overview

The Agent Operations enhancement transforms agent deployment from a manual, error-prone process into a streamlined, self-healing experience. It provides zero-touch bootstrap, declarative configuration, comprehensive health diagnostics (Doctor plugin), and operational tooling that makes agents easy to deploy, monitor, and maintain at scale.

This enhancement complements Sprint 034 (Agent Resilience) by focusing on the operational and configuration aspects rather than the clustering and failover mechanisms.


Design Principles

  1. Zero-Touch Bootstrap: Agents should be deployable with a single command
  2. Declarative Configuration: Define desired state, system converges automatically
  3. Self-Diagnosing: Agents report their own health issues with remediation hints
  4. Operator-Friendly: Clear CLI commands, meaningful error messages, runbook links
  5. Secure by Default: Auto-provisioned certificates, secrets never on disk
  6. Observable: Rich metrics, structured logs, distributed tracing

Current Pain Points

Pain Point Current State Target State
Certificate Management Manual paths to cert/key/ca files Auto-provisioned, auto-renewed
Configuration Static YAML files, manual edits Declarative config with drift detection
Health Monitoring Binary alive/offline Multi-dimensional health scoring
Troubleshooting Manual log inspection Doctor plugin with guided remediation
Scaling Manual per-agent setup Bootstrap token + auto-join
Updates Manual agent binary updates Auto-update with rollback
Network Issues Silent failures Connection diagnostics with hints

Architecture

Component Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                      Agent Operations & Setup                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌───────────────────┐    ┌───────────────────┐    ┌───────────────────┐   │
│  │ BootstrapService  │───▶│ ConfigManager     │───▶│ CertificateManager│   │
│  │                   │    │                   │    │                   │   │
│  └───────────────────┘    └───────────────────┘    └───────────────────┘   │
│           │                        │                        │               │
│           ▼                        ▼                        ▼               │
│  ┌───────────────────┐    ┌───────────────────┐    ┌───────────────────┐   │
│  │ AgentDoctor       │    │ ConnectionDoctor  │    │ UpdateManager     │   │
│  │                   │    │                   │    │                   │   │
│  └───────────────────┘    └───────────────────┘    └───────────────────┘   │
│           │                        │                        │               │
│           ▼                        ▼                        ▼               │
│  ┌───────────────────┐    ┌───────────────────┐    ┌───────────────────┐   │
│  │ DiagnosticReport  │    │ RemediationEngine │    │ OperatorCLI       │   │
│  │                   │    │                   │    │                   │   │
│  └───────────────────┘    └───────────────────┘    └───────────────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

                         Bootstrap Flow

    ┌─────────────┐      ┌─────────────┐      ┌─────────────┐
    │   stella    │      │ Orchestrator│      │   Agent     │
    │   agent     │─────▶│   (API)     │─────▶│   Running   │
    │   bootstrap │      │             │      │             │
    └─────────────┘      └─────────────┘      └─────────────┘
         │                     │                    │
         │  1. Request token   │                    │
         │────────────────────▶│                    │
         │  2. Return token    │                    │
         │◀────────────────────│                    │
         │                     │                    │
         │  3. Start agent with token              │
         │─────────────────────────────────────────▶│
         │                     │  4. Exchange token │
         │                     │◀───────────────────│
         │                     │  5. Issue cert     │
         │                     │───────────────────▶│
         │                     │  6. Register       │
         │                     │◀───────────────────│
         │                     │  7. Confirm        │
         │                     │───────────────────▶│

Key Components

1. Bootstrap Service

Zero-touch agent deployment:

public sealed class BootstrapService
{
    public async Task<BootstrapResult> BootstrapAgentAsync(
        BootstrapRequest request,
        CancellationToken ct)
    {
        // 1. Generate bootstrap token (one-time use, 15-minute expiry)
        var token = await _tokenService.GenerateBootstrapTokenAsync(
            new TokenRequest
            {
                AgentName = request.AgentName,
                Environment = request.Environment,
                Capabilities = request.Capabilities,
                ExpiresIn = TimeSpan.FromMinutes(15),
                MaxUses = 1
            }, ct);

        // 2. Generate agent configuration
        var config = GenerateAgentConfig(request, token);

        // 3. Generate installation script
        var script = GenerateInstallScript(request.Platform, config);

        return new BootstrapResult
        {
            Token = token.Value,
            TokenExpires = token.ExpiresAt,
            Configuration = config,
            InstallScript = script,
            InstallCommand = GetOneLineInstaller(request.Platform, token)
        };
    }

    private string GetOneLineInstaller(Platform platform, BootstrapToken token)
    {
        return platform switch
        {
            Platform.Linux => $"curl -sSL https://stella.example.com/install.sh | sudo bash -s -- --token {token.Value}",
            Platform.Windows => $"iwr https://stella.example.com/install.ps1 -UseBasicParsing | iex; Install-StellaAgent -Token {token.Value}",
            Platform.Docker => $"docker run -d --name stella-agent -e STELLA_BOOTSTRAP_TOKEN={token.Value} stella/agent:latest",
            _ => throw new UnsupportedPlatformException(platform)
        };
    }
}

public sealed record BootstrapRequest
{
    public string AgentName { get; init; }
    public string Environment { get; init; }
    public Platform Platform { get; init; }
    public ImmutableArray<AgentCapability> Capabilities { get; init; }
    public ImmutableDictionary<string, string> Labels { get; init; }
    public string? ClusterId { get; init; }  // Join existing cluster
}

public sealed record BootstrapResult
{
    public string Token { get; init; }
    public DateTimeOffset TokenExpires { get; init; }
    public AgentConfiguration Configuration { get; init; }
    public string InstallScript { get; init; }
    public string InstallCommand { get; init; }
}

2. Configuration Manager

Declarative configuration with drift detection:

public sealed class AgentConfigManager
{
    public async Task<ConfigurationState> ApplyConfigurationAsync(
        AgentConfiguration desired,
        CancellationToken ct)
    {
        var current = await _configStore.GetCurrentAsync(ct);
        var diff = ComputeDiff(current, desired);

        if (diff.HasChanges)
        {
            _logger.LogInformation("Configuration drift detected: {Changes}", diff.Summary);

            // Validate changes are safe
            var validation = await ValidateChangesAsync(diff, ct);
            if (!validation.IsValid)
            {
                return new ConfigurationState
                {
                    Status = ConfigStatus.ValidationFailed,
                    Errors = validation.Errors
                };
            }

            // Apply changes with rollback capability
            try
            {
                await ApplyChangesAsync(diff, ct);
                await _configStore.SaveAsync(desired, ct);

                return new ConfigurationState
                {
                    Status = ConfigStatus.Applied,
                    AppliedChanges = diff.Changes
                };
            }
            catch (Exception ex)
            {
                await RollbackAsync(current, ct);
                throw new ConfigurationApplyException("Failed to apply configuration", ex);
            }
        }

        return new ConfigurationState { Status = ConfigStatus.NoChanges };
    }

    public async Task<ConfigDrift> DetectDriftAsync(CancellationToken ct)
    {
        var desired = await _configStore.GetDesiredAsync(ct);
        var actual = await _configStore.GetActualAsync(ct);

        return new ConfigDrift
        {
            HasDrift = !desired.Equals(actual),
            DesiredState = desired,
            ActualState = actual,
            Differences = ComputeDiff(actual, desired).Changes
        };
    }
}

// Declarative configuration model
public sealed record AgentConfiguration
{
    // Identity
    public string AgentId { get; init; }
    public string AgentName { get; init; }
    public string Environment { get; init; }
    public ImmutableDictionary<string, string> Labels { get; init; }

    // Connection
    public string OrchestratorUrl { get; init; }
    public TimeSpan HeartbeatInterval { get; init; } = TimeSpan.FromSeconds(30);
    public TimeSpan ReconnectBackoff { get; init; } = TimeSpan.FromSeconds(5);
    public int MaxReconnectAttempts { get; init; } = 10;

    // Capabilities
    public ImmutableArray<AgentCapability> Capabilities { get; init; }

    // Resources
    public ResourceLimits ResourceLimits { get; init; }
    public int MaxConcurrentTasks { get; init; } = 5;
    public TimeSpan DefaultTaskTimeout { get; init; } = TimeSpan.FromMinutes(30);

    // Security
    public CertificateConfig Certificates { get; init; }
    public bool AutoRenewCertificates { get; init; } = true;
    public TimeSpan CertificateRenewalThreshold { get; init; } = TimeSpan.FromDays(7);

    // Clustering (optional)
    public ClusterConfig? Cluster { get; init; }

    // Observability
    public ObservabilityConfig Observability { get; init; }

    // Auto-update
    public AutoUpdateConfig? AutoUpdate { get; init; }
}

public sealed record CertificateConfig
{
    public CertificateSource Source { get; init; } = CertificateSource.AutoProvision;
    public string? CertificatePath { get; init; }  // Only if Source = File
    public string? PrivateKeyPath { get; init; }   // Only if Source = File
    public string? CaCertificatePath { get; init; } // Only if Source = File
}

public enum CertificateSource
{
    AutoProvision,  // Orchestrator provisions via bootstrap
    File,           // Manual file paths
    Vault,          // HashiCorp Vault
    ACME,           // Let's Encrypt / ACME
    AzureKeyVault,  // Azure Key Vault
    AWSKMS          // AWS KMS/Secrets Manager
}

3. Certificate Manager

Automatic certificate lifecycle:

public sealed class AgentCertificateManager
{
    public async Task<CertificateState> EnsureCertificateAsync(CancellationToken ct)
    {
        var current = await GetCurrentCertificateAsync(ct);

        if (current == null)
        {
            _logger.LogInformation("No certificate found, requesting new certificate");
            return await ProvisionCertificateAsync(ct);
        }

        var expiresIn = current.NotAfter - _timeProvider.GetUtcNow();
        var threshold = _config.CertificateRenewalThreshold;

        if (expiresIn <= TimeSpan.Zero)
        {
            _logger.LogWarning("Certificate expired, requesting renewal");
            return await RenewCertificateAsync(current, ct);
        }

        if (expiresIn <= threshold)
        {
            _logger.LogInformation(
                "Certificate expires in {Days} days, renewing proactively",
                expiresIn.TotalDays);
            return await RenewCertificateAsync(current, ct);
        }

        return new CertificateState
        {
            Status = CertificateStatus.Valid,
            Certificate = current,
            ExpiresAt = current.NotAfter,
            RenewalScheduled = current.NotAfter - threshold
        };
    }

    private async Task<CertificateState> ProvisionCertificateAsync(CancellationToken ct)
    {
        // Generate key pair locally (private key never leaves agent)
        using var rsa = RSA.Create(4096);

        // Create CSR
        var csr = CreateCertificateSigningRequest(rsa);

        // Submit CSR to orchestrator
        var signedCert = await _orchestratorClient.SubmitCSRAsync(
            new CSRRequest
            {
                AgentId = _config.AgentId,
                CSR = csr,
                RequestedValidity = TimeSpan.FromDays(365)
            }, ct);

        // Store certificate and key securely
        await _certStore.StoreCertificateAsync(signedCert, ct);
        await _keyStore.StorePrivateKeyAsync(rsa, ct);

        return new CertificateState
        {
            Status = CertificateStatus.Provisioned,
            Certificate = signedCert,
            ExpiresAt = signedCert.NotAfter
        };
    }
}

4. Agent Doctor (Health Checks)

Comprehensive health diagnostics:

public sealed class AgentDoctor
{
    private readonly ImmutableArray<IAgentHealthCheck> _checks;

    public AgentDoctor()
    {
        _checks = new IAgentHealthCheck[]
        {
            // Core checks
            new CertificateExpiryCheck(),
            new CertificateValidityCheck(),
            new OrchestratorConnectivityCheck(),
            new HeartbeatCheck(),

            // Resource checks
            new DiskSpaceCheck(),
            new MemoryUsageCheck(),
            new CpuUsageCheck(),
            new FileDescriptorCheck(),

            // Configuration checks
            new ConfigurationValidityCheck(),
            new ConfigurationDriftCheck(),
            new CapabilityCheck(),

            // Network checks
            new RegistryConnectivityCheck(),
            new DNSResolutionCheck(),
            new TLSVersionCheck(),
            new MTLSHandshakeCheck(),

            // Task execution checks
            new DockerConnectivityCheck(),
            new DockerVersionCheck(),
            new TaskQueueDepthCheck(),
            new FailedTaskRateCheck(),

            // Cluster checks (if clustered)
            new ClusterMembershipCheck(),
            new LeaderConnectivityCheck(),
            new StateSyncCheck()
        }.ToImmutableArray();
    }

    public async Task<AgentDiagnosticReport> RunDiagnosticsAsync(
        DiagnosticOptions options,
        CancellationToken ct)
    {
        var results = new List<HealthCheckResult>();
        var startTime = _timeProvider.GetUtcNow();

        foreach (var check in _checks)
        {
            if (options.Categories.Any() &&
                !options.Categories.Contains(check.Category))
            {
                continue;
            }

            try
            {
                var result = await check.ExecuteAsync(ct);
                results.Add(result);

                if (result.Status == HealthStatus.Critical && options.StopOnCritical)
                {
                    break;
                }
            }
            catch (Exception ex)
            {
                results.Add(new HealthCheckResult
                {
                    CheckName = check.Name,
                    Status = HealthStatus.Error,
                    Message = $"Check failed with exception: {ex.Message}",
                    Exception = ex
                });
            }
        }

        return new AgentDiagnosticReport
        {
            AgentId = _config.AgentId,
            AgentName = _config.AgentName,
            Timestamp = startTime,
            Duration = _timeProvider.GetUtcNow() - startTime,
            OverallStatus = DetermineOverallStatus(results),
            Results = results.ToImmutableArray(),
            Remediations = GenerateRemediations(results)
        };
    }

    private ImmutableArray<RemediationStep> GenerateRemediations(
        List<HealthCheckResult> results)
    {
        var remediations = new List<RemediationStep>();

        foreach (var result in results.Where(r => r.Status != HealthStatus.Healthy))
        {
            var steps = _remediationEngine.GetRemediationSteps(result);
            remediations.AddRange(steps);
        }

        // Sort by priority and deduplicate
        return remediations
            .DistinctBy(r => r.Id)
            .OrderByDescending(r => r.Priority)
            .ToImmutableArray();
    }
}

// Individual health checks
public sealed class CertificateExpiryCheck : IAgentHealthCheck
{
    public string Name => "Certificate Expiry";
    public string Category => "Security";

    public async Task<HealthCheckResult> ExecuteAsync(CancellationToken ct)
    {
        var cert = await _certManager.GetCurrentCertificateAsync(ct);

        if (cert == null)
        {
            return new HealthCheckResult
            {
                CheckName = Name,
                Status = HealthStatus.Critical,
                Message = "No certificate found",
                RemediationHint = "Run 'stella agent bootstrap' to provision certificate",
                RunbookUrl = "https://docs.stella-ops.org/runbooks/agent-no-certificate"
            };
        }

        var expiresIn = cert.NotAfter - _timeProvider.GetUtcNow();

        if (expiresIn <= TimeSpan.Zero)
        {
            return new HealthCheckResult
            {
                CheckName = Name,
                Status = HealthStatus.Critical,
                Message = $"Certificate expired on {cert.NotAfter:u}",
                RemediationHint = "Run 'stella agent renew-cert' or restart agent for auto-renewal",
                RunbookUrl = "https://docs.stella-ops.org/runbooks/agent-cert-expired"
            };
        }

        if (expiresIn <= TimeSpan.FromDays(7))
        {
            return new HealthCheckResult
            {
                CheckName = Name,
                Status = HealthStatus.Warning,
                Message = $"Certificate expires in {expiresIn.TotalDays:F0} days",
                RemediationHint = "Certificate will auto-renew if enabled, or run 'stella agent renew-cert'",
                Data = new Dictionary<string, object>
                {
                    ["expires_at"] = cert.NotAfter,
                    ["expires_in_days"] = expiresIn.TotalDays
                }
            };
        }

        return new HealthCheckResult
        {
            CheckName = Name,
            Status = HealthStatus.Healthy,
            Message = $"Certificate valid until {cert.NotAfter:u} ({expiresIn.TotalDays:F0} days)",
            Data = new Dictionary<string, object>
            {
                ["expires_at"] = cert.NotAfter,
                ["expires_in_days"] = expiresIn.TotalDays
            }
        };
    }
}

public sealed class OrchestratorConnectivityCheck : IAgentHealthCheck
{
    public string Name => "Orchestrator Connectivity";
    public string Category => "Network";

    public async Task<HealthCheckResult> ExecuteAsync(CancellationToken ct)
    {
        var endpoint = _config.OrchestratorUrl;

        try
        {
            // Test DNS resolution
            var uri = new Uri(endpoint);
            var addresses = await Dns.GetHostAddressesAsync(uri.Host, ct);

            if (addresses.Length == 0)
            {
                return new HealthCheckResult
                {
                    CheckName = Name,
                    Status = HealthStatus.Critical,
                    Message = $"DNS resolution failed for {uri.Host}",
                    RemediationHint = "Check DNS settings and network connectivity",
                    RunbookUrl = "https://docs.stella-ops.org/runbooks/agent-dns-failure"
                };
            }

            // Test TCP connection
            using var tcpClient = new TcpClient();
            var connectTask = tcpClient.ConnectAsync(uri.Host, uri.Port, ct);
            var completed = await Task.WhenAny(
                connectTask.AsTask(),
                Task.Delay(TimeSpan.FromSeconds(5), ct));

            if (completed != connectTask.AsTask() || !tcpClient.Connected)
            {
                return new HealthCheckResult
                {
                    CheckName = Name,
                    Status = HealthStatus.Critical,
                    Message = $"TCP connection to {endpoint} timed out",
                    RemediationHint = "Check firewall rules and network connectivity",
                    RunbookUrl = "https://docs.stella-ops.org/runbooks/agent-connection-timeout"
                };
            }

            // Test mTLS handshake
            var tlsResult = await TestMTLSHandshakeAsync(uri, ct);
            if (!tlsResult.Success)
            {
                return new HealthCheckResult
                {
                    CheckName = Name,
                    Status = HealthStatus.Critical,
                    Message = $"mTLS handshake failed: {tlsResult.Error}",
                    RemediationHint = tlsResult.RemediationHint,
                    RunbookUrl = "https://docs.stella-ops.org/runbooks/agent-mtls-failure"
                };
            }

            // Test gRPC health endpoint
            var healthResult = await _orchestratorClient.HealthCheckAsync(ct);

            return new HealthCheckResult
            {
                CheckName = Name,
                Status = HealthStatus.Healthy,
                Message = $"Connected to orchestrator at {endpoint}",
                Data = new Dictionary<string, object>
                {
                    ["resolved_addresses"] = addresses.Select(a => a.ToString()).ToArray(),
                    ["tls_version"] = tlsResult.TlsVersion,
                    ["latency_ms"] = healthResult.LatencyMs
                }
            };
        }
        catch (Exception ex)
        {
            return new HealthCheckResult
            {
                CheckName = Name,
                Status = HealthStatus.Critical,
                Message = $"Connectivity check failed: {ex.Message}",
                Exception = ex,
                RemediationHint = "Check network configuration and orchestrator status",
                RunbookUrl = "https://docs.stella-ops.org/runbooks/agent-connectivity"
            };
        }
    }
}

public sealed class DockerConnectivityCheck : IAgentHealthCheck
{
    public string Name => "Docker Connectivity";
    public string Category => "Runtime";

    public async Task<HealthCheckResult> ExecuteAsync(CancellationToken ct)
    {
        try
        {
            var version = await _dockerClient.GetVersionAsync(ct);

            // Check minimum version
            var minVersion = new Version(20, 10, 0);
            var currentVersion = new Version(version.Version);

            if (currentVersion < minVersion)
            {
                return new HealthCheckResult
                {
                    CheckName = Name,
                    Status = HealthStatus.Warning,
                    Message = $"Docker version {version.Version} is below recommended {minVersion}",
                    RemediationHint = "Upgrade Docker to version 20.10 or later",
                    Data = new Dictionary<string, object>
                    {
                        ["docker_version"] = version.Version,
                        ["api_version"] = version.ApiVersion,
                        ["min_recommended"] = minVersion.ToString()
                    }
                };
            }

            return new HealthCheckResult
            {
                CheckName = Name,
                Status = HealthStatus.Healthy,
                Message = $"Docker {version.Version} connected",
                Data = new Dictionary<string, object>
                {
                    ["docker_version"] = version.Version,
                    ["api_version"] = version.ApiVersion,
                    ["os"] = version.Os,
                    ["arch"] = version.Arch
                }
            };
        }
        catch (Exception ex)
        {
            return new HealthCheckResult
            {
                CheckName = Name,
                Status = HealthStatus.Critical,
                Message = $"Docker connectivity failed: {ex.Message}",
                Exception = ex,
                RemediationHint = "Ensure Docker daemon is running and agent has permission to access Docker socket",
                RunbookUrl = "https://docs.stella-ops.org/runbooks/agent-docker-connectivity"
            };
        }
    }
}

5. Remediation Engine

Guided problem resolution:

public sealed class RemediationEngine
{
    public ImmutableArray<RemediationStep> GetRemediationSteps(
        HealthCheckResult result)
    {
        var steps = new List<RemediationStep>();

        // Match result to known remediation patterns
        var pattern = _patterns.FirstOrDefault(p => p.Matches(result));

        if (pattern != null)
        {
            steps.AddRange(pattern.Steps);
        }

        // Add generic remediation based on status
        if (result.Status == HealthStatus.Critical)
        {
            steps.Add(new RemediationStep
            {
                Id = "check-logs",
                Priority = RemediationPriority.High,
                Title = "Check Agent Logs",
                Description = "Review agent logs for detailed error information",
                Command = "stella agent logs --tail 100",
                RunbookUrl = result.RunbookUrl
            });
        }

        return steps.ToImmutableArray();
    }

    private readonly ImmutableArray<RemediationPattern> _patterns = new[]
    {
        new RemediationPattern
        {
            CheckName = "Certificate Expiry",
            StatusMatch = HealthStatus.Critical,
            Steps = new[]
            {
                new RemediationStep
                {
                    Id = "renew-cert",
                    Priority = RemediationPriority.Critical,
                    Title = "Renew Agent Certificate",
                    Description = "Agent certificate has expired and must be renewed",
                    Command = "stella agent renew-cert --force",
                    Automated = true
                },
                new RemediationStep
                {
                    Id = "restart-agent",
                    Priority = RemediationPriority.High,
                    Title = "Restart Agent",
                    Description = "Restart agent to apply new certificate",
                    Command = "systemctl restart stella-agent",
                    Automated = false
                }
            }
        },
        new RemediationPattern
        {
            CheckName = "Orchestrator Connectivity",
            MessageContains = "DNS resolution failed",
            Steps = new[]
            {
                new RemediationStep
                {
                    Id = "check-dns",
                    Priority = RemediationPriority.Critical,
                    Title = "Verify DNS Configuration",
                    Description = "Check that DNS servers are configured and reachable",
                    Command = "cat /etc/resolv.conf && nslookup orchestrator.example.com",
                    Automated = false
                },
                new RemediationStep
                {
                    Id = "check-hosts",
                    Priority = RemediationPriority.High,
                    Title = "Check /etc/hosts",
                    Description = "Verify no conflicting entries in hosts file",
                    Command = "grep orchestrator /etc/hosts",
                    Automated = false
                }
            }
        },
        new RemediationPattern
        {
            CheckName = "Docker Connectivity",
            Steps = new[]
            {
                new RemediationStep
                {
                    Id = "check-docker-daemon",
                    Priority = RemediationPriority.Critical,
                    Title = "Check Docker Daemon",
                    Description = "Verify Docker daemon is running",
                    Command = "systemctl status docker",
                    Automated = false
                },
                new RemediationStep
                {
                    Id = "check-docker-socket",
                    Priority = RemediationPriority.High,
                    Title = "Check Docker Socket Permissions",
                    Description = "Verify agent has access to Docker socket",
                    Command = "ls -la /var/run/docker.sock && groups stella-agent",
                    Automated = false
                }
            }
        }
    }.ToImmutableArray();
}

public sealed record RemediationStep
{
    public string Id { get; init; }
    public RemediationPriority Priority { get; init; }
    public string Title { get; init; }
    public string Description { get; init; }
    public string? Command { get; init; }
    public string? RunbookUrl { get; init; }
    public bool Automated { get; init; }
    public TimeSpan? EstimatedDuration { get; init; }
}

6. Auto-Update Manager

Safe agent binary updates:

public sealed class AgentUpdateManager
{
    public async Task<UpdateResult> CheckAndApplyUpdateAsync(
        CancellationToken ct)
    {
        if (!_config.AutoUpdate?.Enabled == true)
        {
            return new UpdateResult { Status = UpdateStatus.Disabled };
        }

        // Check for available update
        var available = await _updateService.CheckForUpdateAsync(
            _config.AgentVersion,
            _config.AutoUpdate.Channel,
            ct);

        if (!available.HasUpdate)
        {
            return new UpdateResult { Status = UpdateStatus.UpToDate };
        }

        // Verify update signature
        var verified = await _signatureVerifier.VerifyAsync(
            available.Package,
            available.Signature,
            ct);

        if (!verified)
        {
            _logger.LogError("Update signature verification failed");
            return new UpdateResult
            {
                Status = UpdateStatus.VerificationFailed,
                Error = "Package signature verification failed"
            };
        }

        // Check if update window is allowed
        if (!IsInUpdateWindow())
        {
            _logger.LogInformation(
                "Update available but outside update window, scheduling for {Window}",
                _config.AutoUpdate.MaintenanceWindow);

            return new UpdateResult
            {
                Status = UpdateStatus.Scheduled,
                ScheduledFor = GetNextMaintenanceWindow()
            };
        }

        // Drain active tasks
        await DrainActiveTasksAsync(ct);

        // Download and apply update
        try
        {
            var packagePath = await DownloadPackageAsync(available, ct);

            // Create rollback point
            var rollbackPoint = await CreateRollbackPointAsync(ct);

            // Apply update
            await ApplyUpdateAsync(packagePath, ct);

            // Verify new version starts correctly
            var healthCheck = await VerifyNewVersionAsync(ct);

            if (!healthCheck.Healthy)
            {
                _logger.LogError("New version health check failed, rolling back");
                await RollbackAsync(rollbackPoint, ct);

                return new UpdateResult
                {
                    Status = UpdateStatus.RolledBack,
                    Error = healthCheck.Error
                };
            }

            return new UpdateResult
            {
                Status = UpdateStatus.Applied,
                PreviousVersion = _config.AgentVersion,
                NewVersion = available.Version
            };
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Update failed, attempting rollback");
            await RollbackAsync(ct);

            return new UpdateResult
            {
                Status = UpdateStatus.Failed,
                Error = ex.Message
            };
        }
    }
}

public sealed record AutoUpdateConfig
{
    public bool Enabled { get; init; } = false;
    public UpdateChannel Channel { get; init; } = UpdateChannel.Stable;
    public string? MaintenanceWindow { get; init; }  // Cron expression
    public bool DrainBeforeUpdate { get; init; } = true;
    public TimeSpan DrainTimeout { get; init; } = TimeSpan.FromMinutes(5);
    public int MaxRollbackVersions { get; init; } = 3;
}

public enum UpdateChannel
{
    Stable,
    Beta,
    Canary
}

7. Operator CLI Commands

Streamlined operational commands:

public sealed class AgentOperatorCommands
{
    // Bootstrap new agent
    // stella agent bootstrap --name prod-agent-01 --env production --platform linux
    [Command("agent bootstrap")]
    public async Task<int> BootstrapAsync(
        [Option] string name,
        [Option] string env,
        [Option] Platform platform = Platform.Linux,
        [Option] string[]? capabilities = null,
        [Option] string? cluster = null)
    {
        var result = await _bootstrap.BootstrapAgentAsync(new BootstrapRequest
        {
            AgentName = name,
            Environment = env,
            Platform = platform,
            Capabilities = capabilities?.ToImmutableArray() ?? ImmutableArray<AgentCapability>.Empty,
            ClusterId = cluster
        }, _ct);

        Console.WriteLine($"Bootstrap token generated (expires in 15 minutes):");
        Console.WriteLine();
        Console.WriteLine($"  Token: {result.Token}");
        Console.WriteLine();
        Console.WriteLine($"One-line installer:");
        Console.WriteLine($"  {result.InstallCommand}");
        Console.WriteLine();
        Console.WriteLine($"Or download the install script:");
        Console.WriteLine($"  stella agent install-script --token {result.Token} --output install.sh");

        return 0;
    }

    // Run diagnostics
    // stella agent doctor [--category security] [--fix]
    [Command("agent doctor")]
    public async Task<int> DoctorAsync(
        [Option] string? agentId = null,
        [Option] string[]? categories = null,
        [Option] bool fix = false,
        [Option] OutputFormat format = OutputFormat.Table)
    {
        var options = new DiagnosticOptions
        {
            Categories = categories?.ToImmutableArray() ?? ImmutableArray<string>.Empty,
            IncludeRemediations = true
        };

        var report = agentId != null
            ? await _doctor.RunRemoteDiagnosticsAsync(agentId, options, _ct)
            : await _doctor.RunDiagnosticsAsync(options, _ct);

        // Display results
        RenderDiagnosticReport(report, format);

        // Optionally apply automated fixes
        if (fix && report.Remediations.Any(r => r.Automated))
        {
            Console.WriteLine();
            Console.WriteLine("Applying automated remediations...");

            foreach (var remediation in report.Remediations.Where(r => r.Automated))
            {
                Console.WriteLine($"  - {remediation.Title}");
                await _remediation.ApplyAsync(remediation, _ct);
            }
        }

        return report.OverallStatus == HealthStatus.Healthy ? 0 : 1;
    }

    // View agent configuration
    // stella agent config [--agent-id xyz] [--diff]
    [Command("agent config")]
    public async Task<int> ConfigAsync(
        [Option] string? agentId = null,
        [Option] bool diff = false,
        [Option] OutputFormat format = OutputFormat.Yaml)
    {
        if (diff)
        {
            var drift = await _configManager.DetectDriftAsync(_ct);
            RenderConfigDiff(drift, format);
            return drift.HasDrift ? 1 : 0;
        }

        var config = await _configManager.GetConfigurationAsync(agentId, _ct);
        RenderConfiguration(config, format);
        return 0;
    }

    // Apply configuration changes
    // stella agent apply -f agent-config.yaml
    [Command("agent apply")]
    public async Task<int> ApplyAsync(
        [Option('f')] string configFile)
    {
        var config = await LoadConfigurationAsync(configFile);
        var validation = await _configManager.ValidateAsync(config, _ct);

        if (!validation.IsValid)
        {
            Console.WriteLine("Configuration validation failed:");
            foreach (var error in validation.Errors)
            {
                Console.WriteLine($"  - {error}");
            }
            return 1;
        }

        var result = await _configManager.ApplyConfigurationAsync(config, _ct);

        if (result.Status == ConfigStatus.Applied)
        {
            Console.WriteLine($"Configuration applied successfully ({result.AppliedChanges.Length} changes)");
            return 0;
        }

        Console.WriteLine($"Configuration apply failed: {result.Status}");
        return 1;
    }

    // Renew certificate
    // stella agent renew-cert [--force]
    [Command("agent renew-cert")]
    public async Task<int> RenewCertAsync(
        [Option] bool force = false)
    {
        var result = await _certManager.RenewCertificateAsync(force, _ct);

        if (result.Status == CertificateStatus.Renewed)
        {
            Console.WriteLine($"Certificate renewed successfully");
            Console.WriteLine($"  New expiry: {result.ExpiresAt:u}");
            return 0;
        }

        Console.WriteLine($"Certificate renewal failed: {result.Error}");
        return 1;
    }

    // View agent logs
    // stella agent logs [--tail 100] [--follow] [--level error]
    [Command("agent logs")]
    public async Task<int> LogsAsync(
        [Option] string? agentId = null,
        [Option] int tail = 50,
        [Option] bool follow = false,
        [Option] LogLevel? level = null)
    {
        await foreach (var entry in _logService.StreamLogsAsync(
            agentId, tail, follow, level, _ct))
        {
            RenderLogEntry(entry);
        }

        return 0;
    }

    // Force update
    // stella agent update [--version x.y.z] [--force]
    [Command("agent update")]
    public async Task<int> UpdateAsync(
        [Option] string? version = null,
        [Option] bool force = false)
    {
        var result = await _updateManager.UpdateToVersionAsync(version, force, _ct);

        Console.WriteLine($"Update status: {result.Status}");
        if (result.Status == UpdateStatus.Applied)
        {
            Console.WriteLine($"  Previous: {result.PreviousVersion}");
            Console.WriteLine($"  Current:  {result.NewVersion}");
        }

        return result.Status == UpdateStatus.Applied ? 0 : 1;
    }
}

Doctor Plugin for Server-Side

Central Doctor plugin for agent fleet health:

// src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Agent/AgentHealthPlugin.cs
public sealed class AgentHealthPlugin : IDoctorPlugin
{
    public string Name => "Agent Health";
    public string Description => "Monitors agent fleet health and connectivity";

    public ImmutableArray<IDoctorCheck> Checks => new IDoctorCheck[]
    {
        new AgentHeartbeatFreshnessCheck(),
        new AgentCertificateExpiryCheck(),
        new AgentVersionConsistencyCheck(),
        new AgentCapacityCheck(),
        new StaleAgentCheck(),
        new AgentClusterHealthCheck(),
        new TaskQueueBacklogCheck(),
        new FailedTaskRateCheck(),
        new AgentResourceUtilizationCheck()
    }.ToImmutableArray();
}

public sealed class AgentHeartbeatFreshnessCheck : IDoctorCheck
{
    public string Name => "Agent Heartbeat Freshness";
    public CheckSeverity Severity => CheckSeverity.Critical;

    public async Task<DoctorCheckResult> ExecuteAsync(CancellationToken ct)
    {
        var agents = await _agentStore.GetAllAsync(ct);
        var staleAgents = new List<string>();
        var warningAgents = new List<string>();

        foreach (var agent in agents.Where(a => a.Status != AgentStatus.Deactivated))
        {
            var heartbeatAge = _timeProvider.GetUtcNow() - agent.LastHeartbeat;

            if (heartbeatAge > TimeSpan.FromMinutes(5))
            {
                staleAgents.Add($"{agent.Name} (last heartbeat: {heartbeatAge.TotalMinutes:F0}m ago)");
            }
            else if (heartbeatAge > TimeSpan.FromMinutes(2))
            {
                warningAgents.Add($"{agent.Name} (last heartbeat: {heartbeatAge.TotalSeconds:F0}s ago)");
            }
        }

        if (staleAgents.Any())
        {
            return new DoctorCheckResult
            {
                Status = CheckStatus.Critical,
                Message = $"{staleAgents.Count} agent(s) have stale heartbeats",
                Details = staleAgents,
                Remediation = "Check agent connectivity and status. Run 'stella agent doctor --agent-id <id>' for diagnostics."
            };
        }

        if (warningAgents.Any())
        {
            return new DoctorCheckResult
            {
                Status = CheckStatus.Warning,
                Message = $"{warningAgents.Count} agent(s) have delayed heartbeats",
                Details = warningAgents
            };
        }

        return new DoctorCheckResult
        {
            Status = CheckStatus.Healthy,
            Message = $"All {agents.Count} agents have fresh heartbeats"
        };
    }
}

public sealed class AgentCertificateExpiryCheck : IDoctorCheck
{
    public string Name => "Agent Certificate Expiry";
    public CheckSeverity Severity => CheckSeverity.High;

    public async Task<DoctorCheckResult> ExecuteAsync(CancellationToken ct)
    {
        var agents = await _agentStore.GetAllAsync(ct);
        var expiringSoon = new List<string>();
        var expired = new List<string>();

        foreach (var agent in agents)
        {
            var expiresIn = agent.CertificateExpiry - _timeProvider.GetUtcNow();

            if (expiresIn <= TimeSpan.Zero)
            {
                expired.Add($"{agent.Name} (expired {-expiresIn.TotalDays:F0} days ago)");
            }
            else if (expiresIn <= TimeSpan.FromDays(7))
            {
                expiringSoon.Add($"{agent.Name} (expires in {expiresIn.TotalDays:F0} days)");
            }
        }

        if (expired.Any())
        {
            return new DoctorCheckResult
            {
                Status = CheckStatus.Critical,
                Message = $"{expired.Count} agent(s) have expired certificates",
                Details = expired,
                Remediation = "Renew certificates immediately: 'stella agent renew-cert --agent-id <id>'"
            };
        }

        if (expiringSoon.Any())
        {
            return new DoctorCheckResult
            {
                Status = CheckStatus.Warning,
                Message = $"{expiringSoon.Count} agent(s) have certificates expiring soon",
                Details = expiringSoon,
                Remediation = "Schedule certificate renewal before expiry"
            };
        }

        return new DoctorCheckResult
        {
            Status = CheckStatus.Healthy,
            Message = "All agent certificates are valid"
        };
    }
}

Configuration Examples

Minimal Configuration (Bootstrap)

# Bootstrapped agent - minimal config required
agent:
  name: prod-agent-01
  orchestrator_url: https://orchestrator.example.com:8443
  # Everything else is auto-configured via bootstrap

Full Configuration

agent:
  # Identity
  id: a1b2c3d4-e5f6-7890-abcd-ef1234567890
  name: prod-agent-01
  environment: production
  labels:
    region: us-east-1
    tier: web

  # Connection
  orchestrator_url: https://orchestrator.example.com:8443
  heartbeat_interval: 30s
  reconnect_backoff: 5s
  max_reconnect_attempts: 10

  # Capabilities
  capabilities:
    - docker
    - compose
    - health_check

  # Resources
  max_concurrent_tasks: 5
  default_task_timeout: 30m
  resource_limits:
    cpu_percent: 80
    memory_percent: 80
    disk_percent: 90

  # Certificates
  certificates:
    source: auto_provision  # auto_provision | file | vault
    auto_renew: true
    renewal_threshold: 7d

  # Clustering (optional)
  cluster:
    id: prod-cluster-01
    mode: active_active  # active_passive | active_active | sharded
    min_members: 2

  # Observability
  observability:
    metrics:
      enabled: true
      port: 9090
    logging:
      level: info
      format: json
    tracing:
      enabled: true
      endpoint: http://jaeger:14268/api/traces

  # Auto-update (optional)
  auto_update:
    enabled: true
    channel: stable  # stable | beta | canary
    maintenance_window: "0 3 * * *"  # 3 AM daily
    drain_before_update: true

CLI Quick Reference

# Bootstrap new agent
stella agent bootstrap --name prod-01 --env production --platform linux

# Run health diagnostics
stella agent doctor
stella agent doctor --category security --fix
stella agent doctor --agent-id abc123 --format json

# View/apply configuration
stella agent config
stella agent config --diff
stella agent apply -f agent-config.yaml

# Certificate management
stella agent renew-cert
stella agent renew-cert --force

# Logs and debugging
stella agent logs --tail 100
stella agent logs --follow --level error

# Updates
stella agent update
stella agent update --version 2.1.0

# Status and health
stella agent status
stella agent list --env production
stella agent health abc123

Metrics & Observability

Prometheus Metrics

# Bootstrap
stella_agent_bootstrap_total{environment, platform}
stella_agent_bootstrap_success_total{environment}
stella_agent_bootstrap_failed_total{environment, reason}

# Configuration
stella_agent_config_drift_detected_total{agent_id}
stella_agent_config_apply_total{agent_id, status}

# Certificates
stella_agent_certificate_expiry_seconds{agent_id}
stella_agent_certificate_renewal_total{agent_id, status}

# Health Checks
stella_agent_health_check_total{agent_id, check_name, status}
stella_agent_health_score{agent_id}

# Updates
stella_agent_update_available{agent_id, current_version, available_version}
stella_agent_update_applied_total{agent_id, status}
stella_agent_update_rollback_total{agent_id}

Test Strategy

Unit Tests

  • Bootstrap token generation and validation
  • Configuration diff computation
  • Certificate lifecycle logic
  • Health check execution
  • Remediation matching

Integration Tests

  • Full bootstrap flow
  • Configuration apply with rollback
  • Certificate renewal
  • Auto-update with rollback
  • Doctor diagnostics

E2E Tests

  • Bootstrap to running agent
  • Multi-agent cluster formation
  • Failover scenarios
  • Update and rollback scenarios

Migration Path

Phase 1: Bootstrap Service (Week 1-2)

  • Bootstrap token service
  • One-line installer generation
  • Platform-specific install scripts

Phase 2: Configuration Manager (Week 3-4)

  • Declarative configuration model
  • Drift detection
  • Apply with rollback

Phase 3: Certificate Manager (Week 5-6)

  • Auto-provisioning
  • Auto-renewal
  • Multi-source support (Vault, ACME, etc.)

Phase 4: Agent Doctor (Week 7-8)

  • Core health checks
  • Remediation engine
  • CLI integration

Phase 5: Doctor Plugin (Week 9-10)

  • Server-side fleet health
  • Dashboard integration
  • Alerting rules

Phase 6: Auto-Update (Week 11-12)

  • Update service
  • Safe rollback
  • Maintenance windows