--- checkId: check.agent.cluster.health plugin: stellaops.doctor.agent severity: fail tags: [agent, cluster, ha, resilience] --- # Agent Cluster Health ## What It Checks Monitors the health of the agent cluster when clustering is enabled. The check only runs when the configuration key `Agent:Cluster:Enabled` is set to `true`. It is designed to verify: 1. All cluster members are reachable 2. A leader is elected and healthy 3. State synchronization is working across members 4. Failover is possible if the current leader goes down **Current status:** implementation pending -- the check returns Skip with a placeholder message. The `CanRun` gate is functional (reads cluster config), but `RunAsync` does not yet perform cluster health probes. ## Why It Matters In high-availability deployments, agents form a cluster to provide redundancy and automatic failover. If cluster health degrades -- members become unreachable, leader election fails, or state sync stalls -- task dispatch can stop entirely or produce split-brain scenarios where two agents execute the same task concurrently, leading to deployment conflicts. ## Common Causes - Network partition between cluster members - Leader node crashed without triggering failover - State sync backlog due to high task volume - Clock skew between cluster members causing consensus protocol failures - Insufficient cluster members for quorum (see `check.agent.cluster.quorum`) ## How to Fix ### Docker Compose ```bash # Check cluster member containers docker compose -f devops/compose/docker-compose.stella-ops.yml ps | grep agent # View cluster-specific logs docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 200 | grep -i cluster # Restart all agent containers to force re-election docker compose -f devops/compose/docker-compose.stella-ops.yml restart agent ``` Set clustering configuration in your `.env` or compose override: ``` AGENT__CLUSTER__ENABLED=true AGENT__CLUSTER__MEMBERS=agent-1:8500,agent-2:8500,agent-3:8500 ``` ### Bare Metal / systemd ```bash # Check cluster status stella agent cluster status # View cluster member health stella agent cluster members # Force leader re-election if leader is unhealthy stella agent cluster elect --force # Restart agent to rejoin cluster sudo systemctl restart stella-agent ``` ### Kubernetes / Helm ```bash # Check agent StatefulSet pods kubectl get pods -l app.kubernetes.io/component=agent -n stellaops # View cluster gossip logs kubectl logs -l app.kubernetes.io/component=agent -n stellaops --tail=100 | grep -i cluster # Helm values for clustering # agent: # cluster: # enabled: true # replicas: 3 helm upgrade stellaops stellaops/stellaops --set agent.cluster.enabled=true --set agent.cluster.replicas=3 ``` ## Verification ``` stella doctor run --check check.agent.cluster.health ``` ## Related Checks - `check.agent.cluster.quorum` -- verifies minimum members for consensus - `check.agent.heartbeat.freshness` -- individual agent connectivity - `check.agent.capacity` -- fleet-level task capacity