--- checkId: check.agent.cluster.quorum plugin: stellaops.doctor.agent severity: fail tags: [agent, cluster, quorum, ha] --- # Agent Cluster Quorum ## What It Checks Verifies that the agent cluster has sufficient members online to maintain quorum for leader election and consensus operations. The check only runs when `Agent:Cluster:Enabled` is `true`. It is designed to verify: 1. Minimum members are online (n/2 + 1 for odd-numbered clusters, or the configured minimum) 2. Leader election is possible with current membership 3. Split-brain prevention mechanisms are active **Current status:** implementation pending -- the check returns Skip with a placeholder message. The `CanRun` gate is functional (reads cluster config), but `RunAsync` does not yet query cluster membership. ## Why It Matters Without quorum, the agent cluster cannot elect a leader, which means no task dispatch, no failover, and potentially a complete halt of agent-driven operations. Losing quorum is often the step before a full cluster outage. Monitoring quorum proactively allows operators to add members or fix partitions before the cluster becomes non-functional. ## Common Causes - Too many cluster members went offline simultaneously (maintenance, host failure) - Network partition isolating a minority of members from the majority - Cluster scaled down below quorum threshold - New deployment removed members without draining them first ## How to Fix ### Docker Compose ```bash # Verify all agent containers are running docker compose -f devops/compose/docker-compose.stella-ops.yml ps | grep agent # Scale agents to restore quorum (minimum 3 for quorum of 2) docker compose -f devops/compose/docker-compose.stella-ops.yml up -d --scale agent=3 ``` Ensure cluster member list is correct in `.env`: ``` AGENT__CLUSTER__ENABLED=true AGENT__CLUSTER__MINMEMBERS=2 ``` ### Bare Metal / systemd ```bash # Check how many cluster members are online stella agent cluster members --status online # If a member is down, restart it ssh 'sudo systemctl restart stella-agent' # Verify quorum status stella agent cluster quorum ``` ### Kubernetes / Helm ```bash # Check agent pod count vs desired kubectl get statefulset stellaops-agent -n stellaops # Scale up if below quorum kubectl scale statefulset stellaops-agent --replicas=3 -n stellaops # Check pod disruption budget kubectl get pdb -n stellaops ``` Set a PodDisruptionBudget to prevent quorum loss during rollouts: ```yaml # values.yaml agent: cluster: enabled: true replicas: 3 podDisruptionBudget: minAvailable: 2 ``` ## Verification ``` stella doctor run --check check.agent.cluster.quorum ``` ## Related Checks - `check.agent.cluster.health` -- overall cluster health including leader and sync status - `check.agent.capacity` -- even with quorum, capacity may be insufficient - `check.agent.heartbeat.freshness` -- individual member connectivity