---
checkId: check.agent.cluster.health
plugin: stellaops.doctor.agent
severity: fail
tags: [agent, cluster, ha, resilience]
---
# Agent Cluster Health

## What It Checks

Monitors the health of the agent cluster when clustering is enabled. The check only runs when the configuration key `Agent:Cluster:Enabled` is set to `true`. It is designed to verify:

1. All cluster members are reachable
2. A leader is elected and healthy
3. State synchronization is working across members
4. Failover is possible if the current leader goes down

**Current status:** implementation pending -- the check returns Skip with a placeholder message. The `CanRun` gate is functional (reads cluster config), but `RunAsync` does not yet perform cluster health probes.

## Why It Matters

In high-availability deployments, agents form a cluster to provide redundancy and automatic failover. If cluster health degrades -- members become unreachable, leader election fails, or state sync stalls -- task dispatch can stop entirely or produce split-brain scenarios where two agents execute the same task concurrently, leading to deployment conflicts.

## Common Causes

- Network partition between cluster members
- Leader node crashed without triggering failover
- State sync backlog due to high task volume
- Clock skew between cluster members causing consensus protocol failures
- Insufficient cluster members for quorum (see `check.agent.cluster.quorum`)

## How to Fix

### Docker Compose

```bash
# Check cluster member containers
docker compose -f devops/compose/docker-compose.stella-ops.yml ps | grep agent

# View cluster-specific logs
docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 200 | grep -i cluster

# Restart all agent containers to force re-election
docker compose -f devops/compose/docker-compose.stella-ops.yml restart agent
```

Set clustering configuration in your `.env` or compose override:

```
AGENT__CLUSTER__ENABLED=true
AGENT__CLUSTER__MEMBERS=agent-1:8500,agent-2:8500,agent-3:8500
```

### Bare Metal / systemd

```bash
# Check cluster status
stella agent cluster status

# View cluster member health
stella agent cluster members

# Force leader re-election if leader is unhealthy
stella agent cluster elect --force

# Restart agent to rejoin cluster
sudo systemctl restart stella-agent
```

### Kubernetes / Helm

```bash
# Check agent StatefulSet pods
kubectl get pods -l app.kubernetes.io/component=agent -n stellaops

# View cluster gossip logs
kubectl logs -l app.kubernetes.io/component=agent -n stellaops --tail=100 | grep -i cluster

# Helm values for clustering
# agent:
#   cluster:
#     enabled: true
#     replicas: 3
helm upgrade stellaops stellaops/stellaops --set agent.cluster.enabled=true --set agent.cluster.replicas=3
```

## Verification

```
stella doctor run --check check.agent.cluster.health
```

## Related Checks

- `check.agent.cluster.quorum` -- verifies minimum members for consensus
- `check.agent.heartbeat.freshness` -- individual agent connectivity
- `check.agent.capacity` -- fleet-level task capacity