Files
git.stella-ops.org/docs/doctor/articles/agent/cluster-health.md
master c58a236d70 Doctor plugin checks: implement health check classes and documentation
Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00

3.0 KiB

checkId, plugin, severity, tags
checkId plugin severity tags
check.agent.cluster.health stellaops.doctor.agent fail
agent
cluster
ha
resilience

Agent Cluster Health

What It Checks

Monitors the health of the agent cluster when clustering is enabled. The check only runs when the configuration key Agent:Cluster:Enabled is set to true. It is designed to verify:

  1. All cluster members are reachable
  2. A leader is elected and healthy
  3. State synchronization is working across members
  4. Failover is possible if the current leader goes down

Current status: implementation pending -- the check returns Skip with a placeholder message. The CanRun gate is functional (reads cluster config), but RunAsync does not yet perform cluster health probes.

Why It Matters

In high-availability deployments, agents form a cluster to provide redundancy and automatic failover. If cluster health degrades -- members become unreachable, leader election fails, or state sync stalls -- task dispatch can stop entirely or produce split-brain scenarios where two agents execute the same task concurrently, leading to deployment conflicts.

Common Causes

  • Network partition between cluster members
  • Leader node crashed without triggering failover
  • State sync backlog due to high task volume
  • Clock skew between cluster members causing consensus protocol failures
  • Insufficient cluster members for quorum (see check.agent.cluster.quorum)

How to Fix

Docker Compose

# Check cluster member containers
docker compose -f devops/compose/docker-compose.stella-ops.yml ps | grep agent

# View cluster-specific logs
docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 200 | grep -i cluster

# Restart all agent containers to force re-election
docker compose -f devops/compose/docker-compose.stella-ops.yml restart agent

Set clustering configuration in your .env or compose override:

AGENT__CLUSTER__ENABLED=true
AGENT__CLUSTER__MEMBERS=agent-1:8500,agent-2:8500,agent-3:8500

Bare Metal / systemd

# Check cluster status
stella agent cluster status

# View cluster member health
stella agent cluster members

# Force leader re-election if leader is unhealthy
stella agent cluster elect --force

# Restart agent to rejoin cluster
sudo systemctl restart stella-agent

Kubernetes / Helm

# Check agent StatefulSet pods
kubectl get pods -l app.kubernetes.io/component=agent -n stellaops

# View cluster gossip logs
kubectl logs -l app.kubernetes.io/component=agent -n stellaops --tail=100 | grep -i cluster

# Helm values for clustering
# agent:
#   cluster:
#     enabled: true
#     replicas: 3
helm upgrade stellaops stellaops/stellaops --set agent.cluster.enabled=true --set agent.cluster.replicas=3

Verification

stella doctor run --check check.agent.cluster.health
  • check.agent.cluster.quorum -- verifies minimum members for consensus
  • check.agent.heartbeat.freshness -- individual agent connectivity
  • check.agent.capacity -- fleet-level task capacity