Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3.0 KiB
checkId, plugin, severity, tags
| checkId | plugin | severity | tags | ||||
|---|---|---|---|---|---|---|---|
| check.agent.cluster.health | stellaops.doctor.agent | fail |
|
Agent Cluster Health
What It Checks
Monitors the health of the agent cluster when clustering is enabled. The check only runs when the configuration key Agent:Cluster:Enabled is set to true. It is designed to verify:
- All cluster members are reachable
- A leader is elected and healthy
- State synchronization is working across members
- Failover is possible if the current leader goes down
Current status: implementation pending -- the check returns Skip with a placeholder message. The CanRun gate is functional (reads cluster config), but RunAsync does not yet perform cluster health probes.
Why It Matters
In high-availability deployments, agents form a cluster to provide redundancy and automatic failover. If cluster health degrades -- members become unreachable, leader election fails, or state sync stalls -- task dispatch can stop entirely or produce split-brain scenarios where two agents execute the same task concurrently, leading to deployment conflicts.
Common Causes
- Network partition between cluster members
- Leader node crashed without triggering failover
- State sync backlog due to high task volume
- Clock skew between cluster members causing consensus protocol failures
- Insufficient cluster members for quorum (see
check.agent.cluster.quorum)
How to Fix
Docker Compose
# Check cluster member containers
docker compose -f devops/compose/docker-compose.stella-ops.yml ps | grep agent
# View cluster-specific logs
docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 200 | grep -i cluster
# Restart all agent containers to force re-election
docker compose -f devops/compose/docker-compose.stella-ops.yml restart agent
Set clustering configuration in your .env or compose override:
AGENT__CLUSTER__ENABLED=true
AGENT__CLUSTER__MEMBERS=agent-1:8500,agent-2:8500,agent-3:8500
Bare Metal / systemd
# Check cluster status
stella agent cluster status
# View cluster member health
stella agent cluster members
# Force leader re-election if leader is unhealthy
stella agent cluster elect --force
# Restart agent to rejoin cluster
sudo systemctl restart stella-agent
Kubernetes / Helm
# Check agent StatefulSet pods
kubectl get pods -l app.kubernetes.io/component=agent -n stellaops
# View cluster gossip logs
kubectl logs -l app.kubernetes.io/component=agent -n stellaops --tail=100 | grep -i cluster
# Helm values for clustering
# agent:
# cluster:
# enabled: true
# replicas: 3
helm upgrade stellaops stellaops/stellaops --set agent.cluster.enabled=true --set agent.cluster.replicas=3
Verification
stella doctor run --check check.agent.cluster.health
Related Checks
check.agent.cluster.quorum-- verifies minimum members for consensuscheck.agent.heartbeat.freshness-- individual agent connectivitycheck.agent.capacity-- fleet-level task capacity