Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2.9 KiB
checkId, plugin, severity, tags
| checkId | plugin | severity | tags | ||||
|---|---|---|---|---|---|---|---|
| check.agent.cluster.quorum | stellaops.doctor.agent | fail |
|
Agent Cluster Quorum
What It Checks
Verifies that the agent cluster has sufficient members online to maintain quorum for leader election and consensus operations. The check only runs when Agent:Cluster:Enabled is true. It is designed to verify:
- Minimum members are online (n/2 + 1 for odd-numbered clusters, or the configured minimum)
- Leader election is possible with current membership
- Split-brain prevention mechanisms are active
Current status: implementation pending -- the check returns Skip with a placeholder message. The CanRun gate is functional (reads cluster config), but RunAsync does not yet query cluster membership.
Why It Matters
Without quorum, the agent cluster cannot elect a leader, which means no task dispatch, no failover, and potentially a complete halt of agent-driven operations. Losing quorum is often the step before a full cluster outage. Monitoring quorum proactively allows operators to add members or fix partitions before the cluster becomes non-functional.
Common Causes
- Too many cluster members went offline simultaneously (maintenance, host failure)
- Network partition isolating a minority of members from the majority
- Cluster scaled down below quorum threshold
- New deployment removed members without draining them first
How to Fix
Docker Compose
# Verify all agent containers are running
docker compose -f devops/compose/docker-compose.stella-ops.yml ps | grep agent
# Scale agents to restore quorum (minimum 3 for quorum of 2)
docker compose -f devops/compose/docker-compose.stella-ops.yml up -d --scale agent=3
Ensure cluster member list is correct in .env:
AGENT__CLUSTER__ENABLED=true
AGENT__CLUSTER__MINMEMBERS=2
Bare Metal / systemd
# Check how many cluster members are online
stella agent cluster members --status online
# If a member is down, restart it
ssh <agent-host> 'sudo systemctl restart stella-agent'
# Verify quorum status
stella agent cluster quorum
Kubernetes / Helm
# Check agent pod count vs desired
kubectl get statefulset stellaops-agent -n stellaops
# Scale up if below quorum
kubectl scale statefulset stellaops-agent --replicas=3 -n stellaops
# Check pod disruption budget
kubectl get pdb -n stellaops
Set a PodDisruptionBudget to prevent quorum loss during rollouts:
# values.yaml
agent:
cluster:
enabled: true
replicas: 3
podDisruptionBudget:
minAvailable: 2
Verification
stella doctor run --check check.agent.cluster.quorum
Related Checks
check.agent.cluster.health-- overall cluster health including leader and sync statuscheck.agent.capacity-- even with quorum, capacity may be insufficientcheck.agent.heartbeat.freshness-- individual member connectivity