Files

master c58a236d70 Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 12:28:00 +02:00

2.9 KiB

Raw Blame History

checkId, plugin, severity, tags

checkId

plugin

severity

Agent Cluster Quorum

What It Checks

Verifies that the agent cluster has sufficient members online to maintain quorum for leader election and consensus operations. The check only runs when Agent:Cluster:Enabled is true. It is designed to verify:

Minimum members are online (n/2 + 1 for odd-numbered clusters, or the configured minimum)
Leader election is possible with current membership
Split-brain prevention mechanisms are active

Current status: implementation pending -- the check returns Skip with a placeholder message. The CanRun gate is functional (reads cluster config), but RunAsync does not yet query cluster membership.

Why It Matters

Without quorum, the agent cluster cannot elect a leader, which means no task dispatch, no failover, and potentially a complete halt of agent-driven operations. Losing quorum is often the step before a full cluster outage. Monitoring quorum proactively allows operators to add members or fix partitions before the cluster becomes non-functional.

Common Causes

Too many cluster members went offline simultaneously (maintenance, host failure)
Network partition isolating a minority of members from the majority
Cluster scaled down below quorum threshold
New deployment removed members without draining them first

How to Fix

Docker Compose

# Verify all agent containers are running
docker compose -f devops/compose/docker-compose.stella-ops.yml ps | grep agent

# Scale agents to restore quorum (minimum 3 for quorum of 2)
docker compose -f devops/compose/docker-compose.stella-ops.yml up -d --scale agent=3

Ensure cluster member list is correct in .env:

AGENT__CLUSTER__ENABLED=true
AGENT__CLUSTER__MINMEMBERS=2

Bare Metal / systemd

# Check how many cluster members are online
stella agent cluster members --status online

# If a member is down, restart it
ssh <agent-host> 'sudo systemctl restart stella-agent'

# Verify quorum status
stella agent cluster quorum

Kubernetes / Helm

# Check agent pod count vs desired
kubectl get statefulset stellaops-agent -n stellaops

# Scale up if below quorum
kubectl scale statefulset stellaops-agent --replicas=3 -n stellaops

# Check pod disruption budget
kubectl get pdb -n stellaops

Set a PodDisruptionBudget to prevent quorum loss during rollouts:

# values.yaml
agent:
  cluster:
    enabled: true
    replicas: 3
  podDisruptionBudget:
    minAvailable: 2

Verification

stella doctor run --check check.agent.cluster.quorum

check.agent.cluster.health -- overall cluster health including leader and sync status
check.agent.capacity -- even with quorum, capacity may be insufficient
check.agent.heartbeat.freshness -- individual member connectivity

2.9 KiB Raw Blame History