Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
master
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions

View File

@@ -0,0 +1,97 @@
---
checkId: check.agent.cluster.quorum
plugin: stellaops.doctor.agent
severity: fail
tags: [agent, cluster, quorum, ha]
---
# Agent Cluster Quorum
## What It Checks
Verifies that the agent cluster has sufficient members online to maintain quorum for leader election and consensus operations. The check only runs when `Agent:Cluster:Enabled` is `true`. It is designed to verify:
1. Minimum members are online (n/2 + 1 for odd-numbered clusters, or the configured minimum)
2. Leader election is possible with current membership
3. Split-brain prevention mechanisms are active
**Current status:** implementation pending -- the check returns Skip with a placeholder message. The `CanRun` gate is functional (reads cluster config), but `RunAsync` does not yet query cluster membership.
## Why It Matters
Without quorum, the agent cluster cannot elect a leader, which means no task dispatch, no failover, and potentially a complete halt of agent-driven operations. Losing quorum is often the step before a full cluster outage. Monitoring quorum proactively allows operators to add members or fix partitions before the cluster becomes non-functional.
## Common Causes
- Too many cluster members went offline simultaneously (maintenance, host failure)
- Network partition isolating a minority of members from the majority
- Cluster scaled down below quorum threshold
- New deployment removed members without draining them first
## How to Fix
### Docker Compose
```bash
# Verify all agent containers are running
docker compose -f devops/compose/docker-compose.stella-ops.yml ps | grep agent
# Scale agents to restore quorum (minimum 3 for quorum of 2)
docker compose -f devops/compose/docker-compose.stella-ops.yml up -d --scale agent=3
```
Ensure cluster member list is correct in `.env`:
```
AGENT__CLUSTER__ENABLED=true
AGENT__CLUSTER__MINMEMBERS=2
```
### Bare Metal / systemd
```bash
# Check how many cluster members are online
stella agent cluster members --status online
# If a member is down, restart it
ssh <agent-host> 'sudo systemctl restart stella-agent'
# Verify quorum status
stella agent cluster quorum
```
### Kubernetes / Helm
```bash
# Check agent pod count vs desired
kubectl get statefulset stellaops-agent -n stellaops
# Scale up if below quorum
kubectl scale statefulset stellaops-agent --replicas=3 -n stellaops
# Check pod disruption budget
kubectl get pdb -n stellaops
```
Set a PodDisruptionBudget to prevent quorum loss during rollouts:
```yaml
# values.yaml
agent:
cluster:
enabled: true
replicas: 3
podDisruptionBudget:
minAvailable: 2
```
## Verification
```
stella doctor run --check check.agent.cluster.quorum
```
## Related Checks
- `check.agent.cluster.health` -- overall cluster health including leader and sync status
- `check.agent.capacity` -- even with quorum, capacity may be insufficient
- `check.agent.heartbeat.freshness` -- individual member connectivity