Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/agent/cluster-quorum.md
+++ b/docs/doctor/articles/agent/cluster-quorum.md
@@ -0,0 +1,97 @@
+---
+checkId: check.agent.cluster.quorum
+plugin: stellaops.doctor.agent
+severity: fail
+tags: [agent, cluster, quorum, ha]
+---
+# Agent Cluster Quorum
+
+## What It Checks
+
+Verifies that the agent cluster has sufficient members online to maintain quorum for leader election and consensus operations. The check only runs when `Agent:Cluster:Enabled` is `true`. It is designed to verify:
+
+1. Minimum members are online (n/2 + 1 for odd-numbered clusters, or the configured minimum)
+2. Leader election is possible with current membership
+3. Split-brain prevention mechanisms are active
+
+**Current status:** implementation pending -- the check returns Skip with a placeholder message. The `CanRun` gate is functional (reads cluster config), but `RunAsync` does not yet query cluster membership.
+
+## Why It Matters
+
+Without quorum, the agent cluster cannot elect a leader, which means no task dispatch, no failover, and potentially a complete halt of agent-driven operations. Losing quorum is often the step before a full cluster outage. Monitoring quorum proactively allows operators to add members or fix partitions before the cluster becomes non-functional.
+
+## Common Causes
+
+- Too many cluster members went offline simultaneously (maintenance, host failure)
+- Network partition isolating a minority of members from the majority
+- Cluster scaled down below quorum threshold
+- New deployment removed members without draining them first
+
+## How to Fix
+
+### Docker Compose
+
+```bash
+# Verify all agent containers are running
+docker compose -f devops/compose/docker-compose.stella-ops.yml ps | grep agent
+
+# Scale agents to restore quorum (minimum 3 for quorum of 2)
+docker compose -f devops/compose/docker-compose.stella-ops.yml up -d --scale agent=3
+```
+
+Ensure cluster member list is correct in `.env`:
+
+```
+AGENT__CLUSTER__ENABLED=true
+AGENT__CLUSTER__MINMEMBERS=2
+```
+
+### Bare Metal / systemd
+
+```bash
+# Check how many cluster members are online
+stella agent cluster members --status online
+
+# If a member is down, restart it
+ssh <agent-host> 'sudo systemctl restart stella-agent'
+
+# Verify quorum status
+stella agent cluster quorum
+```
+
+### Kubernetes / Helm
+
+```bash
+# Check agent pod count vs desired
+kubectl get statefulset stellaops-agent -n stellaops
+
+# Scale up if below quorum
+kubectl scale statefulset stellaops-agent --replicas=3 -n stellaops
+
+# Check pod disruption budget
+kubectl get pdb -n stellaops
+```
+
+Set a PodDisruptionBudget to prevent quorum loss during rollouts:
+
+```yaml
+# values.yaml
+agent:
+  cluster:
+    enabled: true
+    replicas: 3
+  podDisruptionBudget:
+    minAvailable: 2
+```
+
+## Verification
+
+```
+stella doctor run --check check.agent.cluster.quorum
+```
+
+## Related Checks
+
+- `check.agent.cluster.health` -- overall cluster health including leader and sync status
+- `check.agent.capacity` -- even with quorum, capacity may be insufficient
+- `check.agent.heartbeat.freshness` -- individual member connectivity