Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/agent/heartbeat-freshness.md
+++ b/docs/doctor/articles/agent/heartbeat-freshness.md
@@ -0,0 +1,104 @@
+---
+checkId: check.agent.heartbeat.freshness
+plugin: stellaops.doctor.agent
+severity: fail
+tags: [agent, heartbeat, connectivity, quick]
+---
+# Agent Heartbeat Freshness
+
+## What It Checks
+
+Queries all non-revoked, non-inactive agents for the current tenant and classifies each by the age of its last heartbeat:
+
+1. **Stale** (> 5 minutes since last heartbeat): Result is **Fail**. Evidence lists each stale agent with the time since its last heartbeat in minutes.
+2. **Warning** (> 2 minutes but <= 5 minutes): Result is **Warn**. Evidence lists each delayed agent with time since heartbeat in seconds.
+3. **Healthy** (<= 2 minutes): Result is **Pass**.
+
+If no active agents are registered, the check returns **Warn** with a prompt to bootstrap agents. If the tenant ID is missing, it warns about being unable to check.
+
+Evidence collected: `TotalActive`, `Stale` count, `Warning` count, `Healthy` count, per-agent names and heartbeat ages.
+
+## Why It Matters
+
+Heartbeats are the primary signal that an agent is alive and accepting work. A stale heartbeat means the agent has stopped communicating with the orchestrator -- it may have crashed, lost network connectivity, or had its mTLS certificate expire. Tasks dispatched to a stale agent will time out, and the lack of timely detection causes deployment delays and alert fatigue.
+
+## Common Causes
+
+- Agent process has crashed or stopped
+- Network connectivity issue between agent and orchestrator
+- Firewall blocking agent heartbeat traffic (typically HTTPS on port 8443)
+- Agent host is unreachable or powered off
+- mTLS certificate has expired (see `check.agent.certificate.expiry`)
+- Agent is under heavy load (warning-level)
+- Network latency between agent and orchestrator (warning-level)
+- Agent is processing long-running tasks that block the heartbeat loop (warning-level)
+
+## How to Fix
+
+### Docker Compose
+
+```bash
+# Check agent container status
+docker compose -f devops/compose/docker-compose.stella-ops.yml ps agent
+
+# View agent logs for crash or error messages
+docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 200
+
+# Restart agent container
+docker compose -f devops/compose/docker-compose.stella-ops.yml restart agent
+
+# Verify network connectivity from agent to orchestrator
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
+  curl -k https://orchestrator:8443/health
+```
+
+### Bare Metal / systemd
+
+```bash
+# Check agent service status
+systemctl status stella-agent
+
+# View recent agent logs
+journalctl -u stella-agent --since '10 minutes ago'
+
+# Run agent diagnostics
+stella agent doctor
+
+# Check network connectivity to orchestrator
+curl -k https://orchestrator:8443/health
+
+# If certificate expired, renew it
+stella agent renew-cert --force
+
+# Restart the service
+sudo systemctl restart stella-agent
+```
+
+### Kubernetes / Helm
+
+```bash
+# Check agent pod status and restarts
+kubectl get pods -l app.kubernetes.io/component=agent -n stellaops
+
+# View agent pod logs
+kubectl logs -l app.kubernetes.io/component=agent -n stellaops --tail=200
+
+# Check network policy allowing agent -> orchestrator traffic
+kubectl get networkpolicy -n stellaops
+
+# Restart agent pods via rollout
+kubectl rollout restart deployment/stellaops-agent -n stellaops
+```
+
+## Verification
+
+```
+stella doctor run --check check.agent.heartbeat.freshness
+```
+
+## Related Checks
+
+- `check.agent.stale` -- detects agents offline for hours/days (longer threshold than heartbeat freshness)
+- `check.agent.certificate.expiry` -- expired certificates cause heartbeat authentication failures
+- `check.agent.capacity` -- heartbeat failures reduce effective fleet capacity
+- `check.agent.resource.utilization` -- overloaded agents may delay heartbeats