Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/agent/stale.md
+++ b/docs/doctor/articles/agent/stale.md
@@ -0,0 +1,91 @@
+---
+checkId: check.agent.stale
+plugin: stellaops.doctor.agent
+severity: warn
+tags: [agent, maintenance, cleanup]
+---
+# Stale Agent Detection
+
+## What It Checks
+
+Identifies agents that have been offline (no heartbeat) for extended periods and may need investigation or decommissioning. The check inspects all non-revoked, non-inactive agents and categorizes them:
+
+1. **Decommission candidates** -- offline for more than **7 days**. Result: **Warn** listing each agent with days offline.
+2. **Stale** -- offline for more than **1 hour** but less than 7 days. Result: **Warn** listing each agent with hours offline.
+3. **All healthy** -- no agents exceed the 1-hour stale threshold. Result: **Pass**.
+
+The check uses `LastHeartbeatAt` from the agent store. Agents with no recorded heartbeat (`null`) are treated as having `TimeSpan.MaxValue` offline duration.
+
+Evidence collected: `DecommissionCandidates` count, `StaleAgents` count, per-agent names with offline durations.
+
+## Why It Matters
+
+Stale agents consume fleet management overhead, confuse capacity planning, and may hold allocated resources (IP addresses, certificates, license seats) that could be reclaimed. An agent that has been offline for 7+ days is unlikely to return without intervention and should be explicitly deactivated or investigated. Ignoring stale agents leads to a growing inventory of ghost entries that obscure the true fleet state.
+
+## Common Causes
+
+- Agent host has been permanently removed (decommissioned hardware, terminated cloud instance)
+- Agent was replaced by a new instance but the old registration was not deactivated
+- Infrastructure change (network re-architecture, datacenter migration) without cleanup
+- Agent host is undergoing extended maintenance
+- Network partition isolating the agent
+- Agent process crash without auto-restart configured (systemd restart policy missing)
+
+## How to Fix
+
+### Docker Compose
+
+```bash
+# List all agent registrations with status
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
+  stella agent list --all
+
+# Deactivate a stale agent
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
+  stella agent deactivate --agent-id <agent-id>
+```
+
+### Bare Metal / systemd
+
+```bash
+# Review stale agents
+stella agent list --status stale
+
+# Deactivate agents that are no longer needed
+stella agent deactivate --agent-id <agent-id>
+
+# If the agent should still be active, investigate the host
+ssh <agent-host> 'systemctl status stella-agent'
+
+# Check network connectivity from the agent host
+ssh <agent-host> 'curl -k https://orchestrator:8443/health'
+
+# Restart agent on the host
+ssh <agent-host> 'sudo systemctl restart stella-agent'
+```
+
+### Kubernetes / Helm
+
+```bash
+# Check for terminated or evicted agent pods
+kubectl get pods -l app.kubernetes.io/component=agent -n stellaops --field-selector=status.phase!=Running
+
+# Remove stale agent registrations via API
+stella agent deactivate --agent-id <agent-id>
+
+# If pod was evicted, check node status
+kubectl get nodes
+kubectl describe node <node-name> | grep -A5 Conditions
+```
+
+## Verification
+
+```
+stella doctor run --check check.agent.stale
+```
+
+## Related Checks
+
+- `check.agent.heartbeat.freshness` -- short-term heartbeat staleness (minutes vs. hours/days)
+- `check.agent.capacity` -- stale agents do not contribute to capacity
+- `check.agent.certificate.expiry` -- long-offline agents likely have expired certificates