Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3.4 KiB
checkId, plugin, severity, tags
| checkId | plugin | severity | tags | |||
|---|---|---|---|---|---|---|
| check.agent.stale | stellaops.doctor.agent | warn |
|
Stale Agent Detection
What It Checks
Identifies agents that have been offline (no heartbeat) for extended periods and may need investigation or decommissioning. The check inspects all non-revoked, non-inactive agents and categorizes them:
- Decommission candidates -- offline for more than 7 days. Result: Warn listing each agent with days offline.
- Stale -- offline for more than 1 hour but less than 7 days. Result: Warn listing each agent with hours offline.
- All healthy -- no agents exceed the 1-hour stale threshold. Result: Pass.
The check uses LastHeartbeatAt from the agent store. Agents with no recorded heartbeat (null) are treated as having TimeSpan.MaxValue offline duration.
Evidence collected: DecommissionCandidates count, StaleAgents count, per-agent names with offline durations.
Why It Matters
Stale agents consume fleet management overhead, confuse capacity planning, and may hold allocated resources (IP addresses, certificates, license seats) that could be reclaimed. An agent that has been offline for 7+ days is unlikely to return without intervention and should be explicitly deactivated or investigated. Ignoring stale agents leads to a growing inventory of ghost entries that obscure the true fleet state.
Common Causes
- Agent host has been permanently removed (decommissioned hardware, terminated cloud instance)
- Agent was replaced by a new instance but the old registration was not deactivated
- Infrastructure change (network re-architecture, datacenter migration) without cleanup
- Agent host is undergoing extended maintenance
- Network partition isolating the agent
- Agent process crash without auto-restart configured (systemd restart policy missing)
How to Fix
Docker Compose
# List all agent registrations with status
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
stella agent list --all
# Deactivate a stale agent
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
stella agent deactivate --agent-id <agent-id>
Bare Metal / systemd
# Review stale agents
stella agent list --status stale
# Deactivate agents that are no longer needed
stella agent deactivate --agent-id <agent-id>
# If the agent should still be active, investigate the host
ssh <agent-host> 'systemctl status stella-agent'
# Check network connectivity from the agent host
ssh <agent-host> 'curl -k https://orchestrator:8443/health'
# Restart agent on the host
ssh <agent-host> 'sudo systemctl restart stella-agent'
Kubernetes / Helm
# Check for terminated or evicted agent pods
kubectl get pods -l app.kubernetes.io/component=agent -n stellaops --field-selector=status.phase!=Running
# Remove stale agent registrations via API
stella agent deactivate --agent-id <agent-id>
# If pod was evicted, check node status
kubectl get nodes
kubectl describe node <node-name> | grep -A5 Conditions
Verification
stella doctor run --check check.agent.stale
Related Checks
check.agent.heartbeat.freshness-- short-term heartbeat staleness (minutes vs. hours/days)check.agent.capacity-- stale agents do not contribute to capacitycheck.agent.certificate.expiry-- long-offline agents likely have expired certificates