--- checkId: check.agent.heartbeat.freshness plugin: stellaops.doctor.agent severity: fail tags: [agent, heartbeat, connectivity, quick] --- # Agent Heartbeat Freshness ## What It Checks Queries all non-revoked, non-inactive agents for the current tenant and classifies each by the age of its last heartbeat: 1. **Stale** (> 5 minutes since last heartbeat): Result is **Fail**. Evidence lists each stale agent with the time since its last heartbeat in minutes. 2. **Warning** (> 2 minutes but <= 5 minutes): Result is **Warn**. Evidence lists each delayed agent with time since heartbeat in seconds. 3. **Healthy** (<= 2 minutes): Result is **Pass**. If no active agents are registered, the check returns **Warn** with a prompt to bootstrap agents. If the tenant ID is missing, it warns about being unable to check. Evidence collected: `TotalActive`, `Stale` count, `Warning` count, `Healthy` count, per-agent names and heartbeat ages. ## Why It Matters Heartbeats are the primary signal that an agent is alive and accepting work. A stale heartbeat means the agent has stopped communicating with the orchestrator -- it may have crashed, lost network connectivity, or had its mTLS certificate expire. Tasks dispatched to a stale agent will time out, and the lack of timely detection causes deployment delays and alert fatigue. ## Common Causes - Agent process has crashed or stopped - Network connectivity issue between agent and orchestrator - Firewall blocking agent heartbeat traffic (typically HTTPS on port 8443) - Agent host is unreachable or powered off - mTLS certificate has expired (see `check.agent.certificate.expiry`) - Agent is under heavy load (warning-level) - Network latency between agent and orchestrator (warning-level) - Agent is processing long-running tasks that block the heartbeat loop (warning-level) ## How to Fix ### Docker Compose ```bash # Check agent container status docker compose -f devops/compose/docker-compose.stella-ops.yml ps agent # View agent logs for crash or error messages docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 200 # Restart agent container docker compose -f devops/compose/docker-compose.stella-ops.yml restart agent # Verify network connectivity from agent to orchestrator docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \ curl -k https://orchestrator:8443/health ``` ### Bare Metal / systemd ```bash # Check agent service status systemctl status stella-agent # View recent agent logs journalctl -u stella-agent --since '10 minutes ago' # Run agent diagnostics stella agent doctor # Check network connectivity to orchestrator curl -k https://orchestrator:8443/health # If certificate expired, renew it stella agent renew-cert --force # Restart the service sudo systemctl restart stella-agent ``` ### Kubernetes / Helm ```bash # Check agent pod status and restarts kubectl get pods -l app.kubernetes.io/component=agent -n stellaops # View agent pod logs kubectl logs -l app.kubernetes.io/component=agent -n stellaops --tail=200 # Check network policy allowing agent -> orchestrator traffic kubectl get networkpolicy -n stellaops # Restart agent pods via rollout kubectl rollout restart deployment/stellaops-agent -n stellaops ``` ## Verification ``` stella doctor run --check check.agent.heartbeat.freshness ``` ## Related Checks - `check.agent.stale` -- detects agents offline for hours/days (longer threshold than heartbeat freshness) - `check.agent.certificate.expiry` -- expired certificates cause heartbeat authentication failures - `check.agent.capacity` -- heartbeat failures reduce effective fleet capacity - `check.agent.resource.utilization` -- overloaded agents may delay heartbeats