---
checkId: check.agent.heartbeat.freshness
plugin: stellaops.doctor.agent
severity: fail
tags: [agent, heartbeat, connectivity, quick]
---
# Agent Heartbeat Freshness

## What It Checks

Queries all non-revoked, non-inactive agents for the current tenant and classifies each by the age of its last heartbeat:

1. **Stale** (> 5 minutes since last heartbeat): Result is **Fail**. Evidence lists each stale agent with the time since its last heartbeat in minutes.
2. **Warning** (> 2 minutes but <= 5 minutes): Result is **Warn**. Evidence lists each delayed agent with time since heartbeat in seconds.
3. **Healthy** (<= 2 minutes): Result is **Pass**.

If no active agents are registered, the check returns **Warn** with a prompt to bootstrap agents. If the tenant ID is missing, it warns about being unable to check.

Evidence collected: `TotalActive`, `Stale` count, `Warning` count, `Healthy` count, per-agent names and heartbeat ages.

## Why It Matters

Heartbeats are the primary signal that an agent is alive and accepting work. A stale heartbeat means the agent has stopped communicating with the orchestrator -- it may have crashed, lost network connectivity, or had its mTLS certificate expire. Tasks dispatched to a stale agent will time out, and the lack of timely detection causes deployment delays and alert fatigue.

## Common Causes

- Agent process has crashed or stopped
- Network connectivity issue between agent and orchestrator
- Firewall blocking agent heartbeat traffic (typically HTTPS on port 8443)
- Agent host is unreachable or powered off
- mTLS certificate has expired (see `check.agent.certificate.expiry`)
- Agent is under heavy load (warning-level)
- Network latency between agent and orchestrator (warning-level)
- Agent is processing long-running tasks that block the heartbeat loop (warning-level)

## How to Fix

### Docker Compose

```bash
# Check agent container status
docker compose -f devops/compose/docker-compose.stella-ops.yml ps agent

# View agent logs for crash or error messages
docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 200

# Restart agent container
docker compose -f devops/compose/docker-compose.stella-ops.yml restart agent

# Verify network connectivity from agent to orchestrator
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
  curl -k https://orchestrator:8443/health
```

### Bare Metal / systemd

```bash
# Check agent service status
systemctl status stella-agent

# View recent agent logs
journalctl -u stella-agent --since '10 minutes ago'

# Run agent diagnostics
stella agent doctor

# Check network connectivity to orchestrator
curl -k https://orchestrator:8443/health

# If certificate expired, renew it
stella agent renew-cert --force

# Restart the service
sudo systemctl restart stella-agent
```

### Kubernetes / Helm

```bash
# Check agent pod status and restarts
kubectl get pods -l app.kubernetes.io/component=agent -n stellaops

# View agent pod logs
kubectl logs -l app.kubernetes.io/component=agent -n stellaops --tail=200

# Check network policy allowing agent -> orchestrator traffic
kubectl get networkpolicy -n stellaops

# Restart agent pods via rollout
kubectl rollout restart deployment/stellaops-agent -n stellaops
```

## Verification

```
stella doctor run --check check.agent.heartbeat.freshness
```

## Related Checks

- `check.agent.stale` -- detects agents offline for hours/days (longer threshold than heartbeat freshness)
- `check.agent.certificate.expiry` -- expired certificates cause heartbeat authentication failures
- `check.agent.capacity` -- heartbeat failures reduce effective fleet capacity
- `check.agent.resource.utilization` -- overloaded agents may delay heartbeats