---
checkId: check.agent.stale
plugin: stellaops.doctor.agent
severity: warn
tags: [agent, maintenance, cleanup]
---
# Stale Agent Detection

## What It Checks

Identifies agents that have been offline (no heartbeat) for extended periods and may need investigation or decommissioning. The check inspects all non-revoked, non-inactive agents and categorizes them:

1. **Decommission candidates** -- offline for more than **7 days**. Result: **Warn** listing each agent with days offline.
2. **Stale** -- offline for more than **1 hour** but less than 7 days. Result: **Warn** listing each agent with hours offline.
3. **All healthy** -- no agents exceed the 1-hour stale threshold. Result: **Pass**.

The check uses `LastHeartbeatAt` from the agent store. Agents with no recorded heartbeat (`null`) are treated as having `TimeSpan.MaxValue` offline duration.

Evidence collected: `DecommissionCandidates` count, `StaleAgents` count, per-agent names with offline durations.

## Why It Matters

Stale agents consume fleet management overhead, confuse capacity planning, and may hold allocated resources (IP addresses, certificates, license seats) that could be reclaimed. An agent that has been offline for 7+ days is unlikely to return without intervention and should be explicitly deactivated or investigated. Ignoring stale agents leads to a growing inventory of ghost entries that obscure the true fleet state.

## Common Causes

- Agent host has been permanently removed (decommissioned hardware, terminated cloud instance)
- Agent was replaced by a new instance but the old registration was not deactivated
- Infrastructure change (network re-architecture, datacenter migration) without cleanup
- Agent host is undergoing extended maintenance
- Network partition isolating the agent
- Agent process crash without auto-restart configured (systemd restart policy missing)

## How to Fix

### Docker Compose

```bash
# List all agent registrations with status
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
  stella agent list --all

# Deactivate a stale agent
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
  stella agent deactivate --agent-id <agent-id>
```

### Bare Metal / systemd

```bash
# Review stale agents
stella agent list --status stale

# Deactivate agents that are no longer needed
stella agent deactivate --agent-id <agent-id>

# If the agent should still be active, investigate the host
ssh <agent-host> 'systemctl status stella-agent'

# Check network connectivity from the agent host
ssh <agent-host> 'curl -k https://orchestrator:8443/health'

# Restart agent on the host
ssh <agent-host> 'sudo systemctl restart stella-agent'
```

### Kubernetes / Helm

```bash
# Check for terminated or evicted agent pods
kubectl get pods -l app.kubernetes.io/component=agent -n stellaops --field-selector=status.phase!=Running

# Remove stale agent registrations via API
stella agent deactivate --agent-id <agent-id>

# If pod was evicted, check node status
kubectl get nodes
kubectl describe node <node-name> | grep -A5 Conditions
```

## Verification

```
stella doctor run --check check.agent.stale
```

## Related Checks

- `check.agent.heartbeat.freshness` -- short-term heartbeat staleness (minutes vs. hours/days)
- `check.agent.capacity` -- stale agents do not contribute to capacity
- `check.agent.certificate.expiry` -- long-offline agents likely have expired certificates