---
checkId: check.agent.resource.utilization
plugin: stellaops.doctor.agent
severity: warn
tags: [agent, resource, performance, capacity]
---
# Agent Resource Utilization

## What It Checks

Monitors CPU, memory, and disk utilization across the agent fleet. The check is designed to verify:

1. CPU utilization per agent
2. Memory utilization per agent
3. Disk space per agent (for task workspace, logs, and cached artifacts)
4. Resource usage trends (increasing/stable/decreasing)

**Current status:** implementation pending -- the check always returns Pass with a placeholder message. The `CanRun` method always returns true, so the check will always appear in results.

## Why It Matters

Agents that exhaust CPU, memory, or disk become unable to execute tasks reliably. CPU saturation causes task timeouts; memory exhaustion triggers OOM kills that look like intermittent crashes; disk exhaustion prevents artifact downloads and log writes. Proactive monitoring prevents these cascading failures before they impact deployment SLAs.

## Common Causes

- Agent running too many concurrent tasks for its resource allocation
- Disk filled by accumulated scan artifacts, logs, or cached images
- Memory leak in long-running agent process
- Noisy neighbor on shared infrastructure consuming resources
- Resource limits not configured (no cgroup/container memory cap)

## How to Fix

### Docker Compose

```bash
# Check agent container resource usage
docker stats --no-stream $(docker compose -f devops/compose/docker-compose.stella-ops.yml ps -q agent)

# Set resource limits in compose override
# docker-compose.override.yml:
#   services:
#     agent:
#       deploy:
#         resources:
#           limits:
#             cpus: '2.0'
#             memory: 4G

# Clean up old task artifacts
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
  stella agent cleanup --older-than 7d
```

### Bare Metal / systemd

```bash
# Check resource usage
stella agent health <agent-id>

# View system resources on agent host
top -bn1 | head -20
df -h /var/lib/stellaops

# Clean up old task artifacts
stella agent cleanup --older-than 7d

# Adjust concurrent task limit
stella agent config --agent-id <agent-id> --set max_concurrent_tasks=4
```

### Kubernetes / Helm

```bash
# Check agent pod resource usage
kubectl top pods -l app.kubernetes.io/component=agent -n stellaops

# Set resource requests and limits in Helm values
# agent:
#   resources:
#     requests:
#       cpu: "500m"
#       memory: "1Gi"
#     limits:
#       cpu: "2000m"
#       memory: "4Gi"
helm upgrade stellaops stellaops/stellaops -f values.yaml

# Check if pods are being OOM-killed
kubectl get events -n stellaops --field-selector reason=OOMKilling
```

## Verification

```
stella doctor run --check check.agent.resource.utilization
```

## Related Checks

- `check.agent.capacity` -- resource exhaustion reduces effective capacity
- `check.agent.heartbeat.freshness` -- resource saturation can delay heartbeats
- `check.agent.task.backlog` -- high utilization combined with backlog indicates need to scale