--- checkId: check.agent.resource.utilization plugin: stellaops.doctor.agent severity: warn tags: [agent, resource, performance, capacity] --- # Agent Resource Utilization ## What It Checks Monitors CPU, memory, and disk utilization across the agent fleet. The check is designed to verify: 1. CPU utilization per agent 2. Memory utilization per agent 3. Disk space per agent (for task workspace, logs, and cached artifacts) 4. Resource usage trends (increasing/stable/decreasing) **Current status:** implementation pending -- the check always returns Pass with a placeholder message. The `CanRun` method always returns true, so the check will always appear in results. ## Why It Matters Agents that exhaust CPU, memory, or disk become unable to execute tasks reliably. CPU saturation causes task timeouts; memory exhaustion triggers OOM kills that look like intermittent crashes; disk exhaustion prevents artifact downloads and log writes. Proactive monitoring prevents these cascading failures before they impact deployment SLAs. ## Common Causes - Agent running too many concurrent tasks for its resource allocation - Disk filled by accumulated scan artifacts, logs, or cached images - Memory leak in long-running agent process - Noisy neighbor on shared infrastructure consuming resources - Resource limits not configured (no cgroup/container memory cap) ## How to Fix ### Docker Compose ```bash # Check agent container resource usage docker stats --no-stream $(docker compose -f devops/compose/docker-compose.stella-ops.yml ps -q agent) # Set resource limits in compose override # docker-compose.override.yml: # services: # agent: # deploy: # resources: # limits: # cpus: '2.0' # memory: 4G # Clean up old task artifacts docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \ stella agent cleanup --older-than 7d ``` ### Bare Metal / systemd ```bash # Check resource usage stella agent health # View system resources on agent host top -bn1 | head -20 df -h /var/lib/stellaops # Clean up old task artifacts stella agent cleanup --older-than 7d # Adjust concurrent task limit stella agent config --agent-id --set max_concurrent_tasks=4 ``` ### Kubernetes / Helm ```bash # Check agent pod resource usage kubectl top pods -l app.kubernetes.io/component=agent -n stellaops # Set resource requests and limits in Helm values # agent: # resources: # requests: # cpu: "500m" # memory: "1Gi" # limits: # cpu: "2000m" # memory: "4Gi" helm upgrade stellaops stellaops/stellaops -f values.yaml # Check if pods are being OOM-killed kubectl get events -n stellaops --field-selector reason=OOMKilling ``` ## Verification ``` stella doctor run --check check.agent.resource.utilization ``` ## Related Checks - `check.agent.capacity` -- resource exhaustion reduces effective capacity - `check.agent.heartbeat.freshness` -- resource saturation can delay heartbeats - `check.agent.task.backlog` -- high utilization combined with backlog indicates need to scale