Files
git.stella-ops.org/docs/doctor/articles/agent/resource-utilization.md
master c58a236d70 Doctor plugin checks: implement health check classes and documentation
Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00

3.0 KiB

checkId, plugin, severity, tags
checkId plugin severity tags
check.agent.resource.utilization stellaops.doctor.agent warn
agent
resource
performance
capacity

Agent Resource Utilization

What It Checks

Monitors CPU, memory, and disk utilization across the agent fleet. The check is designed to verify:

  1. CPU utilization per agent
  2. Memory utilization per agent
  3. Disk space per agent (for task workspace, logs, and cached artifacts)
  4. Resource usage trends (increasing/stable/decreasing)

Current status: implementation pending -- the check always returns Pass with a placeholder message. The CanRun method always returns true, so the check will always appear in results.

Why It Matters

Agents that exhaust CPU, memory, or disk become unable to execute tasks reliably. CPU saturation causes task timeouts; memory exhaustion triggers OOM kills that look like intermittent crashes; disk exhaustion prevents artifact downloads and log writes. Proactive monitoring prevents these cascading failures before they impact deployment SLAs.

Common Causes

  • Agent running too many concurrent tasks for its resource allocation
  • Disk filled by accumulated scan artifacts, logs, or cached images
  • Memory leak in long-running agent process
  • Noisy neighbor on shared infrastructure consuming resources
  • Resource limits not configured (no cgroup/container memory cap)

How to Fix

Docker Compose

# Check agent container resource usage
docker stats --no-stream $(docker compose -f devops/compose/docker-compose.stella-ops.yml ps -q agent)

# Set resource limits in compose override
# docker-compose.override.yml:
#   services:
#     agent:
#       deploy:
#         resources:
#           limits:
#             cpus: '2.0'
#             memory: 4G

# Clean up old task artifacts
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
  stella agent cleanup --older-than 7d

Bare Metal / systemd

# Check resource usage
stella agent health <agent-id>

# View system resources on agent host
top -bn1 | head -20
df -h /var/lib/stellaops

# Clean up old task artifacts
stella agent cleanup --older-than 7d

# Adjust concurrent task limit
stella agent config --agent-id <agent-id> --set max_concurrent_tasks=4

Kubernetes / Helm

# Check agent pod resource usage
kubectl top pods -l app.kubernetes.io/component=agent -n stellaops

# Set resource requests and limits in Helm values
# agent:
#   resources:
#     requests:
#       cpu: "500m"
#       memory: "1Gi"
#     limits:
#       cpu: "2000m"
#       memory: "4Gi"
helm upgrade stellaops stellaops/stellaops -f values.yaml

# Check if pods are being OOM-killed
kubectl get events -n stellaops --field-selector reason=OOMKilling

Verification

stella doctor run --check check.agent.resource.utilization
  • check.agent.capacity -- resource exhaustion reduces effective capacity
  • check.agent.heartbeat.freshness -- resource saturation can delay heartbeats
  • check.agent.task.backlog -- high utilization combined with backlog indicates need to scale