Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3.0 KiB
3.0 KiB
checkId, plugin, severity, tags
| checkId | plugin | severity | tags | ||||
|---|---|---|---|---|---|---|---|
| check.agent.resource.utilization | stellaops.doctor.agent | warn |
|
Agent Resource Utilization
What It Checks
Monitors CPU, memory, and disk utilization across the agent fleet. The check is designed to verify:
- CPU utilization per agent
- Memory utilization per agent
- Disk space per agent (for task workspace, logs, and cached artifacts)
- Resource usage trends (increasing/stable/decreasing)
Current status: implementation pending -- the check always returns Pass with a placeholder message. The CanRun method always returns true, so the check will always appear in results.
Why It Matters
Agents that exhaust CPU, memory, or disk become unable to execute tasks reliably. CPU saturation causes task timeouts; memory exhaustion triggers OOM kills that look like intermittent crashes; disk exhaustion prevents artifact downloads and log writes. Proactive monitoring prevents these cascading failures before they impact deployment SLAs.
Common Causes
- Agent running too many concurrent tasks for its resource allocation
- Disk filled by accumulated scan artifacts, logs, or cached images
- Memory leak in long-running agent process
- Noisy neighbor on shared infrastructure consuming resources
- Resource limits not configured (no cgroup/container memory cap)
How to Fix
Docker Compose
# Check agent container resource usage
docker stats --no-stream $(docker compose -f devops/compose/docker-compose.stella-ops.yml ps -q agent)
# Set resource limits in compose override
# docker-compose.override.yml:
# services:
# agent:
# deploy:
# resources:
# limits:
# cpus: '2.0'
# memory: 4G
# Clean up old task artifacts
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
stella agent cleanup --older-than 7d
Bare Metal / systemd
# Check resource usage
stella agent health <agent-id>
# View system resources on agent host
top -bn1 | head -20
df -h /var/lib/stellaops
# Clean up old task artifacts
stella agent cleanup --older-than 7d
# Adjust concurrent task limit
stella agent config --agent-id <agent-id> --set max_concurrent_tasks=4
Kubernetes / Helm
# Check agent pod resource usage
kubectl top pods -l app.kubernetes.io/component=agent -n stellaops
# Set resource requests and limits in Helm values
# agent:
# resources:
# requests:
# cpu: "500m"
# memory: "1Gi"
# limits:
# cpu: "2000m"
# memory: "4Gi"
helm upgrade stellaops stellaops/stellaops -f values.yaml
# Check if pods are being OOM-killed
kubectl get events -n stellaops --field-selector reason=OOMKilling
Verification
stella doctor run --check check.agent.resource.utilization
Related Checks
check.agent.capacity-- resource exhaustion reduces effective capacitycheck.agent.heartbeat.freshness-- resource saturation can delay heartbeatscheck.agent.task.backlog-- high utilization combined with backlog indicates need to scale