Files

master c58a236d70 Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 12:28:00 +02:00

3.0 KiB

Raw Blame History

checkId, plugin, severity, tags

checkId

plugin

severity

Agent Resource Utilization

What It Checks

Monitors CPU, memory, and disk utilization across the agent fleet. The check is designed to verify:

CPU utilization per agent
Memory utilization per agent
Disk space per agent (for task workspace, logs, and cached artifacts)
Resource usage trends (increasing/stable/decreasing)

Current status: implementation pending -- the check always returns Pass with a placeholder message. The CanRun method always returns true, so the check will always appear in results.

Why It Matters

Agents that exhaust CPU, memory, or disk become unable to execute tasks reliably. CPU saturation causes task timeouts; memory exhaustion triggers OOM kills that look like intermittent crashes; disk exhaustion prevents artifact downloads and log writes. Proactive monitoring prevents these cascading failures before they impact deployment SLAs.

Common Causes

Agent running too many concurrent tasks for its resource allocation
Disk filled by accumulated scan artifacts, logs, or cached images
Memory leak in long-running agent process
Noisy neighbor on shared infrastructure consuming resources
Resource limits not configured (no cgroup/container memory cap)

How to Fix

Docker Compose

# Check agent container resource usage
docker stats --no-stream $(docker compose -f devops/compose/docker-compose.stella-ops.yml ps -q agent)

# Set resource limits in compose override
# docker-compose.override.yml:
#   services:
#     agent:
#       deploy:
#         resources:
#           limits:
#             cpus: '2.0'
#             memory: 4G

# Clean up old task artifacts
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
  stella agent cleanup --older-than 7d

Bare Metal / systemd

# Check resource usage
stella agent health <agent-id>

# View system resources on agent host
top -bn1 | head -20
df -h /var/lib/stellaops

# Clean up old task artifacts
stella agent cleanup --older-than 7d

# Adjust concurrent task limit
stella agent config --agent-id <agent-id> --set max_concurrent_tasks=4

Kubernetes / Helm

# Check agent pod resource usage
kubectl top pods -l app.kubernetes.io/component=agent -n stellaops

# Set resource requests and limits in Helm values
# agent:
#   resources:
#     requests:
#       cpu: "500m"
#       memory: "1Gi"
#     limits:
#       cpu: "2000m"
#       memory: "4Gi"
helm upgrade stellaops stellaops/stellaops -f values.yaml

# Check if pods are being OOM-killed
kubectl get events -n stellaops --field-selector reason=OOMKilling

Verification

stella doctor run --check check.agent.resource.utilization

check.agent.capacity -- resource exhaustion reduces effective capacity
check.agent.heartbeat.freshness -- resource saturation can delay heartbeats
check.agent.task.backlog -- high utilization combined with backlog indicates need to scale

3.0 KiB Raw Blame History