Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/agent/resource-utilization.md
+++ b/docs/doctor/articles/agent/resource-utilization.md
@@ -0,0 +1,103 @@
+---
+checkId: check.agent.resource.utilization
+plugin: stellaops.doctor.agent
+severity: warn
+tags: [agent, resource, performance, capacity]
+---
+# Agent Resource Utilization
+
+## What It Checks
+
+Monitors CPU, memory, and disk utilization across the agent fleet. The check is designed to verify:
+
+1. CPU utilization per agent
+2. Memory utilization per agent
+3. Disk space per agent (for task workspace, logs, and cached artifacts)
+4. Resource usage trends (increasing/stable/decreasing)
+
+**Current status:** implementation pending -- the check always returns Pass with a placeholder message. The `CanRun` method always returns true, so the check will always appear in results.
+
+## Why It Matters
+
+Agents that exhaust CPU, memory, or disk become unable to execute tasks reliably. CPU saturation causes task timeouts; memory exhaustion triggers OOM kills that look like intermittent crashes; disk exhaustion prevents artifact downloads and log writes. Proactive monitoring prevents these cascading failures before they impact deployment SLAs.
+
+## Common Causes
+
+- Agent running too many concurrent tasks for its resource allocation
+- Disk filled by accumulated scan artifacts, logs, or cached images
+- Memory leak in long-running agent process
+- Noisy neighbor on shared infrastructure consuming resources
+- Resource limits not configured (no cgroup/container memory cap)
+
+## How to Fix
+
+### Docker Compose
+
+```bash
+# Check agent container resource usage
+docker stats --no-stream $(docker compose -f devops/compose/docker-compose.stella-ops.yml ps -q agent)
+
+# Set resource limits in compose override
+# docker-compose.override.yml:
+#   services:
+#     agent:
+#       deploy:
+#         resources:
+#           limits:
+#             cpus: '2.0'
+#             memory: 4G
+
+# Clean up old task artifacts
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
+  stella agent cleanup --older-than 7d
+```
+
+### Bare Metal / systemd
+
+```bash
+# Check resource usage
+stella agent health <agent-id>
+
+# View system resources on agent host
+top -bn1 | head -20
+df -h /var/lib/stellaops
+
+# Clean up old task artifacts
+stella agent cleanup --older-than 7d
+
+# Adjust concurrent task limit
+stella agent config --agent-id <agent-id> --set max_concurrent_tasks=4
+```
+
+### Kubernetes / Helm
+
+```bash
+# Check agent pod resource usage
+kubectl top pods -l app.kubernetes.io/component=agent -n stellaops
+
+# Set resource requests and limits in Helm values
+# agent:
+#   resources:
+#     requests:
+#       cpu: "500m"
+#       memory: "1Gi"
+#     limits:
+#       cpu: "2000m"
+#       memory: "4Gi"
+helm upgrade stellaops stellaops/stellaops -f values.yaml
+
+# Check if pods are being OOM-killed
+kubectl get events -n stellaops --field-selector reason=OOMKilling
+```
+
+## Verification
+
+```
+stella doctor run --check check.agent.resource.utilization
+```
+
+## Related Checks
+
+- `check.agent.capacity` -- resource exhaustion reduces effective capacity
+- `check.agent.heartbeat.freshness` -- resource saturation can delay heartbeats
+- `check.agent.task.backlog` -- high utilization combined with backlog indicates need to scale