Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
master
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions

View File

@@ -0,0 +1,103 @@
---
checkId: check.agent.resource.utilization
plugin: stellaops.doctor.agent
severity: warn
tags: [agent, resource, performance, capacity]
---
# Agent Resource Utilization
## What It Checks
Monitors CPU, memory, and disk utilization across the agent fleet. The check is designed to verify:
1. CPU utilization per agent
2. Memory utilization per agent
3. Disk space per agent (for task workspace, logs, and cached artifacts)
4. Resource usage trends (increasing/stable/decreasing)
**Current status:** implementation pending -- the check always returns Pass with a placeholder message. The `CanRun` method always returns true, so the check will always appear in results.
## Why It Matters
Agents that exhaust CPU, memory, or disk become unable to execute tasks reliably. CPU saturation causes task timeouts; memory exhaustion triggers OOM kills that look like intermittent crashes; disk exhaustion prevents artifact downloads and log writes. Proactive monitoring prevents these cascading failures before they impact deployment SLAs.
## Common Causes
- Agent running too many concurrent tasks for its resource allocation
- Disk filled by accumulated scan artifacts, logs, or cached images
- Memory leak in long-running agent process
- Noisy neighbor on shared infrastructure consuming resources
- Resource limits not configured (no cgroup/container memory cap)
## How to Fix
### Docker Compose
```bash
# Check agent container resource usage
docker stats --no-stream $(docker compose -f devops/compose/docker-compose.stella-ops.yml ps -q agent)
# Set resource limits in compose override
# docker-compose.override.yml:
# services:
# agent:
# deploy:
# resources:
# limits:
# cpus: '2.0'
# memory: 4G
# Clean up old task artifacts
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
stella agent cleanup --older-than 7d
```
### Bare Metal / systemd
```bash
# Check resource usage
stella agent health <agent-id>
# View system resources on agent host
top -bn1 | head -20
df -h /var/lib/stellaops
# Clean up old task artifacts
stella agent cleanup --older-than 7d
# Adjust concurrent task limit
stella agent config --agent-id <agent-id> --set max_concurrent_tasks=4
```
### Kubernetes / Helm
```bash
# Check agent pod resource usage
kubectl top pods -l app.kubernetes.io/component=agent -n stellaops
# Set resource requests and limits in Helm values
# agent:
# resources:
# requests:
# cpu: "500m"
# memory: "1Gi"
# limits:
# cpu: "2000m"
# memory: "4Gi"
helm upgrade stellaops stellaops/stellaops -f values.yaml
# Check if pods are being OOM-killed
kubectl get events -n stellaops --field-selector reason=OOMKilling
```
## Verification
```
stella doctor run --check check.agent.resource.utilization
```
## Related Checks
- `check.agent.capacity` -- resource exhaustion reduces effective capacity
- `check.agent.heartbeat.freshness` -- resource saturation can delay heartbeats
- `check.agent.task.backlog` -- high utilization combined with backlog indicates need to scale