Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
master
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions

View File

@@ -0,0 +1,123 @@
---
checkId: check.docker.storage
plugin: stellaops.doctor.docker
severity: warn
tags: [docker, storage, disk]
---
# Docker Storage
## What It Checks
Validates Docker storage driver and disk space usage. The check connects to the Docker daemon and retrieves system information, then inspects:
| Condition | Result |
|---|---|
| Storage driver is not `overlay2`, `btrfs`, or `zfs` | `warn` — non-recommended driver |
| Free disk space on Docker root partition < **10 GB** (configurable via `Docker:MinFreeSpaceGb`) | `warn` |
| Disk usage > **85%** (configurable via `Docker:MaxStorageUsagePercent`) | `warn` |
The check reads the Docker root directory (typically `/var/lib/docker`) and queries drive info for that partition. On platforms where disk info is unavailable, the check still validates the storage driver.
Evidence collected includes: storage driver, Docker root directory, total space, free space, usage percentage, and whether the driver is recommended.
## Why It Matters
Docker storage issues are a leading cause of container deployment failures:
- **Non-recommended storage drivers** (e.g., `vfs`, `devicemapper`) have performance and reliability problems. `overlay2` is the recommended driver for most workloads.
- **Low disk space** prevents image pulls, container creation, and volume writes. Docker images and layers consume significant space.
- **High disk usage** can cause container crashes, database corruption, and evidence write failures.
The Docker root directory often shares a partition with the OS, so storage exhaustion affects the entire host.
## Common Causes
- Storage driver is not overlay2, btrfs, or zfs (e.g., using legacy `devicemapper` or `vfs`)
- Low disk space on the Docker root partition (less than 10 GB free)
- Disk usage exceeds 85% threshold
- Unused images, containers, and volumes consuming space
- Large build caches not pruned
## How to Fix
### Docker Compose
Check and clean Docker storage:
```bash
# Check disk usage
docker system df
# Detailed disk usage
docker system df -v
# Prune unused data (images, containers, networks, build cache)
docker system prune -a
# Prune volumes too (WARNING: removes data volumes)
docker system prune -a --volumes
# Check storage driver
docker info | grep "Storage Driver"
```
Configure storage thresholds:
```yaml
environment:
Docker__MinFreeSpaceGb: "10"
Docker__MaxStorageUsagePercent: "85"
```
### Bare Metal / systemd
Switch to overlay2 storage driver if not already using it:
```bash
# Check current driver
docker info | grep "Storage Driver"
# Configure overlay2 in /etc/docker/daemon.json
{
"storage-driver": "overlay2"
}
# Restart Docker (WARNING: may require re-pulling images)
sudo systemctl restart docker
```
Free up disk space:
```bash
# Find large Docker directories
du -sh /var/lib/docker/*
# Clean unused resources
docker system prune -a
# Set up automatic cleanup via cron
echo "0 2 * * 0 docker system prune -f --filter 'until=168h'" | sudo crontab -
```
### Kubernetes / Helm
Monitor node disk usage:
```bash
# Check node disk pressure
kubectl describe node <node> | grep -A 5 "Conditions"
# Check for DiskPressure condition
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{range .status.conditions[?(@.type=="DiskPressure")]}{.status}{"\n"}{end}{end}'
```
Configure kubelet garbage collection thresholds:
```yaml
# In kubelet config
imageGCHighThresholdPercent: 85
imageGCLowThresholdPercent: 80
evictionHard:
nodefs.available: "10%"
imagefs.available: "15%"
```
## Verification
```
stella doctor run --check check.docker.storage
```
## Related Checks
- `check.core.env.diskspace` — checks general disk space (not Docker-specific)
- `check.docker.daemon` — daemon must be running to query storage info