Doctor plugin checks: implement health check classes and documentation
Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
123
docs/doctor/articles/docker/storage.md
Normal file
123
docs/doctor/articles/docker/storage.md
Normal file
@@ -0,0 +1,123 @@
|
||||
---
|
||||
checkId: check.docker.storage
|
||||
plugin: stellaops.doctor.docker
|
||||
severity: warn
|
||||
tags: [docker, storage, disk]
|
||||
---
|
||||
# Docker Storage
|
||||
|
||||
## What It Checks
|
||||
Validates Docker storage driver and disk space usage. The check connects to the Docker daemon and retrieves system information, then inspects:
|
||||
|
||||
| Condition | Result |
|
||||
|---|---|
|
||||
| Storage driver is not `overlay2`, `btrfs`, or `zfs` | `warn` — non-recommended driver |
|
||||
| Free disk space on Docker root partition < **10 GB** (configurable via `Docker:MinFreeSpaceGb`) | `warn` |
|
||||
| Disk usage > **85%** (configurable via `Docker:MaxStorageUsagePercent`) | `warn` |
|
||||
|
||||
The check reads the Docker root directory (typically `/var/lib/docker`) and queries drive info for that partition. On platforms where disk info is unavailable, the check still validates the storage driver.
|
||||
|
||||
Evidence collected includes: storage driver, Docker root directory, total space, free space, usage percentage, and whether the driver is recommended.
|
||||
|
||||
## Why It Matters
|
||||
Docker storage issues are a leading cause of container deployment failures:
|
||||
|
||||
- **Non-recommended storage drivers** (e.g., `vfs`, `devicemapper`) have performance and reliability problems. `overlay2` is the recommended driver for most workloads.
|
||||
- **Low disk space** prevents image pulls, container creation, and volume writes. Docker images and layers consume significant space.
|
||||
- **High disk usage** can cause container crashes, database corruption, and evidence write failures.
|
||||
|
||||
The Docker root directory often shares a partition with the OS, so storage exhaustion affects the entire host.
|
||||
|
||||
## Common Causes
|
||||
- Storage driver is not overlay2, btrfs, or zfs (e.g., using legacy `devicemapper` or `vfs`)
|
||||
- Low disk space on the Docker root partition (less than 10 GB free)
|
||||
- Disk usage exceeds 85% threshold
|
||||
- Unused images, containers, and volumes consuming space
|
||||
- Large build caches not pruned
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Check and clean Docker storage:
|
||||
|
||||
```bash
|
||||
# Check disk usage
|
||||
docker system df
|
||||
|
||||
# Detailed disk usage
|
||||
docker system df -v
|
||||
|
||||
# Prune unused data (images, containers, networks, build cache)
|
||||
docker system prune -a
|
||||
|
||||
# Prune volumes too (WARNING: removes data volumes)
|
||||
docker system prune -a --volumes
|
||||
|
||||
# Check storage driver
|
||||
docker info | grep "Storage Driver"
|
||||
```
|
||||
|
||||
Configure storage thresholds:
|
||||
```yaml
|
||||
environment:
|
||||
Docker__MinFreeSpaceGb: "10"
|
||||
Docker__MaxStorageUsagePercent: "85"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Switch to overlay2 storage driver if not already using it:
|
||||
|
||||
```bash
|
||||
# Check current driver
|
||||
docker info | grep "Storage Driver"
|
||||
|
||||
# Configure overlay2 in /etc/docker/daemon.json
|
||||
{
|
||||
"storage-driver": "overlay2"
|
||||
}
|
||||
|
||||
# Restart Docker (WARNING: may require re-pulling images)
|
||||
sudo systemctl restart docker
|
||||
```
|
||||
|
||||
Free up disk space:
|
||||
```bash
|
||||
# Find large Docker directories
|
||||
du -sh /var/lib/docker/*
|
||||
|
||||
# Clean unused resources
|
||||
docker system prune -a
|
||||
|
||||
# Set up automatic cleanup via cron
|
||||
echo "0 2 * * 0 docker system prune -f --filter 'until=168h'" | sudo crontab -
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
Monitor node disk usage:
|
||||
|
||||
```bash
|
||||
# Check node disk pressure
|
||||
kubectl describe node <node> | grep -A 5 "Conditions"
|
||||
|
||||
# Check for DiskPressure condition
|
||||
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{range .status.conditions[?(@.type=="DiskPressure")]}{.status}{"\n"}{end}{end}'
|
||||
```
|
||||
|
||||
Configure kubelet garbage collection thresholds:
|
||||
```yaml
|
||||
# In kubelet config
|
||||
imageGCHighThresholdPercent: 85
|
||||
imageGCLowThresholdPercent: 80
|
||||
evictionHard:
|
||||
nodefs.available: "10%"
|
||||
imagefs.available: "15%"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.docker.storage
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.core.env.diskspace` — checks general disk space (not Docker-specific)
|
||||
- `check.docker.daemon` — daemon must be running to query storage info
|
||||
Reference in New Issue
Block a user