Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
master
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions

View File

@@ -0,0 +1,94 @@
---
checkId: check.docker.apiversion
plugin: stellaops.doctor.docker
severity: warn
tags: [docker, api, compatibility]
---
# Docker API Version
## What It Checks
Validates that the Docker API version meets minimum requirements for Stella Ops. The check connects to the Docker daemon (using `Docker:Host` configuration or the platform default) and queries the API version via `System.GetVersionAsync()`.
| API Version | Result |
|---|---|
| Below **1.41** | `warn` — below minimum required |
| Between **1.41** and **1.43** | `warn` — below recommended |
| **1.43** or higher | `pass` |
The minimum API version 1.41 corresponds to Docker Engine 20.10+. The recommended version 1.43 corresponds to Docker Engine 23.0+.
Evidence collected includes: API version, Docker version, minimum required version, recommended version, OS, build time, and git commit.
Default Docker host:
- **Linux**: `unix:///var/run/docker.sock`
- **Windows**: `npipe://./pipe/docker_engine`
## Why It Matters
Stella Ops uses Docker API features for container management, image inspection, and network configuration. Older API versions may not support required features such as:
- BuildKit-based image builds (API 1.39+).
- Multi-platform image inspection (API 1.41+).
- Container resource management improvements (API 1.43+).
Running an outdated Docker version also means missing security patches and bug fixes.
## Common Causes
- Docker Engine is outdated (version < 20.10)
- Docker Engine is functional but below recommended version (< 23.0)
- Using a Docker-compatible runtime (Podman, containerd) that reports a lower API version
- Docker not updated after OS upgrade
## How to Fix
### Docker Compose
Update Docker Engine to the latest stable version:
```bash
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io
# RHEL/CentOS
sudo yum update docker-ce docker-ce-cli containerd.io
# Verify version
docker version
```
### Bare Metal / systemd
```bash
# Check current version
docker version
# Update Docker
curl -fsSL https://get.docker.com | sh
# Restart Docker
sudo systemctl restart docker
# Verify
docker version
```
### Kubernetes / Helm
Update the container runtime on cluster nodes. The method depends on your Kubernetes distribution:
```bash
# Check node runtime version
kubectl get nodes -o wide
# For kubeadm clusters, update containerd on each node
sudo apt-get update && sudo apt-get install containerd.io
# Verify
sudo crictl version
```
## Verification
```
stella doctor run --check check.docker.apiversion
```
## Related Checks
- `check.docker.daemon` verifies Docker daemon is running (prerequisite for version check)
- `check.docker.socket` verifies Docker socket is accessible

View File

@@ -0,0 +1,124 @@
---
checkId: check.docker.daemon
plugin: stellaops.doctor.docker
severity: fail
tags: [docker, daemon, container]
---
# Docker Daemon
## What It Checks
Validates that the Docker daemon is running and responsive. The check connects to the Docker daemon (using `Docker:Host` configuration or the platform default) and performs two operations:
1. **Ping**: Sends a ping request to verify the daemon is alive (with a configurable timeout, default 10 seconds via `Docker:TimeoutSeconds`).
2. **Version**: Retrieves version information to confirm the daemon is fully operational.
Evidence collected on success: host address, Docker version, API version, OS, architecture, and kernel version.
On failure, the check distinguishes between:
- **DockerApiException**: The daemon is running but returned an error (reports status code and response body).
- **Connection failure**: Cannot connect to the daemon at all (Docker not installed, not running, or socket inaccessible).
Default Docker host:
- **Linux**: `unix:///var/run/docker.sock`
- **Windows**: `npipe://./pipe/docker_engine`
## Why It Matters
The Docker daemon is the core runtime for all Stella Ops containers. If the daemon is down:
- No containers can start, stop, or restart.
- Health checks for all containerized services fail.
- Image pulls and builds are impossible.
- Docker Compose operations fail entirely.
- The entire Stella Ops platform is offline in container-based deployments.
## Common Causes
- Docker daemon is not running or not accessible
- Docker is not installed on the host
- Docker service crashed or was stopped
- Docker daemon returned an error response (resource exhaustion, configuration error)
- Timeout connecting to the daemon (overloaded host, slow disk)
## How to Fix
### Docker Compose
Check and restart the Docker daemon:
```bash
# Check daemon status
sudo systemctl status docker
# Start the daemon
sudo systemctl start docker
# Enable auto-start on boot
sudo systemctl enable docker
# Verify
docker info
```
If Docker is not installed:
```bash
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
```
### Bare Metal / systemd
```bash
# Check status
sudo systemctl status docker
# View daemon logs
sudo journalctl -u docker --since "10 minutes ago"
# Restart the daemon
sudo systemctl restart docker
# Verify connectivity
docker version
docker info
```
If the daemon crashes repeatedly, check for resource exhaustion:
```bash
# Check disk space (Docker requires space for images/containers)
df -h /var/lib/docker
# Check memory
free -h
# Clean up Docker resources
docker system prune -a
```
### Kubernetes / Helm
On Kubernetes nodes, the container runtime (containerd/CRI-O) replaces Docker daemon. Check the runtime:
```bash
# Check containerd status
sudo systemctl status containerd
# Check CRI-O status
sudo systemctl status crio
# Restart if needed
sudo systemctl restart containerd
```
For Docker Desktop (development):
```bash
# Restart Docker Desktop
# macOS: killall Docker && open -a Docker
# Windows: Restart-Service docker
```
## Verification
```
stella doctor run --check check.docker.daemon
```
## Related Checks
- `check.docker.socket` — verifies the Docker socket exists and has correct permissions
- `check.docker.apiversion` — verifies the Docker API version is compatible
- `check.docker.storage` — verifies Docker storage is healthy (requires running daemon)
- `check.docker.network` — verifies Docker networks are configured (requires running daemon)

View File

@@ -0,0 +1,104 @@
---
checkId: check.docker.network
plugin: stellaops.doctor.docker
severity: warn
tags: [docker, network, connectivity]
---
# Docker Network
## What It Checks
Validates Docker network configuration and connectivity. The check connects to the Docker daemon and lists all networks, then verifies:
1. **Required networks exist**: Checks that each network listed in `Docker:RequiredNetworks` configuration is present. Defaults to `["bridge"]` if not configured.
2. **Bridge driver available**: Verifies at least one network using the `bridge` driver exists.
Evidence collected includes: total network count, available network drivers, found/missing required networks, and bridge network name.
If the Docker daemon is unreachable, the check is skipped.
## Why It Matters
Docker networks provide isolated communication channels between containers. Stella Ops services communicate over dedicated networks for:
- **Service-to-service communication**: Platform, Authority, Gateway, and other services need to reach each other.
- **Database access**: PostgreSQL and Valkey are on specific networks.
- **Network isolation**: Separating frontend, backend, and data tiers.
Missing networks cause container DNS resolution failures and connection refused errors between services.
## Common Causes
- Required network not found (not yet created or was deleted)
- No bridge network driver available (Docker networking misconfigured)
- Docker Compose network not created (compose project not started)
- Network name mismatch between configuration and actual Docker networks
## How to Fix
### Docker Compose
Docker Compose normally creates networks automatically. If missing:
```bash
# List existing networks
docker network ls
# Start compose to create networks
docker compose -f devops/compose/docker-compose.stella-ops.yml up -d
# Create a network manually if needed
docker network create stellaops-network
# Inspect a network
docker network inspect <network-name>
```
Configure required networks for the check:
```yaml
environment:
Docker__RequiredNetworks__0: "stellaops-network"
Docker__RequiredNetworks__1: "bridge"
```
### Bare Metal / systemd
For bare metal deployments, Docker networks must be created manually:
```bash
# Create required networks
docker network create --driver bridge stellaops-frontend
docker network create --driver bridge stellaops-backend
docker network create --driver bridge stellaops-data
# List networks
docker network ls
# Inspect network details
docker network inspect stellaops-backend
```
### Kubernetes / Helm
Docker networks are not used in Kubernetes; instead, Kubernetes networking (Services, NetworkPolicies) handles inter-pod communication. Configure the check to skip Docker network requirements:
```yaml
doctor:
docker:
requiredNetworks: [] # Not applicable in Kubernetes
```
Or verify Kubernetes networking:
```bash
# Check services
kubectl get svc -n stellaops
# Check network policies
kubectl get networkpolicy -n stellaops
# Test connectivity between pods
kubectl exec -it <pod-a> -- curl http://<service-b>:5000/health
```
## Verification
```
stella doctor run --check check.docker.network
```
## Related Checks
- `check.docker.daemon` — Docker daemon must be running to query networks
- `check.docker.socket` — Docker socket must be accessible to communicate with the daemon

View File

@@ -0,0 +1,125 @@
---
checkId: check.docker.socket
plugin: stellaops.doctor.docker
severity: fail
tags: [docker, socket, permissions]
---
# Docker Socket
## What It Checks
Validates that the Docker socket exists and is accessible with correct permissions. The check behavior differs by platform:
### Linux / Unix
Checks the Unix socket at the path extracted from `Docker:Host` (default: `/var/run/docker.sock`):
| Condition | Result |
|---|---|
| Socket does not exist + running inside a container | `pass` — socket mount is optional for most services |
| Socket does not exist + not inside a container | `warn` |
| Socket exists but not readable or writable | `warn` — insufficient permissions |
| Socket exists and is readable + writable | `pass` |
The check detects whether the process is running inside a Docker container by checking for `/.dockerenv` or `/proc/1/cgroup`. When running inside a container without a mounted socket, this is considered normal for services that don't need direct Docker access.
### Windows
On Windows, the check verifies that the named pipe path is configured (default: `npipe://./pipe/docker_engine`). The actual connectivity is deferred to the daemon check since named pipe access testing differs from Unix sockets.
Evidence collected includes: socket path, existence, readability, writability, and whether the process is running inside a container.
## Why It Matters
The Docker socket is the communication channel between clients (CLI, SDKs, Stella Ops services) and the Docker daemon. Without socket access:
- Docker CLI commands fail.
- Services that manage containers (scanner, job engine) cannot create or inspect containers.
- Docker Compose operations fail.
- Health checks that query Docker state cannot run.
Note that most Stella Ops services do NOT need direct Docker socket access. Only services that manage containers (e.g., scanner, job engine) require the socket to be mounted.
## Common Causes
- Docker socket not found at the expected path
- Docker not installed or daemon not running
- Insufficient permissions on the socket file (user not in `docker` group)
- Docker socket not mounted into the container (for containerized services that need it)
- SELinux or AppArmor blocking socket access
## How to Fix
### Docker Compose
Mount the Docker socket for services that need container management:
```yaml
services:
scanner:
volumes:
- /var/run/docker.sock:/var/run/docker.sock
# Most services do NOT need the socket:
platform:
# No socket mount needed
```
Fix socket permissions on the host:
```bash
# Add your user to the docker group
sudo usermod -aG docker $USER
# Log out and back in, then verify
docker ps
```
### Bare Metal / systemd
```bash
# Check if Docker is installed
which docker
# Check socket existence
ls -la /var/run/docker.sock
# Check socket permissions
stat /var/run/docker.sock
# Add user to docker group
sudo usermod -aG docker $USER
logout # Must log out and back in
# If socket is missing, start Docker
sudo systemctl start docker
# Verify
docker ps
```
If SELinux is blocking access:
```bash
# Check SELinux denials
sudo ausearch -m avc -ts recent | grep docker
# Allow Docker socket access (create a policy module)
sudo setsebool -P container_manage_cgroup on
```
### Kubernetes / Helm
In Kubernetes, the Docker socket is typically not available. Use the container runtime socket instead:
```yaml
# For containerd
volumes:
- name: containerd-sock
hostPath:
path: /run/containerd/containerd.sock
type: Socket
```
Most Stella Ops services should NOT mount any runtime socket in Kubernetes. Only the scanner or job engine may need it for container-in-container operations.
## Verification
```
stella doctor run --check check.docker.socket
```
## Related Checks
- `check.docker.daemon` — verifies the Docker daemon is running and responsive (uses the socket)
- `check.docker.apiversion` — verifies Docker API version compatibility (requires socket access)
- `check.docker.network` — verifies Docker networks (requires socket access)
- `check.docker.storage` — verifies Docker storage (requires socket access)

View File

@@ -0,0 +1,123 @@
---
checkId: check.docker.storage
plugin: stellaops.doctor.docker
severity: warn
tags: [docker, storage, disk]
---
# Docker Storage
## What It Checks
Validates Docker storage driver and disk space usage. The check connects to the Docker daemon and retrieves system information, then inspects:
| Condition | Result |
|---|---|
| Storage driver is not `overlay2`, `btrfs`, or `zfs` | `warn` — non-recommended driver |
| Free disk space on Docker root partition < **10 GB** (configurable via `Docker:MinFreeSpaceGb`) | `warn` |
| Disk usage > **85%** (configurable via `Docker:MaxStorageUsagePercent`) | `warn` |
The check reads the Docker root directory (typically `/var/lib/docker`) and queries drive info for that partition. On platforms where disk info is unavailable, the check still validates the storage driver.
Evidence collected includes: storage driver, Docker root directory, total space, free space, usage percentage, and whether the driver is recommended.
## Why It Matters
Docker storage issues are a leading cause of container deployment failures:
- **Non-recommended storage drivers** (e.g., `vfs`, `devicemapper`) have performance and reliability problems. `overlay2` is the recommended driver for most workloads.
- **Low disk space** prevents image pulls, container creation, and volume writes. Docker images and layers consume significant space.
- **High disk usage** can cause container crashes, database corruption, and evidence write failures.
The Docker root directory often shares a partition with the OS, so storage exhaustion affects the entire host.
## Common Causes
- Storage driver is not overlay2, btrfs, or zfs (e.g., using legacy `devicemapper` or `vfs`)
- Low disk space on the Docker root partition (less than 10 GB free)
- Disk usage exceeds 85% threshold
- Unused images, containers, and volumes consuming space
- Large build caches not pruned
## How to Fix
### Docker Compose
Check and clean Docker storage:
```bash
# Check disk usage
docker system df
# Detailed disk usage
docker system df -v
# Prune unused data (images, containers, networks, build cache)
docker system prune -a
# Prune volumes too (WARNING: removes data volumes)
docker system prune -a --volumes
# Check storage driver
docker info | grep "Storage Driver"
```
Configure storage thresholds:
```yaml
environment:
Docker__MinFreeSpaceGb: "10"
Docker__MaxStorageUsagePercent: "85"
```
### Bare Metal / systemd
Switch to overlay2 storage driver if not already using it:
```bash
# Check current driver
docker info | grep "Storage Driver"
# Configure overlay2 in /etc/docker/daemon.json
{
"storage-driver": "overlay2"
}
# Restart Docker (WARNING: may require re-pulling images)
sudo systemctl restart docker
```
Free up disk space:
```bash
# Find large Docker directories
du -sh /var/lib/docker/*
# Clean unused resources
docker system prune -a
# Set up automatic cleanup via cron
echo "0 2 * * 0 docker system prune -f --filter 'until=168h'" | sudo crontab -
```
### Kubernetes / Helm
Monitor node disk usage:
```bash
# Check node disk pressure
kubectl describe node <node> | grep -A 5 "Conditions"
# Check for DiskPressure condition
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{range .status.conditions[?(@.type=="DiskPressure")]}{.status}{"\n"}{end}{end}'
```
Configure kubelet garbage collection thresholds:
```yaml
# In kubelet config
imageGCHighThresholdPercent: 85
imageGCLowThresholdPercent: 80
evictionHard:
nodefs.available: "10%"
imagefs.available: "15%"
```
## Verification
```
stella doctor run --check check.docker.storage
```
## Related Checks
- `check.core.env.diskspace` — checks general disk space (not Docker-specific)
- `check.docker.daemon` — daemon must be running to query storage info