Doctor plugin checks: implement health check classes and documentation
Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
94
docs/doctor/articles/docker/apiversion.md
Normal file
94
docs/doctor/articles/docker/apiversion.md
Normal file
@@ -0,0 +1,94 @@
|
||||
---
|
||||
checkId: check.docker.apiversion
|
||||
plugin: stellaops.doctor.docker
|
||||
severity: warn
|
||||
tags: [docker, api, compatibility]
|
||||
---
|
||||
# Docker API Version
|
||||
|
||||
## What It Checks
|
||||
Validates that the Docker API version meets minimum requirements for Stella Ops. The check connects to the Docker daemon (using `Docker:Host` configuration or the platform default) and queries the API version via `System.GetVersionAsync()`.
|
||||
|
||||
| API Version | Result |
|
||||
|---|---|
|
||||
| Below **1.41** | `warn` — below minimum required |
|
||||
| Between **1.41** and **1.43** | `warn` — below recommended |
|
||||
| **1.43** or higher | `pass` |
|
||||
|
||||
The minimum API version 1.41 corresponds to Docker Engine 20.10+. The recommended version 1.43 corresponds to Docker Engine 23.0+.
|
||||
|
||||
Evidence collected includes: API version, Docker version, minimum required version, recommended version, OS, build time, and git commit.
|
||||
|
||||
Default Docker host:
|
||||
- **Linux**: `unix:///var/run/docker.sock`
|
||||
- **Windows**: `npipe://./pipe/docker_engine`
|
||||
|
||||
## Why It Matters
|
||||
Stella Ops uses Docker API features for container management, image inspection, and network configuration. Older API versions may not support required features such as:
|
||||
|
||||
- BuildKit-based image builds (API 1.39+).
|
||||
- Multi-platform image inspection (API 1.41+).
|
||||
- Container resource management improvements (API 1.43+).
|
||||
|
||||
Running an outdated Docker version also means missing security patches and bug fixes.
|
||||
|
||||
## Common Causes
|
||||
- Docker Engine is outdated (version < 20.10)
|
||||
- Docker Engine is functional but below recommended version (< 23.0)
|
||||
- Using a Docker-compatible runtime (Podman, containerd) that reports a lower API version
|
||||
- Docker not updated after OS upgrade
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Update Docker Engine to the latest stable version:
|
||||
|
||||
```bash
|
||||
# Ubuntu/Debian
|
||||
sudo apt-get update
|
||||
sudo apt-get install docker-ce docker-ce-cli containerd.io
|
||||
|
||||
# RHEL/CentOS
|
||||
sudo yum update docker-ce docker-ce-cli containerd.io
|
||||
|
||||
# Verify version
|
||||
docker version
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check current version
|
||||
docker version
|
||||
|
||||
# Update Docker
|
||||
curl -fsSL https://get.docker.com | sh
|
||||
|
||||
# Restart Docker
|
||||
sudo systemctl restart docker
|
||||
|
||||
# Verify
|
||||
docker version
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
Update the container runtime on cluster nodes. The method depends on your Kubernetes distribution:
|
||||
|
||||
```bash
|
||||
# Check node runtime version
|
||||
kubectl get nodes -o wide
|
||||
|
||||
# For kubeadm clusters, update containerd on each node
|
||||
sudo apt-get update && sudo apt-get install containerd.io
|
||||
|
||||
# Verify
|
||||
sudo crictl version
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.docker.apiversion
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.docker.daemon` — verifies Docker daemon is running (prerequisite for version check)
|
||||
- `check.docker.socket` — verifies Docker socket is accessible
|
||||
124
docs/doctor/articles/docker/daemon.md
Normal file
124
docs/doctor/articles/docker/daemon.md
Normal file
@@ -0,0 +1,124 @@
|
||||
---
|
||||
checkId: check.docker.daemon
|
||||
plugin: stellaops.doctor.docker
|
||||
severity: fail
|
||||
tags: [docker, daemon, container]
|
||||
---
|
||||
# Docker Daemon
|
||||
|
||||
## What It Checks
|
||||
Validates that the Docker daemon is running and responsive. The check connects to the Docker daemon (using `Docker:Host` configuration or the platform default) and performs two operations:
|
||||
|
||||
1. **Ping**: Sends a ping request to verify the daemon is alive (with a configurable timeout, default 10 seconds via `Docker:TimeoutSeconds`).
|
||||
2. **Version**: Retrieves version information to confirm the daemon is fully operational.
|
||||
|
||||
Evidence collected on success: host address, Docker version, API version, OS, architecture, and kernel version.
|
||||
|
||||
On failure, the check distinguishes between:
|
||||
- **DockerApiException**: The daemon is running but returned an error (reports status code and response body).
|
||||
- **Connection failure**: Cannot connect to the daemon at all (Docker not installed, not running, or socket inaccessible).
|
||||
|
||||
Default Docker host:
|
||||
- **Linux**: `unix:///var/run/docker.sock`
|
||||
- **Windows**: `npipe://./pipe/docker_engine`
|
||||
|
||||
## Why It Matters
|
||||
The Docker daemon is the core runtime for all Stella Ops containers. If the daemon is down:
|
||||
|
||||
- No containers can start, stop, or restart.
|
||||
- Health checks for all containerized services fail.
|
||||
- Image pulls and builds are impossible.
|
||||
- Docker Compose operations fail entirely.
|
||||
- The entire Stella Ops platform is offline in container-based deployments.
|
||||
|
||||
## Common Causes
|
||||
- Docker daemon is not running or not accessible
|
||||
- Docker is not installed on the host
|
||||
- Docker service crashed or was stopped
|
||||
- Docker daemon returned an error response (resource exhaustion, configuration error)
|
||||
- Timeout connecting to the daemon (overloaded host, slow disk)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Check and restart the Docker daemon:
|
||||
|
||||
```bash
|
||||
# Check daemon status
|
||||
sudo systemctl status docker
|
||||
|
||||
# Start the daemon
|
||||
sudo systemctl start docker
|
||||
|
||||
# Enable auto-start on boot
|
||||
sudo systemctl enable docker
|
||||
|
||||
# Verify
|
||||
docker info
|
||||
```
|
||||
|
||||
If Docker is not installed:
|
||||
```bash
|
||||
curl -fsSL https://get.docker.com | sh
|
||||
sudo usermod -aG docker $USER
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check status
|
||||
sudo systemctl status docker
|
||||
|
||||
# View daemon logs
|
||||
sudo journalctl -u docker --since "10 minutes ago"
|
||||
|
||||
# Restart the daemon
|
||||
sudo systemctl restart docker
|
||||
|
||||
# Verify connectivity
|
||||
docker version
|
||||
docker info
|
||||
```
|
||||
|
||||
If the daemon crashes repeatedly, check for resource exhaustion:
|
||||
```bash
|
||||
# Check disk space (Docker requires space for images/containers)
|
||||
df -h /var/lib/docker
|
||||
|
||||
# Check memory
|
||||
free -h
|
||||
|
||||
# Clean up Docker resources
|
||||
docker system prune -a
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
On Kubernetes nodes, the container runtime (containerd/CRI-O) replaces Docker daemon. Check the runtime:
|
||||
|
||||
```bash
|
||||
# Check containerd status
|
||||
sudo systemctl status containerd
|
||||
|
||||
# Check CRI-O status
|
||||
sudo systemctl status crio
|
||||
|
||||
# Restart if needed
|
||||
sudo systemctl restart containerd
|
||||
```
|
||||
|
||||
For Docker Desktop (development):
|
||||
```bash
|
||||
# Restart Docker Desktop
|
||||
# macOS: killall Docker && open -a Docker
|
||||
# Windows: Restart-Service docker
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.docker.daemon
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.docker.socket` — verifies the Docker socket exists and has correct permissions
|
||||
- `check.docker.apiversion` — verifies the Docker API version is compatible
|
||||
- `check.docker.storage` — verifies Docker storage is healthy (requires running daemon)
|
||||
- `check.docker.network` — verifies Docker networks are configured (requires running daemon)
|
||||
104
docs/doctor/articles/docker/network.md
Normal file
104
docs/doctor/articles/docker/network.md
Normal file
@@ -0,0 +1,104 @@
|
||||
---
|
||||
checkId: check.docker.network
|
||||
plugin: stellaops.doctor.docker
|
||||
severity: warn
|
||||
tags: [docker, network, connectivity]
|
||||
---
|
||||
# Docker Network
|
||||
|
||||
## What It Checks
|
||||
Validates Docker network configuration and connectivity. The check connects to the Docker daemon and lists all networks, then verifies:
|
||||
|
||||
1. **Required networks exist**: Checks that each network listed in `Docker:RequiredNetworks` configuration is present. Defaults to `["bridge"]` if not configured.
|
||||
2. **Bridge driver available**: Verifies at least one network using the `bridge` driver exists.
|
||||
|
||||
Evidence collected includes: total network count, available network drivers, found/missing required networks, and bridge network name.
|
||||
|
||||
If the Docker daemon is unreachable, the check is skipped.
|
||||
|
||||
## Why It Matters
|
||||
Docker networks provide isolated communication channels between containers. Stella Ops services communicate over dedicated networks for:
|
||||
|
||||
- **Service-to-service communication**: Platform, Authority, Gateway, and other services need to reach each other.
|
||||
- **Database access**: PostgreSQL and Valkey are on specific networks.
|
||||
- **Network isolation**: Separating frontend, backend, and data tiers.
|
||||
|
||||
Missing networks cause container DNS resolution failures and connection refused errors between services.
|
||||
|
||||
## Common Causes
|
||||
- Required network not found (not yet created or was deleted)
|
||||
- No bridge network driver available (Docker networking misconfigured)
|
||||
- Docker Compose network not created (compose project not started)
|
||||
- Network name mismatch between configuration and actual Docker networks
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Docker Compose normally creates networks automatically. If missing:
|
||||
|
||||
```bash
|
||||
# List existing networks
|
||||
docker network ls
|
||||
|
||||
# Start compose to create networks
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml up -d
|
||||
|
||||
# Create a network manually if needed
|
||||
docker network create stellaops-network
|
||||
|
||||
# Inspect a network
|
||||
docker network inspect <network-name>
|
||||
```
|
||||
|
||||
Configure required networks for the check:
|
||||
```yaml
|
||||
environment:
|
||||
Docker__RequiredNetworks__0: "stellaops-network"
|
||||
Docker__RequiredNetworks__1: "bridge"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
For bare metal deployments, Docker networks must be created manually:
|
||||
|
||||
```bash
|
||||
# Create required networks
|
||||
docker network create --driver bridge stellaops-frontend
|
||||
docker network create --driver bridge stellaops-backend
|
||||
docker network create --driver bridge stellaops-data
|
||||
|
||||
# List networks
|
||||
docker network ls
|
||||
|
||||
# Inspect network details
|
||||
docker network inspect stellaops-backend
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
Docker networks are not used in Kubernetes; instead, Kubernetes networking (Services, NetworkPolicies) handles inter-pod communication. Configure the check to skip Docker network requirements:
|
||||
|
||||
```yaml
|
||||
doctor:
|
||||
docker:
|
||||
requiredNetworks: [] # Not applicable in Kubernetes
|
||||
```
|
||||
|
||||
Or verify Kubernetes networking:
|
||||
```bash
|
||||
# Check services
|
||||
kubectl get svc -n stellaops
|
||||
|
||||
# Check network policies
|
||||
kubectl get networkpolicy -n stellaops
|
||||
|
||||
# Test connectivity between pods
|
||||
kubectl exec -it <pod-a> -- curl http://<service-b>:5000/health
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.docker.network
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.docker.daemon` — Docker daemon must be running to query networks
|
||||
- `check.docker.socket` — Docker socket must be accessible to communicate with the daemon
|
||||
125
docs/doctor/articles/docker/socket.md
Normal file
125
docs/doctor/articles/docker/socket.md
Normal file
@@ -0,0 +1,125 @@
|
||||
---
|
||||
checkId: check.docker.socket
|
||||
plugin: stellaops.doctor.docker
|
||||
severity: fail
|
||||
tags: [docker, socket, permissions]
|
||||
---
|
||||
# Docker Socket
|
||||
|
||||
## What It Checks
|
||||
Validates that the Docker socket exists and is accessible with correct permissions. The check behavior differs by platform:
|
||||
|
||||
### Linux / Unix
|
||||
Checks the Unix socket at the path extracted from `Docker:Host` (default: `/var/run/docker.sock`):
|
||||
|
||||
| Condition | Result |
|
||||
|---|---|
|
||||
| Socket does not exist + running inside a container | `pass` — socket mount is optional for most services |
|
||||
| Socket does not exist + not inside a container | `warn` |
|
||||
| Socket exists but not readable or writable | `warn` — insufficient permissions |
|
||||
| Socket exists and is readable + writable | `pass` |
|
||||
|
||||
The check detects whether the process is running inside a Docker container by checking for `/.dockerenv` or `/proc/1/cgroup`. When running inside a container without a mounted socket, this is considered normal for services that don't need direct Docker access.
|
||||
|
||||
### Windows
|
||||
On Windows, the check verifies that the named pipe path is configured (default: `npipe://./pipe/docker_engine`). The actual connectivity is deferred to the daemon check since named pipe access testing differs from Unix sockets.
|
||||
|
||||
Evidence collected includes: socket path, existence, readability, writability, and whether the process is running inside a container.
|
||||
|
||||
## Why It Matters
|
||||
The Docker socket is the communication channel between clients (CLI, SDKs, Stella Ops services) and the Docker daemon. Without socket access:
|
||||
|
||||
- Docker CLI commands fail.
|
||||
- Services that manage containers (scanner, job engine) cannot create or inspect containers.
|
||||
- Docker Compose operations fail.
|
||||
- Health checks that query Docker state cannot run.
|
||||
|
||||
Note that most Stella Ops services do NOT need direct Docker socket access. Only services that manage containers (e.g., scanner, job engine) require the socket to be mounted.
|
||||
|
||||
## Common Causes
|
||||
- Docker socket not found at the expected path
|
||||
- Docker not installed or daemon not running
|
||||
- Insufficient permissions on the socket file (user not in `docker` group)
|
||||
- Docker socket not mounted into the container (for containerized services that need it)
|
||||
- SELinux or AppArmor blocking socket access
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Mount the Docker socket for services that need container management:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
scanner:
|
||||
volumes:
|
||||
- /var/run/docker.sock:/var/run/docker.sock
|
||||
|
||||
# Most services do NOT need the socket:
|
||||
platform:
|
||||
# No socket mount needed
|
||||
```
|
||||
|
||||
Fix socket permissions on the host:
|
||||
```bash
|
||||
# Add your user to the docker group
|
||||
sudo usermod -aG docker $USER
|
||||
|
||||
# Log out and back in, then verify
|
||||
docker ps
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check if Docker is installed
|
||||
which docker
|
||||
|
||||
# Check socket existence
|
||||
ls -la /var/run/docker.sock
|
||||
|
||||
# Check socket permissions
|
||||
stat /var/run/docker.sock
|
||||
|
||||
# Add user to docker group
|
||||
sudo usermod -aG docker $USER
|
||||
logout # Must log out and back in
|
||||
|
||||
# If socket is missing, start Docker
|
||||
sudo systemctl start docker
|
||||
|
||||
# Verify
|
||||
docker ps
|
||||
```
|
||||
|
||||
If SELinux is blocking access:
|
||||
```bash
|
||||
# Check SELinux denials
|
||||
sudo ausearch -m avc -ts recent | grep docker
|
||||
|
||||
# Allow Docker socket access (create a policy module)
|
||||
sudo setsebool -P container_manage_cgroup on
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
In Kubernetes, the Docker socket is typically not available. Use the container runtime socket instead:
|
||||
|
||||
```yaml
|
||||
# For containerd
|
||||
volumes:
|
||||
- name: containerd-sock
|
||||
hostPath:
|
||||
path: /run/containerd/containerd.sock
|
||||
type: Socket
|
||||
```
|
||||
|
||||
Most Stella Ops services should NOT mount any runtime socket in Kubernetes. Only the scanner or job engine may need it for container-in-container operations.
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.docker.socket
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.docker.daemon` — verifies the Docker daemon is running and responsive (uses the socket)
|
||||
- `check.docker.apiversion` — verifies Docker API version compatibility (requires socket access)
|
||||
- `check.docker.network` — verifies Docker networks (requires socket access)
|
||||
- `check.docker.storage` — verifies Docker storage (requires socket access)
|
||||
123
docs/doctor/articles/docker/storage.md
Normal file
123
docs/doctor/articles/docker/storage.md
Normal file
@@ -0,0 +1,123 @@
|
||||
---
|
||||
checkId: check.docker.storage
|
||||
plugin: stellaops.doctor.docker
|
||||
severity: warn
|
||||
tags: [docker, storage, disk]
|
||||
---
|
||||
# Docker Storage
|
||||
|
||||
## What It Checks
|
||||
Validates Docker storage driver and disk space usage. The check connects to the Docker daemon and retrieves system information, then inspects:
|
||||
|
||||
| Condition | Result |
|
||||
|---|---|
|
||||
| Storage driver is not `overlay2`, `btrfs`, or `zfs` | `warn` — non-recommended driver |
|
||||
| Free disk space on Docker root partition < **10 GB** (configurable via `Docker:MinFreeSpaceGb`) | `warn` |
|
||||
| Disk usage > **85%** (configurable via `Docker:MaxStorageUsagePercent`) | `warn` |
|
||||
|
||||
The check reads the Docker root directory (typically `/var/lib/docker`) and queries drive info for that partition. On platforms where disk info is unavailable, the check still validates the storage driver.
|
||||
|
||||
Evidence collected includes: storage driver, Docker root directory, total space, free space, usage percentage, and whether the driver is recommended.
|
||||
|
||||
## Why It Matters
|
||||
Docker storage issues are a leading cause of container deployment failures:
|
||||
|
||||
- **Non-recommended storage drivers** (e.g., `vfs`, `devicemapper`) have performance and reliability problems. `overlay2` is the recommended driver for most workloads.
|
||||
- **Low disk space** prevents image pulls, container creation, and volume writes. Docker images and layers consume significant space.
|
||||
- **High disk usage** can cause container crashes, database corruption, and evidence write failures.
|
||||
|
||||
The Docker root directory often shares a partition with the OS, so storage exhaustion affects the entire host.
|
||||
|
||||
## Common Causes
|
||||
- Storage driver is not overlay2, btrfs, or zfs (e.g., using legacy `devicemapper` or `vfs`)
|
||||
- Low disk space on the Docker root partition (less than 10 GB free)
|
||||
- Disk usage exceeds 85% threshold
|
||||
- Unused images, containers, and volumes consuming space
|
||||
- Large build caches not pruned
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Check and clean Docker storage:
|
||||
|
||||
```bash
|
||||
# Check disk usage
|
||||
docker system df
|
||||
|
||||
# Detailed disk usage
|
||||
docker system df -v
|
||||
|
||||
# Prune unused data (images, containers, networks, build cache)
|
||||
docker system prune -a
|
||||
|
||||
# Prune volumes too (WARNING: removes data volumes)
|
||||
docker system prune -a --volumes
|
||||
|
||||
# Check storage driver
|
||||
docker info | grep "Storage Driver"
|
||||
```
|
||||
|
||||
Configure storage thresholds:
|
||||
```yaml
|
||||
environment:
|
||||
Docker__MinFreeSpaceGb: "10"
|
||||
Docker__MaxStorageUsagePercent: "85"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Switch to overlay2 storage driver if not already using it:
|
||||
|
||||
```bash
|
||||
# Check current driver
|
||||
docker info | grep "Storage Driver"
|
||||
|
||||
# Configure overlay2 in /etc/docker/daemon.json
|
||||
{
|
||||
"storage-driver": "overlay2"
|
||||
}
|
||||
|
||||
# Restart Docker (WARNING: may require re-pulling images)
|
||||
sudo systemctl restart docker
|
||||
```
|
||||
|
||||
Free up disk space:
|
||||
```bash
|
||||
# Find large Docker directories
|
||||
du -sh /var/lib/docker/*
|
||||
|
||||
# Clean unused resources
|
||||
docker system prune -a
|
||||
|
||||
# Set up automatic cleanup via cron
|
||||
echo "0 2 * * 0 docker system prune -f --filter 'until=168h'" | sudo crontab -
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
Monitor node disk usage:
|
||||
|
||||
```bash
|
||||
# Check node disk pressure
|
||||
kubectl describe node <node> | grep -A 5 "Conditions"
|
||||
|
||||
# Check for DiskPressure condition
|
||||
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{range .status.conditions[?(@.type=="DiskPressure")]}{.status}{"\n"}{end}{end}'
|
||||
```
|
||||
|
||||
Configure kubelet garbage collection thresholds:
|
||||
```yaml
|
||||
# In kubelet config
|
||||
imageGCHighThresholdPercent: 85
|
||||
imageGCLowThresholdPercent: 80
|
||||
evictionHard:
|
||||
nodefs.available: "10%"
|
||||
imagefs.available: "15%"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.docker.storage
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.core.env.diskspace` — checks general disk space (not Docker-specific)
|
||||
- `check.docker.daemon` — daemon must be running to query storage info
|
||||
Reference in New Issue
Block a user