Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/storage/disk-space.md
+++ b/docs/doctor/articles/storage/disk-space.md
@@ -0,0 +1,84 @@
+---
+checkId: check.storage.diskspace
+plugin: stellaops.doctor.storage
+severity: fail
+tags: [storage, disk, capacity, core]
+---
+# Disk Space Availability
+
+## What It Checks
+Verifies disk space availability on drives used by Stella Ops. The check:
+
+- Identifies paths to check from `Storage:DataPath`, `EvidenceLocker:Path`, `Backup:Path`, and `Logging:Path` configuration (falls back to platform defaults: `/var/lib/stellaops` on Linux, `%ProgramData%\StellaOps` on Windows).
+- Gets the drive info for each path and calculates usage ratio.
+- **Fails at 90%+ usage** (critical threshold) -- the system is at immediate risk of running out of space.
+- **Warns at 80%+ usage** (warning threshold) -- approaching capacity.
+- Reports the most critically used drive.
+
+## Why It Matters
+Disk exhaustion causes cascading failures: database writes fail, evidence cannot be stored, log rotation breaks, and container operations halt. This is a severity-fail check because disk exhaustion can cause data loss and service outages that are difficult to recover from.
+
+## Common Causes
+- Log files accumulating without rotation
+- Evidence artifacts consuming space
+- Backup files not rotated or pruned
+- Large container images cached on disk
+- Normal data growth approaching provisioned capacity
+
+## How to Fix
+
+### Docker Compose
+```bash
+# Check disk usage
+docker exec <platform-container> df -h
+
+# Cleanup old logs
+stella storage cleanup --logs --older-than 7d
+
+# Prune Docker resources
+docker system prune -a
+docker volume prune
+```
+
+### Bare Metal / systemd
+```bash
+# Find large files
+du -sh /var/lib/stellaops/* | sort -rh | head -20
+
+# Cleanup logs
+stella storage cleanup --logs --older-than 7d
+
+# Cleanup temporary files
+stella storage cleanup --temp
+
+# Review Docker disk usage
+docker system df
+```
+
+### Kubernetes / Helm
+```bash
+# Check PV usage
+kubectl get pv
+kubectl exec -it <platform-pod> -- df -h
+
+# Expand PVC if needed
+kubectl edit pvc stellaops-data  # increase storage request
+```
+
+Consider setting up automated cleanup policies:
+```yaml
+storage:
+  cleanup:
+    enabled: true
+    logRetentionDays: 30
+    tempCleanupSchedule: "0 4 * * *"
+```
+
+## Verification
+```
+stella doctor run --check check.storage.diskspace
+```
+
+## Related Checks
+- `check.storage.backup` — verifies backup directory accessibility
+- `check.storage.evidencelocker` — verifies evidence locker write access