Doctor plugin checks: implement health check classes and documentation
Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
80
docs/doctor/articles/storage/backup-directory.md
Normal file
80
docs/doctor/articles/storage/backup-directory.md
Normal file
@@ -0,0 +1,80 @@
|
||||
---
|
||||
checkId: check.storage.backup
|
||||
plugin: stellaops.doctor.storage
|
||||
severity: warn
|
||||
tags: [storage, backup, disaster-recovery]
|
||||
---
|
||||
# Backup Directory Accessibility
|
||||
|
||||
## What It Checks
|
||||
Verifies backup directory accessibility and recent backup presence. The check:
|
||||
|
||||
- Reads the backup path from `Backup:Path` or `Storage:BackupPath` configuration.
|
||||
- Verifies the directory exists.
|
||||
- Tests write access by creating and deleting a temp file.
|
||||
- Scans for backup files (`.bak`, `.backup`, `.tar`, `.tar.gz`, `.tgz`, `.zip`, `.sql`, `.dump`) in the top-level directory.
|
||||
- Warns if no backup files are found or if the most recent backup is older than 7 days.
|
||||
- Fails if the directory exists but is not writable.
|
||||
|
||||
The check only runs when a backup path is configured.
|
||||
|
||||
## Why It Matters
|
||||
Backups are the last line of defense against data loss. An inaccessible backup directory, missing backups, or stale backups mean the system cannot recover from database corruption, hardware failure, or accidental deletion. The 7-day staleness threshold ensures backups are kept reasonably current.
|
||||
|
||||
## Common Causes
|
||||
- Backup directory not created yet
|
||||
- Path misconfigured or remote mount not available
|
||||
- Insufficient permissions (read-only mount, wrong ownership)
|
||||
- Backup job never run or failing silently
|
||||
- Backup schedule disabled
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
environment:
|
||||
Backup__Path: "/var/backups/stellaops"
|
||||
volumes:
|
||||
- backup-data:/var/backups/stellaops
|
||||
```
|
||||
|
||||
```bash
|
||||
# Create backup directory
|
||||
docker exec <platform-container> mkdir -p /var/backups/stellaops
|
||||
|
||||
# Run initial backup
|
||||
docker exec <platform-container> stella backup create --full
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Create backup directory
|
||||
mkdir -p /var/backups/stellaops
|
||||
chmod 750 /var/backups/stellaops
|
||||
|
||||
# Run initial backup
|
||||
stella backup create --full
|
||||
|
||||
# Set up a schedule
|
||||
stella backup schedule create --interval daily
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
backup:
|
||||
enabled: true
|
||||
path: "/var/backups/stellaops"
|
||||
schedule: "0 3 * * *"
|
||||
persistence:
|
||||
enabled: true
|
||||
size: 100Gi
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.storage.backup
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.storage.diskspace` — verifies sufficient disk space is available
|
||||
- `check.storage.evidencelocker` — verifies evidence locker write access
|
||||
84
docs/doctor/articles/storage/disk-space.md
Normal file
84
docs/doctor/articles/storage/disk-space.md
Normal file
@@ -0,0 +1,84 @@
|
||||
---
|
||||
checkId: check.storage.diskspace
|
||||
plugin: stellaops.doctor.storage
|
||||
severity: fail
|
||||
tags: [storage, disk, capacity, core]
|
||||
---
|
||||
# Disk Space Availability
|
||||
|
||||
## What It Checks
|
||||
Verifies disk space availability on drives used by Stella Ops. The check:
|
||||
|
||||
- Identifies paths to check from `Storage:DataPath`, `EvidenceLocker:Path`, `Backup:Path`, and `Logging:Path` configuration (falls back to platform defaults: `/var/lib/stellaops` on Linux, `%ProgramData%\StellaOps` on Windows).
|
||||
- Gets the drive info for each path and calculates usage ratio.
|
||||
- **Fails at 90%+ usage** (critical threshold) -- the system is at immediate risk of running out of space.
|
||||
- **Warns at 80%+ usage** (warning threshold) -- approaching capacity.
|
||||
- Reports the most critically used drive.
|
||||
|
||||
## Why It Matters
|
||||
Disk exhaustion causes cascading failures: database writes fail, evidence cannot be stored, log rotation breaks, and container operations halt. This is a severity-fail check because disk exhaustion can cause data loss and service outages that are difficult to recover from.
|
||||
|
||||
## Common Causes
|
||||
- Log files accumulating without rotation
|
||||
- Evidence artifacts consuming space
|
||||
- Backup files not rotated or pruned
|
||||
- Large container images cached on disk
|
||||
- Normal data growth approaching provisioned capacity
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check disk usage
|
||||
docker exec <platform-container> df -h
|
||||
|
||||
# Cleanup old logs
|
||||
stella storage cleanup --logs --older-than 7d
|
||||
|
||||
# Prune Docker resources
|
||||
docker system prune -a
|
||||
docker volume prune
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Find large files
|
||||
du -sh /var/lib/stellaops/* | sort -rh | head -20
|
||||
|
||||
# Cleanup logs
|
||||
stella storage cleanup --logs --older-than 7d
|
||||
|
||||
# Cleanup temporary files
|
||||
stella storage cleanup --temp
|
||||
|
||||
# Review Docker disk usage
|
||||
docker system df
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# Check PV usage
|
||||
kubectl get pv
|
||||
kubectl exec -it <platform-pod> -- df -h
|
||||
|
||||
# Expand PVC if needed
|
||||
kubectl edit pvc stellaops-data # increase storage request
|
||||
```
|
||||
|
||||
Consider setting up automated cleanup policies:
|
||||
```yaml
|
||||
storage:
|
||||
cleanup:
|
||||
enabled: true
|
||||
logRetentionDays: 30
|
||||
tempCleanupSchedule: "0 4 * * *"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.storage.diskspace
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.storage.backup` — verifies backup directory accessibility
|
||||
- `check.storage.evidencelocker` — verifies evidence locker write access
|
||||
83
docs/doctor/articles/storage/evidence-locker-write.md
Normal file
83
docs/doctor/articles/storage/evidence-locker-write.md
Normal file
@@ -0,0 +1,83 @@
|
||||
---
|
||||
checkId: check.storage.evidencelocker
|
||||
plugin: stellaops.doctor.storage
|
||||
severity: fail
|
||||
tags: [storage, evidence, write, permissions]
|
||||
---
|
||||
# Evidence Locker Write Access
|
||||
|
||||
## What It Checks
|
||||
Verifies evidence locker write permissions and performance. The check:
|
||||
|
||||
- Reads the evidence locker path from `EvidenceLocker:Path` or `Storage:EvidencePath`.
|
||||
- Creates the directory if it does not exist.
|
||||
- Writes a test file, reads it back to verify content integrity, and measures latency.
|
||||
- **Fails** if the directory cannot be created, writes are denied (`UnauthorizedAccessException`), or content read-back does not match (storage corruption).
|
||||
- **Warns** if write latency exceeds 100ms (elevated I/O latency, e.g., slow NFS/CIFS backend).
|
||||
- Cleans up the test file after the check.
|
||||
|
||||
The check only runs when an evidence locker path is configured.
|
||||
|
||||
## Why It Matters
|
||||
The evidence locker stores cryptographically signed release evidence -- attestations, SBOM snapshots, policy evaluation results, and audit trails. If the locker is not writable, releases cannot produce verifiable evidence, blocking policy-gated promotions and breaking auditability guarantees. This is a severity-fail check because evidence integrity is a core platform invariant.
|
||||
|
||||
## Common Causes
|
||||
- Insufficient file system permissions
|
||||
- Directory owned by a different user
|
||||
- SELinux/AppArmor blocking writes
|
||||
- Disk full
|
||||
- Filesystem mounted read-only
|
||||
- Slow network-attached storage (NFS/CIFS) causing high latency
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
environment:
|
||||
EvidenceLocker__Path: "/var/lib/stellaops/evidence"
|
||||
volumes:
|
||||
- evidence-data:/var/lib/stellaops/evidence
|
||||
```
|
||||
|
||||
```bash
|
||||
# Check permissions inside container
|
||||
docker exec <platform-container> ls -la /var/lib/stellaops/evidence
|
||||
|
||||
# Fix permissions
|
||||
docker exec <platform-container> chown -R stellaops:stellaops /var/lib/stellaops/evidence
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Create directory
|
||||
mkdir -p /var/lib/stellaops/evidence
|
||||
|
||||
# Set ownership and permissions
|
||||
chown -R stellaops:stellaops /var/lib/stellaops/evidence
|
||||
chmod 750 /var/lib/stellaops/evidence
|
||||
|
||||
# Check disk space
|
||||
df -h /var/lib/stellaops/evidence
|
||||
|
||||
# Check mount status
|
||||
mount | grep $(df --output=source /var/lib/stellaops/evidence | tail -1)
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
evidenceLocker:
|
||||
path: "/var/lib/stellaops/evidence"
|
||||
persistence:
|
||||
enabled: true
|
||||
size: 50Gi
|
||||
storageClass: "fast-ssd" # use fast storage to avoid latency warnings
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.storage.evidencelocker
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.storage.diskspace` — verifies sufficient disk space is available
|
||||
- `check.storage.backup` — verifies backup directory accessibility
|
||||
Reference in New Issue
Block a user