Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
master
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions

View File

@@ -0,0 +1,112 @@
---
checkId: check.scanner.slice.cache
plugin: stellaops.doctor.scanner
severity: warn
tags: [scanner, cache, slice, performance]
---
# Slice Cache Health
## What It Checks
Queries the Scanner service at `/api/v1/cache/stats` and evaluates slice cache effectiveness:
- **Storage utilization**: warn at 80% full, fail at 95% full.
- **Hit rate**: warn below 50%, fail below 20%.
- **Eviction rate**: reported as evidence.
Evidence collected: `hit_rate`, `hits`, `misses`, `entry_count`, `used_bytes`, `total_bytes`, `storage_utilization`, `eviction_rate`.
The check requires `Scanner:Url` or `Services:Scanner:Url` to be configured.
## Why It Matters
The slice cache stores pre-computed code slices used by the reachability engine. A healthy cache avoids re-analyzing the same dependency trees on every scan, reducing computation time from seconds to milliseconds. When the cache hit rate drops, reachability computations slow dramatically, causing queue backlog and delayed security feedback. When storage fills up, evictions accelerate and the cache thrashes, making it effectively useless.
## Common Causes
- Cache size limit too small for the working set of scanned artifacts
- TTL configured too long, preventing eviction of stale entries
- Eviction policy not working (configuration error)
- Unexpected growth in the number of unique slices (new projects onboarded)
- Cache was recently cleared (restart, volume reset)
- Working set larger than cache capacity
## How to Fix
### Docker Compose
```bash
# Clear stale cache entries
stella scanner cache prune --stale
# Warm the cache for active projects
stella scanner cache warm
```
Increase cache size in `docker-compose.stella-ops.yml`:
```yaml
services:
scanner:
environment:
Scanner__Cache__MaxSizeBytes: "4294967296" # 4 GB
Scanner__Cache__TtlHours: "72"
Scanner__Cache__EvictionPolicy: "LRU"
volumes:
- scanner-cache:/data/cache
```
### Bare Metal / systemd
```bash
# Check cache directory size
du -sh /var/lib/stellaops/scanner/cache
# Prune stale entries
stella scanner cache prune --stale
# Warm cache
stella scanner cache warm
```
Edit `/etc/stellaops/scanner/appsettings.json`:
```json
{
"Cache": {
"MaxSizeBytes": 4294967296,
"TtlHours": 72,
"EvictionPolicy": "LRU",
"DataPath": "/var/lib/stellaops/scanner/cache"
}
}
```
```bash
sudo systemctl restart stellaops-scanner
```
### Kubernetes / Helm
```bash
# Check PVC usage for cache volume
kubectl exec -it <scanner-pod> -- df -h /data/cache
```
Set in Helm `values.yaml`:
```yaml
scanner:
cache:
maxSizeBytes: 4294967296 # 4 GB
ttlHours: 72
evictionPolicy: LRU
persistence:
enabled: true
size: 10Gi
storageClass: fast-ssd
```
## Verification
```
stella doctor run --check check.scanner.slice.cache
```
## Related Checks
- `check.scanner.reachability` -- cache misses directly increase computation time
- `check.scanner.resources` -- cache thrashing increases CPU and memory usage
- `check.scanner.queue` -- slow cache performance cascades into queue backlog