Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/scanner/slice-cache.md
+++ b/docs/doctor/articles/scanner/slice-cache.md
@@ -0,0 +1,112 @@
+---
+checkId: check.scanner.slice.cache
+plugin: stellaops.doctor.scanner
+severity: warn
+tags: [scanner, cache, slice, performance]
+---
+# Slice Cache Health
+
+## What It Checks
+Queries the Scanner service at `/api/v1/cache/stats` and evaluates slice cache effectiveness:
+
+- **Storage utilization**: warn at 80% full, fail at 95% full.
+- **Hit rate**: warn below 50%, fail below 20%.
+- **Eviction rate**: reported as evidence.
+
+Evidence collected: `hit_rate`, `hits`, `misses`, `entry_count`, `used_bytes`, `total_bytes`, `storage_utilization`, `eviction_rate`.
+
+The check requires `Scanner:Url` or `Services:Scanner:Url` to be configured.
+
+## Why It Matters
+The slice cache stores pre-computed code slices used by the reachability engine. A healthy cache avoids re-analyzing the same dependency trees on every scan, reducing computation time from seconds to milliseconds. When the cache hit rate drops, reachability computations slow dramatically, causing queue backlog and delayed security feedback. When storage fills up, evictions accelerate and the cache thrashes, making it effectively useless.
+
+## Common Causes
+- Cache size limit too small for the working set of scanned artifacts
+- TTL configured too long, preventing eviction of stale entries
+- Eviction policy not working (configuration error)
+- Unexpected growth in the number of unique slices (new projects onboarded)
+- Cache was recently cleared (restart, volume reset)
+- Working set larger than cache capacity
+
+## How to Fix
+
+### Docker Compose
+```bash
+# Clear stale cache entries
+stella scanner cache prune --stale
+
+# Warm the cache for active projects
+stella scanner cache warm
+```
+
+Increase cache size in `docker-compose.stella-ops.yml`:
+
+```yaml
+services:
+  scanner:
+    environment:
+      Scanner__Cache__MaxSizeBytes: "4294967296"  # 4 GB
+      Scanner__Cache__TtlHours: "72"
+      Scanner__Cache__EvictionPolicy: "LRU"
+    volumes:
+      - scanner-cache:/data/cache
+```
+
+### Bare Metal / systemd
+```bash
+# Check cache directory size
+du -sh /var/lib/stellaops/scanner/cache
+
+# Prune stale entries
+stella scanner cache prune --stale
+
+# Warm cache
+stella scanner cache warm
+```
+
+Edit `/etc/stellaops/scanner/appsettings.json`:
+
+```json
+{
+  "Cache": {
+    "MaxSizeBytes": 4294967296,
+    "TtlHours": 72,
+    "EvictionPolicy": "LRU",
+    "DataPath": "/var/lib/stellaops/scanner/cache"
+  }
+}
+```
+
+```bash
+sudo systemctl restart stellaops-scanner
+```
+
+### Kubernetes / Helm
+```bash
+# Check PVC usage for cache volume
+kubectl exec -it <scanner-pod> -- df -h /data/cache
+```
+
+Set in Helm `values.yaml`:
+
+```yaml
+scanner:
+  cache:
+    maxSizeBytes: 4294967296  # 4 GB
+    ttlHours: 72
+    evictionPolicy: LRU
+    persistence:
+      enabled: true
+      size: 10Gi
+      storageClass: fast-ssd
+```
+
+## Verification
+```
+stella doctor run --check check.scanner.slice.cache
+```
+
+## Related Checks
+- `check.scanner.reachability` -- cache misses directly increase computation time
+- `check.scanner.resources` -- cache thrashing increases CPU and memory usage
+- `check.scanner.queue` -- slow cache performance cascades into queue backlog