Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/scanner/resources.md
+++ b/docs/doctor/articles/scanner/resources.md
@@ -0,0 +1,119 @@
+---
+checkId: check.scanner.resources
+plugin: stellaops.doctor.scanner
+severity: warn
+tags: [scanner, resources, cpu, memory, workers]
+---
+# Scanner Resource Utilization
+
+## What It Checks
+Queries the Scanner service at `/api/v1/resources/stats` and evaluates CPU, memory, and worker pool health:
+
+- **CPU utilization**: warn at 75%, fail at 90%.
+- **Memory utilization**: warn at 80%, fail at 95%.
+- **Worker pool saturation**: warn when all workers are busy (zero idle workers).
+
+Evidence collected: `cpu_utilization`, `memory_utilization`, `memory_used_mb`, `active_workers`, `total_workers`, `idle_workers`.
+
+The check requires `Scanner:Url` or `Services:Scanner:Url` to be configured.
+
+## Why It Matters
+The scanner is one of the most resource-intensive services in the Stella Ops stack. It processes container images, generates SBOMs, runs vulnerability matching, and performs reachability analysis. When scanner resources are exhausted, all downstream pipelines stall: queue depth grows, scan latency increases, and release gates time out waiting for scan results. Memory exhaustion can cause OOM kills that lose in-progress work.
+
+## Common Causes
+- High scan volume during bulk import or CI surge
+- Memory leak from accumulated scan artifacts not being garbage collected
+- Large container images (multi-GB layers) being processed concurrently
+- Insufficient CPU/memory allocation relative to workload
+- All workers busy with no capacity for new jobs
+- Worker scaling not keeping up with demand
+
+## How to Fix
+
+### Docker Compose
+```bash
+# Check scanner resource usage
+docker stats scanner --no-stream
+
+# Reduce concurrent jobs to lower resource pressure
+# In docker-compose.stella-ops.yml:
+```
+
+```yaml
+services:
+  scanner:
+    deploy:
+      resources:
+        limits:
+          memory: 4G
+          cpus: "4.0"
+    environment:
+      Scanner__MaxConcurrentJobs: "2"
+      Scanner__Workers__Count: "4"
+```
+
+```bash
+# Restart scanner to apply new resource limits
+docker compose -f docker-compose.stella-ops.yml up -d scanner
+```
+
+### Bare Metal / systemd
+```bash
+# Check current resource usage
+top -p $(pgrep -f stellaops-scanner)
+
+# Reduce concurrent processing
+stella scanner config set MaxConcurrentJobs 2
+```
+
+Edit `/etc/stellaops/scanner/appsettings.json`:
+
+```json
+{
+  "Scanner": {
+    "MaxConcurrentJobs": 2,
+    "Workers": {
+      "Count": 4
+    }
+  }
+}
+```
+
+```bash
+sudo systemctl restart stellaops-scanner
+```
+
+### Kubernetes / Helm
+```bash
+# Check pod resource usage
+kubectl top pods -l app=stellaops-scanner
+
+# Scale horizontally instead of vertically
+kubectl scale deployment stellaops-scanner --replicas=4
+```
+
+Set in Helm `values.yaml`:
+
+```yaml
+scanner:
+  replicas: 4
+  resources:
+    requests:
+      memory: 2Gi
+      cpu: "2"
+    limits:
+      memory: 4Gi
+      cpu: "4"
+  maxConcurrentJobs: 2
+```
+
+## Verification
+```
+stella doctor run --check check.scanner.resources
+```
+
+## Related Checks
+- `check.scanner.queue` -- resource exhaustion causes queue backlog growth
+- `check.scanner.sbom` -- memory exhaustion causes SBOM generation failures
+- `check.scanner.reachability` -- CPU constraints slow computation times
+- `check.scanner.slice.cache` -- cache effectiveness reduces resource demand