Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
master
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions

View File

@@ -0,0 +1,119 @@
---
checkId: check.scanner.resources
plugin: stellaops.doctor.scanner
severity: warn
tags: [scanner, resources, cpu, memory, workers]
---
# Scanner Resource Utilization
## What It Checks
Queries the Scanner service at `/api/v1/resources/stats` and evaluates CPU, memory, and worker pool health:
- **CPU utilization**: warn at 75%, fail at 90%.
- **Memory utilization**: warn at 80%, fail at 95%.
- **Worker pool saturation**: warn when all workers are busy (zero idle workers).
Evidence collected: `cpu_utilization`, `memory_utilization`, `memory_used_mb`, `active_workers`, `total_workers`, `idle_workers`.
The check requires `Scanner:Url` or `Services:Scanner:Url` to be configured.
## Why It Matters
The scanner is one of the most resource-intensive services in the Stella Ops stack. It processes container images, generates SBOMs, runs vulnerability matching, and performs reachability analysis. When scanner resources are exhausted, all downstream pipelines stall: queue depth grows, scan latency increases, and release gates time out waiting for scan results. Memory exhaustion can cause OOM kills that lose in-progress work.
## Common Causes
- High scan volume during bulk import or CI surge
- Memory leak from accumulated scan artifacts not being garbage collected
- Large container images (multi-GB layers) being processed concurrently
- Insufficient CPU/memory allocation relative to workload
- All workers busy with no capacity for new jobs
- Worker scaling not keeping up with demand
## How to Fix
### Docker Compose
```bash
# Check scanner resource usage
docker stats scanner --no-stream
# Reduce concurrent jobs to lower resource pressure
# In docker-compose.stella-ops.yml:
```
```yaml
services:
scanner:
deploy:
resources:
limits:
memory: 4G
cpus: "4.0"
environment:
Scanner__MaxConcurrentJobs: "2"
Scanner__Workers__Count: "4"
```
```bash
# Restart scanner to apply new resource limits
docker compose -f docker-compose.stella-ops.yml up -d scanner
```
### Bare Metal / systemd
```bash
# Check current resource usage
top -p $(pgrep -f stellaops-scanner)
# Reduce concurrent processing
stella scanner config set MaxConcurrentJobs 2
```
Edit `/etc/stellaops/scanner/appsettings.json`:
```json
{
"Scanner": {
"MaxConcurrentJobs": 2,
"Workers": {
"Count": 4
}
}
}
```
```bash
sudo systemctl restart stellaops-scanner
```
### Kubernetes / Helm
```bash
# Check pod resource usage
kubectl top pods -l app=stellaops-scanner
# Scale horizontally instead of vertically
kubectl scale deployment stellaops-scanner --replicas=4
```
Set in Helm `values.yaml`:
```yaml
scanner:
replicas: 4
resources:
requests:
memory: 2Gi
cpu: "2"
limits:
memory: 4Gi
cpu: "4"
maxConcurrentJobs: 2
```
## Verification
```
stella doctor run --check check.scanner.resources
```
## Related Checks
- `check.scanner.queue` -- resource exhaustion causes queue backlog growth
- `check.scanner.sbom` -- memory exhaustion causes SBOM generation failures
- `check.scanner.reachability` -- CPU constraints slow computation times
- `check.scanner.slice.cache` -- cache effectiveness reduces resource demand