Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3.1 KiB
checkId, plugin, severity, tags
| checkId | plugin | severity | tags | |||||
|---|---|---|---|---|---|---|---|---|
| check.scanner.resources | stellaops.doctor.scanner | warn |
|
Scanner Resource Utilization
What It Checks
Queries the Scanner service at /api/v1/resources/stats and evaluates CPU, memory, and worker pool health:
- CPU utilization: warn at 75%, fail at 90%.
- Memory utilization: warn at 80%, fail at 95%.
- Worker pool saturation: warn when all workers are busy (zero idle workers).
Evidence collected: cpu_utilization, memory_utilization, memory_used_mb, active_workers, total_workers, idle_workers.
The check requires Scanner:Url or Services:Scanner:Url to be configured.
Why It Matters
The scanner is one of the most resource-intensive services in the Stella Ops stack. It processes container images, generates SBOMs, runs vulnerability matching, and performs reachability analysis. When scanner resources are exhausted, all downstream pipelines stall: queue depth grows, scan latency increases, and release gates time out waiting for scan results. Memory exhaustion can cause OOM kills that lose in-progress work.
Common Causes
- High scan volume during bulk import or CI surge
- Memory leak from accumulated scan artifacts not being garbage collected
- Large container images (multi-GB layers) being processed concurrently
- Insufficient CPU/memory allocation relative to workload
- All workers busy with no capacity for new jobs
- Worker scaling not keeping up with demand
How to Fix
Docker Compose
# Check scanner resource usage
docker stats scanner --no-stream
# Reduce concurrent jobs to lower resource pressure
# In docker-compose.stella-ops.yml:
services:
scanner:
deploy:
resources:
limits:
memory: 4G
cpus: "4.0"
environment:
Scanner__MaxConcurrentJobs: "2"
Scanner__Workers__Count: "4"
# Restart scanner to apply new resource limits
docker compose -f docker-compose.stella-ops.yml up -d scanner
Bare Metal / systemd
# Check current resource usage
top -p $(pgrep -f stellaops-scanner)
# Reduce concurrent processing
stella scanner config set MaxConcurrentJobs 2
Edit /etc/stellaops/scanner/appsettings.json:
{
"Scanner": {
"MaxConcurrentJobs": 2,
"Workers": {
"Count": 4
}
}
}
sudo systemctl restart stellaops-scanner
Kubernetes / Helm
# Check pod resource usage
kubectl top pods -l app=stellaops-scanner
# Scale horizontally instead of vertically
kubectl scale deployment stellaops-scanner --replicas=4
Set in Helm values.yaml:
scanner:
replicas: 4
resources:
requests:
memory: 2Gi
cpu: "2"
limits:
memory: 4Gi
cpu: "4"
maxConcurrentJobs: 2
Verification
stella doctor run --check check.scanner.resources
Related Checks
check.scanner.queue-- resource exhaustion causes queue backlog growthcheck.scanner.sbom-- memory exhaustion causes SBOM generation failurescheck.scanner.reachability-- CPU constraints slow computation timescheck.scanner.slice.cache-- cache effectiveness reduces resource demand