Files

master c58a236d70 Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 12:28:00 +02:00

3.1 KiB

Raw Blame History

checkId, plugin, severity, tags

checkId

plugin

severity

Scanner Resource Utilization

What It Checks

Queries the Scanner service at /api/v1/resources/stats and evaluates CPU, memory, and worker pool health:

CPU utilization: warn at 75%, fail at 90%.
Memory utilization: warn at 80%, fail at 95%.
Worker pool saturation: warn when all workers are busy (zero idle workers).

Evidence collected: cpu_utilization, memory_utilization, memory_used_mb, active_workers, total_workers, idle_workers.

The check requires Scanner:Url or Services:Scanner:Url to be configured.

Why It Matters

The scanner is one of the most resource-intensive services in the Stella Ops stack. It processes container images, generates SBOMs, runs vulnerability matching, and performs reachability analysis. When scanner resources are exhausted, all downstream pipelines stall: queue depth grows, scan latency increases, and release gates time out waiting for scan results. Memory exhaustion can cause OOM kills that lose in-progress work.

Common Causes

High scan volume during bulk import or CI surge
Memory leak from accumulated scan artifacts not being garbage collected
Large container images (multi-GB layers) being processed concurrently
Insufficient CPU/memory allocation relative to workload
All workers busy with no capacity for new jobs
Worker scaling not keeping up with demand

How to Fix

Docker Compose

# Check scanner resource usage
docker stats scanner --no-stream

# Reduce concurrent jobs to lower resource pressure
# In docker-compose.stella-ops.yml:

services:
  scanner:
    deploy:
      resources:
        limits:
          memory: 4G
          cpus: "4.0"
    environment:
      Scanner__MaxConcurrentJobs: "2"
      Scanner__Workers__Count: "4"

# Restart scanner to apply new resource limits
docker compose -f docker-compose.stella-ops.yml up -d scanner

Bare Metal / systemd

# Check current resource usage
top -p $(pgrep -f stellaops-scanner)

# Reduce concurrent processing
stella scanner config set MaxConcurrentJobs 2

Edit /etc/stellaops/scanner/appsettings.json:

{
  "Scanner": {
    "MaxConcurrentJobs": 2,
    "Workers": {
      "Count": 4
    }
  }
}

sudo systemctl restart stellaops-scanner

Kubernetes / Helm

# Check pod resource usage
kubectl top pods -l app=stellaops-scanner

# Scale horizontally instead of vertically
kubectl scale deployment stellaops-scanner --replicas=4

Set in Helm values.yaml:

scanner:
  replicas: 4
  resources:
    requests:
      memory: 2Gi
      cpu: "2"
    limits:
      memory: 4Gi
      cpu: "4"
  maxConcurrentJobs: 2

Verification

stella doctor run --check check.scanner.resources

check.scanner.queue -- resource exhaustion causes queue backlog growth
check.scanner.sbom -- memory exhaustion causes SBOM generation failures
check.scanner.reachability -- CPU constraints slow computation times
check.scanner.slice.cache -- cache effectiveness reduces resource demand

3.1 KiB Raw Blame History