--- checkId: check.scanner.queue plugin: stellaops.doctor.scanner severity: warn tags: [scanner, queue, jobs, processing] --- # Scanner Queue Health ## What It Checks Queries the Scanner service at `/api/v1/queue/stats` and evaluates job queue health across four dimensions: - **Queue depth**: warn at 100+ pending jobs, fail at 500+. - **Failure rate**: warn at 5%+ of processed jobs failing, fail at 15%+. - **Stuck jobs**: any stuck jobs trigger an immediate fail. - **Backlog growth**: a growing backlog triggers a warning. Evidence collected: `queue_depth`, `processing_rate_per_min`, `stuck_jobs`, `failed_jobs`, `failure_rate`, `oldest_job_age_min`, `backlog_growing`. The check requires `Scanner:Url` or `Services:Scanner:Url` to be configured; otherwise it is skipped. ## Why It Matters The scanner queue is the central work pipeline for SBOM generation, vulnerability scanning, and reachability analysis. A backlogged or stuck queue delays security findings, blocks release gates that depend on scan results, and can cascade into approval timeouts. Stuck jobs indicate a worker crash or resource failure that will not self-heal. ## Common Causes - Scanner worker process crashed or was OOM-killed - Job dependency (registry, database) became unavailable mid-scan - Resource exhaustion (CPU, memory, disk) on the scanner host - Database connection lost during job processing - Sudden spike in image pushes overwhelming worker capacity - Processing rate slower than ingest rate during bulk import ## How to Fix ### Docker Compose Check scanner worker status and restart if needed: ```bash # View scanner container logs for errors docker compose -f docker-compose.stella-ops.yml logs --tail 200 scanner # Restart the scanner service docker compose -f docker-compose.stella-ops.yml restart scanner # Scale scanner workers (if using replicas) docker compose -f docker-compose.stella-ops.yml up -d --scale scanner=4 ``` Adjust concurrency via environment variables: ```yaml environment: Scanner__Queue__MaxConcurrentJobs: "4" Scanner__Queue__StuckJobTimeoutMinutes: "30" ``` ### Bare Metal / systemd ```bash # Check scanner service status sudo systemctl status stellaops-scanner # View recent logs sudo journalctl -u stellaops-scanner --since "1 hour ago" # Restart the service sudo systemctl restart stellaops-scanner ``` Edit `/etc/stellaops/scanner/appsettings.json`: ```json { "Queue": { "MaxConcurrentJobs": 4, "StuckJobTimeoutMinutes": 30 } } ``` ### Kubernetes / Helm ```bash # Check scanner pod status kubectl get pods -l app=stellaops-scanner # View logs for crash loops kubectl logs -l app=stellaops-scanner --tail=200 # Scale scanner deployment kubectl scale deployment stellaops-scanner --replicas=4 ``` Set in Helm `values.yaml`: ```yaml scanner: replicas: 4 queue: maxConcurrentJobs: 4 stuckJobTimeoutMinutes: 30 ``` ## Verification ``` stella doctor run --check check.scanner.queue ``` ## Related Checks - `check.scanner.resources` -- scanner CPU/memory utilization affecting processing rate - `check.scanner.sbom` -- SBOM generation failures may originate from queue issues - `check.scanner.vuln` -- vulnerability scan health depends on queue throughput - `check.operations.job-queue` -- platform-wide job queue health