4.1 KiB
Runbook: Scanner - Worker Not Processing Jobs
Sprint: SPRINT_20260117_029_DOCS_runbook_coverage Task: RUN-002 - Scanner Runbooks
Metadata
| Field | Value |
|---|---|
| Component | Scanner |
| Severity | Critical |
| On-call scope | Platform team |
| Last updated | 2026-01-17 |
| Doctor check | check.scanner.worker-health |
Symptoms
- Scan jobs stuck in "pending" or "processing" state for >5 minutes
- Scanner worker process shows 0% CPU usage
- Alert
ScannerWorkerStuckorScannerQueueBacklogfiring - UI shows "Scan in progress" indefinitely
- Metric
scanner_jobs_pendingincreasing over time
Impact
| Impact Type | Description |
|---|---|
| User-facing | New scans cannot complete, blocking CI/CD pipelines and release gates |
| Data integrity | No data loss; pending jobs will resume when worker recovers |
| SLA impact | Scan latency SLO violated if not resolved within 15 minutes |
Diagnosis
Quick checks (< 2 minutes)
-
Check Doctor diagnostics:
stella doctor --check check.scanner.worker-health -
Check scanner service status:
stella scanner statusExpected: "Scanner workers: 4 active, 0 idle" Problem: "Scanner workers: 0 active" or "status: degraded"
-
Check job queue depth:
stella scanner queue statusExpected: Queue depth < 50 Problem: Queue depth > 100 or growing rapidly
Deep diagnosis
-
Check worker process logs:
stella scanner logs --tail 100 --level errorLook for: "timeout", "connection refused", "out of memory"
-
Check Valkey connectivity (job queue):
stella doctor --check check.storage.valkey -
Check if workers are OOM-killed:
stella scanner workers inspectLook for: "exit_code: 137" (OOM) or "exit_code: 143" (SIGTERM)
-
Check resource utilization:
stella obs metrics --filter scanner --last 10mLook for: Memory > 90%, CPU sustained > 95%
Resolution
Immediate mitigation
-
Restart scanner workers:
stella scanner workers restartThis will: Terminate current workers and spawn fresh ones
-
If restart fails, force restart the scanner service:
stella service restart scanner -
Verify workers are processing:
stella scanner queue status --watchQueue depth should start decreasing
Root cause fix
If workers were OOM-killed:
-
Increase worker memory limit:
stella scanner config set worker.memory_limit 4Gi stella scanner workers restart -
Reduce concurrent scans per worker:
stella scanner config set worker.concurrency 2 stella scanner workers restart
If Valkey connection failed:
-
Check Valkey health:
stella doctor --check check.storage.valkey -
Restart Valkey if needed (see
valkey-connection-failure.md)
If workers are deadlocked:
- Enable deadlock detection:
stella scanner config set worker.deadlock_detection true stella scanner workers restart
Verification
# Verify workers are healthy
stella doctor --check check.scanner.worker-health
# Submit a test scan
stella scan image --image alpine:latest --dry-run
# Watch queue drain
stella scanner queue status --watch
# Verify no errors in recent logs
stella scanner logs --tail 20 --level error
Prevention
- Alert: Ensure
ScannerQueueBacklogalert is configured with threshold < 100 jobs - Monitoring: Add Grafana panel for worker memory usage
- Capacity: Review worker count and memory limits during capacity planning
- Deadlock: Enable
worker.deadlock_detectionin production
Related Resources
- Architecture:
docs/modules/scanner/architecture.md - Related runbooks:
scanner-oom.md,scanner-timeout.md - Doctor check:
src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Scanner/Checks/WorkerHealthCheck.cs - Dashboard: Grafana > Stella Ops > Scanner Overview