Files
git.stella-ops.org/docs/operations/runbooks/scanner-worker-stuck.md

4.1 KiB

Runbook: Scanner - Worker Not Processing Jobs

Sprint: SPRINT_20260117_029_DOCS_runbook_coverage Task: RUN-002 - Scanner Runbooks

Metadata

Field Value
Component Scanner
Severity Critical
On-call scope Platform team
Last updated 2026-01-17
Doctor check check.scanner.worker-health

Symptoms

  • Scan jobs stuck in "pending" or "processing" state for >5 minutes
  • Scanner worker process shows 0% CPU usage
  • Alert ScannerWorkerStuck or ScannerQueueBacklog firing
  • UI shows "Scan in progress" indefinitely
  • Metric scanner_jobs_pending increasing over time

Impact

Impact Type Description
User-facing New scans cannot complete, blocking CI/CD pipelines and release gates
Data integrity No data loss; pending jobs will resume when worker recovers
SLA impact Scan latency SLO violated if not resolved within 15 minutes

Diagnosis

Quick checks (< 2 minutes)

  1. Check Doctor diagnostics:

    stella doctor --check check.scanner.worker-health
    
  2. Check scanner service status:

    stella scanner status
    

    Expected: "Scanner workers: 4 active, 0 idle" Problem: "Scanner workers: 0 active" or "status: degraded"

  3. Check job queue depth:

    stella scanner queue status
    

    Expected: Queue depth < 50 Problem: Queue depth > 100 or growing rapidly

Deep diagnosis

  1. Check worker process logs:

    stella scanner logs --tail 100 --level error
    

    Look for: "timeout", "connection refused", "out of memory"

  2. Check Valkey connectivity (job queue):

    stella doctor --check check.storage.valkey
    
  3. Check if workers are OOM-killed:

    stella scanner workers inspect
    

    Look for: "exit_code: 137" (OOM) or "exit_code: 143" (SIGTERM)

  4. Check resource utilization:

    stella obs metrics --filter scanner --last 10m
    

    Look for: Memory > 90%, CPU sustained > 95%


Resolution

Immediate mitigation

  1. Restart scanner workers:

    stella scanner workers restart
    

    This will: Terminate current workers and spawn fresh ones

  2. If restart fails, force restart the scanner service:

    stella service restart scanner
    
  3. Verify workers are processing:

    stella scanner queue status --watch
    

    Queue depth should start decreasing

Root cause fix

If workers were OOM-killed:

  1. Increase worker memory limit:

    stella scanner config set worker.memory_limit 4Gi
    stella scanner workers restart
    
  2. Reduce concurrent scans per worker:

    stella scanner config set worker.concurrency 2
    stella scanner workers restart
    

If Valkey connection failed:

  1. Check Valkey health:

    stella doctor --check check.storage.valkey
    
  2. Restart Valkey if needed (see valkey-connection-failure.md)

If workers are deadlocked:

  1. Enable deadlock detection:
    stella scanner config set worker.deadlock_detection true
    stella scanner workers restart
    

Verification

# Verify workers are healthy
stella doctor --check check.scanner.worker-health

# Submit a test scan
stella scan image --image alpine:latest --dry-run

# Watch queue drain
stella scanner queue status --watch

# Verify no errors in recent logs
stella scanner logs --tail 20 --level error

Prevention

  • Alert: Ensure ScannerQueueBacklog alert is configured with threshold < 100 jobs
  • Monitoring: Add Grafana panel for worker memory usage
  • Capacity: Review worker count and memory limits during capacity planning
  • Deadlock: Enable worker.deadlock_detection in production

  • Architecture: docs/modules/scanner/architecture.md
  • Related runbooks: scanner-oom.md, scanner-timeout.md
  • Doctor check: src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Scanner/Checks/WorkerHealthCheck.cs
  • Dashboard: Grafana > Stella Ops > Scanner Overview