git.stella-ops.org/docs/operations/runbooks/scanner-worker-stuck.md

# Runbook: Scanner - Worker Not Processing Jobs

> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-002 - Scanner Runbooks

## Metadata

| Field | Value |
|-------|-------|
| **Component** | Scanner |
| **Severity** | Critical |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.scanner.worker-health` |

---

## Symptoms

- [ ] Scan jobs stuck in "pending" or "processing" state for >5 minutes
- [ ] Scanner worker process shows 0% CPU usage
- [ ] Alert `ScannerWorkerStuck` or `ScannerQueueBacklog` firing
- [ ] UI shows "Scan in progress" indefinitely
- [ ] Metric `scanner_jobs_pending` increasing over time

---

## Impact

| Impact Type | Description |
|-------------|-------------|
| **User-facing** | New scans cannot complete, blocking CI/CD pipelines and release gates |
| **Data integrity** | No data loss; pending jobs will resume when worker recovers |
| **SLA impact** | Scan latency SLO violated if not resolved within 15 minutes |

---

## Diagnosis

### Quick checks (< 2 minutes)

1. **Check Doctor diagnostics:**
   ```bash
   stella doctor --check check.scanner.worker-health
   ```

2. **Check scanner service status:**
   ```bash
   stella scanner status
   ```
   Expected: "Scanner workers: 4 active, 0 idle"
   Problem: "Scanner workers: 0 active" or "status: degraded"

3. **Check job queue depth:**
   ```bash
   stella scanner queue status
   ```
   Expected: Queue depth < 50
   Problem: Queue depth > 100 or growing rapidly

### Deep diagnosis

1. **Check worker process logs:**
   ```bash
   stella scanner logs --tail 100 --level error
   ```
   Look for: "timeout", "connection refused", "out of memory"

2. **Check Valkey connectivity (job queue):**
   ```bash
   stella doctor --check check.storage.valkey
   ```

3. **Check if workers are OOM-killed:**
   ```bash
   stella scanner workers inspect
   ```
   Look for: "exit_code: 137" (OOM) or "exit_code: 143" (SIGTERM)

4. **Check resource utilization:**
   ```bash
   stella obs metrics --filter scanner --last 10m
   ```
   Look for: Memory > 90%, CPU sustained > 95%

---

## Resolution

### Immediate mitigation

1. **Restart scanner workers:**
   ```bash
   stella scanner workers restart
   ```
   This will: Terminate current workers and spawn fresh ones

2. **If restart fails, force restart the scanner service:**
   ```bash
   stella service restart scanner
   ```

3. **Verify workers are processing:**
   ```bash
   stella scanner queue status --watch
   ```
   Queue depth should start decreasing

### Root cause fix

**If workers were OOM-killed:**

1. Increase worker memory limit:
   ```bash
   stella scanner config set worker.memory_limit 4Gi
   stella scanner workers restart
   ```

2. Reduce concurrent scans per worker:
   ```bash
   stella scanner config set worker.concurrency 2
   stella scanner workers restart
   ```

**If Valkey connection failed:**

1. Check Valkey health:
   ```bash
   stella doctor --check check.storage.valkey
   ```

2. Restart Valkey if needed (see `valkey-connection-failure.md`)

**If workers are deadlocked:**

1. Enable deadlock detection:
   ```bash
   stella scanner config set worker.deadlock_detection true
   stella scanner workers restart
   ```

### Verification

```bash
# Verify workers are healthy
stella doctor --check check.scanner.worker-health

# Submit a test scan
stella scan image --image alpine:latest --dry-run

# Watch queue drain
stella scanner queue status --watch

# Verify no errors in recent logs
stella scanner logs --tail 20 --level error
```

---

## Prevention

- [ ] **Alert:** Ensure `ScannerQueueBacklog` alert is configured with threshold < 100 jobs
- [ ] **Monitoring:** Add Grafana panel for worker memory usage
- [ ] **Capacity:** Review worker count and memory limits during capacity planning
- [ ] **Deadlock:** Enable `worker.deadlock_detection` in production

---

## Related Resources

- **Architecture:** `docs/modules/scanner/architecture.md`
- **Related runbooks:** `scanner-oom.md`, `scanner-timeout.md`
- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Scanner/Checks/WorkerHealthCheck.cs`
- **Dashboard:** Grafana > Stella Ops > Scanner Overview