# Runbook: Scanner - Out of Memory on Large Images

> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-002 - Scanner Runbooks

## Metadata

| Field | Value |
|-------|-------|
| **Component** | Scanner |
| **Severity** | High |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.scanner.memory-usage` |

---

## Symptoms

- [ ] Scanner worker exits with code 137 (OOM killed)
- [ ] Scans fail consistently for specific large images
- [ ] Error log contains "fatal error: runtime: out of memory"
- [ ] Alert `ScannerWorkerOOM` firing
- [ ] Metric `scanner_worker_restarts_total{reason="oom"}` increasing

---

## Impact

| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Large images cannot be scanned; smaller images may still work |
| **Data integrity** | No data loss; failed scans can be retried |
| **SLA impact** | Specific images blocked from release pipeline |

---

## Diagnosis

### Quick checks

1. **Identify the failing image:**
   ```bash
   stella scanner jobs list --status failed --last 1h
   ```

2. **Check image size:**
   ```bash
   stella image inspect <image-ref> --format json | jq '.size'
   ```
   Problem if: Image size > 2GB or layer count > 100

3. **Check worker memory limit:**
   ```bash
   stella scanner config get worker.memory_limit
   ```

### Deep diagnosis

1. **Profile memory usage during scan:**
   ```bash
   stella scan image --image <image-ref> --profile-memory
   ```

2. **Check SBOM generation memory:**
   ```bash
   stella scanner logs --filter "sbom" --level debug --last 30m
   ```
   Look for: "memory allocation failed", "heap exhausted"

3. **Identify memory-heavy layers:**
   ```bash
   stella image layers <image-ref> --sort-by size
   ```

---

## Resolution

### Immediate mitigation

1. **Increase worker memory limit:**
   ```bash
   stella scanner config set worker.memory_limit 8Gi
   stella scanner workers restart
   ```

2. **Enable streaming mode for large images:**
   ```bash
   stella scanner config set sbom.streaming_threshold 1Gi
   stella scanner workers restart
   ```

3. **Retry the failed scan:**
   ```bash
   stella scan image --image <image-ref> --retry
   ```

### Root cause fix

**For consistently large images:**

1. Configure dedicated large-image worker pool:
   ```bash
   stella scanner workers add --pool large-images --memory 16Gi --count 2
   stella scanner config set routing.large_image_threshold 2Gi
   stella scanner config set routing.large_image_pool large-images
   ```

**For images with many small files (node_modules, etc.):**

1. Enable incremental SBOM mode:
   ```bash
   stella scanner config set sbom.incremental_mode true
   ```

**For base image reuse:**

1. Enable layer caching:
   ```bash
   stella scanner config set cache.layer_dedup true
   ```

### Verification

```bash
# Retry the previously failing scan
stella scan image --image <image-ref>

# Monitor memory during scan
stella scanner workers stats --watch

# Verify no OOM in recent logs
stella scanner logs --filter "out of memory" --last 1h
```

---

## Prevention

- [ ] **Capacity:** Set memory limit based on largest expected image (recommend 4Gi minimum)
- [ ] **Routing:** Configure large-image pool for images > 2GB
- [ ] **Monitoring:** Alert on `scanner_worker_memory_usage_bytes` > 80% of limit
- [ ] **Documentation:** Document image size limits in user guide

---

## Related Resources

- **Architecture:** `docs/modules/scanner/architecture.md`
- **Related runbooks:** `scanner-worker-stuck.md`, `scanner-timeout.md`
- **Dashboard:** Grafana > Stella Ops > Scanner Memory