Files
git.stella-ops.org/docs/operations/runbooks/scanner-oom.md

153 lines
3.6 KiB
Markdown

# Runbook: Scanner - Out of Memory on Large Images
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-002 - Scanner Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Scanner |
| **Severity** | High |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.scanner.memory-usage` |
---
## Symptoms
- [ ] Scanner worker exits with code 137 (OOM killed)
- [ ] Scans fail consistently for specific large images
- [ ] Error log contains "fatal error: runtime: out of memory"
- [ ] Alert `ScannerWorkerOOM` firing
- [ ] Metric `scanner_worker_restarts_total{reason="oom"}` increasing
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Large images cannot be scanned; smaller images may still work |
| **Data integrity** | No data loss; failed scans can be retried |
| **SLA impact** | Specific images blocked from release pipeline |
---
## Diagnosis
### Quick checks
1. **Identify the failing image:**
```bash
stella scanner jobs list --status failed --last 1h
```
2. **Check image size:**
```bash
stella image inspect <image-ref> --format json | jq '.size'
```
Problem if: Image size > 2GB or layer count > 100
3. **Check worker memory limit:**
```bash
stella scanner config get worker.memory_limit
```
### Deep diagnosis
1. **Profile memory usage during scan:**
```bash
stella scan image --image <image-ref> --profile-memory
```
2. **Check SBOM generation memory:**
```bash
stella scanner logs --filter "sbom" --level debug --last 30m
```
Look for: "memory allocation failed", "heap exhausted"
3. **Identify memory-heavy layers:**
```bash
stella image layers <image-ref> --sort-by size
```
---
## Resolution
### Immediate mitigation
1. **Increase worker memory limit:**
```bash
stella scanner config set worker.memory_limit 8Gi
stella scanner workers restart
```
2. **Enable streaming mode for large images:**
```bash
stella scanner config set sbom.streaming_threshold 1Gi
stella scanner workers restart
```
3. **Retry the failed scan:**
```bash
stella scan image --image <image-ref> --retry
```
### Root cause fix
**For consistently large images:**
1. Configure dedicated large-image worker pool:
```bash
stella scanner workers add --pool large-images --memory 16Gi --count 2
stella scanner config set routing.large_image_threshold 2Gi
stella scanner config set routing.large_image_pool large-images
```
**For images with many small files (node_modules, etc.):**
1. Enable incremental SBOM mode:
```bash
stella scanner config set sbom.incremental_mode true
```
**For base image reuse:**
1. Enable layer caching:
```bash
stella scanner config set cache.layer_dedup true
```
### Verification
```bash
# Retry the previously failing scan
stella scan image --image <image-ref>
# Monitor memory during scan
stella scanner workers stats --watch
# Verify no OOM in recent logs
stella scanner logs --filter "out of memory" --last 1h
```
---
## Prevention
- [ ] **Capacity:** Set memory limit based on largest expected image (recommend 4Gi minimum)
- [ ] **Routing:** Configure large-image pool for images > 2GB
- [ ] **Monitoring:** Alert on `scanner_worker_memory_usage_bytes` > 80% of limit
- [ ] **Documentation:** Document image size limits in user guide
---
## Related Resources
- **Architecture:** `docs/modules/scanner/architecture.md`
- **Related runbooks:** `scanner-worker-stuck.md`, `scanner-timeout.md`
- **Dashboard:** Grafana > Stella Ops > Scanner Memory