153 lines
3.6 KiB
Markdown
153 lines
3.6 KiB
Markdown
# Runbook: Scanner - Out of Memory on Large Images
|
|
|
|
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
|
> **Task:** RUN-002 - Scanner Runbooks
|
|
|
|
## Metadata
|
|
|
|
| Field | Value |
|
|
|-------|-------|
|
|
| **Component** | Scanner |
|
|
| **Severity** | High |
|
|
| **On-call scope** | Platform team |
|
|
| **Last updated** | 2026-01-17 |
|
|
| **Doctor check** | `check.scanner.memory-usage` |
|
|
|
|
---
|
|
|
|
## Symptoms
|
|
|
|
- [ ] Scanner worker exits with code 137 (OOM killed)
|
|
- [ ] Scans fail consistently for specific large images
|
|
- [ ] Error log contains "fatal error: runtime: out of memory"
|
|
- [ ] Alert `ScannerWorkerOOM` firing
|
|
- [ ] Metric `scanner_worker_restarts_total{reason="oom"}` increasing
|
|
|
|
---
|
|
|
|
## Impact
|
|
|
|
| Impact Type | Description |
|
|
|-------------|-------------|
|
|
| **User-facing** | Large images cannot be scanned; smaller images may still work |
|
|
| **Data integrity** | No data loss; failed scans can be retried |
|
|
| **SLA impact** | Specific images blocked from release pipeline |
|
|
|
|
---
|
|
|
|
## Diagnosis
|
|
|
|
### Quick checks
|
|
|
|
1. **Identify the failing image:**
|
|
```bash
|
|
stella scanner jobs list --status failed --last 1h
|
|
```
|
|
|
|
2. **Check image size:**
|
|
```bash
|
|
stella image inspect <image-ref> --format json | jq '.size'
|
|
```
|
|
Problem if: Image size > 2GB or layer count > 100
|
|
|
|
3. **Check worker memory limit:**
|
|
```bash
|
|
stella scanner config get worker.memory_limit
|
|
```
|
|
|
|
### Deep diagnosis
|
|
|
|
1. **Profile memory usage during scan:**
|
|
```bash
|
|
stella scan image --image <image-ref> --profile-memory
|
|
```
|
|
|
|
2. **Check SBOM generation memory:**
|
|
```bash
|
|
stella scanner logs --filter "sbom" --level debug --last 30m
|
|
```
|
|
Look for: "memory allocation failed", "heap exhausted"
|
|
|
|
3. **Identify memory-heavy layers:**
|
|
```bash
|
|
stella image layers <image-ref> --sort-by size
|
|
```
|
|
|
|
---
|
|
|
|
## Resolution
|
|
|
|
### Immediate mitigation
|
|
|
|
1. **Increase worker memory limit:**
|
|
```bash
|
|
stella scanner config set worker.memory_limit 8Gi
|
|
stella scanner workers restart
|
|
```
|
|
|
|
2. **Enable streaming mode for large images:**
|
|
```bash
|
|
stella scanner config set sbom.streaming_threshold 1Gi
|
|
stella scanner workers restart
|
|
```
|
|
|
|
3. **Retry the failed scan:**
|
|
```bash
|
|
stella scan image --image <image-ref> --retry
|
|
```
|
|
|
|
### Root cause fix
|
|
|
|
**For consistently large images:**
|
|
|
|
1. Configure dedicated large-image worker pool:
|
|
```bash
|
|
stella scanner workers add --pool large-images --memory 16Gi --count 2
|
|
stella scanner config set routing.large_image_threshold 2Gi
|
|
stella scanner config set routing.large_image_pool large-images
|
|
```
|
|
|
|
**For images with many small files (node_modules, etc.):**
|
|
|
|
1. Enable incremental SBOM mode:
|
|
```bash
|
|
stella scanner config set sbom.incremental_mode true
|
|
```
|
|
|
|
**For base image reuse:**
|
|
|
|
1. Enable layer caching:
|
|
```bash
|
|
stella scanner config set cache.layer_dedup true
|
|
```
|
|
|
|
### Verification
|
|
|
|
```bash
|
|
# Retry the previously failing scan
|
|
stella scan image --image <image-ref>
|
|
|
|
# Monitor memory during scan
|
|
stella scanner workers stats --watch
|
|
|
|
# Verify no OOM in recent logs
|
|
stella scanner logs --filter "out of memory" --last 1h
|
|
```
|
|
|
|
---
|
|
|
|
## Prevention
|
|
|
|
- [ ] **Capacity:** Set memory limit based on largest expected image (recommend 4Gi minimum)
|
|
- [ ] **Routing:** Configure large-image pool for images > 2GB
|
|
- [ ] **Monitoring:** Alert on `scanner_worker_memory_usage_bytes` > 80% of limit
|
|
- [ ] **Documentation:** Document image size limits in user guide
|
|
|
|
---
|
|
|
|
## Related Resources
|
|
|
|
- **Architecture:** `docs/modules/scanner/architecture.md`
|
|
- **Related runbooks:** `scanner-worker-stuck.md`, `scanner-timeout.md`
|
|
- **Dashboard:** Grafana > Stella Ops > Scanner Memory
|