3.6 KiB
3.6 KiB
Runbook: Scanner - Out of Memory on Large Images
Sprint: SPRINT_20260117_029_DOCS_runbook_coverage Task: RUN-002 - Scanner Runbooks
Metadata
| Field | Value |
|---|---|
| Component | Scanner |
| Severity | High |
| On-call scope | Platform team |
| Last updated | 2026-01-17 |
| Doctor check | check.scanner.memory-usage |
Symptoms
- Scanner worker exits with code 137 (OOM killed)
- Scans fail consistently for specific large images
- Error log contains "fatal error: runtime: out of memory"
- Alert
ScannerWorkerOOMfiring - Metric
scanner_worker_restarts_total{reason="oom"}increasing
Impact
| Impact Type | Description |
|---|---|
| User-facing | Large images cannot be scanned; smaller images may still work |
| Data integrity | No data loss; failed scans can be retried |
| SLA impact | Specific images blocked from release pipeline |
Diagnosis
Quick checks
-
Identify the failing image:
stella scanner jobs list --status failed --last 1h -
Check image size:
stella image inspect <image-ref> --format json | jq '.size'Problem if: Image size > 2GB or layer count > 100
-
Check worker memory limit:
stella scanner config get worker.memory_limit
Deep diagnosis
-
Profile memory usage during scan:
stella scan image --image <image-ref> --profile-memory -
Check SBOM generation memory:
stella scanner logs --filter "sbom" --level debug --last 30mLook for: "memory allocation failed", "heap exhausted"
-
Identify memory-heavy layers:
stella image layers <image-ref> --sort-by size
Resolution
Immediate mitigation
-
Increase worker memory limit:
stella scanner config set worker.memory_limit 8Gi stella scanner workers restart -
Enable streaming mode for large images:
stella scanner config set sbom.streaming_threshold 1Gi stella scanner workers restart -
Retry the failed scan:
stella scan image --image <image-ref> --retry
Root cause fix
For consistently large images:
- Configure dedicated large-image worker pool:
stella scanner workers add --pool large-images --memory 16Gi --count 2 stella scanner config set routing.large_image_threshold 2Gi stella scanner config set routing.large_image_pool large-images
For images with many small files (node_modules, etc.):
- Enable incremental SBOM mode:
stella scanner config set sbom.incremental_mode true
For base image reuse:
- Enable layer caching:
stella scanner config set cache.layer_dedup true
Verification
# Retry the previously failing scan
stella scan image --image <image-ref>
# Monitor memory during scan
stella scanner workers stats --watch
# Verify no OOM in recent logs
stella scanner logs --filter "out of memory" --last 1h
Prevention
- Capacity: Set memory limit based on largest expected image (recommend 4Gi minimum)
- Routing: Configure large-image pool for images > 2GB
- Monitoring: Alert on
scanner_worker_memory_usage_bytes> 80% of limit - Documentation: Document image size limits in user guide
Related Resources
- Architecture:
docs/modules/scanner/architecture.md - Related runbooks:
scanner-worker-stuck.md,scanner-timeout.md - Dashboard: Grafana > Stella Ops > Scanner Memory