Files
git.stella-ops.org/docs/operations/runbooks/scanner-oom.md

3.6 KiB

Runbook: Scanner - Out of Memory on Large Images

Sprint: SPRINT_20260117_029_DOCS_runbook_coverage Task: RUN-002 - Scanner Runbooks

Metadata

Field Value
Component Scanner
Severity High
On-call scope Platform team
Last updated 2026-01-17
Doctor check check.scanner.memory-usage

Symptoms

  • Scanner worker exits with code 137 (OOM killed)
  • Scans fail consistently for specific large images
  • Error log contains "fatal error: runtime: out of memory"
  • Alert ScannerWorkerOOM firing
  • Metric scanner_worker_restarts_total{reason="oom"} increasing

Impact

Impact Type Description
User-facing Large images cannot be scanned; smaller images may still work
Data integrity No data loss; failed scans can be retried
SLA impact Specific images blocked from release pipeline

Diagnosis

Quick checks

  1. Identify the failing image:

    stella scanner jobs list --status failed --last 1h
    
  2. Check image size:

    stella image inspect <image-ref> --format json | jq '.size'
    

    Problem if: Image size > 2GB or layer count > 100

  3. Check worker memory limit:

    stella scanner config get worker.memory_limit
    

Deep diagnosis

  1. Profile memory usage during scan:

    stella scan image --image <image-ref> --profile-memory
    
  2. Check SBOM generation memory:

    stella scanner logs --filter "sbom" --level debug --last 30m
    

    Look for: "memory allocation failed", "heap exhausted"

  3. Identify memory-heavy layers:

    stella image layers <image-ref> --sort-by size
    

Resolution

Immediate mitigation

  1. Increase worker memory limit:

    stella scanner config set worker.memory_limit 8Gi
    stella scanner workers restart
    
  2. Enable streaming mode for large images:

    stella scanner config set sbom.streaming_threshold 1Gi
    stella scanner workers restart
    
  3. Retry the failed scan:

    stella scan image --image <image-ref> --retry
    

Root cause fix

For consistently large images:

  1. Configure dedicated large-image worker pool:
    stella scanner workers add --pool large-images --memory 16Gi --count 2
    stella scanner config set routing.large_image_threshold 2Gi
    stella scanner config set routing.large_image_pool large-images
    

For images with many small files (node_modules, etc.):

  1. Enable incremental SBOM mode:
    stella scanner config set sbom.incremental_mode true
    

For base image reuse:

  1. Enable layer caching:
    stella scanner config set cache.layer_dedup true
    

Verification

# Retry the previously failing scan
stella scan image --image <image-ref>

# Monitor memory during scan
stella scanner workers stats --watch

# Verify no OOM in recent logs
stella scanner logs --filter "out of memory" --last 1h

Prevention

  • Capacity: Set memory limit based on largest expected image (recommend 4Gi minimum)
  • Routing: Configure large-image pool for images > 2GB
  • Monitoring: Alert on scanner_worker_memory_usage_bytes > 80% of limit
  • Documentation: Document image size limits in user guide

  • Architecture: docs/modules/scanner/architecture.md
  • Related runbooks: scanner-worker-stuck.md, scanner-timeout.md
  • Dashboard: Grafana > Stella Ops > Scanner Memory