Files
git.stella-ops.org/docs/operations/runbooks/scanner-timeout.md

4.1 KiB

Runbook: Scanner - Scan Timeout on Complex Images

Sprint: SPRINT_20260117_029_DOCS_runbook_coverage Task: RUN-002 - Scanner Runbooks

Metadata

Field Value
Component Scanner
Severity Medium
On-call scope Platform team
Last updated 2026-01-17
Doctor check check.scanner.timeout-rate

Symptoms

  • Scans failing with "timeout exceeded" error
  • Alert ScannerTimeoutExceeded firing
  • Metric scanner_scan_timeout_total increasing
  • Specific images consistently timing out
  • Error log: "scan operation exceeded timeout of X seconds"

Impact

Impact Type Description
User-facing Specific images cannot be scanned; pipeline blocked
Data integrity No data loss; scans can be retried with adjusted settings
SLA impact Release pipeline delayed for affected images

Diagnosis

Quick checks

  1. Check Doctor diagnostics:

    stella doctor --check check.scanner.timeout-rate
    
  2. Identify failing images:

    stella scanner jobs list --status timeout --last 1h
    

    Look for: Pattern in image types or sizes

  3. Check current timeout settings:

    stella scanner config get timeouts
    

Deep diagnosis

  1. Analyze image complexity:

    stella image inspect <image-ref> --format json | jq '{size, layers: .layers | length, files: .manifest.fileCount}'
    

    Problem if: > 50 layers, > 100k files, or > 5GB size

  2. Check scanner worker load:

    stella scanner workers stats
    

    Problem if: All workers at capacity during timeouts

  3. Profile a scan:

    stella scan image --image <image-ref> --profile --verbose
    

    Look for: Which phase is slowest (layer extraction, SBOM generation, vuln matching)

  4. Check for filesystem-heavy images:

    stella image layers <image-ref> --sort-by file-count
    

    Problem if: Single layer with > 50k files (e.g., node_modules)


Resolution

Immediate mitigation

  1. Increase timeout for specific image:

    stella scan image --image <image-ref> --timeout 30m
    
  2. Increase global scan timeout:

    stella scanner config set timeouts.scan 20m
    stella scanner workers restart
    
  3. Enable fast mode for initial scan:

    stella scan image --image <image-ref> --fast-mode
    

Root cause fix

If image is too complex:

  1. Enable incremental scanning:

    stella scanner config set scan.incremental_mode true
    
  2. Configure layer caching:

    stella scanner config set cache.layer_dedup true
    stella scanner config set cache.sbom_cache true
    

If filesystem is too large:

  1. Enable streaming SBOM generation:

    stella scanner config set sbom.streaming_threshold 500Gi
    
  2. Configure file sampling for massive images:

    stella scanner config set sbom.file_sample_max 100000
    

If vulnerability matching is slow:

  1. Enable parallel matching:

    stella scanner config set vuln.parallel_matching true
    stella scanner config set vuln.match_workers 4
    
  2. Optimize vulnerability database indexes:

    stella db optimize --component scanner
    

Verification

# Retry the previously failing scan
stella scan image --image <image-ref> --timeout 30m

# Monitor scan progress
stella scanner jobs watch <job-id>

# Verify no timeouts in recent scans
stella scanner jobs list --status timeout --last 1h

Prevention

  • Capacity: Configure appropriate timeouts based on expected image complexity (15m default, 30m for large)
  • Monitoring: Alert on timeout rate > 5%
  • Caching: Enable layer and SBOM caching for base images
  • Documentation: Document image size/complexity limits in user guide

  • Architecture: docs/modules/scanner/architecture.md
  • Related runbooks: scanner-oom.md, scanner-worker-stuck.md
  • Dashboard: Grafana > Stella Ops > Scanner Performance