4.1 KiB
Runbook: Scanner - Scan Timeout on Complex Images
Sprint: SPRINT_20260117_029_DOCS_runbook_coverage Task: RUN-002 - Scanner Runbooks
Metadata
| Field | Value |
|---|---|
| Component | Scanner |
| Severity | Medium |
| On-call scope | Platform team |
| Last updated | 2026-01-17 |
| Doctor check | check.scanner.timeout-rate |
Symptoms
- Scans failing with "timeout exceeded" error
- Alert
ScannerTimeoutExceededfiring - Metric
scanner_scan_timeout_totalincreasing - Specific images consistently timing out
- Error log: "scan operation exceeded timeout of X seconds"
Impact
| Impact Type | Description |
|---|---|
| User-facing | Specific images cannot be scanned; pipeline blocked |
| Data integrity | No data loss; scans can be retried with adjusted settings |
| SLA impact | Release pipeline delayed for affected images |
Diagnosis
Quick checks
-
Check Doctor diagnostics:
stella doctor --check check.scanner.timeout-rate -
Identify failing images:
stella scanner jobs list --status timeout --last 1hLook for: Pattern in image types or sizes
-
Check current timeout settings:
stella scanner config get timeouts
Deep diagnosis
-
Analyze image complexity:
stella image inspect <image-ref> --format json | jq '{size, layers: .layers | length, files: .manifest.fileCount}'Problem if: > 50 layers, > 100k files, or > 5GB size
-
Check scanner worker load:
stella scanner workers statsProblem if: All workers at capacity during timeouts
-
Profile a scan:
stella scan image --image <image-ref> --profile --verboseLook for: Which phase is slowest (layer extraction, SBOM generation, vuln matching)
-
Check for filesystem-heavy images:
stella image layers <image-ref> --sort-by file-countProblem if: Single layer with > 50k files (e.g., node_modules)
Resolution
Immediate mitigation
-
Increase timeout for specific image:
stella scan image --image <image-ref> --timeout 30m -
Increase global scan timeout:
stella scanner config set timeouts.scan 20m stella scanner workers restart -
Enable fast mode for initial scan:
stella scan image --image <image-ref> --fast-mode
Root cause fix
If image is too complex:
-
Enable incremental scanning:
stella scanner config set scan.incremental_mode true -
Configure layer caching:
stella scanner config set cache.layer_dedup true stella scanner config set cache.sbom_cache true
If filesystem is too large:
-
Enable streaming SBOM generation:
stella scanner config set sbom.streaming_threshold 500Gi -
Configure file sampling for massive images:
stella scanner config set sbom.file_sample_max 100000
If vulnerability matching is slow:
-
Enable parallel matching:
stella scanner config set vuln.parallel_matching true stella scanner config set vuln.match_workers 4 -
Optimize vulnerability database indexes:
stella db optimize --component scanner
Verification
# Retry the previously failing scan
stella scan image --image <image-ref> --timeout 30m
# Monitor scan progress
stella scanner jobs watch <job-id>
# Verify no timeouts in recent scans
stella scanner jobs list --status timeout --last 1h
Prevention
- Capacity: Configure appropriate timeouts based on expected image complexity (15m default, 30m for large)
- Monitoring: Alert on timeout rate > 5%
- Caching: Enable layer and SBOM caching for base images
- Documentation: Document image size/complexity limits in user guide
Related Resources
- Architecture:
docs/modules/scanner/architecture.md - Related runbooks:
scanner-oom.md,scanner-worker-stuck.md - Dashboard: Grafana > Stella Ops > Scanner Performance