175 lines
4.1 KiB
Markdown
175 lines
4.1 KiB
Markdown
# Runbook: Scanner - Scan Timeout on Complex Images
|
|
|
|
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
|
> **Task:** RUN-002 - Scanner Runbooks
|
|
|
|
## Metadata
|
|
|
|
| Field | Value |
|
|
|-------|-------|
|
|
| **Component** | Scanner |
|
|
| **Severity** | Medium |
|
|
| **On-call scope** | Platform team |
|
|
| **Last updated** | 2026-01-17 |
|
|
| **Doctor check** | `check.scanner.timeout-rate` |
|
|
|
|
---
|
|
|
|
## Symptoms
|
|
|
|
- [ ] Scans failing with "timeout exceeded" error
|
|
- [ ] Alert `ScannerTimeoutExceeded` firing
|
|
- [ ] Metric `scanner_scan_timeout_total` increasing
|
|
- [ ] Specific images consistently timing out
|
|
- [ ] Error log: "scan operation exceeded timeout of X seconds"
|
|
|
|
---
|
|
|
|
## Impact
|
|
|
|
| Impact Type | Description |
|
|
|-------------|-------------|
|
|
| **User-facing** | Specific images cannot be scanned; pipeline blocked |
|
|
| **Data integrity** | No data loss; scans can be retried with adjusted settings |
|
|
| **SLA impact** | Release pipeline delayed for affected images |
|
|
|
|
---
|
|
|
|
## Diagnosis
|
|
|
|
### Quick checks
|
|
|
|
1. **Check Doctor diagnostics:**
|
|
```bash
|
|
stella doctor --check check.scanner.timeout-rate
|
|
```
|
|
|
|
2. **Identify failing images:**
|
|
```bash
|
|
stella scanner jobs list --status timeout --last 1h
|
|
```
|
|
Look for: Pattern in image types or sizes
|
|
|
|
3. **Check current timeout settings:**
|
|
```bash
|
|
stella scanner config get timeouts
|
|
```
|
|
|
|
### Deep diagnosis
|
|
|
|
1. **Analyze image complexity:**
|
|
```bash
|
|
stella image inspect <image-ref> --format json | jq '{size, layers: .layers | length, files: .manifest.fileCount}'
|
|
```
|
|
Problem if: > 50 layers, > 100k files, or > 5GB size
|
|
|
|
2. **Check scanner worker load:**
|
|
```bash
|
|
stella scanner workers stats
|
|
```
|
|
Problem if: All workers at capacity during timeouts
|
|
|
|
3. **Profile a scan:**
|
|
```bash
|
|
stella scan image --image <image-ref> --profile --verbose
|
|
```
|
|
Look for: Which phase is slowest (layer extraction, SBOM generation, vuln matching)
|
|
|
|
4. **Check for filesystem-heavy images:**
|
|
```bash
|
|
stella image layers <image-ref> --sort-by file-count
|
|
```
|
|
Problem if: Single layer with > 50k files (e.g., node_modules)
|
|
|
|
---
|
|
|
|
## Resolution
|
|
|
|
### Immediate mitigation
|
|
|
|
1. **Increase timeout for specific image:**
|
|
```bash
|
|
stella scan image --image <image-ref> --timeout 30m
|
|
```
|
|
|
|
2. **Increase global scan timeout:**
|
|
```bash
|
|
stella scanner config set timeouts.scan 20m
|
|
stella scanner workers restart
|
|
```
|
|
|
|
3. **Enable fast mode for initial scan:**
|
|
```bash
|
|
stella scan image --image <image-ref> --fast-mode
|
|
```
|
|
|
|
### Root cause fix
|
|
|
|
**If image is too complex:**
|
|
|
|
1. Enable incremental scanning:
|
|
```bash
|
|
stella scanner config set scan.incremental_mode true
|
|
```
|
|
|
|
2. Configure layer caching:
|
|
```bash
|
|
stella scanner config set cache.layer_dedup true
|
|
stella scanner config set cache.sbom_cache true
|
|
```
|
|
|
|
**If filesystem is too large:**
|
|
|
|
1. Enable streaming SBOM generation:
|
|
```bash
|
|
stella scanner config set sbom.streaming_threshold 500Gi
|
|
```
|
|
|
|
2. Configure file sampling for massive images:
|
|
```bash
|
|
stella scanner config set sbom.file_sample_max 100000
|
|
```
|
|
|
|
**If vulnerability matching is slow:**
|
|
|
|
1. Enable parallel matching:
|
|
```bash
|
|
stella scanner config set vuln.parallel_matching true
|
|
stella scanner config set vuln.match_workers 4
|
|
```
|
|
|
|
2. Optimize vulnerability database indexes:
|
|
```bash
|
|
stella db optimize --component scanner
|
|
```
|
|
|
|
### Verification
|
|
|
|
```bash
|
|
# Retry the previously failing scan
|
|
stella scan image --image <image-ref> --timeout 30m
|
|
|
|
# Monitor scan progress
|
|
stella scanner jobs watch <job-id>
|
|
|
|
# Verify no timeouts in recent scans
|
|
stella scanner jobs list --status timeout --last 1h
|
|
```
|
|
|
|
---
|
|
|
|
## Prevention
|
|
|
|
- [ ] **Capacity:** Configure appropriate timeouts based on expected image complexity (15m default, 30m for large)
|
|
- [ ] **Monitoring:** Alert on timeout rate > 5%
|
|
- [ ] **Caching:** Enable layer and SBOM caching for base images
|
|
- [ ] **Documentation:** Document image size/complexity limits in user guide
|
|
|
|
---
|
|
|
|
## Related Resources
|
|
|
|
- **Architecture:** `docs/modules/scanner/architecture.md`
|
|
- **Related runbooks:** `scanner-oom.md`, `scanner-worker-stuck.md`
|
|
- **Dashboard:** Grafana > Stella Ops > Scanner Performance
|