synergy moats product advisory implementations
This commit is contained in:
152
docs/operations/runbooks/scanner-oom.md
Normal file
152
docs/operations/runbooks/scanner-oom.md
Normal file
@@ -0,0 +1,152 @@
|
||||
# Runbook: Scanner - Out of Memory on Large Images
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-002 - Scanner Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Scanner |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.scanner.memory-usage` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Scanner worker exits with code 137 (OOM killed)
|
||||
- [ ] Scans fail consistently for specific large images
|
||||
- [ ] Error log contains "fatal error: runtime: out of memory"
|
||||
- [ ] Alert `ScannerWorkerOOM` firing
|
||||
- [ ] Metric `scanner_worker_restarts_total{reason="oom"}` increasing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Large images cannot be scanned; smaller images may still work |
|
||||
| **Data integrity** | No data loss; failed scans can be retried |
|
||||
| **SLA impact** | Specific images blocked from release pipeline |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Identify the failing image:**
|
||||
```bash
|
||||
stella scanner jobs list --status failed --last 1h
|
||||
```
|
||||
|
||||
2. **Check image size:**
|
||||
```bash
|
||||
stella image inspect <image-ref> --format json | jq '.size'
|
||||
```
|
||||
Problem if: Image size > 2GB or layer count > 100
|
||||
|
||||
3. **Check worker memory limit:**
|
||||
```bash
|
||||
stella scanner config get worker.memory_limit
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Profile memory usage during scan:**
|
||||
```bash
|
||||
stella scan image --image <image-ref> --profile-memory
|
||||
```
|
||||
|
||||
2. **Check SBOM generation memory:**
|
||||
```bash
|
||||
stella scanner logs --filter "sbom" --level debug --last 30m
|
||||
```
|
||||
Look for: "memory allocation failed", "heap exhausted"
|
||||
|
||||
3. **Identify memory-heavy layers:**
|
||||
```bash
|
||||
stella image layers <image-ref> --sort-by size
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Increase worker memory limit:**
|
||||
```bash
|
||||
stella scanner config set worker.memory_limit 8Gi
|
||||
stella scanner workers restart
|
||||
```
|
||||
|
||||
2. **Enable streaming mode for large images:**
|
||||
```bash
|
||||
stella scanner config set sbom.streaming_threshold 1Gi
|
||||
stella scanner workers restart
|
||||
```
|
||||
|
||||
3. **Retry the failed scan:**
|
||||
```bash
|
||||
stella scan image --image <image-ref> --retry
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**For consistently large images:**
|
||||
|
||||
1. Configure dedicated large-image worker pool:
|
||||
```bash
|
||||
stella scanner workers add --pool large-images --memory 16Gi --count 2
|
||||
stella scanner config set routing.large_image_threshold 2Gi
|
||||
stella scanner config set routing.large_image_pool large-images
|
||||
```
|
||||
|
||||
**For images with many small files (node_modules, etc.):**
|
||||
|
||||
1. Enable incremental SBOM mode:
|
||||
```bash
|
||||
stella scanner config set sbom.incremental_mode true
|
||||
```
|
||||
|
||||
**For base image reuse:**
|
||||
|
||||
1. Enable layer caching:
|
||||
```bash
|
||||
stella scanner config set cache.layer_dedup true
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Retry the previously failing scan
|
||||
stella scan image --image <image-ref>
|
||||
|
||||
# Monitor memory during scan
|
||||
stella scanner workers stats --watch
|
||||
|
||||
# Verify no OOM in recent logs
|
||||
stella scanner logs --filter "out of memory" --last 1h
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Capacity:** Set memory limit based on largest expected image (recommend 4Gi minimum)
|
||||
- [ ] **Routing:** Configure large-image pool for images > 2GB
|
||||
- [ ] **Monitoring:** Alert on `scanner_worker_memory_usage_bytes` > 80% of limit
|
||||
- [ ] **Documentation:** Document image size limits in user guide
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/scanner/architecture.md`
|
||||
- **Related runbooks:** `scanner-worker-stuck.md`, `scanner-timeout.md`
|
||||
- **Dashboard:** Grafana > Stella Ops > Scanner Memory
|
||||
Reference in New Issue
Block a user