synergy moats product advisory implementations
This commit is contained in:
174
docs/operations/runbooks/scanner-worker-stuck.md
Normal file
174
docs/operations/runbooks/scanner-worker-stuck.md
Normal file
@@ -0,0 +1,174 @@
|
||||
# Runbook: Scanner - Worker Not Processing Jobs
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-002 - Scanner Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Scanner |
|
||||
| **Severity** | Critical |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.scanner.worker-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Scan jobs stuck in "pending" or "processing" state for >5 minutes
|
||||
- [ ] Scanner worker process shows 0% CPU usage
|
||||
- [ ] Alert `ScannerWorkerStuck` or `ScannerQueueBacklog` firing
|
||||
- [ ] UI shows "Scan in progress" indefinitely
|
||||
- [ ] Metric `scanner_jobs_pending` increasing over time
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | New scans cannot complete, blocking CI/CD pipelines and release gates |
|
||||
| **Data integrity** | No data loss; pending jobs will resume when worker recovers |
|
||||
| **SLA impact** | Scan latency SLO violated if not resolved within 15 minutes |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks (< 2 minutes)
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.scanner.worker-health
|
||||
```
|
||||
|
||||
2. **Check scanner service status:**
|
||||
```bash
|
||||
stella scanner status
|
||||
```
|
||||
Expected: "Scanner workers: 4 active, 0 idle"
|
||||
Problem: "Scanner workers: 0 active" or "status: degraded"
|
||||
|
||||
3. **Check job queue depth:**
|
||||
```bash
|
||||
stella scanner queue status
|
||||
```
|
||||
Expected: Queue depth < 50
|
||||
Problem: Queue depth > 100 or growing rapidly
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check worker process logs:**
|
||||
```bash
|
||||
stella scanner logs --tail 100 --level error
|
||||
```
|
||||
Look for: "timeout", "connection refused", "out of memory"
|
||||
|
||||
2. **Check Valkey connectivity (job queue):**
|
||||
```bash
|
||||
stella doctor --check check.storage.valkey
|
||||
```
|
||||
|
||||
3. **Check if workers are OOM-killed:**
|
||||
```bash
|
||||
stella scanner workers inspect
|
||||
```
|
||||
Look for: "exit_code: 137" (OOM) or "exit_code: 143" (SIGTERM)
|
||||
|
||||
4. **Check resource utilization:**
|
||||
```bash
|
||||
stella obs metrics --filter scanner --last 10m
|
||||
```
|
||||
Look for: Memory > 90%, CPU sustained > 95%
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Restart scanner workers:**
|
||||
```bash
|
||||
stella scanner workers restart
|
||||
```
|
||||
This will: Terminate current workers and spawn fresh ones
|
||||
|
||||
2. **If restart fails, force restart the scanner service:**
|
||||
```bash
|
||||
stella service restart scanner
|
||||
```
|
||||
|
||||
3. **Verify workers are processing:**
|
||||
```bash
|
||||
stella scanner queue status --watch
|
||||
```
|
||||
Queue depth should start decreasing
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If workers were OOM-killed:**
|
||||
|
||||
1. Increase worker memory limit:
|
||||
```bash
|
||||
stella scanner config set worker.memory_limit 4Gi
|
||||
stella scanner workers restart
|
||||
```
|
||||
|
||||
2. Reduce concurrent scans per worker:
|
||||
```bash
|
||||
stella scanner config set worker.concurrency 2
|
||||
stella scanner workers restart
|
||||
```
|
||||
|
||||
**If Valkey connection failed:**
|
||||
|
||||
1. Check Valkey health:
|
||||
```bash
|
||||
stella doctor --check check.storage.valkey
|
||||
```
|
||||
|
||||
2. Restart Valkey if needed (see `valkey-connection-failure.md`)
|
||||
|
||||
**If workers are deadlocked:**
|
||||
|
||||
1. Enable deadlock detection:
|
||||
```bash
|
||||
stella scanner config set worker.deadlock_detection true
|
||||
stella scanner workers restart
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Verify workers are healthy
|
||||
stella doctor --check check.scanner.worker-health
|
||||
|
||||
# Submit a test scan
|
||||
stella scan image --image alpine:latest --dry-run
|
||||
|
||||
# Watch queue drain
|
||||
stella scanner queue status --watch
|
||||
|
||||
# Verify no errors in recent logs
|
||||
stella scanner logs --tail 20 --level error
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Alert:** Ensure `ScannerQueueBacklog` alert is configured with threshold < 100 jobs
|
||||
- [ ] **Monitoring:** Add Grafana panel for worker memory usage
|
||||
- [ ] **Capacity:** Review worker count and memory limits during capacity planning
|
||||
- [ ] **Deadlock:** Enable `worker.deadlock_detection` in production
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/scanner/architecture.md`
|
||||
- **Related runbooks:** `scanner-oom.md`, `scanner-timeout.md`
|
||||
- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Scanner/Checks/WorkerHealthCheck.cs`
|
||||
- **Dashboard:** Grafana > Stella Ops > Scanner Overview
|
||||
Reference in New Issue
Block a user