synergy moats product advisory implementations

2026-01-17 01:30:03 +02:00
parent 77ff029205
commit 702a27ac83
112 changed files with 21356 additions and 127 deletions
--- a/docs/operations/runbooks/scanner-worker-stuck.md
+++ b/docs/operations/runbooks/scanner-worker-stuck.md
@@ -0,0 +1,174 @@
+# Runbook: Scanner - Worker Not Processing Jobs
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-002 - Scanner Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Scanner |
+| **Severity** | Critical |
+| **On-call scope** | Platform team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.scanner.worker-health` |
+
+---
+
+## Symptoms
+
+- [ ] Scan jobs stuck in "pending" or "processing" state for >5 minutes
+- [ ] Scanner worker process shows 0% CPU usage
+- [ ] Alert `ScannerWorkerStuck` or `ScannerQueueBacklog` firing
+- [ ] UI shows "Scan in progress" indefinitely
+- [ ] Metric `scanner_jobs_pending` increasing over time
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | New scans cannot complete, blocking CI/CD pipelines and release gates |
+| **Data integrity** | No data loss; pending jobs will resume when worker recovers |
+| **SLA impact** | Scan latency SLO violated if not resolved within 15 minutes |
+
+---
+
+## Diagnosis
+
+### Quick checks (< 2 minutes)
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.scanner.worker-health
+   ```
+
+2. **Check scanner service status:**
+   ```bash
+   stella scanner status
+   ```
+   Expected: "Scanner workers: 4 active, 0 idle"
+   Problem: "Scanner workers: 0 active" or "status: degraded"
+
+3. **Check job queue depth:**
+   ```bash
+   stella scanner queue status
+   ```
+   Expected: Queue depth < 50
+   Problem: Queue depth > 100 or growing rapidly
+
+### Deep diagnosis
+
+1. **Check worker process logs:**
+   ```bash
+   stella scanner logs --tail 100 --level error
+   ```
+   Look for: "timeout", "connection refused", "out of memory"
+
+2. **Check Valkey connectivity (job queue):**
+   ```bash
+   stella doctor --check check.storage.valkey
+   ```
+
+3. **Check if workers are OOM-killed:**
+   ```bash
+   stella scanner workers inspect
+   ```
+   Look for: "exit_code: 137" (OOM) or "exit_code: 143" (SIGTERM)
+
+4. **Check resource utilization:**
+   ```bash
+   stella obs metrics --filter scanner --last 10m
+   ```
+   Look for: Memory > 90%, CPU sustained > 95%
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **Restart scanner workers:**
+   ```bash
+   stella scanner workers restart
+   ```
+   This will: Terminate current workers and spawn fresh ones
+
+2. **If restart fails, force restart the scanner service:**
+   ```bash
+   stella service restart scanner
+   ```
+
+3. **Verify workers are processing:**
+   ```bash
+   stella scanner queue status --watch
+   ```
+   Queue depth should start decreasing
+
+### Root cause fix
+
+**If workers were OOM-killed:**
+
+1. Increase worker memory limit:
+   ```bash
+   stella scanner config set worker.memory_limit 4Gi
+   stella scanner workers restart
+   ```
+
+2. Reduce concurrent scans per worker:
+   ```bash
+   stella scanner config set worker.concurrency 2
+   stella scanner workers restart
+   ```
+
+**If Valkey connection failed:**
+
+1. Check Valkey health:
+   ```bash
+   stella doctor --check check.storage.valkey
+   ```
+
+2. Restart Valkey if needed (see `valkey-connection-failure.md`)
+
+**If workers are deadlocked:**
+
+1. Enable deadlock detection:
+   ```bash
+   stella scanner config set worker.deadlock_detection true
+   stella scanner workers restart
+   ```
+
+### Verification
+
+```bash
+# Verify workers are healthy
+stella doctor --check check.scanner.worker-health
+
+# Submit a test scan
+stella scan image --image alpine:latest --dry-run
+
+# Watch queue drain
+stella scanner queue status --watch
+
+# Verify no errors in recent logs
+stella scanner logs --tail 20 --level error
+```
+
+---
+
+## Prevention
+
+- [ ] **Alert:** Ensure `ScannerQueueBacklog` alert is configured with threshold < 100 jobs
+- [ ] **Monitoring:** Add Grafana panel for worker memory usage
+- [ ] **Capacity:** Review worker count and memory limits during capacity planning
+- [ ] **Deadlock:** Enable `worker.deadlock_detection` in production
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/scanner/architecture.md`
+- **Related runbooks:** `scanner-oom.md`, `scanner-timeout.md`
+- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Scanner/Checks/WorkerHealthCheck.cs`
+- **Dashboard:** Grafana > Stella Ops > Scanner Overview