stella-ops.org/git.stella-ops.org

Fork 0

Files

master 702a27ac83 synergy moats product advisory implementations

2026-01-17 01:32:20 +02:00

4.1 KiB

Raw Blame History

Runbook: Scanner - Worker Not Processing Jobs

Sprint: SPRINT_20260117_029_DOCS_runbook_coverage Task: RUN-002 - Scanner Runbooks

Metadata

Field	Value
Component	Scanner
Severity	Critical
On-call scope	Platform team
Last updated	2026-01-17
Doctor check	`check.scanner.worker-health`

Symptoms

Scan jobs stuck in "pending" or "processing" state for >5 minutes
Scanner worker process shows 0% CPU usage
Alert ScannerWorkerStuck or ScannerQueueBacklog firing
UI shows "Scan in progress" indefinitely
Metric scanner_jobs_pending increasing over time

Impact

Impact Type	Description
User-facing	New scans cannot complete, blocking CI/CD pipelines and release gates
Data integrity	No data loss; pending jobs will resume when worker recovers
SLA impact	Scan latency SLO violated if not resolved within 15 minutes

Diagnosis

Quick checks (< 2 minutes)

Check Doctor diagnostics:

stella doctor --check check.scanner.worker-health

Check scanner service status:
```
stella scanner status
```
Expected: "Scanner workers: 4 active, 0 idle" Problem: "Scanner workers: 0 active" or "status: degraded"
Check job queue depth:
```
stella scanner queue status
```
Expected: Queue depth < 50 Problem: Queue depth > 100 or growing rapidly

Deep diagnosis

Check worker process logs:
```
stella scanner logs --tail 100 --level error
```
Look for: "timeout", "connection refused", "out of memory"

Check Valkey connectivity (job queue):

stella doctor --check check.storage.valkey

Check if workers are OOM-killed:
```
stella scanner workers inspect
```
Look for: "exit_code: 137" (OOM) or "exit_code: 143" (SIGTERM)
Check resource utilization:
```
stella obs metrics --filter scanner --last 10m
```
Look for: Memory > 90%, CPU sustained > 95%

Resolution

Immediate mitigation

Restart scanner workers:
```
stella scanner workers restart
```
This will: Terminate current workers and spawn fresh ones
If restart fails, force restart the scanner service:
```
stella service restart scanner
```
Verify workers are processing:
```
stella scanner queue status --watch
```
Queue depth should start decreasing

Root cause fix

If workers were OOM-killed:

Increase worker memory limit:

stella scanner config set worker.memory_limit 4Gi
stella scanner workers restart

Reduce concurrent scans per worker:

stella scanner config set worker.concurrency 2
stella scanner workers restart

If Valkey connection failed:

Check Valkey health:

stella doctor --check check.storage.valkey

Restart Valkey if needed (see valkey-connection-failure.md)

If workers are deadlocked:

Enable deadlock detection:

stella scanner config set worker.deadlock_detection true
stella scanner workers restart

Verification

# Verify workers are healthy
stella doctor --check check.scanner.worker-health

# Submit a test scan
stella scan image --image alpine:latest --dry-run

# Watch queue drain
stella scanner queue status --watch

# Verify no errors in recent logs
stella scanner logs --tail 20 --level error

Prevention

Alert: Ensure ScannerQueueBacklog alert is configured with threshold < 100 jobs
Monitoring: Add Grafana panel for worker memory usage
Capacity: Review worker count and memory limits during capacity planning
Deadlock: Enable worker.deadlock_detection in production

Architecture: docs/modules/scanner/architecture.md
Related runbooks: scanner-oom.md, scanner-timeout.md
Doctor check: src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Scanner/Checks/WorkerHealthCheck.cs
Dashboard: Grafana > Stella Ops > Scanner Overview

4.1 KiB Raw Blame History