---
checkId: check.operations.job-queue
plugin: stellaops.doctor.operations
severity: fail
tags: [operations, queue, jobs, core]
---
# Job Queue Health

## What It Checks
Evaluates the platform job queue health across three dimensions:

- **Worker availability**: fail immediately if no workers are active (zero active workers).
- **Queue depth**: warn at 100+ pending jobs, fail at 500+ pending jobs.
- **Processing rate**: warn if processing rate drops below 10 jobs/minute.

Evidence collected: `QueueDepth`, `ActiveWorkers`, `TotalWorkers`, `ProcessingRate`, `OldestJobAge`, `CompletedLast24h`, `CriticalThreshold`, `WarningThreshold`, `RateStatus`.

This check always runs (no configuration prerequisites).

## Why It Matters
The job queue is the backbone of asynchronous processing in Stella Ops. It handles scan jobs, SBOM generation, vulnerability matching, evidence collection, notification delivery, and many other background tasks. If no workers are available, all background processing stops. A deep queue means jobs are waiting longer than expected, which cascades into delayed scan results, stale findings, and blocked release gates. A low processing rate indicates a performance bottleneck that will only get worse under load.

## Common Causes
- Worker service not running (crashed, not started, configuration error)
- All workers crashed or became unhealthy simultaneously
- Job processing slower than submission rate during high-activity periods
- Workers overloaded or misconfigured (too few workers for the workload)
- Downstream service bottleneck (database slow, external API rate-limited)
- Database performance issues slowing job dequeue operations
- Higher than normal job submission rate (bulk scan, new integration)

## How to Fix

### Docker Compose
```bash
# Check orchestrator service status
docker compose -f docker-compose.stella-ops.yml ps orchestrator

# View worker logs
docker compose -f docker-compose.stella-ops.yml logs --tail 200 orchestrator

# Restart the orchestrator service
docker compose -f docker-compose.stella-ops.yml restart orchestrator

# Scale workers
docker compose -f docker-compose.stella-ops.yml up -d --scale orchestrator=4
```

```yaml
services:
  orchestrator:
    environment:
      Orchestrator__Workers__Count: "8"
      Orchestrator__Workers__MaxConcurrent: "4"
```

### Bare Metal / systemd
```bash
# Check orchestrator service
sudo systemctl status stellaops-orchestrator

# View logs for worker errors
sudo journalctl -u stellaops-orchestrator --since "1 hour ago" | grep -i "worker\|queue"

# Restart workers
stella orchestrator workers restart

# Scale workers
stella orchestrator workers scale --count 8

# Monitor queue depth trend
stella orchestrator queue watch
```

### Kubernetes / Helm
```bash
# Check orchestrator pods
kubectl get pods -l app=stellaops-orchestrator

# View worker logs
kubectl logs -l app=stellaops-orchestrator --tail=200

# Scale workers
kubectl scale deployment stellaops-orchestrator --replicas=4

# Check for stuck jobs
kubectl exec -it <orchestrator-pod> -- stella orchestrator jobs list --status stuck
```

Set in Helm `values.yaml`:

```yaml
orchestrator:
  replicas: 4
  workers:
    count: 8
    maxConcurrent: 4
  resources:
    limits:
      memory: 2Gi
      cpu: "2"
```

## Verification
```
stella doctor run --check check.operations.job-queue
```

## Related Checks
- `check.operations.dead-letter` -- failed jobs end up in the dead letter queue
- `check.operations.scheduler` -- scheduler feeds jobs into the queue
- `check.scanner.queue` -- scanner-specific queue health
- `check.postgres.connectivity` -- database issues affect job dequeue performance