--- checkId: check.operations.job-queue plugin: stellaops.doctor.operations severity: fail tags: [operations, queue, jobs, core] --- # Job Queue Health ## What It Checks Evaluates the platform job queue health across three dimensions: - **Worker availability**: fail immediately if no workers are active (zero active workers). - **Queue depth**: warn at 100+ pending jobs, fail at 500+ pending jobs. - **Processing rate**: warn if processing rate drops below 10 jobs/minute. Evidence collected: `QueueDepth`, `ActiveWorkers`, `TotalWorkers`, `ProcessingRate`, `OldestJobAge`, `CompletedLast24h`, `CriticalThreshold`, `WarningThreshold`, `RateStatus`. This check always runs (no configuration prerequisites). ## Why It Matters The job queue is the backbone of asynchronous processing in Stella Ops. It handles scan jobs, SBOM generation, vulnerability matching, evidence collection, notification delivery, and many other background tasks. If no workers are available, all background processing stops. A deep queue means jobs are waiting longer than expected, which cascades into delayed scan results, stale findings, and blocked release gates. A low processing rate indicates a performance bottleneck that will only get worse under load. ## Common Causes - Worker service not running (crashed, not started, configuration error) - All workers crashed or became unhealthy simultaneously - Job processing slower than submission rate during high-activity periods - Workers overloaded or misconfigured (too few workers for the workload) - Downstream service bottleneck (database slow, external API rate-limited) - Database performance issues slowing job dequeue operations - Higher than normal job submission rate (bulk scan, new integration) ## How to Fix ### Docker Compose ```bash # Check orchestrator service status docker compose -f docker-compose.stella-ops.yml ps orchestrator # View worker logs docker compose -f docker-compose.stella-ops.yml logs --tail 200 orchestrator # Restart the orchestrator service docker compose -f docker-compose.stella-ops.yml restart orchestrator # Scale workers docker compose -f docker-compose.stella-ops.yml up -d --scale orchestrator=4 ``` ```yaml services: orchestrator: environment: Orchestrator__Workers__Count: "8" Orchestrator__Workers__MaxConcurrent: "4" ``` ### Bare Metal / systemd ```bash # Check orchestrator service sudo systemctl status stellaops-orchestrator # View logs for worker errors sudo journalctl -u stellaops-orchestrator --since "1 hour ago" | grep -i "worker\|queue" # Restart workers stella orchestrator workers restart # Scale workers stella orchestrator workers scale --count 8 # Monitor queue depth trend stella orchestrator queue watch ``` ### Kubernetes / Helm ```bash # Check orchestrator pods kubectl get pods -l app=stellaops-orchestrator # View worker logs kubectl logs -l app=stellaops-orchestrator --tail=200 # Scale workers kubectl scale deployment stellaops-orchestrator --replicas=4 # Check for stuck jobs kubectl exec -it -- stella orchestrator jobs list --status stuck ``` Set in Helm `values.yaml`: ```yaml orchestrator: replicas: 4 workers: count: 8 maxConcurrent: 4 resources: limits: memory: 2Gi cpu: "2" ``` ## Verification ``` stella doctor run --check check.operations.job-queue ``` ## Related Checks - `check.operations.dead-letter` -- failed jobs end up in the dead letter queue - `check.operations.scheduler` -- scheduler feeds jobs into the queue - `check.scanner.queue` -- scanner-specific queue health - `check.postgres.connectivity` -- database issues affect job dequeue performance