Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3.6 KiB
checkId, plugin, severity, tags
| checkId | plugin | severity | tags | ||||
|---|---|---|---|---|---|---|---|
| check.operations.job-queue | stellaops.doctor.operations | fail |
|
Job Queue Health
What It Checks
Evaluates the platform job queue health across three dimensions:
- Worker availability: fail immediately if no workers are active (zero active workers).
- Queue depth: warn at 100+ pending jobs, fail at 500+ pending jobs.
- Processing rate: warn if processing rate drops below 10 jobs/minute.
Evidence collected: QueueDepth, ActiveWorkers, TotalWorkers, ProcessingRate, OldestJobAge, CompletedLast24h, CriticalThreshold, WarningThreshold, RateStatus.
This check always runs (no configuration prerequisites).
Why It Matters
The job queue is the backbone of asynchronous processing in Stella Ops. It handles scan jobs, SBOM generation, vulnerability matching, evidence collection, notification delivery, and many other background tasks. If no workers are available, all background processing stops. A deep queue means jobs are waiting longer than expected, which cascades into delayed scan results, stale findings, and blocked release gates. A low processing rate indicates a performance bottleneck that will only get worse under load.
Common Causes
- Worker service not running (crashed, not started, configuration error)
- All workers crashed or became unhealthy simultaneously
- Job processing slower than submission rate during high-activity periods
- Workers overloaded or misconfigured (too few workers for the workload)
- Downstream service bottleneck (database slow, external API rate-limited)
- Database performance issues slowing job dequeue operations
- Higher than normal job submission rate (bulk scan, new integration)
How to Fix
Docker Compose
# Check orchestrator service status
docker compose -f docker-compose.stella-ops.yml ps orchestrator
# View worker logs
docker compose -f docker-compose.stella-ops.yml logs --tail 200 orchestrator
# Restart the orchestrator service
docker compose -f docker-compose.stella-ops.yml restart orchestrator
# Scale workers
docker compose -f docker-compose.stella-ops.yml up -d --scale orchestrator=4
services:
orchestrator:
environment:
Orchestrator__Workers__Count: "8"
Orchestrator__Workers__MaxConcurrent: "4"
Bare Metal / systemd
# Check orchestrator service
sudo systemctl status stellaops-orchestrator
# View logs for worker errors
sudo journalctl -u stellaops-orchestrator --since "1 hour ago" | grep -i "worker\|queue"
# Restart workers
stella orchestrator workers restart
# Scale workers
stella orchestrator workers scale --count 8
# Monitor queue depth trend
stella orchestrator queue watch
Kubernetes / Helm
# Check orchestrator pods
kubectl get pods -l app=stellaops-orchestrator
# View worker logs
kubectl logs -l app=stellaops-orchestrator --tail=200
# Scale workers
kubectl scale deployment stellaops-orchestrator --replicas=4
# Check for stuck jobs
kubectl exec -it <orchestrator-pod> -- stella orchestrator jobs list --status stuck
Set in Helm values.yaml:
orchestrator:
replicas: 4
workers:
count: 8
maxConcurrent: 4
resources:
limits:
memory: 2Gi
cpu: "2"
Verification
stella doctor run --check check.operations.job-queue
Related Checks
check.operations.dead-letter-- failed jobs end up in the dead letter queuecheck.operations.scheduler-- scheduler feeds jobs into the queuecheck.scanner.queue-- scanner-specific queue healthcheck.postgres.connectivity-- database issues affect job dequeue performance