Files

master c58a236d70 Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 12:28:00 +02:00

3.6 KiB

Raw Blame History

checkId, plugin, severity, tags

checkId

plugin

severity

Job Queue Health

What It Checks

Evaluates the platform job queue health across three dimensions:

Worker availability: fail immediately if no workers are active (zero active workers).
Queue depth: warn at 100+ pending jobs, fail at 500+ pending jobs.
Processing rate: warn if processing rate drops below 10 jobs/minute.

Evidence collected: QueueDepth, ActiveWorkers, TotalWorkers, ProcessingRate, OldestJobAge, CompletedLast24h, CriticalThreshold, WarningThreshold, RateStatus.

This check always runs (no configuration prerequisites).

Why It Matters

The job queue is the backbone of asynchronous processing in Stella Ops. It handles scan jobs, SBOM generation, vulnerability matching, evidence collection, notification delivery, and many other background tasks. If no workers are available, all background processing stops. A deep queue means jobs are waiting longer than expected, which cascades into delayed scan results, stale findings, and blocked release gates. A low processing rate indicates a performance bottleneck that will only get worse under load.

Common Causes

Worker service not running (crashed, not started, configuration error)
All workers crashed or became unhealthy simultaneously
Job processing slower than submission rate during high-activity periods
Workers overloaded or misconfigured (too few workers for the workload)
Downstream service bottleneck (database slow, external API rate-limited)
Database performance issues slowing job dequeue operations
Higher than normal job submission rate (bulk scan, new integration)

How to Fix

Docker Compose

# Check orchestrator service status
docker compose -f docker-compose.stella-ops.yml ps orchestrator

# View worker logs
docker compose -f docker-compose.stella-ops.yml logs --tail 200 orchestrator

# Restart the orchestrator service
docker compose -f docker-compose.stella-ops.yml restart orchestrator

# Scale workers
docker compose -f docker-compose.stella-ops.yml up -d --scale orchestrator=4

services:
  orchestrator:
    environment:
      Orchestrator__Workers__Count: "8"
      Orchestrator__Workers__MaxConcurrent: "4"

Bare Metal / systemd

# Check orchestrator service
sudo systemctl status stellaops-orchestrator

# View logs for worker errors
sudo journalctl -u stellaops-orchestrator --since "1 hour ago" | grep -i "worker\|queue"

# Restart workers
stella orchestrator workers restart

# Scale workers
stella orchestrator workers scale --count 8

# Monitor queue depth trend
stella orchestrator queue watch

Kubernetes / Helm

# Check orchestrator pods
kubectl get pods -l app=stellaops-orchestrator

# View worker logs
kubectl logs -l app=stellaops-orchestrator --tail=200

# Scale workers
kubectl scale deployment stellaops-orchestrator --replicas=4

# Check for stuck jobs
kubectl exec -it <orchestrator-pod> -- stella orchestrator jobs list --status stuck

Set in Helm values.yaml:

orchestrator:
  replicas: 4
  workers:
    count: 8
    maxConcurrent: 4
  resources:
    limits:
      memory: 2Gi
      cpu: "2"

Verification

stella doctor run --check check.operations.job-queue

check.operations.dead-letter -- failed jobs end up in the dead letter queue
check.operations.scheduler -- scheduler feeds jobs into the queue
check.scanner.queue -- scanner-specific queue health
check.postgres.connectivity -- database issues affect job dequeue performance

3.6 KiB Raw Blame History