Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2.9 KiB
2.9 KiB
checkId, plugin, severity, tags
| checkId | plugin | severity | tags | ||||
|---|---|---|---|---|---|---|---|
| check.agent.task.backlog | stellaops.doctor.agent | warn |
|
Task Queue Backlog
What It Checks
Monitors the pending task queue depth across the agent fleet to detect capacity issues. The check is designed to evaluate:
- Total queued tasks across the entire fleet
- Age of the oldest queued task (how long tasks wait before dispatch)
- Queue growth rate trend (growing, stable, or draining)
Current status: implementation pending -- the check always returns Pass with a placeholder message. The CanRun method always returns true.
Why It Matters
A growing task backlog means agents cannot keep up with incoming work. Tasks age in the queue, SLA timers expire, and users experience delayed deployments and scan results. If the backlog grows unchecked, it can cascade: delayed scans block policy gates, which block promotions, which block release trains. Detecting backlog growth early allows operators to scale the fleet or prioritize the queue.
Common Causes
- Insufficient agent count for current workload
- One or more agents offline, reducing effective fleet capacity
- Task burst from bulk operations (mass rescans, environment-wide deployments)
- Slow tasks monopolizing agent slots (large image scans, complex builds)
- Task dispatch paused due to configuration or freeze window
How to Fix
Docker Compose
# Check current queue depth
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
stella agent tasks --status queued --count
# Scale agents to reduce backlog
docker compose -f devops/compose/docker-compose.stella-ops.yml up -d --scale agent=3
# Increase concurrent task limit per agent
# Set environment variable in compose override:
# AGENT__MAXCONCURRENTTASKS=8
Bare Metal / systemd
# Check queue depth and oldest task
stella agent tasks --status queued
# Increase concurrent task limit
stella agent config --agent-id <id> --set max_concurrent_tasks=8
# Add more agents to the fleet
stella agent bootstrap --name agent-03 --env production --platform linux
Kubernetes / Helm
# Check queue depth
kubectl exec -it deploy/stellaops-agent -n stellaops -- \
stella agent tasks --status queued --count
# Scale agent deployment
kubectl scale deployment stellaops-agent --replicas=5 -n stellaops
# Or use HPA for auto-scaling
# agent:
# autoscaling:
# enabled: true
# minReplicas: 2
# maxReplicas: 10
# targetCPUUtilizationPercentage: 70
helm upgrade stellaops stellaops/stellaops -f values.yaml
Verification
stella doctor run --check check.agent.task.backlog
Related Checks
check.agent.capacity-- backlog grows when capacity is insufficientcheck.agent.task.failure.rate-- failed tasks may be re-queued, inflating the backlogcheck.agent.resource.utilization-- saturated agents process tasks slowlycheck.agent.heartbeat.freshness-- offline agents reduce dispatch targets