Files

master c58a236d70 Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 12:28:00 +02:00

2.9 KiB

Raw Blame History

checkId, plugin, severity, tags

checkId

plugin

severity

Task Queue Backlog

What It Checks

Monitors the pending task queue depth across the agent fleet to detect capacity issues. The check is designed to evaluate:

Total queued tasks across the entire fleet
Age of the oldest queued task (how long tasks wait before dispatch)
Queue growth rate trend (growing, stable, or draining)

Current status: implementation pending -- the check always returns Pass with a placeholder message. The CanRun method always returns true.

Why It Matters

A growing task backlog means agents cannot keep up with incoming work. Tasks age in the queue, SLA timers expire, and users experience delayed deployments and scan results. If the backlog grows unchecked, it can cascade: delayed scans block policy gates, which block promotions, which block release trains. Detecting backlog growth early allows operators to scale the fleet or prioritize the queue.

Common Causes

Insufficient agent count for current workload
One or more agents offline, reducing effective fleet capacity
Task burst from bulk operations (mass rescans, environment-wide deployments)
Slow tasks monopolizing agent slots (large image scans, complex builds)
Task dispatch paused due to configuration or freeze window

How to Fix

Docker Compose

# Check current queue depth
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
  stella agent tasks --status queued --count

# Scale agents to reduce backlog
docker compose -f devops/compose/docker-compose.stella-ops.yml up -d --scale agent=3

# Increase concurrent task limit per agent
# Set environment variable in compose override:
# AGENT__MAXCONCURRENTTASKS=8

Bare Metal / systemd

# Check queue depth and oldest task
stella agent tasks --status queued

# Increase concurrent task limit
stella agent config --agent-id <id> --set max_concurrent_tasks=8

# Add more agents to the fleet
stella agent bootstrap --name agent-03 --env production --platform linux

Kubernetes / Helm

# Check queue depth
kubectl exec -it deploy/stellaops-agent -n stellaops -- \
  stella agent tasks --status queued --count

# Scale agent deployment
kubectl scale deployment stellaops-agent --replicas=5 -n stellaops

# Or use HPA for auto-scaling
# agent:
#   autoscaling:
#     enabled: true
#     minReplicas: 2
#     maxReplicas: 10
#     targetCPUUtilizationPercentage: 70
helm upgrade stellaops stellaops/stellaops -f values.yaml

Verification

stella doctor run --check check.agent.task.backlog

check.agent.capacity -- backlog grows when capacity is insufficient
check.agent.task.failure.rate -- failed tasks may be re-queued, inflating the backlog
check.agent.resource.utilization -- saturated agents process tasks slowly
check.agent.heartbeat.freshness -- offline agents reduce dispatch targets

2.9 KiB Raw Blame History