Files
git.stella-ops.org/docs/doctor/articles/agent/task-failure-rate.md
master c58a236d70 Doctor plugin checks: implement health check classes and documentation
Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00

2.7 KiB

checkId, plugin, severity, tags
checkId plugin severity tags
check.agent.task.failure.rate stellaops.doctor.agent warn
agent
task
failure
reliability

Task Failure Rate

What It Checks

Monitors the task failure rate across the agent fleet to detect systemic issues. The check is designed to evaluate:

  1. Overall task failure rate over the last hour
  2. Per-agent failure rate to isolate problematic agents
  3. Failure rate trend (increasing, decreasing, or stable)
  4. Common failure reasons to guide remediation

Current status: implementation pending -- the check always returns Pass with a placeholder message. The CanRun method always returns true.

Why It Matters

A rising task failure rate is an early indicator of systemic problems: infrastructure issues, misconfigured environments, expired credentials, or agent software bugs. Catching a spike before it reaches 100% failure allows operators to intervene, roll back, or redirect tasks to healthy agents before an outage fully materializes.

Common Causes

  • Registry or artifact store unreachable (tasks cannot pull images)
  • Expired credentials used by tasks (registry tokens, cloud provider keys)
  • Agent software bug introduced by recent update
  • Target environment misconfigured (wrong endpoints, firewall rules)
  • Disk exhaustion on agent hosts preventing artifact staging
  • OOM kills during resource-intensive tasks (scans, builds)

How to Fix

Docker Compose

# Check agent logs for task failures
docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 500 | \
  grep -i "task.*fail\|error\|exception"

# Review recent task history
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
  stella agent tasks --status failed --last 1h

Bare Metal / systemd

# View failed tasks
stella agent tasks --status failed --last 1h

# Check per-agent failure rates
stella agent health <agent-id> --show-tasks

# Review agent logs for failure patterns
journalctl -u stella-agent --since '1 hour ago' | grep -i 'fail\|error'

Kubernetes / Helm

# Check agent pod logs for task errors
kubectl logs -l app.kubernetes.io/component=agent -n stellaops --tail=500 | \
  grep -i "task.*fail\|error"

# Check pod events for OOM or crash signals
kubectl get events -n stellaops --sort-by='.lastTimestamp' | grep -i agent

Verification

stella doctor run --check check.agent.task.failure.rate
  • check.agent.resource.utilization -- resource exhaustion causes task failures
  • check.agent.task.backlog -- high failure rate combined with backlog indicates systemic issue
  • check.agent.heartbeat.freshness -- crashing agents fail tasks and go stale
  • check.agent.version.consistency -- version skew can cause task compatibility failures