--- checkId: check.agent.task.failure.rate plugin: stellaops.doctor.agent severity: warn tags: [agent, task, failure, reliability] --- # Task Failure Rate ## What It Checks Monitors the task failure rate across the agent fleet to detect systemic issues. The check is designed to evaluate: 1. Overall task failure rate over the last hour 2. Per-agent failure rate to isolate problematic agents 3. Failure rate trend (increasing, decreasing, or stable) 4. Common failure reasons to guide remediation **Current status:** implementation pending -- the check always returns Pass with a placeholder message. The `CanRun` method always returns true. ## Why It Matters A rising task failure rate is an early indicator of systemic problems: infrastructure issues, misconfigured environments, expired credentials, or agent software bugs. Catching a spike before it reaches 100% failure allows operators to intervene, roll back, or redirect tasks to healthy agents before an outage fully materializes. ## Common Causes - Registry or artifact store unreachable (tasks cannot pull images) - Expired credentials used by tasks (registry tokens, cloud provider keys) - Agent software bug introduced by recent update - Target environment misconfigured (wrong endpoints, firewall rules) - Disk exhaustion on agent hosts preventing artifact staging - OOM kills during resource-intensive tasks (scans, builds) ## How to Fix ### Docker Compose ```bash # Check agent logs for task failures docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 500 | \ grep -i "task.*fail\|error\|exception" # Review recent task history docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \ stella agent tasks --status failed --last 1h ``` ### Bare Metal / systemd ```bash # View failed tasks stella agent tasks --status failed --last 1h # Check per-agent failure rates stella agent health --show-tasks # Review agent logs for failure patterns journalctl -u stella-agent --since '1 hour ago' | grep -i 'fail\|error' ``` ### Kubernetes / Helm ```bash # Check agent pod logs for task errors kubectl logs -l app.kubernetes.io/component=agent -n stellaops --tail=500 | \ grep -i "task.*fail\|error" # Check pod events for OOM or crash signals kubectl get events -n stellaops --sort-by='.lastTimestamp' | grep -i agent ``` ## Verification ``` stella doctor run --check check.agent.task.failure.rate ``` ## Related Checks - `check.agent.resource.utilization` -- resource exhaustion causes task failures - `check.agent.task.backlog` -- high failure rate combined with backlog indicates systemic issue - `check.agent.heartbeat.freshness` -- crashing agents fail tasks and go stale - `check.agent.version.consistency` -- version skew can cause task compatibility failures