---
checkId: check.agent.task.failure.rate
plugin: stellaops.doctor.agent
severity: warn
tags: [agent, task, failure, reliability]
---
# Task Failure Rate

## What It Checks

Monitors the task failure rate across the agent fleet to detect systemic issues. The check is designed to evaluate:

1. Overall task failure rate over the last hour
2. Per-agent failure rate to isolate problematic agents
3. Failure rate trend (increasing, decreasing, or stable)
4. Common failure reasons to guide remediation

**Current status:** implementation pending -- the check always returns Pass with a placeholder message. The `CanRun` method always returns true.

## Why It Matters

A rising task failure rate is an early indicator of systemic problems: infrastructure issues, misconfigured environments, expired credentials, or agent software bugs. Catching a spike before it reaches 100% failure allows operators to intervene, roll back, or redirect tasks to healthy agents before an outage fully materializes.

## Common Causes

- Registry or artifact store unreachable (tasks cannot pull images)
- Expired credentials used by tasks (registry tokens, cloud provider keys)
- Agent software bug introduced by recent update
- Target environment misconfigured (wrong endpoints, firewall rules)
- Disk exhaustion on agent hosts preventing artifact staging
- OOM kills during resource-intensive tasks (scans, builds)

## How to Fix

### Docker Compose

```bash
# Check agent logs for task failures
docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 500 | \
  grep -i "task.*fail\|error\|exception"

# Review recent task history
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
  stella agent tasks --status failed --last 1h
```

### Bare Metal / systemd

```bash
# View failed tasks
stella agent tasks --status failed --last 1h

# Check per-agent failure rates
stella agent health <agent-id> --show-tasks

# Review agent logs for failure patterns
journalctl -u stella-agent --since '1 hour ago' | grep -i 'fail\|error'
```

### Kubernetes / Helm

```bash
# Check agent pod logs for task errors
kubectl logs -l app.kubernetes.io/component=agent -n stellaops --tail=500 | \
  grep -i "task.*fail\|error"

# Check pod events for OOM or crash signals
kubectl get events -n stellaops --sort-by='.lastTimestamp' | grep -i agent
```

## Verification

```
stella doctor run --check check.agent.task.failure.rate
```

## Related Checks

- `check.agent.resource.utilization` -- resource exhaustion causes task failures
- `check.agent.task.backlog` -- high failure rate combined with backlog indicates systemic issue
- `check.agent.heartbeat.freshness` -- crashing agents fail tasks and go stale
- `check.agent.version.consistency` -- version skew can cause task compatibility failures