Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2.7 KiB
2.7 KiB
checkId, plugin, severity, tags
| checkId | plugin | severity | tags | ||||
|---|---|---|---|---|---|---|---|
| check.agent.task.failure.rate | stellaops.doctor.agent | warn |
|
Task Failure Rate
What It Checks
Monitors the task failure rate across the agent fleet to detect systemic issues. The check is designed to evaluate:
- Overall task failure rate over the last hour
- Per-agent failure rate to isolate problematic agents
- Failure rate trend (increasing, decreasing, or stable)
- Common failure reasons to guide remediation
Current status: implementation pending -- the check always returns Pass with a placeholder message. The CanRun method always returns true.
Why It Matters
A rising task failure rate is an early indicator of systemic problems: infrastructure issues, misconfigured environments, expired credentials, or agent software bugs. Catching a spike before it reaches 100% failure allows operators to intervene, roll back, or redirect tasks to healthy agents before an outage fully materializes.
Common Causes
- Registry or artifact store unreachable (tasks cannot pull images)
- Expired credentials used by tasks (registry tokens, cloud provider keys)
- Agent software bug introduced by recent update
- Target environment misconfigured (wrong endpoints, firewall rules)
- Disk exhaustion on agent hosts preventing artifact staging
- OOM kills during resource-intensive tasks (scans, builds)
How to Fix
Docker Compose
# Check agent logs for task failures
docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 500 | \
grep -i "task.*fail\|error\|exception"
# Review recent task history
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
stella agent tasks --status failed --last 1h
Bare Metal / systemd
# View failed tasks
stella agent tasks --status failed --last 1h
# Check per-agent failure rates
stella agent health <agent-id> --show-tasks
# Review agent logs for failure patterns
journalctl -u stella-agent --since '1 hour ago' | grep -i 'fail\|error'
Kubernetes / Helm
# Check agent pod logs for task errors
kubectl logs -l app.kubernetes.io/component=agent -n stellaops --tail=500 | \
grep -i "task.*fail\|error"
# Check pod events for OOM or crash signals
kubectl get events -n stellaops --sort-by='.lastTimestamp' | grep -i agent
Verification
stella doctor run --check check.agent.task.failure.rate
Related Checks
check.agent.resource.utilization-- resource exhaustion causes task failurescheck.agent.task.backlog-- high failure rate combined with backlog indicates systemic issuecheck.agent.heartbeat.freshness-- crashing agents fail tasks and go stalecheck.agent.version.consistency-- version skew can cause task compatibility failures