Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/agent/task-failure-rate.md
+++ b/docs/doctor/articles/agent/task-failure-rate.md
@@ -0,0 +1,82 @@
+---
+checkId: check.agent.task.failure.rate
+plugin: stellaops.doctor.agent
+severity: warn
+tags: [agent, task, failure, reliability]
+---
+# Task Failure Rate
+
+## What It Checks
+
+Monitors the task failure rate across the agent fleet to detect systemic issues. The check is designed to evaluate:
+
+1. Overall task failure rate over the last hour
+2. Per-agent failure rate to isolate problematic agents
+3. Failure rate trend (increasing, decreasing, or stable)
+4. Common failure reasons to guide remediation
+
+**Current status:** implementation pending -- the check always returns Pass with a placeholder message. The `CanRun` method always returns true.
+
+## Why It Matters
+
+A rising task failure rate is an early indicator of systemic problems: infrastructure issues, misconfigured environments, expired credentials, or agent software bugs. Catching a spike before it reaches 100% failure allows operators to intervene, roll back, or redirect tasks to healthy agents before an outage fully materializes.
+
+## Common Causes
+
+- Registry or artifact store unreachable (tasks cannot pull images)
+- Expired credentials used by tasks (registry tokens, cloud provider keys)
+- Agent software bug introduced by recent update
+- Target environment misconfigured (wrong endpoints, firewall rules)
+- Disk exhaustion on agent hosts preventing artifact staging
+- OOM kills during resource-intensive tasks (scans, builds)
+
+## How to Fix
+
+### Docker Compose
+
+```bash
+# Check agent logs for task failures
+docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 500 | \
+  grep -i "task.*fail\|error\|exception"
+
+# Review recent task history
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
+  stella agent tasks --status failed --last 1h
+```
+
+### Bare Metal / systemd
+
+```bash
+# View failed tasks
+stella agent tasks --status failed --last 1h
+
+# Check per-agent failure rates
+stella agent health <agent-id> --show-tasks
+
+# Review agent logs for failure patterns
+journalctl -u stella-agent --since '1 hour ago' | grep -i 'fail\|error'
+```
+
+### Kubernetes / Helm
+
+```bash
+# Check agent pod logs for task errors
+kubectl logs -l app.kubernetes.io/component=agent -n stellaops --tail=500 | \
+  grep -i "task.*fail\|error"
+
+# Check pod events for OOM or crash signals
+kubectl get events -n stellaops --sort-by='.lastTimestamp' | grep -i agent
+```
+
+## Verification
+
+```
+stella doctor run --check check.agent.task.failure.rate
+```
+
+## Related Checks
+
+- `check.agent.resource.utilization` -- resource exhaustion causes task failures
+- `check.agent.task.backlog` -- high failure rate combined with backlog indicates systemic issue
+- `check.agent.heartbeat.freshness` -- crashing agents fail tasks and go stale
+- `check.agent.version.consistency` -- version skew can cause task compatibility failures