Files

master c58a236d70 Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 12:28:00 +02:00

2.7 KiB

Raw Blame History

checkId, plugin, severity, tags

checkId

plugin

severity

Task Failure Rate

What It Checks

Monitors the task failure rate across the agent fleet to detect systemic issues. The check is designed to evaluate:

Overall task failure rate over the last hour
Per-agent failure rate to isolate problematic agents
Failure rate trend (increasing, decreasing, or stable)
Common failure reasons to guide remediation

Current status: implementation pending -- the check always returns Pass with a placeholder message. The CanRun method always returns true.

Why It Matters

A rising task failure rate is an early indicator of systemic problems: infrastructure issues, misconfigured environments, expired credentials, or agent software bugs. Catching a spike before it reaches 100% failure allows operators to intervene, roll back, or redirect tasks to healthy agents before an outage fully materializes.

Common Causes

Registry or artifact store unreachable (tasks cannot pull images)
Expired credentials used by tasks (registry tokens, cloud provider keys)
Agent software bug introduced by recent update
Target environment misconfigured (wrong endpoints, firewall rules)
Disk exhaustion on agent hosts preventing artifact staging
OOM kills during resource-intensive tasks (scans, builds)

How to Fix

Docker Compose

# Check agent logs for task failures
docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 500 | \
  grep -i "task.*fail\|error\|exception"

# Review recent task history
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
  stella agent tasks --status failed --last 1h

Bare Metal / systemd

# View failed tasks
stella agent tasks --status failed --last 1h

# Check per-agent failure rates
stella agent health <agent-id> --show-tasks

# Review agent logs for failure patterns
journalctl -u stella-agent --since '1 hour ago' | grep -i 'fail\|error'

Kubernetes / Helm

# Check agent pod logs for task errors
kubectl logs -l app.kubernetes.io/component=agent -n stellaops --tail=500 | \
  grep -i "task.*fail\|error"

# Check pod events for OOM or crash signals
kubectl get events -n stellaops --sort-by='.lastTimestamp' | grep -i agent

Verification

stella doctor run --check check.agent.task.failure.rate

check.agent.resource.utilization -- resource exhaustion causes task failures
check.agent.task.backlog -- high failure rate combined with backlog indicates systemic issue
check.agent.heartbeat.freshness -- crashing agents fail tasks and go stale
check.agent.version.consistency -- version skew can cause task compatibility failures

2.7 KiB Raw Blame History