Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2.9 KiB
2.9 KiB
checkId, plugin, severity, tags
| checkId | plugin | severity | tags | |||
|---|---|---|---|---|---|---|
| check.agent.version.consistency | stellaops.doctor.agent | warn |
|
Agent Version Consistency
What It Checks
Groups all non-revoked, non-inactive agents by their reported Version field and evaluates version skew:
- Single version across all agents: Pass -- all agents are consistent.
- Two versions with skew affecting less than half the fleet: Pass (minor skew acceptable).
- Significant skew (more than 2 distinct versions, or outdated agents exceed half the fleet): Warn with evidence listing the version distribution and up to 10 outdated agent names.
- No active agents: Skip.
The "majority version" is the version running on the most agents. All other versions are considered outdated. Evidence collected: MajorityVersion, VersionDistribution (e.g., "1.5.0: 8, 1.4.2: 2"), OutdatedAgents (list of names with their versions).
Why It Matters
Version skew across the agent fleet can cause subtle compatibility issues: newer agents may support task types that older agents reject, protocol changes may cause heartbeat or dispatch failures, and mixed versions make incident triage harder because behavior differs across agents. Keeping the fleet consistent reduces operational surprises.
Common Causes
- Auto-update is disabled on some agents
- Some agents failed to update (download failure, permission issue, disk full)
- Phased rollout in progress (expected, temporary skew)
- Agents on isolated networks that cannot reach the update server
How to Fix
Docker Compose
# Check agent image versions
docker compose -f devops/compose/docker-compose.stella-ops.yml ps agent --format json | \
jq '.[] | {name: .Name, image: .Image}'
# Pull latest image and recreate
docker compose -f devops/compose/docker-compose.stella-ops.yml pull agent
docker compose -f devops/compose/docker-compose.stella-ops.yml up -d agent
Bare Metal / systemd
# Update outdated agents to target version
stella agent update --version <target-version> --agent-id <id>
# Enable auto-update
stella agent config --agent-id <id> --set auto_update.enabled=true
# Batch update all agents
stella agent update --version <target-version> --all
Kubernetes / Helm
# Check running image versions across pods
kubectl get pods -l app.kubernetes.io/component=agent -n stellaops \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].image}{"\n"}{end}'
# Update image tag in Helm values and rollout
helm upgrade stellaops stellaops/stellaops --set agent.image.tag=<target-version>
# Monitor rollout
kubectl rollout status deployment/stellaops-agent -n stellaops
Verification
stella doctor run --check check.agent.version.consistency
Related Checks
check.agent.heartbeat.freshness-- version mismatch can cause heartbeat protocol failurescheck.agent.capacity-- outdated agents may be unable to accept newer task types