Doctor plugin checks: implement health check classes and documentation
Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
81
docs/doctor/articles/agent/version-consistency.md
Normal file
81
docs/doctor/articles/agent/version-consistency.md
Normal file
@@ -0,0 +1,81 @@
|
||||
---
|
||||
checkId: check.agent.version.consistency
|
||||
plugin: stellaops.doctor.agent
|
||||
severity: warn
|
||||
tags: [agent, version, maintenance]
|
||||
---
|
||||
# Agent Version Consistency
|
||||
|
||||
## What It Checks
|
||||
|
||||
Groups all non-revoked, non-inactive agents by their reported `Version` field and evaluates version skew:
|
||||
|
||||
1. **Single version** across all agents: **Pass** -- all agents are consistent.
|
||||
2. **Two versions** with skew affecting less than half the fleet: **Pass** (minor skew acceptable).
|
||||
3. **Significant skew** (more than 2 distinct versions, or outdated agents exceed half the fleet): **Warn** with evidence listing the version distribution and up to 10 outdated agent names.
|
||||
4. **No active agents**: **Skip**.
|
||||
|
||||
The "majority version" is the version running on the most agents. All other versions are considered outdated. Evidence collected: `MajorityVersion`, `VersionDistribution` (e.g., "1.5.0: 8, 1.4.2: 2"), `OutdatedAgents` (list of names with their versions).
|
||||
|
||||
## Why It Matters
|
||||
|
||||
Version skew across the agent fleet can cause subtle compatibility issues: newer agents may support task types that older agents reject, protocol changes may cause heartbeat or dispatch failures, and mixed versions make incident triage harder because behavior differs across agents. Keeping the fleet consistent reduces operational surprises.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- Auto-update is disabled on some agents
|
||||
- Some agents failed to update (download failure, permission issue, disk full)
|
||||
- Phased rollout in progress (expected, temporary skew)
|
||||
- Agents on isolated networks that cannot reach the update server
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Check agent image versions
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml ps agent --format json | \
|
||||
jq '.[] | {name: .Name, image: .Image}'
|
||||
|
||||
# Pull latest image and recreate
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml pull agent
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml up -d agent
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Update outdated agents to target version
|
||||
stella agent update --version <target-version> --agent-id <id>
|
||||
|
||||
# Enable auto-update
|
||||
stella agent config --agent-id <id> --set auto_update.enabled=true
|
||||
|
||||
# Batch update all agents
|
||||
stella agent update --version <target-version> --all
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Check running image versions across pods
|
||||
kubectl get pods -l app.kubernetes.io/component=agent -n stellaops \
|
||||
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].image}{"\n"}{end}'
|
||||
|
||||
# Update image tag in Helm values and rollout
|
||||
helm upgrade stellaops stellaops/stellaops --set agent.image.tag=<target-version>
|
||||
|
||||
# Monitor rollout
|
||||
kubectl rollout status deployment/stellaops-agent -n stellaops
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.agent.version.consistency
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.agent.heartbeat.freshness` -- version mismatch can cause heartbeat protocol failures
|
||||
- `check.agent.capacity` -- outdated agents may be unable to accept newer task types
|
||||
Reference in New Issue
Block a user