Files

master c58a236d70 Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 12:28:00 +02:00

2.9 KiB

Raw Blame History

checkId, plugin, severity, tags

checkId

plugin

severity

Agent Version Consistency

What It Checks

Groups all non-revoked, non-inactive agents by their reported Version field and evaluates version skew:

Single version across all agents: Pass -- all agents are consistent.
Two versions with skew affecting less than half the fleet: Pass (minor skew acceptable).
Significant skew (more than 2 distinct versions, or outdated agents exceed half the fleet): Warn with evidence listing the version distribution and up to 10 outdated agent names.
No active agents: Skip.

The "majority version" is the version running on the most agents. All other versions are considered outdated. Evidence collected: MajorityVersion, VersionDistribution (e.g., "1.5.0: 8, 1.4.2: 2"), OutdatedAgents (list of names with their versions).

Why It Matters

Version skew across the agent fleet can cause subtle compatibility issues: newer agents may support task types that older agents reject, protocol changes may cause heartbeat or dispatch failures, and mixed versions make incident triage harder because behavior differs across agents. Keeping the fleet consistent reduces operational surprises.

Common Causes

Auto-update is disabled on some agents
Some agents failed to update (download failure, permission issue, disk full)
Phased rollout in progress (expected, temporary skew)
Agents on isolated networks that cannot reach the update server

How to Fix

Docker Compose

# Check agent image versions
docker compose -f devops/compose/docker-compose.stella-ops.yml ps agent --format json | \
  jq '.[] | {name: .Name, image: .Image}'

# Pull latest image and recreate
docker compose -f devops/compose/docker-compose.stella-ops.yml pull agent
docker compose -f devops/compose/docker-compose.stella-ops.yml up -d agent

Bare Metal / systemd

# Update outdated agents to target version
stella agent update --version <target-version> --agent-id <id>

# Enable auto-update
stella agent config --agent-id <id> --set auto_update.enabled=true

# Batch update all agents
stella agent update --version <target-version> --all

Kubernetes / Helm

# Check running image versions across pods
kubectl get pods -l app.kubernetes.io/component=agent -n stellaops \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].image}{"\n"}{end}'

# Update image tag in Helm values and rollout
helm upgrade stellaops stellaops/stellaops --set agent.image.tag=<target-version>

# Monitor rollout
kubectl rollout status deployment/stellaops-agent -n stellaops

Verification

stella doctor run --check check.agent.version.consistency

check.agent.heartbeat.freshness -- version mismatch can cause heartbeat protocol failures
check.agent.capacity -- outdated agents may be unable to accept newer task types

2.9 KiB Raw Blame History