Files

master c58a236d70 Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 12:28:00 +02:00

3.4 KiB

Raw Blame History

checkId, plugin, severity, tags

checkId

plugin

severity

Stale Agent Detection

What It Checks

Identifies agents that have been offline (no heartbeat) for extended periods and may need investigation or decommissioning. The check inspects all non-revoked, non-inactive agents and categorizes them:

Decommission candidates -- offline for more than 7 days. Result: Warn listing each agent with days offline.
Stale -- offline for more than 1 hour but less than 7 days. Result: Warn listing each agent with hours offline.
All healthy -- no agents exceed the 1-hour stale threshold. Result: Pass.

The check uses LastHeartbeatAt from the agent store. Agents with no recorded heartbeat (null) are treated as having TimeSpan.MaxValue offline duration.

Evidence collected: DecommissionCandidates count, StaleAgents count, per-agent names with offline durations.

Why It Matters

Stale agents consume fleet management overhead, confuse capacity planning, and may hold allocated resources (IP addresses, certificates, license seats) that could be reclaimed. An agent that has been offline for 7+ days is unlikely to return without intervention and should be explicitly deactivated or investigated. Ignoring stale agents leads to a growing inventory of ghost entries that obscure the true fleet state.

Common Causes

Agent host has been permanently removed (decommissioned hardware, terminated cloud instance)
Agent was replaced by a new instance but the old registration was not deactivated
Infrastructure change (network re-architecture, datacenter migration) without cleanup
Agent host is undergoing extended maintenance
Network partition isolating the agent
Agent process crash without auto-restart configured (systemd restart policy missing)

How to Fix

Docker Compose

# List all agent registrations with status
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
  stella agent list --all

# Deactivate a stale agent
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
  stella agent deactivate --agent-id <agent-id>

Bare Metal / systemd

# Review stale agents
stella agent list --status stale

# Deactivate agents that are no longer needed
stella agent deactivate --agent-id <agent-id>

# If the agent should still be active, investigate the host
ssh <agent-host> 'systemctl status stella-agent'

# Check network connectivity from the agent host
ssh <agent-host> 'curl -k https://orchestrator:8443/health'

# Restart agent on the host
ssh <agent-host> 'sudo systemctl restart stella-agent'

Kubernetes / Helm

# Check for terminated or evicted agent pods
kubectl get pods -l app.kubernetes.io/component=agent -n stellaops --field-selector=status.phase!=Running

# Remove stale agent registrations via API
stella agent deactivate --agent-id <agent-id>

# If pod was evicted, check node status
kubectl get nodes
kubectl describe node <node-name> | grep -A5 Conditions

Verification

stella doctor run --check check.agent.stale

check.agent.heartbeat.freshness -- short-term heartbeat staleness (minutes vs. hours/days)
check.agent.capacity -- stale agents do not contribute to capacity
check.agent.certificate.expiry -- long-offline agents likely have expired certificates

3.4 KiB Raw Blame History