Files

master c58a236d70 Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 12:28:00 +02:00

2.6 KiB

Raw Blame History

checkId, plugin, severity, tags

checkId

plugin

severity

Agent Capacity

What It Checks

Verifies that agents have sufficient capacity to handle incoming tasks. The check queries the agent store for the current tenant and categorizes agents by status:

Fail if zero agents have AgentStatus.Active -- no agents are available to run tasks.
Pass if at least one active agent exists, reporting the active-vs-total count.

Evidence collected: ActiveAgents, TotalAgents.

Thresholds defined in source (not yet wired to the simplified implementation):

High utilization: >= 90%
Warning utilization: >= 75%

The check skips with a warning if the tenant ID is missing or unparseable.

Why It Matters

When no active agents are available, the platform cannot execute deployment tasks, scans, or any agent-dispatched work. Releases stall, scan queues grow, and SLA timers expire silently. Detecting zero-capacity before a promotion attempt prevents failed deployments and on-call pages.

Common Causes

All agents are offline (host crash, network partition, maintenance window)
No agents have been registered for this tenant
Agents exist but are in Revoked or Inactive status and none remain Active
Agent bootstrap was started but never completed

How to Fix

Docker Compose

# Check agent container health
docker compose -f devops/compose/docker-compose.stella-ops.yml ps | grep agent

# View agent container logs
docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 100

# Restart agent container
docker compose -f devops/compose/docker-compose.stella-ops.yml restart agent

Bare Metal / systemd

# Check agent service status
systemctl status stella-agent

# Restart agent service
sudo systemctl restart stella-agent

# Bootstrap a new agent if none registered
stella agent bootstrap --name agent-01 --env production --platform linux

Kubernetes / Helm

# Check agent pods
kubectl get pods -l app.kubernetes.io/component=agent -n stellaops

# Describe agent deployment
kubectl describe deployment stellaops-agent -n stellaops

# Scale agent replicas
kubectl scale deployment stellaops-agent --replicas=2 -n stellaops

Verification

stella doctor run --check check.agent.capacity

check.agent.heartbeat.freshness -- agents may be registered but not sending heartbeats
check.agent.stale -- agents offline for extended periods may need decommissioning
check.agent.resource.utilization -- active agents may be resource-constrained

2.6 KiB Raw Blame History