Doctor plugin checks: implement health check classes and documentation
Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
86
docs/doctor/articles/agent/capacity.md
Normal file
86
docs/doctor/articles/agent/capacity.md
Normal file
@@ -0,0 +1,86 @@
|
||||
---
|
||||
checkId: check.agent.capacity
|
||||
plugin: stellaops.doctor.agent
|
||||
severity: warn
|
||||
tags: [agent, capacity, performance]
|
||||
---
|
||||
# Agent Capacity
|
||||
|
||||
## What It Checks
|
||||
|
||||
Verifies that agents have sufficient capacity to handle incoming tasks. The check queries the agent store for the current tenant and categorizes agents by status:
|
||||
|
||||
1. **Fail** if zero agents have `AgentStatus.Active` -- no agents are available to run tasks.
|
||||
2. **Pass** if at least one active agent exists, reporting the active-vs-total count.
|
||||
|
||||
Evidence collected: `ActiveAgents`, `TotalAgents`.
|
||||
|
||||
Thresholds defined in source (not yet wired to the simplified implementation):
|
||||
- High utilization: >= 90%
|
||||
- Warning utilization: >= 75%
|
||||
|
||||
The check skips with a warning if the tenant ID is missing or unparseable.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
When no active agents are available, the platform cannot execute deployment tasks, scans, or any agent-dispatched work. Releases stall, scan queues grow, and SLA timers expire silently. Detecting zero-capacity before a promotion attempt prevents failed deployments and on-call pages.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- All agents are offline (host crash, network partition, maintenance window)
|
||||
- No agents have been registered for this tenant
|
||||
- Agents exist but are in `Revoked` or `Inactive` status and none remain `Active`
|
||||
- Agent bootstrap was started but never completed
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Check agent container health
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml ps | grep agent
|
||||
|
||||
# View agent container logs
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 100
|
||||
|
||||
# Restart agent container
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml restart agent
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Check agent service status
|
||||
systemctl status stella-agent
|
||||
|
||||
# Restart agent service
|
||||
sudo systemctl restart stella-agent
|
||||
|
||||
# Bootstrap a new agent if none registered
|
||||
stella agent bootstrap --name agent-01 --env production --platform linux
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Check agent pods
|
||||
kubectl get pods -l app.kubernetes.io/component=agent -n stellaops
|
||||
|
||||
# Describe agent deployment
|
||||
kubectl describe deployment stellaops-agent -n stellaops
|
||||
|
||||
# Scale agent replicas
|
||||
kubectl scale deployment stellaops-agent --replicas=2 -n stellaops
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.agent.capacity
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.agent.heartbeat.freshness` -- agents may be registered but not sending heartbeats
|
||||
- `check.agent.stale` -- agents offline for extended periods may need decommissioning
|
||||
- `check.agent.resource.utilization` -- active agents may be resource-constrained
|
||||
Reference in New Issue
Block a user