Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2.9 KiB
checkId, plugin, severity, tags
| checkId | plugin | severity | tags | ||||||
|---|---|---|---|---|---|---|---|---|---|
| check.environment.capacity | stellaops.doctor.environment | warn |
|
Environment Capacity
What It Checks
Queries the Release Orchestrator API (/api/v1/environments/capacity) and evaluates CPU, memory, storage, and deployment slot usage for every configured environment. Each resource is compared against two thresholds:
- Warn when usage >= 75%
- Fail when usage >= 90%
Deployment slot utilization is calculated as activeDeployments / maxConcurrentDeployments * 100. If no environments exist, the check passes with a note. If the orchestrator is unreachable, the check returns warn.
Why It Matters
Resource exhaustion in a target environment blocks deployments and can cause running services to crash or degrade. Detecting capacity pressure early gives operators time to scale up, clean up unused deployments, or redistribute workloads before an outage occurs. In production environments, exceeding 90% on any resource dimension is a leading indicator of imminent service disruption.
Common Causes
- Gradual organic growth without corresponding resource scaling
- Runaway or leaked processes consuming CPU/memory
- Accumulated old deployments that were never cleaned up
- Resource limits set too tightly relative to actual workload
- Unexpected traffic spike or batch job saturating storage
How to Fix
Docker Compose
# Check current resource usage on the host
docker stats --no-stream
# Increase resource limits in docker-compose.stella-ops.yml
# Edit the target service under deploy.resources.limits:
# cpus: '4.0'
# memory: 8G
# Remove stopped containers to free deployment slots
docker container prune -f
# Restart with updated limits
docker compose -f docker-compose.stella-ops.yml up -d
Bare Metal / systemd
# Check system resource usage
free -h && df -h && top -bn1 | head -20
# Increase memory/CPU limits in systemd unit overrides
sudo systemctl edit stellaops-environment-agent.service
# Add under [Service]:
# MemoryMax=8G
# CPUQuota=400%
sudo systemctl daemon-reload && sudo systemctl restart stellaops-environment-agent.service
# Clean up old deployments
stella env cleanup <environment-name>
Kubernetes / Helm
# Check node resource usage
kubectl top nodes
kubectl top pods -n stellaops
# Scale up resources via Helm values
helm upgrade stellaops stellaops/stellaops \
--set environments.resources.limits.cpu=4 \
--set environments.resources.limits.memory=8Gi \
--set environments.maxConcurrentDeployments=20
# Or add more nodes to the cluster for horizontal scaling
Verification
stella doctor run --check check.environment.capacity
Related Checks
check.environment.deployments- checks deployed service health, which may degrade under capacity pressurecheck.environment.connectivity- verifies agents are reachable, which capacity exhaustion can prevent