Files
git.stella-ops.org/docs/doctor/articles/environment/environment-capacity.md
master c58a236d70 Doctor plugin checks: implement health check classes and documentation
Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00

2.9 KiB

checkId, plugin, severity, tags
checkId plugin severity tags
check.environment.capacity stellaops.doctor.environment warn
environment
capacity
resources
cpu
memory
storage

Environment Capacity

What It Checks

Queries the Release Orchestrator API (/api/v1/environments/capacity) and evaluates CPU, memory, storage, and deployment slot usage for every configured environment. Each resource is compared against two thresholds:

  • Warn when usage >= 75%
  • Fail when usage >= 90%

Deployment slot utilization is calculated as activeDeployments / maxConcurrentDeployments * 100. If no environments exist, the check passes with a note. If the orchestrator is unreachable, the check returns warn.

Why It Matters

Resource exhaustion in a target environment blocks deployments and can cause running services to crash or degrade. Detecting capacity pressure early gives operators time to scale up, clean up unused deployments, or redistribute workloads before an outage occurs. In production environments, exceeding 90% on any resource dimension is a leading indicator of imminent service disruption.

Common Causes

  • Gradual organic growth without corresponding resource scaling
  • Runaway or leaked processes consuming CPU/memory
  • Accumulated old deployments that were never cleaned up
  • Resource limits set too tightly relative to actual workload
  • Unexpected traffic spike or batch job saturating storage

How to Fix

Docker Compose

# Check current resource usage on the host
docker stats --no-stream

# Increase resource limits in docker-compose.stella-ops.yml
# Edit the target service under deploy.resources.limits:
#   cpus: '4.0'
#   memory: 8G

# Remove stopped containers to free deployment slots
docker container prune -f

# Restart with updated limits
docker compose -f docker-compose.stella-ops.yml up -d

Bare Metal / systemd

# Check system resource usage
free -h && df -h && top -bn1 | head -20

# Increase memory/CPU limits in systemd unit overrides
sudo systemctl edit stellaops-environment-agent.service
# Add under [Service]:
#   MemoryMax=8G
#   CPUQuota=400%

sudo systemctl daemon-reload && sudo systemctl restart stellaops-environment-agent.service

# Clean up old deployments
stella env cleanup <environment-name>

Kubernetes / Helm

# Check node resource usage
kubectl top nodes
kubectl top pods -n stellaops

# Scale up resources via Helm values
helm upgrade stellaops stellaops/stellaops \
  --set environments.resources.limits.cpu=4 \
  --set environments.resources.limits.memory=8Gi \
  --set environments.maxConcurrentDeployments=20

# Or add more nodes to the cluster for horizontal scaling

Verification

stella doctor run --check check.environment.capacity
  • check.environment.deployments - checks deployed service health, which may degrade under capacity pressure
  • check.environment.connectivity - verifies agents are reachable, which capacity exhaustion can prevent