Files

master c58a236d70 Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 12:28:00 +02:00

2.9 KiB

Raw Blame History

checkId, plugin, severity, tags

checkId

plugin

severity

Environment Capacity

What It Checks

Queries the Release Orchestrator API (/api/v1/environments/capacity) and evaluates CPU, memory, storage, and deployment slot usage for every configured environment. Each resource is compared against two thresholds:

Warn when usage >= 75%
Fail when usage >= 90%

Deployment slot utilization is calculated as activeDeployments / maxConcurrentDeployments * 100. If no environments exist, the check passes with a note. If the orchestrator is unreachable, the check returns warn.

Why It Matters

Resource exhaustion in a target environment blocks deployments and can cause running services to crash or degrade. Detecting capacity pressure early gives operators time to scale up, clean up unused deployments, or redistribute workloads before an outage occurs. In production environments, exceeding 90% on any resource dimension is a leading indicator of imminent service disruption.

Common Causes

Gradual organic growth without corresponding resource scaling
Runaway or leaked processes consuming CPU/memory
Accumulated old deployments that were never cleaned up
Resource limits set too tightly relative to actual workload
Unexpected traffic spike or batch job saturating storage

How to Fix

Docker Compose

# Check current resource usage on the host
docker stats --no-stream

# Increase resource limits in docker-compose.stella-ops.yml
# Edit the target service under deploy.resources.limits:
#   cpus: '4.0'
#   memory: 8G

# Remove stopped containers to free deployment slots
docker container prune -f

# Restart with updated limits
docker compose -f docker-compose.stella-ops.yml up -d

Bare Metal / systemd

# Check system resource usage
free -h && df -h && top -bn1 | head -20

# Increase memory/CPU limits in systemd unit overrides
sudo systemctl edit stellaops-environment-agent.service
# Add under [Service]:
#   MemoryMax=8G
#   CPUQuota=400%

sudo systemctl daemon-reload && sudo systemctl restart stellaops-environment-agent.service

# Clean up old deployments
stella env cleanup <environment-name>

Kubernetes / Helm

# Check node resource usage
kubectl top nodes
kubectl top pods -n stellaops

# Scale up resources via Helm values
helm upgrade stellaops stellaops/stellaops \
  --set environments.resources.limits.cpu=4 \
  --set environments.resources.limits.memory=8Gi \
  --set environments.maxConcurrentDeployments=20

# Or add more nodes to the cluster for horizontal scaling

Verification

stella doctor run --check check.environment.capacity

check.environment.deployments - checks deployed service health, which may degrade under capacity pressure
check.environment.connectivity - verifies agents are reachable, which capacity exhaustion can prevent

2.9 KiB Raw Blame History