Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/environment/environment-capacity.md
+++ b/docs/doctor/articles/environment/environment-capacity.md
@@ -0,0 +1,84 @@
+---
+checkId: check.environment.capacity
+plugin: stellaops.doctor.environment
+severity: warn
+tags: [environment, capacity, resources, cpu, memory, storage]
+---
+# Environment Capacity
+
+## What It Checks
+Queries the Release Orchestrator API (`/api/v1/environments/capacity`) and evaluates CPU, memory, storage, and deployment slot usage for every configured environment. Each resource is compared against two thresholds:
+- **Warn** when usage >= 75%
+- **Fail** when usage >= 90%
+
+Deployment slot utilization is calculated as `activeDeployments / maxConcurrentDeployments * 100`. If no environments exist, the check passes with a note. If the orchestrator is unreachable, the check returns warn.
+
+## Why It Matters
+Resource exhaustion in a target environment blocks deployments and can cause running services to crash or degrade. Detecting capacity pressure early gives operators time to scale up, clean up unused deployments, or redistribute workloads before an outage occurs. In production environments, exceeding 90% on any resource dimension is a leading indicator of imminent service disruption.
+
+## Common Causes
+- Gradual organic growth without corresponding resource scaling
+- Runaway or leaked processes consuming CPU/memory
+- Accumulated old deployments that were never cleaned up
+- Resource limits set too tightly relative to actual workload
+- Unexpected traffic spike or batch job saturating storage
+
+## How to Fix
+
+### Docker Compose
+```bash
+# Check current resource usage on the host
+docker stats --no-stream
+
+# Increase resource limits in docker-compose.stella-ops.yml
+# Edit the target service under deploy.resources.limits:
+#   cpus: '4.0'
+#   memory: 8G
+
+# Remove stopped containers to free deployment slots
+docker container prune -f
+
+# Restart with updated limits
+docker compose -f docker-compose.stella-ops.yml up -d
+```
+
+### Bare Metal / systemd
+```bash
+# Check system resource usage
+free -h && df -h && top -bn1 | head -20
+
+# Increase memory/CPU limits in systemd unit overrides
+sudo systemctl edit stellaops-environment-agent.service
+# Add under [Service]:
+#   MemoryMax=8G
+#   CPUQuota=400%
+
+sudo systemctl daemon-reload && sudo systemctl restart stellaops-environment-agent.service
+
+# Clean up old deployments
+stella env cleanup <environment-name>
+```
+
+### Kubernetes / Helm
+```bash
+# Check node resource usage
+kubectl top nodes
+kubectl top pods -n stellaops
+
+# Scale up resources via Helm values
+helm upgrade stellaops stellaops/stellaops \
+  --set environments.resources.limits.cpu=4 \
+  --set environments.resources.limits.memory=8Gi \
+  --set environments.maxConcurrentDeployments=20
+
+# Or add more nodes to the cluster for horizontal scaling
+```
+
+## Verification
+```bash
+stella doctor run --check check.environment.capacity
+```
+
+## Related Checks
+- `check.environment.deployments` - checks deployed service health, which may degrade under capacity pressure
+- `check.environment.connectivity` - verifies agents are reachable, which capacity exhaustion can prevent