Files
git.stella-ops.org/docs/doctor/articles/environment/environment-deployment-health.md
master c58a236d70 Doctor plugin checks: implement health check classes and documentation
Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00

3.3 KiB

checkId, plugin, severity, tags
checkId plugin severity tags
check.environment.deployments stellaops.doctor.environment warn
environment
deployment
services
health

Environment Deployment Health

What It Checks

Queries the Release Orchestrator (/api/v1/environments/deployments) for all deployed services across all environments. Each service is evaluated for:

  • Status -- failed, stopped, degraded, or healthy
  • Replica health -- compares healthyReplicas against total replicas; partial health triggers degraded status

Severity escalation:

  • Fail if any production service has status failed (production detected by environment name containing "prod")
  • Fail if any non-production service has status failed
  • Warn if services are degraded (partial replica health)
  • Warn if services are stopped
  • Pass if all services are healthy

Why It Matters

Failed services in production directly impact end users and violate SLA commitments. Degraded services with partial replica health reduce fault tolerance and can cascade into full outages under load. Stopped services may indicate incomplete deployments or maintenance windows that were never closed. This check provides the earliest signal that a deployment rollout needs intervention.

Common Causes

  • Service crashed due to unhandled exception or OOM kill
  • Deployment rolled out a bad image version
  • Dependency (database, cache, message broker) became unavailable
  • Resource exhaustion preventing replicas from starting
  • Health check endpoint misconfigured, causing false failures
  • Node failure taking down co-located replicas

How to Fix

Docker Compose

# Identify failed containers
docker ps -a --filter "status=exited" --filter "status=dead"

# View logs for the failed service
docker logs <container-name> --tail 200

# Restart the failed service
docker compose -f docker-compose.stella-ops.yml restart <service-name>

# If the image is bad, roll back to previous version
# Edit docker-compose.stella-ops.yml to pin the previous image tag
docker compose -f docker-compose.stella-ops.yml up -d <service-name>

Bare Metal / systemd

# Check service status
sudo systemctl status stellaops-<service-name>

# View logs for crash details
sudo journalctl -u stellaops-<service-name> --since "30 minutes ago" --no-pager

# Restart the service
sudo systemctl restart stellaops-<service-name>

# Roll back to previous binary
sudo cp /opt/stellaops/backup/<service-name> /opt/stellaops/bin/<service-name>
sudo systemctl restart stellaops-<service-name>

Kubernetes / Helm

# Check pod status across environments
kubectl get pods -n stellaops-<env> --field-selector=status.phase!=Running

# View events and logs for failing pods
kubectl describe pod <pod-name> -n stellaops-<env>
kubectl logs <pod-name> -n stellaops-<env> --previous

# Rollback a deployment
kubectl rollout undo deployment/<service-name> -n stellaops-<env>

# Or via Helm
helm rollback stellaops <previous-revision> -n stellaops-<env>

Verification

stella doctor run --check check.environment.deployments
  • check.environment.capacity - resource exhaustion can cause deployment failures
  • check.environment.connectivity - agent must be reachable to report deployment health
  • check.environment.drift - configuration drift can cause services to fail after redeployment