--- checkId: check.environment.deployments plugin: stellaops.doctor.environment severity: warn tags: [environment, deployment, services, health] --- # Environment Deployment Health ## What It Checks Queries the Release Orchestrator (`/api/v1/environments/deployments`) for all deployed services across all environments. Each service is evaluated for: - **Status** -- `failed`, `stopped`, `degraded`, or healthy - **Replica health** -- compares `healthyReplicas` against total `replicas`; partial health triggers degraded status Severity escalation: - **Fail** if any production service has status `failed` (production detected by environment name containing "prod") - **Fail** if any non-production service has status `failed` - **Warn** if services are `degraded` (partial replica health) - **Warn** if services are `stopped` - **Pass** if all services are healthy ## Why It Matters Failed services in production directly impact end users and violate SLA commitments. Degraded services with partial replica health reduce fault tolerance and can cascade into full outages under load. Stopped services may indicate incomplete deployments or maintenance windows that were never closed. This check provides the earliest signal that a deployment rollout needs intervention. ## Common Causes - Service crashed due to unhandled exception or OOM kill - Deployment rolled out a bad image version - Dependency (database, cache, message broker) became unavailable - Resource exhaustion preventing replicas from starting - Health check endpoint misconfigured, causing false failures - Node failure taking down co-located replicas ## How to Fix ### Docker Compose ```bash # Identify failed containers docker ps -a --filter "status=exited" --filter "status=dead" # View logs for the failed service docker logs --tail 200 # Restart the failed service docker compose -f docker-compose.stella-ops.yml restart # If the image is bad, roll back to previous version # Edit docker-compose.stella-ops.yml to pin the previous image tag docker compose -f docker-compose.stella-ops.yml up -d ``` ### Bare Metal / systemd ```bash # Check service status sudo systemctl status stellaops- # View logs for crash details sudo journalctl -u stellaops- --since "30 minutes ago" --no-pager # Restart the service sudo systemctl restart stellaops- # Roll back to previous binary sudo cp /opt/stellaops/backup/ /opt/stellaops/bin/ sudo systemctl restart stellaops- ``` ### Kubernetes / Helm ```bash # Check pod status across environments kubectl get pods -n stellaops- --field-selector=status.phase!=Running # View events and logs for failing pods kubectl describe pod -n stellaops- kubectl logs -n stellaops- --previous # Rollback a deployment kubectl rollout undo deployment/ -n stellaops- # Or via Helm helm rollback stellaops -n stellaops- ``` ## Verification ```bash stella doctor run --check check.environment.deployments ``` ## Related Checks - `check.environment.capacity` - resource exhaustion can cause deployment failures - `check.environment.connectivity` - agent must be reachable to report deployment health - `check.environment.drift` - configuration drift can cause services to fail after redeployment