Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3.3 KiB
3.3 KiB
checkId, plugin, severity, tags
| checkId | plugin | severity | tags | ||||
|---|---|---|---|---|---|---|---|
| check.environment.deployments | stellaops.doctor.environment | warn |
|
Environment Deployment Health
What It Checks
Queries the Release Orchestrator (/api/v1/environments/deployments) for all deployed services across all environments. Each service is evaluated for:
- Status --
failed,stopped,degraded, or healthy - Replica health -- compares
healthyReplicasagainst totalreplicas; partial health triggers degraded status
Severity escalation:
- Fail if any production service has status
failed(production detected by environment name containing "prod") - Fail if any non-production service has status
failed - Warn if services are
degraded(partial replica health) - Warn if services are
stopped - Pass if all services are healthy
Why It Matters
Failed services in production directly impact end users and violate SLA commitments. Degraded services with partial replica health reduce fault tolerance and can cascade into full outages under load. Stopped services may indicate incomplete deployments or maintenance windows that were never closed. This check provides the earliest signal that a deployment rollout needs intervention.
Common Causes
- Service crashed due to unhandled exception or OOM kill
- Deployment rolled out a bad image version
- Dependency (database, cache, message broker) became unavailable
- Resource exhaustion preventing replicas from starting
- Health check endpoint misconfigured, causing false failures
- Node failure taking down co-located replicas
How to Fix
Docker Compose
# Identify failed containers
docker ps -a --filter "status=exited" --filter "status=dead"
# View logs for the failed service
docker logs <container-name> --tail 200
# Restart the failed service
docker compose -f docker-compose.stella-ops.yml restart <service-name>
# If the image is bad, roll back to previous version
# Edit docker-compose.stella-ops.yml to pin the previous image tag
docker compose -f docker-compose.stella-ops.yml up -d <service-name>
Bare Metal / systemd
# Check service status
sudo systemctl status stellaops-<service-name>
# View logs for crash details
sudo journalctl -u stellaops-<service-name> --since "30 minutes ago" --no-pager
# Restart the service
sudo systemctl restart stellaops-<service-name>
# Roll back to previous binary
sudo cp /opt/stellaops/backup/<service-name> /opt/stellaops/bin/<service-name>
sudo systemctl restart stellaops-<service-name>
Kubernetes / Helm
# Check pod status across environments
kubectl get pods -n stellaops-<env> --field-selector=status.phase!=Running
# View events and logs for failing pods
kubectl describe pod <pod-name> -n stellaops-<env>
kubectl logs <pod-name> -n stellaops-<env> --previous
# Rollback a deployment
kubectl rollout undo deployment/<service-name> -n stellaops-<env>
# Or via Helm
helm rollback stellaops <previous-revision> -n stellaops-<env>
Verification
stella doctor run --check check.environment.deployments
Related Checks
check.environment.capacity- resource exhaustion can cause deployment failurescheck.environment.connectivity- agent must be reachable to report deployment healthcheck.environment.drift- configuration drift can cause services to fail after redeployment