Files
git.stella-ops.org/docs/doctor/articles/release/environment-readiness.md
master c58a236d70 Doctor plugin checks: implement health check classes and documentation
Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00

3.3 KiB

checkId, plugin, severity, tags
checkId plugin severity tags
check.release.environment.readiness stellaops.doctor.release warn
release
environment
readiness
deployment

Environment Readiness

What It Checks

Queries the Release Orchestrator at /api/v1/environments and evaluates the health and readiness of all configured target environments:

  • Reachability: environments must respond to health checks.
  • Health status: environments must report as healthy.
  • Health check freshness: warn if the last health check data is older than 1 hour.
  • Production priority: production environment issues escalate to fail severity; non-production issues are warnings.

Evidence collected: environment_count, dev_environments, staging_environments, prod_environments, unreachable_count, unhealthy_count, unreachable_environments, unhealthy_environments, stale_health_check_count.

The check requires ReleaseOrchestrator:Url or Release:Orchestrator:Url to be configured.

Why It Matters

Environments are the deployment targets in the release pipeline. An unreachable or unhealthy environment will cause any release targeting it to fail, blocking the promotion chain. Production environment issues are critical because they can indicate that the currently deployed version is also impacted. Stale health data means the system is operating on outdated information, which can lead to deploying to an environment that is actually down.

Common Causes

  • Environment agent not responding (crashed, network partition)
  • Network connectivity issue between the orchestrator and target environment
  • Container runtime issue in the target environment (Docker daemon down)
  • Resource exhaustion (disk full, memory pressure) on the target host
  • Dev/staging environment intentionally powered down
  • Health check scheduler not running, producing stale data
  • Environment agent intermittent connectivity causing stale health reports

How to Fix

Docker Compose

# Ping the unreachable environment
stella env ping <environment-name>

# View environment agent logs
stella env logs <environment-name>

# Check environment health details
stella env health <environment-name>

# Refresh health data for all environments
stella env health --refresh-all

Bare Metal / systemd

# Check the environment agent service
ssh <environment-host> "systemctl status stellaops-agent"

# Test network connectivity
stella env ping <environment-name>

# View agent logs on the target host
ssh <environment-host> "journalctl -u stellaops-agent --since '1 hour ago'"

# Restart agent if needed
ssh <environment-host> "systemctl restart stellaops-agent"

Kubernetes / Helm

# Check agent pods in the target cluster
kubectl --context <target-cluster> get pods -l app=stellaops-agent

# View agent logs
kubectl --context <target-cluster> logs -l app=stellaops-agent --tail=200

# Check node resource availability
kubectl --context <target-cluster> top nodes

Verification

stella doctor run --check check.release.environment.readiness
  • check.release.active -- unreachable environments cause active releases to get stuck
  • check.release.rollback.readiness -- environment health affects rollback capability
  • check.release.promotion.gates -- environments must be reachable for gate checks to pass