Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/release/environment-readiness.md
+++ b/docs/doctor/articles/release/environment-readiness.md
@@ -0,0 +1,85 @@
+---
+checkId: check.release.environment.readiness
+plugin: stellaops.doctor.release
+severity: warn
+tags: [release, environment, readiness, deployment]
+---
+# Environment Readiness
+
+## What It Checks
+Queries the Release Orchestrator at `/api/v1/environments` and evaluates the health and readiness of all configured target environments:
+
+- **Reachability**: environments must respond to health checks.
+- **Health status**: environments must report as healthy.
+- **Health check freshness**: warn if the last health check data is older than 1 hour.
+- **Production priority**: production environment issues escalate to fail severity; non-production issues are warnings.
+
+Evidence collected: `environment_count`, `dev_environments`, `staging_environments`, `prod_environments`, `unreachable_count`, `unhealthy_count`, `unreachable_environments`, `unhealthy_environments`, `stale_health_check_count`.
+
+The check requires `ReleaseOrchestrator:Url` or `Release:Orchestrator:Url` to be configured.
+
+## Why It Matters
+Environments are the deployment targets in the release pipeline. An unreachable or unhealthy environment will cause any release targeting it to fail, blocking the promotion chain. Production environment issues are critical because they can indicate that the currently deployed version is also impacted. Stale health data means the system is operating on outdated information, which can lead to deploying to an environment that is actually down.
+
+## Common Causes
+- Environment agent not responding (crashed, network partition)
+- Network connectivity issue between the orchestrator and target environment
+- Container runtime issue in the target environment (Docker daemon down)
+- Resource exhaustion (disk full, memory pressure) on the target host
+- Dev/staging environment intentionally powered down
+- Health check scheduler not running, producing stale data
+- Environment agent intermittent connectivity causing stale health reports
+
+## How to Fix
+
+### Docker Compose
+```bash
+# Ping the unreachable environment
+stella env ping <environment-name>
+
+# View environment agent logs
+stella env logs <environment-name>
+
+# Check environment health details
+stella env health <environment-name>
+
+# Refresh health data for all environments
+stella env health --refresh-all
+```
+
+### Bare Metal / systemd
+```bash
+# Check the environment agent service
+ssh <environment-host> "systemctl status stellaops-agent"
+
+# Test network connectivity
+stella env ping <environment-name>
+
+# View agent logs on the target host
+ssh <environment-host> "journalctl -u stellaops-agent --since '1 hour ago'"
+
+# Restart agent if needed
+ssh <environment-host> "systemctl restart stellaops-agent"
+```
+
+### Kubernetes / Helm
+```bash
+# Check agent pods in the target cluster
+kubectl --context <target-cluster> get pods -l app=stellaops-agent
+
+# View agent logs
+kubectl --context <target-cluster> logs -l app=stellaops-agent --tail=200
+
+# Check node resource availability
+kubectl --context <target-cluster> top nodes
+```
+
+## Verification
+```
+stella doctor run --check check.release.environment.readiness
+```
+
+## Related Checks
+- `check.release.active` -- unreachable environments cause active releases to get stuck
+- `check.release.rollback.readiness` -- environment health affects rollback capability
+- `check.release.promotion.gates` -- environments must be reachable for gate checks to pass