Files

master c58a236d70 Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 12:28:00 +02:00

3.3 KiB

Raw Blame History

checkId, plugin, severity, tags

checkId

plugin

severity

Environment Readiness

What It Checks

Queries the Release Orchestrator at /api/v1/environments and evaluates the health and readiness of all configured target environments:

Reachability: environments must respond to health checks.
Health status: environments must report as healthy.
Health check freshness: warn if the last health check data is older than 1 hour.
Production priority: production environment issues escalate to fail severity; non-production issues are warnings.

Evidence collected: environment_count, dev_environments, staging_environments, prod_environments, unreachable_count, unhealthy_count, unreachable_environments, unhealthy_environments, stale_health_check_count.

The check requires ReleaseOrchestrator:Url or Release:Orchestrator:Url to be configured.

Why It Matters

Environments are the deployment targets in the release pipeline. An unreachable or unhealthy environment will cause any release targeting it to fail, blocking the promotion chain. Production environment issues are critical because they can indicate that the currently deployed version is also impacted. Stale health data means the system is operating on outdated information, which can lead to deploying to an environment that is actually down.

Common Causes

Environment agent not responding (crashed, network partition)
Network connectivity issue between the orchestrator and target environment
Container runtime issue in the target environment (Docker daemon down)
Resource exhaustion (disk full, memory pressure) on the target host
Dev/staging environment intentionally powered down
Health check scheduler not running, producing stale data
Environment agent intermittent connectivity causing stale health reports

How to Fix

Docker Compose

# Ping the unreachable environment
stella env ping <environment-name>

# View environment agent logs
stella env logs <environment-name>

# Check environment health details
stella env health <environment-name>

# Refresh health data for all environments
stella env health --refresh-all

Bare Metal / systemd

# Check the environment agent service
ssh <environment-host> "systemctl status stellaops-agent"

# Test network connectivity
stella env ping <environment-name>

# View agent logs on the target host
ssh <environment-host> "journalctl -u stellaops-agent --since '1 hour ago'"

# Restart agent if needed
ssh <environment-host> "systemctl restart stellaops-agent"

Kubernetes / Helm

# Check agent pods in the target cluster
kubectl --context <target-cluster> get pods -l app=stellaops-agent

# View agent logs
kubectl --context <target-cluster> logs -l app=stellaops-agent --tail=200

# Check node resource availability
kubectl --context <target-cluster> top nodes

Verification

stella doctor run --check check.release.environment.readiness

check.release.active -- unreachable environments cause active releases to get stuck
check.release.rollback.readiness -- environment health affects rollback capability
check.release.promotion.gates -- environments must be reachable for gate checks to pass

3.3 KiB Raw Blame History