Files

master c58a236d70 Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 12:28:00 +02:00

3.1 KiB

Raw Blame History

checkId, plugin, severity, tags

checkId

plugin

severity

Active Release Health

What It Checks

Queries the Release Orchestrator at /api/v1/releases?state=active and evaluates the health of all currently active releases:

Stuck releases: warn if an executing or pending release has been running for more than 1 hour, fail after 4 hours.
Failed releases: any release with an error triggers an immediate fail.
Pending approvals: warn if an approval has been pending for more than 4 hours, fail after 24 hours.

Evidence collected: active_release_count, stuck_release_count, failed_release_count, pending_approval_count, oldest_active_release_age_minutes, stuck_releases, failed_releases, approval_pending_releases.

The check requires ReleaseOrchestrator:Url or Release:Orchestrator:Url to be configured.

Why It Matters

Active releases represent in-flight changes moving through the promotion pipeline. A stuck release blocks the target environment from receiving updates and can hold locks that prevent other releases. Failed releases indicate broken deployment workflows that need immediate attention. Stale approvals delay time-sensitive deployments and can indicate that approvers are unaware of pending requests or that notification delivery has failed.

Common Causes

Release workflow step failed (script error, timeout, integration failure)
Approval bottleneck -- approvers not notified or unavailable
Target environment became unreachable during deployment
Resource contention between concurrent releases
Release taking longer than expected due to large artifact size
Environment slow to respond to health probes after deployment

How to Fix

Docker Compose

# Inspect a failed or stuck release
stella release inspect <release-id>

# View release execution logs
stella release logs <release-id>

# Check Release Orchestrator service health
docker compose -f docker-compose.stella-ops.yml logs --tail 200 orchestrator

# List pending approvals
stella release approvals list

Bare Metal / systemd

# Check Release Orchestrator service
sudo systemctl status stellaops-orchestrator

# Inspect the stuck release
stella release inspect <release-id>

# View release logs
stella release logs <release-id>

# Review and action pending approvals
stella release approvals list
stella release approve <release-id>

Kubernetes / Helm

# Check orchestrator pod status
kubectl get pods -l app=stellaops-orchestrator

# View orchestrator logs
kubectl logs -l app=stellaops-orchestrator --tail=200

# Inspect stuck release
kubectl exec -it <orchestrator-pod> -- stella release inspect <release-id>

Verification

stella doctor run --check check.release.active

check.release.environment.readiness -- environment issues cause releases to get stuck
check.release.promotion.gates -- misconfigured gates can block releases indefinitely
check.release.configuration -- workflow configuration errors cause release failures
check.release.schedule -- schedule conflicts can cause resource contention

3.1 KiB Raw Blame History