Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3.1 KiB
checkId, plugin, severity, tags
| checkId | plugin | severity | tags | ||||
|---|---|---|---|---|---|---|---|
| check.release.active | stellaops.doctor.release | warn |
|
Active Release Health
What It Checks
Queries the Release Orchestrator at /api/v1/releases?state=active and evaluates the health of all currently active releases:
- Stuck releases: warn if an executing or pending release has been running for more than 1 hour, fail after 4 hours.
- Failed releases: any release with an error triggers an immediate fail.
- Pending approvals: warn if an approval has been pending for more than 4 hours, fail after 24 hours.
Evidence collected: active_release_count, stuck_release_count, failed_release_count, pending_approval_count, oldest_active_release_age_minutes, stuck_releases, failed_releases, approval_pending_releases.
The check requires ReleaseOrchestrator:Url or Release:Orchestrator:Url to be configured.
Why It Matters
Active releases represent in-flight changes moving through the promotion pipeline. A stuck release blocks the target environment from receiving updates and can hold locks that prevent other releases. Failed releases indicate broken deployment workflows that need immediate attention. Stale approvals delay time-sensitive deployments and can indicate that approvers are unaware of pending requests or that notification delivery has failed.
Common Causes
- Release workflow step failed (script error, timeout, integration failure)
- Approval bottleneck -- approvers not notified or unavailable
- Target environment became unreachable during deployment
- Resource contention between concurrent releases
- Release taking longer than expected due to large artifact size
- Environment slow to respond to health probes after deployment
How to Fix
Docker Compose
# Inspect a failed or stuck release
stella release inspect <release-id>
# View release execution logs
stella release logs <release-id>
# Check Release Orchestrator service health
docker compose -f docker-compose.stella-ops.yml logs --tail 200 orchestrator
# List pending approvals
stella release approvals list
Bare Metal / systemd
# Check Release Orchestrator service
sudo systemctl status stellaops-orchestrator
# Inspect the stuck release
stella release inspect <release-id>
# View release logs
stella release logs <release-id>
# Review and action pending approvals
stella release approvals list
stella release approve <release-id>
Kubernetes / Helm
# Check orchestrator pod status
kubectl get pods -l app=stellaops-orchestrator
# View orchestrator logs
kubectl logs -l app=stellaops-orchestrator --tail=200
# Inspect stuck release
kubectl exec -it <orchestrator-pod> -- stella release inspect <release-id>
Verification
stella doctor run --check check.release.active
Related Checks
check.release.environment.readiness-- environment issues cause releases to get stuckcheck.release.promotion.gates-- misconfigured gates can block releases indefinitelycheck.release.configuration-- workflow configuration errors cause release failurescheck.release.schedule-- schedule conflicts can cause resource contention