Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
master
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions

View File

@@ -0,0 +1,85 @@
---
checkId: check.release.active
plugin: stellaops.doctor.release
severity: warn
tags: [release, pipeline, active, monitoring]
---
# Active Release Health
## What It Checks
Queries the Release Orchestrator at `/api/v1/releases?state=active` and evaluates the health of all currently active releases:
- **Stuck releases**: warn if an executing or pending release has been running for more than 1 hour, fail after 4 hours.
- **Failed releases**: any release with an error triggers an immediate fail.
- **Pending approvals**: warn if an approval has been pending for more than 4 hours, fail after 24 hours.
Evidence collected: `active_release_count`, `stuck_release_count`, `failed_release_count`, `pending_approval_count`, `oldest_active_release_age_minutes`, `stuck_releases`, `failed_releases`, `approval_pending_releases`.
The check requires `ReleaseOrchestrator:Url` or `Release:Orchestrator:Url` to be configured.
## Why It Matters
Active releases represent in-flight changes moving through the promotion pipeline. A stuck release blocks the target environment from receiving updates and can hold locks that prevent other releases. Failed releases indicate broken deployment workflows that need immediate attention. Stale approvals delay time-sensitive deployments and can indicate that approvers are unaware of pending requests or that notification delivery has failed.
## Common Causes
- Release workflow step failed (script error, timeout, integration failure)
- Approval bottleneck -- approvers not notified or unavailable
- Target environment became unreachable during deployment
- Resource contention between concurrent releases
- Release taking longer than expected due to large artifact size
- Environment slow to respond to health probes after deployment
## How to Fix
### Docker Compose
```bash
# Inspect a failed or stuck release
stella release inspect <release-id>
# View release execution logs
stella release logs <release-id>
# Check Release Orchestrator service health
docker compose -f docker-compose.stella-ops.yml logs --tail 200 orchestrator
# List pending approvals
stella release approvals list
```
### Bare Metal / systemd
```bash
# Check Release Orchestrator service
sudo systemctl status stellaops-orchestrator
# Inspect the stuck release
stella release inspect <release-id>
# View release logs
stella release logs <release-id>
# Review and action pending approvals
stella release approvals list
stella release approve <release-id>
```
### Kubernetes / Helm
```bash
# Check orchestrator pod status
kubectl get pods -l app=stellaops-orchestrator
# View orchestrator logs
kubectl logs -l app=stellaops-orchestrator --tail=200
# Inspect stuck release
kubectl exec -it <orchestrator-pod> -- stella release inspect <release-id>
```
## Verification
```
stella doctor run --check check.release.active
```
## Related Checks
- `check.release.environment.readiness` -- environment issues cause releases to get stuck
- `check.release.promotion.gates` -- misconfigured gates can block releases indefinitely
- `check.release.configuration` -- workflow configuration errors cause release failures
- `check.release.schedule` -- schedule conflicts can cause resource contention