Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
master
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions

View File

@@ -0,0 +1,124 @@
---
checkId: check.release.rollback.readiness
plugin: stellaops.doctor.release
severity: warn
tags: [release, rollback, disaster-recovery, production]
---
# Rollback Readiness
## What It Checks
Queries the Release Orchestrator at `/api/v1/environments/rollback-status` (with fallback to `/api/v1/environments`) and evaluates rollback capability for production environments:
- **Cannot rollback**: fail if a production environment has a previous version but cannot roll back (e.g., irreversible migration, artifacts purged).
- **No previous version**: warn if a production environment has no previous deployment to roll back to.
- **Missing health probe**: warn if a production environment lacks a health probe (prevents auto-rollback on failure).
Only production environments (type "prod" or "production") are evaluated. Non-production environments are not checked.
Evidence collected: `prod_environment_count`, `rollback_ready_count`, `cannot_rollback_count`, `no_previous_version_count`, `no_health_probe_count`, `cannot_rollback_environments`, `rollback_blocker`.
The check requires `ReleaseOrchestrator:Url` or `Release:Orchestrator:Url` to be configured.
## Why It Matters
Rollback is the primary recovery mechanism when a production deployment introduces a critical issue. If rollback is unavailable, the only options are an emergency forward-fix or extended downtime. Missing health probes prevent automatic rollback on deployment failure, requiring manual intervention during incidents. In regulated environments, rollback readiness is often a compliance requirement for change management.
## Common Causes
- Previous deployment artifacts not retained (artifact retention policy too aggressive)
- Database migration not reversible (destructive schema change)
- Breaking API change deployed that prevents running the previous version
- Rollback manually disabled for the environment
- First deployment to environment (no previous version exists)
- Deployment history cleared during maintenance
- Health probe URL not configured for auto-rollback
- Auto-rollback on failure not enabled
## How to Fix
### Docker Compose
```bash
# Check rollback status for a specific environment
stella env rollback-status <environment-name>
# View deployment history
stella env history <environment-name>
# Configure artifact retention to keep previous versions
```
```yaml
services:
orchestrator:
environment:
Release__ArtifactRetention__Count: "5"
Release__ArtifactRetention__Days: "30"
```
Configure health probes:
```bash
# Set health probe for a production environment
stella env configure <environment-name> --health-probe-url "http://<app>:8080/health"
# Enable auto-rollback on failure
stella env configure <environment-name> --auto-rollback-on-failure
```
### Bare Metal / systemd
```bash
# Check rollback blockers
stella env rollback-status <environment-name>
# View deployment history
stella env history <environment-name>
# Configure health probe
stella env configure <environment-name> --health-probe-url "http://localhost:8080/health"
# Enable auto-rollback
stella env configure <environment-name> --auto-rollback-on-failure
```
Edit `/etc/stellaops/orchestrator/appsettings.json`:
```json
{
"Release": {
"ArtifactRetention": {
"Count": 5,
"Days": 30
}
}
}
```
### Kubernetes / Helm
```bash
# Check rollback status
kubectl exec -it <orchestrator-pod> -- stella env rollback-status <environment-name>
# View deployment history
kubectl exec -it <orchestrator-pod> -- stella env history <environment-name>
```
Set in Helm `values.yaml`:
```yaml
releaseOrchestrator:
artifactRetention:
count: 5
days: 30
environments:
production:
healthProbeUrl: "http://app:8080/health"
autoRollbackOnFailure: true
```
## Verification
```
stella doctor run --check check.release.rollback.readiness
```
## Related Checks
- `check.release.active` -- failed releases may require rollback
- `check.release.environment.readiness` -- environment health affects rollback execution
- `check.release.configuration` -- workflow configuration defines rollback behavior