Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/release/rollback-readiness.md
+++ b/docs/doctor/articles/release/rollback-readiness.md
@@ -0,0 +1,124 @@
+---
+checkId: check.release.rollback.readiness
+plugin: stellaops.doctor.release
+severity: warn
+tags: [release, rollback, disaster-recovery, production]
+---
+# Rollback Readiness
+
+## What It Checks
+Queries the Release Orchestrator at `/api/v1/environments/rollback-status` (with fallback to `/api/v1/environments`) and evaluates rollback capability for production environments:
+
+- **Cannot rollback**: fail if a production environment has a previous version but cannot roll back (e.g., irreversible migration, artifacts purged).
+- **No previous version**: warn if a production environment has no previous deployment to roll back to.
+- **Missing health probe**: warn if a production environment lacks a health probe (prevents auto-rollback on failure).
+
+Only production environments (type "prod" or "production") are evaluated. Non-production environments are not checked.
+
+Evidence collected: `prod_environment_count`, `rollback_ready_count`, `cannot_rollback_count`, `no_previous_version_count`, `no_health_probe_count`, `cannot_rollback_environments`, `rollback_blocker`.
+
+The check requires `ReleaseOrchestrator:Url` or `Release:Orchestrator:Url` to be configured.
+
+## Why It Matters
+Rollback is the primary recovery mechanism when a production deployment introduces a critical issue. If rollback is unavailable, the only options are an emergency forward-fix or extended downtime. Missing health probes prevent automatic rollback on deployment failure, requiring manual intervention during incidents. In regulated environments, rollback readiness is often a compliance requirement for change management.
+
+## Common Causes
+- Previous deployment artifacts not retained (artifact retention policy too aggressive)
+- Database migration not reversible (destructive schema change)
+- Breaking API change deployed that prevents running the previous version
+- Rollback manually disabled for the environment
+- First deployment to environment (no previous version exists)
+- Deployment history cleared during maintenance
+- Health probe URL not configured for auto-rollback
+- Auto-rollback on failure not enabled
+
+## How to Fix
+
+### Docker Compose
+```bash
+# Check rollback status for a specific environment
+stella env rollback-status <environment-name>
+
+# View deployment history
+stella env history <environment-name>
+
+# Configure artifact retention to keep previous versions
+```
+
+```yaml
+services:
+  orchestrator:
+    environment:
+      Release__ArtifactRetention__Count: "5"
+      Release__ArtifactRetention__Days: "30"
+```
+
+Configure health probes:
+
+```bash
+# Set health probe for a production environment
+stella env configure <environment-name> --health-probe-url "http://<app>:8080/health"
+
+# Enable auto-rollback on failure
+stella env configure <environment-name> --auto-rollback-on-failure
+```
+
+### Bare Metal / systemd
+```bash
+# Check rollback blockers
+stella env rollback-status <environment-name>
+
+# View deployment history
+stella env history <environment-name>
+
+# Configure health probe
+stella env configure <environment-name> --health-probe-url "http://localhost:8080/health"
+
+# Enable auto-rollback
+stella env configure <environment-name> --auto-rollback-on-failure
+```
+
+Edit `/etc/stellaops/orchestrator/appsettings.json`:
+
+```json
+{
+  "Release": {
+    "ArtifactRetention": {
+      "Count": 5,
+      "Days": 30
+    }
+  }
+}
+```
+
+### Kubernetes / Helm
+```bash
+# Check rollback status
+kubectl exec -it <orchestrator-pod> -- stella env rollback-status <environment-name>
+
+# View deployment history
+kubectl exec -it <orchestrator-pod> -- stella env history <environment-name>
+```
+
+Set in Helm `values.yaml`:
+
+```yaml
+releaseOrchestrator:
+  artifactRetention:
+    count: 5
+    days: 30
+  environments:
+    production:
+      healthProbeUrl: "http://app:8080/health"
+      autoRollbackOnFailure: true
+```
+
+## Verification
+```
+stella doctor run --check check.release.rollback.readiness
+```
+
+## Related Checks
+- `check.release.active` -- failed releases may require rollback
+- `check.release.environment.readiness` -- environment health affects rollback execution
+- `check.release.configuration` -- workflow configuration defines rollback behavior