Files

master c58a236d70 Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 12:28:00 +02:00

4.0 KiB

Raw Blame History

checkId, plugin, severity, tags

checkId

plugin

severity

Rollback Readiness

What It Checks

Queries the Release Orchestrator at /api/v1/environments/rollback-status (with fallback to /api/v1/environments) and evaluates rollback capability for production environments:

Cannot rollback: fail if a production environment has a previous version but cannot roll back (e.g., irreversible migration, artifacts purged).
No previous version: warn if a production environment has no previous deployment to roll back to.
Missing health probe: warn if a production environment lacks a health probe (prevents auto-rollback on failure).

Only production environments (type "prod" or "production") are evaluated. Non-production environments are not checked.

Evidence collected: prod_environment_count, rollback_ready_count, cannot_rollback_count, no_previous_version_count, no_health_probe_count, cannot_rollback_environments, rollback_blocker.

The check requires ReleaseOrchestrator:Url or Release:Orchestrator:Url to be configured.

Why It Matters

Rollback is the primary recovery mechanism when a production deployment introduces a critical issue. If rollback is unavailable, the only options are an emergency forward-fix or extended downtime. Missing health probes prevent automatic rollback on deployment failure, requiring manual intervention during incidents. In regulated environments, rollback readiness is often a compliance requirement for change management.

Common Causes

Previous deployment artifacts not retained (artifact retention policy too aggressive)
Database migration not reversible (destructive schema change)
Breaking API change deployed that prevents running the previous version
Rollback manually disabled for the environment
First deployment to environment (no previous version exists)
Deployment history cleared during maintenance
Health probe URL not configured for auto-rollback
Auto-rollback on failure not enabled

How to Fix

Docker Compose

# Check rollback status for a specific environment
stella env rollback-status <environment-name>

# View deployment history
stella env history <environment-name>

# Configure artifact retention to keep previous versions

services:
  orchestrator:
    environment:
      Release__ArtifactRetention__Count: "5"
      Release__ArtifactRetention__Days: "30"

Configure health probes:

# Set health probe for a production environment
stella env configure <environment-name> --health-probe-url "http://<app>:8080/health"

# Enable auto-rollback on failure
stella env configure <environment-name> --auto-rollback-on-failure

Bare Metal / systemd

# Check rollback blockers
stella env rollback-status <environment-name>

# View deployment history
stella env history <environment-name>

# Configure health probe
stella env configure <environment-name> --health-probe-url "http://localhost:8080/health"

# Enable auto-rollback
stella env configure <environment-name> --auto-rollback-on-failure

Edit /etc/stellaops/orchestrator/appsettings.json:

{
  "Release": {
    "ArtifactRetention": {
      "Count": 5,
      "Days": 30
    }
  }
}

Kubernetes / Helm

# Check rollback status
kubectl exec -it <orchestrator-pod> -- stella env rollback-status <environment-name>

# View deployment history
kubectl exec -it <orchestrator-pod> -- stella env history <environment-name>

Set in Helm values.yaml:

releaseOrchestrator:
  artifactRetention:
    count: 5
    days: 30
  environments:
    production:
      healthProbeUrl: "http://app:8080/health"
      autoRollbackOnFailure: true

Verification

stella doctor run --check check.release.rollback.readiness

check.release.active -- failed releases may require rollback
check.release.environment.readiness -- environment health affects rollback execution
check.release.configuration -- workflow configuration defines rollback behavior

4.0 KiB Raw Blame History