Files

master c58a236d70 Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 12:28:00 +02:00

3.3 KiB

Raw Blame History

checkId, plugin, severity, tags

checkId

plugin

severity

Environment Deployment Health

What It Checks

Queries the Release Orchestrator (/api/v1/environments/deployments) for all deployed services across all environments. Each service is evaluated for:

Status -- failed, stopped, degraded, or healthy
Replica health -- compares healthyReplicas against total replicas; partial health triggers degraded status

Severity escalation:

Fail if any production service has status failed (production detected by environment name containing "prod")
Fail if any non-production service has status failed
Warn if services are degraded (partial replica health)
Warn if services are stopped
Pass if all services are healthy

Why It Matters

Failed services in production directly impact end users and violate SLA commitments. Degraded services with partial replica health reduce fault tolerance and can cascade into full outages under load. Stopped services may indicate incomplete deployments or maintenance windows that were never closed. This check provides the earliest signal that a deployment rollout needs intervention.

Common Causes

Service crashed due to unhandled exception or OOM kill
Deployment rolled out a bad image version
Dependency (database, cache, message broker) became unavailable
Resource exhaustion preventing replicas from starting
Health check endpoint misconfigured, causing false failures
Node failure taking down co-located replicas

How to Fix

Docker Compose

# Identify failed containers
docker ps -a --filter "status=exited" --filter "status=dead"

# View logs for the failed service
docker logs <container-name> --tail 200

# Restart the failed service
docker compose -f docker-compose.stella-ops.yml restart <service-name>

# If the image is bad, roll back to previous version
# Edit docker-compose.stella-ops.yml to pin the previous image tag
docker compose -f docker-compose.stella-ops.yml up -d <service-name>

Bare Metal / systemd

# Check service status
sudo systemctl status stellaops-<service-name>

# View logs for crash details
sudo journalctl -u stellaops-<service-name> --since "30 minutes ago" --no-pager

# Restart the service
sudo systemctl restart stellaops-<service-name>

# Roll back to previous binary
sudo cp /opt/stellaops/backup/<service-name> /opt/stellaops/bin/<service-name>
sudo systemctl restart stellaops-<service-name>

Kubernetes / Helm

# Check pod status across environments
kubectl get pods -n stellaops-<env> --field-selector=status.phase!=Running

# View events and logs for failing pods
kubectl describe pod <pod-name> -n stellaops-<env>
kubectl logs <pod-name> -n stellaops-<env> --previous

# Rollback a deployment
kubectl rollout undo deployment/<service-name> -n stellaops-<env>

# Or via Helm
helm rollback stellaops <previous-revision> -n stellaops-<env>

Verification

stella doctor run --check check.environment.deployments

check.environment.capacity - resource exhaustion can cause deployment failures
check.environment.connectivity - agent must be reachable to report deployment health
check.environment.drift - configuration drift can cause services to fail after redeployment

3.3 KiB Raw Blame History