Files

master c58a236d70 Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 12:28:00 +02:00

3.4 KiB

Raw Blame History

checkId, plugin, severity, tags

checkId

plugin

severity

Dead Letter Queue

What It Checks

Examines the dead letter queue for failed jobs that have exhausted their retry attempts and require manual review:

Critical threshold: fail when more than 50 failed jobs accumulate in the dead letter queue.
Warning threshold: warn when more than 10 failed jobs are present.
Acceptable range: 1-10 failed jobs pass with an informational note.

Evidence collected: FailedJobs, OldestFailure, MostCommonError, RetryableCount.

This check always runs (no configuration prerequisites).

Why It Matters

Dead letter queue entries represent work that the system was unable to complete after all retry attempts. Each entry is a job that may have had side effects (partial writes, notifications sent, resources allocated) and now sits in an inconsistent state. A growing dead letter queue indicates a systemic issue -- a downstream service outage, a configuration error, or a bug that is causing repeated failures. Left unattended, dead letters accumulate and can mask the root cause of operational issues.

Common Causes

Persistent downstream service failures (registry unavailable, external API down)
Configuration errors causing jobs to fail deterministically (wrong credentials, missing endpoints)
Resource exhaustion (out of memory, disk full) during job execution
Integration service outage (SCM, CI, secrets manager)
Transient failures accumulating faster than the retry mechanism can clear them
Jobs consistently failing on specific artifact types or inputs

How to Fix

Docker Compose

# List dead letter queue entries
stella orchestrator deadletter list --limit 20

# Analyze common failure patterns
stella orchestrator deadletter analyze

# Retry jobs that are eligible for retry
stella orchestrator deadletter retry --filter retryable

# Retry all failed jobs
stella orchestrator deadletter retry --all

# View orchestrator logs for root cause
docker compose -f docker-compose.stella-ops.yml logs --tail 200 orchestrator | grep -i "error\|fail"

Bare Metal / systemd

# List recent failures
stella orchestrator deadletter list --since 1h

# Analyze failure patterns
stella orchestrator deadletter analyze

# Retry retryable jobs
stella orchestrator deadletter retry --filter retryable

# Check orchestrator service health
sudo systemctl status stellaops-orchestrator
sudo journalctl -u stellaops-orchestrator --since "4 hours ago" | grep -i "deadletter\|error"

Kubernetes / Helm

# List dead letter entries
kubectl exec -it <orchestrator-pod> -- stella orchestrator deadletter list --limit 20

# Analyze failures
kubectl exec -it <orchestrator-pod> -- stella orchestrator deadletter analyze

# Retry retryable jobs
kubectl exec -it <orchestrator-pod> -- stella orchestrator deadletter retry --filter retryable

# Check orchestrator pod logs
kubectl logs -l app=stellaops-orchestrator --tail=200 | grep -i dead.letter

Verification

stella doctor run --check check.operations.dead-letter

check.operations.job-queue -- job queue backlog can indicate the same underlying issue
check.operations.scheduler -- scheduler failures may produce dead letter entries
check.postgres.connectivity -- database issues are a common root cause of job failures

3.4 KiB Raw Blame History