--- checkId: check.operations.dead-letter plugin: stellaops.doctor.operations severity: warn tags: [operations, queue, dead-letter] --- # Dead Letter Queue ## What It Checks Examines the dead letter queue for failed jobs that have exhausted their retry attempts and require manual review: - **Critical threshold**: fail when more than 50 failed jobs accumulate in the dead letter queue. - **Warning threshold**: warn when more than 10 failed jobs are present. - **Acceptable range**: 1-10 failed jobs pass with an informational note. Evidence collected: `FailedJobs`, `OldestFailure`, `MostCommonError`, `RetryableCount`. This check always runs (no configuration prerequisites). ## Why It Matters Dead letter queue entries represent work that the system was unable to complete after all retry attempts. Each entry is a job that may have had side effects (partial writes, notifications sent, resources allocated) and now sits in an inconsistent state. A growing dead letter queue indicates a systemic issue -- a downstream service outage, a configuration error, or a bug that is causing repeated failures. Left unattended, dead letters accumulate and can mask the root cause of operational issues. ## Common Causes - Persistent downstream service failures (registry unavailable, external API down) - Configuration errors causing jobs to fail deterministically (wrong credentials, missing endpoints) - Resource exhaustion (out of memory, disk full) during job execution - Integration service outage (SCM, CI, secrets manager) - Transient failures accumulating faster than the retry mechanism can clear them - Jobs consistently failing on specific artifact types or inputs ## How to Fix ### Docker Compose ```bash # List dead letter queue entries stella orchestrator deadletter list --limit 20 # Analyze common failure patterns stella orchestrator deadletter analyze # Retry jobs that are eligible for retry stella orchestrator deadletter retry --filter retryable # Retry all failed jobs stella orchestrator deadletter retry --all # View orchestrator logs for root cause docker compose -f docker-compose.stella-ops.yml logs --tail 200 orchestrator | grep -i "error\|fail" ``` ### Bare Metal / systemd ```bash # List recent failures stella orchestrator deadletter list --since 1h # Analyze failure patterns stella orchestrator deadletter analyze # Retry retryable jobs stella orchestrator deadletter retry --filter retryable # Check orchestrator service health sudo systemctl status stellaops-orchestrator sudo journalctl -u stellaops-orchestrator --since "4 hours ago" | grep -i "deadletter\|error" ``` ### Kubernetes / Helm ```bash # List dead letter entries kubectl exec -it -- stella orchestrator deadletter list --limit 20 # Analyze failures kubectl exec -it -- stella orchestrator deadletter analyze # Retry retryable jobs kubectl exec -it -- stella orchestrator deadletter retry --filter retryable # Check orchestrator pod logs kubectl logs -l app=stellaops-orchestrator --tail=200 | grep -i dead.letter ``` ## Verification ``` stella doctor run --check check.operations.dead-letter ``` ## Related Checks - `check.operations.job-queue` -- job queue backlog can indicate the same underlying issue - `check.operations.scheduler` -- scheduler failures may produce dead letter entries - `check.postgres.connectivity` -- database issues are a common root cause of job failures