---
checkId: check.operations.dead-letter
plugin: stellaops.doctor.operations
severity: warn
tags: [operations, queue, dead-letter]
---
# Dead Letter Queue

## What It Checks
Examines the dead letter queue for failed jobs that have exhausted their retry attempts and require manual review:

- **Critical threshold**: fail when more than 50 failed jobs accumulate in the dead letter queue.
- **Warning threshold**: warn when more than 10 failed jobs are present.
- **Acceptable range**: 1-10 failed jobs pass with an informational note.

Evidence collected: `FailedJobs`, `OldestFailure`, `MostCommonError`, `RetryableCount`.

This check always runs (no configuration prerequisites).

## Why It Matters
Dead letter queue entries represent work that the system was unable to complete after all retry attempts. Each entry is a job that may have had side effects (partial writes, notifications sent, resources allocated) and now sits in an inconsistent state. A growing dead letter queue indicates a systemic issue -- a downstream service outage, a configuration error, or a bug that is causing repeated failures. Left unattended, dead letters accumulate and can mask the root cause of operational issues.

## Common Causes
- Persistent downstream service failures (registry unavailable, external API down)
- Configuration errors causing jobs to fail deterministically (wrong credentials, missing endpoints)
- Resource exhaustion (out of memory, disk full) during job execution
- Integration service outage (SCM, CI, secrets manager)
- Transient failures accumulating faster than the retry mechanism can clear them
- Jobs consistently failing on specific artifact types or inputs

## How to Fix

### Docker Compose
```bash
# List dead letter queue entries
stella orchestrator deadletter list --limit 20

# Analyze common failure patterns
stella orchestrator deadletter analyze

# Retry jobs that are eligible for retry
stella orchestrator deadletter retry --filter retryable

# Retry all failed jobs
stella orchestrator deadletter retry --all

# View orchestrator logs for root cause
docker compose -f docker-compose.stella-ops.yml logs --tail 200 orchestrator | grep -i "error\|fail"
```

### Bare Metal / systemd
```bash
# List recent failures
stella orchestrator deadletter list --since 1h

# Analyze failure patterns
stella orchestrator deadletter analyze

# Retry retryable jobs
stella orchestrator deadletter retry --filter retryable

# Check orchestrator service health
sudo systemctl status stellaops-orchestrator
sudo journalctl -u stellaops-orchestrator --since "4 hours ago" | grep -i "deadletter\|error"
```

### Kubernetes / Helm
```bash
# List dead letter entries
kubectl exec -it <orchestrator-pod> -- stella orchestrator deadletter list --limit 20

# Analyze failures
kubectl exec -it <orchestrator-pod> -- stella orchestrator deadletter analyze

# Retry retryable jobs
kubectl exec -it <orchestrator-pod> -- stella orchestrator deadletter retry --filter retryable

# Check orchestrator pod logs
kubectl logs -l app=stellaops-orchestrator --tail=200 | grep -i dead.letter
```

## Verification
```
stella doctor run --check check.operations.dead-letter
```

## Related Checks
- `check.operations.job-queue` -- job queue backlog can indicate the same underlying issue
- `check.operations.scheduler` -- scheduler failures may produce dead letter entries
- `check.postgres.connectivity` -- database issues are a common root cause of job failures