Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3.4 KiB
3.4 KiB
checkId, plugin, severity, tags
| checkId | plugin | severity | tags | |||
|---|---|---|---|---|---|---|
| check.operations.dead-letter | stellaops.doctor.operations | warn |
|
Dead Letter Queue
What It Checks
Examines the dead letter queue for failed jobs that have exhausted their retry attempts and require manual review:
- Critical threshold: fail when more than 50 failed jobs accumulate in the dead letter queue.
- Warning threshold: warn when more than 10 failed jobs are present.
- Acceptable range: 1-10 failed jobs pass with an informational note.
Evidence collected: FailedJobs, OldestFailure, MostCommonError, RetryableCount.
This check always runs (no configuration prerequisites).
Why It Matters
Dead letter queue entries represent work that the system was unable to complete after all retry attempts. Each entry is a job that may have had side effects (partial writes, notifications sent, resources allocated) and now sits in an inconsistent state. A growing dead letter queue indicates a systemic issue -- a downstream service outage, a configuration error, or a bug that is causing repeated failures. Left unattended, dead letters accumulate and can mask the root cause of operational issues.
Common Causes
- Persistent downstream service failures (registry unavailable, external API down)
- Configuration errors causing jobs to fail deterministically (wrong credentials, missing endpoints)
- Resource exhaustion (out of memory, disk full) during job execution
- Integration service outage (SCM, CI, secrets manager)
- Transient failures accumulating faster than the retry mechanism can clear them
- Jobs consistently failing on specific artifact types or inputs
How to Fix
Docker Compose
# List dead letter queue entries
stella orchestrator deadletter list --limit 20
# Analyze common failure patterns
stella orchestrator deadletter analyze
# Retry jobs that are eligible for retry
stella orchestrator deadletter retry --filter retryable
# Retry all failed jobs
stella orchestrator deadletter retry --all
# View orchestrator logs for root cause
docker compose -f docker-compose.stella-ops.yml logs --tail 200 orchestrator | grep -i "error\|fail"
Bare Metal / systemd
# List recent failures
stella orchestrator deadletter list --since 1h
# Analyze failure patterns
stella orchestrator deadletter analyze
# Retry retryable jobs
stella orchestrator deadletter retry --filter retryable
# Check orchestrator service health
sudo systemctl status stellaops-orchestrator
sudo journalctl -u stellaops-orchestrator --since "4 hours ago" | grep -i "deadletter\|error"
Kubernetes / Helm
# List dead letter entries
kubectl exec -it <orchestrator-pod> -- stella orchestrator deadletter list --limit 20
# Analyze failures
kubectl exec -it <orchestrator-pod> -- stella orchestrator deadletter analyze
# Retry retryable jobs
kubectl exec -it <orchestrator-pod> -- stella orchestrator deadletter retry --filter retryable
# Check orchestrator pod logs
kubectl logs -l app=stellaops-orchestrator --tail=200 | grep -i dead.letter
Verification
stella doctor run --check check.operations.dead-letter
Related Checks
check.operations.job-queue-- job queue backlog can indicate the same underlying issuecheck.operations.scheduler-- scheduler failures may produce dead letter entriescheck.postgres.connectivity-- database issues are a common root cause of job failures