Queue Retries Without Chaos: Backoff + Dead-Letter Design
Retrying failed jobs blindly can amplify outages. A stable queue system needs bounded retries, exponential backoff, and dead-letter isolation.
Step 1: classify retryable vs terminal failures
function isRetryable(err: Error): boolean {
return /timeout|503|rate limit/i.test(err.message)
}
Step 2: apply capped exponential delay
function nextDelay(attempt: number) {
return Math.min(300, 2 ** attempt) // seconds
}
Step 3: route exhausted jobs to DLQ with context
{
"job_id": "job-81",
"attempts": 6,
"last_error": "503 upstream timeout"
}
Pitfall
Infinite retries on non-retryable payload errors. The queue never drains and hides true incidents.
Verification
- Retry count caps are enforced.
- DLQ contains enough context for manual replay.
- Queue lag recovers after upstream incidents.