Introduction
Modern Event-Driven Architectures (EDA) are asynchronous and distributed by nature. That gives us scalability and resilience but it also means we must design carefully for failure, because when things break, debugging or replaying events can be painful if the foundations are weak.
I try to focuses on reliability patterns in EDA system:
- Retries
- Idempotency
- Dead-Letter Queues
- Parking Lot Queues
- Poison Message handling
- Soft vs Hard failure logic
1. Failure Happens. Expect It.
In synchronous request-response systems, if something fails, the caller usually gets an error and can try again. But in event-driven systems: The producer has moved on and the consumer owns the retry behavior. The system must recover without anyone manually “fixing” anything.
Consumer should implement retries, idempotency, and DLQs.
2. Build your system for Idempotency
All AWS services do not retry the same way. Understanding their retry model helps prevent surprises.
| Service | Retry Behavior | Notes |
|---|---|---|
| SNS | Retries delivery to endpoints (e.g. Lambda, HTTP) with exponential backoff. | If retries keep failing, message is dropped unless a DLQ is configured. |
| SQS | Consumer controls retries by not deleting the message. It becomes visible again after the Visibility Timeout. | You choose how many times to attempt before DLQ. |
| Lambda Triggers (SNS/SQS/EventBridge) | Lambda retries failed async invocations 2 times by default before sending to DLQ. | Configurable via retry settings and destinations. |
| EventBridge | Retries for 24 hours with exponential backoff. | If target continuously fails → send to DLQ. |
Your consumer must be safe to retry. Always assume a message may be processed more than once.
Why we need it:
- Retries happen everywhere
- Lambda cold starts happen
- Duplicate events can be sent by producer
These are common approaches
| Approach | Example | When to Use |
|---|---|---|
| Idempotency Table | Store processed message in Dynamo | Most common; easy |
| Upsert / Merge Writes | SQL INSERT ... ON CONFLICT DO NOTHING | When DB supports it |
| Event Sourcing State | Skip if state already applied | CQRS/event sourcing systems |
3. Soft Failures vs Hard Failures
Soft Failure
Temporary issue → retry should fix it. Examples:
- Database connection timeout
- Third-party API slow for 10 seconds
- Network congestion
Action: Retry with backoff (with jitter)
Hard Failure
The message will never succeed. Examples:
- Message is malformed
- Required data missing
- Business rule violation (e.g., “Account is closed”)
Action: Move to DLQ immediately (retrying is pointless here)
4. Dead-Letter Queues (DLQ)
A Dead-Letter Queue is where messages go when retries are exhausted.
Why DLQs matter
- They prevent your system from getting stuck
- They let ops teams inspect / fix / reprocess failed events
- They isolate poison messages
Where to use DLQs
- SNS: DLQ per subscription
- Lambda: Configure async DLQ
- SQS: Built-in fail-forward to DLQ
- EventBridge: DLQs for rule targets
The DLQ is not the end of the story. What you do with it afterward, matters!
5. The Parking Lot Pattern (for Poison Messages)
A Poison Message is a message that always fails, no matter how many times you retry it. DLQ becomes a Parking Lot for such messages.
Flow:
- Normal Queue → Consumer fails → Retries → DLQ
- Later → Human or automated process reviews → Fix → Replay back to main queue
This keeps your main system running without blocking.
Parking Lot Reprocessing Options
- Automatic scheduled Lambda to retry older DLQ messages
- AWS Console-driven re-drive
- S3 archival and batch replay later
Wrap-up
In summary, you should try these best practices. As reliability in EDA doesn’t come from preventing failure but it comes from designing for failure.
| Component | Best Practice |
|---|---|
| Producer | Generate unique message IDs |
| Queue / Bus | Set DLQ with maxReceiveCount = 3–5 |
| Consumer | Implement idempotency check & semantic retries |
| DLQ Handler | Build Parking Lot replay workflow |
| Observability | Log trace_id, message_id, retry_count |