Designing Effective AWS EDA Architecture

Introduction

Modern Event-Driven Architectures (EDA) are asynchronous and distributed by nature. That gives us scalability and resilience but it also means we must design carefully for failure, because when things break, debugging or replaying events can be painful if the foundations are weak.

I try to focuses on reliability patterns in EDA system:

Retries
Idempotency
Dead-Letter Queues
Parking Lot Queues
Poison Message handling
Soft vs Hard failure logic

1. Failure Happens. Expect It.

In synchronous request-response systems, if something fails, the caller usually gets an error and can try again. But in event-driven systems: The producer has moved on and the consumer owns the retry behavior. The system must recover without anyone manually “fixing” anything.

Consumer should implement retries, idempotency, and DLQs.

2. Build your system for Idempotency

All AWS services do not retry the same way. Understanding their retry model helps prevent surprises.

Service	Retry Behavior	Notes
SNS	Retries delivery to endpoints (e.g. Lambda, HTTP) with exponential backoff.	If retries keep failing, message is dropped unless a DLQ is configured.
SQS	Consumer controls retries by not deleting the message. It becomes visible again after the Visibility Timeout.	You choose how many times to attempt before DLQ.
Lambda Triggers (SNS/SQS/EventBridge)	Lambda retries failed async invocations 2 times by default before sending to DLQ.	Configurable via retry settings and destinations.
EventBridge	Retries for 24 hours with exponential backoff.	If target continuously fails → send to DLQ.

Your consumer must be safe to retry. Always assume a message may be processed more than once.

Why we need it:

Retries happen everywhere
Lambda cold starts happen
Duplicate events can be sent by producer

These are common approaches

Approach	Example	When to Use
Idempotency Table	Store processed message in Dynamo	Most common; easy
Upsert / Merge Writes	SQL `INSERT ... ON CONFLICT DO NOTHING`	When DB supports it
Event Sourcing State	Skip if state already applied	CQRS/event sourcing systems

3. Soft Failures vs Hard Failures

Soft Failure

Temporary issue → retry should fix it. Examples:

Database connection timeout
Third-party API slow for 10 seconds
Network congestion

Action: Retry with backoff (with jitter)

Hard Failure

The message will never succeed. Examples:

Message is malformed
Required data missing
Business rule violation (e.g., “Account is closed”)

Action: Move to DLQ immediately (retrying is pointless here)

4. Dead-Letter Queues (DLQ)

A Dead-Letter Queue is where messages go when retries are exhausted.

Why DLQs matter

They prevent your system from getting stuck
They let ops teams inspect / fix / reprocess failed events
They isolate poison messages

Where to use DLQs

SNS: DLQ per subscription
Lambda: Configure async DLQ
SQS: Built-in fail-forward to DLQ
EventBridge: DLQs for rule targets

The DLQ is not the end of the story. What you do with it afterward, matters!

5. The Parking Lot Pattern (for Poison Messages)

A Poison Message is a message that always fails, no matter how many times you retry it. DLQ becomes a Parking Lot for such messages.

Flow:

Normal Queue → Consumer fails → Retries → DLQ
Later → Human or automated process reviews → Fix → Replay back to main queue

This keeps your main system running without blocking.

Parking Lot Reprocessing Options

Automatic scheduled Lambda to retry older DLQ messages
AWS Console-driven re-drive
S3 archival and batch replay later

Wrap-up

In summary, you should try these best practices. As reliability in EDA doesn’t come from preventing failure but it comes from designing for failure.

Component	Best Practice
Producer	Generate unique message IDs
Queue / Bus	Set DLQ with maxReceiveCount = 3–5
Consumer	Implement idempotency check & semantic retries
DLQ Handler	Build Parking Lot replay workflow
Observability	Log `trace_id`, `message_id`, `retry_count`