Skip to content
Go back

Designing Effective AWS EDA Architecture

Introduction

Modern Event-Driven Architectures (EDA) are asynchronous and distributed by nature. That gives us scalability and resilience but it also means we must design carefully for failure, because when things break, debugging or replaying events can be painful if the foundations are weak.

I try to focuses on reliability patterns in EDA system:

1. Failure Happens. Expect It.

In synchronous request-response systems, if something fails, the caller usually gets an error and can try again. But in event-driven systems: The producer has moved on and the consumer owns the retry behavior. The system must recover without anyone manually “fixing” anything.

Consumer should implement retries, idempotency, and DLQs.

2. Build your system for Idempotency

All AWS services do not retry the same way. Understanding their retry model helps prevent surprises.

ServiceRetry BehaviorNotes
SNSRetries delivery to endpoints (e.g. Lambda, HTTP) with exponential backoff.If retries keep failing, message is dropped unless a DLQ is configured.
SQSConsumer controls retries by not deleting the message. It becomes visible again after the Visibility Timeout.You choose how many times to attempt before DLQ.
Lambda Triggers (SNS/SQS/EventBridge)Lambda retries failed async invocations 2 times by default before sending to DLQ.Configurable via retry settings and destinations.
EventBridgeRetries for 24 hours with exponential backoff.If target continuously fails → send to DLQ.

Your consumer must be safe to retry. Always assume a message may be processed more than once.

Why we need it:

These are common approaches

ApproachExampleWhen to Use
Idempotency TableStore processed message in DynamoMost common; easy
Upsert / Merge WritesSQL INSERT ... ON CONFLICT DO NOTHINGWhen DB supports it
Event Sourcing StateSkip if state already appliedCQRS/event sourcing systems

3. Soft Failures vs Hard Failures

Soft Failure

Temporary issue → retry should fix it. Examples:

Action: Retry with backoff (with jitter)

Hard Failure

The message will never succeed. Examples:

Action: Move to DLQ immediately (retrying is pointless here)

4. Dead-Letter Queues (DLQ)

A Dead-Letter Queue is where messages go when retries are exhausted.

Why DLQs matter

Where to use DLQs

The DLQ is not the end of the story. What you do with it afterward, matters!

5. The Parking Lot Pattern (for Poison Messages)

A Poison Message is a message that always fails, no matter how many times you retry it. DLQ becomes a Parking Lot for such messages.

Flow:

This keeps your main system running without blocking.

Parking Lot Reprocessing Options

Wrap-up

In summary, you should try these best practices. As reliability in EDA doesn’t come from preventing failure but it comes from designing for failure.

ComponentBest Practice
ProducerGenerate unique message IDs
Queue / BusSet DLQ with maxReceiveCount = 3–5
ConsumerImplement idempotency check & semantic retries
DLQ HandlerBuild Parking Lot replay workflow
ObservabilityLog trace_id, message_id, retry_count

Share this post on:

Previous Post
Global by Design: Serving a Static React App with S3 Multi-Region Access Points & CloudFront OAC
Next Post
Choosing the Right AWS Messaging Service