April 5, 202613 min readRishi

Event-Driven Architecture: Patterns, Pitfalls, and Practical Guidance

Event-driven architecture (EDA) is how modern distributed systems communicate at scale. Netflix processes billions of events daily. LinkedIn's event pipeline handles trillions of messages per day through Kafka. But EDA is not a silver bullet — it introduces complexity that can bite you if you do not understand the patterns and pitfalls.

This post covers the core concepts, the major patterns, the broker trade-offs, and the hard-won lessons from real production systems.

What Does "Event-Driven" Mean?

An event-driven system is one where state changes are communicated as events rather than through direct calls between services. Instead of Service A calling Service B's API, Service A publishes an event ("OrderPlaced") and any interested service reacts to it.

This fundamental shift changes how services relate to each other:

Synchronous (request-driven): Service A knows about Service B and calls it directly. Tight coupling.
Asynchronous (event-driven): Service A publishes events. It does not know or care who consumes them. Loose coupling.

Request-Driven:
  OrderService --calls--> PaymentService --calls--> InventoryService
  (if PaymentService is down, the whole chain breaks)

Event-Driven:
  OrderService --publishes "OrderPlaced"--> Event Broker
  PaymentService --subscribes--> processes payment
  InventoryService --subscribes--> reserves stock
  NotificationService --subscribes--> sends confirmation email
  (services fail independently)

Events vs Commands vs Queries

These three message types serve fundamentally different purposes. Confusing them leads to architectural problems.

Events

An event is a notification that something happened. It is past tense, immutable, and the publisher does not care who consumes it.

{
  "type": "OrderPlaced",
  "timestamp": "2026-04-05T14:30:00Z",
  "data": {
    "orderId": "ord-123",
    "customerId": "cust-456",
    "items": [{"sku": "WIDGET-A", "qty": 2}],
    "total": 49.98
  }
}

Commands

A command is a request to do something. It is imperative, directed at a specific service, and the sender expects it to be handled.

{
  "type": "ProcessPayment",
  "target": "payment-service",
  "data": {
    "orderId": "ord-123",
    "amount": 49.98,
    "method": "credit_card"
  }
}

Queries

A query is a request for information. It does not change state.

{
  "type": "GetOrderStatus",
  "data": { "orderId": "ord-123" }
}

Key distinction: Events have zero or many consumers. Commands have exactly one handler. Design your system around events whenever possible — they give you the most flexibility to add new consumers without changing existing services.

The Pub/Sub Pattern

Publish-subscribe is the foundation of event-driven architecture.

# Producer: publishes events without knowing consumers
class OrderService:
    def place_order(self, order_data):
        order = self.repository.create(order_data)
        # Publish event — no idea who's listening
        self.event_bus.publish("orders", {
            "type": "OrderPlaced",
            "data": order.to_dict(),
            "timestamp": datetime.utcnow().isoformat()
        })
        return order

# Consumer 1: Payment processing
class PaymentConsumer:
    @subscribe("orders", filter="OrderPlaced")
    def handle_order_placed(self, event):
        order = event["data"]
        self.payment_gateway.charge(
            order["customerId"],
            order["total"]
        )

# Consumer 2: Inventory reservation
class InventoryConsumer:
    @subscribe("orders", filter="OrderPlaced")
    def handle_order_placed(self, event):
        for item in event["data"]["items"]:
            self.inventory.reserve(item["sku"], item["qty"])

# Consumer 3: Added 6 months later — zero changes to OrderService
class AnalyticsConsumer:
    @subscribe("orders", filter="OrderPlaced")
    def handle_order_placed(self, event):
        self.metrics.increment("orders.placed")
        self.metrics.record("orders.revenue", event["data"]["total"])

The beauty of pub/sub is that Consumer 3 was added months later without touching a single line in OrderService or the other consumers.

Event Sourcing

Instead of storing the current state of an entity, event sourcing stores every event that led to the current state. The current state is derived by replaying events.

Traditional CRUD vs Event Sourcing

CRUD approach (stores current state):
┌──────────────────────────────────┐
│ accounts table                   │
│ id: acc-123  balance: $750.00    │
└──────────────────────────────────┘
  (How did we get to $750? No idea.)

Event Sourcing (stores all events):
┌──────────────────────────────────┐
│ AccountCreated    balance: $0    │
│ MoneyDeposited    amount: $1000  │
│ MoneyWithdrawn    amount: $200   │
│ MoneyWithdrawn    amount: $50    │
│ Current balance:  $750           │
└──────────────────────────────────┘
  (Full audit trail. Can replay to any point in time.)

Implementation

class BankAccount:
    def __init__(self, account_id: str):
        self.id = account_id
        self.balance = 0
        self.events = []

    def deposit(self, amount: float):
        if amount <= 0:
            raise ValueError("Deposit must be positive")
        event = {
            "type": "MoneyDeposited",
            "account_id": self.id,
            "amount": amount,
            "timestamp": datetime.utcnow().isoformat()
        }
        self._apply(event)
        self.events.append(event)

    def withdraw(self, amount: float):
        if amount > self.balance:
            raise ValueError("Insufficient funds")
        event = {
            "type": "MoneyWithdrawn",
            "account_id": self.id,
            "amount": amount,
            "timestamp": datetime.utcnow().isoformat()
        }
        self._apply(event)
        self.events.append(event)

    def _apply(self, event):
        if event["type"] == "MoneyDeposited":
            self.balance += event["amount"]
        elif event["type"] == "MoneyWithdrawn":
            self.balance -= event["amount"]

    @classmethod
    def from_events(cls, account_id: str, events: list):
        """Rebuild state by replaying events."""
        account = cls(account_id)
        for event in events:
            account._apply(event)
        return account

When Event Sourcing Makes Sense

Audit requirements: Finance, healthcare, compliance — anywhere you need a complete history
Temporal queries: "What was the account balance on March 15th?"
Event replay: Rebuild read models, fix bugs by replaying with corrected logic
Complex domains: Where the history of changes matters as much as current state

When It Does Not

Simple CRUD applications with no audit needs
When storage costs of storing every event outweigh the benefits
When the team lacks experience with eventual consistency

CQRS (Command Query Responsibility Segregation)

CQRS separates the write model (commands) from the read model (queries). Writes go to an event store, and one or more read-optimized projections are built from those events.

                ┌─────────────────┐
  Commands ──►  │  Write Model    │ ──► Event Store
                │  (validates,    │         │
                │   enforces      │         │ events flow
                │   business      │         │ to projections
                │   rules)        │         ▼
                └─────────────────┘  ┌──────────────┐
                                     │ Read Model 1  │ ◄── Queries (API)
                                     │ (optimized    │
                                     │  for listing) │
                                     └──────────────┘
                                     ┌──────────────┐
                                     │ Read Model 2  │ ◄── Queries (search)
                                     │ (optimized    │
                                     │  for search)  │
                                     └──────────────┘

Why CQRS? Because reads and writes have fundamentally different performance characteristics and access patterns. Your write model can use a normalized schema for consistency, while your read models can be denormalized for speed.

# Write side: validates business rules, appends events
class OrderCommandHandler:
    def handle_place_order(self, cmd):
        order = Order.create(cmd.customer_id, cmd.items)
        order.validate_inventory()
        order.calculate_total()
        self.event_store.append(order.uncommitted_events)

# Read side: optimized projection for order listing
class OrderListProjection:
    def on_order_placed(self, event):
        self.db.execute("""
            INSERT INTO order_list_view
            (order_id, customer_name, total, status, placed_at)
            VALUES (%s, %s, %s, 'placed', %s)
        """, event.order_id, event.customer_name,
             event.total, event.timestamp)

    def on_order_shipped(self, event):
        self.db.execute("""
            UPDATE order_list_view
            SET status = 'shipped', shipped_at = %s
            WHERE order_id = %s
        """, event.timestamp, event.order_id)

Message Brokers: Kafka vs RabbitMQ vs SQS

Apache Kafka

Kafka is a distributed commit log designed for high-throughput, ordered, durable event streaming.

Throughput: Millions of messages per second
Ordering: Guaranteed within a partition
Retention: Messages persist for a configurable time (days, weeks, forever)
Consumer groups: Multiple consumer groups read independently from the same topic
Replay: Consumers can re-read messages from any offset

# Kafka producer
from kafka import KafkaProducer
import json

producer = KafkaProducer(
    bootstrap_servers=["kafka:9092"],
    value_serializer=lambda v: json.dumps(v).encode("utf-8"),
    acks="all",           # Wait for all replicas to confirm
    retries=3,            # Retry on transient failures
    enable_idempotence=True  # Prevent duplicate writes
)

# Partition by order_id for ordering guarantee
producer.send(
    "orders",
    key=b"ord-123",       # Same key = same partition = ordered
    value={"type": "OrderPlaced", "orderId": "ord-123"}
)

RabbitMQ

RabbitMQ is a traditional message broker that implements AMQP. It excels at complex routing and task distribution.

Routing: Flexible exchange types (direct, topic, fanout, headers)
Acknowledgment: Consumer confirms processing, broker redelivers on failure
Priority queues: Messages can have priority levels
Lower latency: Better for low-latency, request-reply patterns

Choose Kafka for high-throughput event streaming where you need replay and ordering. Choose RabbitMQ for task queues, complex routing, and request-reply patterns.

Amazon SQS

SQS is a fully managed message queue — zero infrastructure to manage.

Standard queues: At-least-once delivery, best-effort ordering
FIFO queues: Exactly-once processing, strict ordering (but lower throughput)
Dead letter queues: Built-in DLQ support
Scaling: Virtually unlimited throughput for standard queues

Choose SQS when you want simplicity and are already on AWS. The operational overhead of managing Kafka or RabbitMQ clusters is significant.

Ordering Guarantees

Ordering is one of the most misunderstood aspects of event-driven systems.

The Problem

In a distributed system, events can arrive out of order due to network delays, retries, and parallel processing. Consider:

Event 1: OrderPlaced    (produced at T=0)
Event 2: OrderCancelled (produced at T=1)

If Event 2 arrives before Event 1, the consumer might:
1. Process cancellation (no-op, order doesn't exist yet)
2. Process order placement (order is now active, but should be cancelled!)

Solutions

Partition by entity ID — Kafka guarantees ordering within a partition. Use the entity ID as the partition key.

# All events for order "ord-123" go to the same partition
producer.send("orders", key=b"ord-123", value=event)

Sequence numbers — Include a monotonically increasing sequence number. Consumers reject or buffer out-of-order events.

def handle_event(event):
    expected_seq = self.get_last_sequence(event["entity_id"]) + 1
    if event["sequence"] < expected_seq:
        return  # Already processed (duplicate)
    if event["sequence"] > expected_seq:
        self.buffer_event(event)  # Out of order — buffer it
        return
    self.process(event)
    self.process_buffered(event["entity_id"])  # Check buffer

Idempotency

In distributed systems, messages can be delivered more than once. Every consumer must be idempotent — processing the same message twice should produce the same result as processing it once.

class IdempotentPaymentProcessor:
    def process_payment(self, event):
        idempotency_key = event["event_id"]

        # Check if we already processed this event
        if self.processed_events.exists(idempotency_key):
            logger.info(f"Skipping duplicate event {idempotency_key}")
            return

        # Process the payment
        result = self.payment_gateway.charge(
            event["data"]["customer_id"],
            event["data"]["amount"]
        )

        # Record that we processed this event (atomically with the work)
        self.processed_events.insert(idempotency_key, {
            "processed_at": datetime.utcnow(),
            "result": result
        })

Making Operations Idempotent

Database inserts: Use INSERT ... ON CONFLICT DO NOTHING with a unique event ID
Counter increments: Use SET counter = X instead of INCREMENT counter (set absolute value, not relative)
External API calls: Pass an idempotency key to the external service (Stripe, etc.)
State machines: Validate that the transition is valid from the current state before applying

Delivery Semantics

At-Most-Once

The message is delivered zero or one times. If the consumer crashes, the message is lost. Fastest but least reliable.

At-Least-Once

The message is delivered one or more times. The broker retries until the consumer acknowledges. Most common in practice. Requires idempotent consumers.

Exactly-Once

The message is delivered exactly one time. This is extremely hard to achieve in distributed systems. Kafka supports it within its own ecosystem using transactions and idempotent producers, but true exactly-once across system boundaries is effectively impossible.

In practice: Design for at-least-once delivery with idempotent consumers. Do not chase exactly-once semantics — the operational complexity is rarely worth it.

Dead Letter Queues (DLQ)

When a consumer cannot process a message after multiple retries, it goes to a dead letter queue for manual inspection and reprocessing.

class ResilientConsumer:
    MAX_RETRIES = 3

    def consume(self, message):
        retry_count = message.headers.get("retry_count", 0)

        try:
            self.process(message)
            self.ack(message)
        except TransientError:
            if retry_count < self.MAX_RETRIES:
                # Retry with exponential backoff
                message.headers["retry_count"] = retry_count + 1
                delay = 2 ** retry_count  # 1s, 2s, 4s
                self.requeue(message, delay_seconds=delay)
            else:
                # Max retries exceeded — send to DLQ
                self.send_to_dlq(message, reason="max_retries_exceeded")
                self.ack(message)
        except PermanentError as e:
            # No point retrying — send to DLQ immediately
            self.send_to_dlq(message, reason=str(e))
            self.ack(message)

DLQ best practices:

Monitor DLQ depth — a growing DLQ means something is broken
Include the original message, the error reason, and a timestamp
Build tooling to inspect and replay DLQ messages
Set alerts when the DLQ receives messages — it should normally be empty

The Saga Pattern

Distributed transactions across multiple services cannot use traditional ACID transactions. The saga pattern breaks a transaction into a sequence of local transactions, each with a compensating action for rollback.

Choreography-Based Saga

Each service listens for events and decides what to do next. No central coordinator.

OrderPlaced ──► PaymentService charges card
                    │
                    ├── PaymentSucceeded ──► InventoryService reserves stock
                    │                              │
                    │                              ├── StockReserved ──► ShippingService ships
                    │                              │
                    │                              └── OutOfStock ──► PaymentService refunds
                    │
                    └── PaymentFailed ──► OrderService cancels order

Orchestration-Based Saga

A central orchestrator coordinates the steps and handles compensations.

class OrderSaga:
    def execute(self, order):
        try:
            # Step 1: Reserve inventory
            reservation = self.inventory.reserve(order.items)

            # Step 2: Process payment
            payment = self.payments.charge(order.customer_id, order.total)

            # Step 3: Confirm order
            self.orders.confirm(order.id)

        except PaymentError:
            # Compensate: release inventory
            self.inventory.release(reservation.id)
            self.orders.fail(order.id, "Payment failed")

        except InventoryError:
            self.orders.fail(order.id, "Insufficient stock")

Choreography is simpler for 2-3 services but becomes hard to reason about with more. Orchestration adds a coordinator but makes the flow explicit and easier to debug.

Event Schema Evolution

Events are contracts between services. When you change an event schema, you risk breaking every consumer.

Rules for Safe Evolution

Add fields, never remove or rename them. New fields with defaults are backward compatible.
Use schema registries (Confluent Schema Registry for Kafka) to validate compatibility.
Version your events if breaking changes are unavoidable.

# Version 1
{"type": "OrderPlaced", "version": 1, "data": {"orderId": "123", "total": 49.98}}

# Version 2 — added "currency" field, old consumers ignore it
{"type": "OrderPlaced", "version": 2, "data": {"orderId": "123", "total": 49.98, "currency": "USD"}}

# Consumer handles both versions
def handle_order_placed(event):
    currency = event["data"].get("currency", "USD")  # Default for v1
    total = event["data"]["total"]

Use a consumer-driven contract testing approach — consumers define what fields they need, and the producer's CI pipeline verifies compatibility.

When NOT to Use Event-Driven Architecture

EDA is powerful but not universal. Avoid it when:

You need synchronous responses. If the client must wait for a result (checkout confirmation, login), a direct API call is simpler and more reliable.
Your team is small. EDA adds operational complexity — message brokers, event schemas, idempotency, monitoring. For a team of 3, the overhead may not be worth it.
The domain is simple. A monolithic CRUD app with 5 entities does not need an event bus.
You cannot tolerate eventual consistency. EDA is inherently eventually consistent. If you need strong consistency, use synchronous transactions.
You do not have observability. Debugging event-driven systems without distributed tracing, centralized logging, and event lineage tooling is a nightmare.

Signs You Should Consider EDA

Multiple teams or services need to react to the same business events
You need to scale producers and consumers independently
You want to decouple deployment cycles between services
Your system has clear domain events (OrderPlaced, UserRegistered, PaymentProcessed)
You need audit trails or event replay capability

Key Takeaways

Events, commands, and queries serve different purposes — use events for loose coupling, commands for directed actions
Pub/sub is the foundation — producers do not know about consumers, enabling independent evolution
Event sourcing stores history, not just current state — powerful for audit and replay but adds complexity
CQRS separates reads from writes, optimizing each independently
Choose your broker based on your needs: Kafka for streaming, RabbitMQ for routing, SQS for simplicity
Design for at-least-once delivery with idempotent consumers — do not chase exactly-once
Dead letter queues are essential — always have a plan for messages that cannot be processed
Sagas replace distributed transactions — use choreography for simple flows, orchestration for complex ones
Evolve schemas carefully — add fields, never remove them, and use a schema registry
EDA is not always the answer — evaluate the trade-offs honestly before committing

Event-driven architecture is a tool, not a religion. Use it where it genuinely solves problems — decoupling, scaling, extensibility — and reach for simpler approaches everywhere else.

SharePost Share

Keep reading

Jul 13, 20265 min read

Designing Ad Click Aggregation: Exactly-Once Counting at Scale

Billions of clicks, billed to the cent: streaming aggregation with watermarks, dedupe, idempotent sinks, and lambda-style reconciliation.

system-design

Jul 13, 20265 min read

Designing a CDN: Cache Hierarchy, Invalidation, and Request Routing

How a CDN actually works: edge PoPs, origin shields, consistent-hash cache keys, purge fan-out, and the anycast vs DNS routing decision.

system-design

Jul 13, 20265 min read

Designing a Distributed Cache Service: Redis Cluster Internals and Hot Keys

Build the cache, not just use it: slot-based sharding, gossip and failover, eviction under memory pressure, and the hot-key problem that shards can't solve.

What Does "Event-Driven" Mean?

Events vs Commands vs Queries

Events

Commands

Queries

The Pub/Sub Pattern

Event Sourcing

Traditional CRUD vs Event Sourcing

Implementation

When Event Sourcing Makes Sense

When It Does Not

CQRS (Command Query Responsibility Segregation)

Message Brokers: Kafka vs RabbitMQ vs SQS

Apache Kafka

RabbitMQ

Amazon SQS

Ordering Guarantees

The Problem

Solutions

Idempotency

Making Operations Idempotent

Delivery Semantics

At-Most-Once

At-Least-Once

Exactly-Once

Dead Letter Queues (DLQ)

The Saga Pattern

Choreography-Based Saga

Orchestration-Based Saga

Event Schema Evolution

Rules for Safe Evolution

When NOT to Use Event-Driven Architecture

Signs You Should Consider EDA

Key Takeaways

Keep reading

Designing Ad Click Aggregation: Exactly-Once Counting at Scale

Designing a CDN: Cache Hierarchy, Invalidation, and Request Routing

Designing a Distributed Cache Service: Redis Cluster Internals and Hot Keys

New posts, straight to your inbox

Comments