·13 min read·Rishi

Event-Driven Architecture: Patterns, Pitfalls, and Practical Guidance

Event-Driven Architecture: Patterns, Pitfalls, and Practical Guidance

Event-driven architecture (EDA) is how modern distributed systems communicate at scale. Netflix processes billions of events daily. LinkedIn's event pipeline handles trillions of messages per day through Kafka. But EDA is not a silver bullet — it introduces complexity that can bite you if you do not understand the patterns and pitfalls.

This post covers the core concepts, the major patterns, the broker trade-offs, and the hard-won lessons from real production systems.

What Does "Event-Driven" Mean?

An event-driven system is one where state changes are communicated as events rather than through direct calls between services. Instead of Service A calling Service B's API, Service A publishes an event ("OrderPlaced") and any interested service reacts to it.

This fundamental shift changes how services relate to each other:

  • Synchronous (request-driven): Service A knows about Service B and calls it directly. Tight coupling.
  • Asynchronous (event-driven): Service A publishes events. It does not know or care who consumes them. Loose coupling.
Request-Driven:
  OrderService --calls--> PaymentService --calls--> InventoryService
  (if PaymentService is down, the whole chain breaks)

Event-Driven:
  OrderService --publishes "OrderPlaced"--> Event Broker
  PaymentService --subscribes--> processes payment
  InventoryService --subscribes--> reserves stock
  NotificationService --subscribes--> sends confirmation email
  (services fail independently)

Events vs Commands vs Queries

These three message types serve fundamentally different purposes. Confusing them leads to architectural problems.

Events

An event is a notification that something happened. It is past tense, immutable, and the publisher does not care who consumes it.

{
  "type": "OrderPlaced",
  "timestamp": "2026-04-05T14:30:00Z",
  "data": {
    "orderId": "ord-123",
    "customerId": "cust-456",
    "items": [{"sku": "WIDGET-A", "qty": 2}],
    "total": 49.98
  }
}

Commands

A command is a request to do something. It is imperative, directed at a specific service, and the sender expects it to be handled.

{
  "type": "ProcessPayment",
  "target": "payment-service",
  "data": {
    "orderId": "ord-123",
    "amount": 49.98,
    "method": "credit_card"
  }
}

Queries

A query is a request for information. It does not change state.

{
  "type": "GetOrderStatus",
  "data": { "orderId": "ord-123" }
}

Key distinction: Events have zero or many consumers. Commands have exactly one handler. Design your system around events whenever possible — they give you the most flexibility to add new consumers without changing existing services.

The Pub/Sub Pattern

Publish-subscribe is the foundation of event-driven architecture.

# Producer: publishes events without knowing consumers
class OrderService:
    def place_order(self, order_data):
        order = self.repository.create(order_data)
        # Publish event — no idea who's listening
        self.event_bus.publish("orders", {
            "type": "OrderPlaced",
            "data": order.to_dict(),
            "timestamp": datetime.utcnow().isoformat()
        })
        return order

# Consumer 1: Payment processing
class PaymentConsumer:
    @subscribe("orders", filter="OrderPlaced")
    def handle_order_placed(self, event):
        order = event["data"]
        self.payment_gateway.charge(
            order["customerId"],
            order["total"]
        )

# Consumer 2: Inventory reservation
class InventoryConsumer:
    @subscribe("orders", filter="OrderPlaced")
    def handle_order_placed(self, event):
        for item in event["data"]["items"]:
            self.inventory.reserve(item["sku"], item["qty"])

# Consumer 3: Added 6 months later — zero changes to OrderService
class AnalyticsConsumer:
    @subscribe("orders", filter="OrderPlaced")
    def handle_order_placed(self, event):
        self.metrics.increment("orders.placed")
        self.metrics.record("orders.revenue", event["data"]["total"])

The beauty of pub/sub is that Consumer 3 was added months later without touching a single line in OrderService or the other consumers.

Event Sourcing

Instead of storing the current state of an entity, event sourcing stores every event that led to the current state. The current state is derived by replaying events.

Traditional CRUD vs Event Sourcing

CRUD approach (stores current state):
┌──────────────────────────────────┐
│ accounts table                   │
│ id: acc-123  balance: $750.00    │
└──────────────────────────────────┘
  (How did we get to $750? No idea.)

Event Sourcing (stores all events):
┌──────────────────────────────────┐
│ AccountCreated    balance: $0    │
│ MoneyDeposited    amount: $1000  │
│ MoneyWithdrawn    amount: $200   │
│ MoneyWithdrawn    amount: $50    │
│ Current balance:  $750           │
└──────────────────────────────────┘
  (Full audit trail. Can replay to any point in time.)

Implementation

class BankAccount:
    def __init__(self, account_id: str):
        self.id = account_id
        self.balance = 0
        self.events = []

    def deposit(self, amount: float):
        if amount <= 0:
            raise ValueError("Deposit must be positive")
        event = {
            "type": "MoneyDeposited",
            "account_id": self.id,
            "amount": amount,
            "timestamp": datetime.utcnow().isoformat()
        }
        self._apply(event)
        self.events.append(event)

    def withdraw(self, amount: float):
        if amount > self.balance:
            raise ValueError("Insufficient funds")
        event = {
            "type": "MoneyWithdrawn",
            "account_id": self.id,
            "amount": amount,
            "timestamp": datetime.utcnow().isoformat()
        }
        self._apply(event)
        self.events.append(event)

    def _apply(self, event):
        if event["type"] == "MoneyDeposited":
            self.balance += event["amount"]
        elif event["type"] == "MoneyWithdrawn":
            self.balance -= event["amount"]

    @classmethod
    def from_events(cls, account_id: str, events: list):
        """Rebuild state by replaying events."""
        account = cls(account_id)
        for event in events:
            account._apply(event)
        return account

When Event Sourcing Makes Sense

  • Audit requirements: Finance, healthcare, compliance — anywhere you need a complete history
  • Temporal queries: "What was the account balance on March 15th?"
  • Event replay: Rebuild read models, fix bugs by replaying with corrected logic
  • Complex domains: Where the history of changes matters as much as current state

When It Does Not

  • Simple CRUD applications with no audit needs
  • When storage costs of storing every event outweigh the benefits
  • When the team lacks experience with eventual consistency

CQRS (Command Query Responsibility Segregation)

CQRS separates the write model (commands) from the read model (queries). Writes go to an event store, and one or more read-optimized projections are built from those events.

                ┌─────────────────┐
  Commands ──►  │  Write Model    │ ──► Event Store
                │  (validates,    │         │
                │   enforces      │         │ events flow
                │   business      │         │ to projections
                │   rules)        │         ▼
                └─────────────────┘  ┌──────────────┐
                                     │ Read Model 1  │ ◄── Queries (API)
                                     │ (optimized    │
                                     │  for listing) │
                                     └──────────────┘
                                     ┌──────────────┐
                                     │ Read Model 2  │ ◄── Queries (search)
                                     │ (optimized    │
                                     │  for search)  │
                                     └──────────────┘

Why CQRS? Because reads and writes have fundamentally different performance characteristics and access patterns. Your write model can use a normalized schema for consistency, while your read models can be denormalized for speed.

# Write side: validates business rules, appends events
class OrderCommandHandler:
    def handle_place_order(self, cmd):
        order = Order.create(cmd.customer_id, cmd.items)
        order.validate_inventory()
        order.calculate_total()
        self.event_store.append(order.uncommitted_events)

# Read side: optimized projection for order listing
class OrderListProjection:
    def on_order_placed(self, event):
        self.db.execute("""
            INSERT INTO order_list_view
            (order_id, customer_name, total, status, placed_at)
            VALUES (%s, %s, %s, 'placed', %s)
        """, event.order_id, event.customer_name,
             event.total, event.timestamp)

    def on_order_shipped(self, event):
        self.db.execute("""
            UPDATE order_list_view
            SET status = 'shipped', shipped_at = %s
            WHERE order_id = %s
        """, event.timestamp, event.order_id)

Message Brokers: Kafka vs RabbitMQ vs SQS

Apache Kafka

Kafka is a distributed commit log designed for high-throughput, ordered, durable event streaming.

  • Throughput: Millions of messages per second
  • Ordering: Guaranteed within a partition
  • Retention: Messages persist for a configurable time (days, weeks, forever)
  • Consumer groups: Multiple consumer groups read independently from the same topic
  • Replay: Consumers can re-read messages from any offset
# Kafka producer
from kafka import KafkaProducer
import json

producer = KafkaProducer(
    bootstrap_servers=["kafka:9092"],
    value_serializer=lambda v: json.dumps(v).encode("utf-8"),
    acks="all",           # Wait for all replicas to confirm
    retries=3,            # Retry on transient failures
    enable_idempotence=True  # Prevent duplicate writes
)

# Partition by order_id for ordering guarantee
producer.send(
    "orders",
    key=b"ord-123",       # Same key = same partition = ordered
    value={"type": "OrderPlaced", "orderId": "ord-123"}
)

RabbitMQ

RabbitMQ is a traditional message broker that implements AMQP. It excels at complex routing and task distribution.

  • Routing: Flexible exchange types (direct, topic, fanout, headers)
  • Acknowledgment: Consumer confirms processing, broker redelivers on failure
  • Priority queues: Messages can have priority levels
  • Lower latency: Better for low-latency, request-reply patterns

Choose Kafka for high-throughput event streaming where you need replay and ordering. Choose RabbitMQ for task queues, complex routing, and request-reply patterns.

Amazon SQS

SQS is a fully managed message queue — zero infrastructure to manage.

  • Standard queues: At-least-once delivery, best-effort ordering
  • FIFO queues: Exactly-once processing, strict ordering (but lower throughput)
  • Dead letter queues: Built-in DLQ support
  • Scaling: Virtually unlimited throughput for standard queues

Choose SQS when you want simplicity and are already on AWS. The operational overhead of managing Kafka or RabbitMQ clusters is significant.

Ordering Guarantees

Ordering is one of the most misunderstood aspects of event-driven systems.

The Problem

In a distributed system, events can arrive out of order due to network delays, retries, and parallel processing. Consider:

Event 1: OrderPlaced    (produced at T=0)
Event 2: OrderCancelled (produced at T=1)

If Event 2 arrives before Event 1, the consumer might:
1. Process cancellation (no-op, order doesn't exist yet)
2. Process order placement (order is now active, but should be cancelled!)

Solutions

Partition by entity ID — Kafka guarantees ordering within a partition. Use the entity ID as the partition key.

# All events for order "ord-123" go to the same partition
producer.send("orders", key=b"ord-123", value=event)

Sequence numbers — Include a monotonically increasing sequence number. Consumers reject or buffer out-of-order events.

def handle_event(event):
    expected_seq = self.get_last_sequence(event["entity_id"]) + 1
    if event["sequence"] < expected_seq:
        return  # Already processed (duplicate)
    if event["sequence"] > expected_seq:
        self.buffer_event(event)  # Out of order — buffer it
        return
    self.process(event)
    self.process_buffered(event["entity_id"])  # Check buffer

Idempotency

In distributed systems, messages can be delivered more than once. Every consumer must be idempotent — processing the same message twice should produce the same result as processing it once.

class IdempotentPaymentProcessor:
    def process_payment(self, event):
        idempotency_key = event["event_id"]

        # Check if we already processed this event
        if self.processed_events.exists(idempotency_key):
            logger.info(f"Skipping duplicate event {idempotency_key}")
            return

        # Process the payment
        result = self.payment_gateway.charge(
            event["data"]["customer_id"],
            event["data"]["amount"]
        )

        # Record that we processed this event (atomically with the work)
        self.processed_events.insert(idempotency_key, {
            "processed_at": datetime.utcnow(),
            "result": result
        })

Making Operations Idempotent

  • Database inserts: Use INSERT ... ON CONFLICT DO NOTHING with a unique event ID
  • Counter increments: Use SET counter = X instead of INCREMENT counter (set absolute value, not relative)
  • External API calls: Pass an idempotency key to the external service (Stripe, etc.)
  • State machines: Validate that the transition is valid from the current state before applying

Delivery Semantics

At-Most-Once

The message is delivered zero or one times. If the consumer crashes, the message is lost. Fastest but least reliable.

At-Least-Once

The message is delivered one or more times. The broker retries until the consumer acknowledges. Most common in practice. Requires idempotent consumers.

Exactly-Once

The message is delivered exactly one time. This is extremely hard to achieve in distributed systems. Kafka supports it within its own ecosystem using transactions and idempotent producers, but true exactly-once across system boundaries is effectively impossible.

In practice: Design for at-least-once delivery with idempotent consumers. Do not chase exactly-once semantics — the operational complexity is rarely worth it.

Dead Letter Queues (DLQ)

When a consumer cannot process a message after multiple retries, it goes to a dead letter queue for manual inspection and reprocessing.

class ResilientConsumer:
    MAX_RETRIES = 3

    def consume(self, message):
        retry_count = message.headers.get("retry_count", 0)

        try:
            self.process(message)
            self.ack(message)
        except TransientError:
            if retry_count < self.MAX_RETRIES:
                # Retry with exponential backoff
                message.headers["retry_count"] = retry_count + 1
                delay = 2 ** retry_count  # 1s, 2s, 4s
                self.requeue(message, delay_seconds=delay)
            else:
                # Max retries exceeded — send to DLQ
                self.send_to_dlq(message, reason="max_retries_exceeded")
                self.ack(message)
        except PermanentError as e:
            # No point retrying — send to DLQ immediately
            self.send_to_dlq(message, reason=str(e))
            self.ack(message)

DLQ best practices:

  • Monitor DLQ depth — a growing DLQ means something is broken
  • Include the original message, the error reason, and a timestamp
  • Build tooling to inspect and replay DLQ messages
  • Set alerts when the DLQ receives messages — it should normally be empty

The Saga Pattern

Distributed transactions across multiple services cannot use traditional ACID transactions. The saga pattern breaks a transaction into a sequence of local transactions, each with a compensating action for rollback.

Choreography-Based Saga

Each service listens for events and decides what to do next. No central coordinator.

OrderPlaced ──► PaymentService charges card
                    │
                    ├── PaymentSucceeded ──► InventoryService reserves stock
                    │                              │
                    │                              ├── StockReserved ──► ShippingService ships
                    │                              │
                    │                              └── OutOfStock ──► PaymentService refunds
                    │
                    └── PaymentFailed ──► OrderService cancels order

Orchestration-Based Saga

A central orchestrator coordinates the steps and handles compensations.

class OrderSaga:
    def execute(self, order):
        try:
            # Step 1: Reserve inventory
            reservation = self.inventory.reserve(order.items)

            # Step 2: Process payment
            payment = self.payments.charge(order.customer_id, order.total)

            # Step 3: Confirm order
            self.orders.confirm(order.id)

        except PaymentError:
            # Compensate: release inventory
            self.inventory.release(reservation.id)
            self.orders.fail(order.id, "Payment failed")

        except InventoryError:
            self.orders.fail(order.id, "Insufficient stock")

Choreography is simpler for 2-3 services but becomes hard to reason about with more. Orchestration adds a coordinator but makes the flow explicit and easier to debug.

Event Schema Evolution

Events are contracts between services. When you change an event schema, you risk breaking every consumer.

Rules for Safe Evolution

  1. Add fields, never remove or rename them. New fields with defaults are backward compatible.
  2. Use schema registries (Confluent Schema Registry for Kafka) to validate compatibility.
  3. Version your events if breaking changes are unavoidable.
# Version 1
{"type": "OrderPlaced", "version": 1, "data": {"orderId": "123", "total": 49.98}}

# Version 2 — added "currency" field, old consumers ignore it
{"type": "OrderPlaced", "version": 2, "data": {"orderId": "123", "total": 49.98, "currency": "USD"}}

# Consumer handles both versions
def handle_order_placed(event):
    currency = event["data"].get("currency", "USD")  # Default for v1
    total = event["data"]["total"]
  1. Use a consumer-driven contract testing approach — consumers define what fields they need, and the producer's CI pipeline verifies compatibility.

When NOT to Use Event-Driven Architecture

EDA is powerful but not universal. Avoid it when:

  • You need synchronous responses. If the client must wait for a result (checkout confirmation, login), a direct API call is simpler and more reliable.
  • Your team is small. EDA adds operational complexity — message brokers, event schemas, idempotency, monitoring. For a team of 3, the overhead may not be worth it.
  • The domain is simple. A monolithic CRUD app with 5 entities does not need an event bus.
  • You cannot tolerate eventual consistency. EDA is inherently eventually consistent. If you need strong consistency, use synchronous transactions.
  • You do not have observability. Debugging event-driven systems without distributed tracing, centralized logging, and event lineage tooling is a nightmare.

Signs You Should Consider EDA

  • Multiple teams or services need to react to the same business events
  • You need to scale producers and consumers independently
  • You want to decouple deployment cycles between services
  • Your system has clear domain events (OrderPlaced, UserRegistered, PaymentProcessed)
  • You need audit trails or event replay capability

Key Takeaways

  • Events, commands, and queries serve different purposes — use events for loose coupling, commands for directed actions
  • Pub/sub is the foundation — producers do not know about consumers, enabling independent evolution
  • Event sourcing stores history, not just current state — powerful for audit and replay but adds complexity
  • CQRS separates reads from writes, optimizing each independently
  • Choose your broker based on your needs: Kafka for streaming, RabbitMQ for routing, SQS for simplicity
  • Design for at-least-once delivery with idempotent consumers — do not chase exactly-once
  • Dead letter queues are essential — always have a plan for messages that cannot be processed
  • Sagas replace distributed transactions — use choreography for simple flows, orchestration for complex ones
  • Evolve schemas carefully — add fields, never remove them, and use a schema registry
  • EDA is not always the answer — evaluate the trade-offs honestly before committing

Event-driven architecture is a tool, not a religion. Use it where it genuinely solves problems — decoupling, scaling, extensibility — and reach for simpler approaches everywhere else.

Keep Reading

Comments

No comments yet. Be the first!