4 min readRishi

Designing Webhook Delivery That Survives Flaky Consumers

Webhooks look like the easy half of an API: you just POST JSON at a URL the customer gave you. Then the customer's endpoint times out during their deploy, returns 200 while dropping the payload, or gets called twice and double-fulfills an order. Every mature webhook system — Stripe, GitHub, Shopify — converges on the same handful of design decisions, because the naive version fails the same ways every time.

Deliver from a queue, never from the request path

The event producer should never wait on a customer's server. When the triggering action happens, write the event and enqueue a delivery job; a separate dispatcher pool makes the HTTP calls. A consumer that takes 30 seconds to respond should cost you one worker slot, not a database transaction or a user-facing request.

action → event store → delivery queue → dispatcher → consumer endpoint
                          ↑ retries re-enqueue here

Give the call a tight timeout — 5 to 10 seconds is typical — and treat a timeout exactly like a 5xx. On the consumer side the matching advice: acknowledge fast, process async. An endpoint that does real work inline will eventually exceed the timeout and get retried into double-processing.

Retry with backoff, then dead-letter

Failures are normal, so the schedule matters more than the intent. Retry on 5xx, 429, and timeouts with exponential backoff and jitter, spread over hours to ride out deploys and incidents. Do not retry 4xx responses other than 429 — a 404 or 401 will not fix itself, and hammering it just fills your logs.

AttemptDelayCumulative
1immediate0
21 min1 min
315 min~16 min
42 h~2 h
512 h~14 h
624 h~38 h

After the last attempt, dead-letter — do not delete. Keep exhausted deliveries queryable, expose them in a dashboard, and offer manual or API-driven redelivery. Automatically disabling an endpoint that has failed for days (with an email to the owner) protects your dispatchers and is a feature customers thank you for, provided reactivation is self-service.

Sign every delivery and make replays detectable

A webhook endpoint is a public URL that accepts POSTs, so authenticity is your problem. Sign the payload with a per-endpoint secret and put the signature and a timestamp in headers. Consumers verify with a constant-time comparison and reject stale timestamps to block replay.

import hashlib, hmac, time

def verify(payload: bytes, timestamp: str, signature: str, secret: str) -> bool:
    if abs(time.time() - int(timestamp)) > 300:
        return False
    signed = f"{timestamp}.".encode() + payload
    expected = hmac.new(secret.encode(), signed, hashlib.sha256).hexdigest()
    return hmac.compare_digest(expected, signature)

Sign the raw bytes you send, and tell consumers to verify the raw bytes they receive — JSON re-serialization is the classic verification bug. Support two active secrets per endpoint so customers can rotate without downtime.

Duplicates and ordering are the consumer's contract

At-least-once delivery means duplicates, so say it out loud. Give every event a unique ID, put it in the payload and headers, and document that consumers must deduplicate. A consumer that records processed event IDs handles both your retries and their own replays.

Do not promise ordering. Retries alone destroy it: event 42 fails and retries in an hour, event 43 delivers now. Instead, put a sequence number or updated timestamp on the resource so consumers can discard stale events, and design payloads so the latest event wins.

There is a related choice in payload design: thin events (order.updated, an ID, fetch the rest via API) versus fat events (full resource embedded). Thin payloads make stale data impossible and shrink the security surface; fat payloads save the read call. Stripe ships fat, many systems ship thin — either works if you are consistent, but thin payloads make the ordering problem mostly disappear.

Observability turns webhook support tickets into dashboards

Every delivery attempt should be a queryable record: event ID, endpoint, attempt number, status code, latency, response snippet. When a customer says "we never got the event," the answer should be a lookup, not an investigation. Per-endpoint success-rate metrics also tell you which consumers are struggling before they notice themselves.

Webhook delivery is a small distributed system you run on behalf of people who did not design for it. Queue the sends, retry with patience, sign everything, hand consumers the tools to deduplicate — and the flaky endpoint on the other end stops being your outage.

Keep reading

Newsletter

New posts, straight to your inbox

One email per post. No spam, no tracking pixels, unsubscribe anytime.

Comments

  • No comments yet. Be the first.