Designing Webhook Delivery That Survives Flaky Consumers
Webhooks look like the easy half of an API: you just POST JSON at a URL the customer gave you. Then the customer's endpoint times out during their deploy, returns 200 while dropping the payload, or gets called twice and double-fulfills an order. Every mature webhook system — Stripe, GitHub, Shopify — converges on the same handful of design decisions, because the naive version fails the same ways every time.
Deliver from a queue, never from the request path
The event producer should never wait on a customer's server. When the triggering action happens, write the event and enqueue a delivery job; a separate dispatcher pool makes the HTTP calls. A consumer that takes 30 seconds to respond should cost you one worker slot, not a database transaction or a user-facing request.
action → event store → delivery queue → dispatcher → consumer endpoint
↑ retries re-enqueue here
Give the call a tight timeout — 5 to 10 seconds is typical — and treat a timeout exactly like a 5xx. On the consumer side the matching advice: acknowledge fast, process async. An endpoint that does real work inline will eventually exceed the timeout and get retried into double-processing.
Retry with backoff, then dead-letter
Failures are normal, so the schedule matters more than the intent. Retry on 5xx, 429, and timeouts with exponential backoff and jitter, spread over hours to ride out deploys and incidents. Do not retry 4xx responses other than 429 — a 404 or 401 will not fix itself, and hammering it just fills your logs.
| Attempt | Delay | Cumulative |
|---|---|---|
| 1 | immediate | 0 |
| 2 | 1 min | 1 min |
| 3 | 15 min | ~16 min |
| 4 | 2 h | ~2 h |
| 5 | 12 h | ~14 h |
| 6 | 24 h | ~38 h |
After the last attempt, dead-letter — do not delete. Keep exhausted deliveries queryable, expose them in a dashboard, and offer manual or API-driven redelivery. Automatically disabling an endpoint that has failed for days (with an email to the owner) protects your dispatchers and is a feature customers thank you for, provided reactivation is self-service.
Sign every delivery and make replays detectable
A webhook endpoint is a public URL that accepts POSTs, so authenticity is your problem. Sign the payload with a per-endpoint secret and put the signature and a timestamp in headers. Consumers verify with a constant-time comparison and reject stale timestamps to block replay.
import hashlib, hmac, time
def verify(payload: bytes, timestamp: str, signature: str, secret: str) -> bool:
if abs(time.time() - int(timestamp)) > 300:
return False
signed = f"{timestamp}.".encode() + payload
expected = hmac.new(secret.encode(), signed, hashlib.sha256).hexdigest()
return hmac.compare_digest(expected, signature)
Sign the raw bytes you send, and tell consumers to verify the raw bytes they receive — JSON re-serialization is the classic verification bug. Support two active secrets per endpoint so customers can rotate without downtime.
Duplicates and ordering are the consumer's contract
At-least-once delivery means duplicates, so say it out loud. Give every event a unique ID, put it in the payload and headers, and document that consumers must deduplicate. A consumer that records processed event IDs handles both your retries and their own replays.
Do not promise ordering. Retries alone destroy it: event 42 fails and retries in an hour, event 43 delivers now. Instead, put a sequence number or updated timestamp on the resource so consumers can discard stale events, and design payloads so the latest event wins.
There is a related choice in payload design: thin events (order.updated, an ID, fetch the rest via API) versus fat events (full resource embedded). Thin payloads make stale data impossible and shrink the security surface; fat payloads save the read call. Stripe ships fat, many systems ship thin — either works if you are consistent, but thin payloads make the ordering problem mostly disappear.
Observability turns webhook support tickets into dashboards
Every delivery attempt should be a queryable record: event ID, endpoint, attempt number, status code, latency, response snippet. When a customer says "we never got the event," the answer should be a lookup, not an investigation. Per-endpoint success-rate metrics also tell you which consumers are struggling before they notice themselves.
Webhook delivery is a small distributed system you run on behalf of people who did not design for it. Queue the sends, retry with patience, sign everything, hand consumers the tools to deduplicate — and the flaky endpoint on the other end stops being your outage.
Keep reading
Columnar Storage: How Analytical Databases Work Under the Hood
Why column-oriented databases run analytical queries 100x faster than row-oriented ones — covering physical layout, compression algorithms, vectorized execution, and predicate pushdown with concrete examples.
Kafka Consumer Groups, Partitions, and Offset Management Explained
A deep dive into how Kafka distributes work across consumers, why rebalancing stalls your pipeline, and how to choose an offset commit strategy that matches your delivery guarantee requirements.
PostgreSQL Connection Pooling with PgBouncer
Why PostgreSQL's connection model breaks under load, how PgBouncer fixes it, and how to configure transaction-mode pooling without getting bitten by prepared statements or advisory locks.
Newsletter
New posts, straight to your inbox
One email per post. No spam, no tracking pixels, unsubscribe anytime.
Comments
- No comments yet. Be the first.