How do you handle lost webhooks in production?
4 points
2 hours ago
| 3 comments
| HN
I've worked at several companies where we'd discover hours later that critical webhooks from Stripe/Shopify never arrived (deployment, timeout, bug, etc.).

Every team ended up building the same solution: retry logic, dead letter queue, monitoring.

Curious how others handle this: - Do you rely on the provider's retry policy? - Built your own reliability layer? - Use a service? - Just manually reconcile when it happens?

(Context: Building https://relaehook.com to solve this, but genuinely curious what the norm is)

super256
52 minutes ago
[-]
Ofc I rely on the retry policy. Stripe retries with exponential back off for three days. If Stripe can't reach our endpoint in 3 days we probably went bankrupt or a solar flare ate IT.
reply
everydaydev
36 minutes ago
[-]
Stripe does retries right, no argument there.

Where things get messy is when you have a mix of providers with wildly different retry behaviors, or internal services that have their own rate limits or downtime windows. A relay layer keeps the intake consistent even when the rest of the system isn’t.

reply
samarthr1
1 hour ago
[-]
Wait, so your product moves the point of failure from my infra to your infra?

Plus trusts y'all with contents of said webhook?

reply
everydaydev
51 minutes ago
[-]
Fair question — we’re not eliminating failure so much as isolating it behind a system that’s purpose-built for durability. Our infra is built with redundant queues, retry pipelines, and observability you typically wouldn’t stand up for a single product team.

And on the data side, we don’t use webhook payloads for anything other than delivery. They’re encrypted at rest, transit, and automatically purged based on retention settings.

reply
nickphx
4 minutes ago
[-]
Yeaaaaaaaaaaaaah.. I am not sure adding an additional third party and point of potential failure would help mitigate the issue of receiving data from third parties... but good luck.
reply