Every team ended up building the same solution: retry logic, dead letter queue, monitoring.
Curious how others handle this: - Do you rely on the provider's retry policy? - Built your own reliability layer? - Use a service? - Just manually reconcile when it happens?
(Context: Building https://relaehook.com to solve this, but genuinely curious what the norm is)
Where things get messy is when you have a mix of providers with wildly different retry behaviors, or internal services that have their own rate limits or downtime windows. A relay layer keeps the intake consistent even when the rest of the system isn’t.
Plus trusts y'all with contents of said webhook?
And on the data side, we don’t use webhook payloads for anything other than delivery. They’re encrypted at rest, transit, and automatically purged based on retention settings.