Ask HN: Is anyone losing sleep over retry storms or partial API outages?
2 points
11 hours ago
| 2 comments
| HN
I’m working on infrastructure to solve retry storms and outages. Before I go further, I want to understand what people are actually doing today. Compare solutions and maybe help someone see potential solutions. The problems:

Retry storms - API fails, your entire fleet retries independently, thundering herd makes it worse.

Partial outages - API is “up” but degraded (slow, intermittent 500s). Health checks pass, requests suffer.

What I’m curious about: ∙ What’s your current solution? (circuit breakers, queues, custom coordination, service mesh, something else?) ∙ How well does it work? What are the gaps? ∙ What scale are you at? (company size, # of instances, requests/sec)

I’d love to hear what’s working, what isn’t, and what you wish existed.

toast0
8 hours ago
[-]
Retry storms are "easy" exponential backoff with jitter. Like what ethernet on shared media has been doing since the 80s.

If that's not enough to come back from an outage, you need to put in load shedding and/or back pressure. There's no sense accepting all the requests and then not servicing any in time.

You want to be able to accept and do work on requests that are likely to succeed within reasonable latency bounds, and drop the rest --- but being careful that an instant error may feed back into retry storms, sometimes it's better if such errors come after a delay, so that the client is stuck waiting (back pressure)

reply
rjpruitt16
10 minutes ago
[-]
Agree backoff+jitter is table stakes, and load shedding/backpressure is necessary under sustained overload. The tricky cases I’m digging into are shared rate limits (429s) and many concurrent clients/agents where local backoff isn’t coordinated and you still get herds after partial outages. Curious what patterns you’ve seen work well for coordinating retries/fairness across tenants or API keys?
reply
HelloNurse
6 hours ago
[-]
A worrying choice of words.

"Losing sleep" implies an actual problem, which in turn implies that the mentioned mitigations and similar ones have not been applied (at least not properly) for dire reasons that are likely to be a more important problem than bad QoS.

"Infrastructure" implies an expectation that you deploy something external to the troubled application: there is a defective, presumably simplistic application architecture, and fixing it is not an option. This puts you in an awkward position: someone else is incompetent or unreasonable, but the responsibility for keeping their dumpster fire running falls on you.

reply