FilterHN

Summary of the Amazon DynamoDB Service Disruption in US-East-1 Region

551 points

9 days ago

| 34 comments

Recent and related: AWS multiple services outage in us-east-1 - https://news.ycombinator.com/item?id=45640838 (2045 comments)

▲

tptacek

9 days ago

[-]

I'm a tedious broken record about this (among many other things) but if you haven't read this Richard Cook piece, I strongly recommend you stop reading this postmortem and go read Cook's piece first. It won't take you long. It's the single best piece of writing about this topic I have ever read and I think the piece of technical writing that has done the most to change my thinking:

https://how.complexsystems.fail/

You can literally check off the things from Cook's piece that apply directly here. Also: when I wrote this comment, most of the thread was about root-causing the DNS thing that happened, which I don't think is the big story behind this outage. (Cook rejects the whole idea of a "root cause", and I'm pretty sure he's dead on right about why.)

▲

cb321

8 days ago

[-]

That minimalist post mortem for the public is of what sounds like a Rube Goldberg machine and the reality is probably even more hairy. I completely agree that if one wants to understand "root causes", it's more important to understand why such machines are built/trusted/evolved in the first place.

That piece by Cook is ok, but largely just a list of assertions (true or not, most do feel intuitive, though). I suppose one should delve into all those references at the end for details? Anyway, this is an ancient topic, and I doubt we have all the answers on those root whys. The MIT course on systems, 6.033, used to assign reading a paper raised on HN only a few times in its history: https://news.ycombinator.com/item?id=10082625 and https://news.ycombinator.com/item?id=16392223 It's from 1962, over 60 years ago, but that is also probably more illuminating/thought provoking than the post mortem. Personally, I suspect it's probably an instance of a https://en.wikipedia.org/wiki/Wicked_problem , but only past a certain scale.

▲

tptacek

8 days ago

[-]

I have a housing activism meetup I have to get to, but real quick let me just say that these kinds of problems are not an abstraction to me in my day job, that I read this piece before I worked where I do and it bounced off me, but then I read it last year and was like "are you me but just smarter?", like my pupils probably dilated theatrically when I read it like I was a character in Requiem for a Dream, and I think most of the points he's making are much subtler and deeper than they seem at a casual read.

You might have to bring personal trauma to this piece to get the full effect.

▲

cb321

8 days ago

[-]

Oh, it's fine. At your leisure. I didn't mean to go against the assertions themselves, but more just kind of speak to their "unargued" quality and often sketchy presentation. Even that Simon piece has a lot of this in there, where it's sort of "by defenition of 'complexity'/by unelaborated observation".

In engineered systems, there is just a disconnect between on our own/small scale KISS and what happens in large organizations, and then what happens over time. This is the real root cause/why, but I'm not sure it's fixable. Maybe partly addressable, tho'.

One thing that might give you a moment of worry is both in that Simon and far, far more broadly all over academia both long before and ever since, biological systems like our bodies are an archetypal example of "complex". Besides medical failures, life mostly has this one main trick -- make many copies and if they don't all fail before they, too, can copy then a stable-ish pattern emerges.

Stable populations + "litter size/replication factor" largely imply average failure rates. For most species it is horrific. On the David Attenborough specials they'll play the sad music and tell you X% of these offspring never make it to mating age. The alternative is not the https://en.wikipedia.org/wiki/Gray_goo apocalypse, but the "whatever-that-species-is-biopocalypse". Sorry - it's late and my joke circuits are maybe fritzing. So, both big 'L' and little 'l' life, too, "is on the edge", just structurally.

https://en.wikipedia.org/wiki/Self-organized_criticality (with sand piles and whatnot) used to be a kind of statistical physics hope for a theory of everything of these kinds of phenomena, but it just doesn't get deployed. Things will seem "shallowly critical" but not so upon deeper inspection. So, maybe it's not not a useful enough approximation.

Anyway, good luck with your housing meetup!

▲

markus_zhang

8 days ago

[-]

As a contractor who is on an oncall schedule. I have never worked in a company that treats oncall as a very serious business. I only worked in 2 companies that need oncall so I’m biased. On paper, they both say it is serious and all SLA stuffs were setup, but in reality there is not enough support.

The problem is, oncall is a full-time business. It takes full attention of the oncall engineer, whether there is an issue or not. Both companies simply treat oncall as a by-product. We just had to do it so let’s stuff it into the sprint. The first company was slightly more serious as we were asked to put up a 2-3 point oncall task in JIRA. The second one doesn’t even do this.

Neither company really encourages engineers to read through complex code written by others, even if we do oncall for those products. Again, the first company did better, and we were supposed to create a channel and pull people in, so it’s OKish to not know anything about the code. The second company simply leaves oncall to do whatever they can. Neither company allocates enough time for engineers to read the source code thoroughly. And neither has good documentation for oncall.

I don’t know the culture of AWS. I’d very much want to work in an oncall environment that is serious and encourages learning.

▲

dekhn

8 days ago

[-]

When I was an SRE at Google our oncall was extremely serious (if the service went down, Google was unable to show ads, record ad impressions, or do any billing for ads). It was done on a rotation, lasted 1 week (IIRC it was 9AM-9PM, we had another time zone for the alternate 12 hours). The on-call was empowered to do pretty much anything required to keep the service up and running, including cancelling scheduled downtimes, pausing deployment updates, stop abusive jobs, stop abusive developers, and invoke an SVP if there was a fight with another important group).

We sent a test page periodically to make sure the pager actually beeped. We got paid extra for being in the rotation. The leadership knew this was a critical step. Unfortunately, much of our tooling was terrible, which would cause false pages, or failed critical operations, all too frequently.

I later worked on SWE teams that didn't take dev oncall very seriously. At my current job, we have an oncall, but it's best effort business hours only.

▲

citizenpaul

8 days ago

[-]

>empowered to do pretty much anything required to keep the service up and running,

Is that really uncommon? I've been on call for many companies and many types of institutions and never been told once I couldn't do something to bring a system up that I can recall at least. Its kinda the job?

On call seriousness should be directly proportional to pay. Google pays. If smallcorp want to pay me COL I'll be looking at that 2AM ticket at 9AM when I get to work.

▲

markus_zhang

8 days ago

[-]

That’s pretty good. Our oncall is actually 24-hour for one week. On paper it looks very serious but even the best of us don’t really know everything so issues tend to lag to the morning. Neither do we get any compensation for it. Someone got a bad night and still need to logon next day. There is an informal understanding to relax a bit if the night is too bad, though.

▲

dmoy

8 days ago

[-]

I did 24hr-for-a-week oncall for 10+ years, do not recommend.

12-12 rotation in SRE is a lot more reasonable for humans

▲

sandeepkd

8 days ago

[-]

Unfortunately 24hr-for-a-week seems to be default everywhere nowdays, its just not practical for serious type businesses. It just an indicator of how important is the UPTIME for a company.

▲

markus_zhang

8 days ago

[-]

I agree. It sucks. And our schedule is actually 2 weeks in every five. One is secondary and the other is primary.

▲

lanyard-textile

8 days ago

[-]

Handling my first non-prod alert bug as the oncall at Google was pretty eye opening :)

It was a good lesson in what a manicured lower environment can do for you.

▲

malfist

8 days ago

[-]

Amazon generally treats on call as a full time job. Generally engineers who are on call are expected to only be on call. No feature work.

▲

tidbits

8 days ago

[-]

It's very team/org dependent and I would say that's generally not the case. In 6 years I have only had 1 team out of 3 where that was true. The other two teams I was expected to juggle feature work with oncall work. Same for most teams I interacted with.

▲

malfist

7 days ago

[-]

Interesting, I've been here nearly that long and every team I've worked with its generally the way I described. Do engineers always do that? No. But it is the expectation

▲

markus_zhang

8 days ago

[-]

That's actually pretty good.

▲

nickelpro

8 days ago

[-]

To quote Grandpa Simpson, "Everything everyone just said is either obvious or wrong".

Pointing out that "complex systems" have "layers of defense" is neither insightful nor useful, it's obvious. Saying that any and all failures in a given complex system lack a root cause is wrong.

Cook uses a lot of words to say not much at all. There's no concrete advise to be taken from How Complex Systems Fail, nothing to change. There's no casualty procedure or post-mortem investigation which would change a single letter of a single word in response to it. It's hot air.

▲

baq

8 days ago

[-]

There’s a difference between ‘grown organically’ and ‘designed to operate in this way’, though. Experienced folks will design system components with conscious awareness of how operations actually look like from the start. Juniors won’t and will be bolting on quasi solutions as their systems fall over time and time again. Cook’s generalization is actually wildly applicable, but it takes work to map it to specific situations.

▲

yabones

9 days ago

[-]

Another great lens to see this is "Normal Accidents" theory, where the argument is made that the most dangerous systems are ones where components are very tightly coupled, interactions are complex and uncontrollable, and consequences of failure are serious.

https://en.wikipedia.org/wiki/Normal_Accidents

▲

ramraj07

8 days ago

[-]

As I was reading through that list, I kept feeling, "why do I feel this is not universally true?"

Then I realized: the internet; the power-grid (at least in most developed countries); there are things that don't actually fail catastrophically, even though they are extremely complex, and not always built by efficient organizations. Whats the retort to this argument?

▲

singron

8 days ago

[-]

They do fail catastrophically. E.g. https://en.wikipedia.org/wiki/Northeast_blackout_of_2003

I think you could argue AWS is more complex than the electrical grid, but even if it's not, the grid has had several decades to iron out kinks and AWS hasn't. AWS also adds a ton of completely new services each year in addition to adding more capacity. E.g. I bet these DNS Enactors have become more numerous and their plans became much larger than when they were first developed, which has greatly increased the odds of experiencing this issue.

▲

ramraj07

7 days ago

[-]

Okay I concede that the power grid was a poor example but clearly the internet is not. No one pointed out a counter for teh internet

▲

singron

7 days ago

[-]

Some of the biggest failures have been BGP leaks/hijacks. E.g. https://www.ripe.net/about-us/news/youtube-hijacking-a-ripe-...

This has gotten significantly better in recent years, but it used to be possible and common for a single misbehaving AS to cause global issues.

▲

habinero

8 days ago

[-]

The power grid absolutely can fail catastrophically and is a lot more fragile than people think.

Texas nearly ran into this during their blackout a few years ago -- their grid got within a few minutes of complete failure that would have required a black start which IIRC has never been done.

Grady has a good explanation and the writeup is interesting reading too.

https://youtu.be/08mwXICY4JM?si=Lmg_9UoDjQszRnMw

https://youtu.be/uOSnQM1Zu4w?si=-v6-Li7PhGHN64LB

▲

figassis

8 days ago

[-]

The grid fails catastrophically. It happened this year in Portugal, spain and nearby countries? Still, think of the grid as more like DNS. It is immense, but the concept is simple and well understood. You can quickly identify where the fault is (even if not the actual root cause), and can also quickly address it (even if bringing it back up in sync takes time and is not trivial). Current cloud infra is different in that each implementation is unique, services are unique, knowledge is not universal. There are no books about AWS's infra fundamentals or how to manage AWS's cloud.

▲

jb1991

8 days ago

[-]

The power grid is a huge risk in several major western nations.

▲

baq

8 days ago

[-]

> the internet

https://www.kentik.com/blog/a-brief-history-of-the-internets...

> power grid

https://www.entsoe.eu/publications/blackout/28-april-2025-ib...

▲

grumbelbart2

8 days ago

[-]

Also, aviation is great example of how we can manage failures in complex systems and how we can track and fix more and rarer failures over time.

▲

nonfamous

8 days ago

[-]

Great link, thanks for sharing. This point below stood out to me — put another way, “fixing” a system in response to an incident to make it safer might actually be making it less safe.

>>> Views of ‘cause’ limit the effectiveness of defenses against future events.

>>> Post-accident remedies for “human error” are usually predicated on obstructing activities that can “cause” accidents. These end-of-the-chain measures do little to reduce the likelihood of further accidents. In fact that likelihood of an identical accident is already extraordinarily low because the pattern of latent failures changes constantly. Instead of increasing safety, post-accident remedies usually increase the coupling and complexity of the system. This increases the potential number of latent failures and also makes the detection and blocking of accident trajectories more difficult.

▲

albert_e

8 days ago

[-]

But that sounds like an assertion without evidence and underestimates the competence of everyone involved in designing and maintaining these complex systems.

For example, take airline safety -- are we to believe based on the quoted assertion that every airline accident and resulting remedy that mitigated the causes have made air travel LESS safe? That sounds objectively, demonstrably false.

Truly complex systems like ecosystems and climate might qualify for this assertion where humans have interfered, often with best intentions, but caused unexpected effects that maybe beyond human capacity control.

▲

nonfamous

8 days ago

[-]

Airline safety is a special case I think — THE NTSB does incredible work, and their recommendations are always designed to improve total safety, not just reduce the likelihood of a specific failure.

But I can think of lots of examples where the response to an unfortunate, but very rare, incident can make us less safe overall. The response to rare vaccine side effects comes immediately to mind.

▲

ericyd

8 days ago

[-]

I'll admit i didn't read all of either document, but I'm not convinced of the argument that one cannot attribute a failure to a root cause simply because the system is complex and required multiple points of failure to fail catastrophically.

One could make a similar argument in sports that no one person ever scores a point because they are only put into scoring position by a complex series of actions which preceded the actual point. I think that's technically true but practically useless. It's good to have a wide perspective of an issue but I see nothing wrong with identifying the crux of a failure like this one.

▲

Yokolos

8 days ago

[-]

The best example for this is aviation. Insanely complex from the machines to the processes to the situations to the people, all interconnected and constantly interacting. But we still do "root cause" analyses and based on those findings try to improve every point in the system that failed or contributed to the failure, because that's how we get a safer aviation industry. It's definitely worked.

▲

wbl

8 days ago

[-]

Its extremely useful in sports. We evaluate batters on OPS vs RBI, and no one ever evaluated them on runs they happened to score. We talk all the time about a QB and his linemen working together and the receivers. If all we talked about was the immediate cause we'd miss all that.

▲

ericyd

8 days ago

[-]

I'm not saying we ignore all other causes in sports analysis, I'm saying it doesn't make sense to pretend that there's no "one person" who hit the home run or scored a touchdown. Of course it's usually a team effort but we still attribute a score to one person.

▲

ponco

8 days ago

[-]

Respectfully, I don't think that piece adds anything of material substance. It's a list of hollow platitudes (vapid writing listing inactionable truisms).

▲

anonymars

8 days ago

[-]

A better resource is likely Michael Nygard's book, "Release It!". It has practical advice about many issues in this outage. For example, it appears the circuit breaker and bulkhead patterns were underused here.

Excerpt: https://www.infoq.com/articles/release-it-five-am/

▲

dosnem

8 days ago

[-]

How does knowing this help you avoid these problems? It doesn’t seem to provide any guidance on what to do in the face of complex systems

▲

tptacek

8 days ago

[-]

He's literally writing about Three Mile Island. He doesn't have anything to tell you about what concurrency primitives to use for your distributed DNS management system.

But: given finite resources, should you respond to this incident by auditing your DNS management systems (or all your systems) for race conditions? Or should you instead figure out how to make the Droplet Manager survive (in some degraded state) a partition from DynamoDB without entering congestive collapse? Is the right response an identification of the "most faulty components" and a project plan to improve them? Or is it closing the human expertise/process gap that prevented them from throttling DWFM for 4.5 hours?

Cook isn't telling you how to solve problems; he's asking you to change how you think about problems, so you don't rathole in obvious local extrema instead of being guided by the bigger picture.

▲

dekhn

8 days ago

[-]

It's entirely unclear to me if a system the size and scope of AWS could be re-thought using these principles and successfully execute a complete restructuring of all their processes to reduce their failure rate a bit. It's a system that grew over time with many thousands of different developers, with a need to solve critical scaling issues that would have stopped the business in its tracks (far worse than this outage).

▲

cyberax

8 days ago

[-]

Another point is that DWFM is likely working in a privileged, isolated network because it needs access deep into the core control plane. After all, you don't want a rogue service to be able to add a malicious agent to a customer's VPC.

And since this network is privileged, observability tools, debugging support, and even maybe access to it are more complicated. Even just the set of engineers who have access is likely more limited, especially at 2AM.

Should AWS relax these controls to make recovery easier? But then it will also result in a less secure system. It's again a trade-off.

▲

doctorpangloss

8 days ago

[-]

Both documents are, "ceremonies for engineering personalities."

Even you can't help it - "enumerating a list of questions" is a very engineering thing to do.

Normal people don't talk or think like that. The way Cook is asking us to "think about problems" is kind of the opposite of what good leadership looks like. Thinking about thinking about problems is like, 200% wrong. On the contrary, be way more emotional and way simpler.

▲

dosnem

8 days ago

[-]

I don’t really follow what you are suggesting. If the system is complex and constantly evolving as the article states, you aren’t going to be able to close any expertise process gap. Operating in a degraded state is probably already built in, this was just a state of degradation they were not prepared for. You can’t figure out all degraded states to operate in because by definition the system is complex

▲

smj-edison

7 days ago

[-]

Reading through this reminds me a lot of the book "Engineering a Safer World", which opens up talking about some of the largest catastrophies (ferry sinking, chemical plant leaks, etc), and talks about how they went wrong in the framework of systemic thinking. I haven't finished it yet, but even the first part has made me dislike the concept of "root causes", it's more like "emergent behavior".

▲

GuinansEyebrows

8 days ago

[-]

thanks, i'm one of the lucky 10,000 today.

▲

user3939382

8 days ago

[-]

Nobody discussing the problem understands it.

▲

RainyDayTmrw

6 days ago

[-]

This particular piece has been shared near me several times, in the context of this recent AWS outage, the previous big AWS outage, non-AWS outages, and others. Every time, I feel like I'm in vague agreement with the author, and at the same time, none of it is the least bit actionable. Even if Cook is correct, so what? There's no concrete change I can make in my working.

▲

vader_n

8 days ago

[-]

That was a waste of my time.

▲

inkyoto

8 days ago

[-]

And I strongly recommend that you stop recommending the reading of something that has its practical usefulness limited by what the treatise leaves unsaid:

  – It identifies problems (complexity, latent failures, hindsight bias, etc.) more than it offers solutions. Readers must seek outside methods to act on these insights.

  – It feels abstract, describing general truths applicable to many domains, but requiring translation into domain-specific practices (be it software, aviation, medicine, etc.).

  – It leaves out discussion on managing complexity – e.g. principles of simplification, modular design, or quantitative risk assessment – which would help prevent some of the failures it warns about.

  – It assumes well-intentioned actors and does not grapple with scenarios where business or political pressures undermine safety – an increasingly pertinent issue in modern industries.

  – It does not explicitly warn against misusing its principles (e.g. becoming fatalistic or overconfident in defenses). The nuance that «failures are inevitable but we still must diligently work to minimize them» must come from the reader’s interpretation.

«How Complex Systems Fail» is highly valuable for its conceptual clarity and timeless truths about complex system behavior. Its direction is one of realism – accepting that no complex system is ever 100% safe – and of placing trust in human skill and systemic defenses over simplistic fixes. The rational critique is that this direction, whilst insightful, needs to be paired with concrete strategies and a proactive mindset to be practically useful.

The treatise by itself won’t tell you how to design the next aircraft or run a data center more safely, but it will shape your thinking so you avoid common pitfalls (such as chasing singular root causes or blaming operators). To truly «preclude» failures or mitigate them, one must extend Cook’s ideas with detailed engineering and organizational practices. In other words, Cook teaches us why things fail in complex ways; it is up to us – engineers, managers, regulators, and front-line practitioners – to apply those lessons in how we build and operate the systems under our care.

To be fair, at the time of writing (late 1990's), Cook’s treatise was breaking ground by succinctly articulating these concepts for a broad audience. Its objective was likely to provoke thought and shift paradigms, rather than serve as a handbook.

Today, we have the benefit of two more decades of research and practice in resilience engineering, which builds on Cook’s points. Practitioners now emphasise building resilient systems, not just trying to prevent failure outright. They use Cook’s insights as rationale for things such as chaos engineering, better incident response, and continuous learning cultures.

▲

stefan_bobev

9 days ago

[-]

I appreciate the details this went through, especially laying out the exact timelines of operations and how overlaying those timelines produces unexpected effects. One of my all time favourite bits about distributed systems comes from the (legendary) talk at GDC - I Shot You First[1] - where the speaker describes drawing sequence diagrams with tilted arrows to represent the flow of time and asking "Where is the lag?". This method has saved me many times, all throughout my career from making games, to livestream and VoD services to now fintech. Always account for the flow of time when doing a distributed operation - time's arrow always marches forward, your systems might not.

But the stale read didn't scare me nearly as much as this quote:

> Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with DWFM without causing further issues

Everyone can make a distributed system mistake (these things are hard). But I did not expect something as core as the service managing the leases on the physical EC2 nodes to not have recovery procedure. Maybe I am reading too much into it, maybe what they meant was that they didn't have a recovery procedure for "this exact" set of circumstances, but it is a little worrying even if that were the case. EC2 is one of the original services in AWS. At this point I expect it to be so battle hardened that very few edge cases would not have been identified. It seems that the EC2 failure was more impactful in a way, as it cascaded to more and more services (like the NLB and Lambda) and took more time to fully recover. I'd be interested to know what gets put in place there to make it even more resilient.

[1] https://youtu.be/h47zZrqjgLc?t=1587

▲

tptacek

9 days ago

[-]

It shouldn't scare you. It should spark recognition. This meta-failure-mode exists in every complex technological system. You should be, like, "ah, of course, that makes sense now". Latent failures are fractally prevalent and have combinatoric potential to cause catastrophic failures. Yes, this is a runbook they need to have, but we should all understand there are an unbounded number of other runbooks they'll need and won't have, too!

▲

lazystar

9 days ago

[-]

the thing that scares me is that AI will never be able to diagnose an issue that it has never seen before. If there are no runbooks, there is no pattern recognition. this is something Ive been shouting about for 2 years now; hopefully this issue makes AWS leadership understand that current gen AI can never replace human engineering.

▲

tptacek

9 days ago

[-]

I'm much less confident in that assertion. I'm not bullish on AI systems independently taking over operations from humans, but catastrophic outages are combinations of less-catastrophic outages which are themselves combinations of latent failures, and when the latent failures are easy to characterize (as is the case here!), LLMs actually do really interesting stuff working out the combinatorics.

I wouldn't want to, like, make a company out of it (I assume the foundational model companies will eat all these businesses) but you could probably do some really interesting stuff with an agent that consumes telemetry and failure model information and uses it to surface hypos about what to look at or what interventions to consider.

All of this is besides my original point, though: I'm saying, you can't runbook your way to having a system as complex as AWS run safely. Safety in a system like that is a much more complicated process, unavoidably. Like: I don't think an LLM can solve the "fractal runbook requirement" problem!

▲

janalsncm

8 days ago

[-]

AI is a lot more than just LLMs. Running through the rats nest of interdependent systems like AWS has is exactly what symbolic AI was good at.

▲

Aeolun

8 days ago

[-]

I think millions of systems have failed due to missing DNS records though.

▲

gtowey

8 days ago

[-]

It's shocking to me too, but not very surprising. It's probably a combination of factors that could cause a failure of planning and I've seen it play out the same way at lots of companies.

I bet the original engineers planned for, and designed the system to be resilient to this cold start situation. But over time those engineers left, and new people took over -- those who didn't fully understand and appreciate the complexity, and probably didn't care that much about all the edge cases. Then, pushed by management to pursue goals that are antithetical to reliability, such as cost optimization and other things the new failure case was introduced by lots of sub optimal changes. The result is as we see it -- a catastrophic failure which caught everyone by surprise.

It's the kind of thing that happens over and over again when the accountants are in charge.

▲

throwdbaaway

8 days ago

[-]

> But I did not expect something as core as the service managing the leases on the physical EC2 nodes to not have recovery procedure.

I guess they don't have a recovery procedure for the "congestive collapse" edge case. I have seen something similar, so I wouldn't be frowning at this.

A couple of red flags though:

1. Apparent lack of load-shedding support by this DWFM, such that a server reboot had to be performed. Need to learn from https://aws.amazon.com/builders-library/using-load-shedding-...

2. Having DynamoDB as a dependency of this DWFM service, instead of something more primitive like Chubby. Need to learn more about distributed systems primitives from https://www.youtube.com/watch?v=QVvFVwyElLY

▲

jasode

9 days ago

[-]

So the DNS records if-stale-then-needs-update it was basically a variation of the "2 Hard Things In Computer Science - cache invalidation". Excerpt from the giant paragraph:

>[...] Right before this event started, one DNS Enactor experienced unusually high delays needing to retry its update on several of the DNS endpoints. As it was slowly working through the endpoints, several other things were also happening. First, the DNS Planner continued to run and produced many newer generations of plans. Second, one of the other DNS Enactors then began applying one of the newer plans and rapidly progressed through all of the endpoints. The timing of these events triggered the latent race condition. When the second Enactor (applying the newest plan) completed its endpoint updates, it then invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them. At the same time that this clean-up process was invoked, the first Enactor (which had been unusually delayed) applied its much older plan to the regional DDB endpoint, overwriting the newer plan. The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time due to the unusually high delays in Enactor processing. [...]

It outlines some of the mechanics but some might think it still isn't a "Root Cause Analysis" because there's no satisfying explanation of _why_ there were "unusually high delays in Enactor processing". Hardware problem?!? Human error misconfiguration causing unintended delays in Enactor behavior?!? Either the previous sequence of events leading up to that is considered unimportant, or Amazon is still investigating what made Enactor behave in an unpredictable way.

▲

donavanm

9 days ago

[-]

This is public messaging to explain the problem at large. This isnt really a post incident analysis.

Before the active incident is “resolved” theres an evaluation of probable/plausible reoccurrence. Usually we/they would have potential mitigations and recovery runbooks prepared as well to quickly react to any reoccurance. Any likely open risks are actively worked to mitigate before the immediate issue is considered resolved. That includes around-the-clock dev team work if its the best known path to mitigation.

Next any plausible paths to “risk of reoccurance” would be top dev team priority (business hours) until those action items are completed and in deployment. That might include other teams with similar DIY DNS management, other teams who had less impactful queue depth problems, or other similar “near miss” findings. Service team tech & business owners (PE, Sr PE, GM, VP) would be tracking progress daily until resolved.

Then in the next few weeks at org & AWS level “ops meetings” there are going to be the in depth discussions of the incident, response, underlying problems, etc. the goal there being organizational learning and broader dissemination of lessons learned, action items, best practice etc.

▲

ignoramous

9 days ago

[-]

> ...there's no satisfying explanation of _why_ there were "unusually high delays in Enactor processing". Hardware problem?

Can't speak for the current incident but a similar "slow machine" issue once bit our BigCloud service (not as big an incident, thankfully) due to loooong JVM GC pauses on failing hardware.

▲

Cicero22

9 days ago

[-]

my take away was that the race condition was the root cause. Take away that bug, and suddenly there's no incident, regardless of any processing delays.

▲

_alternator_

9 days ago

[-]

Right.sounds like it’s a case of “rolling your own distributed system algorithm” without the up front investment in implementing a true robust distributed system.

Often network engineers are unaware of some of the tricky problems that DS research has addressed/solved in the last 50 years because the algorithms are arcane and heuristics often work pretty well, until they don’t. But my guess is that AWS will invest in some serious redesign of the system, hopefully with some rigorous algorithms underpinning the updates.

Consider this a nudge for all you engineers that are designing fault tolerant distributed systems at scale to investigate the problem spaces and know which algorithms solve what problems.

▲

foobarian

9 days ago

[-]

> some serious redesign of the system, hopefully with some rigorous algorithms underpinning the updates

Reading these words makes me break out in cold sweat :-) I really hope they don't

▲

dboreham

9 days ago

[-]

Certainly seems like misuse of DNS. It wasn't designed to be a rapidly updatable consistent distributed database.

▲

tremon

8 days ago

[-]

That's true, if you use the the CAP definition for consistency. Otherwise, I'd say that the DNS design satisfies each of those terms:

- "Rapidly updatable" depends on the specific implementation, but the design allows for 2 billion changesets in flight before mirrors fall irreparably out of sync with the master database, and the DNS specs include all components necessary for rapid updates: push-based notifications and incremental transfers.

- DNS is designed to be eventually consistent, and each replica is expected to always offer internally consistent data. It's certainly possible for two mirrors to respond with different responses to the same query, but eventual consistency does not preclude that.

- Distributed: the DNS system certainly is a distributed database, if fact it was specifically designed to allow for replication across organization boundaries -- something that very few other distributed systems offer. What DNS does not offer is multi-master operation, but neither do e.g. Postgres or MSSQL.

▲

pyrolistical

8 days ago

[-]

I think historically DNS was “best effort” but with consensus algorithms like raft, I can imagine a DNS that is perfectly consistent

▲

withinboredom

8 days ago

[-]

Further, please don’t stop at RAFT. RAFT is popular because it is easy to understand, not because it is the best way to do distributed consensus. It is non-deterministic (thus requiring odd numbers of electors), requires timeouts for liveness (thus latency can kill you), and isn’t all that good for general-purpose consensus, IMHO.

▲

dustbunny

9 days ago

[-]

Why is the "DNS Planner" and "DNS Enactor" separate? If it was one thing, wouldn't this race condition have been much more clear to the people working on it? Is this caused by the explosion of complexity due to the over use of the microservice architecture?

▲

bananapub

9 days ago

[-]

> Why is the "DNS Planner" and "DNS Enactor" separate?

for a large system, it's in practice very nice to split up things like that - you have one bit of software that just reads a bunch of data and then emits a plan, and then another thing that just gets given a plan and executes it.

this is easier to test (you're just dealing with producing one data structure and consuming one data structure, the planner doesn't even try to mutate anything), it's easier to restrict permissions (one side only needs read access to the world!), it's easier to do upgrades (neither side depends on the other existing or even being in the same language), it's safer to operate (the planner is disposable, it can crash or be killed at any time with no problem except update latency), it's easier to comprehend (humans can examine the planner output which contains the entire state of the plan), it's easier to recover from weird states (you can in extremis hack the plan) etc etc. these are all things you appreciate more and more and your system gets bigger and more complicated.

> If it was one thing, wouldn't this race condition have been much more clear to the people working on it?

> Is this caused by the explosion of complexity due to the over use of the microservice architecture?

it's extremely easy to second-guess the way other people decompose their services since randoms online can't see any of the actual complexity or any of the details and so can easily suggest it would be better if it was different, without having to worry about any of the downsides of the imagined alternative solution.

▲

tuckerman

9 days ago

[-]

Agreed, this is a common division of labor and simplifies things. It's not entirely clear in the postmortem but I speculate that the conflation of duties (i.e. the enactor also being responsible for janitor duty of stale plans) might have been a contributing factor.

The Oxide and Friends folks covered an update system they built that is similarly split and they cite a number of the same benefits as you: https://oxide-and-friends.transistor.fm/episodes/systems-sof...

▲

jiggawatts

8 days ago

[-]

I would divide these as functions inside a monolithic executable. At most, emit the plan to a file on disk as a “—whatif” optional path.

Distributed systems with files as a communication medium are much more complex than programmers think with far more failure modes than they can imagine.

Like… this one, that took out a cloud for hours!

▲

tuckerman

8 days ago

[-]

Doing it inside a single binary gets rid of some of the nice observability features you get "for free" by breaking it up and could complicate things quite a bit (more code paths, flags for running it in "don't make a plan use the last plan mode", flags for "use this human generated plan mode"). Very few things are a free lunch but I've used this pattern numerous times and quite like it. I ran a system that used a MIP model to do capacity planning and separating planning from executing a plan was very useful for us.

I think the communications piece depends on what other systems you have around you to build on, its unlikely this planner/executor is completely freestanding. Some companies have large distributed filesystems with well known/tested semantics, schedulers that launch jobs when files appear, they might have ~free access to a database with strict serializability where they can store a serialized version of the plan, etc.

▲

Anon1096

9 days ago

[-]

I mean any time a service goes down even 1/100 the size of AWS you have people crawling out of the woodworks giving armchair advice while having no domain relevant experience. It's barely even worth taking the time to respond. The people with opinions of value are already giving them internally.

▲

lazystar

9 days ago

[-]

> The people with opinions of value are already giving them internally.

interesting take, in light of all the brain drain that AWS has experienced over the last few years. some outside opinions might be useful - but perhaps the brain drain is so extreme that those remaining don't realize it's occurring?

▲

neom

9 days ago

[-]

Pick your battle I'd guess. Given how huge AWS is, if you have Desired state vs. reconciler, you probably have more resilient operations generally and a easier job of finding and isolating problems, the flip side of that is if you screw up your error handling, you get this. That aside, it seems strange to me they didn't account for the fact that a stale plan could get picked up over a new one, so maybe I misunderstand the incident/architecture.

▲

supportengineer

9 days ago

[-]

It probably was a single-threaded python script until somebody found a way to get a Promo out of it.

▲

placardloop

8 days ago

[-]

This is Amazon we’re talking about, it was probably Perl.

▲

jiggawatts

8 days ago

[-]

This was my thought also. The first sentences of the RCA screamed “race condition” without even having to mention the phrase.

The two DNS components comprise a monolith: neither is useful without the other and there is one arrow on the design coupling them together.

If they were a single component then none of this would have happened.

Also, version checks? Really?

Why not compare the current state against the desired state and take the necessary actions to bring them inline?

Last but not least, deleting old config files so aggressively is a “penny wise pound foolish” design. I would keep these forever or at least a month! Certainly much, much longer than any possible time taken through the sequence of provisioning steps.

▲

UltraSane

8 days ago

[-]

Yes it should be impossible for all DNS entries to get deleted like that.

▲

mcmoor

9 days ago

[-]

Also, I don't know if I missed it, but they don't establish anything to prevent outage if there's unusually high delay again?

▲

mattcrox

9 days ago

[-]

It’s at the end, they disabled the DDB DNS automations around this to fix before they re-enable them

▲

mcmoor

8 days ago

[-]

If it's re enabled (without change?), wouldn't an unusually high delay break it again?

▲

cthalupa

8 days ago

[-]

Why would they enable it without fixing the issue?

The post-mortem is specific that they won't turn it back on without resolving this but I feel like the default assumption for any halfway competent entity would be that they fix the known issue that they have disabled something because.

▲

ecnahc515

9 days ago

[-]

Seems like the enactor should be checking the version/generation of the current record before it applies the new value, to ensure it never applies an old plan on top of an record updated by a new plan. It wouldn't be as efficient, but that's just how it is. It's a basic compare and swap operation, so it could be handled easily within dynamodb itself where these records are stored.

▲

al_be_back

8 days ago

[-]

Postmortem all you want - the internet is breaking, hard.

The internet was born out of the need for Distributed networks during the cold war - to reduce central points of failure - a hedging mechanism if you will.

Now it has consolidated into ever smaller mono nets. A simple mistake in on one deployment could bring banking, shopping and travel to a halt globally. This can only get much worse when cyber warfare gets involved.

Personally, I think the cloud metaphor has overstretched and has long burst.

For R&D, early stage start-ups and occasional/seasonal computing, cloud works perfectly (similar to how time-sharing systems used to work).

For well established/growth businesses and gov, you better become self-reliant and tech independent: own physical servers + own cloud + own essential services (db, messaging, payment).

There's no shortage of affordable tech, know-how or workforce.

▲

anyonecancode

8 days ago

[-]

> The internet was born out of the need for Distributed networks during the cold war - to reduce central points of failure - a hedging mechanism if you will.

I don't think the idea was that in the event of catastrophe, up to and including nuclear attack, the system would continue working normally, but that it would keep working. And the internet -- as a system -- certainly kept working during this AWS outage. In a degraded state, yes, but it was working, and recovered.

I'm more concerned with the way the early public internet promised a different kind of decentralization -- of economics, power, and ideas -- and how _that_ has become heavily centralized. In which case, AWS, and Amazon, indeed do make a good example. The internet, as a system, is certainly working today, but arguably in a degraded state.

▲

al_be_back

8 days ago

[-]

preventing a catastrophe was ARPA's mitigation strategy. the point is where it's heading, not where it is. It's not about AWS per se, or any one company, it's the way it is consolidating. AWS came about by accident - cleverly utilizing spare server capacity from amazon.com.

In it's conception, the internet (not www), was not envisaged as a economical medium - it's success was a lovely side-effect.

▲

protocolture

8 days ago

[-]

>the internet is breaking, hard.

I dont see that this is the case, its just more people want services over the internet from the same 3 places that break irregularly.

Internet infrastructure is as far as I can tell, getting better all the time.

The last big BGP bug had 1/10th the comments of the AWS one. And had much less scary naming (ooooh routing instability)

https://news.ycombinator.com/item?id=44105796

>The internet was born out of the need for Distributed networks during the cold war - to reduce central points of failure - a hedging mechanism if you will.

Instead of arguing about the need that birthed the internet, I will simply say that the internet still works in the same largely distributed fashion. Maybe you mean Web instead of Internet?

The issue here is that "Internet" isnt the same as "Things you might access on the Internet". The Internet held up great during this adventure. As far as I can tell it was returning 404's and 502's without incident. The distributed networks were networking distributedly. If you wanted to send and received packets with any internet joined human in a way that didnt rely on some AWS hosted application, that was still very possible.

>A simple mistake in on one deployment could bring banking, shopping and travel to a halt globally.

Yeah but for how long and for how many people? The last 20 years have been a burn in test for a lot of big industries on crappy infrastructure. It looks like near everyone has been dragged kicking and screaming into the future.

I mean the entire shipping industry got done over the last decade.

https://www.zdnet.com/article/all-four-of-the-worlds-largest...

>Personally, I think the cloud metaphor has overstretched and has long burst.

It was never very useful.

>For well established/growth businesses and gov, you better become self-reliant and tech independent

For these businesses, they just go out and get themselves some region/vendor redundancy. Lots of applications fell over during this outage, but lots of teams are also getting internal praise for designing their systems robustly and avoiding its fallout.

>There's no shortage of affordable tech, know-how or workforce.

Yes, and these people often know how to design cloud infrastructure to avoid these issues, or are smart enough to warn people that if their region or its dependencies fail without redundancy, they are taking a nose dive. Businesses will make business decisions and review those decisions after getting publicly burnt.

▲

al_be_back

8 days ago

[-]

I don't mean www, that's a different beast. i said distributed nets, as was before www. It's not actually about aws per se, or whether the cloud improves - evolution doesn't necessarily favor improvement just because.

the centralization of computing is distorting the Internet's core strength, the distributed nets (not aws/azure/gcloud zones).

since covid, if anything is telling, is that politics, economy and warfare has shifted into a new era, pretty much globally.

▲

protocolture

5 days ago

[-]

>i said distributed nets

So which nets failed here? The write up doesnt mention any network layer issues, and I am not aware of any large scale network layer fallout.

▲

gslin

9 days ago

[-]

I believe a report with timezone not using UTC is a crime.

▲

exogenousdata

9 days ago

[-]

An epoch fail?

▲

tguedes

9 days ago

[-]

I think it makes sense in this instance. Because this occurred in us-east-1, the vast majority of affected customers are US based. For most people, it's easier to do the timezone conversion from PT than UTC.

▲

thayne

8 days ago

[-]

But us-east-1 is in Eastern Time, so if you aren't going to use UTC, why not use that?

I'm guessing PT was chosen because the people writing this report are in PT (where Amazon headquarters is).

▲

trenchpilgrim

8 days ago

[-]

us-east-1 is an exceptional Amazon region; it hosts many global services as well as services which are not yet available in other regions. Most AWS customers worldwide probably have an indirect dependency on us-east-1.

▲

cheeze

9 days ago

[-]

My guess is that PT was chosen to highlight the fact that this happened in the middle of the night for most of the responding ops folks.

(I don't know anything here, just spitballing why that choice would be made)

▲

throitallaway

9 days ago

[-]

Their headquarters is in Seattle (Pacific Time.) But yeah, I hate time zones.

▲

shayonj

9 days ago

[-]

I was kinda surprised the lack of CAS on per-endpoint plan version or rejecting stale writes via 2PC or single-writer lease per endpoint like patterns.

Definitely a painful one with good learnings and kudos to AWS for being so transparent and detailed :hugops:

▲

donavanm

9 days ago

[-]

See https://news.ycombinator.com/item?id=45681136. The actual DNS mutation API does, effectively, CAS. They had multiple unsynchronized writers who raced without logical constraints or ordering to teh changes. Without thinking much they _might_ have been able to implement something like a vector either through updating the zone serial or another "sentinel record" that was always used for ChangeRRSets affecting that label/zone; like a TXT record containing a serialized change set number or a "checksum" of the old + new state.

Im guessing the "plans" aspect skipped that and they were just applying intended state, without trying serialize them. And last-write-wins, until it doesnt.

▲

cyberax

8 days ago

[-]

Oh, I can see it from here. AWS internally has a problem with things like task orchestration. I bet that the enactor can be rewritten as a goroutine/thread in the planner, with proper locking and ordering.

But that's too complicated and results in more code. So they likely just used an SQS queue with consumers reading from it.

▲

baalimago

8 days ago

[-]

Did they intentionally make it dense and complicated to discourage anyone from actually reading it..?

776 words in a single paragraph

▲

valbaca

7 days ago

[-]

Amazon writing “culture”

▲

__turbobrew__

9 days ago

[-]

From a meta analysis level: bugs will always happen, formal verification is hard, and sometimes it just takes a number of years to have some bad luck (I have hit bugs which were over 10 years old but due to low probability of them occurring they didn’t happen for a long time).

If we assume that the system will fail, I think the logical thing to think about is how to limit the effects of that failure. In practice this means cell based architecture, phased rollouts, and isolated zones.

To my knowledge AWS does attempt to implement cell based architecture, but there are some cross region dependencies specifically with us-east-1 due to legacy. The real long term fix for this is designing regions to be independent of each other.

This is a hard thing to do, but it is possible. I have personally been involved in disaster testing where a region was purposely firewalled off from the rest of the infrastructure. You find out very quick where those cross region dependencies lie, and many of them are in unexpected places.

Usually this work is not done due to lack of upper level VP support and funding, and it is easier to stick your head in the sand and hope bad things don’t happen. The strongest supporters of this work are going to be the share holders who are in it for the long run. If the company goes poof due to improper disaster testing, the shareholders are going to be the main bag holders. Making the board aware of the risks and the estimated probability of fundamentally company ending events can help get this work funded.

▲

JCM9

8 days ago

[-]

Good to see a detailed summary. The frustration from a customer perspective is that AWS continues to have these cross-region issues and they continue to be very secretive about where these single points of failure exist.

The region model is a lot less robust if core things in other regions require US-East-1 to operate. This has been an issue in previous outages and appears to have struck again this week.

It is what it is, but AWS consistently oversells the robustness of regions as fully separate when events like Monday reveal they’re really not.

▲

Arainach

8 days ago

[-]

>about where these single points of failure exist.

In general, when you find one you work to fix it, and one of the most common ways to find more is when one of them fails. Having single points of failure and letting them live isn't the standard practice at this scale.

▲

lazystar

9 days ago

[-]

> Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with DWFM without causing further issues.

interesting.

▲

everfrustrated

9 days ago

[-]

>Services like DynamoDB maintain hundreds of thousands of DNS records to operate a very large heterogeneous fleet of load balancers in each Region

Does that mean a DNS query for dynamodb.us-east-1.amazonaws.com can resolve to one of a hundred thousand IP address?

That's insane!

And also well beyond the limits of route53.

I'm wondering if they're constantly updating route53 with a smaller subset of records and using a low ttl to somewhat work around this.

▲

supriyo-biswas

9 days ago

[-]

DNS-based CDNs are also effectively this: collect metrics from a datastore regarding system usage metrics, packet loss, latency etc and compute a table of viewer networks and preferred PoPs.

Unfortunately hard documentation is difficult to provide but that’s how a CDN worked at a place I used to work for, there’s also another CDN[1] which talks about the same thing in fancier terms.

[1] https://bunny.net/network/smartedge/

▲

donavanm

8 days ago

[-]

Akamai talked about it in the early 2000s. Facebook content folks had a decent paper describing the latency collection and realtime routing around 2011ish, something like “pinpoint” I want to say. Though as you say was industry practice before then.

▲

donavanm

8 days ago

[-]

Some details, but yeah that's basically how all AWS DNS works. I think youre missing how labels, zones, and domains are related but distinct. And that R53 operates in resource record SETS. And there are affordances in the set relationships to build trees and logic for selecting an appropriate set (eg healthcheck, latency).

> And also well beyond the limits of route53

Ipso facto, R53 can do this just fine. Where do you think all of your public EC2, ELB, RDS, API Gateway, etc etc records are managed and served?

▲

thayne

8 days ago

[-]

I haven't tested with dynamodb, but I once ran a loop of doing DNS lookups for s3, and I in a couple seconds I got hundreds of distinct ip addresses. And that was just for a single region, from a single source ip.

▲

rescbr

8 days ago

[-]

> And also well beyond the limits of route53.

One thing is the internal limit, another thing is the customer-facing limit.

Some hard limits are softer than they appear.

▲

asim

8 days ago

[-]

On the one hand it's an incentive to shift away to smaller self management where you don't need AWS e.g as an individual I just run a single DigitalOcean VPS. But on the other hand if you're a large business the evaluation process is basically, can I tolerate this kind of incident once in a while versus the massive operational cost of doing it myself. It's really going to be a case by case study of who stays, who moves and who tries some multicloud failover. It's not one of those situations where you can blanket just say oh this is terrible, stupid, should never happen, let's get off AWS. This is the slow build up of dependency on something people value. That's not going to change quickly. It might never change. The too big to fail mantra of banks applies. What happens next is essentially very anticlimactic which is to say, nothing.

▲

tonymet

8 days ago

[-]

multi-region AWS would have been adequate for this outage.

▲

giamma

8 days ago

[-]

Interesting analysis from The Register https://www.theregister.com/2025/10/20/aws_outage_amazon_bra...

▲

mellosouls

8 days ago

[-]

Discussed the other day:

Today is when the Amazon brain drain sent AWS down the spout (644 comments)

https://news.ycombinator.com/item?id=45649178

▲

rr808

8 days ago

[-]

https://newsletter.pragmaticengineer.com/p/what-caused-the-l... has a better explanation instead of the wall of text from AWS

▲

cowsandmilk

8 days ago

[-]

Except just in the DNS section, I’ve already found one place where he gets it wrong…

▲

yla92

9 days ago

[-]

So the root cause is basically race condition 101 stale read ?

▲

philipwhiuk

9 days ago

[-]

Race condition and bad data validation.

▲

ericpauley

9 days ago

[-]

Interesting use of the phrase “Route53 transaction” for an operation that has no hard transactional guarantees. Especially given the lack of transactional updates are what caused the outage…

▲

donavanm

9 days ago

[-]

I think you misunderstnad the failure case. The ChangeResourceRecordSet is transactional (or was when I worked on the service) https://docs.aws.amazon.com/Route53/latest/APIReference/API_....

The fault was two different clients with divergent goal states:

- one ("old") DNS Enactor experienced unusually high delays needing to retry its update on several of the DNS endpoints

- the DNS Planner continued to run and produced many newer generations of plans [Ed: this is key: its producing "plans" of desired state, the does not include a complete transaction like a log or chain with previous state + mutations]

- one of the other ("new") DNS Enactors then began applying one of the newer plans

- then ("new") invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them [Ed: the key race is implied here. The "old" Enactor is reading _current state_, which was the output of "new", and applying its desired "old" state on top. The discrepency is because apparently Planer and Enactor aren't working with a chain/vector clock/serialized change set numbers/etc]

- At the same time the first ("old") Enactor ... applied its much older plan to the regional DDB endpoint, overwriting the newer plan. [Ed: and here is where "old" Enactor creates the valid ChangeRRSets call, replacing "new" with "old"]

- The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time [Ed: Whoops!]

- The second Enactor’s clean-up process then deleted this older plan because it was many generations older than the plan it had just applied.

Ironically Route 53 does have strong transactions of API changes _and_ serializes them _and_ has closed loop observers to validate change sets globally on every dataplane host. So do other AWS services. And there are even some internal primitives for building replication or change set chains like this. But its also a PITA and takes a bunch of work and when it _does_ fail you end up with global deadlock and customers who are really grumpy that they dont see their DNS changes going in to effect.

▲

RijilV

9 days ago

[-]

Not for nothing, there’s a support group for those of us who’ve been hurt by WHU sev2s…

▲

donavanm

8 days ago

[-]

Man I always hated that phrasing; always tried to get people to use more precise terms like “customer change propagation.” But yeah, who hasnt been punished by a queryplan change or some random connectivity problem in south east asia!

▲

dilyevsky

9 days ago

[-]

Sounds like they went with Availability over Correctness with this design but the problem is that if your core foundational config is not correct you get no availability either.

▲

grogers

8 days ago

[-]

> As this plan was deleted, all IP addresses for the regional endpoint were immediately removed.

I feel like I am missing something here... They make it sound like the DNS enactor basically diffs the current state of DNS with the desired state, and then submits the adds/deletes needed to make the DNS go to the desired state.

With the racing writers, wouldn't that have just made the DNS go back to an older state? Why did it remove all the IPs entirely?

▲

Aeolun

8 days ago

[-]

1. Read state, oh, I need to delete all this.

2. Read state, oh, I need to write all this.

2. Writes

1. Deletes

Or some variant of that anyway. It happens in any system that has concurrent reader/writers and no locks.

▲

pelagicAustral

9 days ago

[-]

Had no idea Dynamo was so intertwined with the whole AWS stack.

▲

freedomben

9 days ago

[-]

Yeah, for better or worse, AWS is a huge dogfooder. It's nice to know they trust their stuff enough to depend on it themselves, but it's also scary to know that the blast radius of a failure in any particular service can be enormous

▲

JohnMakin

8 days ago

[-]

Was still seeing SQS latency affecting my systems a full day after they gave the “all clear.” There are red flags all over this summary to me, particularly the case where they had no operational procedure for recovery. That seems to me impossible in a hyperscaler - you never considered this failure scenario, ever? Or did you lose engineers that did know?

Anyway appreciate that this seems pretty honest and descriptive.

▲

Velocifyer

9 days ago

[-]

This is unreadable and terribly formatted.

▲

citizenpaul

8 days ago

[-]

Yeah for real thats what an "industry leading" company puts out for their post mortem? They should be red in the face embarrassed. Jeeze, paragraphs? Punctuation?

Looks like Amazon is starting to show cracks in the foundation.

▲

citizenpaul

8 days ago

[-]

Yeah for real thats what an "industry leading" company puts out for their post mortem? They should be red in the face embarrassed. Jeeze paragraphs? Punctuation?

I put more effort into my internet comments that won't be read by millions of people.

▲

joeyhage

9 days ago

[-]

> as is the case with the recently launched IPv6 endpoint and the public regional endpoint

It isn't explicitly stated in the RCA but it is likely these new endpoints were the straw that broke the camel's back for the DynamoDB load balancer DNS automation

▲

WaitWaitWha

9 days ago

[-]

I gather, the root cause was a latent race condition in the DynamoDB DNS management system that allowed an outdated DNS plan to overwrite the current one, resulting in an empty DNS record for the regional endpoint.

Correct?

▲

tptacek

9 days ago

[-]

I think you have to be careful with ideas like "the root cause". They underwent a metastable congestive collapse. A large component of the outage was them not having a runbook to safely recover an adequately performing state for their droplet manager service.

The precipitating event was a race condition with the DynamoDB planner/enactor system.

https://how.complexsystems.fail/

▲

1970-01-01

8 days ago

[-]

Why can't a race condition bug be seen as the single root cause? Yes, there were other factors that accelerated collapse, but those are inherent to DNS, which is outside the scope of a summary.

▲

tptacek

8 days ago

[-]

Because the DNS race condition is just one flaw in the system. The more important latent flaw† is probably the metastable failure mode for the droplet manager, which, when it loses connectivity to Dynamo, gradually itself loses connectivity with the Droplets, until a critical mass is hit where the Droplet manager has to be throttled and manually recovered.

Importantly: the DNS problem was resolved (to degraded state) in 1hr15, and fully resolved in 2hr30. The Droplet Manager problem took much longer!

This is the point of complex failure analysis, and why that school of thought says "root causing" is counterproductive. There will always be other precipitating events!

† which itself could very well be a second-order effect of some even deeper and more latent issue that would be more useful to address!

▲

cyberax

8 days ago

[-]

The droplet manager failure is a lot more forgivable scenario. It happened because the "must always be up" service went down for an extended period of time, and the sheer amount of actions needed for the recovery overwhelmed the system.

The initial DynamoDB DNS outage was much worse. A bog-standard TOCTTOU for scheduled tasks that are assumed to be "instant". And the lack of controls that allowed one task to just blow up everything in one of the foundational services.

When I was at AWS some years ago, there were calls to limit the blast radius by using cell architecture to create vertical slices of the infrastructure for critical services. I guess that got completely sidelined.

▲

dgemm

8 days ago

[-]

https://en.wikipedia.org/wiki/Swiss_cheese_model

▲

1970-01-01

8 days ago

[-]

Two different questions here.

1. How did it break?

2. Why did it collapse?

A1: Race condition

A2: What you said.

▲

tptacek

8 days ago

[-]

What is the purpose of identifying "root causes" in this model? Is the root cause of a memory corruption vulnerability holding a stale pointer to a freed value, or is it the lack of memory safety? Where does AWS gain more advantage: in identifying and mitigating metastable failure modes in EC2, or in trying to identify every possible way DNS might take down DynamoDB? (The latter is actually not an easy question, but that's the point!)

▲

1970-01-01

8 days ago

[-]

Two things can be important for an audience. For most, it's the race condition lesson. Locks are there for a reason. For AWS, it's the stability lesson. DNS can and did take down the empire for several hours.

▲

tptacek

8 days ago

[-]

Did DNS take it down, or did a pattern of latent failures take it down? DNS was restored fairly quickly!

Nobody is saying that locks aren't interesting or important.

▲

nickelpro

8 days ago

[-]

The Droplet lease timeouts were an aggravating factor for the severity of the incident, but are not causative. Absent a trigger the droplet leases never experience congestive failure.

The race condition was necessary and sufficient for collapse. Absent corrective action it always leads to AWS going down. In the presence of corrective actions the severity of the failure would have been minor without other aggravating factors, but the race condition is always the cause of this failure.

▲

dosnem

8 days ago

[-]

This doesn’t really matter. This type of error gets the whole 5 why’s treatment and every why needs to get fixed. Both problems will certainly have an action item

▲

tptacek

8 days ago

[-]

It is not my claim that AWS is going to handle this badly, only that this thread is.

▲

martythemaniak

8 days ago

[-]

It's not DNS There's no way it's DNS It was DNS

▲

qrush

9 days ago

[-]

Sounds like DynamoDB is going to continue to be a hard dependency for EC2, etc. I at least appreciate the transparency and hearing about their internal systems names.

▲

offmycloud

9 days ago

[-]

I think it's time for AWS to pull the curtain back a bit and release a JSON document that shows a list of all internal service dependencies for each AWS service.

▲

throitallaway

9 days ago

[-]

Would it matter? Would you base decisions on whether or not to use one of their products based on the dependency graph?

▲

UltraSane

8 days ago

[-]

It would let you know that if if service A and B both depend on service C you can't use A and B to gain reliability.

▲

withinboredom

8 days ago

[-]

Yes.

▲

bdangubic

8 days ago

[-]

if so, I hate to tell you this but you would not use AWS (or any other cloud provider)!

▲

withinboredom

8 days ago

[-]

I don’t use AWS or any other cloud provider. I use bare metal since 2012. See, in 2012 (IIRC), one fateful day, we turned off our bare metal machines and went full AWS. That afternoon, AWS had its first major outage. Prior to that day, the owner could walk in and ask what we were doing about it. That day, all we could do was twiddle our thumbs or turn on a now outdated database replica. Surely AWS won’t be out for hours, right? Right? With bare metal, you might be out for hours, but you can quickly get back to a degraded state, no matter what happens. With AWS, you’re stuck with whatever they happen to fix first.

▲

cthalupa

8 days ago

[-]

Meanwhile I've had bare metal be a complete outage for over a day because a backhoe decided it wanted to eat the fiber line into our building. All I could do was twiddle my thumbs because we were stuck waiting on another company to fix that.

Could we have had an offsite location to fail over to? From a technical perspective, sure. Same as you could go multi-region or multi-cloud or turn on some servers at hetzner or whatever. There's nothing better or worse about the cloud here - you always have the ability to design with resilience for whatever happens short of the internet on the whole breaking somehow.

▲

mparnisari

8 days ago

[-]

I worked for AWS for two years and if I recall correctly, one of the issues was circular dependencies.

▲

cyberax

8 days ago

[-]

A lot of internal AWS services have names that are completely opaque to outside users. Such a document will be pretty useless as a result.

▲

dmytrish

7 days ago

[-]

+1, SREs can spend months during their onboarding basically reading design docs and getting to know about services in their vicinity.

Short of publicly releasing all internal documentation, there's not much that can make the AWS infrastructure reasonably clear to an outsider. Reading and understanding all of this also would be rather futile without actual access to source code and observability.

▲

UltraSane

8 days ago

[-]

They should at least split off dedicated isolated instances of DynamoDB to reduce blast radius. I would want at least 2 instances for every internal AWS service that uses it.

▲

skywhopper

8 days ago

[-]

I mean, something has to be the baseline data storage layer. I’m more comfortable with it being DynamoDB than something else that isn’t pushed as hard by as many different customers.

▲

UltraSane

8 days ago

[-]

The actual storage layer of DynamoDB is well engineered and has some formal proofs.

▲

827a

8 days ago

[-]

I made it about ten lines into this before realizing that, against all odds, I wasn't reading a postmortem, I was reading marketing material designed to sell AWS.

> Many of the largest AWS services rely extensively on DNS to provide seamless scale, fault isolation and recovery, low latency, and locality...

▲

Aeolun

8 days ago

[-]

I didn’t get 10 lines in before I realized that this wall of text couldn’t possibly contain the actual reason. Somewhere behind all of that is an engineer saying “We done borked up and deleted the dynamodb DNS records”

▲

galaxy01

9 days ago

[-]

Would conditional read/write solve this? looks like some kind of stale read

▲

alexnewman

8 days ago

[-]

Is it the internal dynamodb that other people use?

▲

polyglotfacto2

8 days ago

[-]

Use TLA+ (which I thought they did)

▲

bithavoc

9 days ago

[-]

does DynamoDB run on EC2? if I read it right, EC2 depends on DynamoDB.

▲

dokument

8 days ago

[-]

There are circular dependencies within AWS, but also systems to account for this (especially for cold starting).

Also there really is no one AWS, each region is its own (Now more then ever before, where some systems weren't built to support this).

▲

danpalmer

8 days ago

[-]

776 word paragraph and 28 word screen width, this is practically unreadable.

▲

scottatron

8 days ago

[-]

yeah that is some pretty reader hostile formatting -_-

I asked Claude to reformat it for readability for me: https://claude.ai/public/artifacts/958c4039-d2f1-45eb-9dfe-b...

Obvs do your own cross-checking with the original if 100% accuracy is required.

▲

LaserToy

9 days ago

[-]

TLDR: A DNS automation bug removed all the IP addresses for the regional endpoints. The tooling that was supposed to help with recovery depends on the system it needed to recover. That’s a classic “we deleted prod” failure mode at AWS scale.

▲

shrubble

9 days ago

[-]

The Bind resolver required each zone to have an increasing serial number for the zone.

So if you made a change you had to increase the number, usually a timestamp like 20250906114509 which would be older / lower numbered than 20250906114702; making it easier to determine which zone file had the newest data.

Seems like they sort of had the same setup but with less rigidity in terms of refusing to load older files.