https://how.complexsystems.fail/
You can literally check off the things from Cook's piece that apply directly here. Also: when I wrote this comment, most of the thread was about root-causing the DNS thing that happened, which I don't think is the big story behind this outage. (Cook rejects the whole idea of a "root cause", and I'm pretty sure he's dead on right about why.)
That piece by Cook is ok, but largely just a list of assertions (true or not, most do feel intuitive, though). I suppose one should delve into all those references at the end for details? Anyway, this is an ancient topic, and I doubt we have all the answers on those root whys. The MIT course on systems, 6.033, used to assign reading a paper raised on HN only a few times in its history: https://news.ycombinator.com/item?id=10082625 and https://news.ycombinator.com/item?id=16392223 It's from 1962, over 60 years ago, but that is also probably more illuminating/thought provoking than the post mortem. Personally, I suspect it's probably an instance of a https://en.wikipedia.org/wiki/Wicked_problem , but only past a certain scale.
You might have to bring personal trauma to this piece to get the full effect.
In engineered systems, there is just a disconnect between on our own/small scale KISS and what happens in large organizations, and then what happens over time. This is the real root cause/why, but I'm not sure it's fixable. Maybe partly addressable, tho'.
One thing that might give you a moment of worry is both in that Simon and far, far more broadly all over academia both long before and ever since, biological systems like our bodies are an archetypal example of "complex". Besides medical failures, life mostly has this one main trick -- make many copies and if they don't all fail before they, too, can copy then a stable-ish pattern emerges.
Stable populations + "litter size/replication factor" largely imply average failure rates. For most species it is horrific. On the David Attenborough specials they'll play the sad music and tell you X% of these offspring never make it to mating age. The alternative is not the https://en.wikipedia.org/wiki/Gray_goo apocalypse, but the "whatever-that-species-is-biopocalypse". Sorry - it's late and my joke circuits are maybe fritzing. So, both big 'L' and little 'l' life, too, "is on the edge", just structurally.
https://en.wikipedia.org/wiki/Self-organized_criticality (with sand piles and whatnot) used to be a kind of statistical physics hope for a theory of everything of these kinds of phenomena, but it just doesn't get deployed. Things will seem "shallowly critical" but not so upon deeper inspection. So, maybe it's not not a useful enough approximation.
Anyway, good luck with your housing meetup!
The problem is, oncall is a full-time business. It takes full attention of the oncall engineer, whether there is an issue or not. Both companies simply treat oncall as a by-product. We just had to do it so let’s stuff it into the sprint. The first company was slightly more serious as we were asked to put up a 2-3 point oncall task in JIRA. The second one doesn’t even do this.
Neither company really encourages engineers to read through complex code written by others, even if we do oncall for those products. Again, the first company did better, and we were supposed to create a channel and pull people in, so it’s OKish to not know anything about the code. The second company simply leaves oncall to do whatever they can. Neither company allocates enough time for engineers to read the source code thoroughly. And neither has good documentation for oncall.
I don’t know the culture of AWS. I’d very much want to work in an oncall environment that is serious and encourages learning.
We sent a test page periodically to make sure the pager actually beeped. We got paid extra for being in the rotation. The leadership knew this was a critical step. Unfortunately, much of our tooling was terrible, which would cause false pages, or failed critical operations, all too frequently.
I later worked on SWE teams that didn't take dev oncall very seriously. At my current job, we have an oncall, but it's best effort business hours only.
Is that really uncommon? I've been on call for many companies and many types of institutions and never been told once I couldn't do something to bring a system up that I can recall at least. Its kinda the job?
On call seriousness should be directly proportional to pay. Google pays. If smallcorp want to pay me COL I'll be looking at that 2AM ticket at 9AM when I get to work.
12-12 rotation in SRE is a lot more reasonable for humans
It was a good lesson in what a manicured lower environment can do for you.
Pointing out that "complex systems" have "layers of defense" is neither insightful nor useful, it's obvious. Saying that any and all failures in a given complex system lack a root cause is wrong.
Cook uses a lot of words to say not much at all. There's no concrete advise to be taken from How Complex Systems Fail, nothing to change. There's no casualty procedure or post-mortem investigation which would change a single letter of a single word in response to it. It's hot air.
Then I realized: the internet; the power-grid (at least in most developed countries); there are things that don't actually fail catastrophically, even though they are extremely complex, and not always built by efficient organizations. Whats the retort to this argument?
I think you could argue AWS is more complex than the electrical grid, but even if it's not, the grid has had several decades to iron out kinks and AWS hasn't. AWS also adds a ton of completely new services each year in addition to adding more capacity. E.g. I bet these DNS Enactors have become more numerous and their plans became much larger than when they were first developed, which has greatly increased the odds of experiencing this issue.
This has gotten significantly better in recent years, but it used to be possible and common for a single misbehaving AS to cause global issues.
Texas nearly ran into this during their blackout a few years ago -- their grid got within a few minutes of complete failure that would have required a black start which IIRC has never been done.
Grady has a good explanation and the writeup is interesting reading too.
https://www.kentik.com/blog/a-brief-history-of-the-internets...
> power grid
https://www.entsoe.eu/publications/blackout/28-april-2025-ib...
>>> Views of ‘cause’ limit the effectiveness of defenses against future events.
>>> Post-accident remedies for “human error” are usually predicated on obstructing activities that can “cause” accidents. These end-of-the-chain measures do little to reduce the likelihood of further accidents. In fact that likelihood of an identical accident is already extraordinarily low because the pattern of latent failures changes constantly. Instead of increasing safety, post-accident remedies usually increase the coupling and complexity of the system. This increases the potential number of latent failures and also makes the detection and blocking of accident trajectories more difficult.
For example, take airline safety -- are we to believe based on the quoted assertion that every airline accident and resulting remedy that mitigated the causes have made air travel LESS safe? That sounds objectively, demonstrably false.
Truly complex systems like ecosystems and climate might qualify for this assertion where humans have interfered, often with best intentions, but caused unexpected effects that maybe beyond human capacity control.
But I can think of lots of examples where the response to an unfortunate, but very rare, incident can make us less safe overall. The response to rare vaccine side effects comes immediately to mind.
One could make a similar argument in sports that no one person ever scores a point because they are only put into scoring position by a complex series of actions which preceded the actual point. I think that's technically true but practically useless. It's good to have a wide perspective of an issue but I see nothing wrong with identifying the crux of a failure like this one.
But: given finite resources, should you respond to this incident by auditing your DNS management systems (or all your systems) for race conditions? Or should you instead figure out how to make the Droplet Manager survive (in some degraded state) a partition from DynamoDB without entering congestive collapse? Is the right response an identification of the "most faulty components" and a project plan to improve them? Or is it closing the human expertise/process gap that prevented them from throttling DWFM for 4.5 hours?
Cook isn't telling you how to solve problems; he's asking you to change how you think about problems, so you don't rathole in obvious local extrema instead of being guided by the bigger picture.
And since this network is privileged, observability tools, debugging support, and even maybe access to it are more complicated. Even just the set of engineers who have access is likely more limited, especially at 2AM.
Should AWS relax these controls to make recovery easier? But then it will also result in a less secure system. It's again a trade-off.
Even you can't help it - "enumerating a list of questions" is a very engineering thing to do.
Normal people don't talk or think like that. The way Cook is asking us to "think about problems" is kind of the opposite of what good leadership looks like. Thinking about thinking about problems is like, 200% wrong. On the contrary, be way more emotional and way simpler.
– It identifies problems (complexity, latent failures, hindsight bias, etc.) more than it offers solutions. Readers must seek outside methods to act on these insights.
– It feels abstract, describing general truths applicable to many domains, but requiring translation into domain-specific practices (be it software, aviation, medicine, etc.).
– It leaves out discussion on managing complexity – e.g. principles of simplification, modular design, or quantitative risk assessment – which would help prevent some of the failures it warns about.
– It assumes well-intentioned actors and does not grapple with scenarios where business or political pressures undermine safety – an increasingly pertinent issue in modern industries.
– It does not explicitly warn against misusing its principles (e.g. becoming fatalistic or overconfident in defenses). The nuance that «failures are inevitable but we still must diligently work to minimize them» must come from the reader’s interpretation.
«How Complex Systems Fail» is highly valuable for its conceptual clarity and timeless truths about complex system behavior. Its direction is one of realism – accepting that no complex system is ever 100% safe – and of placing trust in human skill and systemic defenses over simplistic fixes. The rational critique is that this direction, whilst insightful, needs to be paired with concrete strategies and a proactive mindset to be practically useful.The treatise by itself won’t tell you how to design the next aircraft or run a data center more safely, but it will shape your thinking so you avoid common pitfalls (such as chasing singular root causes or blaming operators). To truly «preclude» failures or mitigate them, one must extend Cook’s ideas with detailed engineering and organizational practices. In other words, Cook teaches us why things fail in complex ways; it is up to us – engineers, managers, regulators, and front-line practitioners – to apply those lessons in how we build and operate the systems under our care.
To be fair, at the time of writing (late 1990's), Cook’s treatise was breaking ground by succinctly articulating these concepts for a broad audience. Its objective was likely to provoke thought and shift paradigms, rather than serve as a handbook.
Today, we have the benefit of two more decades of research and practice in resilience engineering, which builds on Cook’s points. Practitioners now emphasise building resilient systems, not just trying to prevent failure outright. They use Cook’s insights as rationale for things such as chaos engineering, better incident response, and continuous learning cultures.
But the stale read didn't scare me nearly as much as this quote:
> Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with DWFM without causing further issues
Everyone can make a distributed system mistake (these things are hard). But I did not expect something as core as the service managing the leases on the physical EC2 nodes to not have recovery procedure. Maybe I am reading too much into it, maybe what they meant was that they didn't have a recovery procedure for "this exact" set of circumstances, but it is a little worrying even if that were the case. EC2 is one of the original services in AWS. At this point I expect it to be so battle hardened that very few edge cases would not have been identified. It seems that the EC2 failure was more impactful in a way, as it cascaded to more and more services (like the NLB and Lambda) and took more time to fully recover. I'd be interested to know what gets put in place there to make it even more resilient.
I wouldn't want to, like, make a company out of it (I assume the foundational model companies will eat all these businesses) but you could probably do some really interesting stuff with an agent that consumes telemetry and failure model information and uses it to surface hypos about what to look at or what interventions to consider.
All of this is besides my original point, though: I'm saying, you can't runbook your way to having a system as complex as AWS run safely. Safety in a system like that is a much more complicated process, unavoidably. Like: I don't think an LLM can solve the "fractal runbook requirement" problem!
I bet the original engineers planned for, and designed the system to be resilient to this cold start situation. But over time those engineers left, and new people took over -- those who didn't fully understand and appreciate the complexity, and probably didn't care that much about all the edge cases. Then, pushed by management to pursue goals that are antithetical to reliability, such as cost optimization and other things the new failure case was introduced by lots of sub optimal changes. The result is as we see it -- a catastrophic failure which caught everyone by surprise.
It's the kind of thing that happens over and over again when the accountants are in charge.
I guess they don't have a recovery procedure for the "congestive collapse" edge case. I have seen something similar, so I wouldn't be frowning at this.
A couple of red flags though:
1. Apparent lack of load-shedding support by this DWFM, such that a server reboot had to be performed. Need to learn from https://aws.amazon.com/builders-library/using-load-shedding-...
2. Having DynamoDB as a dependency of this DWFM service, instead of something more primitive like Chubby. Need to learn more about distributed systems primitives from https://www.youtube.com/watch?v=QVvFVwyElLY
>[...] Right before this event started, one DNS Enactor experienced unusually high delays needing to retry its update on several of the DNS endpoints. As it was slowly working through the endpoints, several other things were also happening. First, the DNS Planner continued to run and produced many newer generations of plans. Second, one of the other DNS Enactors then began applying one of the newer plans and rapidly progressed through all of the endpoints. The timing of these events triggered the latent race condition. When the second Enactor (applying the newest plan) completed its endpoint updates, it then invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them. At the same time that this clean-up process was invoked, the first Enactor (which had been unusually delayed) applied its much older plan to the regional DDB endpoint, overwriting the newer plan. The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time due to the unusually high delays in Enactor processing. [...]
It outlines some of the mechanics but some might think it still isn't a "Root Cause Analysis" because there's no satisfying explanation of _why_ there were "unusually high delays in Enactor processing". Hardware problem?!? Human error misconfiguration causing unintended delays in Enactor behavior?!? Either the previous sequence of events leading up to that is considered unimportant, or Amazon is still investigating what made Enactor behave in an unpredictable way.
Before the active incident is “resolved” theres an evaluation of probable/plausible reoccurrence. Usually we/they would have potential mitigations and recovery runbooks prepared as well to quickly react to any reoccurance. Any likely open risks are actively worked to mitigate before the immediate issue is considered resolved. That includes around-the-clock dev team work if its the best known path to mitigation.
Next any plausible paths to “risk of reoccurance” would be top dev team priority (business hours) until those action items are completed and in deployment. That might include other teams with similar DIY DNS management, other teams who had less impactful queue depth problems, or other similar “near miss” findings. Service team tech & business owners (PE, Sr PE, GM, VP) would be tracking progress daily until resolved.
Then in the next few weeks at org & AWS level “ops meetings” there are going to be the in depth discussions of the incident, response, underlying problems, etc. the goal there being organizational learning and broader dissemination of lessons learned, action items, best practice etc.
Can't speak for the current incident but a similar "slow machine" issue once bit our BigCloud service (not as big an incident, thankfully) due to loooong JVM GC pauses on failing hardware.
Often network engineers are unaware of some of the tricky problems that DS research has addressed/solved in the last 50 years because the algorithms are arcane and heuristics often work pretty well, until they don’t. But my guess is that AWS will invest in some serious redesign of the system, hopefully with some rigorous algorithms underpinning the updates.
Consider this a nudge for all you engineers that are designing fault tolerant distributed systems at scale to investigate the problem spaces and know which algorithms solve what problems.
Reading these words makes me break out in cold sweat :-) I really hope they don't
- "Rapidly updatable" depends on the specific implementation, but the design allows for 2 billion changesets in flight before mirrors fall irreparably out of sync with the master database, and the DNS specs include all components necessary for rapid updates: push-based notifications and incremental transfers.
- DNS is designed to be eventually consistent, and each replica is expected to always offer internally consistent data. It's certainly possible for two mirrors to respond with different responses to the same query, but eventual consistency does not preclude that.
- Distributed: the DNS system certainly is a distributed database, if fact it was specifically designed to allow for replication across organization boundaries -- something that very few other distributed systems offer. What DNS does not offer is multi-master operation, but neither do e.g. Postgres or MSSQL.
for a large system, it's in practice very nice to split up things like that - you have one bit of software that just reads a bunch of data and then emits a plan, and then another thing that just gets given a plan and executes it.
this is easier to test (you're just dealing with producing one data structure and consuming one data structure, the planner doesn't even try to mutate anything), it's easier to restrict permissions (one side only needs read access to the world!), it's easier to do upgrades (neither side depends on the other existing or even being in the same language), it's safer to operate (the planner is disposable, it can crash or be killed at any time with no problem except update latency), it's easier to comprehend (humans can examine the planner output which contains the entire state of the plan), it's easier to recover from weird states (you can in extremis hack the plan) etc etc. these are all things you appreciate more and more and your system gets bigger and more complicated.
> If it was one thing, wouldn't this race condition have been much more clear to the people working on it?
no
> Is this caused by the explosion of complexity due to the over use of the microservice architecture?
no
it's extremely easy to second-guess the way other people decompose their services since randoms online can't see any of the actual complexity or any of the details and so can easily suggest it would be better if it was different, without having to worry about any of the downsides of the imagined alternative solution.
The Oxide and Friends folks covered an update system they built that is similarly split and they cite a number of the same benefits as you: https://oxide-and-friends.transistor.fm/episodes/systems-sof...
Distributed systems with files as a communication medium are much more complex than programmers think with far more failure modes than they can imagine.
Like… this one, that took out a cloud for hours!
I think the communications piece depends on what other systems you have around you to build on, its unlikely this planner/executor is completely freestanding. Some companies have large distributed filesystems with well known/tested semantics, schedulers that launch jobs when files appear, they might have ~free access to a database with strict serializability where they can store a serialized version of the plan, etc.
interesting take, in light of all the brain drain that AWS has experienced over the last few years. some outside opinions might be useful - but perhaps the brain drain is so extreme that those remaining don't realize it's occurring?
The two DNS components comprise a monolith: neither is useful without the other and there is one arrow on the design coupling them together.
If they were a single component then none of this would have happened.
Also, version checks? Really?
Why not compare the current state against the desired state and take the necessary actions to bring them inline?
Last but not least, deleting old config files so aggressively is a “penny wise pound foolish” design. I would keep these forever or at least a month! Certainly much, much longer than any possible time taken through the sequence of provisioning steps.
The post-mortem is specific that they won't turn it back on without resolving this but I feel like the default assumption for any halfway competent entity would be that they fix the known issue that they have disabled something because.
The internet was born out of the need for Distributed networks during the cold war - to reduce central points of failure - a hedging mechanism if you will.
Now it has consolidated into ever smaller mono nets. A simple mistake in on one deployment could bring banking, shopping and travel to a halt globally. This can only get much worse when cyber warfare gets involved.
Personally, I think the cloud metaphor has overstretched and has long burst.
For R&D, early stage start-ups and occasional/seasonal computing, cloud works perfectly (similar to how time-sharing systems used to work).
For well established/growth businesses and gov, you better become self-reliant and tech independent: own physical servers + own cloud + own essential services (db, messaging, payment).
There's no shortage of affordable tech, know-how or workforce.
I don't think the idea was that in the event of catastrophe, up to and including nuclear attack, the system would continue working normally, but that it would keep working. And the internet -- as a system -- certainly kept working during this AWS outage. In a degraded state, yes, but it was working, and recovered.
I'm more concerned with the way the early public internet promised a different kind of decentralization -- of economics, power, and ideas -- and how _that_ has become heavily centralized. In which case, AWS, and Amazon, indeed do make a good example. The internet, as a system, is certainly working today, but arguably in a degraded state.
In it's conception, the internet (not www), was not envisaged as a economical medium - it's success was a lovely side-effect.
I dont see that this is the case, its just more people want services over the internet from the same 3 places that break irregularly.
Internet infrastructure is as far as I can tell, getting better all the time.
The last big BGP bug had 1/10th the comments of the AWS one. And had much less scary naming (ooooh routing instability)
https://news.ycombinator.com/item?id=44105796
>The internet was born out of the need for Distributed networks during the cold war - to reduce central points of failure - a hedging mechanism if you will.
Instead of arguing about the need that birthed the internet, I will simply say that the internet still works in the same largely distributed fashion. Maybe you mean Web instead of Internet?
The issue here is that "Internet" isnt the same as "Things you might access on the Internet". The Internet held up great during this adventure. As far as I can tell it was returning 404's and 502's without incident. The distributed networks were networking distributedly. If you wanted to send and received packets with any internet joined human in a way that didnt rely on some AWS hosted application, that was still very possible.
>A simple mistake in on one deployment could bring banking, shopping and travel to a halt globally.
Yeah but for how long and for how many people? The last 20 years have been a burn in test for a lot of big industries on crappy infrastructure. It looks like near everyone has been dragged kicking and screaming into the future.
I mean the entire shipping industry got done over the last decade.
https://www.zdnet.com/article/all-four-of-the-worlds-largest...
>Personally, I think the cloud metaphor has overstretched and has long burst.
It was never very useful.
>For well established/growth businesses and gov, you better become self-reliant and tech independent
For these businesses, they just go out and get themselves some region/vendor redundancy. Lots of applications fell over during this outage, but lots of teams are also getting internal praise for designing their systems robustly and avoiding its fallout.
>There's no shortage of affordable tech, know-how or workforce.
Yes, and these people often know how to design cloud infrastructure to avoid these issues, or are smart enough to warn people that if their region or its dependencies fail without redundancy, they are taking a nose dive. Businesses will make business decisions and review those decisions after getting publicly burnt.
the centralization of computing is distorting the Internet's core strength, the distributed nets (not aws/azure/gcloud zones).
since covid, if anything is telling, is that politics, economy and warfare has shifted into a new era, pretty much globally.
So which nets failed here? The write up doesnt mention any network layer issues, and I am not aware of any large scale network layer fallout.
I'm guessing PT was chosen because the people writing this report are in PT (where Amazon headquarters is).
(I don't know anything here, just spitballing why that choice would be made)
Definitely a painful one with good learnings and kudos to AWS for being so transparent and detailed :hugops:
Im guessing the "plans" aspect skipped that and they were just applying intended state, without trying serialize them. And last-write-wins, until it doesnt.
But that's too complicated and results in more code. So they likely just used an SQS queue with consumers reading from it.
776 words in a single paragraph
If we assume that the system will fail, I think the logical thing to think about is how to limit the effects of that failure. In practice this means cell based architecture, phased rollouts, and isolated zones.
To my knowledge AWS does attempt to implement cell based architecture, but there are some cross region dependencies specifically with us-east-1 due to legacy. The real long term fix for this is designing regions to be independent of each other.
This is a hard thing to do, but it is possible. I have personally been involved in disaster testing where a region was purposely firewalled off from the rest of the infrastructure. You find out very quick where those cross region dependencies lie, and many of them are in unexpected places.
Usually this work is not done due to lack of upper level VP support and funding, and it is easier to stick your head in the sand and hope bad things don’t happen. The strongest supporters of this work are going to be the share holders who are in it for the long run. If the company goes poof due to improper disaster testing, the shareholders are going to be the main bag holders. Making the board aware of the risks and the estimated probability of fundamentally company ending events can help get this work funded.
The region model is a lot less robust if core things in other regions require US-East-1 to operate. This has been an issue in previous outages and appears to have struck again this week.
It is what it is, but AWS consistently oversells the robustness of regions as fully separate when events like Monday reveal they’re really not.
In general, when you find one you work to fix it, and one of the most common ways to find more is when one of them fails. Having single points of failure and letting them live isn't the standard practice at this scale.
interesting.
Does that mean a DNS query for dynamodb.us-east-1.amazonaws.com can resolve to one of a hundred thousand IP address?
That's insane!
And also well beyond the limits of route53.
I'm wondering if they're constantly updating route53 with a smaller subset of records and using a low ttl to somewhat work around this.
Unfortunately hard documentation is difficult to provide but that’s how a CDN worked at a place I used to work for, there’s also another CDN[1] which talks about the same thing in fancier terms.
> And also well beyond the limits of route53
Ipso facto, R53 can do this just fine. Where do you think all of your public EC2, ELB, RDS, API Gateway, etc etc records are managed and served?
One thing is the internal limit, another thing is the customer-facing limit.
Some hard limits are softer than they appear.
Today is when the Amazon brain drain sent AWS down the spout (644 comments)
The fault was two different clients with divergent goal states:
- one ("old") DNS Enactor experienced unusually high delays needing to retry its update on several of the DNS endpoints
- the DNS Planner continued to run and produced many newer generations of plans [Ed: this is key: its producing "plans" of desired state, the does not include a complete transaction like a log or chain with previous state + mutations]
- one of the other ("new") DNS Enactors then began applying one of the newer plans
- then ("new") invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them [Ed: the key race is implied here. The "old" Enactor is reading _current state_, which was the output of "new", and applying its desired "old" state on top. The discrepency is because apparently Planer and Enactor aren't working with a chain/vector clock/serialized change set numbers/etc]
- At the same time the first ("old") Enactor ... applied its much older plan to the regional DDB endpoint, overwriting the newer plan. [Ed: and here is where "old" Enactor creates the valid ChangeRRSets call, replacing "new" with "old"]
- The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time [Ed: Whoops!]
- The second Enactor’s clean-up process then deleted this older plan because it was many generations older than the plan it had just applied.
Ironically Route 53 does have strong transactions of API changes _and_ serializes them _and_ has closed loop observers to validate change sets globally on every dataplane host. So do other AWS services. And there are even some internal primitives for building replication or change set chains like this. But its also a PITA and takes a bunch of work and when it _does_ fail you end up with global deadlock and customers who are really grumpy that they dont see their DNS changes going in to effect.
I feel like I am missing something here... They make it sound like the DNS enactor basically diffs the current state of DNS with the desired state, and then submits the adds/deletes needed to make the DNS go to the desired state.
With the racing writers, wouldn't that have just made the DNS go back to an older state? Why did it remove all the IPs entirely?
2. Read state, oh, I need to write all this.
2. Writes
1. Deletes
Or some variant of that anyway. It happens in any system that has concurrent reader/writers and no locks.
Anyway appreciate that this seems pretty honest and descriptive.
Looks like Amazon is starting to show cracks in the foundation.
I put more effort into my internet comments that won't be read by millions of people.
It isn't explicitly stated in the RCA but it is likely these new endpoints were the straw that broke the camel's back for the DynamoDB load balancer DNS automation
Correct?
The precipitating event was a race condition with the DynamoDB planner/enactor system.
Importantly: the DNS problem was resolved (to degraded state) in 1hr15, and fully resolved in 2hr30. The Droplet Manager problem took much longer!
This is the point of complex failure analysis, and why that school of thought says "root causing" is counterproductive. There will always be other precipitating events!
† which itself could very well be a second-order effect of some even deeper and more latent issue that would be more useful to address!
The initial DynamoDB DNS outage was much worse. A bog-standard TOCTTOU for scheduled tasks that are assumed to be "instant". And the lack of controls that allowed one task to just blow up everything in one of the foundational services.
When I was at AWS some years ago, there were calls to limit the blast radius by using cell architecture to create vertical slices of the infrastructure for critical services. I guess that got completely sidelined.
1. How did it break?
2. Why did it collapse?
A1: Race condition
A2: What you said.
Nobody is saying that locks aren't interesting or important.
The race condition was necessary and sufficient for collapse. Absent corrective action it always leads to AWS going down. In the presence of corrective actions the severity of the failure would have been minor without other aggravating factors, but the race condition is always the cause of this failure.
Could we have had an offsite location to fail over to? From a technical perspective, sure. Same as you could go multi-region or multi-cloud or turn on some servers at hetzner or whatever. There's nothing better or worse about the cloud here - you always have the ability to design with resilience for whatever happens short of the internet on the whole breaking somehow.
Short of publicly releasing all internal documentation, there's not much that can make the AWS infrastructure reasonably clear to an outsider. Reading and understanding all of this also would be rather futile without actual access to source code and observability.
> Many of the largest AWS services rely extensively on DNS to provide seamless scale, fault isolation and recovery, low latency, and locality...
Also there really is no one AWS, each region is its own (Now more then ever before, where some systems weren't built to support this).
I asked Claude to reformat it for readability for me: https://claude.ai/public/artifacts/958c4039-d2f1-45eb-9dfe-b...
Obvs do your own cross-checking with the original if 100% accuracy is required.
So if you made a change you had to increase the number, usually a timestamp like 20250906114509 which would be older / lower numbered than 20250906114702; making it easier to determine which zone file had the newest data.
Seems like they sort of had the same setup but with less rigidity in terms of refusing to load older files.