Not that I'm discounting the author's experience, but something doesn't quite add up:
- How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?
- If they know, how is this not an urgent P0 issue for AWS? This seems like the most basic of basic usability features is 100% broken.
- Is there something more nuanced to the failure case here such as does this depend on transactions in-progress? I can see how maybe the failover is waiting for in-flight transactions to close and then maybe hits a timeout where it proceeds with the other part of the failover by accident. That could explain why it doesn't seem like the issue is more widespread.
If it's anything like how Azure handles this kind of issue, it's likely "lots of people have experienced it, a restart fixes it so no one cares that much, few have any idea how to figure out a root cause on their own, and the process to find a root cause with the vendor is so painful that no one ever sees it through"
That was when I scripted away a test that ran hundreds of times a day on a lower environment, attempting repro. As they say, at scale, even insignificant issues become significant. I don't remember clearly, I think it was a 5-10% chance that the issue triggered.
At least confirming the fix, which we did eventually receive, was mostly a breeze. Had to provide an inordinate amount of captures, logs, and data to get there though. Was quite the grueling few weeks, especially all the office politics laden calls.
With all the enterprise solutions being distributed, loosely coupled, self-healing, redundant, and fault-tolerant, issues like this essentially just slot in perfectly. Compound this with man-hours (especially expert ones) being a lot harder to justify for any one particular bump in tail latency, and the equation is just really not there for all this.
What gets us specifically to look into things is either the issue being operationally gnarly (e.g. frequent, impacting, or both), or management being swayed enough by principled thinking (or at least pretending to be). I'd imagine it's the same elsewhere. The latter would mostly happen if fixing a given thing becomes an office political concern, or a corporate reputation one. You might wonder if those individual issues ever snowballed into a big one, but turns out human nature takes care of that just "sufficiently enough" before it would manifest "too severely". [0]
Otherwise, you're looking at fixing / RCA'ing / working around someone else's product defect on their behalf, and giving your engineers a "fun challenge". Fun doesn't pay the bills, and we rarely saw much in return from the vendor in exchange for our research. I'd love to entertain the idea that maybe behind closed doors the negotiations went a little better because of these, but for various reasons, I really doubt so in hindsight.
[0] as delightfully subjective as those get of course
Amazon is mature enough for processes to reflect this, so my guess for why something like this could slip through is either too many new feature requests or many more critical issues to resolve.
AWS this is surprising
I know that there is no comparison in the user base, but a few years ago I ran into a massive Python + MySQL bug that:
1. made SELECT ... FOR UPDATE fail silenty 2. aborted the transaction and set the connection into autocommit mode
This basically a worst case scenario in a transactional system.
I was basically screaming like a mad man in the corner but no one seemed to care.
Someone contacted me months later telling me that they experienced the same problem with "interesting" consequences in their system.
The bug was eventually fixed but at that point I wasn't tracking it anymore, I provided a patch when I created the issue and moved on.
https://stackoverflow.com/questions/945482/why-doesnt-anyone...
If so, that part is all totally normal and expected. It's just that due to a bug in the Python client library (16 years ago), the rollback was happening silently because the error was not surfaced properly by the client library.
When you're in autocommit mode, BEGIN starts an explicit transaction, but after that transaction (either COMMIT or ROLLBACK), you return to autocommit mode.
The situation being described upthread is a case where a transaction was started, and then rolled back by the server due to deadlock error. So it's totally normal that you're back in autocommit mode after the rollback. Most DBMS handle this identically.
The bug described was entirely in the client library failing to surface the deadlock error. There's simply no autocommit-related bug as it was described.
In a sane world, statements outside `BEGIN` would be an unconditional error.
In any case, that's a subjective opinion on database design, not a bug. Anyway it's fairly tangential to the client library bug described up-thread.
I've had AWS engineers confirm very detailed and specific technical implementation details many many times. But these were at companies that happily spent over a $1M/year with AWS.
Their docs are not good and things frequently don't behave how you expect them to.
It's at least possible that this sort of aborted failover happens a fair amount, but if there's no downtime then users just try again and it succeeds, so they never bother complaining to AWS. Unless AWS is specifically monitoring for it, they might be blind to it happening.
This AWS documentation section: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraPostgreSQ...
“Amazon Aurora PostgreSQL updates”: under Aurora PostgreSQL 17.5.3, September 16, 2025 – Critical stability enhancements includes a potential match:
“...Fixed a race condition where an old writer instance may not step down after a new writer instance is promoted and continues to write…”
If that is the underlying issue, it would be serious, but without more specifics we can’t draw conclusions.
For context: I do not work for AWS, but I do run several production systems on Aurora PostgreSQL. I will try to reproduce this using the latest versions over the next few hours. If I do not post an update within 24 hours, assume my tests did not surface anything.
That would not rule out a real issue in certain edge cases, configurations, or version combinations but it would at least suggest it is not broadly reproducible.
We have done a similar operation routinely on databases under pretty write intensive workloads (like 10s of thousands of inserts per second). It is so routine we have automation to adjust to planned changes in volume and do so a dozen times a month or so. It has been very robust for us. Our apps are designed for it and use AWS’s JDBC wrapper.
Just one more thing to worry about I guess…
I noticed another interesting (and still unconfirmed) bug with Aurora PostgreSQL around their Zero Downtime Patching.
During an Aurora minor version upgrade, Aurora preserves sessions across the engine restart, but it appears to also preserve stale per-session execution state (including the internal statement timer). After ZDP, I’ve seen very simple queries (e.g. a single-row lookup via Rails/ActiveRecord) fail with `PG::QueryCanceled: ERROR: canceling statement due to statement timeout` in far less than the configured statement_timeout (GUC), and only in the brief window right after ZDP completes.
My working theory is that when the client reconnects (e.g. via PG::Connection#reset), Aurora routes the new TCP connection back to a preserved session whose “statement start time” wasn’t properly reset, so the new query inherits an old timer and gets canceled almost immediately even though it’s not long-running at all.
I think the promotion fails to happen and then an external watchdog notices that it didn’t, and kills everything ASAP as it’s a cluster state mismatch.
The message about the storage subsystem going away is after the other Postgres process was kill -9’d.
If you have a PG cluster with 1 writer, 2 readers, 10Ti of storage and 16k provision IOPs (io1/2 has better latency than gp3), you pay for 30Ti and 48k PIOPS without redundancy or 60Ti and 96k PIOPS with multi-AZ.
The same Aurora setup you pay for 10Ti and get multi-AZ for free (assuming the same cluster setup and that you've stuck the instances in different AZs).
I don't want to figure the exact numbers but iirc if you have enough storage--especially io1/2--you can end up saving money and getting better performance. For smaller amounts of storage, the numbers don't necessarily work out.
There's also 2 IO billing modes to be aware of. There's the default pay-per-IO which is really only helpful for extreme spikes and generally low IO usage. The other mode is "provisioned" or "storage optimized" or something where you pay a flat 30% of the instance cost (in addition to the instance cost) for unlimited IO--you can get a lot more IO and end up cheaper in this mode if you had an IO heavy workload before
I'd also say Serverless is almost never worth it. Iirc provisioning instances was ~17% of the cost of serverless. Serverless only works out if you have ~ <4 hours of heavy usage followed by almost all idle. You can add instances fairly quickly and failover for minimal downtime (of course barring running into the bug the article describes...) to handle workload spikes using fixed instance sizes without serverless
[0] https://dev.to/aws-heroes/100k-write-iops-in-aurora-t3medium...
We have some clusters with very high write IOPS on Aurora.
When looking at costs we modelled running MySQL and regular RDS MySQL.
We found for the IOPS capacity of Aurora we wouldn't be able to match it on AWS without paying a stupid amount more.
Blatant plug time:
I'm actually working for a company right now ( https://pgdog.dev/ ) that is working on proper sharding and failovers from a connection pooler standpoint. We handle failovers like this by pausing write traffic for up to 60 seconds by default at the connection pooler and swapping which backend instance is getting traffic.
Max throughput on gp3 was recently increased to 2GB/s, is there some way I don't know about of getting 3.125?
> General Purpose SSD (gp3) - Throughput > gp3 supports a max of 4000 MiBps per volume
But the docs say 2000. Then there's IOPS... The calculator allows up to 64.000 but on [0], if you expand "Higher performance and throughout" it says
> Customers looking for higher performance can scale up to 80,000 IOPS and 2,000 MiBps for an additional fee.
I think 80k IOPs on gp3 is a newer release so presumably AWS hasn't updated RDS from the old max of 64k. iirc it took a while before gp3 and io2 were even available for RDS after they were released as EBS options
Edit: Presumably it takes some time to do testing/optimizations to make sure their RDS config can achieve the same performance as EBS. Sometimes there are limitations with instance generations/types that also impact whether you can hit maximum advertised throughput
Do you have a problem believing these claims on equivalent hardware?: https://pages.cs.wisc.edu/~yxy/cs764-f20/papers/aurora-sigmo...
Or do your own performance assessments, following the published document and templates available so you can find the facts on your own?
For Aurora MySql:
"Amazon Aurora Performance Assessment Technical Guide" - https://d1.awsstatic.com/product-marketing/Aurora/RDS_Aurora...
For Aurora Postgres:
"...Steps to benchmark the performance of the PostgreSQL-compatible edition of Amazon Aurora using the pgbench and sysbench benchmarking tools..." - https://d1.awsstatic.com/product-marketing/Aurora/RDS_Aurora...
"Automate benchmark tests for Amazon Aurora PostgreSQL" - https://aws.amazon.com/blogs/database/automate-benchmark-tes...
"Benchmarking Amazon Aurora Limitless with pgbench" - https://aws.amazon.com/blogs/database/benchmarking-amazon-au...
They made the storage layer in total isolation, and they made sure that it guaranteed correctness for exclusive writer access. When the upstream service failed to also make its own guarantees, the data layer was still protected.
Good job AWS engineering!
Is there a case number where we can reach out to AWS regarding this recommendation?
We use Aurora MySQL but I would like to be able to point to that and ask if it applies to us.
And then I also encountered errors just like op in my app layer about trying to execute a write query via read-only transaction.
The workaround so far is to invalidate connection on error. When app reconnects the cluster write endpoint correctly leads to current primary.
I find this approach very compelling. MSSQL has a similar thing with their hyperscale offering. It's probably the only service in Azure that I would actually use.
1. I need to grab some rows from a table
2. Eventual consistency is good enough
And that's a lot of workloads.
More broadly a distributed system problem
A lot of people seem to have the misconception that "Aurora" is its own unique database system, with different front-ends "pretending" to be Postgres or MySQL, but that isn't the case at all.