FilterHN

szastamasta

1 hour ago

[-]

Article has so much fluff and only some very coarse information like (we sharded writes, yay!). Almost no detail just keywords for SEO, or whatever they’re aiming for.

There’s also a lot of repetition. Maybe it was AI generated…?

PUSH_AX

19 minutes ago

[-]

Could even be seen as a disguised ad for their infrastructure partner too.

lighthouse1212

18 minutes ago

[-]

The 'single primary with read replicas' pattern scaling to 800M users is the real insight here. Most startups reach for sharding or distributed databases way too early, adding complexity for scale they don't have. If OpenAI can serve hundreds of millions from one Postgres primary by offloading reads and pushing new write-heavy features elsewhere, that's a strong argument for simplicity.

dbuser99

7 hours ago

[-]

I don’t get it. This whole thing says single writer does not scale, so we stopped writing as much and removed reads away from it, so it works ok and we decided that’s enough. I guess thats great.

sbstp

4 hours ago

[-]

This article has very little useful information...

There's nothing novel about optimizing queries, sharding and using read replicas.

ramraj07

2 hours ago

[-]

It has one piece of useful info: their main data store even for 800M users is a single instance of postgres (for writes) without sharding.

spullara

2 hours ago

[-]

when I joined twitter in 2011 there was a single mysql master user (not tweets) database and a few dozen read replicas. it was writing about 7000 updates per second and during bursts it would go too high for the single-threaded replication in mysql at the time to keep up with the master which would cause replication lag and all kinds of annoying things in the app. you just have to pick the right time to make the switch before it is an emergency.

AlisdairO

4 hours ago

[-]

Regarding schema changes and timeouts - while having timeouts in place is good advice, you can go further. While running the schema rollout, run a script alongside it that kills any workload conflicting with the aggressive locks the schema change is trying to take. This will greatly reduce the pain caused by lock contention, and prevent you from needing to repeatedly rerun statements on high-throughput tables.

This would be a particularly nice-to-have feature for Postgres - the option to have heavyweight locks just proactively cancel any conflicting workload. For any case where you have a high-throughput table, the damage of the heavyweight lock sitting there waiting (and blocking all new traffic) is generally much larger than just cancelling some running transactions.

6 minutes ago

[-]

Doesn't Postgres support transactional schema changes already? Why would you want to proactively kill work that's just going to complete after the schema change is done? Load balancing, throttling etc. is a different matter that has little to do with what you're proposing.

_ink_

51 minutes ago

[-]

Did I miss it, or did they not say why they picked CosmoDB? Postgres has also sharding, so instead of moving to a different DB they could have added a new postgres instance with sharding for the new requests.

everfrustrated

11 hours ago

[-]

This is why I love Postgres. It can get you to being one of the largest websites before you need to reconsider your architecture just by throwing CPU and disk at it. At that point you can well afford to hire people who are deep experts at sharding etc.

8 hours ago

[-]

PostgreSQL actually supports sharding out of the box, it's just a matter of setting up the right table partitioning and using Foreign Data Wrapper (FDW) to forward queries to remote databases. I'm not sure what the post is referencing when they say that sharding requires leaving Postgres altogether.

dmix

8 hours ago

[-]

This is specifically what they said about sharding

> The primary rationale is that sharding existing application workloads would be highly complex and time-consuming, requiring changes to hundreds of application endpoints and potentially taking months or even years

manquer

4 hours ago

[-]

> potentially taking months or even years

On one hand OAI sell coding agents and constantly hype how easy it will replace developers and most of the code written is by agents, on the other hand they claim it will take years to refactor

Both cannot be true at the same time.

simonw

7 hours ago

[-]

Genuinely sounds like the kind of challenge that could be solved with a swarm of Codex coding agents. I'm surprised they aren't treating this as an ideal use-case to show off their stack!

gloflo

2 hours ago

[-]

Oh snap! Maybe it's all a great deception for making money?

csto12

6 hours ago

[-]

I read your message, guessed the author, and I’m happy to announce I guessed correctly.

Ozzie_osman

4 hours ago

[-]

Getting the sharing in-place, yes, but maintaining it operationally would still be a headache. Things like schema migrations across shards, resharding, and even observability.

aisuxmorethanhn

6 hours ago

[-]

It wouldn’t work.

8 hours ago

[-]

I know they said that, but in fact sharding is entirely a database-level concern. The application need not be aware of it at all.

EB66

7 hours ago

[-]

Sharding can be made mostly transparent, but it's not purely a DB-level concern in practice. Once data is split across nodes, join patterns, cross-shard transactions, global uniqueness, certain keys hit with a lot of traffic, etc matter a lot. Even if partitioning handles routing, the application's query patterns and its consistency/latency requirements can still force application-level changes.

1 hour ago

[-]

> mostly transparent, but it's not purely a DB-level concern in practice ...

But how would any of that change by going outside Postgres itself to begin with? That's the part that doesn't make much sense to me.

9 minutes ago

[-]

When sharded, anything crossing a shard boundary becomes non-transactional.

Ie. if you shard by userId, then a "share" feature which allows a user to share data with another user by having a "SharedDocuments" table cannot be consistent.

That in turn means you're probably going to have to rewrite the application to handle cases like a shared document having one or other user attached to it disappear or reappear. There are loads of bugs that can happen with weak consistency like this, and at scale every very rare bug is going to happen and need dealing with.

3 minutes ago

[-]

> When sharded, anything crossing a shard boundary becomes non-transactional.

Not necessarily? You can have two-phase commit for cross-shard writes, which ought to be rare anyway.

awesome_dude

3 hours ago

[-]

> Once data is split across nodes, join patterns, cross-shard transactions, global uniqueness, certain keys hit with a lot of traffic

If you're having trouble there then a proxy "layer" between your application and the sharded database makes sense, meaning your application still keeps its naieve understanding of the data (as it should) and the proxy/database access layer handles that messiness... shirley

9rx

2 hours ago

[-]

> At that point you can well afford to hire people who are deep experts at sharding etc.

Can you, though? OpenAI is haemorrhaging money like it is going out of style and, according to the news cycle over the last couple of days, will likely to be bankrupt by 2027.

5 minutes ago

[-]

And typically the bigger the company gets, the harder it is to migrate to a new data model.

You suddenly have literally thousands of internal users of a datastore, and "We want to shard by userId, nobody please don't do joins on user Id anymore" becomes an impossible ask.

mrweasel

1 hour ago

[-]

Running this on Azure Postgresql, even migrating to CosmosDB, cannot be cheap. I know that OpenAI have to deal/relationship with Microsoft, but still, this has to be expensive.

This is however the most down to earth: How we scale Postgresql I've read in a long time. No weird hacker, no messing around with the source code or tweaking the Linux kernel. Running on Azure Postgresql it's not like OpenAI have those options anyway, but still it seems a lot more relatable than: We wrote our own drive/filesystem/database-hack in Javascript.

nasretdinov

57 minutes ago

[-]

I honestly don't understand such negative response tone from the comments. Yes, it does promote Azure, but that's to be expected from a company with is part owned by Microsoft :).

The main point of the article is that it's actually not that hard to live with a single primary Postgres for your transactional workloads (emphasis on _transactional_), and if OpenAI with their 800M+ users can still survive on a single primary (with 50(!) read replicas), so could you, especially before you've reached your first 100M users.

Any non-distributed database or setup is orders of magnitude easier to design for, and it's also typically much more cost efficient too, both in terms of hardware and software too.

There are some curious details, e.g.:

- you can ship WAL to 50 read replicas simultaneously from a single primary and be fine - you can even be using an ORM and still get decent performance - schema changes are possible, and you can just cancel a slow ALTER to prevent production impact - pgbouncer is ok even for OpenAI scale

There are so many things that contradict current "conventional wisdom" based on the experience from what was possible with the hardware 10+ (or even 20+) years ago. Times finally changed and I really welcome articles like these that show how you can greatly simplify your production setup by leveraging the modern hardware.

n_u

6 hours ago

[-]

Cool! I'd love to know a bit more about the replication setup. I'm guessing they are doing async replication.

> We added nearly 50 read replicas, while keeping replication lag near zero

I wonder what those replication lag numbers are exactly and how they deal with stragglers. It seems likely that at any given moment at least one of the 50 read replicas may be lagging cuz CPU/mem usage spike. Then presumably that would slow down the primary since it has to wait for the TCP acks before sending more of the WAL.

tomnipotent

5 hours ago

[-]

> would slow down the primary since it has to wait for the TCP acks

Other than keeping around more WAL segments not sure why it would slow down the primary?

bostik

2 hours ago

[-]

If you use streaming replication (ie. WAL shipping over the replication connection), a single replica getting really far behind can eventually cause the primary to block writes. Some time back I commented on the behaviour: https://news.ycombinator.com/item?id=45758543

You could use asynchronous WAL shipping, where the WAL files are uploaded to an object store (S3 / Azure Blob) and the streaming connections are only used to signal the position of WAL head to the replicas. The replicas will then fetch the WAL files from the object store and replay them independently. This is what wall-g does, for a real life example.

The tradeoffs when using that mechanism are pretty funky, though. For one, the strategy imposes a hard lower bound to replication delay because even the happy path is now "primary writes WAL file; primary updates WAL head position; primary uploads WAL file to object store; replica downloads WAL file from object store; replica replays WAL file". In case of unhappy write bursts the delay can go up significantly. You are also subject to any object store and/or API rate limits. The setup makes replication delays slightly more complex to monitor for, but for a competent engineering team that shouldn't be an issue.

But it is rather hilarious (in retrospect only) when an object store performance degdaration takes all your replicas effectively offline and the readers fail over to getting their up-to-date data from the single primary.

stemchar

16 minutes ago

[-]

> If you use streaming replication (ie. WAL shipping over the replication connection), a single replica getting really far behind can eventually cause the primary to block writes. Some time back I commented on the behaviour: https://news.ycombinator.com/item?id=45758543

I'd like to know more, since I don't understand how this could happen. When you say "block", what do you mean exactly?

winterrx

7 hours ago

[-]

First OpenAI Engineering blog? I'm definitely interested in seeing more and how they handled the rapid growth.

3 minutes ago

[-]

There was a lot of downtime...

I think they handled the massive growth by a lot of 2am emergencies and editing config files directly in production in the hope of fixing fires.

huksley

3 hours ago

[-]

"... If a new feature requires additional tables, they must be in alternative sharded systems such as Azure CosmosDB rather than PostgreSQL...."

So it is not really scaling too much now, rather maintaining current state of things and new features go to a different DB?

CodeCompost

3 hours ago

[-]

Azure CosmosDB is insanely expensive. I can't imagine anybody using it unless you have OpenAI money.

2 minutes ago

[-]

We don't know the profit margins on it... Might not be very expensive if you're an internal user as openAI effectively is for microsoft.

freakynit

2 hours ago

[-]

*Microsoft's money

ggregoire

5 hours ago

[-]

> scaled up by increasing the instance size

I always wondered what kind of instance companies at that level of scalability are using. Anyone here have some ideas? How much cpu/ram? Do they use the same instance types available to everyone, or does AWS and co offer custom hardware for these big customers?

[1] https://learn.microsoft.com/en-us/azure/virtual-machines/siz...

4 hours ago

[-]

The major hyperscalers all offer a plethora of virtual machines SKUs that are essentially one entire two-socket box with many-core CPUs.

For example, Azure Standard_E192ibds_v6 is 96 cores with 1.8 TB of memory and 10 TB of local SSD storage with 3 million IOPS.

Past those "general purpose" VMs you get the enormous machines with 8, 16, or even 32 sockets.[1] These are almost exclusively used for SAP HANA in-memory databases or similar ERP workloads.

Azure Standard_M896ixds_24_v3 provides 896 cores, 32 TB of memory, and 185 Gbps Ethernet networking. This is generally available, but you have to allocate the quota through a support ticket and you may have to wait and/or get your finances "approved" by Microsoft. Something like this will set you back [edited] $175K per month[/edited]. (I suspect OpenAI is getting a huge effective discount.)

Personally, I'm a fan of "off label" use of the High Performance Compute (HPC) sizes[2] for database servers.

The Standard_HX176rs HPC VM size gives you 176 cores and 1.4 TB of memory. That's similar to the E-series VM above, but with a higher compute-to-memory ratio. The memory throughput is also way better because it has some HBM chips for L3 (or L4?) cache. In my benchmarks it absolutely smoked the general-purpose VMs at a similar price point.

[2] https://learn.microsoft.com/en-us/azure/virtual-machines/siz...

antonkochubey

4 hours ago

[-]

> Something like this will set you back $30K-$60K per year

lol, no, cloud is nowhere near that good value. It’s $3.5M annually.

> The Standard_HX176rs HPC VM size gives you 176 cores and 1.4 TB of memory

This one is $124k per year.

3 hours ago

[-]

Thanks for the correction, fixed.

I noticed that the M896i is so obscure and rarely used that there are typos associated with it everywhere including the official docs! In once place is says it has 23 TB of memory when it actually has 32 TB.

https://docs.aws.amazon.com/sap/latest/general/sap-hana-aws-...

manquer

4 hours ago

[-]

On the AWS side there are "HANA certified" instances that max out at 1920 cores and 32 TB RAM - u7inh-32tb.480xlarge

3 hours ago

[-]

I'm pretty sure both Azure and AWS are merely reselling the same HPE Compute Scale-up Server 3200 chassis with some variations. Azure seems to have only the 16-socket model, but AWS has the 32-socket model.

That AWS instance uses these 60-core processors: https://www.intel.com/content/www/us/en/products/sku/231747/...

To anyone wondering about these huge memory systems: avoid them if at all possible! Only ever use these if you absolutely must.

For one, these systems have specialised parts that are more expensive per unit compute: $283 per CPU core instead of something like $85 for a current-gen AMD EPYC, which are also about 2x as fast as the older Intel Scalable Xeons that need to go into this chassis! So the cost efficiency ratio is something like 6:1 in favour of AMD processors. (The cost of the single large host system vs multiple smaller ones can get complicated.)

The second effect is that 32-way systems have huge inter-processor cache synchronisation overheads. Only very carefully coded software can scale to use thousands of cores without absolutely drowning in cache line invalidations.

At these scales you're almost always better off scaling out "medium" sized boxes. A single writer and multiple read-only secondary replicas will take you very far, up to hundreds of gigabits of aggregate database traffic.

cuu508

4 hours ago

[-]

Are there any pictures around of these 8, 16, 32 socket boards? Just curious how they look like.

1 hour ago

[-]

The individual motherboards have only four sockets: https://assets.ext.hpe.com/is/image/hpedam/s00012647?$zoom$#...

Multiple of these can be linked together with “NUMALink” cables, which carry the same protocol as the traces that go between sockets on the motherboard. You end up with a single kernel running across multiple chassis.

neya

1 hour ago

[-]

Out of pure boredom and tired of all these Chat websites selling my data and with ChatGPT's new update on ads - I decided enough was enough and created my own Chat application for privacy. Like every other architect, I searched for a good database and eventually gave up on specialized ones for chat because they were either too expensive to host or too complex to deal with. So, I simply just used PostgreSQL. My chat app has basic RAG, not ground breaking or anything - but the most important feature I made was ability to add different chat models into one group chat. So, when you ask for opinions on something - you are not relying on just a single model and you can get a multi-model view of all the possible answers. Each model can have its own unique prompt within the group chat. So basically, a join table.

Months passed by since this application was developed (a simple Phoenix/Elixir backend), and yesterday I was casually checking my database to see how many rows it had - about 500,000+ roughly. I didn't notice a single hint of the volume the Postgres was handling, granted - I'm the only user, but there's always a lot going on - RAG, mostly that requires searching of the database for context before multiple agents send you a response (and respond amongst themselves). Absolutely zero performance degradation.

I'm convinced that Postgres is a killer database that doesn't get the attention it deserves over the others (for chat). Already managing some high traffic websites (with over 500M+ requests) with no issues, so I am extremely unsurprised that it works really well for chat apps at scale too.

KellyCriterion

2 hours ago

[-]

I would be super curious about:

How do they store all the other stuff related to operating the service? This must be a combination of several components? (yes, including some massdata storage, Id guess?)

This would be cool to understand, as Ive absolutely no idea how this is done (and could be done :-)

PunchyHamster

1 hour ago

[-]

Weird, I'd imagine for kind of use they are getting it would be easy to shard the infrastructure to entirely separate instances

bzmrgonz

9 hours ago

[-]

Someone ask Microsoft what does it feel to be bested by an open source project on their very own cloud platform!!! Lol.

doodlesdev

6 hours ago

[-]

They don't care. Azure has a revenue higher than GCP, losing only to AWS. It's Microsoft's new baby, and they love it, no matter what you want to run there. Also, they're still the 4th largest company by market cap.

Honestly, only us nerds in Hacker News care about this kind of stuff :) (and that's why I love it here).

edit: also, the article cites OpenAI did adopt Azure Cosmos DB for new stuff they want to shard. Still shows how far you can take PostgreSQL though.

bdcravens

5 hours ago

[-]

That ship sailed a long time ago, as Microsoft has offered Linux VMs in Azure for 14 years, and today, about 2/3 of VMs running there are Linux. In the public cloud era, owning the infrastructure and customer base is far more important than licenses.

beoberha

6 hours ago

[-]

Are you saying this because OpenAI didnt choose SQL Server?

csto12

6 hours ago

[-]

In 2026 is SQL Server ever the answer?

Tostino

6 hours ago

[-]

It really is a good database. Give it lots of room. If you can distribute your workload on multiple machines though, you can't beat Postgres' licencing terms vs SQL Server.

tormeh

3 hours ago

[-]

Why is it a good database? Integration with Entra? I've heard arguments in favor of Oracle DB, but I've never heard anything good about MSSQL besides integration with the MS ecosystem.

beoberha

5 hours ago

[-]

That’s kind of my point. They’re not really in competition. I bet they’d have an easier time with this scale if they were on SQL Server, but obviously that migration isn’t happening and startups don’t reach for it for many reasons.

everfrustrated

57 seconds ago

[-]

The software licencing of 50 read replicas alone would make sqlserver a non-starter

DLA

7 hours ago

[-]

And the same for Linux boxes on Azure - they dominate Windows servers by a huge margin.

esjeon

8 hours ago

[-]

Azure offers Postgres “DBaaS”, so I’m pretty sure they are no where near that stage. It’s more likely that we should watch out for the Microsoft E-E-E strategy.

hu3

11 hours ago

[-]

From what I understand they basically couldn't scale writes in PostgreSQL to their needs and had to offload what they could to Azure's NoSQL database.

I wonder, is there another popular OLTP database solution that does this better?

> For write traffic, we’ve migrated shardable, write-heavy workloads to sharded systems such as Azure CosmosDB.

> Although PostgreSQL scales well for our read-heavy workloads, we still encounter challenges during periods of high write traffic. This is largely due to PostgreSQL’s multiversion concurrency control (MVCC) implementation, which makes it less efficient for write-heavy workloads. For example, when a query updates a tuple or even a single field, the entire row is copied to create a new version. Under heavy write loads, this results in significant write amplification. It also increases read amplification, since queries must scan through multiple tuple versions (dead tuples) to retrieve the latest one. MVCC introduces additional challenges such as table and index bloat, increased index maintenance overhead, and complex autovacuum tuning.

menaerus

25 minutes ago

[-]

I was thinking about the same paragraph because write-amplification is exactly the problem solved by LSM trees _and_ they already have a solution for that in-house - one of the first acquisitions that OpenAI made is Rockset - a company that actually built the RocksDb at scale.

So, this is the part that actually made me left wondering why.

0xdeafbeef

10 hours ago

[-]

Tidb should handle it nice. I've wrote 200к inserts / sec for hour in peak. Underlying lsm works better for writes

anonzzzies

10 hours ago

[-]

That would mean it improved somewhat. We always got better write performance from mysql vs postgres, however that is a while ago; we then tried tidb to go further but it was basically rather slow. Again, a while ago.

When did you get your results, might be time to re-evaluate.

0xdeafbeef

2 hours ago

[-]

It was 1 year ago. Around 15 tikv serves, 32 cpu, 128 ram each, 4 tb nvme. In this case latency matters a lot. When i had load server in different region with ping of 3ms I got 70k inserts, when moved to the same region with sub ms ping it went to thousands

ph4evers

5 hours ago

[-]

Nice write up! It is cool to see that PostgreSQL is still standing. Adyen has some nice blog posts about squeezing the max out of PostgreSQL https://medium.com/adyen/all?topic=postgres

kachapopopow

1 hour ago

[-]

Why does everyone make a "how we scaled PostgreSQL" article.

killingtime74

3 hours ago

[-]

Why a single postgres? Why not shard by users?

ramraj07

2 hours ago

[-]

They literally answer that in the post. They started with a single instance and realized sharding the existing tables will be too much work (they'll slowly migrate to new tables instead).

QuiCasseRien

8 hours ago

[-]

I like the way of thinking. Instead of migrating to another database, they keep that awesome one running and found smart workaround to push limits.

hahahahhaah

6 hours ago

[-]

It is what mature engineering does. Migrations are not fun.

mannyv

5 hours ago

[-]

Uh, they scaled PostgreSQL by offloading a lot of it to Azure CosmosDB.

I'm not sure that's the answer people are looking for.

ed_mercer

10 hours ago

[-]

Why does the [Azure PostgreSQL flexible server instance] link point to Chinese Azure?

noxs

8 hours ago

[-]

All names are Asian and mostly Chinese

> Author Bohan Zhang

> Acknowledgements Special thanks to Jon Lee, Sicheng Liu, Chaomin Yu, and Chenglong Hao, who contributed to this post, and to the entire team that helped scale PostgreSQL. We’d also like to thank the Azure PostgreSQL team for their strong partnership.

Natfan

10 hours ago

[-]

Bohan Zhang, the article's author, is likely Chinese.

e: and the link points to en-us at time of writing. I frankly don't see the value in your comment.

DeathArrow

1 hour ago

[-]

"This effort demonstrates that with the right design and optimizations, Azure PostgreSQL can be scaled to handle the largest production workloads."

Sure, but choosing from the start a DB that can scale with ease would have taken far less time and effort.

You can bend any software into doing anything, but is it worth it?

PunchyHamster

1 hour ago

[-]

They could've just sharded it; their users are not interconnected, it would be easy to just have 128 shards and then assign user to one by org/user hash

qaq

3 hours ago

[-]

for people not burning billions of VC $ sharding Postgres is not a bad option.

ahmetozer

6 hours ago

[-]

ai written blog, its very generic and same context is repated many times

trhway

2 hours ago

[-]

"However, some read queries must remain on the primary because they’re part of write transactions. "

if there is a read replica that has reached required snapshot - it is usually enough (depends on your task of course) for it to be the snapshot that was at the start of your transaction - and if the read query doesn't need to read your transaction uncommitted data, then that replica can serve the read query.

LudwigNagasena

12 minutes ago

[-]

TL;DR There is no secret sauce, it's the same set of techniques you’ve seen in most PostgreSQL scaling guides. Those techniques do work.