The challenges of supporting foreign key constraints
118 points
1 year ago
| 6 comments
| planetscale.com
| HN
leetrout
1 year ago
[-]
Every time people would champion vitess / PS as a drop in fix this the caveat that would slow them down or stop them.

This feature truly is a game-changer for adoption IMO.

Congrats on shipping to all involved!

reply
irq-1
1 year ago
[-]
> At this time, foreign key constraint support is limited to unsharded/single shard databases.
reply
AYBABTME
1 year ago
[-]
The Venn diagram of people needing sharding and still using foreign keys is probably empty.
reply
leetrout
1 year ago
[-]
Worth noting, the competing products from Neon and Cockroach Labs support FKs.
reply
AYBABTME
1 year ago
[-]
Cockroach supports it, but like for MySQL, PG, PlanetScale, or most DBs at scale, you shouldn't use them.

Neon isn't in the same league at all so I think what they support doesn't matter.

reply
abraae
1 year ago
[-]
> but like for MySQL, PG, PlanetScale, or most DBs at scale, you shouldn't use them.

Tell that to the people who have been successfully building massive databases in RDBMS for decades making extensive use of FKs and FK constraints.

reply
nikita
1 year ago
[-]
Neon is not a shared nothing architecture but shared storage. Makes it 100% compatible with Postgres that supports FK constraints since 1989
reply
nikita
1 year ago
[-]
Also would be useful to add a disclaimer that you work for PlanetScale.
reply
adam_gyroscope
1 year ago
[-]
(And you work for neon)
reply
shlomi-noach
1 year ago
[-]
Thank you!
reply
shlomi-noach
1 year ago
[-]
Post author here, happy to answer technical questions.
reply
guptamanan100
1 year ago
[-]
I am the other post-author, and I am available too.
reply
padre24
1 year ago
[-]
This was a great read, thanks.

Are there any plans to support recursive CTEs? What are the technical challenges there?

reply
guptamanan100
1 year ago
[-]
Thank you for the compliment!

We recently started adding support for CTEs in Vitess! You can check out https://github.com/vitessio/vitess/pull/14321 if you want to see some technical details of the implementation.

For now, we have added preliminary support by converting them to derived tables internally, but we believe that we need to make CTEs first-class citizens themselves of query planning, specifically because recursive CTEs are very hard to model as derived tables. Once we make that change, we can look towards supporting recursive CTEs.

This however will take some time, but then, all good things do!

reply
padre24
1 year ago
[-]
Awesome news! We work with hierarchical data so it was a non starter for us.
reply
derekperkins
1 year ago
[-]
If you don't need recursive CTEs on sharded databases, they'll work today. We're actively using them
reply
guptamanan100
1 year ago
[-]
Oh, I see, that's unfortunate! Hopefully, we can remedy that though as soon as possible.
reply
jtriangle
1 year ago
[-]
So how do you deal with orphaned child rows when reverting? I assume it's up to your users to deal with them or not? This very much seems like a clever automation for chosing when to care about foreign key constraints and not outright enforcement
reply
shlomi-noach
1 year ago
[-]
Yes, you got that right! If you drop a foreign key constraint from a child table, and then follow up to INSERT/DELETE rows on parent and child in such way that is incompatible with foreign key constraints, and then revert, then the child, now again with the foreign key constraint, can have orphaned rows. It's as if you did `SET FOREIGN_KEY_CHECKS=0` and manipulated the data unobstructed. The schema itself remains valid, and some rows do not comply.

It's worth noting that MySQL has no problem with this kind of situation. It never cares about the existence of orphaned rows. It only cares about not letting you creating them in the first place, and it cares about cleaning up. But it doesn't blow up if orphaned rows do exist. They will just become ghosts.

reply
saltcured
1 year ago
[-]
Coming from a PostgreSQL worldview, I find this confusing. To me a foreign key constraint is about the referential integrity of a table, not an insert-time rule that can have loopholes to leave invalid data in the table. If the constraint is in place, I should be able to trust that any queryable data satisfies the constraint.

Also from my PostgreSQL-infused worldview, it seems to me you are making your life too difficult by requiring a migration to make "one change" to a table or view. The brute force idiom I've seen for schema migrations is to break it into phases:

1. drop departing foreign key constraints

2. restructure table columns/types and values

3. add new foreign key constraints

This is a bit like running with constraints deferred while making data changes that might look invalid until all data changes are done. But, it defers expression of the new constraints until the table structures are in place to support their definitions too, so it isn't just about deferring enforcement.

The same strategy can be used for import scenarios to support schemas where there are circular foreign key reference constraints. I.e. tables are not in a strict parent-child hierarchy.

reply
shlomi-noach
1 year ago
[-]
Valid points! In my experience, when someone has foreign key constraints in their database, they tend to develop their apps in "trusting" way. Meaning, the app trusts the DB to maintain referential integrity. When you do the three step breakdown, you remove that integrity, and the app doesn't know any better: it keeps feeding the database with data, the database says "fine", and the app assumes referential integrity is preserved.

This is why in our design PlanetScale will not take upon itself to do this three step change. The user is more than welcome to break this into three different (likely they'll be able to make it in just two) schema changes. But then the user takes ownership of handling unchecked references.

> making data changes that might look invalid until all data changes are done

In effect, the data _will be_ invalid, and potentially for many hours.

Now, it's true that if the user messed up the data in between, then adding the foreign key constraint will fail, in both PostgreSQL and in MySQL. To me, this signals more bad news, because now the user has to scramble to clean up whatever incorrect data they have, before they're able to complete their schema change and unblock anyone else who might be interested in modifying the table.

Personally, my take is to not use foreign key constraints on large scale databases. It's nice to have, but comes at a great cost. IMHO referential data integrity should be handled, gracefully, by the app. Moreover, referential integrity is but one aspect of data integrity/consistency. There are many other forms of data integrity, which are commonly managed by the app, due to specific business logic. I think the app should own the data as much as it can. My 2c.

reply
saltcured
1 year ago
[-]
Ah, in my worldview those steps are still in a single transaction. It's just formulated as several ordered statements.

Is that mechanism unique to PostgreSQL? Or is it just that this transaction is impractical for you, due to the wall clock duration and how it impacts other use of the DB?

reply
aidos
1 year ago
[-]
I have a rule not to revert migrations in production. If things have gone wrong enough that you need to rollback then there are all sorts of ways that trying to go back again could be bad news.

My main worry is that I’d end up with 3 classes of data; pre migration, changed / added post migration and post revert. It’s probably fine in most cases, but that could take quite some unpicking.

reply
shlomi-noach
1 year ago
[-]
We've all been there and have been hit hard by rolling back data as well as by not rolling back data. What we do with PlanetScale Reverts, though, is to preserve your data through the rollback. There aren't three classes of data, just one (or, it's nuanced, let's call it 1.5).

As you complete the migration, PlanetScale keeps your old table, and continues to sync any further incoming changes to the table, back to the old table. They're kept in sync, apart of course from what incompatible schema changes they may have. As you revert, we flip the two, placing your old table back, but now not only with the old data, but also with all the newly accumulated data.

I completely appreciate the roll-forward approach. That's something we do for code deployments. Except when we don't, when there's that particular change where the best approach is to revert a commit, revert a PR. I think of schema deployments in the same way. Reverts will be rare, hopefully, but they can save the day.

reply
_a_a_a_
1 year ago
[-]
I've not met planetscale before and didn't understand why you're doing this. From the site:

"Changing a table's schema is one of the most challenging problems in relational databases, and in MySQL in particular. In today's accelerated and rapid development cycles, engineers find that they need to make schema changes sometimes on a daily basis."

Okay, but is this really true? Is it really necessary to have no downtime except in a very few cases? It's honestly rare to have systems that absolutely must stay up all the time, but allowing fiddling with the schema as well just seems excessive.

I'd also say as software dev I'd be very uncomfortable with a system like yours that claims to maintain consistency while running and changing the schema, short of a formal proof.

reply
rdoherty
1 year ago
[-]
Once your schema changes take more than a few minutes, yes. There's a lot of toil and burden if you need to take down your application every time you need a schema change. Announcements, coordination with internal teams and customers and then coordinating with other engineers.

We aren't talking about zero downtime here, but continual, recurring downtime due to schema changes. Once you have beyond a few million rows in a normal RDBMS, schema changes can take minutes to hours depending on the type. Do this a few times per month and you now have 'lots' of downtime and you are blocking other engineering work from happening. It eventually becomes so much of a hassle that engineers don't want to do schema changes, blocking feature work. The more seamless and painless you can make them, the better.

reply
_a_a_a_
1 year ago
[-]
I'm going to have to be a bit contrary here.

How often do you expect to make the schema changes? I mean I quoted this bit "...make schema changes sometimes on a daily basis" – is this realistic, or a kind of business insanity typically caused by bad management? Ditto "...but continual, recurring downtime due to schema changes". This really looks like a failure of management rather than a technical problem to be solved.

Also aren't you likely to be doing something larger than just a schema change very often, in which case that would necessitate replacing your application, so changes are not just restricted to the database. You now have a bigger problem of co-ordinating app and DB changes.

I also asked to do you need permanent uptime because in a lot of systems, especially smaller ones (and by the long tail most systems are going to be smallish) the users are very tolerant of an hours' downtime a month, for example.

"Once you have beyond a few million rows in a normal RDBMS, schema changes can take minutes to hours depending on the type"

That's a pretty strong claim; what kind of thing is going to take hours that your database can do consistently? Does it even take hours? I had a 100,000,000 row table of unique ints lying around so I put a foreign key from itself to itself (a bit daft, but just for timing purposes. DB is MS SQL, table is fully hot in memory)

alter table [tmp_ints_clustered] add constraint ffffkkkkk foreign key (x) references [tmp_ints_clustered](x);

21 seconds.

What you're doing (if you can get it correct! Which I have to wonder at) is doubtless excellent for some very large companies, but in general... I'm afraid I'm not so sure.

Edit: I feel I'm perhaps missing your bigger picture.

reply
dylan604
1 year ago
[-]
I agree with your push back. Even in DEV, I'm not making daily schema changes. in fact, I hate schema changes and go back and forth on if the change is really necessary. sure, changes do become necessary, but sheesh, daily is a sign to me that something else needs to be looked at in the dev cycle. like, is nobody forward thinking enough to come up with a workable schema. is the requirements truly being made by the seat of the pants. also, are the new schema requests really necessary to existing tables, or can we hang a new table and extend the joins? seems like taking a bit of time to do some forward thinking on the initial schema should keep daily changes from existing
reply
saltcured
1 year ago
[-]
On the other hand, a reluctance to do schema changes often just leads to a de facto schemaless system if new concepts start getting hacked into generic fields instead of properly modeled.

When you make schema evolution really easy to do with no downtime, you can also start doing phased deployments where you think in terms of backwards-compatibility. First you add some new/optional parts to the schema. Then you update the applications to use it, where the application has to tolerate data with or without the new bits.

Eventually, you might convert something from optional to required, first for new data entry and later as a conversion of existing data, where that makes sense. Then finally you might deprecate the application ability to handle the old missing bits, and remove that from your code maintenance problems.

reply
_a_a_a_
1 year ago
[-]
Maybe proper releases, running on different machines, might be a better option most of the time.
reply
salawat
1 year ago
[-]
Perhaps we shouldn't be collecting and retaining such large datasets that these issues become such a pressing problem?
reply
klooney
1 year ago
[-]
I feel like you're implying that this is caused by personal data collection and tracking, but it's not- you can get there pretty easily, in a small to medium sized app, with just user tables, or users + things configured.

The giant data lakes for vacuuming up tracking data generally never do schema migrations at all.

reply
paulddraper
1 year ago
[-]
Interesting, elaborate.
reply
phkahler
1 year ago
[-]
>> In today's accelerated and rapid development cycles, engineers find that they need to make schema changes sometimes on a daily basis.

Surely that's a bad design, and capability to support it is enabling continued bad design.

reply
Exuma
1 year ago
[-]
Can someone help me with a suggestion?

Ive been researching databases now for several days straight, the choices are overwhelming but I've pretty much narrowed my use case down to an RDBMS system.

I need to essentially handle 100's of millions of "leads" (and 10s of millions per day) which can make up any number of user fields. over 1B total

I need to resolve duplicate leads either in realtime or near realtime. A duplication can occur across a combination of 1 or more fields, so basically OLTP type operations (select, update, delete on single rows)

I do need to run large OLAP queries as well across all data

I've looked at things like scylla and whatnot but they seem too heavy duty for my volume. it's not like i need to store trillions of messages like discord in some huge event log.

I was considering these 3 options...

1. planetscale

2. citus

3. cockroachdb

I havent really narrowed it down further than this, but i liked the idea of still having RDBMS features without needing to worry about storage and scaling with just sheer write volume.

It seemed i could then do my basic OLTP stuff that i need, and citus had a cool demo how some OLAP query on 1B rows ran in 20s with 10 nodes, and that also fits a reasonable time for queries (BI tools will be used for that)

reply
financltravsty
1 year ago
[-]
Postgres with upserts (and triggers if your de-duplication logic can't be handled on the backend)?

OLAP works here, but is not great depending on how fast you need info to be available. If you're generating reports, Postgres is fine as long as your queries are properly optimized, and you can get them within minutes for massive workloads. If you need near-instant (sub 1s) results, I would recommend you sync your RDBMS to a columnar database like ClickHouse, and let the better data layout work in your favor, rather than trying to constrain a row-based DB to act like it's not.

Otherwise, both are rock solid and simple to use. I've dealt with more intensive workloads than you mentioned, with the same use-case and Postgres worked very well. ClickHouse never had a problem.

reply
Exuma
1 year ago
[-]
Awesome, thank you. That's kind of what I was thinking, I'm glad you confirmed it. How exactly did you sync or data from PG -> Clickhouse?

I was considering using something like Airbyte, but then I thought this may actually be complex if PG rows are updating/deleting it means I also need to sync single rows (or groups of rows) to clickhouse, and I wasn't sure how the support was for that.

reply
financltravsty
1 year ago
[-]
What I did in my case was setup stream replication to send over the Postgres WAL to another service that would update a ClickHouse cluster. Essentially, every time the WAL file is closed, a batch of all the SQL commands that were committed are sent over the wire.

It might be easier to find some "change data capture" product that will do that for you though (like Airbyte). I can't give any recommendations here, however.

reply
tw1ser
1 year ago
[-]
I can recommend a fellow Y member PeerDB [0] for this but I don't know if they support ClickHouse as a destination

[0] - https://www.peerdb.io/

reply
winrid
1 year ago
[-]
Pretty much anything will work at that scale depending on your SLAs. You could use Mongo with the higher compression option, add a couple indexes, and be golden. Just do the reporting off a live secondary. I store billions of documents in Mongo on unimpressive hardware (64gb ram, GP3 EBS), adding millions a day. Mongo isn't super fast at aggregations, though...

What kind of aggregation queries? Can you do pre-aggregation?

Citus would probably be my pick if you want SQL.

Feel free to email.

reply
nick-sta
1 year ago
[-]
Have a look into singlestore - it seems like a nice fit for this use case.
reply
bob1029
1 year ago
[-]
Every time I turn on FK constraints I wind up regretting it and ripping them out of the schema. Not one time have I ever ran a non-query, received a constraint violation, and thought to myself "thank god, what a relief".

Any time I am doing something that could violate some logical constraint, I am probably trying to fix something that is already fucked. The safety nannies simply add insult to injury at this stage. I will delete broken rows in whatever order I please.

If constraint violations are adding value to your solution, you probably have a far more severe problem repeatedly slamming into the wall somewhere.

reply
Knufferlbert
1 year ago
[-]
Without knowing details, I can only assume you are misunderstanding something. I and everyone I worked with have bugs prevented by FK constraints. They prevent getting data to be in bad state, instead of it piling up and expensively fixing it afterwards. Not once have I thought "I wouldn't have had this problem without FKs" and every time I thought "oh yeah, I forgot this path, that would have been a problem".

Having to write code that can handle foreign key violations because the DB doesn't check it is a major pain. (we use Cassandra for example, so there is a "foreign key" usually from a PG row to a Cassandra row, obviously that can't be enforced on DB level so application code has to do the work)

As for deleting/updating data, FKs can be a bit annoying, but postgresql for example has two (possibly more) options.

1) The (possibly dangerous) cascade delete, which will traverse the FKs basically for you and deletes them 2) The check FKs (and other constraints) on commit. I.e. instead of checking every delete/update statement causes FKs violations, it'll check at the end, after having done all the delete/update statements if there are any FK violations. (or update statements). Called deferrable constraints.

reply
perrygeo
1 year ago
[-]
Strange, I was just reflecting this morning how grateful I am that FK constraints are respected. I got a FK constraint violation and without it I would have introduced a bug in my app. Specifically deleting X was doing to delete many child Ys.

If there are any downstream queries that assume the relationship between X and Y, and you accidentally violated that contract, wouldn't you WANT the database to tell you? Without a FK constraint in place, I would just have to know that I needed to update hundreds of lines of (incidentally) dependent code!

It's roughly analagous to the static vs dynamic typing debate. Do you want to discover FK violations at runtime in the future? Or head them off at insert time? Either way, you cannot sweep referential integrity under the rug... you only shift the burden from one place to another.

So I thank my database daily for it's work enforcing FK constraints. Otherwise I'd have to write that code! You can't ignore the lifecycle of references.

reply
falserum
1 year ago
[-]
The benefit is the 100% guarantee that queried child has a parent.

If code has many maintainers, pinky promises to keep referential integrity are bound to be broken eventually. Humans like shortcuts.

reply
ranting-moth
1 year ago
[-]
You should definitely put that on your CV because it is incredibly important information.
reply
obviyus
1 year ago
[-]
The Planetscale blog consistently puts out such high quality posts. I can highly recommend following their YouTube channel as well. I’ve learned a ton from the incredibly well made videos Aaron puts out!
reply
guptamanan100
1 year ago
[-]
Yep! Aaron is great, isn't he!!? I make it a point to watch all the videos he puts out. We appreciate your support!
reply
chrisjc
1 year ago
[-]
Thanks for the recommendation.

https://www.youtube.com/@PlanetScale

Subscribed.

reply