A SQL Heuristic: Ors Are Expensive
36 points
8 hours ago
| 9 comments
| ethanseal.com
| HN
ape4
11 minutes ago
[-]
The OR query certainly states user intent better then the rewrite. So its better at communicating with a future programmer.
reply
ntonozzi
1 hour ago
[-]
I really like the extension pattern. I wish more of the tables at my company used it.

Another huge benefit that we're realizing as we move some of our heaviest tables to this pattern is that it makes it really easy to index a core set of fields in Elasticsearch, along with a tag to associate it with a particular domain model. This has drastically cut down on our search-specific denormlization and let us avoid expensive index schema updates.

reply
Glyptodon
23 minutes ago
[-]
If optimization is really as simple as applying De Morgan's laws, surely it could be done within the query planner if that really is the main optimization switch? Or am I misreading this somehow?

Edit: I guess the main difference is that it's just calculating separate sets and then merging them, which isn't really DeMorgan's, but a calculation approach.

reply
Sesse__
11 minutes ago
[-]
In most cases, the end result is a lot messier than it looks in this minimal example. Consider a case where the query is a 12-way join with a large IN list somewhere (which is essentially an OR); would it still look as attractive to duplicate that 12-way join a thousand times?

There _are_ optimizers that try to represent this kind of AND/OR expression as sets of distinct ranges in order to at least be able to consider whether a thousand different index scans are worth it or not, with rather limited success (it doesn't integrate all that well when you get more than one table, and getting cardinalities for that many index scans can be pretty expensive).

reply
Animats
26 minutes ago
[-]
The query optimizer knows how many items are in each index, but has no advance idea how many items will be in the result of a JOIN. An "a OR b" query on a table with millions of rows might have three hits on A, or millions of hits. The optimal query strategy for the two cases is very different.

Has anyone put machine learning in an SQL query optimizer yet?

reply
throwaway81523
20 minutes ago
[-]
Some traditional planners try some different plans on random subsets of the data to see which plan works best. Don't need machine learning, just Bayes' rule.
reply
Sesse__
10 minutes ago
[-]
> The query optimizer knows how many items are in each index, but has no advance idea how many items will be in the result of a JOIN.

Query optimizers definitely try to estimate cardinalities of joins. It's a really, really hard problem, but the typical estimate is _much_ better than “eh, no idea”.

reply
galkk
42 minutes ago
[-]
I strongly dislike the way the polroblem is presented and the “solution” is promoted. Author mentions merge join with count of top, and if th database supports index merges, it can be extremely efficient in described scenario. There are a lot of real optimizations that can be baked in such merges that author chooses to ignore.

The generalized guidance without even mentioning database server as a baseline, without showing plans and/or checking if the database can be guided is just a bad teaching.

It looks like author discovered a trick and tries to hammer everything into it by overgeneralizing.

reply
ethanseal
15 minutes ago
[-]
For sure, there's definitely a lot of cool techniques (and I'm not aware of all of them)! And the first example is very much contrived to show a small example.

I'm not super familiar with the term index merge - this seems to be the term for a BitmapOr/BitmapAnd?

Is there another optimization I'm missing?

The article links to my code for my timings here: https://github.com/ethan-seal/ors_expensive

There is an optimization that went in the new release of PostgreSQL I'm excited about that may affect this - I'm not sure. See [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit...

reply
Sesse__
6 minutes ago
[-]
> I'm not super familiar with the term index merge - this seems to be the term for a BitmapOr/BitmapAnd?

Different databases will use similar terms for different operations, but I would guess that the comment refers to something similar to MySQL's index merge (which is essentially reading the row IDs of all the relevant ranges, then deduplicating them, then doing the final scan; it's similar to but less flexible than Postgres' BitmapOr).

reply
prein
1 hour ago
[-]
This sort of thing is why looking at generated SQL while developing instead of just trusting the ORM to write good queries is so important.

I find query planning (and databases in general) to be very difficult to reason about, basically magic. Does anyone have some recommended reading or advice?

reply
parshimers
1 hour ago
[-]
https://pages.cs.wisc.edu/~dbbook/ is a great overview, specifically chapters 13 and 14 on query optimization. it is difficult to reason about though, and every compiler is different. it takes time and enough examples to look at a query and have an intuition for what the plan should look like, and if there's something the compiler is not handling well.
reply
jalk
59 minutes ago
[-]
It's a big help if you know how to retrieve and interpret execution plans for the database you use.
reply
jalk
1 hour ago
[-]
Can we agree that this is only applies to queries where all the filter conditions use cols with indexes? If no indexes can be used, a single full table scan with OR surely is faster than multiple full table scans.
reply
ethanseal
1 hour ago
[-]
Absolutely. Though I don't recall seeing multiple sequential scans without a self-join or subquery. A basic filter within a sequential scan/loop is the most naive/simplest way of performing queries like these, so postgres falls back to that. Also, fwiw, BitmapOr is only used with indexes: https://pganalyze.com/docs/explain/other-nodes/bitmap-or.
reply
jalk
43 minutes ago
[-]
That was the extreme case - the multi-scan would be gotten if a casual reader tried your (neat by all means) AND-only query on a non-indexed table (or partially indexed for that matter).
reply
ethanseal
34 minutes ago
[-]
Gotcha, I misunderstood your comment. The multiple counts is a definitely very contrived example to demonstrate the overhead of BitmapOr and general risk of sequential scans.
reply
cyanydeez
43 minutes ago
[-]
Yeah, just watch community. Every OR splits the universe, duh.
reply
to11mtm
30 minutes ago
[-]
If ORs are expensive you've probably got a poor table design or the ORs are for some form of nasty query that is more UI/Dashboard based.

A good table has the right indexes, and a good API to deal with the table is using the RIGHT indexes in it's criteria to get a good result.

reply