FilterHN

A SQL Heuristic: Ors Are Expensive

36 points

by ethanseal

8 hours ago

| past

| 9 comments

| ethanseal.com

| HN

▲

ape4

11 minutes ago

[-]

The OR query certainly states user intent better then the rewrite. So its better at communicating with a future programmer.

▲

ntonozzi

1 hour ago

[-]

I really like the extension pattern. I wish more of the tables at my company used it.

Another huge benefit that we're realizing as we move some of our heaviest tables to this pattern is that it makes it really easy to index a core set of fields in Elasticsearch, along with a tag to associate it with a particular domain model. This has drastically cut down on our search-specific denormlization and let us avoid expensive index schema updates.

▲

Glyptodon

23 minutes ago

[-]

If optimization is really as simple as applying De Morgan's laws, surely it could be done within the query planner if that really is the main optimization switch? Or am I misreading this somehow?

Edit: I guess the main difference is that it's just calculating separate sets and then merging them, which isn't really DeMorgan's, but a calculation approach.

▲

Sesse__

11 minutes ago

[-]

In most cases, the end result is a lot messier than it looks in this minimal example. Consider a case where the query is a 12-way join with a large IN list somewhere (which is essentially an OR); would it still look as attractive to duplicate that 12-way join a thousand times?

There _are_ optimizers that try to represent this kind of AND/OR expression as sets of distinct ranges in order to at least be able to consider whether a thousand different index scans are worth it or not, with rather limited success (it doesn't integrate all that well when you get more than one table, and getting cardinalities for that many index scans can be pretty expensive).

▲

Animats

26 minutes ago

[-]

The query optimizer knows how many items are in each index, but has no advance idea how many items will be in the result of a JOIN. An "a OR b" query on a table with millions of rows might have three hits on A, or millions of hits. The optimal query strategy for the two cases is very different.

Has anyone put machine learning in an SQL query optimizer yet?

▲

throwaway81523

20 minutes ago

[-]

Some traditional planners try some different plans on random subsets of the data to see which plan works best. Don't need machine learning, just Bayes' rule.

▲

Sesse__

10 minutes ago

[-]

> The query optimizer knows how many items are in each index, but has no advance idea how many items will be in the result of a JOIN.

Query optimizers definitely try to estimate cardinalities of joins. It's a really, really hard problem, but the typical estimate is _much_ better than “eh, no idea”.

▲

galkk

42 minutes ago

[-]

I strongly dislike the way the polroblem is presented and the “solution” is promoted. Author mentions merge join with count of top, and if th database supports index merges, it can be extremely efficient in described scenario. There are a lot of real optimizations that can be baked in such merges that author chooses to ignore.

The generalized guidance without even mentioning database server as a baseline, without showing plans and/or checking if the database can be guided is just a bad teaching.

It looks like author discovered a trick and tries to hammer everything into it by overgeneralizing.

▲

ethanseal

15 minutes ago

[-]

For sure, there's definitely a lot of cool techniques (and I'm not aware of all of them)! And the first example is very much contrived to show a small example.

I'm not super familiar with the term index merge - this seems to be the term for a BitmapOr/BitmapAnd?

Is there another optimization I'm missing?

The article links to my code for my timings here: https://github.com/ethan-seal/ors_expensive

There is an optimization that went in the new release of PostgreSQL I'm excited about that may affect this - I'm not sure. See [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit...

▲

Sesse__

6 minutes ago

[-]

> I'm not super familiar with the term index merge - this seems to be the term for a BitmapOr/BitmapAnd?

Different databases will use similar terms for different operations, but I would guess that the comment refers to something similar to MySQL's index merge (which is essentially reading the row IDs of all the relevant ranges, then deduplicating them, then doing the final scan; it's similar to but less flexible than Postgres' BitmapOr).

▲

prein

1 hour ago

[-]

This sort of thing is why looking at generated SQL while developing instead of just trusting the ORM to write good queries is so important.

I find query planning (and databases in general) to be very difficult to reason about, basically magic. Does anyone have some recommended reading or advice?

▲

parshimers

1 hour ago

[-]

https://pages.cs.wisc.edu/~dbbook/ is a great overview, specifically chapters 13 and 14 on query optimization. it is difficult to reason about though, and every compiler is different. it takes time and enough examples to look at a query and have an intuition for what the plan should look like, and if there's something the compiler is not handling well.

▲

jalk

59 minutes ago

[-]

It's a big help if you know how to retrieve and interpret execution plans for the database you use.

▲

jalk

1 hour ago

[-]

Can we agree that this is only applies to queries where all the filter conditions use cols with indexes? If no indexes can be used, a single full table scan with OR surely is faster than multiple full table scans.

▲

ethanseal

1 hour ago

[-]

Absolutely. Though I don't recall seeing multiple sequential scans without a self-join or subquery. A basic filter within a sequential scan/loop is the most naive/simplest way of performing queries like these, so postgres falls back to that. Also, fwiw, BitmapOr is only used with indexes: https://pganalyze.com/docs/explain/other-nodes/bitmap-or.

▲

jalk

43 minutes ago

[-]

That was the extreme case - the multi-scan would be gotten if a casual reader tried your (neat by all means) AND-only query on a non-indexed table (or partially indexed for that matter).

▲

ethanseal

34 minutes ago

[-]

Gotcha, I misunderstood your comment. The multiple counts is a definitely very contrived example to demonstrate the overhead of BitmapOr and general risk of sequential scans.

▲

cyanydeez

43 minutes ago

[-]

Yeah, just watch community. Every OR splits the universe, duh.

▲

to11mtm

30 minutes ago

[-]

If ORs are expensive you've probably got a poor table design or the ORs are for some form of nasty query that is more UI/Dashboard based.

A good table has the right indexes, and a good API to deal with the table is using the RIGHT indexes in it's criteria to get a good result.