The curious case of the aggregation query
76 points
1 year ago
| 4 comments
| modern-sql.com
| HN
crdrost
1 year ago
[-]
For a while now, whenever I have taught SQL to people (which was only I think 3 times over like 8 years) I ask them to imagine that the word SELECT can be replaced with AGGREGATE which unlocks these magical words GROUP BY and SUM and so forth, and they understand me perfectly well: and then I drop on them that I was lying and the magic word is still SELECT but you have to know which kind of SELECT it is from context clues and they are like “What???? Whyyyyyy.”
reply
derefr
1 year ago
[-]
To be honest, teaching SQL "syntax in" seems like a really backwards way of getting an understanding of what SQL is doing.

I run the tech side of a company almost entirely built on complex SQL queries. If I were to take a fresh college student and "teach them SQL"... well, I wouldn't teach them SQL. I'd teach them relational algebra: projection, selection, join, etc. as mathematical operators over bags of tuples. I'd make sure that they could show me what the bag-of-tuples looks like at each step, as each operator is applied. I'd get them to write code in some programming language with first-class Set and Tuple datatypes, that does these operations as in-memory functional transforms over bag-of-tuples -typed variables.

And then, once they have that solid understanding, I'd show them, in order:

1. that there exist these systems called RDBMSes that maintain durable, mutable bags-of-tuples they call tables (which is weird when you think about it; coming from a math world, a relation should be an immutable mathematical object, a value-object defined by its members. But RDBMS tables aren't relations, not exactly — and this is where you can go into detail about SQL `NULL`, "nullable columns", and how in an RDBMS, the absence of an asserted row-tuple in a table doesn't necessarily mean the implicit assertion of its logical negation as a logical predicate; it rather means "haven't been fed the truth-value of that relation yet." RDBMS query engines aren't relational-algebra computers; they're relational-algebra provers — and NULL is their "not proven"!);

2. that these systems accept a COBOL-like declarative syntax for either

• referencing their tables (`TABLE foo`);

• expressing relational algebra upon their tables (`SELECT ... FROM foo ...`);

• defining arbitrary literal bags-of-tuples (e.g. `VALUES (1, 2)` — bet you didn't know that was toplevel-grammatical in SQL!);

• expressing relational algebra upon arbitrary literal bags-of-tuples (e.g. `SELECT a FROM (VALUES (1, 2)) AS x (a, b)`);

• or any combinations of the above — using UNION, INTERSECT, or introducing any of the above as a common table expression and then referencing it by name.

3. that these systems also accept COBOL-like imperative syntax — one which really has not-much-at-all to do with relational algebra — for modifying the durable, mutable bag-of-tuples asserted by a given table (INSERT, UPDATE, DELETE, TRUNCATE);

4. and that, completely unnecessarily, these systems also have a COBOL-like imperative syntax for modifying the definitions of the tables themselves (unnecessary because the data-definition schema could have just been, itself, a set of tables that you modify through DML statements. INFORMATION_SCHEMA exists, but it's read-only in every RDBMS I've ever seen. Wacky, isn't it?)

reply
QuercusMax
1 year ago
[-]
I took a class on Databases back during my undergrad in the early 2000s, and it covered pretty much all of this stuff. I thought this was standard stuff people learned as part of a CS degree.
reply
refulgentis
1 year ago
[-]
There’s certainly merit in learning the mathematical foundations of SQL, like relational algebra.

Beginning with SQL syntax allows students to quickly create something tangible, which can be incredibly gratifying and motivating.

It’s a bit like learning to play music; while music theory is important, there's something to be said for picking up an instrument and playing a few notes to spark a passion.

The intricacies of relational algebra can be woven into the curriculum once students have a handle on the basics and can appreciate the deeper structures at play.

reply
derefr
1 year ago
[-]
My point — that I perhaps didn't articulate directly — was that SQL is a bad "instrument" for noodling around with data in a way that actually gives you "a handle on the basics."

Implementing the sorts of things you'd use SQL to do, but using an in-memory bag-of-tuples data structure in a functional language, is a much better way to learn those basics. There's far less "grammatical magic" involved (where one extra word in the statement changes the way every step does its work); instead, each relational-algebra operation — each of what an RDBMS would call a "query-plan node" — is something you're directly describing with a line of code.

Leaning on your analogy of music:

Building an intuition for relational-algebra by using functions of Set and Tuple ADTs to transform bag-of-tuples-typed data in a programming language, is like building an intuition for music by trying to hit the keys on a piano according to a score.

Building an intuition for relational-algebra through SQL, meanwhile, is like building an intuition for music by wiring together VST modules in a DAW, to automatically generate sequences of notes, apply effects to them, level them, combine them in various ways, etc. And not even directly/interactively/visually like you'd usually use a DAW, but instead, by describing the wiring diagram using some sort of purely-textual high-level declarative VST-wiring DSL. (You only even get the visual equivalent of your VST-wiring-DSL code, if you use SQL EXPLAIN — which only works on 100%-valid SQL, so isn't great as a debugging tool for confused newbies! — and then feed the EXPLAINation into a visualizer like https://explain.dalibo.com/.)

Once you have a really solid intuition for music, you might be able to sit down and write a VST-wiring-DSL query that outputs exactly the notes you want on your first try. Just like someone who has built a really solid intuition for finite-state automata, can get a regular expression right on the first go. But neither regexps nor SQL are a good way of developing the intuition for the way the underlying formalism is "being driven" by the abstract program.

If you want to understand regexps well, you should first do what the regexp would be doing for you. Write your own string parser, as a finite-state automata module, consuming characters (and potentially keeping a lookbehind buffer), and outputting matches and captures.

If you want to understand SQL well, you should first do what the SQL query planner would be doing for you. Implement your own query plan, using a chain of function calls to something like Clojure's lazy-sequence functions, or to Elixir's Enum and Stream modules.

reply
uticus
1 year ago
[-]
> I'd teach them relational algebra...

Agreed. Curious what materials you've found that really drive this home?

Like 90% of googlesearch results simplify things to the point where this is missed.

reply
Tallain
1 year ago
[-]
SQL Server Central has a "Stairway to T-SQL" series that starts off with some light history and then has two fairly decent chapters diving into the mathematics of SQL. It's not intensive by any means, but more than most other resources (and even some textbooks I've read).

[0]: https://www.sqlservercentral.com/stairways/stairway-to-t-sql...

reply
formerly_proven
1 year ago
[-]
A lot of the Relational Databases course materials in CS programs is built from Oracle educational material; it's probably freely available for many universities. I always assumed the approach outlined by GP is the default way to teach SQL and relational databases.
reply
crazygringo
1 year ago
[-]
> Now the mysterious query: SELECT (SELECT sum(a) FROM xx LIMIT 1) FROM aa

To be clear, that's just an absolutely pathological query, since column "a" doesn't exist in table "xx" -- it only exists in table "aa".

You can use values from outer queries inside of inner queries, which are called "correlated subqueries". Which is not recommended unless you're already a real SQL expert, because they're so confusing (and their performance is also pretty absymal).

But this isn't even that -- this is an aggregate function across an outer variable. I consider myself a strong SQL expert, but looking at this query, it just feels nonsensical. Not only could I not tell you what it would produce -- I couldn't even guess at what its author intended it to mean in the first place.

> Later in the article we will see that SQL implementations don’t even agree about the correct result.

I'm not surprised at all. I'm actually more surprised they produce results at all, rather than just errors about it being an invalid query.

reply
indigo945
1 year ago
[-]
I don't find it that confusing, to be honest. When the article presented the "mysterious" query, I guessed the result correctly. A correlated subquery will always execute the entire expression for each row of the outer query. Since the value of `sum(a)` is known in the context of the outer query, it doesn't really matter where you select it from in the inner query. Therefore, the following queries are all identical:

    SELECT sum(a) FROM aa;
    SELECT (SELECT sum(a) FROM xx LIMIT 1) FROM aa;
    SELECT (SELECT sum(a) FROM generate_series(1, 1)) FROM aa;
To maybe explain why this behaviour makes sense, stop thinking about aggregate queries for a second, and just consider normal expressions. What would you expect the result of the following query to be?

    SELECT (SELECT a + x FROM xx LIMIT 1) FROM aa;
This is obviously a bad query because it uses `LIMIT` without `ORDER BY`, but let's just put that aside. This query should make some sense to pretty much any SQL developer: you take each `a` value from `aa`, and then add the value from the "first" (whatever that means) `x` value from `xx` to it. (In my test on Postgres, this returns [11, 12, 13], but there's obviously no guarantee because we neglected to sort `xx`.)

Now, since that was easy, consider this simplification of that query:

    SELECT (SELECT a FROM xx LIMIT 1) FROM aa;
I think that if the previous query made sense, this one should also make sense, as the expression simple has one less term. The result is obviously [1, 2, 3].

So, what should the aggregate return:

    SELECT (SELECT sum(a) FROM xx LIMIT 1) FROM aa;
The only real difference is that the outer query is now aggregated, and therefore it is implicitly in one big grouping, and therefore returns only one row. So the correlated subquery gets run once, for that one row - what's the value for `sum(a)` in that row, the one row that the outer query returns? Well, 6. Duh.
reply
foldU
1 year ago
[-]
(I'm the author of the original post [1]) And while I think the rules, in the end, make sense, I think it's not quite as clear cut as you describe, consider:

    SELECT (SELECT sum(1) FROM xx LIMIT 1) FROM aa;
which returns

    3
    3
    3
Personally, I think having the inner aggregation always attach to the nearest `SELECT` would have been an equally valid way of defining how this works, but it just so happens it is not defined that way.

[1]: https://buttondown.email/jaffray/archive/sql-scoping-is-surp...

reply
indigo945
1 year ago
[-]
Well, the "fun" one to consider is IMO not that, but this:

    SELECT (SELECT sum(a + x) FROM xx LIMIT 1) FROM aa;
Making a pretty diagram where all the behaviours from various DBMSs are listed is left as an exercise to the author. :)
reply
foldU
1 year ago
[-]
I think this one is obvious actually! There's no choice but to aggregate it at the level of the `xx` `SELECT`, no other level has access to the `x` column.
reply
MarkusWinand
1 year ago
[-]
> Making a pretty diagram where all the behaviours from various DBMSs are listed is left as an exercise to the author. :)

It's there already!

It's in the chart as footnote "b": Outer reference must be the only argument (doesn’t support F441)

That's the one thing SQL Server doesn't eat. Those that are green in the chart work fine in this case.

reply
da_chicken
1 year ago
[-]
> I don't find it that confusing, to be honest.

I would say it's not all that complicated what is going on. I would say that you can figure it out if you have experience reading queries.

I would never say it's not confusing. Just because the query engine can parse and generate an execution plan and you can understand how and why it's doing what it decides to do doesn't mean it's not confusing. "Confusing" doesn't mean "incomprehensible." It is confusing. It's confusing in almost the same way that "6÷2(1+2) = ?" viral equation is confusing [0] or the Fast Inverse Square Root is confusing. This query is unconventional.

Not "unconventional" in the sense that it's uncommon (although it is also that). It's unconventional in the sense that it goes against established conventions. Humans don't write queries that way. This query violates the collective query conventions of the SQL community. People don't write queries like this or teach writing queries like this because it's not a particularly useful or portable pattern to put the aggregate value of the outside query in the SELECT clause of an inside correlated subquery. There are better ways to write this query. Because of that, it's not clear what the author's intent was or why they would choose to write it thus instead of, well, nearly any other way that a human would actually choose beforehand. It's so unconventional that the intent is uncertain, and "uncertainty of intent" is another way to say "confusing."

Code is not valuable because it executes. Code is valuable because the programmer, the computer, and any future programmer agree on the intent with no uncertainty. That's why over the long term clarity is often more potent than cleverness.

[0]: https://youtu.be/4x-BcYCiKCk

reply
tsimionescu
1 year ago
[-]
You say it's obvious, but it apparently wasn't obvious to the developers of Oracle DB, one of the most used databases.

I think the most unexpected thing is that the query engine looks into the inner queries to decide whether to return all rows of the outer table or some aggregation of them. If there had been a group by clause, it would have been much clearer that an aggregation must be happening.

Edit: also, the "obvious" logic actually changes completely if you look at the related query, SELECT (SELECT sum(a+x) FROM xx LIMIT 1) FROM aa;

This returns 63, 66, 69. Similarly, sum(1) will return 3, 3, 3.

So, we could very well expect that sum (a) would work like that - it would treat each value of `a` as a constant that it aggregates over all rows of xx, so it would return 3, 6, 9 (the limit is anyway ignored if you are aggregating).

reply
jhoechtl
1 year ago
[-]
> Oracle DB, one of the most used databases.

Is it still? Maybe 20 years ago but it's been a long time that I met an It guy who talked about himself having worked with an Oracle DB recently.

It's all PostgreSQL, MS SQL and MySQL/MariaDB where I work.

reply
tsimionescu
1 year ago
[-]
Here is an article on Statista claiming it is in fact the most used by quite some margin:

https://www.statista.com/statistics/809750/worldwide-popular...

I have no idea if that is even close to correct, but I am sure it is still one of the most used RDBMS out there, out of pure inertia if nothing else. Plus, from what I've heard from a friend who worked with a few commercial. DBs, it's actually pretty good software, it's dealing with Oracle licensing and outrageous audit practices that is killing it.

reply
nimish
1 year ago
[-]
Correlated subqueries are absolutely fine and are equivalent to (possibly lateral) joins. They aren't as easy to identify and optimize as such, so I wouldn't necessarily recommend them, but sometimes they make sense. Not an issue with a good query planner and if they clarify a query.
reply
da_chicken
1 year ago
[-]
Yeah, correlated subqueries are fine. Unless you're on MySQL, which (historically, at least) was infamously terrible at executing them.

The reason this query works is because correlated subqueries... need something to correlate with! Usually the column from the outside query is only referenced in the WHERE clause, but there isn't really a reason you can't reference it in the inside query's SELECT clause. After all, what if you needed it for a CASE expression there?

There are certainly better ways to write what this query is accomplishing -- at the very least there are ways to write it that make it much more clear what your intentions are -- but just because it's weird doesn't mean it's invalid. The developers know they cannot anticipate all possible operations that might be desired, and there are already countless ways to write queries incorrectly.

reply
HDThoreaun
1 year ago
[-]
As an employee at a database company working on the plan optimizer I would like to ask you not to use correlated subqueries for my own sanity. Close to 100% of the mind blowing bugs are correlated subqueries. Kidding of course but holy shit my life would be so much easier without them.
reply
nimish
1 year ago
[-]
I am curious as to why. They should be represented like joins, right?
reply
HDThoreaun
1 year ago
[-]
They’re slow as fuck and most of them can be rewritten to not be correlated, so we attempt to heavily rewrite them. There’s basically an entire subfield of cs research that focuses on strategies for rewriting CSQs and the papers have gotten progressively more and more complicated making them very difficult to reproduce without bugs. I’ve spent easily weeks just reading these papers let alone implementing them.
reply
nimish
1 year ago
[-]
I'd love to read a deep dive on this. Database implementation internals are always very interesting.
reply
MarkusWinand
1 year ago
[-]
No question, such a query should not be written. That's probably the reason why this odd behavior, which is even different in various DBMSes, is not causing everyday problems.
reply
marcosdumay
1 year ago
[-]
The reason the behavior doesn't cause problems is because everybody treats automatic aggregation as a voodoo where they know one recipe that works and anything different is domain of the Devil.

And, IMO, that's a very sane and reasonable way to treat it. The entire idea of automatic aggregation is flawed, and those queries should just have a `group by ()` explicit at the right place.

reply
foldU
1 year ago
[-]
I agree, I think the original sin here is the fact that whether a `SELECT` is an aggregation is determined by the contents of the scalar expressions at all. I think most of this weirdness comes directly out of wanting to be able to write both `SELECT sum(x) FROM xx` and `SELECT x FROM xx` and have them work.

Not that I have a better solution offhand, in SQL grouping by a constant value is not actually the same as not writing `GROUP BY` at all since the behaviour on empty tables is different.

reply
derefr
1 year ago
[-]
What's an aggregation per se? A SQL query is best thought of an arbitrary generator function. You can, using functions like UNNEST, end up emitting multiple row-tuples for each processed input row-tuple. An aggregation is just a generator function that happens to reduce over all the input rows and then emit one row-tuple when there's no more input. Query-planning engines do not special-case this. It's just a generator node like any other generator node.

Consider: using window functions, you can do partial aggregations over subsets of the input — without even necessarily partitioning the input (i.e. you can compute "running totals" and other wacky state-machine-like outputs.)

reply
marcosdumay
1 year ago
[-]
Not writing `group by` is the same as writing `group by ()`.

And yeah, the difference between that and a value is one of those really surprising things on SQL that actually make sense and should be this way. Unfortunately, there are many of those.

reply
foldU
1 year ago
[-]
Ah my apologies, I wasn't familiar with that syntax.
reply
marcosdumay
1 year ago
[-]
No need to apologize. My post didn't make that clear at all.
reply
indigo945
1 year ago
[-]
Would you argue that automatic scalar-ism is also flawed, and the query

    SELECT a FROM aa;
should have an explicit grouping, like

    SELECT a FROM aa GROUP BY id;
?

After all, when you think about it, it's really not the aggregate functions that break expectations, it's the scalars. Of the four combinations of having or not having an explicit `GROUP BY` and having or not having an aggregate function, three of them have the unsurprising behavior of returning exactly one row per grouping.

    -- aggregate function and group by - one single result per grouping
    SELECT sum(a) FROM aa GROUP BY a; --or
    SELECT sum(a) FROM aa GROUP BY ();


    -- no aggregate function, but a group by - one single result per grouping
    SELECT a FROM aa GROUP BY a;

    -- just aggregate function, but no group by - one single result per (single, implicit) grouping
    SELECT sum(a) FROM aa;

But then, when you have neither an aggregate function nor an explicit `GROUP BY`, it breaks expectations:

    -- no aggregate function, no group by - one result row per row 
    -- in the source set, even though there should be only one big implicit group
    SELECT a FROM aa;
Therefore, I propose that the next SQL standard should introduce a new `GROUP BY INDIVIDUAL ROW` keyword, that henceforth all "scalar" queries MUST use in order to have consistent behaviour with the rest of the language.
reply
marcosdumay
1 year ago
[-]
What is flawed is not that there is an implicit grouping on all queries. It's that the implicit grouping changes, depending on a set of rules that consider stuff written in several places that are not explicitly related to it.

You are asking if it makes sense to have an implicit grouping at all; it very obviously absolutely does. And grouping by individual row is the very obvious default. But I do totally support adding that keyword expressing the default. All defaults should be expressible.

reply
gigatexal
1 year ago
[-]
Please don’t use this as an interview question. It is absolutely diabolical.
reply
random3
1 year ago
[-]
So approximately a SQL standard (aggregation "tree-climb") combined with a non-standard (LIMIT) which prevents an error yield an unexpected behavior.
reply