I run the tech side of a company almost entirely built on complex SQL queries. If I were to take a fresh college student and "teach them SQL"... well, I wouldn't teach them SQL. I'd teach them relational algebra: projection, selection, join, etc. as mathematical operators over bags of tuples. I'd make sure that they could show me what the bag-of-tuples looks like at each step, as each operator is applied. I'd get them to write code in some programming language with first-class Set and Tuple datatypes, that does these operations as in-memory functional transforms over bag-of-tuples -typed variables.
And then, once they have that solid understanding, I'd show them, in order:
1. that there exist these systems called RDBMSes that maintain durable, mutable bags-of-tuples they call tables (which is weird when you think about it; coming from a math world, a relation should be an immutable mathematical object, a value-object defined by its members. But RDBMS tables aren't relations, not exactly — and this is where you can go into detail about SQL `NULL`, "nullable columns", and how in an RDBMS, the absence of an asserted row-tuple in a table doesn't necessarily mean the implicit assertion of its logical negation as a logical predicate; it rather means "haven't been fed the truth-value of that relation yet." RDBMS query engines aren't relational-algebra computers; they're relational-algebra provers — and NULL is their "not proven"!);
2. that these systems accept a COBOL-like declarative syntax for either
• referencing their tables (`TABLE foo`);
• expressing relational algebra upon their tables (`SELECT ... FROM foo ...`);
• defining arbitrary literal bags-of-tuples (e.g. `VALUES (1, 2)` — bet you didn't know that was toplevel-grammatical in SQL!);
• expressing relational algebra upon arbitrary literal bags-of-tuples (e.g. `SELECT a FROM (VALUES (1, 2)) AS x (a, b)`);
• or any combinations of the above — using UNION, INTERSECT, or introducing any of the above as a common table expression and then referencing it by name.
3. that these systems also accept COBOL-like imperative syntax — one which really has not-much-at-all to do with relational algebra — for modifying the durable, mutable bag-of-tuples asserted by a given table (INSERT, UPDATE, DELETE, TRUNCATE);
4. and that, completely unnecessarily, these systems also have a COBOL-like imperative syntax for modifying the definitions of the tables themselves (unnecessary because the data-definition schema could have just been, itself, a set of tables that you modify through DML statements. INFORMATION_SCHEMA exists, but it's read-only in every RDBMS I've ever seen. Wacky, isn't it?)
Beginning with SQL syntax allows students to quickly create something tangible, which can be incredibly gratifying and motivating.
It’s a bit like learning to play music; while music theory is important, there's something to be said for picking up an instrument and playing a few notes to spark a passion.
The intricacies of relational algebra can be woven into the curriculum once students have a handle on the basics and can appreciate the deeper structures at play.
Implementing the sorts of things you'd use SQL to do, but using an in-memory bag-of-tuples data structure in a functional language, is a much better way to learn those basics. There's far less "grammatical magic" involved (where one extra word in the statement changes the way every step does its work); instead, each relational-algebra operation — each of what an RDBMS would call a "query-plan node" — is something you're directly describing with a line of code.
Leaning on your analogy of music:
Building an intuition for relational-algebra by using functions of Set and Tuple ADTs to transform bag-of-tuples-typed data in a programming language, is like building an intuition for music by trying to hit the keys on a piano according to a score.
Building an intuition for relational-algebra through SQL, meanwhile, is like building an intuition for music by wiring together VST modules in a DAW, to automatically generate sequences of notes, apply effects to them, level them, combine them in various ways, etc. And not even directly/interactively/visually like you'd usually use a DAW, but instead, by describing the wiring diagram using some sort of purely-textual high-level declarative VST-wiring DSL. (You only even get the visual equivalent of your VST-wiring-DSL code, if you use SQL EXPLAIN — which only works on 100%-valid SQL, so isn't great as a debugging tool for confused newbies! — and then feed the EXPLAINation into a visualizer like https://explain.dalibo.com/.)
Once you have a really solid intuition for music, you might be able to sit down and write a VST-wiring-DSL query that outputs exactly the notes you want on your first try. Just like someone who has built a really solid intuition for finite-state automata, can get a regular expression right on the first go. But neither regexps nor SQL are a good way of developing the intuition for the way the underlying formalism is "being driven" by the abstract program.
If you want to understand regexps well, you should first do what the regexp would be doing for you. Write your own string parser, as a finite-state automata module, consuming characters (and potentially keeping a lookbehind buffer), and outputting matches and captures.
If you want to understand SQL well, you should first do what the SQL query planner would be doing for you. Implement your own query plan, using a chain of function calls to something like Clojure's lazy-sequence functions, or to Elixir's Enum and Stream modules.
Agreed. Curious what materials you've found that really drive this home?
Like 90% of googlesearch results simplify things to the point where this is missed.
[0]: https://www.sqlservercentral.com/stairways/stairway-to-t-sql...
To be clear, that's just an absolutely pathological query, since column "a" doesn't exist in table "xx" -- it only exists in table "aa".
You can use values from outer queries inside of inner queries, which are called "correlated subqueries". Which is not recommended unless you're already a real SQL expert, because they're so confusing (and their performance is also pretty absymal).
But this isn't even that -- this is an aggregate function across an outer variable. I consider myself a strong SQL expert, but looking at this query, it just feels nonsensical. Not only could I not tell you what it would produce -- I couldn't even guess at what its author intended it to mean in the first place.
> Later in the article we will see that SQL implementations don’t even agree about the correct result.
I'm not surprised at all. I'm actually more surprised they produce results at all, rather than just errors about it being an invalid query.
SELECT sum(a) FROM aa;
SELECT (SELECT sum(a) FROM xx LIMIT 1) FROM aa;
SELECT (SELECT sum(a) FROM generate_series(1, 1)) FROM aa;
To maybe explain why this behaviour makes sense, stop thinking about aggregate queries for a second, and just consider normal expressions. What would you expect the result of the following query to be? SELECT (SELECT a + x FROM xx LIMIT 1) FROM aa;
This is obviously a bad query because it uses `LIMIT` without `ORDER BY`, but let's just put that aside. This query should make some sense to pretty much any SQL developer: you take each `a` value from `aa`, and then add the value from the "first" (whatever that means) `x` value from `xx` to it. (In my test on Postgres, this returns [11, 12, 13], but there's obviously no guarantee because we neglected to sort `xx`.)Now, since that was easy, consider this simplification of that query:
SELECT (SELECT a FROM xx LIMIT 1) FROM aa;
I think that if the previous query made sense, this one should also make sense, as the expression simple has one less term. The result is obviously [1, 2, 3].So, what should the aggregate return:
SELECT (SELECT sum(a) FROM xx LIMIT 1) FROM aa;
The only real difference is that the outer query is now aggregated, and therefore it is implicitly in one big grouping, and therefore returns only one row. So the correlated subquery gets run once, for that one row - what's the value for `sum(a)` in that row, the one row that the outer query returns? Well, 6. Duh. SELECT (SELECT sum(1) FROM xx LIMIT 1) FROM aa;
which returns 3
3
3
Personally, I think having the inner aggregation always attach to the nearest `SELECT` would have been an equally valid way of defining how this works, but it just so happens it is not defined that way.[1]: https://buttondown.email/jaffray/archive/sql-scoping-is-surp...
SELECT (SELECT sum(a + x) FROM xx LIMIT 1) FROM aa;
Making a pretty diagram where all the behaviours from various DBMSs are listed is left as an exercise to the author. :)It's there already!
It's in the chart as footnote "b": Outer reference must be the only argument (doesn’t support F441)
That's the one thing SQL Server doesn't eat. Those that are green in the chart work fine in this case.
I would say it's not all that complicated what is going on. I would say that you can figure it out if you have experience reading queries.
I would never say it's not confusing. Just because the query engine can parse and generate an execution plan and you can understand how and why it's doing what it decides to do doesn't mean it's not confusing. "Confusing" doesn't mean "incomprehensible." It is confusing. It's confusing in almost the same way that "6÷2(1+2) = ?" viral equation is confusing [0] or the Fast Inverse Square Root is confusing. This query is unconventional.
Not "unconventional" in the sense that it's uncommon (although it is also that). It's unconventional in the sense that it goes against established conventions. Humans don't write queries that way. This query violates the collective query conventions of the SQL community. People don't write queries like this or teach writing queries like this because it's not a particularly useful or portable pattern to put the aggregate value of the outside query in the SELECT clause of an inside correlated subquery. There are better ways to write this query. Because of that, it's not clear what the author's intent was or why they would choose to write it thus instead of, well, nearly any other way that a human would actually choose beforehand. It's so unconventional that the intent is uncertain, and "uncertainty of intent" is another way to say "confusing."
Code is not valuable because it executes. Code is valuable because the programmer, the computer, and any future programmer agree on the intent with no uncertainty. That's why over the long term clarity is often more potent than cleverness.
I think the most unexpected thing is that the query engine looks into the inner queries to decide whether to return all rows of the outer table or some aggregation of them. If there had been a group by clause, it would have been much clearer that an aggregation must be happening.
Edit: also, the "obvious" logic actually changes completely if you look at the related query, SELECT (SELECT sum(a+x) FROM xx LIMIT 1) FROM aa;
This returns 63, 66, 69. Similarly, sum(1) will return 3, 3, 3.
So, we could very well expect that sum (a) would work like that - it would treat each value of `a` as a constant that it aggregates over all rows of xx, so it would return 3, 6, 9 (the limit is anyway ignored if you are aggregating).
Is it still? Maybe 20 years ago but it's been a long time that I met an It guy who talked about himself having worked with an Oracle DB recently.
It's all PostgreSQL, MS SQL and MySQL/MariaDB where I work.
https://www.statista.com/statistics/809750/worldwide-popular...
I have no idea if that is even close to correct, but I am sure it is still one of the most used RDBMS out there, out of pure inertia if nothing else. Plus, from what I've heard from a friend who worked with a few commercial. DBs, it's actually pretty good software, it's dealing with Oracle licensing and outrageous audit practices that is killing it.
The reason this query works is because correlated subqueries... need something to correlate with! Usually the column from the outside query is only referenced in the WHERE clause, but there isn't really a reason you can't reference it in the inside query's SELECT clause. After all, what if you needed it for a CASE expression there?
There are certainly better ways to write what this query is accomplishing -- at the very least there are ways to write it that make it much more clear what your intentions are -- but just because it's weird doesn't mean it's invalid. The developers know they cannot anticipate all possible operations that might be desired, and there are already countless ways to write queries incorrectly.
And, IMO, that's a very sane and reasonable way to treat it. The entire idea of automatic aggregation is flawed, and those queries should just have a `group by ()` explicit at the right place.
Not that I have a better solution offhand, in SQL grouping by a constant value is not actually the same as not writing `GROUP BY` at all since the behaviour on empty tables is different.
Consider: using window functions, you can do partial aggregations over subsets of the input — without even necessarily partitioning the input (i.e. you can compute "running totals" and other wacky state-machine-like outputs.)
And yeah, the difference between that and a value is one of those really surprising things on SQL that actually make sense and should be this way. Unfortunately, there are many of those.
SELECT a FROM aa;
should have an explicit grouping, like SELECT a FROM aa GROUP BY id;
?After all, when you think about it, it's really not the aggregate functions that break expectations, it's the scalars. Of the four combinations of having or not having an explicit `GROUP BY` and having or not having an aggregate function, three of them have the unsurprising behavior of returning exactly one row per grouping.
-- aggregate function and group by - one single result per grouping
SELECT sum(a) FROM aa GROUP BY a; --or
SELECT sum(a) FROM aa GROUP BY ();
-- no aggregate function, but a group by - one single result per grouping
SELECT a FROM aa GROUP BY a;
-- just aggregate function, but no group by - one single result per (single, implicit) grouping
SELECT sum(a) FROM aa;
But then, when you have neither an aggregate function nor an explicit `GROUP BY`, it breaks expectations: -- no aggregate function, no group by - one result row per row
-- in the source set, even though there should be only one big implicit group
SELECT a FROM aa;
Therefore, I propose that the next SQL standard should introduce a new `GROUP BY INDIVIDUAL ROW` keyword, that henceforth all "scalar" queries MUST use in order to have consistent behaviour with the rest of the language.You are asking if it makes sense to have an implicit grouping at all; it very obviously absolutely does. And grouping by individual row is the very obvious default. But I do totally support adding that keyword expressing the default. All defaults should be expressible.