SQL, Homomorphisms and Constraint Satisfaction Problems
125 points
13 hours ago
| 3 comments
| philipzucker.com
| HN
mbid
11 hours ago
[-]
The post mentions the idea that querying a database D can be understood algebraically as enumerating all morphisms Q -> D, where Q is the "classifying" database of the query, i.e. a minimal database instance that admits a single "generic" result of the query. You can use this to give a neat formulation of Datalog evaluation. A Datalog rule then corresponds a morphism P -> H, where P is the classifying database instance of the rule body and H is the classifying database instance for matches of both body and head. For example, for the the transitivity rule

  edge(x, z) :- edge(x, y), edge(y, z).
you'd take for P the database instance containing two rows (a_1, a_2) and (a_2, a_3), and the database instance H contains additionally (a_1, a_3). Now saying that a Database D satisfies this rule means that every morphism P -> D (i.e., every match of the premise of the rule) can be completed to a commuting diagram

  P --> D
  |    ^
  |   /
  ⌄  /
  Q 
where the additional map is the arrow Q -> D, which corresponds to a match of both body and head.

This kind of phenomenon is known in category theory as a "lifting property", and there's rich theory around it. For example, you can show in great generality that there's always a "free" way to add data to a database D so that it satisfies the lifting property (the orthogonal reflection construction/the small object argument). Those are the theoretical underpinnings of the Datalog engine I'm sometimes working on [1], and there they allow you to prove that Datalog evaluation is also well-defined if you allow adjoining new elements during evaluation in a controlled way. I believe the author of this post is involved in the egglog project [2], which might have similar features as well.

[1] https://github.com/eqlog/eqlog [2] https://github.com/egraphs-good/egglog

reply
snthpy
1 hour ago
[-]
Thank you @xlinux and @mbid. Very interesting and not something I knew much about before.

I had a look at eglog and egglog and if I'm understanding things correctly then one possible use case is type inference and optimization. I'm particular I looked at the example in [1].

I'm thinking that this could be useful in the PRQL [2] compiler, in particular for: a) inference of type restrictions on input relations and resultant output relation types, b) optimization of resultant SQL queries.

Would you be able to comment on whether that's correct?

Any links to related examples, papers, or work would be appreciated. Thanks!

1: https://egglog-python.readthedocs.io/latest/tutorials/sklear...

2: https://www.prql-lang.org/

reply
bubblyworld
1 hour ago
[-]
Very interesting perspective I hadn't heard before on datalog, thanks. How far does it go - can you interpret extensions of datalog (say negation or constrained existentials) in a nice categorical way, for instance? I've given this very little thought but I imagine you'd have issues with uniqueness of these "minimal" database instances, and I'm not sure what that means for these lifting properties.

(if my question even makes sense, pardon the ignorance)

reply
babel_
11 hours ago
[-]
For anyone curious: the performance difference between Clang and GCC on the example C solution for verbal arithmetic comes down to Clang's auto-vectorisation (deducing SIMD) whilst GCC here sticks with scalar, which is why the counter brings Clang closer in line to GCC (https://godbolt.org/z/xfdxGvMYP), and it's actually a pretty nice example of auto-vectorisation (and its limitations) in action, which is a fun tangent from this article (given its relevance to high-performance SMT/SAT solving for CSP)
reply
lovasoa
10 hours ago
[-]
The topic of huge queries on tiny databases makes me think of this recent discussion on the SQLite forum: https://sqlite.org/forum/forumpost/0d18320369

Someone had an issue because SQLite failed to optimize the following query

    select * from t where x = 'x' or '' = 'x'
Someone said that SQLite could not optimize out the "or '' = 'x'" because it would be too expensive to compute. Which is obviously true only for huge queries on tiny datasets.
reply
recursive
10 hours ago
[-]
It's not obviously true at all. Optimizing out `'' = 'x'` can be done for a fixed cost regardless of record count.
reply
lovasoa
8 hours ago
[-]
Optimizing out static expressions can be done in linear time at best. So if the number of clauses in WHERE is huge and the size of the underlying table is tiny (such as in the examples shown in the article we are commenting on), it will be better not to run the optimization.

But of course, in normal life, outside of the world of people having fun with Homomorphisms, queries are much smaller than databases.

reply
recursive
8 hours ago
[-]
Parsing the expression in the first place is already linear time.
reply
hinkley
10 hours ago
[-]
Why would it be too expensive to optimize out static subexpressions?
reply
jjice
8 hours ago
[-]
My guess is that the expense can be tricky to calculate since the additional optimization prior to executing the query may take longer than if the query was just able to run (depending on the dataset, of course). I wonder if it's too expensive to calculate a heuristic as well, so it just allows it to execute.

Just a guess.

reply
jiggawatts
9 hours ago
[-]
> SQLite

Well... there's your problem. SQLite is not a general-purpose RDBMS, it is marketed as a replacement for "fopen()", a purpose for which it excels.

A similar product is the Microsoft Jet database engine, used in products such as Microsoft Exchange and Active Directory. Queries have to be more-or-less manually optimised by the developer, but they run faster and more consistently than they would with a general-purpose query engine designed for ad-hoc queries.

reply
cerved
5 hours ago
[-]
I hate Jet with a vengeance
reply