SQLite has no compression support, MySQL/MariaDB have page-level compression which doesn't work great and I've never seen anyone enable in production, and Postgres has per-value compression which is good for extremely long strings, but useless for short ones.
There are just so many string columns where values and substrings get repeated so much, whether you're storing names, URL's, or just regular text. And I have databases I know would be reduced in size by at least half.
Is it just really really hard to maintain a shared dictionary when constantly adding and deleting values? Is there just no established reference algorithm for it?
It still seems like it would be worth it even if it were something you had to manually set. E.g. wait until your table has 100,000 values, build a dictionary from those, and the dictionary is set in stone and used for the next 10,000,000 rows too unless you rebuild it in the future (which would be an expensive operation).
Global column dictionary has more complexity than normal. Now you are touching more pages than just the index pages and data page. The dictionary entries are sorted, so you need to worry about page expansion and contraction. They sidestep the problems by making it immutable, presumably building it up front by scanning all the data.
Not sure why using FSST is better than using a standard compression algorithm to compress the dictionary entries.
Storing the strings themselves as dictionary IDs is a good idea, as they can be processed quickly with SIMD.
I believe the reason is that FSST allows access to individual strings in the compressed corpus, which is required for fast random access. This is more important for OLTP than OLAP, I assume. More standard compression algorithms, such as zstd, might decompress very fast, but I don't think they allow that
1, complicates and slows down update, which is typically more important in OLTP than OLAP
2, is generally bad for high cardinality columns, which requires tracking cardinality to make decisions, which further complicates things.
lastly, additional operational complexity (like the table maintenance system you described in last paragraph) could reduce system reliability, and they might decide it's not worth the price or against their philosophy.
But string interning is what they're doing, isn't it?
> Dictionary compression is a well-known and widely applied technique. The basic idea is that you store all the unique input values within a dictionary, and then you compress the input by substituting the input values with smaller fixed-size integer keys that act as offsets into the dictionary. Building a CedarDB dictionary on our input data and compressing the data would look like this:
That's string interning!!
Is interning just too old a concept now and it has to be rediscovered/reinvented and renamed?
Interning:
1: "foo"
2: "bar"
my_string = "foo" // stored as ref->1
my_other_string = "foobarbaz" // not found & too long to get interned, stored as "foobarbaz"
Dictionary compression:
1: "foo"
2: "bar"
my_string = "foo" // stored as ref->1
my_other_string = "foobarbaz" // stored as ref->1,ref->2,"baz" (or ref->1,ref->2,ref->3 and "baz" is added to the dict)It's not old yet; it'll be old once it's been renamed two or three times...
The paper suggests that you could rework string matching to work on the compressed data but they haven't done it.
Seems to be another commercial cloud-hosted thing offering a Postgres API? https://dbdb.io/db/cedardb
Successor to Umbra, I believe.
I know somebody (quite talented) working there. It's likely to kick ass in terms of performance.
But it's hard to get people to pay for a DB these days.
CedarDB is the commercialization of Umbra, the TUM group's in-memory database lead by professor Thomas Neumann. Umbra is a successor to HyPer, so this is the third generation of the system Neumann came up with.
Umbra/CedarDB isn't a completely new way of doing database stuff, but basically a combination of several things that rearchitect the query engine from the ground up for modern systems: A query compiler that generates native code, a buffer pool manager optimized for multi core, push-based DAG execution that divides work into batches ("morsels"), and in-memory Adaptive Radix Tries (never used in a database before, I think).
It also has an advanced query planner that embraces the latest theoretical advances in query optimization, especially some techniques to unnest complex multi-join query plans, especially with queries that have a ton of joins. The TUM group has published some great papers on this.
I always wondered how good these planners are in practice. The Neumann/Moerkotte papers are top notch (I've implemented several of them myself), but a planner is much more than its theoretical capabilities; you need so much tweaking and tuning to make anything work well, especially in the cost model. Does anyone have any Umbra experience and can say how well it works for things that are not DBT-3?
The part of Umbra I found interesting was the buffer pool, so that's where focused most of my attention when reading though.