What is better: a lookup table or an enum type?
43 points
15 hours ago
| 9 comments
| cybertec-postgresql.com
| HN
Backslasher
10 minutes ago
[-]
To me, since the DB is there to serve the app (which is there to serve the user), the lookup/enum decision mostly depends on whether the list is defined before build time (> enum) or after (> lookup). US states are probably a solid "before", so you get the added value of easily materializing a validator in the app code. Children IDs sound a bit more dynamic.
reply
unwind
6 hours ago
[-]
I don't database, but I like to think I have some kind of intuition for storage space requirements, and this article was very confusing.

Ignoring the indexes and just focusing on the main table sizes reported, we have:

- String ("The frequent repetition of these names inflates the size of the table"): 392 MB

- Enum data type ("Internally, an enum type is stored as four-byte floating point number. So it saves space in the table [...]"): 338 MB

- Lookup table ("Also, since a smallint only occupies two bytes, the person_l table can potentially use less storage space than the other solutions"): 338 MB.

I just can't make sense of the numbers, especially given the authors comments that I've quoted.

Is this some kind of typo/editing fail?

reply
leononame
5 hours ago
[-]
I'm also wondering about that. But maybe this could be it?

> Surprisingly, the table is just as big as with the enum type above, even though an enum uses four bytes. The reason is that each table row is aligned at a memory address divisible by eight, so PostgreSQL will add six padding bytes after the smallint. If we had more columns and could arrange them carefully, we could see a difference.

This could be the explanation. If the row is padded to 8, bigint is 8, then smallint or enum also use 8. The entries in the string table will be 8 or 16 due to the string length. So one row in person_e and person_l is 16, one row in person_s could be about 20 on average, that is a bit closer to the reality than my intuition, although the storage savings are still less than what I would have expected.

edit:

I did also try out the test and dropped the primary key on the table to compare only enum and string size:

  SELECT PG_SIZE_PRETTY(PG_RELATION_SIZE('person_e')), PG_SIZE_PRETTY(PG_RELATION_SIZE('person_s'))

  277 MB,330 MB
Does not look like an amazing saving either.
reply
gdevenyi
3 hours ago
[-]
> Enum type 4-byte floating point number

This is why the storage is weird. Why would you use a float for distinct number storage!

reply
systems
13 hours ago
[-]
well uniformity and homoiconicity are very important in an ideal db management system (a.k.a a true rdbms) everything should be represent as a relation and use the same set of operators to be manipulated

separations of types and relations should be limited to core atomic type, string, int , date etc ... (althought date is debatable as is not usually atomic in most cases, and many dbs end up with one more date relations)

anyway, always use a table .. when its a choice

reply
netcraft
12 hours ago
[-]
couldn't have said it better myself.

Data should be data, queryable, relational. So often I have had to change enums into lookup tables - or worse, duplicate them into lookup tables - because now we need other information attached to the values. Labels, descriptions, colors, etc.

My biggest recommendation though is that if you have a lookup table like this, make the value you would have made an enum not just unique, but _the primary key_. Now all the places that you would be putting an ID have the value just like they would with an enum, and oftentimes you wont need to join. The FK makes sure its valid. The other information is a join away if you need it.

I do wish though that there were more ways to denote certain tables as configuration data vs domain data, besides naming conventions or schemas.

Edit to add: I will say there is one places where I have begrudgingly used enums and thats where we have used something like prisma to get typescript types from the schema. It is useful to have types generated for these values. Of course you can do your own generation of those values based on data, but there is a fundamental difference there between "schema" and "data".

reply
systems
11 hours ago
[-]
well, if DDL (data definition language) and DML (data manipulation language), were unified and both operated on relation , manipulating meta data would have been a lot simpler, and more dynamics

you can always created data dictionary relation, where you stored the code for table creation, add meta data, and use dynamic sql to execute the DML code stored in the DB, i worked somewhere where they did this ... sort of

reply
mamcx
12 hours ago
[-]
Yeah, that is what I think on https://tablam.org, where I consider everything could be a relation, so like

    "hello world" ? where #chars != " " == ["h", "e", ...]
reply
9rx
10 hours ago
[-]
> everything should be represent as a relation

> always use a table .. when its a choice

Everything should be represented as relations (sets of tuples) but you should always use tables (multisets of tuples) when possible? That seems a little contradictory.

reply
systems
10 hours ago
[-]
how do you want to represent relations in a DBMS, an enum or a table ?
reply
9rx
2 hours ago
[-]
If said DBMS is relational, with relations.

If said DBMS is tablational, like SQL, then you would have to approximate them using tables and constraints.

If said DBMS is of an another paradigm, like a document database, there may be no way to represent relations within the DBMS.

An enum is a construct that numbers things. There is no way to represent a set of tuples with an integer[1]. I'm not sure where you are trying to go with that one. Inversely, you could hold an enum generated value within a relation. Is that what you mean?

[1] Yes, technically you could break up the individual bits such that they form a set of tuples, but that wouldn't be useful beyond a very narrow use-case and doesn't generalize the way relation implies.

reply
psychoslave
6 hours ago
[-]
with foreign keys?
reply
Joker_vD
2 hours ago
[-]
Honestly, the storage use would probably be the last thing of my mind when designing for "what should state/region/district/bundesland/etc. be modelled as". Sometimes those things get renamed, sometimes they are merged, and sometimes they are split. Which means that you may end up in an awkward state when e.g. Mecklenburg-Vorpommern gets split back into Mecklenburg and Western Pomerania, and some of your customers have updated their addresses, and some haven't. You have to store all of that anyway because remember: your DB doesn't represent the current state of the world, it represents your knowledge about the current state of the world (which is where the whole impetus for NULL originated: "I know that the customer has an address, I just don't know what it is", and all related problems with it: compare "I know that the customer actually does not have any address at all", and "I know that this address just can't be correct no longer but I have no new knowledge about what it can be").
reply
nlitened
7 hours ago
[-]
I also love the approach of ClickHouse with LowCardinality(String). Flexible, clear semantics, high performance
reply
CuriouslyC
13 hours ago
[-]
From a maintainability standpoint lookup tables are miles ahead, but from a DX perspective there are a few cases where enums are nice. Honestly I probably would never use enums again, I feel like it's caused pain every time I've done it.
reply
tucnak
8 hours ago
[-]
Enums are great if you're into json/jsonb custom logic and aggregates. It's quite cool to use the constraint system to impose checks on various JSON fields, especially if you're doing extension development, or packaging up procedures for downstream consumption.
reply
sublinear
13 hours ago
[-]
Basically ugly no matter what.

In a lot of web apps this need tends to be related to validation, so many just do these lookups and simple comparisons in their app logic and based on static values from config files long before any db query is made. Sometimes you just don't need to involve the database and the performance would be better for it anyway.

reply
aksss
9 hours ago
[-]
Table with a thread-safe read-through cache in code, imo. But there are places where enums make sense. For instance, things that are specifically in the code's domain.
reply
veltas
7 hours ago
[-]
Who was child 12
reply
teddyh
6 hours ago
[-]
Who was child 12̣
reply