Ignoring the indexes and just focusing on the main table sizes reported, we have:
- String ("The frequent repetition of these names inflates the size of the table"): 392 MB
- Enum data type ("Internally, an enum type is stored as four-byte floating point number. So it saves space in the table [...]"): 338 MB
- Lookup table ("Also, since a smallint only occupies two bytes, the person_l table can potentially use less storage space than the other solutions"): 338 MB.
I just can't make sense of the numbers, especially given the authors comments that I've quoted.
Is this some kind of typo/editing fail?
> Surprisingly, the table is just as big as with the enum type above, even though an enum uses four bytes. The reason is that each table row is aligned at a memory address divisible by eight, so PostgreSQL will add six padding bytes after the smallint. If we had more columns and could arrange them carefully, we could see a difference.
This could be the explanation. If the row is padded to 8, bigint is 8, then smallint or enum also use 8. The entries in the string table will be 8 or 16 due to the string length. So one row in person_e and person_l is 16, one row in person_s could be about 20 on average, that is a bit closer to the reality than my intuition, although the storage savings are still less than what I would have expected.
edit:
I did also try out the test and dropped the primary key on the table to compare only enum and string size:
SELECT PG_SIZE_PRETTY(PG_RELATION_SIZE('person_e')), PG_SIZE_PRETTY(PG_RELATION_SIZE('person_s'))
277 MB,330 MB
Does not look like an amazing saving either.This is why the storage is weird. Why would you use a float for distinct number storage!
separations of types and relations should be limited to core atomic type, string, int , date etc ... (althought date is debatable as is not usually atomic in most cases, and many dbs end up with one more date relations)
anyway, always use a table .. when its a choice
Data should be data, queryable, relational. So often I have had to change enums into lookup tables - or worse, duplicate them into lookup tables - because now we need other information attached to the values. Labels, descriptions, colors, etc.
My biggest recommendation though is that if you have a lookup table like this, make the value you would have made an enum not just unique, but _the primary key_. Now all the places that you would be putting an ID have the value just like they would with an enum, and oftentimes you wont need to join. The FK makes sure its valid. The other information is a join away if you need it.
I do wish though that there were more ways to denote certain tables as configuration data vs domain data, besides naming conventions or schemas.
Edit to add: I will say there is one places where I have begrudgingly used enums and thats where we have used something like prisma to get typescript types from the schema. It is useful to have types generated for these values. Of course you can do your own generation of those values based on data, but there is a fundamental difference there between "schema" and "data".
you can always created data dictionary relation, where you stored the code for table creation, add meta data, and use dynamic sql to execute the DML code stored in the DB, i worked somewhere where they did this ... sort of
"hello world" ? where #chars != " " == ["h", "e", ...]> always use a table .. when its a choice
Everything should be represented as relations (sets of tuples) but you should always use tables (multisets of tuples) when possible? That seems a little contradictory.
If said DBMS is tablational, like SQL, then you would have to approximate them using tables and constraints.
If said DBMS is of an another paradigm, like a document database, there may be no way to represent relations within the DBMS.
An enum is a construct that numbers things. There is no way to represent a set of tuples with an integer[1]. I'm not sure where you are trying to go with that one. Inversely, you could hold an enum generated value within a relation. Is that what you mean?
[1] Yes, technically you could break up the individual bits such that they form a set of tuples, but that wouldn't be useful beyond a very narrow use-case and doesn't generalize the way relation implies.
In a lot of web apps this need tends to be related to validation, so many just do these lookups and simple comparisons in their app logic and based on static values from config files long before any db query is made. Sometimes you just don't need to involve the database and the performance would be better for it anyway.