A nice thing about that point of view is that it fits with your point; redundancy is redundancy whether you look at it with a column-based view or a row-based view.
Definitely agree with what you said - if we treat them as the same thing that's going to mislead some folks.
There are columnar storage engine extensions for many of the popular databases, though.
Looking forward to check out `pg_duckdb`, yes.
There's always a reason for a dev to ship something shitty but when you show you can use 80% less storage for the same operation you can make the accountants your lever.
But I don’t think that’s the top 5 reasons of normalization
1NF removes repeating groups, putting for example data for each month in its own row, not an array of 12 months in 1 row.
Storage efficiency was never the point. IMS had that locked down. Succinctness of expression and accuracy of results was the point. And is: normalization prevents anomalous results.
Normalization ultimately boils down to breaking your data down into the most elemental and simplest "facts" about them, and its greatest value is how it allows and encourages more flexible and disparate ways of looking at the same information.
Classic example being something like a “users” table that tracks account id, display name (mutable), and profile picture (mutable). And then a “posts” table that has post id, account id, and message text. This allows you to change the display name/picture in one place and it can be used across all posts
The main cost is on the join when you need to access several columns, it's flexible but expensive.
To take full advantage of columnar, you have to have that join usually implicitly made through data alignment to avoid joining.
For example, segment the tables in chunks of up to N records, and keep all related contiguous columns of that chunk so they can be independently accessed:
r0, r1 ... rm; f0, f0 ... f0; f1, f1 ... f1; fn, fn ... fn
That balances pointer chasing and joining, you can avoid the IO by only loading needed columns from the segment, and skip the join because the data is trivially aligned.Especially with arrays, what could be one line of JSON, in a CSV you'd have non-normalized array as a string in a single cell, or you expand the array and create a single value for the cell, creating $array_size number of rows.
You can normalize data in just about any structured format, but columns aren't the end-all-be-all normalization format. I think pandas uses "frames".
Pandas is column oriented, as are basically all high performance data libraries. Each column is a separate array of data. To get a "row" you take the n item from each of the arrays.
And FWIW, column-oriented isn't considered normalization. It's a physical optimization that can yield enormous performance advantages for some classes of problems, but can cause a performance nightmare for other problems.
Data analytics loves column-oriented. CRUD type stuff does not. And in the programming realm there are several options to have Structures of Arrays (SoA) instead of the classic Arrays of Structures (AoS).