I needed to look through the latest census data, but it was exported as multiple multi-gigabyte bespoke latin1-encoded CSV files. Pandas, Polars, and SQLite's CSV import tool weren't much help, so I shelved the project until recently, when I started taking a SQLite course online.
I picked it up again, normalized the data, and now there's a database that can be queried through a SQL view that matches the headings in the original CSVs. I'm proud of the script I created to export the data, as well as automatically compress the artifact, make the diagrams and checksums, etc.
This is my first time building up a big database, does my schema seem sane? I've been considering switching the counts from REALs to TEXT, since then SQLite's decimal extension can do exact calculations, but considering there's only one or two places after the decimal points in the data, I'm not sure if it's worth it space-wise.
It would also be very helpful, imho, to indicate keys and indexes, perhaps by modifying your schema diagram, or simply (and maybe better), just dump the actual SQL schema definition (i.e. the output from sqlite3's ".schema" command)
I had a clear end goal in mind, a smallish database I can query for municipality data easily, but in the middle of a project I started to miss the forest for the trees thinking about every theoretical use case. I think I achieved my goal, but it would've been better to keep my end goal in mind.
The schema diagram is limited, I agree. The tool I was using has no options, so I might switch to another library and do more manual work, or otherwise indicate it in text as you mentioned. The diagram does show primary and foreign key relations, it just might not be clear.
First, I have to find the characteristic I'm looking for:
SELECT id, description FROM characteristic WHERE description LIKE "%Inuktitut%";
I can see what I'm looking for, id 455, which is "Total - Mother tongue for the total population excluding institutional residents - 100% data; Single responses; Non-official languages; Indigenous languages; Inuktut (Inuit) languages; Inuktitut" (since these characteristics have parents and children, I can build up a graph with a CTE, which could be useful).To use it, I query the view with:
SELECT geo_name, c1_count_total, c2_count_men, c3_count_women FROM cview
WHERE geo_level_name = 'Census subdivision' -- Roughly a municipality
AND characteristic_id = 455
AND c1_count_total > 1500 -- Total count (speakers, in this case)
ORDER BY c1_count_total DESC;
Which gives: ┌─────────────────────────────────────┬────────────────┬──────────────┬────────────────┐
│ geo_name │ c1_count_total │ c2_count_men │ c3_count_women │
├─────────────────────────────────────┼────────────────┼──────────────┼────────────────┤
│ 'Arviat, Hamlet (HAM)' │ 2465.0 │ 1235.0 │ 1235.0 │
│ 'Iqaluit, City (CY)' │ 2195.0 │ 985.0 │ 1215.0 │
│ 'Puvirnituq, Village nordique (VN)' │ 1965.0 │ 940.0 │ 1030.0 │
│ 'Igloolik, Hamlet (HAM)' │ 1820.0 │ 975.0 │ 845.0 │
│ 'Inukjuak, Village nordique (VN)' │ 1745.0 │ 910.0 │ 835.0 │
│ 'Kuujjuaq, Village nordique (VN)' │ 1745.0 │ 795.0 │ 945.0 │
└─────────────────────────────────────┴────────────────┴──────────────┴────────────────┘
So it appears to work as intended (although the men+women count isn't exactly the same as the total count, I suspect that's some StatCan wierdness that I'll have to look at). It's slow, but I'm not too worried about that, it's not meant for fast real time queries, and I can always add indexes later especially since it's now a read-only database in practice.My main concerns are regarding the schema and use of REALs and NULLs. Where there are empty strings in the CSV, I have NULLs in the database. I suspect that's the best move, but I suspect having REALs instead of DECIMALs (ie. TEXT processed with the SQLite extension) may be the wrong abstraction. For my usecase, I think the database covers all of my needs except having the areas associated with a province/territory, although I know which dataset I would need for that information and that will likely be in a future iteration of the database.
I say "it appears to work" because I'm not sure the best way to test the data in the database without reimplementing the parsing and rechecking every row in each CSV against the database. I'm wary of blindly trusting the data is correct and all there.