Show HN: The Canada census data in a SQLite file; advice appreciated
6 points
21 days ago
| 1 comment
| github.com
| HN
This is niche, I'll admit.

I needed to look through the latest census data, but it was exported as multiple multi-gigabyte bespoke latin1-encoded CSV files. Pandas, Polars, and SQLite's CSV import tool weren't much help, so I shelved the project until recently, when I started taking a SQLite course online.

I picked it up again, normalized the data, and now there's a database that can be queried through a SQL view that matches the headings in the original CSVs. I'm proud of the script I created to export the data, as well as automatically compress the artifact, make the diagrams and checksums, etc.

This is my first time building up a big database, does my schema seem sane? I've been considering switching the counts from REALs to TEXT, since then SQLite's decimal extension can do exact calculations, but considering there's only one or two places after the decimal points in the data, I'm not sure if it's worth it space-wise.

mattewong
21 days ago
[-]
I am always a proponent of starting with the end goal and then working backward. What are the end results you are aiming to achieve (or aiming to allow your audience to achieve)? Is marginal precision more important than the speed impact? The optimal database design will depend on that (i.e., on what you are optimizing for...).

It would also be very helpful, imho, to indicate keys and indexes, perhaps by modifying your schema diagram, or simply (and maybe better), just dump the actual SQL schema definition (i.e. the output from sqlite3's ".schema" command)

reply
Rendello
20 days ago
[-]
Thank you for taking the time, this is exactly the sort of discussion I was looking for :)

I had a clear end goal in mind, a smallish database I can query for municipality data easily, but in the middle of a project I started to miss the forest for the trees thinking about every theoretical use case. I think I achieved my goal, but it would've been better to keep my end goal in mind.

The schema diagram is limited, I agree. The tool I was using has no options, so I might switch to another library and do more manual work, or otherwise indicate it in text as you mentioned. The diagram does show primary and foreign key relations, it just might not be clear.

reply
mattewong
20 days ago
[-]
Glad to be helpful-- I'm in the business of data process automation, so I appreciate the opportunity to learn about new use cases. If you are willing to share what your end goal was in more detail (even as simple as an SQL query that you would now run want to run against your current schema), I'd be interested to see how an optimal process could be designed to easily generate that, and possibly suggest some tooling you could find useful. You may also want to try posting questions like this in forums such as the Seattle Data Guy's discord channel, and I'm suspect you will get lots of suggestions and advice.
reply
Rendello
20 days ago
[-]
Here's an example case: getting all municipalities in Canada with more than 1500 native speakers of Inuktitut (the language of the Inuit):

First, I have to find the characteristic I'm looking for:

   SELECT id, description FROM characteristic WHERE description LIKE "%Inuktitut%";
I can see what I'm looking for, id 455, which is "Total - Mother tongue for the total population excluding institutional residents - 100% data; Single responses; Non-official languages; Indigenous languages; Inuktut (Inuit) languages; Inuktitut" (since these characteristics have parents and children, I can build up a graph with a CTE, which could be useful).

To use it, I query the view with:

  SELECT geo_name, c1_count_total, c2_count_men, c3_count_women FROM cview
  WHERE geo_level_name = 'Census subdivision' -- Roughly a municipality
  AND  characteristic_id = 455
  AND c1_count_total > 1500 -- Total count (speakers, in this case)
  ORDER BY c1_count_total DESC;
Which gives:

  ┌─────────────────────────────────────┬────────────────┬──────────────┬────────────────┐
  │              geo_name               │ c1_count_total │ c2_count_men │ c3_count_women │
  ├─────────────────────────────────────┼────────────────┼──────────────┼────────────────┤
  │ 'Arviat, Hamlet (HAM)'              │ 2465.0         │ 1235.0       │ 1235.0         │
  │ 'Iqaluit, City (CY)'                │ 2195.0         │ 985.0        │ 1215.0         │
  │ 'Puvirnituq, Village nordique (VN)' │ 1965.0         │ 940.0        │ 1030.0         │
  │ 'Igloolik, Hamlet (HAM)'            │ 1820.0         │ 975.0        │ 845.0          │
  │ 'Inukjuak, Village nordique (VN)'   │ 1745.0         │ 910.0        │ 835.0          │
  │ 'Kuujjuaq, Village nordique (VN)'   │ 1745.0         │ 795.0        │ 945.0          │
  └─────────────────────────────────────┴────────────────┴──────────────┴────────────────┘
So it appears to work as intended (although the men+women count isn't exactly the same as the total count, I suspect that's some StatCan wierdness that I'll have to look at). It's slow, but I'm not too worried about that, it's not meant for fast real time queries, and I can always add indexes later especially since it's now a read-only database in practice.

My main concerns are regarding the schema and use of REALs and NULLs. Where there are empty strings in the CSV, I have NULLs in the database. I suspect that's the best move, but I suspect having REALs instead of DECIMALs (ie. TEXT processed with the SQLite extension) may be the wrong abstraction. For my usecase, I think the database covers all of my needs except having the areas associated with a province/territory, although I know which dataset I would need for that information and that will likely be in a future iteration of the database.

I say "it appears to work" because I'm not sure the best way to test the data in the database without reimplementing the parsing and rechecking every row in each CSV against the database. I'm wary of blindly trusting the data is correct and all there.

reply
mattewong
16 days ago
[-]
That is great, thank you. I'd love to continue the conversation-- maybe easier in a separate forum. Can I follow-up via the email address on your profile (gaven...)?
reply
Rendello
16 days ago
[-]
Of course! :)
reply