Research team digitizes more than 100 years of Canadian infectious disease data
115 points
6 days ago
| 5 comments
| news.mcmaster.ca
| HN
arjie
1 hour ago
[-]
This is very cool. I wonder if in their process they stored raw scans as well or if the transcriptions were from the source material directly. The former would be fantastic if possible since perhaps present or future OCR technology could be used to cross-reference for both improving the dataset for human error but also improving OCR as a labeled dataset.

It also seems like a huge effort to try to come up with a data model here for the normalized dataset. They mentioned it in the article as an aside but it seems like a pretty tough task.

And my last thought is perhaps there is a sadness in our loss of population level insight into health with the advent of modern privacy concerns. A big source for genomics data is the UK Biobank which ties all sorts of information to a genome. I’m sure that someone could come up with dangers that this presents but it’s been a gift to so many researchers over time, and to so many people who suffer from genetic disease. I hope that in time people will be willing to volunteer sufficient information to be able to do population-level science, even knowing the dangers inherent.

If you’re in the US, All of US accepts participation and I found that doing so was quite easy. They will give you all the scary warnings, which are good to consider but I hope many will find it worthwhile even knowing the risks.

reply
akudha
7 hours ago
[-]
What useful tools can be made from such a dataset?

The other day I came across this pricing dataset https://oria-data.trillianthealth.com/ (this is just for pricing though)

There must be some gem datasets like these - I wish I had the time (and expertise) to explore

reply
rumplecat
2 hours ago
[-]
Interesting that they manually transcribed the data to Excel. It would also be interesting to know how they mapped from the excel files to the final dataset. I wonder if LLMs could do the switch from scans to structured data more efficiently, and how much of a hit to accuracy would be involved.
reply
toomuchtodo
10 hours ago
[-]
reply
water-your-self
7 hours ago
[-]
Second link is the database
reply
tim-tday
6 hours ago
[-]
Do you want computer viruses? Because that’s how you get computer viruses.
reply