The problem is the non-consumptive load where they just flat-out DDoS the site for no actual reason. They should be criminally charged for that.
Late edit: Individual page loads to answer specific questions aren't a problem either. DDoS is the problem.
Compare:
https://en.wikipedia.org/wiki/Transistor
with the raw markup seen in
https://en.wikipedia.org/w/index.php?title=Transistor&action...
That markup format is very hard to parse/render because it evolved organically to mean "whatever Wikipedia software does." I haven't found an independent renderer that handles all of its edge cases correctly. The new Kaggle/Wikimedia collaboration seems to solve that problem for many use cases, since the announcement says
This release is powered by our Snapshot API’s Structured Contents beta, which outputs Wikimedia project data in a developer-friendly, machine-readable format. Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content—making this ideal for training models, building features, and testing NLP pipelines.The dataset upload, as of 15 April 2025, includes high-utility elements such as abstracts, short descriptions, infobox-style key-value data, image links, and clearly segmented article sections (excluding references and other non-prose elements).
I was at an interview for a tier one AI lab and the pm I was taking to refused to believe that the torrent dumps from Wikipedia were fresh and usable for training.
When you spend all your time fighting bot detection measures it's hard to imagine someone willingly putting up their data out there for free.
The correct way to do this is to stand up a copy of MediaWiki on your own infra and then scrape that. That will give you shittons of HTML to parse and tokenize. If you can't work with that, then you're not qualified to do this kind of thing, sorry.
[0] If you're wondering, I was scraping Wikimedia Commons directly from their public API, from my residential IP with my e-mail address in the UA. This was primarily out of laziness, but I believe this is the way you're "supposed" to use the API.
Yes, I did try to work with Wikitext directly, and yes that is a terrible idea.
From the same set of interviews I made the point that the only way to meaningfully extract the semantics of a page meant for human consumption is to use a vision model that uses typesetting as a guide for structure.
The perfect example was the contract they sent, which looked completely fine, but was a word document with only wysiwyg formatting, e.g. headings were just extra large bold text rather than marked up as heading. If you used the programmatically extracted text as training data you'd be in trouble.
Is there a source on this?
Bot traffic is notoriously stupid, reloading the same pages over and over, surging one hour and then gone the next, getting stuck in loops, not understand html response codes... It's only gotten worse with all the AI scrapers. Somehow, they seem even more poorly written than the search engine bots.
- English: "Size of uncompressed dataset: 79.57 GB chunked by max 2.15GB."
- French: "Size of the uncompressed dataset: 34.01 GB chunked by max 2.15GB."
In 2025, the standards for ML datasets are quite high.As for the size, it's small compared to the training data of most LLMs, but large relative to their context size. Probably best used for retrieval-augmented generation or similar information retrieval applications.
('Issues: Other' https://www.kaggle.com/contact#/other/issue )
they might do something about it.
(Also, FYI, I've previously posted feedback pieces in Kaggle forums that got a very warm direct response from the executives, although that was before the acquisition.)
So, for the average website, you'd be right, but not for Google Cloud/Colab's showcase property.
Anyways, just a story on small-time closed data for reference.