Kaggle and the Wikimedia Foundation are partnering on open data
136 points
by xnx
2 days ago
| 7 comments
| blog.google
| HN
toomuchtodo
2 days ago
[-]
This sounds good to take the ML/AI consumption load off Wikimedia infra?
reply
immibis
2 days ago
[-]
The consumption load isn't the problem. You can download a complete dump of Wikipedia and even if every AI company downloaded the newest dump every time it came out, it would be a manageable server load - you know, probably double-digit terabytes per month, but that's manageable these days. And if that would be a problem, they could charge a reasonable amount to get it on a stack of BD-R discs, or heck, these companies can easily afford a leased line to Wikimedia HQ.

The problem is the non-consumptive load where they just flat-out DDoS the site for no actual reason. They should be criminally charged for that.

Late edit: Individual page loads to answer specific questions aren't a problem either. DDoS is the problem.

reply
parpfish
2 days ago
[-]
I'd assume that AI companies would use the wiki dumps for training, but there are probably tons of bots that query wiki from the web when doing some sort of websearch/function call.
reply
philipkglass
2 days ago
[-]
The raw wiki dumps contain "wikitext" markup that is significantly different from the nice readable pages you see while browsing Wikipedia.

Compare:

https://en.wikipedia.org/wiki/Transistor

with the raw markup seen in

https://en.wikipedia.org/w/index.php?title=Transistor&action...

That markup format is very hard to parse/render because it evolved organically to mean "whatever Wikipedia software does." I haven't found an independent renderer that handles all of its edge cases correctly. The new Kaggle/Wikimedia collaboration seems to solve that problem for many use cases, since the announcement says

This release is powered by our Snapshot API’s Structured Contents beta, which outputs Wikimedia project data in a developer-friendly, machine-readable format. Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content—making this ideal for training models, building features, and testing NLP pipelines.The dataset upload, as of 15 April 2025, includes high-utility elements such as abstracts, short descriptions, infobox-style key-value data, image links, and clearly segmented article sections (excluding references and other non-prose elements).

reply
freeone3000
2 days ago
[-]
Just run your own copy of the wikipedia code. It’ll be cheaper than whatever inference you’re doing.
reply
paulryanrogers
2 days ago
[-]
IDK why this was downvoted. Wikimedia wiki text can be transformed with some REs. Not exactly fast but likely far easier than playing cat and mouse with bot blockers.
reply
jsheard
2 days ago
[-]
The bots which query in response to user prompts aren't really the issue. The really obnoxious ones just crawl the entire web aimlessly looking for training data, and wikis or git repos with huge edit histories and on-demand generated diffs are a worst case scenario for that because even if a crawler only visits each page once, there's a near-infinite number of "unique" pages to visit.
reply
noosphr
2 days ago
[-]
You'd assume wrong.

I was at an interview for a tier one AI lab and the pm I was taking to refused to believe that the torrent dumps from Wikipedia were fresh and usable for training.

When you spend all your time fighting bot detection measures it's hard to imagine someone willingly putting up their data out there for free.

reply
kmeisthax
2 days ago
[-]
As someone who has actually tried scraping Wikimedia Commons for AI training[0], they're correct only in the most literal sense. Wikitext is effectively unparseable, so just using the data dump directly is a bad idea.

The correct way to do this is to stand up a copy of MediaWiki on your own infra and then scrape that. That will give you shittons of HTML to parse and tokenize. If you can't work with that, then you're not qualified to do this kind of thing, sorry.

[0] If you're wondering, I was scraping Wikimedia Commons directly from their public API, from my residential IP with my e-mail address in the UA. This was primarily out of laziness, but I believe this is the way you're "supposed" to use the API.

Yes, I did try to work with Wikitext directly, and yes that is a terrible idea.

reply
noosphr
2 days ago
[-]
This is starting to get into the philosophical question of what training data should look like.

From the same set of interviews I made the point that the only way to meaningfully extract the semantics of a page meant for human consumption is to use a vision model that uses typesetting as a guide for structure.

The perfect example was the contract they sent, which looked completely fine, but was a word document with only wysiwyg formatting, e.g. headings were just extra large bold text rather than marked up as heading. If you used the programmatically extracted text as training data you'd be in trouble.

reply
immibis
2 days ago
[-]
Sounds like they're breaking the CFAA and should be criminally prosecuted.
reply
mrbungie
2 days ago
[-]
Wikimedia or someone else could offer some kind of MCP service/proxy/whatever for real-time data consumption (i.e. for use cases where the dump data is not useful enough), billed by usage.
reply
ipaddr
2 days ago
[-]
Does any repo exist with an updated bot list to block these bot website killers
reply
squigz
2 days ago
[-]
I'm confused. Are you suggesting that the AI companies actively participate in malicious DDOS campaigns against Wikimedia, for no constructive reason?

Is there a source on this?

reply
kbelder
2 days ago
[-]
Not maliciousness. Incompetence.

Bot traffic is notoriously stupid, reloading the same pages over and over, surging one hour and then gone the next, getting stuck in loops, not understand html response codes... It's only gotten worse with all the AI scrapers. Somehow, they seem even more poorly written than the search engine bots.

reply
immibis
2 days ago
[-]
Mine disappeared after about a week of serving them all the same dummy page on every request. They were fetching the images on the dummy page once for each time they fetched the page...
reply
ashvardanian
2 days ago
[-]
It's a good start, but I was hoping for more data. Currently, it's only around 114 GB across 2 languages (<https://www.kaggle.com/datasets/wikimedia-foundation/wikiped...>):

  - English: "Size of uncompressed dataset: 79.57 GB chunked by max 2.15GB."
  - French: "Size of the uncompressed dataset: 34.01 GB chunked by max 2.15GB."
In 2025, the standards for ML datasets are quite high.
reply
yorwba
2 days ago
[-]
I guess it's limited to only two languages because each version of Wikipedia has its own set of templates and they want to make sure they can render them correctly to JSON before releasing the data.

As for the size, it's small compared to the training data of most LLMs, but large relative to their context size. Probably best used for retrieval-augmented generation or similar information retrieval applications.

reply
0cf8612b2e1e
2 days ago
[-]
I wish the Kaggle site were better. Unnecessary amounts of JS to browse a forum.
reply
smcin
2 days ago
[-]
If enough of us report it

('Issues: Other' https://www.kaggle.com/contact#/other/issue )

they might do something about it.

reply
astrange
2 days ago
[-]
There's not really much connection between asking someone to redo their whole website and them actually doing it. That seems like work.
reply
smcin
2 days ago
[-]
There absolutely is, in this case. Kaggle was acquired by Google in 2017, and is a showcase for compute on Google cloud, Google Colab, Kaggle Kernels. Fixing the JS on their forums would be a rounding error in their budget.

(Also, FYI, I've previously posted feedback pieces in Kaggle forums that got a very warm direct response from the executives, although that was before the acquisition.)

So, for the average website, you'd be right, but not for Google Cloud/Colab's showcase property.

https://news.ycombinator.com/item?id=13822675

reply
bk496
2 days ago
[-]
It would be cool if all the HTML tables on Wikipedia were put under individual datasets
reply
bilsbie
2 days ago
[-]
Wasn’t this data always available?
reply
riyanapatel
2 days ago
[-]
I like the concept of Kaggle and appreciate it - I also do agree that UI aspects hinder me from taking the time to explore its capabilities. Hoping this new partnership helps structure data for me.
reply
BigParm
2 days ago
[-]
I was paying experts in a wide variety of industries for interviews in which I meticulously documented and organized the comprehensive role of the human in that line of work. I thought I was building a treasure chest, but it turns out nobody wants that shit.

Anyways, just a story on small-time closed data for reference.

reply