The dataset: 16K human posts from Reddit, Hacker News, and Yelp, each paired with AI generations from 6 models across two providers (Anthropic and OpenAI) at three capability tiers. Same prompt, length-matched, no adversarial coaching — just the model’s natural voice with platform context. Every vote is logged with model, tier, source, response time, and position.
Early findings from testing: Reddit posts are easy to spot (humans are too casual for AI to mimic), HN is significantly harder.
I'll be releasing the full dataset on HuggingFace and I'll publish a paper if I can get enough data via this crowdsourced study.
If you play the HN-only mode, you’re helping calibrate how detectable AI is on here specifically.
Would love feedback on the pairs — are any trivially obvious? Are some genuinely hard?
Some were hard but spottable after re-reading the answers a good 10 times... ahah.
Some were hard though, yeah (at least if not looking longer than 5-10 seconds). Btw, it seemed more logical to me to just see a green/red card when you click, i.e. right choice or wrong choice. Getting red for the correct answer confused me a bit (but this might just be me).
This time around I prompted the models not necessarily to be adversarial - i didn't ask them to try and fool the reader. But i gave them contextual info - something to the effect of "you're a user posting on hacker news"
Yeah there are some very obvious tells, but the models that are most capable are very good at writing like human.
Especially when the human responses for reddit or HN prompts were presumably made after reading the content of the article or the post; whilw the model is simply going off of the title.