LLMs are much better at plausibly summarizing content than they are at doing long sequences of reasoning, so they're much better at generating believable reviews than believable papers. Plus reviews are pretty tedious to do, giving an incentive to half-ass it with an LLM. Plus reviews are usually not shared publicly, taking away some of the potential embarrassment.
On the one hand, if you submit to a conference, you are forced to "volunteer" for that cycle. Which is a good idea from a "justice" point of view, but its also a sure way of generating unmotivated reviewers. Not only because a person might be unmotivated in general, but because the -rather short- reviewing period may coincide in your vacation (this happened to many people with EMNLP, whose reviewing period was in the summer) and you're not given any alternative but to "volunteer" and deal with it.
On the other hand, even regular reviewers aren't treated too well. Lately they implemented a minimum max load of 4 (which can push people towards choosing uncomfortable loads, in fact, that seems to be the purpose) and loads aren't even respected (IIRC there have been mails to the tune of "some people set a max load but we got a lot of submissions so you may get more submissions than your load, lololol").
While I don't condone using LLMs for reviewing and I would never do such a thing, I am not too surprised that these things happen given that ARR makes the already often thankless job of reviewing even more annoying.
To be honest, lately, I have gotten better quality reviews from the supposedly second-tier conferences that haven't joined ARR (e.g. this year's LREC-COLING) than from ARR. Although sample size is very small, of course.
A consequence of that is that there are not sufficient numbers of reviewers available who are qualified to review these manuscripts.
Conference organizers might be keen to accept many or most who offer to volunteer, but clearly there is now a large pool of people that have never done this before, and were never taught how to do this. Add some time pressure, and people will try out some tool, just because it exists.
GPT-generated docs have a particular tone that you can detect if you've played a bit with ChatGPT and if you have a feel for language. Such reviews should be kicked out. I would be interested to view this review (anonymized if you like - by taking out bits that reveal too narrowly what it's about).
The "rolling" model of ARR is a pain, though, because instead of slaving for a month you feel like slaving (conducting scientific peer review free of charge = slave labor) all year round. Last month, I got contacted by a book editor to review a scientific book for $100. I told her I'm not going to read 350 pages, to write two pages worth of book review; to do this properly one would need two days, and I quoted my consulting day rate. On top of that, this email came in the vacation month of August. Of course, said person was never heard of again.
Pretty hard to combat. We just rebutted as if it were a real review - maybe it was - and hope that the chairs see it. Speaking to other folks, opinions are split over whether this sort of review should be flagged. I know some people who tried to query a review and it didn't help.
There were other small cues - the English was perfect, while other reviewers made small slips indicative of non-native speakers. One was simply the discrepancy between the tone of the review (generally very positive) and the middle-of-the-road rating and confidence. The structure of the review was very "The authors do X, Y, Z. This is important because A, B, C." and the reviewer didn't bother to fill out any of the other review sections (they just wrote single-word answeres to all of them).
The kicker was actually putting our paper in to 4o and asking it to write a review and seeing the same keywords pop up.
I see "dogfooding" has now been taken to its natural conclusion.
Not defending LLM papers at all, but these people can go to hell. If "scientometrics" was ever a good idea, after making the measure the target, it for sure isn't anymore. A longer, carefully written, comprehensive paper is rated worse than many short, incremental, hastily written papers.
Right now there is now incentive to do a high quality review unless the reviewer is motivated.
One example was that GPT took an explicit geographic location from a figure caption and used that as a reference point when suggesting improvements (along the lines of "location X is under-represented on this map") I assume because it places some high degree of relevance to figures and the abstract when summarising papers. I think you might be able to combat this by writing defensively - in our case we might have avoided that by saying "more information about geographic diversity may be found in X and the supplementary information"
If you can please publish it and maybe post here on HN or reddit.
Researchers get massive CVs, reviewers and editors get off easy, admins get to show great output numbers from their institutions, and of course the publishers continue making hand over fist.
It's a rather broken system.
They just need to mimic the appearance of reason, follow the same pattern of progression. Ingesting enough of what amounts to executed templates will teach it to generate its own results as if output from the same template.
For example, I recently experimented with using ChatGPT to translate a Wikipedia article, on the grounds that it mighy maintain all the formatting and that Transformer models are also used by Google Translate.
As it was an experiment, I did actually check the results before submitting the translated article.
First roughly 3/4 were fine. Final quarter was completely invented but plausible, including references.
LLMs are very useful tools, I'll gladly use them to help with various tasks and they can (with low reliability but it has happened) even manage a whole project, but right now they should treated with caution and not left unsupervised — Peter principle, being promoted beyond their competence, still applies even though they're not human employees.
By writing it down (yourself) you understand what claims each piece of related work discussed has made (and can realistically make - as there sometims are inflationary lists of claims in papers), and this helps you formulate your own claim as it relates to them (new task, novel method for known task, like older method but works better, nearly as good as a past method but runs faster etc.).
If you outsource it to a machine you no longer see it through, and the result will be poor unless you are a very bad writer.
I can, however, see a role for LLMs in an electronic "learn how to write better" tutoring system.
The bug happens if the ‘bib’ key doesn’t exist in the api response. That leads to the urls array having more rows than the paper_data array. So the columns could become mismatched in the final data frame. It seems they made a third array called flag which could be used to detect and remove the bad results, but it’s not used any where in the posted code.
Not clear to me how this would affect their analysis, it does seem like something they would catch when manually reviewing the papers. But perhaps the bibliographic data wasn’t reviewed and only used to calculate the summary stats etc.
""" Dear XXXX,
MY name is Kristofer, I’m one of the co-authors for the GPT paper. I also wrote the script for the data collection. Jutta forwarded your email regarding the possible bug.
First of all, let me apologise for the late response. Apparently your email made its way to the spam folder, which of course is regrettable. I would also like to thank you for reaching out to us. We are pleased to see the interest of the HN community in transparent and reliable research.
We looked at the comment and the concern around the bug. We’d like to point out that the original commenter was right in saying “it does seem like something they would catch when manually reviewing the papers”. We in fact reviewed the output manually and carefully for any potential errors. In other words, we opened and searched for the query string manually, which also helped determine whether the use of LLMs was declared in some form or other. This is of course a sensitive topic and we took great care to be thorough.
Nevertheless, we once more did a manual review of the code and the data, in light of this potential bug, and we’re glad to say no row-column mismatch is present. You can find the data here: https://doi.org/10.7910/DVN/WUVD8X
Please don’t hesitate if you have any more questions.
All the best, Kristofer """
Contact info for the first author
With meticulous version records it should at least be possible to ascertain what the code did by reconstructing that exact version (assuming stored back versions exist)
For any who haven't seen/heard, this makes for some entertaining and eye-opening viewing!
Now technology is a human artifact and always ends up resembling its creators or financiers or both: I’d have nice fonts on my computer in 2024 most likely either way, but it’s directly because of Jobs they were available in 1984 to a household budget.
If someone other than Altman had or some other insight than “this thing can lie in a newly scalable way” was the escape velocity moment on LLMs then we’d still have test sets and metrics and just science going on in the Commanding Heights of the S&P 500, but these people are a symptom of our apathy around any noble instinct. If we had stuck firm on our values no effective altruism cult leader type would even make the press.
Now this sounds like a story worth hearing!
Academia is a lot about barriers, which while sometimes unpleasant and malfunctioning nevertheless serve a purpose (unfortunately, it is impossible to evaluate everything fully on per-case basis, so humans need shortcuts to filter out noise and determine quicker if it is worth spending attention on). One of the barriers is in the form of the paper itself. The fall of this barrier (notably through often unauthorised use of others’ IP) would likely bring about not sudden idyllic meritocracy but increased noise and/or strengthening of other barriers.
If there are instances where the ability to make such distinctions is lost, it is most likely to be so because the content lacks novelty, i.e. it simply regurgitates known and established facts. In which case it is a pointless effort, even if it might inflate the supposed author's list of publications.
As to the integrity of researchers, this is a known issue. The temptation to fabricate data existed long before the latest innovations in AI, and is very easy to do in most fields, particularly in medicine or biosciences which constitute the bulk of irreproducible research. Policing this kind of behavior is not altered by GPT or similar.
The bigger problem, however, is when non-experts attempt to become informed and are unable to distinguish between plausible and implausible sources of information. This is already a problem even without AI, consider the debates over the origins of SARS-CoV2, for example. The solution to this is the cultivation and funding of sources of expertise, e.g. in Universities and similar.
It seems to be kind of a new thing for laymen to be reading scientific papers. 20 years ago, they just weren't accessible. You had to physically go to a local university library and work out how to use the arcane search tools, which wouldn't really find what you wanted anyway. And even then, you couldn't take it home and half the time you couldn't even photocopy it because you needed a student ID card to use the photocopier.
For example, the authors’ statement that “[GPT’s] undeclared use—beyond proofreading—has potentially far-reaching implications for both science and society” suggests that, for them, using LLMs for “proofreading” is okay. But “proofreading” is understood in various ways. For some people, it would include only correcting spelling and grammatical mistakes. For others, especially for people who are not native speakers of English, it can also include changing the wording and even rewriting entire sentences and paragraphs to make the meaning clearer. To what extent can one use an LLM for such revision without declaring that one has done so?
Hope this research is better.
A third risk: ChatGPT has no understanding of "truth" in the sense of facts reported by established, trusted sources. I'm doing a research project related to use of data lakes and tried using ChatGPT to search for original sources. It's a shitshow of fabricated links and pedestrian summaries of marketing materials.
This feels like an evolutionary dead end.
That's a bad idea, do not do that. Regardless of the the knowledge contained in ChatGPT, it's a completely wrong tool/tech - like using a jackhammer as a screwdriver. If your want original sources, then services like https://perplexity.ai can do it. It's not even an issue with ChatGPT as such, it was never intended for that - that's why they're trying to create search as well https://openai.com/index/searchgpt-prototype/
(edited: typo)
The human effort idea is even a bit morally objectionable. You can feel that you're worth more than others because more of the lives of others were consumed to create your possessions. It's a zero sum game where poor people can never afford high-care art because their time is worth less than the artist's.
> create a picture of scrabble pieces strewn on a table, with a closeup of a line of scrabble letters spelling "CHATGPT" on top of them. photographic, realistic quality, maintain realism and believability
https://ideogram.ai/assets/image/lossless/response/vF81gKjHS...
https://ideogram.ai/assets/image/lossless/response/EcRpDLumS...
Almost the same prompt.
https://replicate.com/p/xm41nvz05drm00chsywb6am7f0
https://replicate.com/p/kdw8bnkj39rm40chsyzbyg5e04
But of course anyone who has even a passing familiarity with scrabble is going to be able to tell that something's off.
AI-generated images are clearly identifiable as such, and it just gets annoying to continually see those desultory fabrications.
Would be better in those cases for people to write their paper in their native language and let readers translate it for themselves.
> “as of my last knowledge update” and/or “I don’t have access to real-time data”
which suggests no human (don’t even need to be a researcher) read every sentence of these damn “papers”. That’s a pretty low bar to clear, if you can’t even bother to read generated crap before including it in your paper, your academic integrity is negative and not a word from you can carry any weight.
Which also suggests none of the so called reviewers or editors read the entire paper before including it in their journal...
If the papers are incorrect, then the reviewers should catch them.
I wonder what they were thinking submitting the paper.
I am also hearing that a lot of reviewers and readers use it though. So we are often joking that PhD students (in CS) nowadays only write bullet point from their research. Generate prose that is used to generate bullet points.
[0] https://www.scientificamerican.com/article/chatbots-have-tho...
[1] https://medium.com/learning-data/words-and-phrases-that-make...
> We searched and scraped Google Scholar using the Python library Scholarly (Cholewiak et al., 2023) for papers that included specific phrases known to be common responses from ChatGPT and similar applications with the same underlying model (GPT3.5 or GPT4): “as of my last knowledge update” and/or “I don’t have access to real-time data” (see Appendix A).
If noone bothered to even spot and remove these, you can be pretty sure that no human ever read the whole paper before publication.
Perhaps "unreviewed scholarship" would be a more concerning claim, but I don't yet see the evidence for it being a major concern.
For example, suppose you wish to back up switch configs or dump a file or whatever and tftp is so easy and simple to setup. You'll tear it down later or firewall it or whatever.
So a quick search "linux tftp serevr" gets you to say: https://thelinuxcode.com/install_tftp_server_ubuntu/
All good until you try to use the --create flag which should allow you to upload to the server. That flag is not valid for tftp-hpa, it is valid on tftpd (another tftp daemon)
That's a hallucination. Hallucinations are fucking annoying and increasingly prevalent. In Windows land the humans hallucinate - C:\ SFC /SCANNOW does not fix anything except for something really madly self imposed.
Seems valid upstream too https://github.com/Distrotech/tftp-hpa/blob/5e95f248e8435eb3...
Much more work is needed to show that this means anything.
The authors use fraud in a specific sense here: "using ChatGPT fraudulently or undeclared" where they proved that the produced text was included without proper review. They also never accused those papers of misinformation, so they don't need to show evidence of that.
In a sense we need to go back two steps and websites need to be much stronger curators of knowledge again, and we need some reliable ways to sign and attribute real authorship to publications. So that when someone publishes a fake paper there is always a human being who signed it and can be held accountable. There's a practically unlimited number of automated systems, but only a limited number of people trying to benefit from it.
In the same way https went from being rare to being the norm because the assumption that things are default-authentic doesn't hold, the same just needs to happen to publishing. If you have a functioning reputation system and you can put on a price on fake information 99% of it is dis-incentivized.
(And if that doesn't work, how is what you're suggesting meaningfully different?)
Having a machine verifiable, cryptographic identity system that renders these kinds of things transparent, basically the equivalent of a ledger but instead of using it for get-rich schemes using it for identity would probably make verification enforceable.
Ouch!