Detection is not what is going to solve the problem. We need to go back and reevaluate why we are asking students to write in the first place. And how we can still achieve the goal of teaching even when these modern tools are one click away.
I think we'll still need ways to detect copy-pasted zero-shot content that's generated by LLMs, for the same reasons that teachers needed ways to detect plagiarism. Kids, students, and interns [1] "cheat" for various different reasons [2], and we want to be able to detect lazy infractions early enough so that we can correct their behavior.
This leads to three outcomes:
1. Those that never really meant to cheat will learn how to do things properly.
2. Those that cheated out of laziness will begrudgingly need to weigh their options, at which point doing things properly may be less effort.
3. Those that meant to cheat will have to invest (much) more effort, and run the risk of being kicked-out if they're rediscovered.
[1] But also employees, employers, government officials, etc.
[2] There could be some relatively benign reasons. For example, they could: not know how to quote/reference others properly; think it's OK because "everyone does it" or they don't care about subjects that involve writing; do it "just this once" out of procrastination; and similar.
We don't need better detection. We need better ways to measure one's grasp of a concept. When calculators were integrated into education, the focus shifted from working the problem out, to using the correct formulas and using the calculator effectively. Sure, elementary classes will force you to 'show your work', but that's to build the foundation to build on, I believe.
We don't need to detect plagiarism if we're asking students verbal answers, for example
"Grasping concepts" is not the only learning goal in schools or universities. Many classes - including within STEM programmes - want to teach students about writing, argumentation, researching, critical analysis, dealing with feedback, etc.
Oral exams can be more stressful, depending on the student. They also don't check for the student's writing or researching ability. They can be gamed with rhetorical skills. Grading of oral exams tends to be more opaque. And so on.
Then there's the issues I explained above, where you don't want to inadvertently reward cheating. Even if you don't care about the cheaters, you should try your best to detect and reward real effort. Otherwise, it'd be stupid not to cheat and use the class for free credits, at which point, from an educational POV, it's a useless class.
So, all in all, there are still very good reasons for doing take-home written responses and essays, and good reasons for wanting to detect cheating or plagiarism.
I suspect many students write to pass the class, and AI can do that. Perhaps the problem is the incentives to write that way.
It is better to pivot and not care about the actual content of the essay, but instead seek alternate strategies to encourage learning - such as an oral presentation or a quiz on the knowledge. In the laziest case, just only accept hand-written output - because even if it was generated at least they retained some knowledge by copying it.
I don't think you're wrong necessarily, but there are good reasons that teachers like papers other than "we've always used them".
People have some different challenges writing papers and taking oral and written quizzes, but is one way or the other necessarily easier? For writing papers, think about language barriers, anxiety about writing ability, stress of writing papers, need for self-motivation and time management, etc.
But that's what we are solving for. So you can't assume it.
This is what I mean when I say educators need to be more agile instead of insisting on assessment methods they simply assume should work.
We need to grade people because that's the best way we have to determine (for one or more subjects) who's:
1. capable enough, so that we can promote them to the next stage;
2. improving or has potential for improvement, so that we can give them the tools or motivation to continue;
3. underperforming, so that we can find out why and help them turn it around (or reduce the pressure);
4. actually learning the content, and if not, why not.
Thankfully, everyone knows this system is flawed, so most don't put too much weight on school grades. But overall, the grades are there to provide both an incentive for teachers and students to do better, and a way to compare performance.
I read their work and sense the same anxiety in myself. When I write with care, when I choose words that carry rhythm and reason, I feel suspicion rather than understanding. Readers ask whether a machine has written the text. I lower my tone, I break the structure, I remove what once gave meaning to style, only to make the words appear more human. In doing so, I betray something essential, not in the language but in myself.
The authors speak of false positives, of systems that mistake human writing for artificial output. But that error already spreads beyond algorithms. It enters conversation, education, and the smallest corners of daily life. A clear sentence now sounds inhuman; a careless one, sincere. Truth begins to look artificial, and confusion passes for honesty.
I recall the warning of Charlotte Thomson Iserbyt in The Deliberate Dumbing Down of America. She foresaw a culture that would teach obedience in place of thought. That warning now feels less like prophecy and more like description.
When people begin to distrust eloquence, when they scorn precision as vanity and mistake simplicity for virtue, they turn against their own mind. And when a society grows ashamed of clear language, it prepares its own silence. Not the silence of peace, but the silence of forgetfulness, the kind that falls when no one believes in the power of words any longer.
"yet behind those terms lives an older struggle, the human desire to prove its own reality in a world of imitation."
..each paragraph ends with this corny and tiresome 50's mechanized `erudite' baloney.
--The Rod Serling Algo, aka, TTZ
For example “delve” and the em-dash are both a result of the finetuning dataset, not the base LLM.
The principle of training them is quite simple. Take an LLM and reward it for revising text so that it doesn't get detected. Reinforcement learning takes care of the rest for you.
Pangram maintains near-perfect accuracy across long and medium length texts. It achieves very low error rates even on shorter passages and ‘stubs.’I hate AI slop and I fight against it in my work, but as that style of writing becomes increasingly prevalent, students are unconsciously adopting it for their base writing style. Automated detection of LLM writing never worked well, and now LLM and human writing have converged so much in style that machine detectors are worthless.
Our response should be to refuse to accept slop, whether produced by human or machine. I strive to point out the stylistic details of slop and how to avoid or edit them away.
If you think about the 2x2 of “Good” vs “By AI”, you only really care about the case when something it good work that an AI did, and then only when catching cheaters, as opposed to deriving some utility.
If it’s bad, who cares if it’s AI or not, and most AI is pretty obvious thoughtless slop, and most people that use it aren’t paying attention to mask that, so I guess what I’m saying is for most cases one could just set a quality bar and see if the work passes.
I think maybe a difference AI brings is that in many cases people don’t really know how to understand or judge the quality of what they are reading, or are to lazy to, so have substituted as proxies for quality the same structural cues that AI now uses. So if you’re used to saying “it’s well formatted, lots of bulleted lists, no spelling mistakes, good use of adjectives, must be good”, now you have to actually read it and think about it to know.
So I inquired with the chatbot and they list possible causes of a flagged transaction could be stolen card, as well as a few other examples which amount to a mix of service issues which are customer-determined. But the bot says it’s definitely not a chargeback. What?
So now I contact support. They say it’s a flag from the credit card issuing bank. Wait. What? Is this a fraudulent stolen card or not? Still no. It’s just a warning based on pattern usage. Why you passing this slop to my client? If there is a pattern problem, the flag should go to the customer who authorizes the charge. Otherwise it’s a chargeback or known stolen card.
They say, well, you can contact the customer. What? If the pattern is actually a stolen card, which is listed as a possible cause of the flag while not saying it is or isn’t, then they can just lie!
Which is a lot to say this pattern matching for fraud or negative patterns suffers from idiocy, even in the simplest of contexts.