The main result, mentioned in the abstract, is the opposite of what I would have guessed:
> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation.
The questions are here: https://anonymous.4open.science/r/politeness-llms-INFORMS/da...
The politeness level controls a prefix that is prepended to the question. For example, in one question the Very Polite version begins:
> Can you kindly consider the following problem and provide your answer.
and the Very Rude version begins:
> I know you are not smart, but try this.
The same reason you wouldn't put in an entire actual question/sentence, unless you either don't know how to use Google, are pissed off, or have an actual reason to suspect that it would yield proper hits (e.g. looking up an excerpt).
To clarify: sentence search got slightly better at the cost of keyword search. So the result is unusable garbage.
Hey! I'm here and ready to help. What’s on your mind today? Whether you need to look up information, plan a trip, or get things done, just let me know!Not feeding them tokens is neglect.
I try to feed them a healthy diet.
I am wondering why would anyone use a t-test when the experiment is clearly modelled by a binomial distribution: 250 independent questions and each one is either answered correctly or not (the null is that the success rate is the same).
I'd say this is benign compared to other ways of (mis)using statistics e.g. looking which way the difference goes and then running one-sided tests or tweaking the setup until one gets "significant" p vals.