LOL. Finally the Techbro-CEOs succeeded in creating an AI in their own image.
Which models? The last ones came out this week.
I asked GPT to compute some hard multiplications and the reasoning trace seems valid and gets the answer right.
https://chatgpt.com/share/6999b72a-3a18-800b-856a-0d5da45b94...
IMO, symbolic AI is way too brittle and case-by-case to drive useful AI, but as a memory and reasoning system for more dynamic and flexible LLMs to call out to, it's a good idea.
Skimming through conclusions and results, the authors conclude that LLMs exhibit failures across many axes we'd find to be demonstrative of AGI. Moral reasoning, simple things like counting that a toddler can do, etc. They're just not human and you can reasonably hypothesize most of these failures stem from their nature as next-token predictors that happen to usually do what you want.
So. If you've got OpenClaw running and thinking you've got Jarvis from Iron Man, this is probably a good read to ground yourself.
Note there's a GitHub repo compiling these failures from the authors: https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failur...
Which LLMs? There's tons of them and more powerful ones appear every month.
An LLM is more akin to interacting with a quirky human that has anterograde amnesia because it can't form long-term memories anymore, it can only follow you in a long-ish conversation.
I'm not arguing that LLMs are human here, just that your reasoning doesn't make sense.
They're sold as AGI by the cloud providers and the whole stock market scam will collapse if normies are allowed to peek behind the curtain.
Specifically, the idea that LLMs fail to solve some tasks correctly due to fundamental limitations where humans also fail periodically well may be an instance of the fundamental attribution error.
>Basic Arithmetic. Another fundamental failure is that LLMs quickly fail in arithmetic as operands increase (Yuan et al., 2023; Testolin, 2024), especially in multiplication. Research shows models rely on superficial pattern-matching rather than arithmetic algorithms, thus struggling notably in middle-digits (Deng et al., 2024). Surprisingly, LLMs fail at simpler tasks (determining the last digit) but succeed in harder ones (first digit identification) (Gambardella et al., 2024). Those fundamental inconsistencies lead to failures for practical tasks like temporal reasoning (Su et al., 2024).
This is very misleading and I think flat out wrong. What's the best way to falsify this claim?
Edit: I tried falsifying it.
https://chatgpt.com/share/6999b72a-3a18-800b-856a-0d5da45b94...
https://chatgpt.com/share/6999b755-62f4-800b-912e-d015f9afc8...
I provided really hard 20 digit multiplications without tools. If you looked at the reasoning trace, it does what is normally expected and gets it right. I think this is enough to suggest that the claims made in the paper are not valid and LLMs do reason well.
To anyone who would disagree, can you provide a counter example that can't be solved using GPT 5 pro but that a normal student could do without mistakes?
This is not a valid experiment, because GPT models always have access to certain tools and will use them even if you tell them not to. They will fib the chain of thought after the fact to make it look like they didn't use a tool.
https://www.anthropic.com/research/alignment-faking
It's also well established that all the frontier models use python for math problems, not just GPT family of models.
Is that enough to falsify?
This isn't an experiment a consumer of the models can actually run. If you have a chance to read the article I linked, it is difficult even for the model maintainers (openai, anthropic, etc.) to look into the model and see what it actually used in it's reasoning process. The models will purposefully hide information about how they reasoned. And they will ignore instructions without telling you.
The problem really isn't that LLM's can't get math/arithmetic right sometimes. They certainly can. The problem is that there's a very high probability that they will get the math wrong. Python or similar tools was the answer to the inconsistency.
If you ask ChatGPT, it will confirm that it uses the python interpreter to do arithmetic on large numbers. To you, that should be convincing.
This is what Mistral outputs:
The result of multiplying 63,157,997,633 by 63,114,90,009 is:
3,965,689,999,999,999,999,999 (approximately 3.966 × 10²⁴).
That's like 5 orders of magnitude off, the scientific notation doesn't even match the full integer, and the mantissa is also slightly wrong.
GPT-5 pro without tools can easily solve your question and much harder ones.
Rather: does there exist a model that can perform these calculations reliably is a better way to falsify this claim.
Else we can always find the worst model with 1B parameters to falsify any claim made on LLM's.
If you try using ChatGPT, you'll see it goes into reasoning mode (It flashes the text "Working") and takes a while when asked to do arithmetic. Chatbots that don't forbid the user from viewing the reasoning chain of thought, such as Deepseek or Mistral, reveal that these chatbots work exactly as I described:
====
Thought for 1s
Thinking
Okay, the user wants to multiply two large numbers: 63,157,997,633 and 6,311,490,009. Since these numbers are quite large, I should use the code interpreter to handle this calculation accurately. Let's write a simple piece of code to compute this product.
Code
output
generated
python
# Calculating the product of the two numbers
number1 = 63157997633
number2 = 6311490009
product = number1 * number2
product
ResultCopy
398621071049125148697
Thinking
Great, the code interpreter has given us the product of the two numbers. Now, let's present the result to the user.
====
Falsifying my claim is simple: Find a model that verifiably does basic arithmetic correctly without relying on any external tools.
The only way to be sure a model calls no tool is to run it locally and control the network.
How are you able to use GPT-5 with tools turned off? Do you mean external tools (like searching the web)?
My understanding is that GPT models always have access to python, and it isn't something you can turn off.
I'm willing to bet this is no longer true as well. We have models that are doing better than humans at IMO.
Not really. From my brief experience they can guess the final answer but the intermediate justifications and proofs are complete hallucinated bullshit.
(Possibly because the final answer is usually some sort of neat and beatiful answer and human evaluators don't care about the final answer anyways, in any olympiad you're graded on the soundness of your reasoning.)