They did all that work to figure out that learning "base conversion" is the difficult thing for transformers. Great! But then why not take that last remaining step to investigate why that specifically is hard for transformers? And how to modify the transformer architecture so that this becomes less hard / more natural / "intuitive" for the network to learn?
A more serious answer might be that it was simply out of scope of what they set out to do, and they didn't want to fall for scope-creep, which is easier said than done.
> An investigation of model errors (Section 5) reveals that, whereas large language models commonly “hallucinate” random solutions, our models fail in principled ways. In almost all cases, the models perform the correct calculations for the long Collatz step, but use the wrong loop lengths, by setting them to the longest loop lengths they have learned so far.
The article is saying the model struggles to learn a particular integer function. https://en.wikipedia.org/wiki/Collatz_conjecture
In this case, they prove that the model works by categorising inputs into a number of binary classes which just happen to be very good predictors for this otherwise random seeming sequence. I don't know whether or not some of these binary classes are new to mathematics but either way, their technique does show that transformer models can be helpful in uncovering mathematical patterns even in functions that are not continuous.
But now imagine that instead of it being a valid reject 0.3% of the time it would also reject valid primes. Now it would be instantly useless.
Now I get your point that a function that is 99.7% accurate will eventually always be incorrect but that's not what the comment said.
LLMs are not calculators. If you want a calculator use a calculator. Hell, have your LLM use a calculator.
>That's precisely why digital computers won out over analog ones, the fact that they are deterministic.
I mean, no not really, digital computers are far easier to build and far more multi-purpose (and technically the underlying signals are analog).
Again, if you have a deterministic solution that is 100% correct all the time, use it, it will be cheaper than an LLM. People use LLMs because there are problems that are either not deterministic or the deterministic solution uses more energy than will ever be available in the local part of our universe. Furthermore a lot of AI (not even LLMs) use random noise at particular steps as a means to escape local maxima.
I think they keep coming back to this because a good command of math underlies a vast domain of applications and without a way to do this as part of the reasoning process the reasoning process itself becomes susceptible to corruption.
> LLMs are not calculators. If you want a calculator use a calculator. Hell, have your LLM use a calculator.
If only it were that simple.
> I mean, no not really, digital computers are far easier to build and far more multi-purpose (and technically the underlying signals are analog).
Try building a practical analog computer for a non-trivial problem.
> Again, if you have a deterministic solution that is 100% correct all the time, use it, it will be cheaper than an LLM. People use LLMs because there are problems that are either not deterministic or the deterministic solution uses more energy than will ever be available in the local part of our universe. Furthermore a lot of AI (not even LLMs) use random noise at particular steps as a means to escape local maxima.
No, people use LLMs for anything and one of the weak points in there is that as soon as it requires slightly more complex computation there is a fair chance that the output is nonsense. I've seen this myself in a bunch of non-trivial trials regarding aerodynamic calculations, specifically rotation of airfoils relative to the direction of travel. It tends to go completely off the rails if the problem is non-trivial and the user does not break it down into roughly the same steps as you would if you were to work out the problem by hand (and even then it may subtly mess up).
This is not even to mention the fact that asking a GPU to think about the problem will always be less efficient than just asking that GPU to directly compute the result for closed algorithms like this.
99.7% of the time good and 0.3% of the time noise is not very useful, especially if there is no confidence indicating that the bad answers are probably incorrect.
The Experiment: Researchers trained AI models (Transformers) to solve a complex arithmetic problem called the "long Collatz step".
The "Language" Matters: The AI's ability to solve the problem depended entirely on how the numbers were written. Models using bases divisible by 8 (like 16 or 24) achieved nearly 100% accuracy, while those using odd bases struggled significantly.
Pattern Matching, Not Math: The AI did not learn the actual arithmetic rules. Instead, it learned to recognize specific patterns in the binary endings of numbers (zeros and ones) to predict the answer.
Principled Errors: When the AI failed, it didn't hallucinate random answers. It usually performed the correct calculation but misjudged the length of the sequence, defaulting to the longest pattern it had already memorized.
Conclusion: These models solve complex math by acting as pattern recognizers rather than calculators. They struggle with the "control structure" (loops) of algorithms unless the input format reveals the answer through shortcuts.
Otherwise I'd just be sitting chatting with ChatGPT all day instead of wast...spending all day on HN.