Transformers know more than they can tell: Learning the Collatz sequence
54 points
5 days ago
| 3 comments
| arxiv.org
| HN
rikimaru0345
1 hour ago
[-]
Ok, I've read the paper and now I wonder, why did they stop at the most interesting part?

They did all that work to figure out that learning "base conversion" is the difficult thing for transformers. Great! But then why not take that last remaining step to investigate why that specifically is hard for transformers? And how to modify the transformer architecture so that this becomes less hard / more natural / "intuitive" for the network to learn?

reply
embedding-shape
59 minutes ago
[-]
Why release one paper when you can release two? Easier to get citations if you spread your efforts, and if you're lucky, someone needs to reference both of them.

A more serious answer might be that it was simply out of scope of what they set out to do, and they didn't want to fall for scope-creep, which is easier said than done.

reply
Y_Y
56 minutes ago
[-]
For interest, this popular pastime goes by several delicious names: https://en.wikipedia.org/wiki/Least_publishable_unit
reply
niek_pas
2 hours ago
[-]
Can someone ELI5 this for a non-mathematician?
reply
esafak
1 hour ago
[-]
The model partially solves the problem but fails to learn the correct loop length:

> An investigation of model errors (Section 5) reveals that, whereas large language models commonly “hallucinate” random solutions, our models fail in principled ways. In almost all cases, the models perform the correct calculations for the long Collatz step, but use the wrong loop lengths, by setting them to the longest loop lengths they have learned so far.

The article is saying the model struggles to learn a particular integer function. https://en.wikipedia.org/wiki/Collatz_conjecture

reply
spuz
1 hour ago
[-]
That's a bit of an uncharitable summary. In bases 8, 12, 16, 24 and 32 their model achieved 99.7% accuracy. They would never expect it to achieve 100% accuracy. It would be like if you trained a model to predict whether or not a given number is prime. A model that was 100% accurate would defy mathematical knowledge but a model that was 99.7% would certainly be impressive.

In this case, they prove that the model works by categorising inputs into a number of binary classes which just happen to be very good predictors for this otherwise random seeming sequence. I don't know whether or not some of these binary classes are new to mathematics but either way, their technique does show that transformer models can be helpful in uncovering mathematical patterns even in functions that are not continuous.

reply
jacquesm
1 hour ago
[-]
A pocket calculator that would give the right numbers 99.7% of the time would be fairly useless. The lack of determinism is a problem and there is nothing 'uncharitable' about that interpretation. It is definitely impressive, but it is fundamentally broken, because when you start making chains of things that are 99.7% correct you end up with garbage after very few iterations. That's precisely why digital computers won out over analog ones, the fact that they are deterministic.
reply
beambot
39 minutes ago
[-]
Most primality tests aren't 100% accurate either (eg Miller Rabin), they just are "reasonably accurate" while being very fast to compute. You can use them in conjunction to improve your confidence in the result.
reply
jacquesm
13 minutes ago
[-]
Yes, and we know they are inaccurate and we know that if you find a prime that way you can only use it to reject, not confirm so if you think that something is prime you need to check it.

But now imagine that instead of it being a valid reject 0.3% of the time it would also reject valid primes. Now it would be instantly useless.

reply
spuz
57 minutes ago
[-]
It's uncharitable because the comment purports to summarise the entire paper while simply cherry picking the worst result. It would be like if asked how did I do on my test and you said well you got question 1 wrong and then didn't elaborate.

Now I get your point that a function that is 99.7% accurate will eventually always be incorrect but that's not what the comment said.

reply
esafak
16 minutes ago
[-]
I just tried to get to the heart of the claim based on a skim. Please feel free to refine my summary.
reply
pixl97
1 hour ago
[-]
Why do people keep using LLMs as algorithms?

LLMs are not calculators. If you want a calculator use a calculator. Hell, have your LLM use a calculator.

>That's precisely why digital computers won out over analog ones, the fact that they are deterministic.

I mean, no not really, digital computers are far easier to build and far more multi-purpose (and technically the underlying signals are analog).

Again, if you have a deterministic solution that is 100% correct all the time, use it, it will be cheaper than an LLM. People use LLMs because there are problems that are either not deterministic or the deterministic solution uses more energy than will ever be available in the local part of our universe. Furthermore a lot of AI (not even LLMs) use random noise at particular steps as a means to escape local maxima.

reply
jacquesm
52 minutes ago
[-]
> Why do people keep using LLMs as algorithms?

I think they keep coming back to this because a good command of math underlies a vast domain of applications and without a way to do this as part of the reasoning process the reasoning process itself becomes susceptible to corruption.

> LLMs are not calculators. If you want a calculator use a calculator. Hell, have your LLM use a calculator.

If only it were that simple.

> I mean, no not really, digital computers are far easier to build and far more multi-purpose (and technically the underlying signals are analog).

Try building a practical analog computer for a non-trivial problem.

> Again, if you have a deterministic solution that is 100% correct all the time, use it, it will be cheaper than an LLM. People use LLMs because there are problems that are either not deterministic or the deterministic solution uses more energy than will ever be available in the local part of our universe. Furthermore a lot of AI (not even LLMs) use random noise at particular steps as a means to escape local maxima.

No, people use LLMs for anything and one of the weak points in there is that as soon as it requires slightly more complex computation there is a fair chance that the output is nonsense. I've seen this myself in a bunch of non-trivial trials regarding aerodynamic calculations, specifically rotation of airfoils relative to the direction of travel. It tends to go completely off the rails if the problem is non-trivial and the user does not break it down into roughly the same steps as you would if you were to work out the problem by hand (and even then it may subtly mess up).

reply
fkarg
1 hour ago
[-]
yeah it's only correct in 99.7% of all cases, but what if it's also 10'000 times faster? There's a bunch of scenarios where that combination provides a lot of value
reply
lkey
45 minutes ago
[-]
Ridiculous counterfactual. The LLM started failing 100% of the time 60! orders of magnitude sooner than the point at which we have checked literally every number.

This is not even to mention the fact that asking a GPU to think about the problem will always be less efficient than just asking that GPU to directly compute the result for closed algorithms like this.

reply
jacquesm
57 minutes ago
[-]
Correctness in software is the first rung of the ladder, optimizing before you have correct output is in almost all cases a complete waste of time. Yes, there are a some scenarios where having a ballpark figure quickly can be useful if you can produce the actual result as well and if you are not going to output complete nonsense the other times but something that approaches the final value. There are a lot of algorithms that do this (for instance: Newton's method for finding square roots).

99.7% of the time good and 0.3% of the time noise is not very useful, especially if there is no confidence indicating that the bad answers are probably incorrect.

reply
poszlem
2 hours ago
[-]
A transformer can. Here gemini:

The Experiment: Researchers trained AI models (Transformers) to solve a complex arithmetic problem called the "long Collatz step".

The "Language" Matters: The AI's ability to solve the problem depended entirely on how the numbers were written. Models using bases divisible by 8 (like 16 or 24) achieved nearly 100% accuracy, while those using odd bases struggled significantly.

Pattern Matching, Not Math: The AI did not learn the actual arithmetic rules. Instead, it learned to recognize specific patterns in the binary endings of numbers (zeros and ones) to predict the answer.

Principled Errors: When the AI failed, it didn't hallucinate random answers. It usually performed the correct calculation but misjudged the length of the sequence, defaulting to the longest pattern it had already memorized.

Conclusion: These models solve complex math by acting as pattern recognizers rather than calculators. They struggle with the "control structure" (loops) of algorithms unless the input format reveals the answer through shortcuts.

reply
embedding-shape
1 hour ago
[-]
Do you think maybe OP would have asked a language model for the answer if they felt like they wanted a language model to give an answer? Or in your mind parent doesn't know about LLMs, and this is your way of introducing them to this completely new concept?
reply
NitpickLawyer
1 hour ago
[-]
Funny that the "human" answer above took 2 people to be "complete" (i.e. an initial answer, followed by a correction and expansion of concepts), while the LLM one had mostly the same explanation, but complete and in one answer.
reply
embedding-shape
1 hour ago
[-]
Maybe most of us here don't seek just whatever answer to whatever question, but the human connection part of it is important too, that we're speaking with real humans that have real experience with real situations.

Otherwise I'd just be sitting chatting with ChatGPT all day instead of wast...spending all day on HN.

reply
pixl97
57 minutes ago
[-]
If life is a jobs program, why don't we dig ditches with spoons?
reply
NitpickLawyer
47 minutes ago
[-]
Oh, I agree. What I found funny is the gut reaction of many other readers that downvoted the message (it's greyed out for me at time of writing this comment). Especially given that the user clearly mentioned that it was LLM generated, while also being cheeky with the "transformer" pun, on a ... transformer topic.
reply
Onavo
1 hour ago
[-]
Interesting, what about the old proof that neural networks can't model arbitrary length sine waves?
reply