Does this result suggest that if we had N clever humans manually building an LLM, they might come up with something as smart as a frontier model, but potentially 45 times smaller? (1644 / 36 ~= 45, N = very large, time not specified)
True, but with even smarter humans, you could exploit the interactions for additional calculations.
While it sounds a bit silly, it is one of the hypotheses behind a fast takeoff. An AI that is sufficiently smart could design a network better than a trained one and could make something much smarter than itself on the same hardware. The question then becomes if that new smarter one can do an even better job. I suspect diminishing returns, but then again I am insufficiently smart.
(I see the Trained Weights results now, thanks.)
I wonder why they don't just write the code themselves, so by design the focus can be on the model.
[A B]
times [1]
[1]
is [A+B]I guess the analogy there is that a 74ls283 never really has a number either and just manipulates a series of logic levels.
Does this boil down to a condemnation of all scientific endeavours if they use resources?
Would it change things if the people who did it enjoyed themselves? Would they have spent more energy playing a first person shooter to get the same degree of enjoyment?
How do you make the calculation of the worth of a human endeavour? Perhaps the greater question is why are you making a calculation of the worth of a human endeavour.
Now if you said this proof of addition opens up some other interesting avenue of research, sure.
Well for starters, it puts the lie to the argument that a transformer can only output examples it has seen before. Performing the calculation on examples that haven't been seen demonstrates generalisation of the principles and not regurgitation.
While this misconception persists in a large number of people, counterexamples can always serve a useful purpose.
not any more, eh?
"So that the room will be empty."
I was initially excited until i saw that, because it would reveal some sort of required local min capacity, and then further revelation that this was all vibe coded and no arXiv, makes me feel I should save my attn for another article.
When you hand-code the weights, you're essentially implementing a known algorithm (carry-propagation) directly into the network topology. But trained networks often discover distributed representations that spread the computation across more parameters in ways that are harder to interpret but more robust to input distribution shifts.
I'd be curious whether the 311-param trained model generalizes better to bases other than 10, or to addition with different digit counts than it was trained on. In my experience, the 'messier' learned solutions sometimes capture more structural regularity than the clean engineered ones, precisely because they aren't locked into a single algorithmic strategy.