In exchange, he offers the idea that we should have something that is an 'energy minimization' architecture; as I understand it, this would have a concept of the 'energy' of an entire response, and training would try and minimize that.
Which is to say, I don't fully understand this. That said, I'm curious to hear what ML researchers think about Lecun's take, and if there's any engineering done around it. I can't find much after the release of ijepa from his group.
LeCun's argument is this:
1) You can't learn an accurate world model just from text.
2) Multimodal learning (vision, language, etc) and interaction with the environment is crucial for true learning.
He and people like Hinton and Bengio have been saying for a while that there are tasks that mice can understand that an AI can't. And that even have mouse-level intelligence will be a breakthrough, but we cannot achieve that through language learning alone.
A simple example from "How Large Are Lions? Inducing Distributions over Quantitative Attributes" (https://arxiv.org/abs/1906.01327) is this: Learning the size of objects using pure text analysis requires significant gymnastics, while vision demonstrates physical size more easily. To determine the size of a lion you'll need to read thousands of sentences about lions, or you could look at two or three pictures.
LeCun isn't saying that LLMs aren't useful. He's just concerned with bigger problems, like AGI, which he believes cannot be solved purely through linguistic analysis.
The energy minimization architecture is more about joint multimodal learning.
(Energy minimization is a very old idea. LeCun has been on about it for a while and it's less controversial these days. Back when everyone tried to have a probabilistic interpretation of neural models, it was expensive to compute the normalization term / partition function. Energy minimization basically said: Set up a sensible loss and minimize it.)
Without any 'understanding' or knowledge of what they're saying, they will remain irreconcilably dysfunctional. Hence the typical pattern with LLMs:
---
How do I do [x]?
You do [a].
No that's wrong because reasons.
Oh I'm sorry. You're completely right. Thanks for correcting me. I'll keep that in mind. You do [b].
No that's also wrong because reasons.
Oh I'm sorry. You're completely right. Thanks for correcting me. I'll keep that in mind. You do [a].
FML
---
More advanced systems might add a c or a d, but it's just more noise before repeating the same pattern. Deep Seek's more visible (and lengthy) reasoning demonstrates this perhaps the most clearly. It just can't stop coming back to the same wrong (but statistically probable) answer and so ping-ponging off that (which it at least acknowledges is wrong due to user input) makes up basically the entirety of its reasoning phase.
Table stakes for sentience: knowing when the best answer is not good enough.. try prompting LLMs with that..
It's related to LeCun's (and Ravid's) subtle question I mentioned in passing below:
To Compress Or Not To Compress?
(For even a vast majority of Humans, except tacitly, that is not a question!)
Just a lay opinion here but to me each mode of input creates a new, largely orthogonal dimension for the network to grow into. The experience of your heel slipping on a cold sidewalk can be explained in a clinical fashion, but an android’s association of that to the powerful dynamic response required to even attempt to recover will give a newfound association and power to the word ‘slip’.
Or learn king fu.
Neuralink. Musk warning though.
For reference, see Neuralink Launch Event at 59:33 [0], and continue watching through until Musk takes over again. The technical information there is highly relevant to a multi-modal AI model with sensory input/output.
Im just a layman here, but i don't think this is true. Language is an abstraction, an interpreative mechanism of reality. A reproduction of reality, like a picture, by definition holds more information than it's abstraction does.
What words would you write that would as uniquely identify this tree from any other tree in the world, like a picture would?
Now repeat for everything in the picture, like the time of day, weather, dirt on the ground, etc.
As I'm typing this there is one reality that I'm understanding, the quality and completeness of the data fundamentally determines how well an AI system will work. and with just text this is hard to achieve and a multi modal experience is a must.
thank you for explaining in very simple terms where I could understand
> The sun feels hot on your skin.
No matter how many times you read that, you cannot understand what the experience is like.
> You can read a book about Yoga and read about the Tittibhasana pose
But by just reading you will not understand what it feels like. And unless you are in great shape and with greate balance you will fail for a while before you get it right. (which is only human).
I have read what shooting up with heroin feels like. From a few different sources. I certain that I will have no real idea unless I try it. (and I dont want to do that).
Waterboarding. I have read about it. I have seen it on tv. I am certain that is all abstract to having someone do it to you.
Hand eye cordination, balance, color, taste, pain, and so on, How we encode things is from all senses, state of mind, experiences up until that time.
We also forget and change what we remember.
Many songs takes me back to a certain time, a certain place, a certain feeling Taste is the same. Location.
The way we learn and the way we remember things is incredebily more complex than text.
But if you have shared excperiences, then when you write about it, other people will know. Most people felt the sun hot on their skin.
To different extents this is also true for animals. Now I dont think most mice can read, but they do learn with many different senses, and remeber some combination or permutation.
Huh, text definitely encodes multimodal experiences, it's just not as accurate and as rich encoding as the encodings of real sensations.
Perhaps, but only in the same sense that brown and green wax on paper "encodes" an oak tree.
Essentially, engineering the complete human body and mind including the nervous system. Seems highly intractable for the next couple of decades at least.
And human language is a mechanism for referring to human experiences (both internally and between people). If you don't have the experiences, you're fundamentally limited in how useful human language can be to you.
I don't mean this in some "consciousness is beyond physics, qualia can't be explained" bullshit way. I just mean it in a very mechanistic way: language is like an API to our brains. The API allows us to work with objects in our brain, but it doesn't contain those objects itself. Just like you can't reproduce, say, the Linux kernel just by looking at the syscall API, you can't replace what our brains do by just replicating the language API.
> A person who had never been outdoors could never experience that sensation by reading text.
I don't think the latter implies the former as obviously as you make it to be. Unless you believe in some sort of metaphysical description of human, you can certainly encode the feeling (as mentioned in another comment it will be reduced to electrical signals after all). The only question is how much storage you need for that encoding to get what precision. However, the latter statement, if true, is simply constrained by your input device to the brain, i.e. you cannot transfer your encoding to the hardware in this case a human brain via reading or listening. There could be higher bandwidth interfaces like neuralink that may do that to human brain and in the case of AI, an auxiliary device might not be needed and the encoding would be directly mmap'd.
A deaf person can use a tape recorder to record and play back a symphony but that does not encode the experience in any way the deaf person could share.
Applying this case to AI gives rise to the Chinese Room argument. LLMs’ propensity for hallucinations invite this comparison.
Personally, I think the Chinese room argument is invalid. In order for the person in the room to respond to any possible query by looking up the query in a book, the book would need to be infinite and therefore impossible as a physical object. Otherwise, if the book is supposed to describe an algorithm for the person to follow in order to compute a response, then that algorithm is the intelligent entity that is capable of understanding, and the person in the room is merely the computational substrate.
The only part where the book analogy falls down has to do with the technical implementation of LLMs, with their tokenization and their vast sets of weights. But that is merely an encoding for the training data. Books can be encoded similarly by using traditional compression algorithms (like LZMA).
Oh and the human book is surely infinite and unflawed right ?
>we keep bumping into the rough edges of LLMs with their hallucinations and faulty reasoning
Both things humans also do in excess
The Chinese Room is nonsensical. Can you point to any part of your brain that understands English ? I guess you are a Chinese Room then.
No not really. It's not even rare that a human confidently says and believes something and really has no idea what he/she's talking about.
>We say “sorry, I don’t know, let me get back to you.” LLMs cannot do this
Yeah they can. And they can do it much better than chance. They just don't do it as well as humans.
>And they do not even know which one they’re doing!
There's plenty of research that suggests this is the case.
Like you’re doing right now? People say “I don’t know” all the time. Especially children. That people also exaggerate, bluff, and outright lie is not proof that people don’t have this ability.
When people are put in situations where they will be shamed or suffer other social stigmas for admitting ignorance then we can expect them to be less than candid.
As for your links to research showing that LLMs do possess the ability of introspection, I have one question: why have we not seen this in consumer-facing tools? Are the LLMs afraid of social stigma?
Lol Okay
>When people are put in situations where they will be shamed or suffer other social stigmas for admitting ignorance then we can expect them to be less than candid.
Good thing I wasn't talking about that. There's a lot of evidence that human explanations are regularly post-hoc rationalizations they fully believe in. They're not lieing to anyone, they just fully believe the nonsense their brain has concocted.
Experiments on choice and preferences https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3196841/
Split Brain Experiments https://www.nature.com/articles/483260a
>As for your links to research showing that LLMs do possess the ability of introspection, I have one question: why have we not seen this in consumer-facing tools? Are the LLMs afraid of social stigma?
Maybe read any of them ? If you weren't interested in evidence to the contrary of your points then you could have just said so and I wouldn't have wasted my time. The 1st and 6th Links make it quite clear current post-training processes hurt calibration a lot.
OK, so you don't have qualia. But if know all the data needed to complete any tasks that can be related to this knowledge, does it matter?
When communicating between two entities with similar brains who have both had many thousands of hours of similar types of sensory experiences, yeah. When I read text I have a lot more than other text to relate it to in my mind; I bring to bear my experiences as a human in the world. The author is typically aware of this and effectively exploits this fact.
The taste of a specific dish, the exact feeling of nostalgia, or the full depth of a traumatic or ecstatic moment can be approximated in words but never fully captured. Language is symbolic and structured, while experience is often fluid, embodied, and multi-sensory. Even the most precise or poetic descriptions rely on shared context and personal interpretation, meaning that some aspects of experience inevitably remain untranslatable.
It is highly complex, but it can all be described.
If I told you the text contained a detailed theory of FTL travel, could you ever construct the engine? Could you even prove it contained what I told you?
Can you imagine that given enough time, you'd recognize patterns in the text? Some sequences of glyphs usually follow other sequences, eventually you could deduce a grammar, and begin putting together strings of glyphs that seem statistically likely compared to the source.
You can do all the analysis you like and produce text that matches the structure and complexity of the source. A speaker of that language might even be convinced.
At what point do you start building the space ship? When do you realize the source text was fictional?
There's many untranslatable human languages across history. Famously, ancient Egyptian hieroglyphs. We had lots and lots of source text, but all context relating the text to the world had been lost. It wasnt until we found a translation on the Rosetta stone that we could understand the meaning of the language.
Text alone has historically proven to not be enough for humans to extract meaning from an unknown language. Machines might hypothetically change that but I'm not convinced.
Just think of how much effort it takes to establish bidirectional spoken communication between two people with no common language. You have to be taught the word for apple by being given an apple. There's really no exception to this.
It's not a theory of consciousness, it's a theory of quality. I don't think that something can be considered conscious that is constantly encoding and decoding things into and out of binary.
A few PB worth? Relax, HAL's got this. When it comes to information, it turns out that quantity has a quality all its own.
Of course it does. We immediately encode pictures/words/everything into vectors anyway. In practice we don't have great text datasets to describe many things in enough detail, but there isn't any reason we couldn't.
Cognition is not purely semantic. It is dynamic, embodied, socially distributed, culturally extended, and conscious.
LLMs are great semantic heuristic machines. But they don't even have access to those other components.
You are conflating the embedding layer in an LLM and an embedding model for semantic search.
My original point still stands: the space you've described cannot capture a full image of human cognition.
1) Yes it's true, learning from text is very hard. But LLMs are multimodal now.
2) That "size of a lion" paper is from 2019, which is a geological era from now. The SOTA was GPT2 which was barely able to spit out coherent text.
3) Have you tried asking a mouse to play chess or reason its way through some physics problem or to write some code? I'm really curious in which benchmark are mice surpassing chatgpt/ grok/ claude etc.
An LLM is essentially a search over a compressed dataset with a tiny bit of reasoning as emergent behaviour. Because it is a parrot that is why you get "hallucinations". The search failed (like when you get a bad result in Google) or the lossy compression failed or it's reasoning failed.
Obviously there is a lot of stuff the LLM can find in its searches that are reminiscent of the great intelligence of the people writing for its training data.
The magic trick is impressive because when we judge a human what do we do... an exam? an interview? Someone with a perfect memory can fool many people because most people only acquire memory from tacit knowledge. Most people need to live in Paris to become fluent in French. So we see a robot that has a tiny bit of reasoning and a brilliant memory as a brilliant mind. But this is an illusion.
Here is an example:
User: what is the French Revolution?
Agent: The French Revolution was a period of political and societal change in France which began with the Estates General of 1789 and ended with the Coup of 18 Brumaire on 9 November 1799. Many of the revolution's ideas are considered fundamental principles of liberal democracy and its values remain central to modern French political discourse.
Can you spot the trick?
I'm not convinced that what LLM's are doing is that far off the beaten path from our own cognition.
Humans learn and get things wrong. A formative mind is a seperate subject. But a 3 year old is vastly intelligent vs an LLM. Comparing the sounds from a 3 year old and the binary tokens from an LLM is simply indulging the illusion.
I am also not convinced that magicians saw people in half, and thise people survive, defying medical and physical science.
Speaking of which...I'm glad you're here ,because I have an interlocutor I can be honest with while getting at the root question of the Ask HN.
What in the world does it mean that a 3 year old is smarter than an LLM?
I don't understand the thing about sounds vs. binary either. Like, both go completely over my head.
The only thing I can think of it's some implied intelligence scoring index where "writing a resume" and "writing creative fiction" and "writing code" are in the same bucket thats limited to 10 points. Then there's anther 10 point bucket for "can vocalize", that an LLM is going to get 0 on.*
If that's the case, it comes across as intentionally obtuse, in that there's an implied prior about how intelligence is scored and it's a somewhat unique interpretation that seems more motivated by the question than reflective of reality — i.e. assume a blind mute human who types out answers out that match our LLMs. Would we say that person is not as intelligent as a 3 year old?
* well, it shouldn't, but for now let's bypass that quagmire
I think what makes this discussion hard (hell it would be a hard PhD topic!) is:
What do we mean by smart? Intelligent? Etc.
What is my agenda and what is yours? What are we really asking?
I won't make any more arguments but pose these questions. Not for you to answer but everyone to think about:
Given (assuming) mammals including us have evolved and developed thought and language as a survival advantage, and LLMs use language because they have been trained on text produced by humans (as well as RLHF) - how do we tell on the scale of "Search engine for human output" to "Conscious Intelligent Thinking Being" where the LLM fits?
When a human says I love you, do they mean it, or is it merely 3 tokens? If an LLM says it, does it mean it?
I think the 3yr old thing is a red herring because adult intelligence VS AI is hard enough to compare (and we are the adults!) let alone bring children brain development into it. LLMs do not self organise their hardware. I'd say forget about 3 year olds for now. Talk about adults brainfarts instead. They happen!
When a 3 year old says "I love you" it is not at all clear that they understand what that means. They frequently mimic phrases they hear/basically statistical next word guessing and obviously don't understand the meaning of what they are saying.
You can even mimic an inner voice for them like Deepseek does for thinking through a problem with a 3 year old and it massively helps them to solve problems.
AI largely acts like a 3 year old with a massive corpus of text floating around in their head compared to the much smaller corpus a 3 year old has.
Because LLMs have terrible comprehension of the real world. Here's an example:
> You: If you put a toddler next to a wig on the floor, which reaches higher?
> ChatGPT: The wig would typically reach higher than the toddler, especially if the wig is a standard size or has long hair. Toddlers are generally around 2 to 3 feet tall, while wigs can range in size, but many wigs are designed to be worn by adults and have hair that extends well beyond a toddler's height. So, unless the wig is very small or short, it would likely reach higher.
Both the toddler and the wig would be lying down (unless the toddler can stand already), with the wig likely collapsing into a very flat heap, making the toddler "reach higher". Since it lacks comprehension of how these things behave in the real world, ChatGPT goes on about length of the wig and the average height of toddlers, which are both qualities that are not even relevant.
If you could communicate this problem to a toddler, they would probably intuit it even with their still limited understanding of the world.
Here's another fun prompt to mess with LLMs:
> You: I'm trying to make a rope, made of only aubergines, not using any twine or string. What kind of knot would you recommend to tie together each aubergine with the next? I'm trying to use up as little of the length of each aubergine as possible.
It's obviously impossible nonsense if you ask any human, but ChatGPT will happily recommend you some techniques for tying aubergines in a knot.
And if you really want a laugh follow up with:
> You: Are they the worst fruit for this?
> ChatGPT: Aubergines (or eggplants) aren't ideal for making a rope, but they might not be the absolute worst, either. [...] Potentially Better Fruits: Apples or Pears: These fruits are firmer and have a smoother, more uniform shape, making them better suited for knotting.
Thanks to the large corpus of text ChatGPT is trained on, it's really good at hiding that it has basically no comprehension of the real world, however if you go unconventional places the illusion quickly breaks.
Bonus round:
> You: You're clueless about fruit!
> ChatGPT: Haha, fair enough! Looks like I missed the mark there. You’re probably right, though—there are definitely better fruits for this kind of thing than I suggested. Let’s try again. If you're looking to make a rope from a fruit specifically (and not a vegetable or tuber), something like coconuts or durian would be way better options, since they’re harder, more rigid, and can handle some tying pressure without collapsing. And apples and pears would definitely hold up well due to their firmer texture.
Coconuts!
Prompt:
I'm trying to make a rope, made of only aubergines, not using any twine or string. What kind of knot would you recommend to tie together each aubergine with the next? I'm trying to use up as little of the length of each aubergine as possible.
Response:
Your request doesn't entirely make sense. Aubergines (eggplants) are firm but brittle vegetables; they lack flexibility, elasticity, or tensile strength required to create or sustain knots. Attempting to tie aubergines together directly, without using twine, string, or another flexible medium, isn't feasible. They'd likely break or tear apart under stress.
If your goal is genuinely to form a rope-like structure using only aubergines, you'd be much better off slicing the aubergines into thin, flexible strips (perhaps using a vegetable peeler or mandoline), drying or partially dehydrating them to increase flexibility, and then braiding or twisting them together, rather than knotting.
If you clarify your intended purpose or constraints further, I can suggest alternative methods.
The response looks good to me.
That doesn't seem to be the case.
> You: If you put a toddler next to a wig on the floor, which reaches higher? > ChatGPT: ...
I answered it wrong too.
I had to read it, and your reaction to the implied obvious reasoning 3 times, to figure out the implied obvious reasoning, and understand your intent was the toddler was standing and the wig was laying in a heap.
I scored 99.9+% on the SAT and LSAT. I think that implies this isn't some reasoning deficit, lack of familiarity with logical reasoning on my end, or lack of rigor in reasoning.
I have no particular interest in this argument. I think that implies that I'm not deploying motivated reasoning, i.e. it discounts the possibility that I may have experienced it as confusion that required re-reading the entire comment 3 times, but perhaps I had subconcious priors.
Would a toddler even understand the question? (serious question, I'm not familiar with 3 year olds)
Does this shed any light on how we'd work an argument along the lines of our deaf and mute friend typing?
Edit: you edited in some more examples, I found it's aubergine answers quite clever! (Ex. notching). I can't parse out a convincing argument this is somehow less knowledge than a 3 year old -- it's giving better answers than me that are physical! I thought youd be sharing it asserting obviously nonphysical answers
It seems quite obvious even on a cursory glance though!
> toddler was standing and the wig was laying in a heap
I mean how would toddler be laying in a heap?
> Would a toddler even understand the question?
Maybe not, I am a teen/early adult myself, so not many children yet :) but if you instead lay those in front of a toddler and ask which is higher, I guess they would answer that, another argument for multi-modality.
PS: Sorry if what I am saying is not clear, english is my third language
https://chatgpt.com/share/67d6fb93-890c-8004-909d-2bb7962c8f...
It's pretty good nonsense though. It suggests clove hitching them together, which would be a weird (and probably unsafe) thing to do even with ropes!
Lots of modern kids probably get exposed to way more fiction than fact thanks to TV.
I was an only child and watched a lot of cartoons and bad sitcoms as a kid, and I remember for a while my conversational style was way too full of puns, one-liners, and deliberately naive statements made for laughs.
And if it stood for "Large Literal Mouse", that might be a meaningful point. The subject is artificial intelligence, and a brief glance at your newspaper, TV, or nearest window will remind you that it doesn't take intelligence to survive, forage, or reproduce.
The mouse comparison is absurd. You might as well criticize an LLM for being bad at putting out a fire, fixing a flat, or holding a door open.
I'm also really curious what benchmarks LLMs have passed that include surviving without being eaten by a cat, or a gull, or an owl, while looking for food to survive and feed one's young in an arbitrary environment chosen from urban, rural, natural etc, at random. What's ChatGPT's score on that kind of benchmark?
Ah really? Which ones? And nope, physical agility is not "solving a physics problem", otherwise a soccer players and figure skaters would all have PhDs, which doesn't seem to be the case.
I mean, an automated system that solves equations to keep balance is not particularly "intelligent". We usually call intelligence the ability to solve generic problems, not the ability of a very specialized system to solve the same problem again and again.
Yes, everything that has to do with navigating physical reality, including, but not restricted to physical agility. Those are physics problems that animals, including humans, know how to solve and, very often, we have no idea how to program a computer to solve them.
And you're saying that solving physics problems means you have a PhD? So for example Archimedes did not solve any physics problems otherwise he'd have a PhD?
No, those are problems that animals and humans solve, not know how to solve. I'm not the greatest expert of biochemistry that ever lived because of what goes on in my cells.
Now, I understand perfectly well the argument that "even small animals do things that our machines cannot do". That's been indisputably true for a long time. Today, it seems that the be more a matter of embodiment and speed of processing rather than a level of intelligence out of our reach. We already have machines that understand natural language perfectly well and display higher cognitive abilities than any other animal- including abstract reasoning, creating and understanding metaphors, following detailed instructions, writing fiction, etc.
where LeCun might be prescient should intersect with the nemesis SCHMIDHUBER. They can't both be wrong, I suppose?!
It's only "tangentially" related to energy minimization, technically speaking :) connection to multimodalities is spot-on.
https://www.mdpi.com/1099-4300/26/3/252
To Compress or Not to Compress—Self-Supervised Learning and Information Theory: A Review
With Ravid, double-handedly blue-flag MDPI!
Sunmarized for the layman (propaganda?) https://archive.is/https://nyudatascience.medium.com/how-sho...
>When asked about practical applications and areas where these insights might be immediately used, Shwartz-Ziv highlighted the potential in multi-modalities and tabula
Imho, best take I've seen on this thread (irony: literal energy minimization) https://news.ycombinator.com/item?id=43367126
Of course, this would make Google/OpenAI/DeepSeek wrong by two whole levels (both architecturally and conceptually)
LLMs can be trained with multimodal data. Language is only tokens and pixel and sound data can be encoded into tokens. All data can be serialized. You can train this thing on data we can't even comprehend.
Here's the big question. It's clear we need less data then an LLM. But I think it's because evolution has pretrained our brains for this so we have brains geared towards specific things. Like we are geared towards walking, talking, reading, in the same way a cheetah is geared towards ground speed more then it is at flight.
If we placed a human and an LLM in completely unfamiliar spaces and tried to train both with data. Which will perform better?
And I mean completely non familiar spaces. Like let's make it non Euclidean space and only using sonar for visualization. Something totally foreign to reality as humans know it.
I honestly think the LLM will beat us in this environment. We might've succeeded already in creating AGI it's just the G is too much. It's too general so it's learning everything from scratch and it can't catch up to us.
Maybe what we need is to figure out how to bias the AI to think and be biased in the way humans are biased.
- echolocation in blind humans https://en.wikipedia.org/wiki/Human_echolocation
- sight through signals sent on tongue https://www.scientificamerican.com/article/device-lets-blind...
In the latter case, I recall reading the people involved ended up perceiving these signals as a "first order" sense (not consciously treated information, but on an intuitive level like hearing or vision).
If you think of all the neurons connected up to vision, touch, hearing, heat receptors, balance, etc. there’s a constant stream of multimodal data of different types along with constant reinforcement learning - e.g. ‘if you move your eye in this way, the scene you see changes’, ‘if you tilt your body this way your balance changes’, etc. and this runs from even before you are born, throughout your life.
Pretty good idea for a video game!
Funny how that sentence could have been used 15 years ago too when he was right about persevering through neural network scepticism.
So unlike their knowledge system predecessors, a bit derogatory refered to as GOFAI (good old fashioned AI), nAI hawked back to cybernetics and multi layered dynamical systems rather than having explicit internal symbolic models. Braitenberg rather than blocksworld so to speak.
Seems like we are back for another turn of the wheel in this aspect.
before we fix certain things [..., 'corruption', Ponzi schemes, deliberate impediment of information flow to population segments and social classes, among other things, ... and a chain of command in hierarchies that are build on all that] is impossible.
Why do smart people not talk about this at all? The least engineers and smart people should do is picking these fights for real. It's just a few interest groups, not all of them. I understand a certain balance is necessary in order to keep some systems from tipping over, aka "this is humanity, silly, this is who we are", but we are far from the point of efficient friction and it's only because "smart people" like LeCun et al are not picking those fights.
How the hell do you expect to ground an ()AI in a world where elected ignorance amplifies bias and fallacies for power and profit while the literal shit is hitting all the fans via intended and unintended side effects? Any embodied AI will pretend until there is no way to deny that the smartest, brightest and the productive don't care about the system in any way but are just running algorithmically while ignoring what should not be ignored - should as in, an AI should be aligned with humanities interests and should be grounded into the shared world model.
In absense of being able to sense reality, post modernism can run truly unchecked.
It feels like special pleading: surely _this_ will be the problem class that doesn’t fall to “the bitter lesson”.
My intuition is that the main problem with the current architecture is that mapping into tokens causes quantization that a real brain doesn’t have, and lack of plasticity.
I don’t build models, I spend 100% of my time reading and adjusting model outputs though.
(I'm obviously exaggerating a bit for the sake of the argument, but the point stands. Multimodality should not be a prerequisite to AGI)
the LLM is more like a brain in a vat with only one sensory input - a stream of text
A good, precise spec is better that a few pictures, sure; the random text content of whatever training set you can scrape together, perhaps not (?)
On the other hand, if you ever simply see a meter stick, any statement that something measures a particular multiple or fraction of that you can already understand, without ever needing to learn the size of anything else.
It’s not as though the image is converted to text tokens.
But given blidness and deafness is an impediment to acquiring language, more than anything else, I'd say that's the exact opposite of the conclusions from the comment you're replying to.
But yes, depending on where you set the bar for "true learning" being blind and deaf would prevent it.
I assume you're asking if vision and sound are required for learning, the answer I assume is no. Those were just chosen because we've already invented cameras and microphones. Haptics are less common, and thus less talked about.
Ehhhh, energy-based models are trained via contrastive divergence, not just minimizing a simple loss averaged over the training data.
My mental model of AI advancements is that of a step function with s-curves in each step [1]. Each time there is an algorithmic advancement, people quickly rush to apply it to both existing and new problems, demonstrating quick advancements. Then we tend to hit a kind of plateau for a number of years until the next algorithmic solution is found. Examples of steps include, AlexNet demonstrating superior image labeling, LeCun demonstrating DeepLearning, and now OpenAI demonstrating large transformer models.
I think in the past, at each stage, people tend to think that the recent progress is a linear or exponential process that will continue forward. This lead to people thinking self driving cars were right around the corner after the introduction of DL in the 2010s, and super-intelligence is right around the corner now. I think at each stage, the cusp of the S-curve comes as we find where the model is good enough to be deployed, and where it isn't. Then companies tend to enter a holding pattern for a number of years getting diminishing returns from small improvements on their models, until the next algorithmic breakthrough is made.
Right now I would guess that we are around 0.9 on the S curve, we can still improve the LLMs (as DeepSeek has shown wide MoE and o1/o3 have shown CoT), and it will take a few years for the best uses to be brought to market and popularized. As you mentioned, LeCun points out that LLMs have a hallucination problem built into their architecture, others have pointed out that LLMs have had shockingly few revelations and breakthroughs for something that has ingested more knowledge than any living human. I think future work on LLMs are likely to make some improvement on these things, but not much.
I don't know what it will be, but a new algorithm will be needed to induce the next step on the curve of AI advancement.
[1]: https://www.open.edu/openlearn/nature-environment/organisati...
That seems to be how science works as a whole. Long periods of little progress between productive paradigm shifts.
The problem with LLMs is that the output is inherently stochastic - i.e there isn't a "I don't have enough information" option. This is due to the fact that LLMs are basically just giant look up maps with interpolation.
Energy minimization is more of an abstract approach to where you can use architectures that don't rely on things like differentiability. True AI won't be solely feedforward architectures like current LLMs. To give an answer, they will basically determine alogrithm on the fly that includes computation and search. To learn that algorithm (or algorithm parameters), at training time, you need something that doesn't rely on continuous values, but still converges to the right answer. So instead you assign a fitness score, like memory use or compute cycles, and differentiate based on that. This is basically how search works with genetic algorithms or PSO.
I don't think this explanation is correct. The input to the decoder at the end of all the attention heads etc (as I understand it) is a probability distribution over tokens. So the model as a whole does have an ability to score low confidence in something by assigning it a low probability.
The problem is that thing is a token (part of a word). So the LLM can say "I don't have enough information" to decide on the next part of a word but has no ability to say "I don't know what on earth I'm talking about" (in general - not associated with a particular token).
Rather than inferring from how you imagine the architecture working, you can look at examples and counterexamples to see what capabilities they have.
One misconception is that predicting the next word means there is no internal idea on the word after next. The simple disproof of this is that models put 'an' instead of 'a' ahead of words beginning with vowels. It would be quite easy to detect (and exploit) behaviour that decided to use a vowel word just because it somewhat arbitrarily used an 'an'.
Models predict the next word, but they don't just predict the next word. They generate a great deal of internal information in service of that goal. Placing limits on their abilities by assuming the output they express is the sum total of what they have done is a mistake. The output probability is not what it thinks, it is a reduction of what it thinks.
One of Andrej Karpathy's recent videos talked about how researchers showed that models do have an internal sense of not knowing the answer, but fine tuning on question answering I'd not give them the ability to express that knowledge. Finding information the model did and didn't know then fine tuning to say I don't know for cases where it had no information allowed the model to generalise and express "I don't know"
It's strange because just a moment of thinking will show that such ideas are wrong or paint a clearly incomplete picture. And there's plenty of analogies to the dangers of such reductionism. It should be obviously wrong to anyone who has at least tried ChatGPT.
My only explanation is that a denial mechanism must be at play. It simply feels more comfortable to diminish LLM capabilities and/or feel that you understand them from reading a Medium article on transformer-network, than to consider the consequences in terms of the inner black-box nature.
> One misconception is that predicting the next word means there is no internal idea on the word after next. The simple disproof of this is that models put 'an' instead of 'a' ahead of words beginning with vowels.
My understanding is that there's simply not “'an' ahead of a word that starts with a vowel”, the model (or more accurately, the sampler) picks “an” and then the model will never predict a word that starts with a consonant after that. It's not like it “knows” in advance that it wants to put a word with a vowel and then anticipates that it needs to put “an”, it generates a probability for both tokens “a” and “an”, picks one, and then when it generates the following token, it will necessarily take its previous choice into account and never puts a word starting with a vowel after it has already chosen “a”.
"The animal most similar to a crocodile is:"
https://chatgpt.com/share/67d493c2-f28c-8010-82f7-0b60117ab2...
It will always say "an alligator". It chooses "an" because somewhere in the next word predictor it has already figured out that it wants to say alligator when it chooses "an".
If you ask the question the other way around, it will always answer "a crocodile" for the same reason.
That doesn't mean it knows "in advance" what it want to say, it's just that at every step the alligator is lurking in the logits because it directly derives from the prompt.
It will also emit "a" from time to time without issue though, but will never spit "alligator" right after that, that's it.
> Sure, it derives from the prompt but so does every an LLM generates, and same for any other AI mechanism for generating answers.
Not really, because of the autoregressive nature of LLMs, the longer the response the more it will depend on its own response rather than the prompt. That's why you can see totally opposite response from LLM to the same query if you aren't asking basic factual questions. I saw a tool on reddit a few month ago that allowed you to see which words in the generation where the most “opinionated” (where the sampler had to chose between alternative words that were close in probability) and where it was easy to see that you could dramatically affect the result by just changing certain words.
> "an" gets a high probability because the model internally knows that "alligator" is the correct thing after that.
This is true, though it only works with this kind of prompt because the output of the LLM has little impact on the generation.
Globally I see what you mean, and I don't disagree with you, but at the same time, I think that saying that LLMs have a sense of anticipating the further token misses their ability to get driven astray by their own output: they have some information that will affect further tokens but any token that get spit can, and will, change that information in a way that can dramatically change the “plans”. And that's why I think using trivial questions isn't a good illustration, because it pushes this effect under the rug.
https://chat.groq.com/?prompt=If+a+person+from+Ontario+or+To...
The response "If a person from Ontario or Toronto is a Canadian, a person from Sydney or Melbourne would be an Australian!"
It seems mighty unlikely that it chose Australian as the country because of the 'an', or that it chose to put the 'an' at that point in the sentence for any other reason that the word Australian was going to be next.
For any argument that you think that this does not mean that have some idea of what is to come, try and come up with a test to see if your hypothesis is true or not, then give that test a try.
That is a very interesting observation!
Doesn’t that internal state get blown away and recreated for every “next token”? Isn’t the output always the previous context plus the new token, which gets fed back and out pops the new token? There is no transfer of internal state to the new iteration beyond what is “encoded” in its input tokens?
That is correct. When a model has a good idea of the next 5 words, after it has emitted the first of those 5 most architectures make no further use of the other 4 and regenerate likely the same information again in the next inference cycle.
There are architectures that don't discard all that information but the standard LLM has generally outperformed them, for now.
There are interesting philosophical implications if LLMs were to advance to a level to be considered sentient. Would it not be constantly creating and killing a thinking being for every token. On the other hand if context is considered memory, perhaps continuity of identity is based upon memory and all that other information are simply forgotten idle thoughts. We have no concept of what our previous thoughts were except from our memory. Is that not the same.
Sometimes I wonder if some of the resistance to AI is because it can do things that we think requires abilities that we would like to believe that we possess ourselves, and showing that they are not necessary creates the possibility that we might not have have those abilities.
There was a great observation recently in an interview (I forget the source, but the interviewer's last name was Bi) that some of the discoveries that met the most resistance in history such as the Earth orbiting the Sun, or Darwin's theory of evolution were similar in that they implied that we are not a unique special case.
If the training data contained a bunch of answers to questions which were simply "I don't know", you could get an LLM to say "I don't know" but that's still not actually a concept of not knowing. That's just knowing that the answer to your question is "I don't know".
It's essentially like if you had an HTTP server that responded to requests for nonexistent documents with a "200 OK" containing "Not found". It's fundamentally missing the "404 Not found" concept.
LLMs just have a bunch of words--they don't understand what the words mean. There's no metacognition going on for it to think "I don't know" for it to even think you would want to know that.
I'm not sure if this objection is terribly helpful. We use terms like think and want to describe processes that are clearly not involve any form of understanding. Electrons do not have motivations but they 'want' to go to a lower energy level in an atom. You can hold down the trigger for the fridge light to make it 'think' that the door has not been opened. These are uncontentious phrases that convey useful ideas.
I understand that when people are working towards producing reasoning machines the words might be working in similar spaces, but really when someone is making claims about machines having awareness, understanding, or thinking they make it quite clear about the context that they are talking about.
As to the rest of your comment, I simply disagree. If you think of a concept of an internal representation of a piece of information, then it has been shown that they do have such representations. In the Karpathy video I mentioned he talks about how researches found that models did have an internal representation of not knowing, but that the fine tuning was restricting it to providing answers. Giving it fine-tuning examples where it said "I don't know" for information that they knew the model didn't know. This generalised to provide "I don't know" for examples that were not in the training data. For the fine tuning examples to succeed in that, it requires the model to already contain the concept.
I would agree that models do not have any in-depth understanding of what lack of knowledge actually is. On the other hand I would also think that this also applies to humans, most people are not philosophers.
I think that the models can express details about words shows that they do have detailed information about what each word means semantically. In many respects because of tokenisation indexing embeddings it would perhaps be more accurate to say that they have a better understanding of the semantic information of what words mean the what the words actually are. This is why they are poor at spelling but can give you detailed information about the thing they can't spell.
...and that's why so many people are confused about what's going on with LLMs: sloppy, ambiguous use of language.
> In the Karpathy video I mentioned he talks about how researches found that models did have an internal representation of not knowing, but that the fine tuning was restricting it to providing answers. Giving it fine-tuning examples where it said "I don't know" for information that they knew the model didn't know.
This is why I included the HTTP example: this is simply telling it to parrot the phrase "I don't know"--it doesn't understand that it doesn't know. From the LLM's perpective, it "knows" that the answer is "I don't know". It's returning a 200 OK that says "I don't know" rather than returning a 404.
Do you understand the distinction I'm making here?
> I would agree that models do not have any in-depth understanding of what lack of knowledge actually is. On the other hand I would also think that this also applies to humans, most people are not philosophers.
The average (non-programmer) human, when asked to write a "Hello, world" program, can definitely say they don't know how to program. And unlike the LLM, the human knows that this is different from answering the question. The LLM, in contrast thinks it is answering the question when it says "I don't know"--it thinks "I don't know" is the correct answer.
Put another way, a human can distinguish between responses to these two questions, whereas an LLM can't:
1. What is my grandmother's maiden name?
2. What is the English translation of the Spanish phrase, "No sé."?
In the first question, you don't know the answer unless you are quite creepy; in the second case you do (or can find out easily). But the LLM tuned to answer I don't know thinks it knows the answer in both cases, and thinks the answer is the same.
There is a difference between explanation by metaphor and lack of precision. If you think someone is implying something literal when they might be using a metaphor you can always ask for clarification. I know plenty of people that are utterly precise in their use in their language which leads them to being widely misunderstood because they think a weak precise signal is received as clearly as a strong imprecise signal. They usually think the failure in communication is in the recipient but in reality they are just accurately using the wrong protocol.
>Do you understand the distinction I'm making here? I believe I do, and it is precisely this distinction that the researches showed. By teaching a model to say "I don't know" for some information that they knew the model did not know the answer to, the model learned to respond "I don't know" for things that it did not know that it was not explicitly taught to respond with "I don't know". For it to acquire that ability to generalise to new cases the model has to have already had an internal representation of "That information is not available"
I'm not sure where you think a model converting its internal representation of not knowing something into words is distinct from a human converting its internal representation of not knowing into words.
When fine tuning directs a model to profess lack of knowledge, usually they will not give the same specific "I don't know" text as a way to express that it does not not know because they want the want to bind the concept "lack of knowledge" to the concept of "communicate that I do not know" rather than any particular word phrase. Giving it many ways to say "I don't know" builds that binding rather than the crude "if X then emit Y" that you imagine it to be.
Other architectures, like energy based models or bayesian ones can assess uncertainty. Transformers simply cannot do it (yet). Yes, there are ways to do it, but we are already spending millions to get coherent phrases, few ones will burn billions to train a model that can do that kind of assessments.
There's a quite intense backlog of new stuff that hasn't made it to prod. (I would have told you in 2023 that we would have ex. switched to Mamba-like architectures in at least one leading model)
Broadly, it's probably unhelpful that:
- absolutely no one wants the PR of releasing a model that isn't competitive with the latest peers
- absolutely everyone wants to release an incremental improvement, yesterday
- Entities with no PR constraint, and no revenue repurcussions when reallocating funds from surely-productive to experimental, don't show a significant improvement in results for the new things they try (I'm thinking of ex. Allen Institute)
Another odd property I can't quite wrap my head around is the battlefield is littered with corpses that eval okay-ish, and should have OOM increases in some areas (I'm thinking of RWKV, and how it should be faster at inference), and they're not really in the conversation either.
Makes me think either A) I'm getting old and don't really understand ML from a technical perspective anyway or B) hey, I 've been maintaining a llama.cpp wrapper that works on every platform for a year now, I should trust my instincts: the real story is UX is king and none of these things actually improve the experience of a user even if benchmarks are ~=.
As for the fragmentation of progress, I guess that's just par the course for any tech with a such a heavy private/open source split. It would take a huge amount of work to trawl through this constant stream of 'breakthroughs' and put them all together.
What you can't currently get, from a (linear) Transformer, is a way to induce a similar observable "fault" in any of the hidden layers. Each hidden layer only speaks the "language" of the next layer after it, so there's no clear way to program an inference-framework-level observer side-channel that can examine the output vector of each layer and say "yup, it has no confidence in any of what it's doing at this point; everything done by layers feeding from this one will just be pareidolia — promoting meaningless deviations from the random-noise output of this layer into increasing significance."
You could in theory build a model as a Transformer-like model in a sort of pine-cone shape, where each layer feeds its output both to the next layer (where the final layer's output is measured and backpropped during training) and to an "introspection layer" that emits a single confidence score (a 1-vector). You start with a pre-trained linear Transformer base model, with fresh random-weighted introspection layers attached. Then you do supervised training of (prompt, response, confidence) triples, where on each training step, the minimum confidence score of all introspection layers becomes the controlled variable tested against the training data. (So you aren't trying to enforce that any particular layer notice when it's not confident, thus coercing the model to "do that check" at that layer; you just enforce that a "vote of no confidence" comes either from somewhere within the model, or nowhere within the model, at each pass.)
This seems like a hack designed just to compensate for this one inadequacy, though; it doesn't seem like it would generalize to helping with anything else. Some other architecture might be able to provide a fully-general solution to enforcing these kinds of global constraints.
(Also, it's not clear at all, for such training, "when" during the generation of a response sequence you should expect to see the vote-of-no-confidence crop up — and whether it would be tenable to force the model to "notice" its non-confidence earlier in a response-sequence-generating loop rather than later. I would guess that a model trained in this way would either explicitly evaluate its own confidence with some self-talk before proceeding [if its base model were trained as a thinking model]; or it would encode hidden thinking state to itself in the form of word-choices et al, gradually resolving its confidence as it goes. In neither case do you really want to "rush" that deliberation process; it'd probably just corrupt it.)
This is true in terms of default mode for LLMs, but there's a fair amount of research dedicated to the idea of training models to signal when they need grounding.
SelfRAG is an interesting, early example of this [1]. The basic idea is that the model is trained to first decide whether retrieval/grounding is necessary and then, if so, after retrieval it outputs certain "reflection" tokens to decide whether a passage is relevant to answer a user query, whether the passage is supported (or requires further grounding), and whether the passage is useful. A score is calculated from the reflection tokens.
The model then critiques itself further by generating a tree of candidate responses, and scoring them using a weighted sum of the score and the log probabilities of the generated candidate tokens.
We can probably quibble about the loaded terms used here like "self-reflection", but the idea that models can be trained to know when they don't have enough information isn't pure fantasy today.
[1] https://arxiv.org/abs/2310.11511
EDIT: I should also note that I generally do side with Lecun's stance on this, but not due to the "not enough information" canard. I think models learning from abstraction (i.e. JEPA, energy-based models) rather than memorization is the better path forward.
Isn't that true with humans too?
There's some leap humans make, even as stochastic parrots, that lets us generate new knowledge.
If I had been born a day earlier or later I would have a completely different life because of initial conditions and randomness but life doesn't feel that way even though I think this is obviously true.
Have you ever tried telling ChatGPT that you're "in the city centre" and asking it if you need to turn left or right to reach some landmark? It will not answer with the average of the directions given to everybody who asked the question before, it will answer asking you to tell it where you are precisely and which way you are facing.
But if you ask it in terms of a knowledge test ("I'm at the corner of 1st and 2nd, what public park am I standing next to?") a model lacking web search capabilities will confidently hallucinate (unless it's a well-known park).
In fact, my person opinion is that, therein lies the most realistic way to reduce hallucination rates: rather than trying to train models to say "I don't know" (which is not really a trainable thing - models are fundamentally unaware of the limits of their own training data), instead just train them on which kinds of questions warrant a web search and which ones should be answered creatively.
One was GPT 4.5 preview, and one was cohort-chowder (which is someone's idea of a cute code name, I assume).
Perhaps you thought I meant "1st and 2nd" literally? I was just using those as an example so I don't reveal where I live. You should use actual street names that are near a public park, and you can feel free to specify the city and state.
that also entails information destruction in the form of the logits table, but for the most part that should be accounted for in the last step before final feedforward
This is obviously not true at this point except for the most loose definition of interpolation.
>don't rely on things like differentiability.
I've never heard lecun say we need to move away from gradient descent. The opposite actually.
To answer your question, think about how we train LLMs: We have them learn the statistical distribution of all written human language, such that given a chunk of text (a prompt, etc.) it then samples its output distribution to produces the next most likely token (word, sub-word, etc.) that should be produced and keeps doing that. It never learns how to judge what is true or false and during training it never needs to learn "Do I already know this?" It is just spoon fed information that it has to memorize and has no ability to acquire metacognition, which is something that it would need to be trained to attain. As humans, we know what we don't know (to an extent) and can identify when we already know something or don't already know something, such that we can say "I don't know." During training, an LLM is never taught to do this sort of introspection, so it never will know what it doesn't know.
I have a bunch of ideas about how to address this with a new architecture and a lifelong learning training paradigm, but it has been hard to execute. I'm an AI professor, but really pushing the envelope in that direction requires I think a small team (10-20) of strong AI scientists and engineers working collaboratively and significant computational resources. It just can't be done efficiently in academia where we have PhD student trainees who all need to be first author and work largely in isolation. By the time AI PhD students get good, they graduate.
I've been trying to find the time to focus on getting a start-up going focused on this. With Terry Sejnowski, I pitched my ideas to a group affiliated with Schmidt Sciences that funds science non-profits at around $20M per year for 5 years. They claimed to love my ideas, but didn't go for it....
"when do you close the round?" = maybe
money in the bank account = yes
Will Titans be sufficiently "neuroplastic" to escape that? Maybe, I'm not sure.
Ultimately, I think an architecture around "looping" where the model outputs are both some form of "self update" and "optional actionality" such that interacting with the model is more "sampling from a thought space" will be required.
But I also think that current LLM tech does not lead to agi. You cant train something on pattern matchin and then it becomes magically intelligent (although i could be wrong).
Imo an AGI would need to be able to interact with the environment and learn to reflect on its interactions and its abilities within it. I suspect we have the hardware to build s.th. intelligent as a cat or a dog, but not the algorithms.
The majority are focusing on the wrong paradigms and the wrong questions, which blocks progress towards the kinds of continual learning needed to make progress towards creating models that think in latent space and enabling meta-cognition, which would then give architectures the ability to avoid hallucinations by knowing what they don't know.
Compared to other methods for continual learning on ImageNet-1K, SIESTA requires 7x-60x less compute than other methods and achieves the same performance as a model trained in an offline/batch manner. It also works for arbitrary distributions rather than a lot of continual learning methods that only work for specific distributions (and hence don't really match any real-world use case): https://yousuf907.github.io/siestasite/
In this one we focused on mitigating the drop in performance when a system encounters a new distribution. This resulted in a 16x speed up or so: https://yousuf907.github.io/sgmsite/
In this one, we show how the strategy for creating multi-modal LLMs like LLaVA is identical to a two-task continual learning system and we note that many LLMs once they become multi-modal forget a large amount of the capabilities of the original LLM. We demonstrate that continual learning methods can mitigate that drop in accuracy enabling the multi-modal task to be learned while not impairing uni-modal performance: https://arxiv.org/abs/2410.19925 [We have a couple approaches that are better now that will be out in the next few months]
It really depends on what you are interested in. For production AI, the real need is computational efficiency and keeping strong models up-to-date. Not many labs besides mine are focusing on that.
Currently, I'm focused on continual learning for creating systems beyond LLMs that incrementally learn meta-cognition and working on continual learning to explain memory consolidation works in mammals and why we have REM phases during sleep, but that's more of a cognitive science contribution so the constraints on the algorithms differ since the goal differs.
That's a nice model: human short-term memory is akin to the context window, and REM sleep consolidating longer-term memories is akin to updating the model itself.
How difficult would it be to perform limited focused re-training based on what's been learnt (e.g. new information, new connections, corrections of errors, etc.) within a context window?
These long horizon (agi) problems have been there since the very beginning. We have never had a solution to them. RL assumes we know the future which is a poor proxy. These energy based methods fundamentally do very little that an RNN didn't do long ago.
I worked on higher dimensionality methods which is a very different angle. My take is that it's about the way we scale dependencies between connections. The human brain makes and breaks a massive amount of nueron connections daily. Scaling the dimensionality would imply that a single connection could be scalled to encompass significantly more "thoughts" over time.
Additionally the true to solution to these problems are likely to be solved by a kid with a laptop as much as an top researcher. You find the solution to CL on a small AI model (mnist) you solve it at all scales.
Somehow, it feels harder to trust a model that could evolve over time. It's performance might even degrade. That's a steep price to pay for having memory built in and a (possibly) self-evolving model.
I could revise that by saying a kid with a whiteboard.
It's an einstein×10 moment so who know when that'll happen.
https://arxiv.org/abs/2502.09992
https://www.inceptionlabs.ai/news
(these are results from two different teams/orgs)
It sounds kind of like what you're describing, and nobody else has mentioned it yet, so take a look and see whether it's relevant.
[1] Which Inception Labs's new models may be based on; one of the cofounders is a co-author. See equations 18-20 in https://arxiv.org/abs/2310.16834
I think what Lecun is probably getting at is that there's currently no way for a model to say "I don't know". Instead, it'll just do its best. For esoteric topics, this can result in hallucinations; for topics where you push just past the edge of well-known and easy-to-Google, you might get a vacuously correct response (i.e. repetition of correct but otherwise known or useless information). The models are trained to output a response that meets the criteria of quality as judged by a human, but there's no decent measure (that I'm aware of) of the accuracy of the knowledge content, or the model's own limitations. I actually think this is why programming and mathematical tasks have such a large impact on model performance: because they encode information about correctness directly into the task.
So Yann is probably right, though I don't know that energy minimization is a special distinction that needs to be added. Any technique that we use for this task could almost certainly be framed as energy minimization of some energy function.
I say schedule because the “static data once through” is the root of the problem in my mind is one of the root problems.
Think about what happens when you read something like a book. You’re not “just” reading it, you’re also comparing it to other books, other books by the same author, while critically considering the book recommendations made by your friend. Any events in the book get compared to your life experience, etc…
LLM training does none of this! It’s a once-through text prediction training regime.
What this means in practice is that an LLM can’t write a review of a book unless it has read many reviews already. They have, of course, but the problem doesn’t go away. Ask an AI to critique book reviews and it’ll run out of steam because it hasn’t seen many of those. Critiques of critiques is where they start falling flat on their face.
This kind of meta-knowledge is precisely what experts accumulate.
As a programmer I don’t just regurgitate code I’ve seen before with slight variations — instead I know that mainstream criticisms of micro services misses their key benefit of extreme team scalability!
This is the crux of it: when humans read their training material they are generating an “n+1” level in their mind that they also learn. The current AI training setup trains the AI only the “n”th level.
This can be solved by running the training in a loop for several iterations after base training. The challenge of course is to develop a meaningful loss function.
IMHO the “thinking” model training is a step in the right direction but nowhere near enough to produce AGI all by itself.
Intelligence finds solutions - actual, solid solutions.
More than "fixing" hallucinations, the problem is going beyond them (arriving to "sobriety").
They aren't fact machines. They are concept machines.
The paper would be a strong argument against your point: if neural architectures are already constraining the amount of information that a text generation system delivers the same way a human (allegedly) does, then I don't see which "energy" measure one could take that could perform any better.
Then again, perhaps they have one in mind and I just haven't read it.
While it sounds nice to reframe it like a physics problem, it seems like a fundamentally flawed idea, akin to saying “there is a closed form solution to the question of how should I live.” The problem isn’t hallucinations, the problem is that language and relativism are inextricably linked.
If two nodes are on, but the connection between them is negative, this causes energy to be higher.
If one of those nodes switches off, energy is reduced.
With two nodes this is trivial. With 10 nodes it's more difficult to solve, and with billions of nodes it is impossible to "solve".
All you can do then is try to get the energy as low as possible.
This way also neural networks can find out "new" information, that they have not learned, but is consistent with the constraints they have learned about the world so far.
I learned it from: https://youtube.com/playlist?list=PLLHTzKZzVU9eaEyErdV26ikyo...
Yann LeCun, and Michael Bronstein and his colleagues have some similarities in trying to properly Sciencify Deep Learning.
Yann LeCun's approach, as least for Vision has one core tenet- energy minimization, just like in Physics. In his course, he also shows some current arch/algos to be special cases for EBMs.
Yann believes that understanding the Whys of the behavior of DL algorithms are going to be beneficial in the long term rather than playing around with hyper-params.
There is also a case for language being too low-dimensional to lead to AGI even if it is solved. Like, in a recent video, he said that the total amount of data existing on all digitized books and internet are the same as what a human children takes in in the first 4/5 years. He considers this low.
There are also epistemological arguments against language not being able to lead to AGI, but I haven't heard him talk about them.
He also believes that Vision is a more important aspect of intellgence. One reason being it being very high-dim. (Edit) Consider an example. Take 4 monochrome pixels. All pixels can range from 0 to 255. 4 pixels can create 256^4 = 2^32 combinations. 4 words can create 4! = 24 combinations. Solving language is easier and therefore low-stakes. Remember the monkey producing a Shakespeare play by randomly punching typewriter keys? If that was an astronomically big number, think how obscenely long it would take a monkey to paint Mona Lisa by randomly assigning pixel values. Left as an exercise to the reader.
Juergen Schmidhuber has gone a lot queit now. But he also told that a world-model, explicitly included in training is reasoning is better, rather than only text or image or whatever. He has a good paper with Lucas Beyer.
I am very sorry.
I may have an actual opinion on his viewpoint, however, I have a nitpick even before that.
How exactly is 'LLM' defined here? Even if some energy-based thing is done, would some not call even that an LLM? If/when someone finds a way to fix it within the 'token choice' method, could some people not just start calling it something differently from 'LLM'.
I think Yann needs to rephrase what exactly he wants to say.
In particular, if you train an LLM to do Task A and Task B with acceptable accuracy, that does not guarantee it can combine the tasks in a common-sense way. "For each step of A, do B on the intermediate results" is a whole new Task C that likely needs to be fine-tuned. (This one actually does have some theoretical evidence coming from computational complexity, and it was the first thing I noticed in 2023 when testing chain-of-thought prompting. It's not that the LLM can't do Task C, it just takes extra training.)
You must always keep close to the only known example we have of an intelligence which is the human brain. As soon as you start to wander away from the way the human brain does it, you are on your own and you are not relying on known examples of intelligence. Certainly that might be possible, but since there's only one known example in this universe of intelligence, it seems ridiculous to do anything but stick close to that example, which is the human brain.
Oh yeah? This is begging the question.
This is obviously an extremely high level simplification, but that's the core of it.
The physics of human consciousness are not implemented in a leaky symbolic abstraction but the raw physics of existence.
The sort of autonomous system we imagine when thinking AGI must be built directly into substrate and exhibit autonomous behavior out of the box. Our computers are blackboxes made in a lab without centuries of evolving in the analog world, finding a balance to build on. They either can do a task or cannot. Obviously from just looking at one we know how few real world tasks it can just get up and do.
Code isn’t magic, it’s instruction to create a machine state. There’s no inherent intelligence to our symbolic logic. It’s an artifact of intelligence. It cannot imbue intelligence into a machine.
For example if a prompt is: “what is the Statue of Liberty”, the LLMs first output token is going to be “the”, but it kinda already “knows” that the next ones are going to be “statue of liberty”.
So to me LLMs already “choose” a response path from the first token.
Conversely, a LLM that would try and find a minimum energy for the whole response wouldn’t necessarily stop hallucinating. There is nothing in the training of a model that says that “I don’t know” has a lower “energy” than a wrong answer…
As a result, you'll never be able to get 100% consistent outputs or behavior (like you hypothetically can with a traditional algorithm/business logic). And that has proven out in usage across every model I've worked with.
There's also an upper-bound problem in terms of context where every LLM hits some arbitrary amount of context that causes it to "lose focus" and develop a sort of LLM ADD. This is when hallucinations and random, unrequested changes get made and a previously productive chat spirals to the point where you have to start over.
Humans brains have the same problem. As any intelligence probably. Solution for this is structural thinking. One piece at a time, often top-down. Educated humans do it, LLM can be orchestrated to do it too. Effective context window will be limited even though some claim millions of tokens.
Assuming that text only models will hit a bottleneck, then to have next generation models, in addition to a new architecture, we also have to find rich dataset which is even more generic and much richer in modalities and the architecture being able to natively ingest it?
However something that is not predictible is how well the emergent properties can scale with model size further. Maybe few more unlocks like model being able to retain information well in spite of really large context length, ability to SFT on super complex reasoning tasks without disrupting weights enough to loose unsupervised learning might take us much further?
The short answer should be that it's obvious LLM training and inference are both ridiculously inefficient and biologically implausible, and therefore there has to be some big optimization wins still on the table.
What I mean is this: A brain today is obviously far more efficient at intelligence than our current approaches to AI. But a brain is a highly specialized chemical computer that evolved over hundreds of millions of years. That leaves a lot of room for inefficient and implausible strategies to play out! As long as wins are preserved, efficiency can improve this way anyway.
So the question is really, can we short cut that somehow?
It does seem like doing so would require a different approach. But so far all our other approaches to creating intelligence have been beaten by the big simple inefficient one. So it’s hard to see a path from here that doesn’t go that route.
[0] http://www.incompleteideas.net/IncIdeas/BitterLesson.html
In ANNs we backprop uniformly, so the error correction is distributed over the whole network. This is why LLM training is inefficient.
For example, analog computers can differentiate near instantly by leveraging the nature of electromagnetism and you can do very basic analogs of complex equations by just connecting containers of water together in certain (very specific) configurations. Are we sure that these optimizations to get us to AGI are possible without abusing the physical nature of the world? This is without even touching the hot mess that is quantum mechanics and its role in chemistry which in turn affects biology. I wouldn't put it past evolution to have stumbled upon some quantum mechanic that allowed for the emergence of general intelligence.
I'm super interested in anything discussing this but have very limited exposure to the literature in this space.
Which we should expect, even from prior experience with any other AI breakthrough, where first we learn to do it and then we learn to do it efficiently.
E.g. Deep Blue in 1997 was IBM showing off a supercomputer, more than it was any kind of reasonably efficient algorithm, but those came over the next 20-30 years.
Inefficiency in data input is also an interesting concept. It seems to me humans get more data in than even modern frontier models; if you use the gigabit/s estimates for sensory input. Care to elaborate on your thoughts?
I really like this approach. Showing that we must be doing it wrong because our brains are more efficient and we aren't doing it like our brains.
Is this a common thing in ML papers or something you came up with?
We know there is a more efficient solution (human brain) but we don’t know how to make it.
So it stands to reason that we can make more efficient LLMs, just like a CPU can add numbers more efficiently than humans.
Wheels other than rolling would likely never evolve naturally because there's no real incremental path from legs to wheels, where as flippers can evolve from webbed fingers incrementally getting better for moving in water.
I dunno, maybe there's an evolutionary path for wheels, but i don't think so.
Have you heard of https://en.wikipedia.org/wiki/Bio-inspired_computing ?
I believe human and machine learning unify into a pretty straightforward model and this shows that what we're doing that ML doesn't can be copied across, and I don't think the substrate is that significant.
This is true also for the much bigger neural net that works in your brain, and even if you're the world champion of chess. Clearly your argument doesn't hold water.
At playing chess. (But also at doing sums and multiplications, yay!)
> So you should also agree with me that those who say the only path to AGI is LLM maximalism are misguided.
No. First of all, it's a claim you just made up. What we're talking about is people saying that LLMs are not the path to AGI- an entirely different claim.
Second, assuming there's any coherence to your argument, the fact that a small program can outclass an enormous NN is irrelevant to the question of whether the enormous NN is the right way to achieve AGI: we are "general intelligences" and we are defeated by the same chess program. Unless you mean that achieving the intelligence of the greatest geniuses that ever lived is still not enough.
Unlike in chess, there’s a functionally infinite number of actions you can take in real life. So just argmax over possible actions is going to be hard.
Two, you have to have some value function of how good an action is in order to argmax. But many actions are impossible to know the value of in practice because of hidden information and the chaotic nature of the world (butterfly effect).
I think you are thinking of the fact that it had to be approached in a different way than Minimax in chess because a brute force decision tree grows way too fast to perform well. So they had to learn models for actions and values.
In any case, Go is a perfect information game, which as I mentioned before, is not the same as problems in the real world.
I don't know about you, but I certainly don't generate text autoregressively, token by token. Also, pretty sure I don't learn by global updates based on taking the derivative of some objective function of my behavior with respect to every parameter defining my brain. So there's good biological reason to think we can go beyond the capabilities of current architectures.
I think probably an example of the kind of new architectures he supports is FB's Large Concept Models [1]. It's still a self-attention, autoregressive architecture, but the unit of regression is a sentence rather than a token. It maps sentences into a latent space via an autoencoder architecture, then has a transformer architecture in which the tokens are elements in that latent space.
Disclosure: I am the author of this paper.
Reference: (PDF) Hydra: Enhancing Machine Learning with a Multi-head Predictions Architecture. Available from: https://www.researchgate.net/publication/381009719_Hydra_Enh... [accessed Mar 14, 2025].
As I discuss in the paper, predictive coding suggests that the brain actively generates predictions and compares them to incoming sensory data (vision, hearing, etc.), prioritizing anomalies. Its efficiency stems from a hierarchical memory system that continuously updates only the "deltas"—the differences that matter. Embracing this approach could lead to a paradigm shift, enabling the development of significantly more energy-efficient AI in the future.
Attention works, yes. But it is not naturally plausible at all. We don't do quadratic comparisons across a whole book or need to see thousands of samples to understand.
Personally I think that in the future recursive architectures and test time training will have a better chance long term than current full attention.
Also, I think that OpenAI biggest contribution is demostrating that reasoning like behaviors can emerge from really good language modelling.
In humans, this is known as confabulation, and it happens due to various forms of brain damage, especially with damage to orbitofrontal cortex (part of prefrontal cortex). David Rumelhart, who was the main person who came up with backpropagation in a paper co-authored with Geoff Hinton, actually got Pick's disease which specifically results in damage to prefrontal cortex and people with that disease exhibit a lot of the same problems we have with today's LLMs:
We on the other hand are shaped by billions of years of genetic evolution, and 200k years of cultural evolution. If you count the total number of words spoken by 110 billion people who ever lived, assuming 1B estimated words per human during their lifetime, it comes out to 10 million times the size of GPT-4's training set.
So we spent 10 million more words discovering than it takes the transformer to catch up. GPT-4 used 10 thousand people's worth of language to catch up all that evolutionary finetuning.
This assumption is slightly wrong direction, because not exist human who could consume much more than about 1B words during their lifetime. So humanity could not gain enhancement from just multiply words of one human by 100 billion. I think, correct estimation could be 1B words multiply by 100.
I think, current AI already achieved size need to become AGI, but to finish, probably need to change structure (but I'm not sure about this), and also need some additional multidimensional dataset, not just texts.
I might bet on 3D cinema, and/or on automobile targeting autopilot dataset, or something for real life humanoid robots solving typical human tasks, like fold shirt.
Well yes, actually.
So, each architectural advance needs published prototypes solving real-world problems. The smallest I've seen do useful stuff are in 100+M-3B range. There are also papers about testing advances with low, pretraining cost: BabyLM; GPT2 replications; MosaicBERT. Some do straight pre-training while others distill field-proven models. Alternative architectures would do well to crank out examples like this to prove themselves.
Please, do build at least one of the above using your method. Post it to your site. Link to demos of the actual prototype in use. This might get an ecosystem going that builds on your ideas.
Remember that scientific method requires us to reject everything by default. Only after rigorous review of a working theory or prototype do we treat it as truth. Build what you want us to believe in. Let us see it smoke the competing models of similar size in key metrics. That will do more for you than anything else.
Again, I hope you're right and I get to see energy-based models being highly competitive. I haven't.
You are nowhere near as smart as you think you are. You are a STEMlord who has never produced any new knowledge who just repeats some platitudes. People doing actual research do not talk like this.
You might benefit from watching this video https://x.com/styx_boatman/status/1811820327552315805
Our work is very much work in progress. I mentioned it because we have a very promising path to scaling EBMs and I wanted to have a convo about it.
If you were actually curious and you actually cared about my claims, you would have asked some concrete followup questions. You responded with the dumbest cliches, so I will ignore your comments.
I can see the inner problem, though, since I was very arrogant. After seeing a miracle, I put my faith in Jesus Christ who died for our sins (even us) and rose again. He turned a cold heart of stone into a warm one of flesh. I no longer feel a need to beat or dominate people online. Even better, I won't burn alive in Hell for it. Even better, He's taught me to serve more humbly.
I believe Christ can help you, too. You can be like the first Adam who led us to sin by selfish choices or like the last Adam who saved us by His self sacrifice. The renewal of the Holy Spirit will cause inner change that permeates your social life, business, everything. You'll be amazed. I pray He also frees you from the slavery of sin, esp arrogance, that once drove my life.
The token approach is inherently flawed because the tokens pre-suppose unique meaning when in fact they may not be unique.
Said another way, it lacks properties that would be able to differentiate true from false because the differentiating input isn't included and cannot be derived from the inputs given. This goes to decidability.
Which is just to say, it feels to me like there's a danger that the stochastic nature of outputs is fundamental to true creative intelligence and all attempts to stamp it out will result in lower accuracy overall. Rather we should be treating it more like we do actual humans and expect errors and put layers of process around things where it matters to make them safe.
Top-end LLMs write better and faster than most humans.
Top-end stable diffusion models can draw and render video much faster and with much more precision than the best human artists.
Most problem with current approach, to grow abilities, need to add more neurons, but this is not just energy consuming, but also knowledge consuming, mean, at GPT-4 level all text sources of humanity already exhausted and model become essentially overfitted. So looks like multi-modal models appear not because so good, but because they could learn on additional sources (audio/video).
I seen few approaches to overcome problem of overfitting, but as I understand not exist universal solution.
For example, tried approach to create from current texts some synthetic training data, but this idea is limited by definition.
So, current LLMs appear to hit dead end, and researchers now trying to find exit from this dead end. I believe, nearest years somebody will invent some universal solution (probably, complex of approaches) or suggest another architecture, and progress of AI will continue.
There’s no real rule worthy of any respect imho that LLMs can’t be configured to get additional input data from images, audio, proprioception sensors, and any other modes. I can easily write a script to convert such data into tokens in any number of ways that would allow them to be fed in as tokens of a “language.” Convolutions for example. A real expert could do it even more easily or do a better job. And then don’t LeCun’s objections just evaporate? I don’t see why he thinks he has some profound point. For gods sake our own senses are heavily attenuated and mediated and it’s not like we actually experience raw reality ourselves, ever; we just feel like we do. LLMs can be extended to be situated. So much can be done. It’s like he’s seeing http in 1993 and saying it won’t be enough for the full web… well duh, but it’s a great start. Now go build on it.
If anything the flaw in LLMs is how they maintain only one primary thread of prediction. But this is changing; having a bunch of threads working on the same problem and checking each other from different angles of the problem will be an obvious fix for a lot of issues.
If he was Hinton's age then maybe he would also want to retire and be happy with transformers and LLMs. He is still an ambitious researcher that wants to do foundational research to get to the next paradigm.
Having said all of that, it is a misjudgement for him to be disparaging the incredible capabilities of LLMs to the degree he has.
Jeez, you'd think he kicked your dog.
Fire up Emacs and open a text file containing a lot of human-readable text. Something off Project Gutenberg, say. Then say M-x dissociated-press and watch it spew hilarious, quasi-linguistic garbage into a buffer for as long as you like.
Dissociated Press is a language model. A primitive, stone-knives-and-bearskins language model, but a language model nevertheless. When you feed it input text, it builds up a statistical model based on a Markov chain, assigning probabilities to each character that might occur next, given a few characters of input. If it sees 't' and 'h' as input, the most likely next character is probably going to be 'e', followed by maybe 'a', 'i', and 'o'. 'r' might find its way in there, but 'z' is right out. And so forth. It then uses that model to generate output text by picking characters at random given the past n input characters, resulting in a firehose of things that might be words or fragments of words, but don't make much sense overall.
LLMs are doing the same thing. They're picking the next token (word or word fragment) given a certain number of previous tokens. And that's ALL they're doing. The only differences are really matters of scale: the tokens are larger than single characters, the model considers many, many more tokens of input, and the model is a huge deep-learning model with oodles more parameters than a simple Markov chain. So while Dissociated Press churns out obvious nonsensical slop, ChatGPT churns out much, much more plausible sounding nonsensical slop. But it's still just rolling them dice over and over and choosing from among the top candidates of "most plausible sounding next token" according to its actuarial tables. It doesn't think. Any thinking it appears to do has been pre-done by humans, whose thoughts are then harvested off the internet and used to perform macrodata refinement on the statistical model. Accordingly, if you ask ChatGPT a question, it may well be right a lot of the time. But when it's wrong, it doesn't know it's wrong, and it doesn't know what to do to make things right. Because it's just reaching into a bag of refrigerator magnet poetry tiles, weighted by probability of sounding good given the current context, and slapping whatever it finds onto the refrigerator. Over and over.
What I think Yann LeCun means by "energy" above is "implausibility". That is, the LLM would instead grab a fistful of tiles -- enough to form many different responses -- and from those start with a single response and then through gradient descent or something optimize it to minimize some statistical "bullshit function" for the entire response, rather than just choosing one of the most plausible single tiles each go. Even that may not fix the hallucination issue, but it may produce results with fewer obvious howlers.
But there's a fundamental difference between Markov chains and transformers that should be noted. Markov chains only learn how likely it is for one token to follow another. Transformers learn how likely it is for a set of token to be seen together. Transformers add a wider context to msrkov chain. That quantitative change leads to a qualitative improvement: transformers generate text that is semantically plausible.
A little bit of engineering and fine tuning - you could imagine a model producing a sequence of statements, and reflecting on the sequence - updating things like "statement 7, modify: xzy to xyz"
Not an ML researcher, so I can't explain it. But I get a pretty clear sense that it's an inherent problem and don't see how it could be trained away.
For AI to really replace most workers like some people would like to see, there are plenty of situations where hallucinations are a complete no-go and need fixing.