Sometimes a sufficiently good model of a surface is completely identical to a model of the volume.
This is a false dichotomy. Functionally the reality is in the middle. They "memorize" training data in the sense that the loss curve is fit to these points but at test time they are asked to interpolate (and extrapolate) to new points. How well they generalize depends on how well an interpolation between training points works. If it reliably works then you could say that interpolation is a good approximation of some grammar rule, say. It's all about the data.
Since an LLM does not change in response to the change in meaning of terms (eg., consider the change to "the war in ukraine" over the last 10 years) -- it isn't reliable in the scientific sense. Explaining why it isnt valid would take much longer, but its not valid either.
In any case: the notion of 'generalisation' used in ML just means we assume there is a single stationary distribution of words, and we want to randomly sample from that distribution without bias to oversampling from points identical to the data.
Not least that this assumption is false (there is no stationary distribution), it is also irrelevant to generalisation in traditional sense. Since whether we are biased towards the data or not isn't what we're interested in. We want output to be valid (the system to use words to mean what they mean) and to be reliable (to do so across all environments in which they mean something).
This does not follow from, nor is it even related to, this ML sense of generalisation. Indeed, if LLMs generalised in this sense, they would be very bad at usefully generalising -- since the assumptions here are false.
Intra-distribution generalization seems like the only rigorously defined kind of generalization we have. Can you provide any references that describe this other kind of generalization? I'd love to learn more.
1. In practical scenarios, how do you know if x' is really drawn from p(x)? Even if you could compute log p(x') under the true data distribution, you can only verify that the support for x' is non-zero. one sample is not enough to tell you if x' drawn from p(x).
2. In high dimensional settings, x' that is not exactly equal to an example within the training set can have arbitrarily high generalization error. here's a criminally under-cited paper discussing this: https://arxiv.org/abs/1801.02774
What we mean by x ~ p(x), y ~ p(y|x) is not x -> y st. x = f(y)
Reality itself has no probability distributions. Reality follows a causal model, where a causal relation is given in terms of necessity and possibility.
Eg., there is no such thing as Photo ~ P(Photo|PhotoOfCat) to be learned, only (All Causes) -> PhotoOfCat. Thus the setup of ML as y = f(x) is incorrect, there is no `f` which satisfies this formula (in almost all cases).
Consider the LLM case: reality has no P("The War in Ukraine"| TheWarIn2022) -- either the speaker meant TheWarIn2022, or they didnt. There's no sense in which reality has it that the utterance is intrinsically ambiguous (necessarily, for communication to be possible, pragmatics+semantics has to be able to fully resolve meaning).
So what are LLMs learning? Just an implied empirical distribution which is "smoothed over" the data just enough that it "hangs on to it, without repeating it" -- and this is vital, since if it were to try to generalise in the scientific sense, it would cease to be meaningful, since no algorithm which computes P(y|x) in this manner could capture the necessary relata that fully resolves meaning. Any system capable of modelling meaning would be probabilistic only in the sense of having a prior over such causal models: P("TheWarInUkraine"|TheWarIn2022, CausalModel) = 1, but P(CausalModel) < 1
So it's always undefined what it means to "generalise" wrt to an empirical distribution -- there aren't any.
When we say scientific theories generalise, we mean their posited necessary causal relations are maintained across irrelevant interventions. Eg., newton's theory of gravity generalises in that each term (F, M, m, r) is a valid measure of some property, and it remains a valid measure across a very large number of environments.
It fails to generalise for extreme values of M, m, etc.
In the ML sense, all intra-distributional generalisation fails for trivial permutations of any causal property, eg., m+dm -- because this induces an entirely new distribution. The "generalisation error" depends on what m+dm does within our model, but regardless, generalisation fails.
Scientific theories do not fail to generalise in this way, irrelevant causal interventions make no difference to the explanatory adequacy (or predictive power) of the theory.
If you were only modelling conditional probability, trying to model meaning this way, would make your solution worse.
ie., if LLMs really generalised in the ML sense, i.e., unbiasedly randomly sampled from some hypothetical "Meaning Distribution", they'd perform terribly -- since there is no such distribution to choose from.
By hijacking an empirical distribution, and "replaying it back", its actually possible to generate useful output.
Think about it this way, probability distributions are just measures of subjective confidence: each person has their own subjective confidence distribution P("some written words"|WhatTheyMean). If you could actually model this -- which one would you model? If you modelled any of them, you'd not be able to understand a great deal, since each person's confidence is poorly calibrated and missing meanings (eg., "acetylcholine").
So the LLM models some half-baked average of the subjective distributions of all speakers on the internet (/ in the training data) with respect to next word expectations.
This is not what we're modelling when we mean things (eg., when I say, "pass the pen", the cause of my saying it is: 1) need for a pen; 2) you having a pen; etc. -- these reasons are unavailable to the LLM, so it cannot model meaning). But as stated, it would be useless if it actually tried to -- because these methods are incapable of saying, "pass me a pen" and meaning it.
Do Large Language Models learn world models or just surface statistics? - https://news.ycombinator.com/item?id=34474043 - Jan 2023 (174 comments)
Copilot fails the cleanly refactor complex Java methods in a way that I’m better of writing that stuff by my own as I have to understand it anyways.
And the news that they don’t scale as predicted is too bad compared to how weak they currently perform…
Personally, I use them for the things they can do, and for the things they can't, I just don't, exactly as I would for any other tool.
People assuming they can do more than they are actually capable of is a problem (compounded by our tendency to attribute intelligence to entities with eloquent language, which might be more of a surface level thing than we used to believe), but that's literally been one for as long as we had proverbial hammers and nails.
If
((time to craft the prompt) + (time required to fix LLM output)) ~ (time to achieve the task on my own)
it's not hard to see that working on my own is a very attractive proposition. It drives down complexity, does not require me to acquire new skills (i.e., prompt engineering), does not require me to provide data to a third party nor to set up an expensive rig to run a model locally, etc.
I'm just a little bit tired of sweeping generalizations like "LLMs are completely broken". You can easily use them as a tool part of a process that then ends up being broken (because it's the wrong tool!), yet that doesn't disqualify them for all tool use.
My point being: Why would anyone have to find a use for a new tool? Why wouldn't "it doesn't help me with what I'm trying to do" be an acceptable answer in many cases?
Surface statistics based interfaces have an internal database of what is expected, and when asked, they give a conformist output.
Not in the sense used in the article: «memorizing “surface statistics”, i.e., a long list of correlations that do not reflect a causal model of the process generating the sequence».
A very basic example: when asked "two plus two", would the interface reply "four" because it memorized a correlation of the two ideas, or because it counted at some point (many points in its development) and in that way assessed reality? That is a dramatic difference.
so humans don't typically have world models then. you ask most people how they arrived at their conclusions (outside of very technical fields) and they will confabulate just like an LLM.
the best example is phenomenology, where people will grant themselves skills that they don't have, to reach conclusions. see also heterophenomenology, aimed at working around that: https://en.wikipedia.org/wiki/Heterophenomenology
That random people will largely have suboptimal skills should not be a surprise.
Yes, many people can't think properly. Proper thinking remains there as a potential.
that's a matter of faith, not evidence. by that reasoning, the same can be said about LLMs. after all, they do occasionally get it right.
To transpose that to LLMs, you should present one that systematically gets it right, not occasionally.
And anyway, the point was about two different processes before statement formulation: some output the strongest correlated idea ("2+2" → "4"); some look at the internal model and check its contents ("2, 2" → "1 and 1, 1 and 1: 4").
could Einstein systematically get new symphonies right? could Feynman create tasty new dishes every single time? Could ......
Did (could) Einstein think about things long and hard? Yes - that is how he explained having solved problems ("How did you do it?" // "I thought about long and hard").
The artificial system in question should (1) be able to do it, and (2) do it systematically, because it is artificial.
To utter conformist statements spawned from surface statistics would be "doxa" - repeating "opinions".
LLMs are very firmly stuck inside the Cave Allegory.
But we could argue it could not be impossible to create an ontology (a very descriptive ontology - "this is said to be that, and that, and that...") from language alone. Hence the question whether the ontology is there. (Actually, the question at this stage remains: "How do they work - in sufficient detail? Why the appearance of some understanding?")
It's just that it's a kind of a useless ontology, because the reality it's describing is language. Well, only "kind of useless" because it should be very useful to parse, synthesize and transform language. But it doesn't have the kind of "knowledge" that most people expect an intelligence to have.
Also, its world isn't only composed of words. All of them got a very strong "Am I fooling somebody?" signal during training.
They dont do science or causality theyre just working with the shadows on the wall, not the actual objects casting them. So yeah, they’re impressive, but let’s not overhype what they’re doing. It’s pattern matching at scale, not magic. Correct me if I am wrong.
It's similar to asking a model to only produce outputs corresponding to a regular expression, given a very large number of inputs that match that regular expression. The RE is the most compact representation that matches them all and it can figure this out.
But we aren't building a "world model", we're building a model of the training data. In artificial problems with simple rules, the model might be essentially perfect, never producing an invalid Othello move, because the problem is so limited.
I'd be cautious about generalizing from this work to a more open-ended situation.
The point is that they aren't directly training the model to output the grid state, like you would an autoencoder. It's trained to predict the next action and learning the state of the 'world' happens incidentally.
It's like how LLMs learn to build world models without directly being trained to do so, just in order to predict the next token.
And as I said in my original comment they are probably not even able to extract the board state very well, otherwise they would depict some kind of direct representation of the state, not all of the other figures of board move causality etc.
Note also that the board state is not directly encoded in the neural network: they train another neural network to find weights to approximate the board state if given the internal weights of the Othello network. It's a bit of fishing for the answer you want.
They do measure and report on this, both in summary in the blog post and in more detail in the paper.
> otherwise they would depict some kind of direct representation of the state
If you can perfectly accurately extract the state the result would be pretty boring to show right? It'd just be a picture of a board state and next to it the same board state with "these are the same".
> Note also that the board state is not directly encoded in the neural network: they train another neural network to find weights to approximate the board state if given the internal weights of the Othello network.
If you can extract them, they are encoded in the activations. That's pretty much by definition surely.
> It's a bit of fishing for the answer you want.
How so?
Given a sequence of moves, they can accurately identify which state most of the positions of the board are in just by looking at the network. In order for that to work, the network must be turning a sequence of moves into some representation of a current board state. Assume for the moment they can accurately identify them do you agree with that conclusion?
That's the whole point under contention, but you're stating it as fact.
I've very confused by this, because they do. Then they manipulate the internal board state and see what move it makes. That's the entire point of the paper. Figure 4 is literally displaying the reconstructed board state.
This is literally figure 4
This also re-constructs the board state of a chess-playing LLM
https://adamkarvonen.github.io/machine_learning/2024/01/03/c...
I haven’t read the paper in some time so it’s possible I’m forgetting something but I don’t think so.
The only other issue you raised doesn't make any sense. A world model is a representation/model of your environment you use for predictions. Yes, an auto-encoder learns to model that data to some degree. To what degree is not well known. If we found out that it learned things like 'city x in country a is approximately distance b from city y' let's just learn where y is and unpack everything else when the need arises then that would certainly qualify as a world model.
Besides that and the big red flag of not directly analyzing the performance of the predicted board state I also said training a neural network to return a specific result is fishy, but that is a more minor point than the other two.
>the big red flag of not directly analyzing the performance of the predicted board state I also said training a neural network to return a specific result is fishy
The idea that probes are some red flag is ridiculous. There are some things to take into account but statistics is not magic. There's nothing fishy about training probes to inspect a models internals. If the internals don't represent the state of the board then the probe won't be able to learn to reconstruct the state of the board. The probe only has access to internals. You can't squeeze blood out of a rock.
In this case specifically “the degree” is pretty low since predicting moves is very close to predicting board state (because for one you have to assign zero probability to moves to occupied positions). That’s even if you accept that world models are just states, which as mtburgess explained is not reasonable.
Further if you read what I wrote I didn’t say internal probes are a big red flag (I explicitly called it the minor problem). I said not directly evaluating how well the putative internal state matches the actual state is. And you can “squeeze blood out of a rock”: it’s the multiple comparison problem and it happens in science all the time and it is what you are doing by training a neural network and fishing for the answer you want to see. This is a very basic problem in statistics and has nothing to do with “magic”. But again all this is the minor problem.
The depth/degree or whatever is not about what is close to the problem space. The blog above spells out the distinction between a 'world model' and 'surface statistics'. The point is that Othello GPT is not in fact playing Othello by 'memorizing a long list of correlations' but by modelling the rules and states of Othello and using that model to make a good prediction of the next move.
>I said not directly evaluating how well the putative internal state matches the actual state is.
This is evaluated in the actual paper with the error rates using the linear and non linear probes. It's not a red flag that a precursor blog wouldn't have such things.
>And you can “squeeze blood out of a rock”: it’s the multiple comparison problem and it happens in science all the time and it is what you are doing by training a neural network and fishing for the answer you want to see.
The multiple comparison problem is only a problem when you're trying to run multiple tests on the same sample. Obviously don't test your probe on states you fed it during training and you're good.
If you give a universal function approximator the task of approximating an abstract function, you will get an approximation.
Eg.,
def circle(radius): ... return points()
aprox_cricle = neuralnetwork(sample(circle()))
if is_model_of(samples(aprox_circle), circle)): print("OF COURSE!")
This is irrelevant: games, rules, shapes, etc. are all abstract. So any model of samples of these is a model of them.The "world model" in question is a model of the world. Here "data" is not computer science data, ie., numbers its measurements of the world, ie., the state of a measuring device causally induced by the target of measurement.
Here there is no "world" in the data, you have to make strong causal assumptions about what properties of the target cause the measures. This is not in the data. There is no "world model" in measurement data. Hence the entirety of experimental science.
No result based on one mathematical function succeeding in approximating another is relevant whether measurement data "contains" a theory of the world which generates it: it does not. And of course if your data is abstract, and hence constitutes the target of modelling (only applies to pure math), then there is no gap -- a model of "measures" (ie., the points on a circle) is the target.
No model of actual measurement data, ie., no model in the whole family we call "machine learning", is a model of its generating process. It contains no "world model".
Photographs of the night sky are compatible with all theories of the solar system in human history (including, eg., stars are angels). There is no summary of these photographs which gives information about the world over and above just summarising patterns in the night sky.
The sense in which any model of measurement data is "surface statistics" is the same. Consider plato's cave: pots, swords, etc. on the outside project shadows inside. Modelling the measurement data is taking cardboard and cutting it out so it matches the shadows. Modelling the world means creating clay pots to match the ones passing by.
The latter is science: you build models of the world and compare them to data, using the data to decide between them.
The former is engineering (, pseudoscience): you take models of measures and reply these models to "predict" the next shadow.
If you claim the latter is just a "surface shortcut" you're an engineer. If you claim its a world model you're a pseudoscientist.
You're stating this as fact but it seems to be the very hypothesis the authors (and related papers) are exploring. To my mind, the OthelloGPT papers are plainly evidence against what you've written - summarising patterns in the sky really does seem to give you information about the world above and beyond the patterns themselves.
(to a scientist this is obvious, no? the precession of mercury, a pattern observable in these photographs, was famously not compatible with known theories until fairly recently)
> Modelling the measurement data is taking cardboard and cutting it out so it matches the shadows. Modelling the world means creating clay pots to match the ones passing by.
I think these are matters of degree. The former is simply a worse model than the latter of the "reality" in this case. Note that our human impressions of what a pot "is" are shadows too, on a higher-dimensional stage, and from a deeper viewpoint any pot we build to "match" reality will likely be just as flawed. Turtles all the way down.
It is exactly this non-sequitur which I'm pointing out.
Approximating an abstract discrete function (a game), with a function approximator has literally nothing to do with whether you can infer the causal properties of the data generating process from measurement data.
To equate the two is just rank pseudoscience. The world is not made of measurements. Summaries of measurement data aren't properties in the world, they're just the state of the measuring device.
If you sample all game states from a game, you define the game. This is the nature of abstract mathematical objects, they are defined by their "data".
Actual physical objects are not defined by how we measure them: the solar system isnt made of photographs. This is astrology: to attribute to the patterns of light hitting the eye some actual physical property in the universe which corresponds to those patterns. No such exists.
It is impossible, and always has been, to treat patterns in measurements as properties of objects. This is maybe one of the most prominent characteristics of psedusocience.
Yes, the one is formally derivable from the other, but the reduction costs compute, and to a fixed epsilon of accuracy this is the situation with everything we interact with on the day to day.
The idea that you can learn underlying mechanics from observation and refutation is central to formal models of inductive reasoning like Solomonoff induction (and idealised reaoners like AIXI, if you want the AI spin). At best this is well established scientific method, at worst a pretty decent epistemology.
Talking about sampling all of the game states is irrelevant here; that wouldn't be possible even in principle for many games and in this case they certainly didn't train the LLM on every possible Othello position.
> This is astrology: to attribute to the patterns of light hitting the eye some actual physical property in the universe which corresponds to those patterns. No such exists.
Of course not - but they are highly correlated in functional human beings. What do you think our perception of the world grounds out in, if not something like the discrepancies between (our brain's) observed data and it's predictions? There's even evidence in neuroscience that this is literally what certain neuronal circuits in the cortex are doing (the hypothesis being that so-called "predictive processing" is more energy efficient than alternative architectures).
Patterns in measurements absolutely reflect properties of the objects being measured, for the simple reason that the measurements are causally linked to the object itself in controlled ways. To think otherwise is frankly insane - this is why we call them measurements, and not noise.
The "Ladder of Causation" proposed by Judea Pearl covers similar ground - "Rung 1” reasoning is the purely predictive work of ML models, "Rung 2" is the interactive optimization of reinforcement learning, and "Rung 3" is the counterfactual and casual reasoning / DGP construction and work of science. LLMs can parrot Rung 3 understanding from ingested texts but it can't generate it.
That's wrong. Whatever your measuring device, it is fundamentally a projection of some underlying reality, eg. a function m in m(r(x)) mapping real values to real values, where r is the function governing reality.
As you've acknowledged that neural networks can learn functions, the neural network here is learning m(r(x)). Clearly the world is in the model here, and if m is invertible, then we can directly extract r.
Yes, the domain of x and range of m(r(x)) is limited, so the inference will be limited for any given dataset, but it's wrong to say the world is not there at all.
For animals, we are born with primitive causal models of our bodies we can recurse on to build models of the world in this sense. So as toddlers we learn perception by having an internal 3d model of our bodies -- so we can ascribe distances to our optical measures.
Without such assumptions there really isnt any world at all in this data. A grid of pixel patterns has no meaning as a grid of numbers. NNs are just mapping this grid to a "summary space" under supervision of how to place the points. This supervision enables a useful encoding of the data, but does not provide the kind of assumptions needed to work backwards to properties of its generation.
In the case of photos, there is no such `m` -- the state of a sensor is not uniquley caused by any catness or dogness properties. Almost no photographs acquire their state from a function X -> Y, because the sensor state is "radically uncontrolled" in a causal sense. Thus the common premise of ML, that y = f(x) is false from the start -- the relevant causal graph has a near infinite number of causes that are unspecified, so f does not exist.
In the example, the 'world' is the grid state. Obviously that's much simpler than the real world but the point is to show that even when the model is not directly trained to input/output this world state it is still learned as a side effect of prediction the next token.
The whole debate is about whether surface patterns in measurement data can be reversed by NNs to describe their generating process, ie., the world. If the "data" isnt actual measurements of the world, no one arguing about it.
If there is no gap between the generating algorithm and the samples, eg., between a "circle" and "the points on a circle" -- then there is no "world model" to learn. The world is the data. To learn "the points on a cirlce" is to learn the cirlce.
By taking cases where "the world" and "the data" are the same object (in the limit of all samples), you're just showing that NNs model data. That's already obvious, no ones arguing about it.
That a NN can approximate a discrete function does not mean it can do science.
The whole issue is that the cause of pixel distributions is not in those distributions. A model of pixel patterns is just a model of pixel patterns, not of the objects which cause those patterns. A TV is not made out of pixels.
The "debate" insofar as there is one, is just some researchers being profoundly confused about what measurement data is: measurements are not their targets, and so no model of data is a model of the target. A model of data is just "surface statistics" in the sense that these statistics describe measurements, not what caused them.
This is blatantly incorrect. Keep in mind that much of modern physics has been invented via observation. Kepler's law and ultimately the law of Gravitation and General Relativity came from these "photographs" of the night sky.
If you are talking about the fact that these theories only ever summarize what we see and maybe there's something else behind the scenes that's going on, then this becomes a different discussion.