Checking a token at a time evaluates if it is going to produce a correct final answer. The intermediate text can be whatever it needs to arrive at that answer, but training at the per token level means training those very tokens that you want to allow the model the leeway to consider. It needs another model to adjudicate how well things are going from incomplete answers.
I'm not sure how much the adjudicator evaluates based upon knowing the final answer or based upon the quality of the reasoning of the model being trained. I'd be inclined to train two adjudicators, one that knows the answers and one that doesn't. I'm sure there would be interesting things to see in their differential signal.
This is the most important sentence describing the fundamental issue that LLMs have. This severely limits the technology's useful applications. Yet OpenAI and others constantly lie about it.
The article very clearly explains why models won't be able to generalise unless RL is performed constantly. But that's not scalable, has other problems in itself. For example, it still runs into paradoxes where the training mechanism has to know the answer in order to formulate the question. (This is precisely where the concept of World Models comes in or why symbolism becomes more important.)
LLMs perform well in highly specialised scenarios with a well-defined and well-known problem space. It's probably possible to increase accuracy and correctness by using lots of interconnected models that can perform RL with each other. Again, this raises questions of scale and feasibility. But I think our brains (together with the other organs) work this way.
i thought world models like genie 3 would be the training mechanism, but i likely misunderstand.
Yes, you can use Genie 3 to train other models. Its far from perfect. You still need to train Genie 3. And its training and outputs must be useful in the context of what you want to train other models with. That's a paradox. The feedback loop needs to produce useful results. And Genie 3 can still hallucinate or produce implausible responses. Symbolism is a wide term. But a "World Model" needs it to make sense between concepts (e.g. Ontologies or the relation of movement and gravity).
The solution to this is giving the model a physical body and actually letting it interact with the real world and learn from it. But no lab dares to try this because allowing a model to learn from experience would mean allowing it to potentially change its views/alignment.
Multiple teams already baked memory into designs, some like typical ML and some biologically inspired. Hallucination mitigation needs a ton more research. My proposal was studying the part of the brain that causes hallucinations when damaged in case it's designed to mitigate them. Then, imitate it until we have something better.
> ... SFT is a subset of RL.
> The first thing to note about traditional SFT is that the responses in the examples are typically human written. ... But it is also possible to build the dataset using responses from the model we’re about to train. ... This is called Rejection Sampling.
I can see why someone might say there's overlap between RL and SFT (or semi-supervised FT), but how is "traditional" SFT considered RL? What is not RL then? Are they saying all supervised learning is a subset of RL, or only if it's fine tuning?
Sutton and Barto define reinforcement learning as "learning what to do- how to map situations to actions-- so as to maximize a numerical reward signal". This is from their textbook on the topic.
That's a pretty broad definition. But the general formulation of RL involves a state of the world and the ability to take different actions given that state. In the context of an LLM, the state could be what has been said so far, and the action could be what token to produce next.
But as you noted, if you take such a broad definition of RL, tons of machine learning is also RL. When people talk about RL they usually mean the more specific thing of letting a model go try things and then be corrected based on the observations of how that turned out.
Supervised learning defines success by matching the labels. Unsupervised learning is about optimizing a known math function (for example, predicting the likelihood that words would appear near each other). Reinforcement learning would maximize a reward function that may not be directly known by the model, and it learns to optimize it by trying things and observing the results and getting a reward/penalty.
It’s not defined until the 13th paragraph of the linked article.
The confusion is understandable. The definition of RL in the Sutton/Barto book extends over two chapters iirc, and after reading it I did not see how it differed from other learning methods. Studying some of the academic papers cleared things up.
Admittely most interesting cases do have delays.
In its most general, RL is about learning a policy (state -> action mapping). Which often requires inferring value, etc.
But copying a strong reference policy ... is still learning a policy. Whether by SFT or not
It can be easier to recognize good responses than generate them.
Then feed it queries, generating its responses and judgements. Instead of training the responses to match response data, train it to output a high positive judgement, but while holding its “judgment” weight values constant. To improve its judgement values, the model is now being trained to give better answers since the judgment weights being back propagated act as a distributor of information from judgement back to how the responses should change to improve.
Learn to predict/judge what is good or bad. Then learn to maximize good and minimize bad using the judgment/prediction as a proxy for actual feedback.
This technique is closer to traditional human/animal reinforcement learning.
How we learn to predict situations that will cause us pain or positive affects, then learn to choose actions that minimize our predictions of bad, and maximize our predictions of good. Which is much more efficient way to learn than the expense of having to actually experience everything and always get explicit external feedback.
There are a many many ways to do reinforcement learning.
Interactive stuff, within content. A mini game in a game, school homework of course, or "whichever text box the viewer looks at longest by WorldCoin Eyeball Tracker for Democracy x Samsung" for an interstitial turned captcha.
Better hope your taste isn't too bland and derivative!
Amazon and Ali soon lap the field by allowing coupon farming, but somehow eventually end up where they started.
Without knowing who/what the experts are, how they are used, what they are judging, what structure and mitigations are in place around their use, and what degree of neutrality is required - with all other factors and techniques being used - you can't make any such claim.
It's so easy to dismiss something.
A general algorithm isn't a claim that its practical use won't require accommodating the specific complications of each context.
Very much like how data scientists don't expect their best algorithms to operate well, without also resolving a stream of practical issues. In standard and ad hoc ways, as needed.
Rather than jump through more hoops, I'm just going to give up on reading this one.