Happy to answer any questions about these methods.
Also how well do these models work to extract structured output? Eg- perform ocr on some hand written text with math, convert to html and format formulas correctly etc. Single shot prompting doesn't work well with such problems but splitting the steps into consecutive api calls works well.
Yes, the search process (beam-search of best-of-N) does produce verbose traces because there is branching involved when sampling "thoughts" from base model. These branched traces (including incomplete "abandoned" branches) can be shown to the user or hidden, if the approach is deployed as-is.
In other words: 1) sample step-by-step solutions from "base" model; 2) do it at non-zero temperature so that you can get multiple continuation from each solution-prefix; 3) use MATH-labels to decide if full solution (leaf/terminal node in MC rolloout) has reward `1` or `0`; 4) roll up these rewards to calculate reward-to-go for each intermediate step.
Yes, verifier trained in this manner can be used to score solution-prefixes (as a process verifier) or a full-solution (as an outcome verifier).
In the original paper (https://arxiv.org/abs/2408.03314) they fine-tune a fresh verifier. HF's replication uses an off-the-shelf verifier based on another paper: https://arxiv.org/abs/2312.08935
Minor gripe - The best-of-n | beam search illustration is not compatible with red-green color blindness. I can literally not see the difference between the Rejected and the Selected dots even if I zoom in.
In contrast, in the original paper, verifier is a fine-tune of the exact same base model which is used to sample step-by-step solutions (="solver").
Using 3B model with 8B verifier against 70B model would make sense too. This being said their performance barely crossed 70B line with 256 examples. This is 256*(8+3)/70 ~ 40 times more computationally expensive than running 70B model as is.
"1B solver + 8B verifier + search" beating 1B-0-shot or 1B-majority as baselines isn't illustrative imo. In other words, by using larger verifier, HF's replication fails to establish a "fair" baseline. Still an awesome blog and release/repository from HF's group - I love it!
> To guide our search strategies, we used RLHFlow/Llama3.1-8B-PRM-Deepseek-Data, an 8B reward model that has been trained using process supervision
See https://github.com/huggingface/search-and-learn/blob/b3375f8... and https://github.com/huggingface/search-and-learn/blob/b3375f8...
In the original paper, they use PaLM 2-S* as "solver" and its fine-tune as "verifier".
1) make model output a full solution, step-by-step, then induce it to revise the solution - repeat this as many times as you have token-budget for. You can do this via prompting alone (see Reflexion for example), or you can fine-tune the model to do that. The paper explores fine-tuning of the base model to turn it into self-revision model.
2) sample step-by-step (one "thought"-sentence per line) solutions from the model, and do it at non-zero temperature to be able to sample multiple next-steps. Then use verifier model to choose between next-step candidates and prefer to continue the rollout of the more promising branches of "thoughts". There are many many methods of exploring such tree when you can score intermediate nodes (beam search is an almost 50 years old algorithm!).
Normally when you run an LLM, you set your prompt and whatever tunable parameters, and the LLM software (eg. lamma.cpp) spits out tokens at whatever rate it can. If you want higher quality, you run a bigger model (though you're limited by the amount of memory you have available). If you want higher speed, you run a smaller model. Hugging Face seems to be looking at ways to make this tradeoff without switching between different models.
I think it *is* an unlock.
1. the reason for generalizations like 'long enough' and 'think more' are apparently because the methods are somewhat obscure 2. those methods are being explored by hugging face to make them less obscure
am I getting that right? I have been struggling to see past the metaphors and understand exactly what additional computation is being done - and here I read its something like multiple guesses being fed back in and chosen among which means its just multiple inferences in series that are all related to solving 1 problem.