But my real concern comes at the results. The "13 parameters" looks like bait, because it is one result of finetuning a model on a very simple math benchmark, grade-school-math (GSM8K), an already very saturated benchmark on every model. Besides, it seems to happen only for the qwen family model... It looks like GSM8K was part of the training set of the qwen model, and this tinylora finetuning did the last adjustments to perfectly reflect that overtraining.
I haven't done much research lately, but when I was working on it, I was having substantial success training an adapter of the form U_k @ P @ A, where U_k was the top k left singular vectors of the underlying weight, and then P and A were your typical LoRA projection matrices.
The 13 parameters are kind of misleading here; the real juice is going to be in the P_i fixed random matrices. My suspicion is that they are overfitting to the benchmark, but they almost certainly are observing a real gain in model capacity that is largely due to avoiding the intruder dimension problem.
Can you elaborate a bit on what you mean with the gap?
Now divide the average SOTA LLM's training cost (or a guess, since these numbers aren't always published as far as I'm aware) by the number of users, or if you wanted to be more strict, the number of people it's proven to be useful for (what else would training be for), and it might not be so far off anymore?
Of course, whether it makes sense to divide and spread out the LLMs' costs across users in order to calculate an "average utility" is debatable.
[1] https://www.publicschoolreview.com/average-spending-student-...
I’m glad the rest of the anchor text gave some context.
> Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.
I’m sorry if that reads like a complaint.
It's just an unfortunate name collision: disambiguating by use of capitals only works with computers.
>In particular, learning to generate longer outputs may be possible in few parameters
Reminded me of: https://arxiv.org/abs/2501.19393
>we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps
Maybe, indeed, the model simply learns to insert the EOS token (or similar) later, and the capability is already in the base model
Let's say we have a low level programmer expert and we try to teach him algebra either we:
- (SFT): give him algebra book with new nomenclature, definitions, syntax
- (RL): let him learn algebra using C syntaxFine tuning works on an input/output basis. You are rewarded for producing a plausible output _now_.
RL rewards you later for producing the right output now. So you have to learn to generate a lot of activity but you are only rewarded if you end up at the right place.
In SFT you are rewarded for generating tokens plausible to the proof, one token at a time. In RL you are expected to generate an entire proof and then you are rewarded or punished only when the proof is done.
[0]: cartesien.io or Salesforce's WebscaleRL
For some use cases it can be parity performance at 1/20th the cost up to exceeds at 1/10th the cost. Trade-off is ofc narrow applicability
even some advanced math usually evolves applying patterns found elsewhere into new topics
*At least up to 300B parameters, based on the models we’ve tested.
refs: https://arxiv.org/abs/2412.17819 https://arxiv.org/abs/2412.06769
The real unlock isn’t TinyLoRA, it’s what this implies: ultra-cheap, continuous adaptation. The bottleneck shifts from compute to having a good reward signal.