Interestingly both seem to indirectly modify the optimisation process, in my opinion effectively trying to fix a bad optimiser. Seems like we still have a long way to go after Adam...
A preprint in arxiv suggests that Adam works better than SGD for training LLMs due to the issue of class-imbalance [0]. It appears that scaling the gradient step helps with the training, for example, see another approach suggested in [1].
0. https://arxiv.org/pdf/2402.19449 1. https://arxiv.org/pdf/2402.02347
If it’s the former this could effectively halve finetuning cost overnight which would go a significant way towards enabling a wider array of use cases for LoRA.
It's a bit like if someone reading a bicycling article and getting annoyed that FTP means Functional Threshold Power instead of File Transfer Protocol, or reading about machine learning and getting confused that MLP doesn't mean My Little Pony.
"computer science" and "tv shows" aren't the same domain, it's fine to have the same acronym.
"computer science" and "computer science" are the same domain, it's not a good idea to use the same acronym.
But “radio communication" is not “computer science”, even though people sometimes plug radio transceivers into computers, just like “tv shows” aren't “computer science” just because people sometimes view or store their shows on a computer, and “bicycles” aren’t “computer science” because sometimes people mount computers on their bikes.
Also, both use a non-standard case mix.
Exactly. Like WiFi: from ancient times it has meant "Wife's Fidelity".
I wonder why!
If the later please point out the alternatives.