SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs
34 points
1 year ago
| 2 comments
| arxiv.org
| HN
imtringued
1 year ago
[-]
Extremely low bit quantization makes me curious why it is so effective.

Why is it better to run a bigger model with more parameters at lower accuracy?

Obviously more parameters are better, but why is that the case exactly? For that you need to understand that a transformer layer consists of the self attention mechanism followed by a bog standard feedforward network (usually multiple MLPs). Most of the parameters are here.

My personal theory is based on the fact that ReLU is the simplest possible activation function that works, yet all it does is clamp negative values to zero. How could a network use that for learning?

The answer to the question is quite simple. If you have weights w_i that are negative and take the sum = \sum_i w_i times x_i plus positive bias, then throw that into ReLU, you will get a boolean function that turns off when the negative sum is smaller than the bias. This means you can build a comparison operator using ReLU. Take it a few steps further and you can probably implement any arbitrary boolean function directly in each row of your MLP.

This means that most of the precision is only really needed during training, because you want a nicely continuosly differentiable function for gradient descent, but the model itself is mostly operating on a form of fuzzy boolean logic.

This means that the embedding length, basically the size of a token, plays a key role in the ability to encode these mostly binary concepts.

Bigger models have wider tokens. That's why bigger models with low bit quantization outperform smaller models with high bit quantization.

reply
Ey7NFZ3P0nzAe
1 year ago
[-]
> all it does is clamp negative values to zero. How could a network use that for learning?

This "simple" effect is actually huge because it allows non linear mapping between input and output. This, is completely changing the size of what's learnable.

reply
mentalgear
1 year ago
[-]
I feel like for many tasks, there's a certain "good enough" threshold that local small LMs can do as good but private and no cloud LLM is needed. I think the future is mostly on-device SLMs and their agentic coordination.

In that sense, a local agentic framework (js/ts based) would be soon very relevant.

reply
digdugdirk
1 year ago
[-]
Any reason why you're calling out a need for a framework to be js/ts based? There's plenty of python frameworks in active development, some of which have js bindings/libraries.
reply
mentalgear
1 year ago
[-]
minimal setup. Anything python requires the host machine to have the correct python version, pm and libs installed (which is far more than normal users can do), or have it compiled within a virtual python executable (big!).

Web-native technology requires minimal setup as it can basically run in any browser (or electron) as is.

reply
PaulHoule
1 year ago
[-]
uv has revolutionized the Python situation, mostly.

I recently updated YOShInOn's Python environment to be repeatable, the python packaging part was pretty simple, but getting CUDA running in WSL2 was a little tricky. Turns out my "game ready" NVIDIA drivers in Windows install a certain version of the base CUDA libs in WSL somehow. You have to install another 5 libraries with deb packaging inside the WSL which is not too hard but I realized I had a version mismatch about 1/3 of the way through but decided to barrel ahead. I installed most of the deb's manually but got fatigued and installed the last one automatically. Somehow it all works so I'm not messing with it, but I am a bit intimidated about what to do if I have problems and need to tear it down.

reply
mentalgear
1 year ago
[-]
uv is still a cli tool, maybe ok for tech enthusiasts. No normal user will ever install a cli tool. And even for knowledgable users like yourself, there are a manifold of different problems as you stated.
reply
PaulHoule
1 year ago
[-]
It can be called from another application and hidden, probably a mostly Java application could uv up a Python environment for one or more subsystems.

My beef is with CUDA being not just one thing but about 5 libs if you wanna use PyTorch. In the old days I figured how to make conda packages for all that NVIDIA stuff so I could single step install any version of Tensorflow and it was 100% correct. Just then they changed Tensorflow so it was 95% correct but worked most of the time for most people and my system didn’t work anymore.

reply
mentalgear
1 year ago
[-]
This again requires Java Virtual Machine to be installed on the system, which most normal user don't have and won't install.
reply