https://github.com/microsoft/BitNet
"...It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption..."
Am I missing something, or is this regressing to GPT-1 level?
> "...It matches the full-precision (i.e., FP16 or BF16)
Wait... WHAT?!When did //HALF PRECISION// become //FULL PRECISION//?
FWIW, I cannot find where you're quoting from. I cannot find "matches" on TFA nor the GitHub link. And in the paper I see
3.2 Inference Accuracy
The bitnet.cpp framework enables lossless inference for ternary BitNet b1.58 LLMs. To evaluate inference accuracy, we randomly selected 1,000 prompts from WildChat [ ZRH+24 ] and compared the outputs generated by bitnet.cpp and llama.cpp to those produced by an FP32 kernel. The evaluation was conducted on a token-by-token basis, with a maximum of 100 tokens per model output, considering an inference sample lossless only if it exactly matched the full-precision output.
In reality, the demand for highly parallelized compute simply blindsided OEMs. AMD, Intel and Apple were all laser-focused on raster efficiency, none of them have a GPU architecture optimized for GPGPU workloads. AMD and Intel don't have competitive fab access and Apple can't sell datacenter hardware to save their life; Nvidia's monopoly on attractive TSMC hardware isn't going anywhere.
There are really only three competitors in this market with one also-ran company.
Obviously it's Nvidia, Google and tenstorrent.
The also ran company is AMD, whose products are only bought as a hedge against Nvidia. Even though the hardware is better on paper, the software is so bad that you get worse performance than Nvidia. Hence "also ran".
Tenstorrent isn't there yet, but it's just a matter of time. They are improving with every generation of hardware and their software stack is 100% open source.
At least not for these reasons - if it does, it'll be because of consistent pattern of overhyping and underdelivering on real-world applications of generative AI, like what's going on with Apple right now.
Was the crazy revaluation of Nvidia wild? Yes.
Will others start taking contracts away with their fast inferencing custom solutions? yes of course but im sure everyone is aware of it.
What is very unclear is, how strong Nvidia is with their robot platform.
This is true for post-training quantization, not for quantization-aware training, and not for something like BitNet. Here they claim comparable performance per parameter count as normal models, that's the entire point.
The claim here seems to be that it runs usefully fast on CPU.
We're not sure how accurate this claim is, because we don't know how fast this model runs on a GPU, because:
> Absent from the list of supported chips are GPUs [...]
And TFA doesn't really quantify anything, just offers: > Perhaps more impressively, BitNet b1.58 2B4T is speedier than other models of its size — in some cases, twice the speed — while using a fraction of the memory.
The model they link to is just over 1GB in size, and there's plenty of existing 1-2GB models that are quite serviceable on even a mildly-modern CPU-only rig. BitNet: Scaling 1-bit Transformers for Large Language Models
was actually binary (weights of -1 or 1),but then in the follow-up paper they started using 1.58bit weights (https://arxiv.org/pdf/2402.17764)
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
This seems to be first source of the confounding of "1-bit LLM" and ternary weights that I could find. In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}.
Lookup tables can be fast, but it's not simpler, see T-MAC https://arxiv.org/abs/2407.00088 (Note that all comparisons with `llama.cpp` were made before I introduced the types from https://github.com/ggml-org/llama.cpp/pull/8151 where the 1.6-bit type uses the techniques described in the aforementioned blog post).
I wanted to try without lookup tables to at least have a baseline, and also because the fixed point packing idea lent itself naturally to using multiplications by powers of 3 when unpacking.
They would keep re-compressing the model in its entirety, recursively until the whole thing was a single bit, but the unpacking and repacking during inference is a bitch.
Edit: Oh I just saw the Git repo:
> exo: Run your own AI cluster at home with everyday devices.
So the "distributed problem" is in the process of being solved. Impressive.
Initially it was more of general 'architecture astronautics' in the context of dynamically reconfigurability/systolic arrays/transport triggered architecture/VLIW, which got me some nice results.
Having read and thought much about balanced ternary hardware, and 'playing' with that, while also reading how this could be favourably applicable to ML lead to that 'pivot'.
A few years before 'this', I might add.
Now I can 'play' much more relaxed and carefree, to see what else I can get out of this :-)
Weird comparison? The M2 already runs 7 or 13gb LLama and Mistral models with relative ease.
The M-series and Macbooks are so ubiquitous that perhaps we're forgetting how weak the average CPU (think i3 or i5) can be.
If a model can be "run on a CPU", it should acceptably run on a general-purpose 8-core CPU, like an i7, or i9, or a Ryzen 7, or even an ARM design like a Snapdragon.