Microsoft researchers developed a hyper-efficient AI model that can run on CPUs
143 points
3 days ago
| 12 comments
| techcrunch.com
| HN
hu3
3 days ago
[-]
Repo with demo video and benchmark:

https://github.com/microsoft/BitNet

"...It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption..."

https://arxiv.org/abs/2402.17764

reply
Animats
3 days ago
[-]
That essay on the water cycle makes no sense. Some sentences are repeated three times. The conclusion about the water cycle and energy appears wrong. And what paper is "Jenkins (2010)"?

Am I missing something, or is this regressing to GPT-1 level?

reply
yorwba
2 days ago
[-]
They should probably redo the demo with their latest model. I tried the same prompt on https://bitnet-demo.azurewebsites.net/ and it looked significantly more coherent. At least it didn't get stuck in a loop.
reply
int_19h
2 days ago
[-]
2B parameters should be in the ballpark of GPT-2, no?
reply
godelski
2 days ago
[-]

  > "...It matches the full-precision (i.e., FP16 or BF16)
Wait... WHAT?!

When did //HALF PRECISION// become //FULL PRECISION//?

FWIW, I cannot find where you're quoting from. I cannot find "matches" on TFA nor the GitHub link. And in the paper I see

  3.2 Inference Accuracy
  
  The bitnet.cpp framework enables lossless inference for ternary BitNet b1.58 LLMs. To evaluate inference accuracy, we randomly selected 1,000 prompts from WildChat [ ZRH+24 ] and compared the outputs generated by bitnet.cpp and llama.cpp to those produced by an FP32 kernel. The evaluation was conducted on a token-by-token basis, with a maximum of 100 tokens per model output, considering an inference sample lossless only if it exactly matched the full-precision output.
reply
ilrwbwrkhv
3 days ago
[-]
This will happen more and more. This is why NVidia is rushing to get CUDA a software level lock-in otherwise their stock will go the way of Zoom.
reply
soup10
3 days ago
[-]
i agree, no matter how much wishful thinking jensen sells to investors about paradigm shifts the days of everyone rushing out to get 6 figure tensor core clusters for their data center probably won't last forever.
reply
bigyabai
3 days ago
[-]
If Nvidia was at all in a hurry to lock-out third-parties, then I don't think they would support OpenCL and Vulkan compute, or allow customers to write PTX compilers that interface with Nvidia hardware.

In reality, the demand for highly parallelized compute simply blindsided OEMs. AMD, Intel and Apple were all laser-focused on raster efficiency, none of them have a GPU architecture optimized for GPGPU workloads. AMD and Intel don't have competitive fab access and Apple can't sell datacenter hardware to save their life; Nvidia's monopoly on attractive TSMC hardware isn't going anywhere.

reply
mlinhares
3 days ago
[-]
The profit margins on Macs must be insane because it just doesn’t make sense at all Apple just doesn’t give a fuck about data center workloads when they have some of the best ARM CPUs and whole packages on the market.
reply
bigyabai
3 days ago
[-]
If Xserve is any basis of comparison, Apple struggles to sell datacenter hardware in the best of markets. The competition is too hot nowadays, and Apple likely knows the investment wouldn't be worth it. ARM CPUs are available from Ampere and Nvidia now, Apple Silicon would have to differentiate itself more than it does on mobile. After a certain point, it probably does come down to the size of the margins on consumer hardware.
reply
ahmeni
2 days ago
[-]
I will never not be forever saddened by the fact that Apple killed their Xserve line shortly before the App store got big. We all ended up having to do dumb things like rack-mount Mac Minis for app CI builds for years and it was such a pain.
reply
pzo
3 days ago
[-]
there was news they recently bought a lot of nvidia gpus since their progress was too slow to use their own chips even in their own data centers for their own purposes
reply
imtringued
2 days ago
[-]
I don't know how it happened, but Intel completely dropped out of the AI accelerator market.

There are really only three competitors in this market with one also-ran company.

Obviously it's Nvidia, Google and tenstorrent.

The also ran company is AMD, whose products are only bought as a hedge against Nvidia. Even though the hardware is better on paper, the software is so bad that you get worse performance than Nvidia. Hence "also ran".

Tenstorrent isn't there yet, but it's just a matter of time. They are improving with every generation of hardware and their software stack is 100% open source.

reply
int_19h
2 days ago
[-]
Even if you can squeeze an existing model into smaller hardware, that means that you can squeeze a larger (and hence smarter) model into that 6 figure cluster. And they aren't anywhere near smart enough for many things people attempt to use them for, so I don't see the hardware demand for inference subsiding substantially anytime soon.

At least not for these reasons - if it does, it'll be because of consistent pattern of overhyping and underdelivering on real-world applications of generative AI, like what's going on with Apple right now.

reply
layoric
3 days ago
[-]
He is fully aware, that is why he is selling his stock on the daily.
reply
Sonnigeszeug
2 days ago
[-]
Comparing Zoom and Nvidia is just not valid at all.

Was the crazy revaluation of Nvidia wild? Yes.

Will others start taking contracts away with their fast inferencing custom solutions? yes of course but im sure everyone is aware of it.

What is very unclear is, how strong Nvidia is with their robot platform.

reply
jcadam
3 days ago
[-]
So Microsoft is about to do to Nvidia what Nvidia did to SGI.
reply
PaulDavisThe1st
3 days ago
[-]
still, better than the way of Skype.
reply
zamadatix
3 days ago
[-]
"Parameter count" is the "GHz" of AI models: the number you're most likely to see but least likely to need. All of the models compared (in the table on the huggingface link) are 1-2 billion parameters but the models range in actual size by more than a factor of 10.
reply
int_19h
2 days ago
[-]
Because of different quantization. However, parameter count is generally the more interesting number so long as quantization isn't too extreme (as it is here). E.g. FP32 is 4x the size of 8-bit quant, but the difference is close to non-existent in most cases.
reply
orbital-decay
2 days ago
[-]
>so long as quantization isn't too extreme (as it is here)

This is true for post-training quantization, not for quantization-aware training, and not for something like BitNet. Here they claim comparable performance per parameter count as normal models, that's the entire point.

reply
charcircuit
2 days ago
[-]
TPS is the Ghz of AI models. Both are related to the the propagation time of data.
reply
idonotknowwhy
2 days ago
[-]
Then i guess vocab is the IPC. 10k mistral tokens are about 8k llama3 tokens
reply
Jedd
3 days ago
[-]
I think almost all the free LLMs (not AI) that you find on hf can 'run on CPUs'.

The claim here seems to be that it runs usefully fast on CPU.

We're not sure how accurate this claim is, because we don't know how fast this model runs on a GPU, because:

  > Absent from the list of supported chips are GPUs [...]

And TFA doesn't really quantify anything, just offers:

  > Perhaps more impressively, BitNet b1.58 2B4T is speedier than other models of its size — in some cases, twice the speed — while using a fraction of the memory.
The model they link to is just over 1GB in size, and there's plenty of existing 1-2GB models that are quite serviceable on even a mildly-modern CPU-only rig.
reply
sheepscreek
3 days ago
[-]
If you click the demo link, you can type a live prompt and see it run on CPU or GPU (A100). From my test, the CPU was laughably slower. To my eyes, it seems comparable to the models I can run with llama.cpp today. Perhaps I am completely missing the point of this.
reply
ein0p
3 days ago
[-]
This is over a year old. The sky did not come down, everyone didn't switch to this in spite of the "advantages". If you look into why, you'll see that it does, in fact, affect the metrics, and some more than others, and there is no silver bullet.
reply
yorwba
2 days ago
[-]
The 2B4T model was literally released yesterday, and it's both smaller and better than what they had a year ago. Presumably the next step is that they get more funding for a larger model trained on even more data to see whether performance keeps improving. Of course the extreme quantization is always going to impact scores a bit, but if it lets you run models that otherwise wouldn't even fit into RAM, it's still worth it.
reply
justanotheratom
3 days ago
[-]
are you predicting, or is there already a documented finding somewhere?
reply
ein0p
3 days ago
[-]
Take a look at their own paper or at many attempts to train something large with this. There's no replacement for displacement. If this actually worked without quality degradation literally everyone would be using this.
reply
imtringued
2 days ago
[-]
AQLM, EfficientQAT and ParetoQ get reasonable benchmark scores at 2-bit quantization. At least 90% of the original unquantized scores.
reply
stogot
3 days ago
[-]
The pricing war will continue to rock bottom
reply
falcor84
3 days ago
[-]
Why do they call it "1-bit" if it uses ternary {-1, 0, 1}? Am I missing something?
reply
Maxious
3 days ago
[-]
reply
falcor84
3 days ago
[-]
Thanks, but I've skimmed through both and couldn't find an answer on why they call it "1-bit".
reply
AzN1337c0d3r
3 days ago
[-]
The original BitNet paper (https://arxiv.org/pdf/2310.11453)

  BitNet: Scaling 1-bit Transformers for Large Language Models
was actually binary (weights of -1 or 1),

but then in the follow-up paper they started using 1.58bit weights (https://arxiv.org/pdf/2402.17764)

  The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
This seems to be first source of the confounding of "1-bit LLM" and ternary weights that I could find.

  In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}.
reply
LeonB
3 days ago
[-]
It’s “1-bit, for particularly large values of ‘bit’”
reply
biomcgary
2 days ago
[-]
Should be 1-trit.
reply
taneq
2 days ago
[-]
That’s pretty cool. :) One thing I don’t get is why do multiple operations when a 243-entry lookup table would be simpler and hopefully faster?
reply
compilade
2 days ago
[-]
Because lookup tables are not necessarily faster compared to 8-bit SIMD operations, at least when implemented naïvely.

Lookup tables can be fast, but it's not simpler, see T-MAC https://arxiv.org/abs/2407.00088 (Note that all comparisons with `llama.cpp` were made before I introduced the types from https://github.com/ggml-org/llama.cpp/pull/8151 where the 1.6-bit type uses the techniques described in the aforementioned blog post).

I wanted to try without lookup tables to at least have a baseline, and also because the fixed point packing idea lent itself naturally to using multiplications by powers of 3 when unpacking.

reply
taneq
1 day ago
[-]
Thanks for taking the time to reply! I haven’t done any serious low level optimisation on modern CPUs so most of my intuitions are probably way out of date.
reply
sambeau
3 days ago
[-]
Maybe they are rounding down from 1.5-bit :)
reply
BuyMyBitcoins
3 days ago
[-]
Classic Microsoft naming shenanigans.
reply
1970-01-01
3 days ago
[-]
It's not too late to claim 1bitdotnet.net before they do.
reply
DecentShoes
3 days ago
[-]
LLM Series One S and X
reply
Nevermark
2 days ago
[-]
Once you know how to compress 32-bit parameters to ternary, compressing ternary to binary is the easy part. :)

They would keep re-compressing the model in its entirety, recursively until the whole thing was a single bit, but the unpacking and repacking during inference is a bitch.

reply
prvc
2 days ago
[-]
There are about 1.58 (i.e. log_2(3)) bits per digit, so they just applied the constant function that maps the reals to 1 to it.
reply
ufocia
2 days ago
[-]
1.58 is still more than 1 in general unless the parameters are corelated. At 1 bit it seems unlikely that you could pack/unpack independent parameters reliably without additional data.
reply
falcor84
2 days ago
[-]
I like that as an explanation, but then every system is 1-bit, right? It definitely would simplify things.
reply
mistrial9
3 days ago
[-]
reply
nodesocket
3 days ago
[-]
There are projects working on distributed LLMs, such as exo[1]. If they can crack the distributed problem fully and get performance it’s a game changer. Instead of spending insane amounts on Nvidia GPUs, can just deploy commodity clusters of AMD EPYC servers with tons of memory, NVMe disks, and 40G or 100G networking which is vastly less expensive. Goodbye Nvidia AI moat.

[1] https://github.com/exo-explore/exo

reply
lioeters
2 days ago
[-]
Do you think this is inevitable? It sounds like, if distributed LLMs are technically feasible to achieve, it will eventually happen. Maybe that's an unknown whether it can be solved at all, but I imagine there are enough people working on the problem that they will find a break-through one way or the other. LLMs themselves could participate in solving it.

Edit: Oh I just saw the Git repo:

> exo: Run your own AI cluster at home with everyday devices.

So the "distributed problem" is in the process of being solved. Impressive.

reply
esafak
3 days ago
[-]
Is there a library to distill bigger models into BitNet?
reply
timschmidt
3 days ago
[-]
I could be wrong, but my understanding is that bitnet models have to be trained that way.
reply
babelfish
3 days ago
[-]
They don't have to be trained that way! The training data for 1-bit LLMs is the same as for any other LLM. A common way to generate this data is called 'model distillation', where you take completions from a teacher model and use them to train the child model (what you're describing)!
reply
timschmidt
3 days ago
[-]
Maybe I wasn't clear, I think you've misunderstood me. I understand that all sorts of LLMs can be trained using a common corpus of data. But my understanding is that the choice of creating a bitnet LLM must be made at training time, as modifications to the training algorithms are required. In other words, an existing FP16 model cannot be quantized to bitnet.
reply
babelfish
1 day ago
[-]
Ah yes, definitely misunderstood you, my bad
reply
justanotheratom
3 days ago
[-]
Super cool. Imagine specialized hardware for running these.
reply
llama_drama
3 days ago
[-]
I wonder if instructions like VPTERNLOGQ would help speed these up
reply
LargoLasskhyfv
3 days ago
[-]
It already exists. Dynamically reconfigurable. Some smartass designed it alone on ridiculously EOL'd FPGAs. Meanwhile ASICs in small batches without FPGA baggage were produced. Unfortunately said smartass is under heavy NDA. Or luckily, because said NDA paid very well for him.
reply
djmips
2 days ago
[-]
Nicely done!
reply
LargoLasskhyfv
2 days ago
[-]
Was actually sort of a sideways pivot, and hard for me to do, because of the involved mathematics.

Initially it was more of general 'architecture astronautics' in the context of dynamically reconfigurability/systolic arrays/transport triggered architecture/VLIW, which got me some nice results.

Having read and thought much about balanced ternary hardware, and 'playing' with that, while also reading how this could be favourably applicable to ML lead to that 'pivot'.

A few years before 'this', I might add.

Now I can 'play' much more relaxed and carefree, to see what else I can get out of this :-)

reply
instagraham
3 days ago
[-]
> it’s openly available under an MIT license and can run on CPUs, including Apple’s M2.

Weird comparison? The M2 already runs 7 or 13gb LLama and Mistral models with relative ease.

The M-series and Macbooks are so ubiquitous that perhaps we're forgetting how weak the average CPU (think i3 or i5) can be.

reply
nine_k
2 days ago
[-]
The M-series have a built-in GPU and unified RAM accessible to both. Running a model on an M-series chip without using the GPU is, imho, pointless. (That said, it's still a long shot from an H100 with a ton of VRAM, or from a Google TPU.)

If a model can be "run on a CPU", it should acceptably run on a general-purpose 8-core CPU, like an i7, or i9, or a Ryzen 7, or even an ARM design like a Snapdragon.

reply
1970-01-01
3 days ago
[-]
..and eventually the Skynet Funding Bill was passed.
reply