I tried a few things with it. Got it driving Cursor, which in itself was impressive - it handled some tool usage. Via cursor I had it generate a few web page tests.
On a monte carlo simulation of pi, it got the logic correct but failed to build an interface to start the test. Requesting changes mostly worked, but left over some symbols which caused things to fail. Required a bit of manual editing.
Tried a Simon Wilson pelican as well - very abstract, not recognizable at all as a bird or a bicycle.
Pictures of the results here: https://x.com/pwnies/status/2039122871604441213
There doesn't seem to be a demo link on their webpage, so here's a llama.cpp running on my local desktop if people want to try it out. I'll keep this running for a couple hours past this post: https://unfarmable-overaffirmatively-euclid.ngrok-free.dev
Strongly agree that the knowledge density is impressive for the being a 1-bit model with such a small size and blazing fast response
I should note this is running on an RTX 6000 pro, so it's probably at the max speed you'll get for "consumer" hardware.
That... pft. Nevermind, I'm just jealous
The question is about the best mode of transportation to a car wash located 100 meters away. Since the user is asking for a recommendation, it's important to consider practical factors like distance, time, and convenience.
Walking is the most convenient and eco-friendly option, especially if the car wash is within a short distance. It avoids the need for any transportation and is ideal for quick errands.
Driving is also an option, but it involves the time and effort of starting and stopping the car, parking, and navigating to the location.
Given the proximity of the car wash (100 meters), walking is the most practical and efficient choice. If the user has a preference or if the distance is longer, they can adjust accordingly.We're only a couple of years into optimization tech for LLMs. How many other optimizations are we yet to find? Just how small can you make a working LLM that doesn't emit nonsense? With the right math could we have been running LLMs in the 1990s?
Sample output below (the model's response to "hi" in the forked llama-cli):
X ( Altern as the from (.. Each. ( the or,./, and, can the Altern for few the as ( (. . ( the You theb,’s, Switch, You entire as other, You can the similar is the, can the You other on, and. Altern. . That, on, and similar, and, similar,, and, or in
1. Clone their forked repo: `git clone https://github.com/PrismML-Eng/llama.cpp.git`
2. Then (assuming you already have xcode build tools installed):
cd llama.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(sysctl -n hw.logicalcpu)
3. Finally, run it with (you can adjust arguments): ./build/bin/llama-server -m ~/Downloads/Bonsai-8B.gguf --port 80 --host 0.0.0.0 --ctx-size 0 --parallel 4 --flash-attn on --no-perf --log-colors on --api-key some_api_key_string
Model was first downloaded from: https://huggingface.co/prism-ml/Bonsai-8B-gguf/tree/mainAnd this is when Im serving zero prompts.. just loaded the model (using llama-server).
Literally just downloaded the model into a folder, opened cursor in that folder, and told it to get it running.
Prompt: The gguf for bonsai 8b are in this local project. Get it up and running so I can chat with it. I don't care through what interface. Just get things going quickly. Run it locally - I have plenty of vram. https://huggingface.co/prism-ml/Bonsai-8B-gguf/tree/main
I had to ask it to increase the context window size to 64k, but other than that it got it running just fine. After that I just told ngrok the port I was serving it on and voila.
Results: 8 passed, 0 failed, 17 errored out of 25
That puts it right between Qwen3.5-4B (7/25) and Nanbeige4.1-3B (9/25) for example, but it took only 200 seconds for the whole test. Qwen3.5 took 976 seconds and Nanbeige over 2000 (although both of these were on my 1070 so not quite the same hardware)
Granite 7B 4bit does the test in 199 seconds but only gets 4/25 correct.
See https://sql-benchmark.nicklothian.com/#all-data (click on the cells for the trace of each question)
Errors are bad tool calls (vs failures which is incorrect SQL)
I used @freakynit's runpod (thanks!)
I imagine this being used as sub-agents with some sota models directing them but I wasn't really able to replicate it personally (I had asked Claude to create a detailed plan for a pomodoro app and then passed it to Bonsai)
I also tried its writing skills and actually they are kind-of decent, I also found that this model actually uses very comparatively little em-dashes.Its fine tunes are gonna be some really amazing things to come out. I hope someone makes a fine tune for website/tampermonkey extensions ;)
I remember using chatgpt-3 to use svelte/sveltekit to make a green button to blue button and having the text inside those buttons change and it's my personal wow moment from gpt-3 (This wasn't really able to accurately replicate it even in plain js), but I think that maybe the current model isn't good at writing html but the possibilities with custom-training these models and the idea of 1 bit model feels really great to me.
Especially with the idea of Ngram-embedding[0] (Meituanlongcat/LongCatFlashLite) and its idea. I imagine a 1 bit model + Ngram-embedding idea and I feel it can have many endless possibilities.
[0]: https://news.ycombinator.com/item?id=46803687 (I had submitted this but it seems to have had no attention during that time)
Maybe a 1 bit model like this and diffusion models for coding purposes might also go hand in hand, there are many experiments which can be done with this! (Also yes, many thanks to @freakynit running the runpod, I think I really learnt many things about this model in particular because of his runpod)
TLDR: I feel like this model is good within writing or atleast better in it than usual and it can be good asking it General purpose questions default but I feel like its not good at making html which can be fair, good to see that they are good in sql, but, not sure how they might approach in normal coding tasks. But either way, its an extremely fun model to play with!
(Edit: After some more tries, I have been able to make even one prototype of it after Gemini had holded its hands/giving it the code/errors, its not the best at this but still it works, just barely, https://gist.github.com/SerJaimeLannister/e90e8a134e4163f205...)
Yes I think very constrained task: known data universe, well known language etc should be the best possible place for small language models to play
Another point, but I feel like, we can see some really good fine tuned models out of this model, the community feels excited about 1-bit LLM architecture. We are gonna see some good innovation within this space in the upcoming future.
For its size (1.2GB download) it's very impressive.
Here's a pelican it drew me running on my phone - the SVG comments are good, the image not so much: https://tools.simonwillison.net/svg-render#%3Csvg%20width%3D...
I tested the following using almost all available models on Locally and did not get a single model that got the right answer.
"What is 9:30 am (Taiwan Standard Time, TST) in US Pacific?"
<!-- Bicycle wheels -->
<circle cx="285" cy="130" r="5" fill="#81c784" />
<circle cx="315" cy="130" r="5" fill="#81c784" />
<circle cx="285" cy="160" r="5" fill="#81c784" />
<circle cx="315" cy="160" r="5" fill="#81c784" />
Did you ask for a pelican with a bicycle, or was that just an added bonus? Generate an SVG of a pelican riding a bicyclehttps://ofo1j9j6qh20a8-80.proxy.runpod.net
./build/bin/llama-server \
-m ../Bonsai-8B.gguf \
-ngl 999 \
--flash-attn on \
--host 0.0.0.0 \
--port 80 \
--ctx-size 65500 \
--batch-size 512 \
--ubatch-size 512 \
--parallel 5 \
--cont-batching \
--threads 8 \
--threads-batch 8 \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--log-colors on
The server can serve 5 parallel request, with each request capped at around `13K` tokens...A bit of of benchmarks I did:
1. Input: 700 tokens, ttfs: ~0 second, outputs: 1822 tokens ~190t/s
1. Input: 6400+ tokens, ttfs: ~2 second, outputs: 2012 tokens at ~135t/s
Vram usage was consistently at ~4GiB.
https://ofo1j9j6qh20a8-80.proxy.runpod.net/#/chat/5554e479-0...
I'm contemplating whether I should drive or walk to the car wash (I just thought of that one HN post) and this is what it said after a few back-and-forths:
- Drive to the car (5 minutes), then park and wash.
- If you have a car wash nearby, you can walk there (2 minutes) and do the washing before driving to your car.
- If you're in a car wash location, drive to it and wash there.
Technically the last point was fine, but I like the creativity.
Then found out they didn't implement AVX2 for their Q1_0_g128 CPU kernel. Added that and getting ~12t/s which isn't shabby for this old machine.
Cool model.
$ ./build/bin/llama-cli -hf prism-ml/Bonsai-8B-gguf -p "Explain quantum computing in simple terms." -n 256 --temp 0.5 --top-p 0.85 --top-k 20 -ngl 99
> Explain quantum computing in simple terms.
\( ,
None ( no for the. (,./. all.2... the ..... by/
EDIT: It runs fine in their collab notebook. Looking at that you have to do: git checkout prism (in the llama.cpp repo) before you build. That's a missing instruction if you're going straight to their fork of llama.cpp. Works fine now.That trade loses outside tight edge deploymints. Float formats stuck around for boring reasons: they handle ugly value ranges and they fit the GPU stack people already own.
This paper uses binary numbers only, even for training, with a solid theoretical foundation: https://proceedings.neurips.cc/paper_files/paper/2024/file/7...
TL;DR: They invent a concept called "Boolean variation" which is the binary analog to the Newton/Leibniz derivative. They are then able to do backpropagation directly in binary.
> *Fathers of Harry and James Potter*: - Sirius Black is the *father* of *James Potter* (the older brother of Harry).
> - James Potter is *Harry's uncle* and the *older brother* of *Luna Lovegood*.
> - This means *Sirius and James are Harry's uncles*, though they are *father and brother*.
Especially considering that these models seem to more or less just be quantized variants of Qwen3 with custom kernels and other inference optimizations (?) rather than fine tuned or trained from scratch with a new architecture, I am very surprised (or suspicious rather) that they didn't do the obvious comparison with a quantized Qwen3.
Their (to my knowledge) new measure/definition of intelligence seems reasonable, but introducing something like this without thorough benchmarking + model comparison is even more of a red flag to me.
[0] https://github.com/PrismML-Eng/Bonsai-demo/blob/main/1-bit-b...
(math: - log error / model size = 1 <-> error / model size = 1/e )
These models quite impressive for their size: even an older raspberry pi would be able to handle these.
There's still a lots of use for this kind of model
The average of MMLU Redux,MuSR,GSM8K,Human Eval+,IFEval,BFCLv3 for this model is 70.5 compared to 79.3 for Qwen3, that being said the model is also having a 16x smaller size and is 6x faster on a 4090....so it is a tradeoff that is pretty respectable
I'd be interested in fine tuning code here personally
Large models do not require that kind of exactness. They are somewhat like a "field" or a "probability cloud": as long as the main directional tendency is correct, a few individual deviations—or even a whole cluster of them—make almost no difference.
Do I need to build their llama.cpp fork from source?
Looks like they only offer CUDA options in the release page, which I think might support CPU mode but refuses to even run without CUDA installed. Seems a bit odd to me, I thought the whole point was supporting low end devices!
Edit: 30 minutes of C++ compile time later, I got it running. Although it uses 7GB of RAM then hangs at Loading model. I thought this thing was less memory hungry than 4 bit quants?
Edit 2: Got the 4B version running, but at 0.1 tok/s and the output seemed to be nonsensical. For comparison I can run, on the same machine, qwen 3.5 4B model (at 4 bit quant) correctly and about 50x faster.
Though I couldn't get actual English output from it, so maybe something went wrong while running it.
You need to count the information contained in the non-DNA part of the cell too.
Just in case it's not obvious, you can't take human DNA and put it in a cat cell, it won't work, that cell won't replicate.
For now, the DNA replication and the synthesis of RNA and proteins using the information stored in DNA are the best understood parts about how a cell grows and divides, but how other complex cellular structures, e.g. membranes or non-ribosomal peptides, are assembled and replicated is much less understood.
We need more years of research, perhaps up to a decade or two, until we will be able to know the entire amount of information describing a simple bacterial cell, and perhaps more than that for a much more complex eukaryotic cell.
Second, what's even more crazy is that roughly 98% of that DNA is actually non-coding.. just junk.
So, we are talking about encoding entirety of the logic to construct a human body in just around 16MB of data!!!
That's some crazy levels of recursive compression.. maybe it's embedding "varying" parsing logic, mixed with data, along the chain.
I think it's a myth that non-coding DNA is junk. Say:
https://www.nature.com/articles/444130a
>'Non-coding' DNA may organize brain cell connections.
The parts of the DNA with known functions encode either proteins or RNA molecules, being templates for their synthesis.
The parts with unknown functions include some amount of true junk caused by various historical accidents that have been replicated continuously until now, but they also include a lot of DNA that seems to have a role in controlling how the protein or RNA genes are expressed (i.e. turning off or on the synthesis of specific proteins or RNAs), by mechanisms not well understood yet.
Can't wait to give it a spin with ollama, if ollama could list it as a model that would be helpful.
The amount of progress they've been making is incredible.
Is anyone following this space more closely? Is anyone predicting performance at certain parameter sizes will plateau soon?
Unlike the frontier models, these don't seem to be showing much progress of slowing down.
I strongly think smaller models will end up being able to do most coding tasks in the future, once they are reigned in properly
It seems PrismML has implemented a better version of an idea I had a while back: what if we had a 1-bit model where the scale of the weight is determined by its position. The model would have to be trained from the ground up for this though, which is why I never tried it. The interleaved scale factor approach of Bonsai is a much more flexible approach at almost the same cost.
Got this nonsense answer: Interesting Scientific Discovery: The Discovery of the "Boson" in 1924 by Charles K. Alton ("Chuck") Alton
What it is: In 1924, American physicist Charles K. Alton discovered the "boson", a particle that would later be named after the Italian physicist Enrico Fermi. This discovery was made while he was working at the University of Chicago, where he was studying the properties of neutrons and protons.
Why it's significant: The boson Alton discovered was the meson, a subatomic particle that plays a crucial role in the strong nuclear force that holds atomic nuclei together. However, Alton's original work was more focused on the neutron, and he mistakenly identified the boson as a neutron—a common mistake at the time. This was not a scientific error, but rather a misunderstanding of the data. His work, however, laid the groundwork for the later discovery of mesons by Enrico Fermi and Richard Feynman, which was a major breakthrough in the field of particle physics.
Impact on Understanding:
• Mesons are now known to be the carriers of the strong force, which is one of the four fundamental forces of nature. • This discovery helped physicists understand the structure of the nucleus and the nature of nuclear binding. • It was a key step in the development of quantum chromodynamics (QCD), which describes the strong force at the subatomic level. • Alton's work is often overlooked in the history of particle physics, yet it was a critical early step in the field.
Why it's interesting: This story highlights how even early scientific discoveries can be misinterpreted or overlooked—sometimes not because of a lack of intelligence, but due to the complexity of the data and the evolving nature of scientific understanding. It also shows how one scientist's work can be built upon by others, leading to major breakthroughs.
Bonus: Alton was not the only person involved. His work was built upon earlier research by Edward Teller and Leopold Rabi, and it was ultimately Fermi who discovered the meson in 1932 using the "storm cloud" method, where he used a cloud chamber to observe the effects of high-energy particles.
So, while the boson Alton discovered was initially misidentified, it was a crucial piece of the puzzle in the development of modern particle physics.
Rather than learn about President Lincoln, the model can learn to look that info up with a search tool and use it to get better answers.
Just like a human does. I don't learn what 76x35 is... I learn that a calculator can give me that answer so I don't need to memorize it.
I'm currently setting this one up, if it works well with a custom LoRa ontop ill be able to run two at once for my custom memory management system :D
Their main contribution seems to be hyperparameter tuning, and they don't compare against other quantization techniques of any sort.