Does that make sense?
> As inhibitory synapses account for 20%-30%, this could map well to how biological brains are structured.
No concise HN comment will give you a complete picture of whats currently known about the human brain, so a platitude necessarily follows:
We call the nearly touching interfaces between neurons synapses, small packets / droplets of neurotransmitter are sent across this interface from the source to the target neuron. Such signals can be excitatory (promote the probability of excitation of the target firing soon) or inhibitory (inhibits the probability of the target firing soon). There are 2 types of sensitive areas on your average neuron: the dendrites (long branching tentacles, that receive excitatory signals) and the cell body where all the signals are accumulated to a local instantaneous "sum" is also sensitive to synaptic activation, but the synapses on the cell body are inhibitory, when sufficiently inhibited the neuron will refuse to fire its axons, so the inhibitory synapses on the cell body can gate the cumulative signal and prevent it from triggering this neuron temporarily. If the neuron does fire, this propagates along the axons (another type of branching tentacles, which lead to yet other neurons, sometimes touching them excitatorily at their dendrite, sometimes touching a neuron inhibitorily at their cell body.
I hope that helped?
Also this is the direction the small LLMs are moving in already. They are too small for general knowledge, but getting quite good at tool use (incl. Googling).
Now we just need them to be very strict about what they know and don't know! (I think this is still an open problem, even with big ones.)
And I don't think that LLM could just Google or check Wikipedia.
But I do agree that this architecture makes a lot of sense. I assume it will become the norm to use such edge LLMs.
While I understand some of the fundamental thoughts behind that comparison, it's slightly wonky... I'm not asking "compress wikipedia really well", but instead "can a 'model' reason its way through wikipedia" (and what does that reasoning look like?).
Theoretically with wikipedia-multi-lang you should be able to reasonably nail machine-translation, but if everyone is starting with "only wikipedia" then how well can they keep up with the wild-web-trained models on similar bar chart per task performance?
If your particular training technique (using only wikipedia) can go from 60% of SOTA to 80% of SOTA on "Explain why 6-degrees of Kevin Bacon is relevant for tensor operations" (which is interesting to plug into Google's AI => Dive Deeper...), then that's a clue that it's not just throwing piles of data at the problem, but instead getting closer to extracting the deeper meaning (and/or reasoning!) that the data enables.
Maybe not crawl the web, but hit a service with pre-hosted, precurated content it can digest (and cache) that doesn't necessarily change often enough. You aren't using it for the latest news necessarily, but programming is mostly static knowledge a a good example.
> I often wonder is "what will be the minimally viable LLM" that can work from just enough information that if it googles the rest it can provide reasonable answers?
It depends what that word "reasonable" means for your specific use-case ;)
How? They can validate thousands if not millions of queries but nothing prevent the millions-th-and-one from being a hallucination. People who would then pay extra for a "Encyclopedia Britanica validated LLM" would then, rightfully so IMHO, complain that "it" suggested them to cook with a dangerous mushroom.
One bit or one trit? I am confused!
For something more practical, you can pack five three-state values within a byte because 3^5 = 243, which is smaller than 256. To unpack, you divide and modulo by 3 five separate times. This encodes data in bytes at 1.6 bits per symbol.
But the packing of 5 symbols into a byte was not done here. Instead, they packed 4 symbols into a byte to reduce computational complexity (no unpacking needed)
>packed 4 symbols into a byte
microslop, typical bunch of two-bit frauds!
So it's not a inference framework for 1-bit models (two states per parameter) but for 1.58 bit models (three states per parameter). Annoying that they try to mix up the two.
I had some AI courses in uni in early 2010s and we were the last cohort that had Prolog and Logic Based AI.
Most interesting project was the final semester where we competed in teams to create the best team of bots for UE3 CTF.
``` Ecosystem Services and their impact on the Ecosystem
Ecosystem services refer to the services provided by ecosystems to the human society. These services include water, air, energy, nutrients, and soil (Jenkins, 2010). For instance, water is the most important service provided by an ecosystem and it helps in the conservation of water, irrigation and sanitation (Jenkins, 2010). On the other hand, air provides the oxygen needed for life.
The water cycle is a significant ecosystem service because it involves the cycling of water among the different parts of an ecosystem. It also involves the movement of water through the atmosphere, from one place to another. It is also the process of evaporation and condensation of water from the atmosphere. It also involves the movement of water from the air to the soil and water into the oceans.
The water cycle is a significant ecosystem service because it involves the cycling of water among the different parts of an ecosystem. It also involves the movement of water through the atmosphere, from one place to another. It is also the process of evaporation and condensation of water from the atmosphere. It also involves the movement of water from the air to the soil and water into the oceans. ```
Except on GSM8K and math...
If Microslop aren't gonna train the model themselves to prove their own thesis, why would others? They've had 2 years (I think?) to prove BitNet in at least some way, are you really saying they haven't tried so far?
Personally that makes it slightly worrisome to just take what they say at face value, why wouldn't they train and publish a model themselves if this actually led to worthwhile results?
But it doesn't mean, idea is worthless.
You could have said same about Transformers, Google released it, but didn't move forward, turns out it was a great idea.
I don't think you can, Google looked at the research results, and continued researching Transformers and related technologies, because they saw the value for it particularly in translations. It's part of the original paper, what direction to take, give it a read, it's relatively approachable for being a machine learning paper :)
Sure, it took OpenAI to make it into an "assistant" that answered questions, but it's not like Google was completely sleeping on the Transformer, they just had other research directions to go into first.
> But it doesn't mean, idea is worthless.
I agree, they aren't, hope that wasn't what my message read as :) But, ideas that don't actually pan out in reality are slightly less useful than ideas that do pan out once put to practice. Root commentator seems to try to say "This is a great idea, it's all ready, only missing piece is for someone to do the training and it'll pan out!" which I'm a bit skeptical about, since it's been two years since they introduced the idea.
The core insight necessary for chatgpt was not scaling (that was already widely accepted): the insight was that instead of finetuning for each individual task, you can finetune once for the meta-task of instruction following, which brings a problem specification directly into the data stream.
It was fun to come up with creative ways to get it to answer your question or generate data by setting up a completion scenario.
I guess "chat" became the universal completion scenario. But I still feel like it could be "smarter" without the RLHF layer of distortion.
Google released transforms as research because they invented it while improving Google Translate. They had been running it for customers for years.
Beyond that, they had publicly-used transformer based LMs ("mums") integrated into search before GPT-3 (pre-chat mode) was even trained. They were shipping transformer models generating text for years before the ChatGPT moment. Literally available on the Google SERP page is probably the widest deployment technology can have today.
Transformers are also used widely in ASR technologies, like Google Assistant, which of course was available to hundreds of millions of users.
Finally, they had a private-to-employees experimental LLMs available, as well as various research initatives released (meena, LaMDA, PaLM, BERT, etc) and other experiments, they just didn't productize everything (but see earlier points). They even experimented with scaling (see "Chinchilla scaling laws").
A successful ternary model would basically erase all that value overnight. In fact, the entire stock market could crash!
Think about it: This is Microsoft we're talking about! They're a convicted monopolist that has a history of manipulating the market for IT goods and services. I wouldn't put it past them to refuse to invest in training a ternary model or going so far as to buy up ternary startups just to shut them down.
Want to make some easy money: Start a business training a ternary model and make an offer to Microsoft. I bet they'll buy you out for at least a few million even if you don't have a product yet!
Occam’s Razor suggests this simply doesn’t yield as good results as the status quo
GLM 5, for example, is running 16-bit weights natively. This makes their 755B model 1.5TB in size. It also makes their 40B active parameters ~80GB each.
Compare this to Kimi K2.5. 1T model, but it's 4-bit weights (int4), which makes the model ~560 GB. Their 32B active parameters are ~16 GB.
Sure, GLM 5 is the stronger model, but is that price worth paying with 2-3x longer generation times? What about 2-3x more memory required?
I think this barrel's bottom really hasn't been scraped.
The "new" on huggingface banner has weights that were uploaded 11 months ago, and it's 2B params. Work on this in the repo is 2 years old.
The amount of publicity compared to the anemic delivery for BitNet is impressive.
You'd still need full GPUs for training, but for inference the hardware would be orders of magnitude simpler than what Nvidia is making
Interestingly, a trit x float multiplier is cheaper than a trit x integer multiplier in hardware if you're willing to ignore things like NaNs.
0 and 1 are trivial, just a mux for identity and zero. But because floats are sign-magnitude, multiply by -1 is just an inverter for the sign bit, where as for integers you need a bitwise inverter and full incrermenter.
The relevant trit arithmetic should be on display in the linked repo (I haven't checked). Or try working it out for the uncompressed 2 bit form with a pen and paper. It's quite trivial. Try starting with a couple bitfields (inputs and weights), a couple masks, and see if you can figure it out without any help.
I happened to "live" on 7.0-7.5 tok/sec output speed for a while, and it is an annoying experience. It is the equivalent of walking behind someone slightly slower on a footwalk. I dealt with this by deliberately looking away for a minute until output was "buffered" and only then started reading.
For any local setup I'd try to reach for 10 tok/sec. Sacrifice some kv cache and shove a few more layers on your GPU, it's worth it.
In what way? On modern processors, a Fused Multiply-Add (FMA) instruction generally has the exact same execution throughput as a basic addition instruction
typically for 1-bit matmul, you can get away with xors and pop_counts which should have a better throughput profile than FMA when taking into account the SIMD nature of the inputs/outputs.
actv = A[_:1] & B[_:1]
sign = A[_:0] ^ B[_:0]
dot = pop_count(actv & !sign) - pop_count(actv & sign)
It can probably be made more efficient by taking a column-first format.
Since we are in CPU land, we mostly deal with dot products that match the cache size, I don't assume we have a tiled matmul instruction which is unlikely to support this weird 1-bit format.
l1 = dot(A[:11000000],B[:11000000]) l2 = dot(A[:00110000],B[:00110000]) l3 = dot(A[:00001100],B[:00001100]) l4 = dot(A[:00000011],B[:00000011])
result = l1 + l2 * 4 + l3 * 16 + l4 * 64
which is 8 bit ops and 4x8 bit dots, which is likely 8 clocks with less serial dependence
So it's not that individual ops are faster — it's that the packed representation lets each instruction do more useful work, and you're moving far less data from memory to do it.
I'm hoping that today's complaints are tomorrow's innovations. Back when 1Mb hard drive was $100,000, or when Gates said 640kb is enough.
Perhaps some 'in the (chip) industry' can comment on what RAM manufacturers are doing at the moment - better, faster, larger? Or is there not much headroom left and it's down to MOBO manufacturers, and volume?
AMD actually used HBM2 memory in their Radeon VII card back in 2019 (!!) for $700. It had 16 GB of HBM2 memory with 1 TB/s throughput.
The RTX 5080 in conversion l comparison also has 16 GB of VRAM, but was released in 2025 and has 960 GB/s throughput. The RTX 5090 does have an edge at 1.8 TB/s bandwidth and 32 GB of VRAM but it also costs several times more. Imagine if GPUs had gone down the path of the Radeon VII.
That being said, the data center cards from both are monstrous.
The Nvidia B200 has 180 GB of VRAM (2x90GB) offering 8.2 TB/s bandwidth (4.1 TB/s x2) released in 2024. It just costs as much as a car, but that doesn't matter, because afaik you can't even buy them individually. I think you need to buy a server system from Nvidia or Dell that will come with like 8 of these and cost you like $600k.
AMD has the Mi series. Eg AMD MI325x. 288 GB of VRAM doing 10 TB/s bandwidth and released in 2024. Same story as Nvidia: buy from an OEM that will sell you a full system with 8x of these (and if you do get your hands on one of these you need a special motherboard for them since they don't do PCIe). Supposedly a lot cheaper than Nvidia, but still probably $250k.
These are not even the latest and greatest for either company. The B300 and Mi355x are even better.
It's a shame about the socket for the Mi series GPUs (and the Nvidia ones too). The Mi200 and Mi250x would be pretty cool to get second-hand. They are 64 GB and 128GB VRAM GPUs, but since they use OAP socket you need the special motherboard to run them. They're from 2021, so in a few years time they will likely be replaced, but as a regular joe you likely can't use them.
The systems exist, you just can't have them, but you can rent them in the cloud at about $2-4 per hour per GPU.
The last logical step of this process would be figuring out how to mix the CPU transistors with the RAM capacitors on the same chip as opposed to merely stacking separate chips on the same package.
A related stopgap is the AI startup (forget which) making accelerators on giant chips full of SRAM. Not a cost effective approach outside of ML.
But it seems the biggest model available is 10B? Somewhat unusual and does make me wonder just how challenging it will be to train any model in the 100B order of magnitude.
The key insight of the BitNet paper was that using their custom BitLinear layer instead of normal Linear layers (as well as some more training and architecture changes) lead to much, much better results than quantizing an existing model down to 1.58 bits. So you end up making a full training run in bf16 precision using the specially adapted model architecture
(only suggesting that it's intentional because it's been there so long)
> We evaluated bitnet.cpp in terms of both inference speed and energy cost. Comprehensive tests were conducted on models with various parameter sizes, ranging from 125M to 100B. specific configurations for each model are detailed in the Appendix A.
Boy would I love to give my agent access to my Quickbooks. They pushed out an incomplete MCP and haven't touched it since.
However this user uses — in almost all his posts and he had a speed of 1 comment per minute or so on multiple different topics.
Edit: oh, just recalled dang restricted Show HNs the other day to only non-new users (possibly with some other thresholds). I wonder if word got out and some are filling accounts with activity.
It feels like it'd take someone superhuman to come across different stories, have such opinions and type and submit both of these in that timeframe or queuing up comments to post rapid-fire.
Conspicuously too, as another pointed out, is every single comment of yours uses an em dash, which despite occasionally using myself (hey look they're in this reply) is not in every single comment. Idk, if I was being seriously accused of botting I'd put more reasoning into my response about it.
COVID was ridiculous as I presume a lot of anxious people were stuck at home able to do nothing but post.
They read this article, called out a specific discrepancy then commented on a paper on Arxiv in a 70-odd word post then the same minute another 70-odd word post on a different technical post. Maybe like you suggest they're just wired differently.
I suspect that they are trying to fake engagement prior to making their first "show" post as well.
It's not a question of if there are other bots out there, but only what % of comments on HN right now and elsewhere are bot generated. That number is only going to increase if nothing is done.
And I did not speak out
Because I was not using em dashes
Then they claimed that if you're crammar is to gud you r not hmuan
And I did not spek aut
Because mi gramar sukcs
Then they claimed that if you actually read the article that you are trying to discuss you are not human...
Created confusion and frustration will make it much harder to separate signal from the noise for most people.
Not only are we losing the ability to communicate clearly without the assistance of computers, those who can are being punished for it.
Residential Treatment Facility for Adults? Red Tail Flight Academy?
can we stop already with these decimals and just call it "1 trit" which it exactly is?
Also, isnt there different affinities to 8bit vs 4bit for inferences
I imagine you got 96gb because you thought you'd be running models locally? Did you not know the phrase Unified Memory is marketing speak?
Also as far as I know, this is more of a research curiosity - BitNet really doesn't perform that well on evals.
I think Qwen3.5 2B is the best you can get in the ~1GB class.
Furthermore, it was published 11 months ago, it's not a new release.
A few months ago I used Whisper from OpenAI, an automatic speech recognition system released in 2002, on my modern 20-core Intel CPU to convert audio from a video file to text. It worked fine. Took a while and the machine got hot and the fans kicked in. I then found the Intel's optimized version of whisper that used NPU. It required a lot more steps to get working, but in the end it did work and was about 6x faster. And the machine remained cool and silent in the process. Since then I have become a fan of the NPUs. They are not NVIDIA GeForce RTX 5090, but they are significantly better than a modern CPU.
A claude 4.6 they are most certainly not, but if you get through the janky AF software ecosystem they can run small LLMs reasonably well with basically zero CPU/GPU usage
I was under the impression that they were primarily designed for low power use.
I had the same question, after some debates with Chatgpt, it's not the "quantize" for post-training we often witness these days, you have to use 1 trit in the beginning since pre-train.
Seems like that could end up as a situation where a fractional number of bits or bytes per parameter might make sense. Particularly with adverbs and adjectives, negators.
With how much RAM? How much storage does it requires?
The engineering/optimization work is nice, but this is not what people have been waiting for, as much as, can’t the Bitnet idea that seemed promise really deliver in a competitive way.
It seems to keep repeating that the water cycle is the main source of energy for all living things on the planet and then citing Jenkins 2010. There are also a ton of sentence beginning with “It also…”
I don’t even think it’s correct. The sun is the main source of energy for most living things but there’s also life near hydrothermal vents etc.
I don’t know who Jenkins is, but this model appears to be very fond of them and the particular fact about water.
I suppose fast and inaccurate is better than slow and inaccurate.
If you have an existing network, making an int4 quant is the better tradeoff. 1.58b quants only become interesting when you train the model specifically for it
On the other hand maybe it works much better than expected because llama3 is just a terrible baseline
There's a lot that you can do when the model size is that small, yet still powerful.
Our next step is that we want to put up a content distribution network for it where people can also share their diffs for their own fine-tuned model. I'll post the project if we finish all the parts.
[1] https://www.youtube.com/live/x791YvPIhFo?is=NfuDFTm9HjvA3nzN
demo shows a huge love for water, this AI knows its home
My disappointment is immeasurable and my day is ruined.