Unsloth Dynamic 2.0 GGUFs
42 points
by tosh
2 hours ago
| 7 comments
| unsloth.ai
| HN
Maxious
1 hour ago
[-]
ICYMI unsloth has had some major breakthroughs today with the Qwen3.5 local models https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

With the Qwen3.5 35B A3B at Q4 I've got 200k context running at 62.98 tokens per second on a local RTX5080 16GB.

reply
mirekrusin
22 minutes ago
[-]
2x RTX 4090, Q8, 256k context, 110 t/s
reply
Kayou
1 hour ago
[-]
Wait, the Q4 quantization which is more than 20GB fits in your 16GB GPU ? I didn't know that was possible, I was always restricting myself to smaller model than the VRAM I had
reply
Maxious
56 minutes ago
[-]
Yep. These Mixture of Experts models are well suited for paging in only the relevant data for a certain task https://huggingface.co/blog/moe

There's some experiments of just removing or merging experts post training to shrink models even more https://bknyaz.github.io/blog/2026/moe/

reply
segmondy
1 hour ago
[-]
llama.cpp is designed for partial offloading, the most important part of the model will be loaded into the GPU and the rest on system ram. I run 500B+ models such as DeepSeek/KimiK2.5/GLM-5 without having that much GPU vram.
reply
jychang
1 hour ago
[-]
Not really breakthroughs, more like bugfixes for their broken first batch.
reply
qskousen
23 minutes ago
[-]
This is pretty interesting, based on the blog post, it seems like they are using a technique similar to what I have been using to generate "layer sensitivity" data in my (still pretty beta) ggufy project, which is more aimed at diffusion (image) models. https://github.com/qskousen/ggufy
reply
tenpa0000
54 minutes ago
[-]
I run Llama 3.2 3B locally for latency-sensitive classification (sub-50ms, so no room for bigger models). At that scale Q2_K vs Q4_K_M isn't just smaller — Q2 starts flipping yes/no answers that Q4 gets right. Not often, but enough to notice in production.

So the KL divergence numbers here are more useful to me than the MMLU tables honestly. I've had MMLU hold steady while the output distribution drifted enough to break things downstream.

Does the calibration dataset make much difference at 3B though? There's so little redundancy that I'd expect it to hit a floor pretty fast regardless of how good the calibration data is.

reply
zozbot234
39 minutes ago
[-]
For a simple classification task you generally want to prioritize regularization over more sophisticated behavior, so fewer parameters with larger quantization makes sense. For more generic chat-like purposes, Q2 of a larger model may often be preferable to Q4 of a smaller one.
reply
am17an
24 minutes ago
[-]
What do you use for sub-50ms inference?
reply
Havoc
1 hour ago
[-]
Advances in this space are always welcome.

I see the change in kld values is pretty modest vs prior version. Does anyone know how that translates to real world? Is more of a linear type situation or exponential etc

reply
electroglyph
59 minutes ago
[-]
Cheers Daniel and Mike and team, keep up the good work!
reply
dyl000
52 minutes ago
[-]
So q6 is practically perfect, and q3 is meaningfully decent. very impressive!
reply
jychang
1 hour ago
[-]
What's up with this post? It's a link to something which has existed for a long time, and there's a bunch of dead comments below. Some weird SEO campaign thing?
reply
tosh
1 hour ago
[-]
Unsloth have just released benchmarks on how their dynamic quants perform for Qwen 3.5

https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

reply
jychang
1 hour ago
[-]
I'm aware of that, but that's not the link of the post. The post is linking to their UD 2.0 quants from a few months back.

Also, the benchmarks are because they messed up the first version of their quants for Qwen 3.5 by quanting some tensors to mxfp4 that should have been in higher quality, and this is their bugfix. The post literally starts out with "We updated Qwen3.5-35B Unsloth Dynamic quants being SOTA on nearly all bits" without explaining WHY they needed to update from the original version.

reply
lostmsu
54 minutes ago
[-]
Looking at their benchmarks there doesn't appear to be meaningful difference between their quants and bartowsky quants.
reply