I ran it on an M5 Pro with 128GB of RAM, but it only needs ~20GB of that. I expect it will run OK on a 32GB machine.
Performance numbers:
Reading: 20 tokens, 0.4s, 54.32 tokens/s
Generation: 4,444 tokens, 2min 53s, 25.57 tokens/s
I like it better than the pelican I got from Opus 4.7 the other day: https://simonwillison.net/2026/Apr/16/qwen-beats-opus/Can you run your other tests and see the difference?
https://gist.github.com/simonw/95735fe5e76e6fdf1753e6dcce360...
But GLM 5.1 is a 1.51TB model, the Qwen 3.6 I used here was 17GB - that's 1/88 the size.
And by the way: Thanks for relentlessly holding new models’ feet to the pelican SVG fire.
1. You can run this on a Mac using llama-server and a 17GB downloaded file
2. That version does indeed produce output (for one specific task) that's of a good enough quality to be worth spending more time checking out this model
3. It generated 4,444 tokens in 2min 53s, which is 25.57 tokens/s
* er, that probably sounds strange, but I did just spend 6 weeks working on integrating the Willison Trifecta for my app I've been building for 2.5 years, and I considered it a release blocker. It's a simple mental model that is a significant UX accomplishment IMHO.
Missing an opportunity here, lol.
The trend went to MoE model for some times and this time around is dense model again. I wonder if closed models are also following this trend: MoE for faster ones and dense for pro model.
Can you replace Claude Code Opus or Codex with this?
Does it feel >80% as good on "real world" tasks you do on a day to day basis.
But every time a local model gets me by - I feel closer to where I should be; writing code should still be free. Both free as in free beer, and free as in freedom.
My setup is a seperate dedicated Ubuntu machine with RTX 5090. Qwen 3.6:27b uses 29/32gb of vram when its working right this minute. I use Ollama in a non root podman instance. And I use OpenCode as ACP Service for my editor, which I highly recommend. ACP (Agent Client Protocol) is how the world should be in case you were asking, which you didnt :)
Exciting times and thank you Qwen team for making the world a better place in a world of Sam Altmans.
This is why they don’t advertise which consumer hardware it can run on: Their direct release that delivers these results cannot fit on your average consumer system.
Most consumers don’t run the model they release directly. They run a quantized model that uses a lower number of bits per weight.
The quantizations come with tradeoffs. You will not get the exact results they advertise using a quantized version, but you can fit it on smaller hardware.
The previous 27B Qwen3.5 model had reasonable performance down to Q5 or Q4 depending on your threshold for quality loss. This was usable on a unified memory system (Mac, Strix Halo) with 32GB of extra RAM, so generally a 64GB Mac. They could also be run on an nVidia 5090 with 32GB RAM or a pair of 16GB or 24GB GPUs, which would not run as fast due to the split.
Watch out for some of the claims about running these models on iPhones or smaller systems. You can use a lot of tricks and heavy quantization to run it on very small systems but the quality of output will not be usable. There is a trend of posting “I ran this model and this small hardware” repos for social media bragging rights but the output isn’t actually good.
Say you have a GPU with 20GB of VRAM. You're probably going to be able to run all the 3-bit quantizations with no problem, but which one do you choose? Unsloth offers[1] four of them: UD-IQ3_XXS, Q3_K_S, Q3_K_M, UD-Q3_K_XL. Will they differ significantly? What are each of them good at? The 4-bit quantizations will be a "tight squeeze" on your 20GB GPU. Again, Unsloth steps up to the plate with seven(!!) choices: IQ4_XS, Q4_K_S, IQ4_NL, Q4_0, Q4_1, Q4_K_M, UD-Q4_K_XL. Holy shit where do I even begin? You can try each of them to see what fits on your GPU, but that's a lot of downloading, and then...
Once you [guess and] commit to one of the quantizations and do a gigantic download, you're not done fiddling. You need to decide at the very least how big a context window you need, and this is going to be trial and error. Choose a value, try to load the model, if it fails, you chose too large. Rinse and repeat.
Then finally, you're still not done. Don't forget the parameters: temperature, top_p, top_k, and so on. It's bewildering!
1. Auto best official parameters set for all models
2. Auto determines the largest quant that can fit on your PC / Mac etc
3. Auto determines max context length
4. Auto heals tool calls, provides python & bash + web search :)
There are actually two problems with this:
First, the 3-bit quants are where the quality loss really becomes obvious. You can get it to run, but you’re not getting the quality you expected. The errors compound over longer sessions.
Second, you need room for context. If you have become familiar with the long 200K contexts you get with SOTA models, you will not be happy with the minimal context you can fit into a card with 16-20GB of RAM.
The challenge for newbies is learning to identify the difference between being able to get a model to run, and being able to run it with useful quality and context.
My R9700 does seem to have an annoying firmware or driver bug[0] that causes the fan to usually be spinning at 100% regardless of temperature, which is very noisy and wastes like 20+ W, but I just moved my main desktop to my basement and use an almost silent N150 minipc as my daily driver now.
[0] Or manufacturing defect? I haven't seen anyone discussing it online, but I don't know how many owners are out there. It's a Sapphire fwiw. It does sometimes spin down, the reported temperatures are fine, and IIRC it reports the fan speed as maxed out, so I assume software bug where it's just not obeying the fan curve
typically those dense models are too slow on Strix Halo to be practical, expect 5-7 tps
you can get an idea by looking at other dense benchmarks here: https://strixhalo.zurkowski.net/experiments - i'd expect this model to be tested here soon, i don't think i will personally bother
EDIT: I'm running the Unsloth Qwen3.6-27B-Q6_K GGUF on a Corsair Strix Halo 128GB I bought summer 2025.
https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/blob/main/Qw...
GTR 9 Pro, "performance" profile in BIOS, GTT instead of GART, Fedora 44
(Intel Core i7 4790K @ 4 Ghz, nVidia GTX Titan Black, 32 GB 2400 MHz DDR3 memory)
Edit: Just tested the new Qwen3.6-27B-Q5_K_M. Got 1.4 tokens per second on "Create an SVG of a pellican riding a bicycle." https://gist.github.com/Wowfunhappy/53a7fd64a855da492f65b4ca...
Making the the right pick for model is one of the key problems as a local user. Do you have any references where one can see a mapping of problem query to model response quality?
Otherwise no need for full fp16, int8 works 99% as well for half the mem, and the lower you go the more you start to pay for the quants. But int8 is super safe imo.
You absolutely do NOT need a $3000 Strix Halo rig or a $4000 Mac or a $9000 RTX 6000 or "multiple high memory consumer GPUs" to run this model at extremely high accuracy. I say this as a huge Strix Halo fanboy (Beelink GTR 9 Pro), mind you. Where Strix Halo is more necessary (and actually offers much better performance) are larger but sparse MoE models - think Qwen 3.5 122B A10B - which offers the total knowledge (and memory requirements) of a 122B model, with processing and generation speed more akin to a 10B dense model, which is a big deal with the limited MBW we get in the land of Strix Halo (256 GB/s theoretical, ~220 GB/s real-world) and DGX Spark (273 GB/s theoretical - not familiar with real-world numbers specifically off the top of my head).
I would make the argument, as a Strix Halo owner, that 27B dense models are actually not particularly pleasant or snappy to run on Strix Halo, and you're much better off with those larger but sparse MoE models with far fewer active parameters on such systems. I'd much rather have an RTX 5090, an Arc B70 Pro, or an AMD AI PRO R9700 (dGPUs with 32GB of GDDR6/7) for 27B dense models specifically.
That said, my Strix Halo rig only has PCIe 4.0 for my NVMe, and I'm using a 990 Evo that had poor sustained random read, being DRAM-less. My effective read speeds from disk were averaging around 1.6-2.0 GB/s, and with unsloth's K2.5, even in IQ2_XXS at "just" 326 GB, with ~64 GB worth of layers in iGPU and the rest free for KV cache + checkpoints. Even still, that was over 250 GB of weights streaming at ~2 GB/s, so I was getting 0.35 PP tok/s and 0.22 TG tok/s.
I could go a little faster with a better drive, or a little faster still if I dropping in two of em in raid0, but it would still be on the order of magnitude of sub-1 tok/s PP (compute limited) and TG (bandwidth limited).
This is not a little faster, but 10 times faster than on your system. So a couple of tokens per second generation speed should be achievable.
Nowadays even many NUCs or NUC-like mini-PCs have such SSD slots.
I have actually started working at optimizing such an inference system, so your data is helpful for comparison.
While many other NUCs may support them, what most of them lack compared to Strix Halo is a 128 GB pool of unified LPDDR5x-8000 on a 256 bit bus and the Radeon 8060S iGPU with 40 CU of RDNA 3.5, which is roughly equivalent in processing power to a laptop 4060 or desktop 3060.
The Radeon 780M and Radeon 890M integrated graphics that come on most AMD NUCs don't hold a candle to Strix Halo's 8060S, and what little you'd gain in this narrow use case with PCIe gen 5, you'd lose a lot in the more common use cases of models that can fit into a 128 GB pool of unified memory, and there are some really nice ones.
Also, the speeds you're suggesting seem rather optimistic. Gen 5 drives, as I understand, hit peak speeds of about 28-30 GB/s (with two in RAID0, at 14-15 GB/s each), but that's peak sequential reads, which is neither reflective of sustained reads, nor the random read workloads that dominate reading model weights.
Maybe there are some Intel NUCs that compete in this space that I'm less up to speed with which do support PCIe 5. I know Panther Lake costs about as much to manufacture as Strix Halo, and while it's much more power efficient and achieves a lot more compute per Xe3 graphics core than Strix Halo achieves per RDNA 3.5 CU, they Panther Lake that's actually shipping ships with so many fewer Xe3 cores that it's still a weaker system overall.
Maybe DGX Spark supports PCIe 5.0, I don't own one and am admittedly not as familiar with that platform either, though it's worth mentioning that the price gap between Strix Halo and DGX Spark at launch ($2000 vs $4000) has closed a bit (many Strix Halo run $3000 now, vs $4700 for DGX Spark, and I think some non-Nvidia GB10 systems are a bit cheaper still)
Seems like nobody wants to admit they exclude working class from the ride.
llama-server \
-hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M \
--no-mmproj \
--fit on \
-np 1 \
-c 65536 \
--cache-ram 4096 -ctxcp 2 \
--jinja \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--presence-penalty 0.0 \
--repeat-penalty 1.0 \
--reasoning on \
--chat-template-kwargs '{"preserve_thinking": true}'
35B-A3B model is at ~25 t/s. For comparison, on an A100 (~RTX 3090 with more memory) they fare respectively at 41 t/s and 97 t/s.I haven't tested the 27B model yet, but 35B-A3B often gets off rails after 15k-20k tokens of context. You can have it to do basic things reliably, but certainly not at the level of "frontier" models.
(Btw I believe the "--jinja" flag is by default true since sometime late 2025, so not needed anymore)
| model | size | params | backend | threads | test | t/s |
| ------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen35 27B Q4_K_M | 15.65 GiB | 26.90 B | BLAS,MTL | 4 | pp512 | 61.31 ± 0.79 |
| qwen35 27B Q4_K_M | 15.65 GiB | 26.90 B | BLAS,MTL | 4 | tg128 | 5.52 ± 0.08 |
| qwen35moe 35B.A3B Q3_K_M | 15.45 GiB | 34.66 B | BLAS,MTL | 4 | pp512 | 385.54 ± 2.70 |
| qwen35moe 35B.A3B Q3_K_M | 15.45 GiB | 34.66 B | BLAS,MTL | 4 | tg128 | 26.75 ± 0.02 |
So ~60 for prefill and ~5 for output on 27B and about 5x on 35B-A3B.Sure it's order of magnitude faster (10x on Apple Metal?) but there's also order of magnitude more tokens to process, especially for tasks involving summarization of some sort.
But point taken that the parent numbers are probably decode
* Specifically, Mac metal, which is what parent numbers are about
It's frustrating when trying to find benchmarks because almost everyone gives decode speed without mentioning prefill speed.
I am wondering how to measure that anyway.
I tried the other qwen models and the reasoning stuff seems to do more harm than good.
For more a detailed analysis, there are several online VRAM calculators. Here's one: https://smcleod.net/vram-estimator/
If you have a huggingface account, you can set your system configuration and then you get little icons next to each quant in the sidebar. (Green: will likely fit, Yellow: Tight fit, Red: will not fit)
Further, t/s depends greatly on a lot of different factors, the best you might get is a guess based on context size.
One thing about running local LLMs right now, is that there are tradeoffs literally everywhere and you have to choose what to optimize for down to the individual task.
For example, the one you linked, when I provide a Qwen3.5 27B Q_4_M GGUF [0], says that it will require 338 GB of memory with 16-bit kv cache. That is wrong by over an order of magnitude.
[0] https://huggingface.co/bartowski/Qwen_Qwen3.5-27B-GGUF/resol...
It's a shame that search is so polluted these days that it's impossible to find good tools like yours.
https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/discussions/...
The higher quantization - the better results, but more memory is needed. Q8 is the best.
"--tensor-parallel-size", "2" - spread the LLM weights over 2 GPU's available
"--max-model-len", "90000" - I've capped context window from ~256k to 90k. It allows us to have more concurrency and for our use cases it is enough.
"--kv-cache-dtype", "fp8_e4m3", - On an L4 cuts KV cache size in half without a noticeable drop in quality, does not work on a5000, as it has no support for native FP8. Use "auto" to see what works for your gpu or try "tq3" once vllm people merge into the nightly.
"--enable-prefix-caching" - Improves time to first output.
"--speculative-config", "{\"method\":\"qwen3_next_mtp\",\"num_speculative_tokens\":2}", - Speculative mutli-token prediction. Qwen3.5 specific feature. In some cases provides a speedup of up to 40%.
"--language-model-only" - does not load vision encoder. Since we are using just the LLM part of the model. Frees up some VRAM.
I don't use any non-FLOSS dev tools; why would I suddenly pay for a subscription to a single SaaS provider with a proprietary client that acts in opaque and user hostile ways?
But further, seeing with Claude, your workflow, or backend or both, arn't going anywhere if you're building on local models. They don't suddenly become dumb; stop responding, claim censorship, etc. Things are non-determinant enough that exposing yourself to the business decisions of cloud providers is just a risk-reward nightmare.
So yeah, privacy, but also, knowing you don't have to constantly upgrade to another model forced by a provider when whatever you're doing is perfectly suitable, that's untolds amount of value. Imagine the early npm ecosystem, but driven now by AI model FOMO.
And the other thing is that i want people to be able to experiment and get familiar with LLM's without being concerned about security, price or any other factor.
It that with some kind of speculative decoding? Or total throughput for parallel requests?
The 4-bit quants are far from lossless. The effects show up more on longer context problems.
> You can probably even go FP8 with 5090 (though there will be tradeoffs)
You cannot run these models at 8-bit on a 32GB card because you need space for context. Typically it would be Q5 on a 32GB card to fit context lengths needed for anything other than short answers.
build/bin/llama-server \
-m ~/models/llm/qwen3.6-27b/qwen3.6-27B-q8_0.gguf \
--no-mmap \
--n-gpu-layers all \
--ctx-size 131072 \
--flash-attn on \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--jinja \
--no-mmproj \
--parallel 1 \
--cache-ram 4096 -ctxcp 2 \
--reasoning on \
--chat-template-kwargs '{"preserve_thinking": true}'
Should fit nicely in a single 5090: self model context compute
30968 = 25972 + 4501 + 495
Even bumping up to 16-bit K cache should fit comfortably by dropping down to 64K context, which is still a pretty decent amount. I would try both. I'm not sure how tolerant Qwen3.5 series is of dropping K cache to 8 bits.You probably can actually. Not saying that it would be ideal but it can fit entirely in VRAM (if you make sure to quantize the attention layers). KV cache quantization and not loading the vision tower would help quite a bit. Not ideal for long context, but it should be very much possible.
I addressed the lossless claim in another reply but I guess it really depends on what the model is used for. For my usecases, it's nearly lossless I'd say.
This isn't the first open-weight LLM to be released. People tend to get a feel for this stuff over time.
Let me give you some more baseless speculation: Based on the quality of the 3.5 27B and the 3.6 35B models, this model is going to absolutely crush it.
TLDR: If you have 14GB of VRAM, you can try out this model with a 4-bit quant.
Tokens per second is an unreasonable ask since every card is different, are you using GGUF or not, CUDA or ROCm or Vulkan or MLX, what optimizations are in your version of your inference software, flags are you running, etc.
Note that it's a dense model (the Qwen models have another value at the end of the MoE model names, e.g. A3B) so it will not run very well in RAM, whereas with a MoE model, you can spill over into RAM if you don't have enough VRAM, and still have reasonable performance.
Using these models requires some technical know-how, and there's no getting around that.
They almost certainly run these benchmarks on their own cloud infrastructure (Alibaba afaik), which is typically not hardware that even the most enthusiastic homelab hobbyist can afford.
This will only run on server hardware, some workstation GPUs, or some 128GB unified memory systems.
It’s a situation where if you have to ask, you can’t run the exact model they released. You have to wait for quantizations to smaller sizes, which come in a lot of varieties and have quality tradeoffs.
Quantizations are already out: https://huggingface.co/unsloth/Qwen3.6-27B-GGUF
An example of a query and a response is below. It was done with 9.09 tok/s, 893 tokens, 7.56 seconds to first token, thought for 1 minute and 9 seconds.
Question: I gave the model this truth table and asked to tell me what boolean function would produce the output given A, B as inputs.
| A | B | Output |
| :--- | --- | ---: |
| (T) | (F) | (T) |
| (T) | (T) | (F) |
| (F) | (F) | (F) |
| (F) | (T) | (F) |
Answer: (note this is part of a larger conversation, where I was getting it to help me with some boolean logic): The boolean condition for your table is simply:
```rust
A && !B
```
### Why?
Your table shows the output is `true` **only** when `A` is true and `B` is false. That's exactly what `A && !B` expresses.
| A | B | `A && !B` | Output |
|---|---|-----------|--------|
| (T) | (F) | `T && T = T` | |
| (T) | (T) | `T && F = F` | |
| (F) | (F) | `F && T = F` | |
| (F) | (T) | `F && F = F` | |
### In Rust:
```rust
if A && !B {
// output is true
} else {
// output is false
}
```
This is the most direct and idiomatic way to express that logic. Let me know if you need it adapted back to your `live_ticks` / `on_disconnect` context!The 3.5 27B model was a strong and capable reasoner, so I have high hopes for this one. Thanks to the team at Qwen for keeping competition in this space alive.
It's also a section that, with hope, becomes obsolete sometime semi soon-ish.
- What kind of tasks/work?
- How is either Qwen/Gemma wired up (e.g. which harness/how are they accessed)?
Or to phase another way; what does your workflow/software stack look like?
2. Lmstudio on my MacBook mainly. You can turn on an OpenAI API compatible endpoint in the settings. Lmstudio also has a headless server called lms. Personally, I find it way better than Ollama since lmstudio uses llama cpp as the backend. With an OpenAI API compatible endpoint, you can use any tool/agent that supports openAI. Lmstudio/lms is Linux compatible too so you can run it on a strix halo desktop and the like.
Very excited for the 122b version as the throughput is significantly better for that vs the dense 27b on my m4.
There are 2 aspects I am interested in:
1. accuracy - is it 95% accuracy of Opus in terms of output quality (4.5 or 4.6)?
2. capability-wise - 95% accuracy when calling your tools and perform agentic work compared to Opus - e.g. trip planning?
2. 3.6 is noticeably better than 3.5 for agentic uses (I have yet to use the dense model). The downside is that there’s so little personality, you’ll find more entertainment talking to a wall. Anything for creative use like writing or talking, I use Gemma 4. I also use Gemma 4 as a “chat” bot only, no agents. One amazing thing about the Gemma models is the vision capabilities. I was able to pipe in some handwritten notes and it converted into markdown flawlessly. But my handwriting is much better than the typical engineer’s chicken scratch.
Or if you want to put it differently, if your prompt is super clear about the actions you want it to do, is it following it exactly as you said or going off the rails occasionally
Also, the token prices of these open source models are at a fraction of Anthropic's Opus 4.6[1]
I’d also say it keeps the frontier shops competitive while costing R&D in the present is beneficial to them in forcing them to make a better and better product especially in value add space.
Finally, particularly for Anthropic, they are going for the more trustworthy shop. Even ali is hosting pay frontier models for service revenue, but if you’re not a Chinese shop, would you really host your production code development workload on a Chinese hosted provider? OpenAI is sketchy enough but even there I have a marginal confidence they aren’t just wholesale mining data for trade secrets - even if they are using it for model training. Anthropic I slightly trust more. Hence the premium. No one really believes at face value a Chinese hosted firm isn’t mass trolling every competitive advantage possible and handing back to the government and other cross competitive firms - even if they aren’t the historical precedent is so well established and known that everyone prices it in.
Everything they have done so far indicates this.
Running your own is the only option unless you really trust them or unless you have the option to sue them like some big companies can.
Or if you don't really care then you can use the chineese one since it is cheaper.
What makes you trust Anthropic more than Alibaba?
That's a cryptic way to say "Only for vibe-coding quality at the margin matters". Obviously, quality is determined first and foremost by the skills of the human operating the LLM.
> No one really believes at face value a Chinese hosted firm isn’t mass trolling every competitive advantage possible
That's much easier to believe than the same but applied to a huge global corp that operates in your own market and has both the power and the desire to eat your market share for breakfast, before the markets open, so "growth" can be reported the same day.
Besides, open models are hosted by many small providers in the US too, you don't have to use foreign providers per se.
2) I think there is a special case for Chinese providers due to the philosophical differences in what constitutes fair markets and the regulatory and civil legal structure outside China generally makes such things existentially dangerous to do; hence while it might happen it is extraordinarily ill advised, while in China is implicitly the way things work. However my point is Ali has their own hosted version of Qwen models operating on the frontier that are at minimum hosted exclusively before released. Theres no reason to believe they won’t at some point exclusively host some frontier or fine tuned variants for purposes for commercial reasons. This is part of why they had recent turnover.
Also, have you considered that your trust in Anthropic and distrust in China may not be shared by many outside the US? There's a reason why Huawei is the largest supplier of 5G hardware globally.
Most code is not P99, but companies pay a premium to produce code that is. That’s my point.
For some problems, sure, and when you are stuck, throwing tokens at Opus is worthwhile.
On the other hand, a $10/month minimax 2.7 coding subscription that literally never runs out of tokens will happily perform most day-to-day coding tasks
Claude also has other models which use less tokens.
If I build a super high quality context for something I'm really good at, I can get great results. If I'm trying to learn something new and have it help me, it's very hit and miss. I can see where the frontier models would be useful for the latter, but they don't seem to make as much difference for the former, at least in my experience.
The biggest issue I have is that if I don't know a topic, my inquiries seem to poison the context. For some reason, my questions are treated like fact. I've also seen the same behavior with Claude getting information from the web. Specifically, I had it take a question about a possible workaround from a bug report and present it as a de-facto solution to my problem. I'm talking disconnect a remote site from the internet levels of wrong.
From what I've seen, I think the future value is in context engineering. I think the value is going to come from systems and tools that let experts "train" a context, which is really just a search problem IMO, and a marketplace or standard for sharing that context building knowledge.
The cynic in me thinks that things like cornering the RAM market are more about depriving everyone else than needing the resources. Whoever usurps the most high quality context from those P99 engineers is going to have a better product because they have better inputs. They don't want to let anyone catch up because the whole thing has properties similar to network effects. The "best" model, even if it's really just the best tooling and context engineering, is going to attract the best users which will improve the model.
It makes me wonder of the self reinforced learning is really just context theft.
That's an interesting analogy.
OpenAI & Anthropic are just lying to everyone right now because if they can't raise enough money they are dead. Intelligence is a commodity, the semiconductor supply chain is not.
Slower and worse is still useful, but not as good in two important dimensions.
It’s ludicrous to believe a small parameter count model will out perform a well made high parameter count model. That’s just magical thinking. We’ve not empirically observed any flattening of the scaling laws, and there’s no reason to believe the scrappy and smart qwen team has discovered P=NP, FTL, or the magical non linear parameter count scaling model.
For coding, quality is not measurable and is based entirely on feels (er, sorry, "vibes").
Employers paying for SOTA models is nothing but a lifestyle status perk for employees, like ping-pong tables or fancy lunch snacks.
Now there's a word I haven't heard in a long, long time.
As opposed to an US-american shop? Yup, sure, why not? It's the same ballpark.
If you want to compare to a hosted model, look toward the GLM hosted model. It’s closest to the big players right now. They were selling it at very low prices but have started raising the price recently.
For coding $200 month plan is such a good value from anthropic it’s not even worth considering anything else except for up time issues
But competition is great. I hope to see Anthropic put out a competitor in the 1/3 to 1/5 of haiku pricing range and bump haiku’s performance should be closer to sonnet level and close the gap here.
Also, they are not exactly as good when you use them in your daily flow; maybe for shallow reasoning but not for coding and more difficult stuff. Or at least I haven't found an open one as good as closed ones; I would love to, if you have some cool settings, please share
The thing is the new OpenAI/Anthropic models are noticeably better than open source. Open source is not unusable, but the frontier is definitely better and likely will remain so. With SWE time costing over $1/min, if a convo costs me $10 but saves me 10 minutes it's probably worth it. And with code, often the time saved by marginally better quality is significant.
This is the competitive advantage. Being better.
Need to check out other harnesses for this besides claude code, but the local models are just painfully slow.
Part of its reply was: Quick clarification: As of early 2025, "Qwen 3.6" hasn't been released yet. You are likely looking for Qwen2.5, specifically the Qwen2.5-32B-Instruct model, which is the 30B-class model closest to your 27B reference. The instructions below will use this model.
Weird.
Generate an SVG of a dragon eating a hotdog while driving a car: https://codepen.io/chdskndyq11546/pen/xbENmgK
Far from perfect, but it really shows how powerful these models can get
Seems like a case of overfitting with regard to the thousands of pelican bike SVG samples on the internet already.
That doesn't make it any less of an achievement given the model size or the time it took to get the results
If anything, it shows there's still much to discover in this field and things to improve upon, which is really interesting to watch unfold
Can we stop both? its so boring
It's disruptive to the commons, doesn't add anything to knowledge of a model at this point, and it's way out of hand when people are not only engaging with the original and creating screenfuls to wade through before on-topic content, but now people are creating the thread before it exists to pattern-match on the engagement they see for the real thing. So now we have 2x.
Something seems off when I combine those premises.
You also make a key observation here: the root comment is fine and on-topic. The the replies spin off into nothing to do with the headline, but the example in the comment. Makes it really hard to critique with coming across as fun police.
Also, worth noting there's a distinction here, we're not in simonw's thread: we're in a brand new account's imitation of it.
I also asked Claude Code (Opus 4.7) and Codex (GPT-5.4) to review both qwen's output and that of opus 4.5, and both agents concluded qwen's was better.
Minesweeper is simple but nontrivial - 600-800 lines of code that need to be internally consistent. At that complexity level, this model is definitely a viable alternative.
(haven't tested with planning, debugging and more complex problems yet)
Interesting pros/cons vs the new Macbook Pros depending on your prefs.
And Linux runs better than ever on such machines.
Then again, I was looking in the UK, maybe prices are extra inflated there.
The 5090RTX mobile sits at 896GB/s, as opposed to the 1.8TB/s of the 5090 desktop and most mobile chips have way smaller bandwith than that, so speeds won't be incredible across the board like with Desktop computers.
I’m excited to try out the MLX version to see if 32GB of memory from a Pro M-series Mac can get some acceptable tok/s with longer context. HuggingFace has uploaded some MLX versions already.
It's been a while since I tried it, but I think I was getting around 12-15 tokens per second an that feels slow when you're used to the big commercial models. Whenever I actually want to do stuff with the open source models, I always find myself falling back to OpenRouter.
I tried Intel/Qwen3.6-35B-A3B-int4-AutoRound on a DGX Spark a couple days ago and that felt usable speed wise. I don't know about quality, but that's like running a 3B parameter model. 27B is a lot slower.
I'm not sure if I "get" the local AI stuff everyone is selling. I love the idea of it, but what's the point of 128GB of shared memory on a DGX Spark if I can only run a 20-30GB model before the slow speed makes it unusable?
ollama launch claude --model qwen3.6:35b-a3b-nvfp4
This has been optimized for Apple Silicon and runs well on a 32G ram system. Local models are getting better!
Friendly reminder: wait a couple weeks to judge the ”final” quality of these free models. Many of them suffer from hidden bugs when connected to an inference backend or bad configs that slow them down. The dev community usually takes a week or two to find the most glaring issues. Some of them may require patches to tools like llama.cpp, and some require users to avoid specific default options.
Gemma 4 had some issues that were ironed out within a week or two. This model is likely no different. Take initial impressions with a grain of salt.
The bugs come from the downstream implementations and quantizations (which inherit bugs in the tools).
Expect to update your tools and redownload the quants multiple times over 2-4 weeks. There is a mad rush to be first to release quants and first to submit PRs to the popular tools, but the output is often not tested much before uploading.
If you experiment with these on launch week, you are the tester. :)
It’s not a surprise that models are leapfrogging each other when the engineers are able to incorporate better code examples and reasoning traces, which in turn bring higher quality outputs.
That's just, like, your opinion, man.
> You really can't compare a model that's got trillions of parameters to a 27B one.
Parameter count doesn't matter much when coding. You don't need in-depth general knowledge or multilingual support in a coding model.
Every release is accompanied by claims of being as good as Sonnet or Opus, but when I try them (even hosted full weights) they’re far from it.
Impressive for the size, though!
if you can't afford to do that, look at a lot of them, eg. on artificialanalysis.com they merge multiple benchmarks across weighted categories and build an Intelligence Score, Coding Score and Agentic score.
GLM 5 scores 5% on the semi-private set, compared to SOTA models which hover around 80%.
Gemini flash was just as good as pro for most tasks with good prompts, tools, and context. Gemma 4 was nearly as good as flash and Qwen 3.6 appears to be even better.
What matters is the motion in the tokens
But when actually employed to write code they will fall over when they leave that specific domain.
Basically they might have skill but lack wisdom. Certainly at this size they will lack anywhere close to the same contextual knowledge.
Still these things could be useful in the context of more specialized tooling, or in a harness that heavily prompts in the right direction, or as a subagent for a "wiser" larger model that directs all the planning and reviews results.
The issue with C# specifically is dataset availability. Open source C# code on GitHub is a fraction of Python/JS, and Microsoft hasn't released a public corpus the way Meta has for their code models. You'd probably get further fine-tuning Qwen3-Coder (or a similar base) on your specific codebase with LoRA than waiting for a dedicated C#-only model to appear.
Fine-tuning / LoRA on basis the org code base would be make it more useful.
Even if they don't run super fast, I can let them work overnight and get comprehensive reports in the morning.
I used Qwen3.6-27B on an M5 (oq8, using omlx) and Swival (https://swival.dev) /audit command on small code bases I use for benchmarking models for security audits.
It found 8 out of 10, which is excellent for a local model, produced valid patches, and didn't report any false positives. which is even better.
For anyone invested in running LLMs at home or on a much more modest budget rig for corporate purposes, Gemma 4 and Qwen 3.6 are some of the most promising models available.
$ llama-server --version
version: 8851 (e365e658f)
$ llama-batched-bench -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 1000 | 128 | 1 | 1128 | 1.529 | 654.11 | 3.470 | 36.89 | 4.999 | 225.67 |
| 2000 | 128 | 1 | 2128 | 3.064 | 652.75 | 3.498 | 36.59 | 6.562 | 324.30 |
| 4000 | 128 | 1 | 4128 | 6.180 | 647.29 | 3.535 | 36.21 | 9.715 | 424.92 |
| 8000 | 128 | 1 | 8128 | 12.477 | 641.16 | 3.582 | 35.73 | 16.059 | 506.12 |
| 16000 | 128 | 1 | 16128 | 25.849 | 618.98 | 3.667 | 34.91 | 29.516 | 546.42 |
| 32000 | 128 | 1 | 32128 | 57.201 | 559.43 | 3.825 | 33.47 | 61.026 | 526.47 | | PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 1000 | 128 | 1 | 1128 | 0.684 | 1462.61 | 2.869 | 44.61 | 3.553 | 317.47 |
| 2000 | 128 | 1 | 2128 | 1.390 | 1438.84 | 2.868 | 44.64 | 4.258 | 499.80 |
| 4000 | 128 | 1 | 4128 | 2.791 | 1433.18 | 2.886 | 44.35 | 5.677 | 727.11 |
| 8000 | 128 | 1 | 8128 | 5.646 | 1416.98 | 2.922 | 43.80 | 8.568 | 948.65 |
| 16000 | 128 | 1 | 16128 | 11.851 | 1350.10 | 3.007 | 42.57 | 14.857 | 1085.51 |
| 32000 | 128 | 1 | 32128 | 25.855 | 1237.66 | 3.168 | 40.40 | 29.024 | 1106.96 |
Edit: Model gets stuck in infinite loops at this quantization level. I've also tried Q5_K_M quantization (fits up to 51968 context length), which seems more robust. $ llama-batched-bench -dev ROCm1 -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 1000 | 128 | 1 | 1128 | 1.034 | 966.90 | 4.851 | 26.39 | 5.885 | 191.67 |
| 2000 | 128 | 1 | 2128 | 2.104 | 950.38 | 4.853 | 26.38 | 6.957 | 305.86 |
| 4000 | 128 | 1 | 4128 | 4.269 | 937.00 | 4.876 | 26.25 | 9.145 | 451.40 |
| 8000 | 128 | 1 | 8128 | 8.962 | 892.69 | 4.912 | 26.06 | 13.873 | 585.88 |
| 16000 | 128 | 1 | 16128 | 19.673 | 813.31 | 4.996 | 25.62 | 24.669 | 653.78 |
| 32000 | 128 | 1 | 32128 | 46.304 | 691.09 | 5.122 | 24.99 | 51.426 | 624.75 |M2 Ultra, Q8_0
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 512 | 128 | 1 | 640 | 1.307 | 391.69 | 6.209 | 20.61 | 7.516 | 85.15 |
| 1024 | 128 | 1 | 1152 | 2.534 | 404.16 | 6.227 | 20.56 | 8.760 | 131.50 |
| 2048 | 128 | 1 | 2176 | 5.029 | 407.26 | 6.229 | 20.55 | 11.258 | 193.29 |
| 4096 | 128 | 1 | 4224 | 10.176 | 402.52 | 6.278 | 20.39 | 16.454 | 256.72 |
| 8192 | 128 | 1 | 8320 | 20.784 | 394.14 | 6.376 | 20.08 | 27.160 | 306.33 |
| 16384 | 128 | 1 | 16512 | 43.513 | 376.53 | 6.532 | 19.59 | 50.046 | 329.94 |
| 32768 | 128 | 1 | 32896 | 99.137 | 330.53 | 7.081 | 18.08 | 106.218 | 309.70 |
DGX Spark, Q8_0 | PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 512 | 128 | 1 | 640 | 0.881 | 580.98 | 16.122 | 7.94 | 17.003 | 37.64 |
| 1024 | 128 | 1 | 1152 | 1.749 | 585.43 | 16.131 | 7.93 | 17.880 | 64.43 |
| 2048 | 128 | 1 | 2176 | 3.486 | 587.54 | 16.169 | 7.92 | 19.655 | 110.71 |
| 4096 | 128 | 1 | 4224 | 7.018 | 583.64 | 16.245 | 7.88 | 23.263 | 181.58 |
| 8192 | 128 | 1 | 8320 | 14.189 | 577.33 | 16.427 | 7.79 | 30.617 | 271.75 |
| 16384 | 128 | 1 | 16512 | 29.015 | 564.68 | 16.749 | 7.64 | 45.763 | 360.81 |
| 32768 | 128 | 1 | 32896 | 60.413 | 542.40 | 17.359 | 7.37 | 77.772 | 422.98 |llama-* version 8889 w/ rocm support ; nightly rocm
llama.cpp/build/bin/llama-batched-bench --version unsloth/Qwen3.6-27B-GGUF:UD-Q8_K_XL -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 1000 | 128 | 1 | 1128 | 2.776 | 360.22 | 20.192 | 6.34 | 22.968 | 49.11 |
| 2000 | 128 | 1 | 2128 | 5.778 | 346.12 | 20.211 | 6.33 | 25.990 | 81.88 |
| 4000 | 128 | 1 | 4128 | 11.723 | 341.22 | 20.291 | 6.31 | 32.013 | 128.95 |
| 8000 | 128 | 1 | 8128 | 24.223 | 330.26 | 20.399 | 6.27 | 44.622 | 182.15 |
| 16000 | 128 | 1 | 16128 | 52.521 | 304.64 | 20.669 | 6.19 | 73.190 | 220.36 |
| 32000 | 128 | 1 | 32128 | 120.333 | 265.93 | 21.244 | 6.03 | 141.577 | 226.93 |
More directly comparable to the results posted by genpfault (IQ4_XS):llama.cpp/build/bin/llama-batched-bench -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 1000 | 128 | 1 | 1128 | 2.543 | 393.23 | 9.829 | 13.02 | 12.372 | 91.17 |
| 2000 | 128 | 1 | 2128 | 5.400 | 370.36 | 9.891 | 12.94 | 15.291 | 139.17 |
| 4000 | 128 | 1 | 4128 | 10.950 | 365.30 | 9.972 | 12.84 | 20.922 | 197.31 |
| 8000 | 128 | 1 | 8128 | 22.762 | 351.46 | 10.118 | 12.65 | 32.880 | 247.20 |
| 16000 | 128 | 1 | 16128 | 49.386 | 323.98 | 10.387 | 12.32 | 59.773 | 269.82 |
| 32000 | 128 | 1 | 32128 | 114.218 | 280.16 | 10.950 | 11.69 | 125.169 | 256.68 | $ llama-batched-bench -dev Vulkan2 -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 1000 | 128 | 1 | 1128 | 3.288 | 304.15 | 9.873 | 12.96 | 13.161 | 85.71 |
| 2000 | 128 | 1 | 2128 | 6.415 | 311.79 | 9.883 | 12.95 | 16.297 | 130.57 |
| 4000 | 128 | 1 | 4128 | 13.113 | 305.04 | 9.979 | 12.83 | 23.092 | 178.76 |
| 8000 | 128 | 1 | 8128 | 27.491 | 291.01 | 10.155 | 12.61 | 37.645 | 215.91 |
| 16000 | 128 | 1 | 16128 | 59.079 | 270.83 | 10.476 | 12.22 | 69.555 | 231.87 |
| 32000 | 128 | 1 | 32128 | 148.625 | 215.31 | 11.084 | 11.55 | 159.709 | 201.17 |