I've been playing with Qwen3-Coder-Next and the Qwen3.5 models since they were each released.
They are impressive, but they are not performing at Sonnet 4.5 level in my experience.
I have observed that they're configured to be very tenacious. If you can carefully constrain the goal with some tests they need to pass and frame it in a way to keep them on track, they will just keep trying things over and over. They'll "solve" a lot of these problems in the way that a broken clock is right twice a day, but there's a lot of fumbling to get there.
That said, they are impressive for open source models. It's amazing what you can do with self-hosted now. Just don't believe the hype that these are Sonnet 4.5 level models because you're going to be very disappointed once you get into anything complex.
I like this benchmark that competes models against one another in competitive environments, which seems like it can't really be gamed: https://gertlabs.com
I'm working on a pretty complex Rust codebase right now, with hundreds of integration tests and nontrivial concurrency, and stepfun powers through.
I have no relation to stepfun, and I'm saying this purely from deep respect to the team that managed to pack this performance in 196B/11B active envelope.
It’s 2× faster than its competitors. For tasks where “one-shotting” is unrealistic, a fast iteration loop makes a measurable difference in productivity.
I bet the cloud ones are doing it a lot more because they can also affect the runtime side which the open source ones can't.
Goodhart's law shows up with people, in system design, in processor design, in education...
Models are going to be over-fit to the tests unless scruples or practical application realities intervene. It's a tale as old as machine learning.
If the tests haven't been published anywhere and are sufficiently different from standard problems, I would think the benchmarks would be robust to intentional over optimization.
Edit: These look decent and generally match my expectations:
The only benchmarks worth anything are dynamic ones which can be scaled up.
I’ve switched to using Kimi 2.5 for all of my personal usage and am far from disappointed.
Aside from being much cheaper than the big names (yes, I’m not running it locally, but like that I could) it just works and isn’t a sycophant. Nice to get coding problems solved without any “That’s a fantastic idea!”/“great point” comments.
At least with Kimi my understanding is that beating benchmarks was a secondary goal to good developer experience.
And could quantization maybe partially explain the worse than expected results?
I have two of my own comments to add to that. First one is that there is problem alignment at play. Specifically - the benchmarks are mostly self-contained problems with well defined solutions and specific prompt language, humans tasks are open ended with messy prompts and much steerage. Second is that it would be interesting to test older models on brand new benchmarks to see how those compare.
That's a much better way to say it than I did.
These models are known for being open weights but they're still products that Alibaba Cloud wants is trying to sell. They have Product Managers and PR and marketing people under pressure to get people using them.
This Venture Beat article is basically a PR piece for the models and Alibaba Cloud hosting. The pricing table is right in the article.
It's cool that they release the models for us to use, but don't think they're operating entirely altruistically. They're playing a business game just like everyone else.
That way, we can have a benchmark that is always up to date.
The benchmarks are public. They're guaranteed to be in the training sets by now. So the benchmarks are no longer an indicator of general performance because the specific tasks have been seen before.
> And could quantization maybe explain the worse than expected results?
You can use the models through various providers on OpenRouter cheaply without quantization.
Quantisation doesn't help, but even running full fat versions of these models through various cloud providers, they still don't match Sonnet in actual agentic coding uses: at least in my experience.
there is nothing open "source" about them. They are open weights, that's all.
So far Opus 4.6 and Gemini Pro are very satisfactory, producing great answers fairly fast. Gemini is very fast at 30-50 sec, Opus is very detailed and comes at about 2-3 minutes.
Today I ran the question against local qwen3.5:35b-a3b - it puffed for 45 (!) minutes, produced a very generic answer with errors, and made my laptop sound like it's going to take off any moment.
Wonder what am I doing wrong?.. How am I supposed to use this for any agentic coding on a large enough codebase? It will take days (and a 3M Peltor X5A) to produce anything useful.
You're comparing 100b parameters open models running on a consumer laptop VS private models with at the very least 1t parameters running on racks of bleeding edge professional gpus
Local agentic coding is closer to "shit me the boiler plate for an android app" not "deep research questions", especially on your machine
Speculation is that the frontier models are all below 200B parameters but a 2x size difference wouldn’t fully explain task performance differences
Core speed/count and memory bandwidth determines your performance. Memory size determines your model size which determines your smarts. Broadly speaking.
Some versions of some the models are around that size, which you might hit for example with the ChatGPT auto-router.
But the frontier models are all over 1T parameters. Source: watch interview with people who have left one of the big three labs and now work at the Chinese labs and are talking about how to train 1T+ models.
Yes it does.
There are the benchmarks, the promises, and what everybody can try at home
But if you've got that kind of equipment, you aren't using it to support a single user. It gets the best utilization by running very large batches with massive parallelism across GPUs, so you're going to do that. There is such a thing as a useful middle ground. that may not give you the absolute best in performance but will be found broadly acceptable and still be quite viable for a home lab.
Local models are more than a useful middle ground they are essential and will never go away, I was just addressing the OPs question about why he observed the difference he did. One is an API call to the worlds most advanced compute infrastructure and another is running on a $500 CPU.
Lots of uses for small, medium, and larger models they all have important places!!
Even on servers this can happen. At work we have a 2U sized server with two 250W class GPUs. And I found that by pinning the case fans at 100% I can get 30% more performance out of GPU tasks which translates to several days faster for our usecase. It does mean I can literally hear the fans screaming in the hallway outside the equipment room but ok lol. Who cares. But a laptop just can't compare.
Something with a desktop GPU or even better something with HBM3 would run much better. Local models get slow when you use a ton of context and the memory bandwidth of a MacBook Pro while better than a pc is still not amazing.
And yeah the heaviest tasks are not great on local models. I tend to run the low hanging fruit locally and the stuff where I really need the best in the cloud. I don't agree local models are on par, however I don't think they really need to be for a lot of tasks.
On the other hand, if indeed open source models and Macbooks can be as powerful as those SOTA models from Google, etc, then stock prices of many companies would already collapsed.
I'm too GPU-poor to run it, but r/LocalLLaMa is full of people using it.
Admittedly, I haven't tried these models on my Mac, but I have on my DGX Spark, and they ran fine. I didn't see the slowdown you're mentioning.
I really, really want open weights models to be great, but I've been disappointed with them. I don't even run them locally, I try them from providers, but they're never as good as even the current Sonnet.
- Qwen3-VL picks up new images in a NAS, auto captions and adds the text descriptions as a hidden EXIF layer into the image, which is used for fast search and organization in conjunction with a Qdrant vector database.
- Gemma3:27b is used for personal translation work (mostly English and Chinese).
- Llama3.1 spins up for sentiment analysis on text.
Maybe I should try local models for home automation, Qwen must be great at that.
if you are able to run something like mlx-community/MiniMax-M2.5-3bit (~100gb), my guess if the results are much better than 35b-a3b.
Also, performance on research-y questions isn't always a good indicator of how the model will do for code generation or agent orchestration.
I then discovered what quantization is by reading a blog post about binary quantization. That seemed too good to be true. I asked Claude to design an analysis assessing the fidelity of 1, 2, 4, and 8 bit quantization. Claude did a good job, downloading 10,000 embeddings from a public source and computing a similarity score and correlation coefficient for each level of quantization against the float32 SoT. 1 and 2 bit quantizations were about 90% similar and 8 bit quantization was lossless given the precision Claude used to display the results. 4 bit was interesting as it was 99% similar (almost lossless) yet half the size of 8 bit. It seemed like the sweet spot.
This analysis took me all of an hour so I thought, "That's cool but is it real?" It's gratifying to see that 4 bit quantization is actually being used by professionals in this field.
It doesn't seem terribly common yet though. I think it is challenging to keep it stable.
[1] https://www.opencompute.org/blog/amd-arm-intel-meta-microsof...
[2] https://www.opencompute.org/documents/ocp-microscaling-forma...
I do wonder where that extra acuity you get from 1% more shows up in practice. I hate how I have basically no way to intuitively tell that because of how much of a black box the system is
- llama.cpp
- OpenCode
- Qwen3-Coder-30B-A3B-Instruct in GGUF format (Q4_K_M quantization)
working on a M1 MacBook Pro (e.g. using brew).
It was bit finicky to get all of the pieces together so hopefully this can be used with these newer models.
https://gist.github.com/alexpotato/5b76989c24593962898294038...
On the model choice: I've tried latest gemma, ministral, and a bunch of others. But qwen was definitely the most impressive (and much faster inference thanks to MoE architecture), so can't wait to try Qwen3.5-35B-A3B if it fits.
I've no clue about which quantization to pick though ... I picked Q4_K_M at random, was your choice of quantization more educated?
What these open models are great for are for narrow, constrained domains, with good input/output examples. I typically use them for things like prompt expansion, sentiment analysis, reformatting or re-arranging flow of code.
What I found they have trouble with is going from ambiguous description -> solved problem. Qwen 3.5 is certainly the best of the OSS models I've found (beating out GPT 120b OSS which was the previous king), and it's just starting to demonstrate true intelligence in unbound situations, but it isn't quite there yet. I have a RTX 6000 pro, so Qwen 3.5 is free for me to run, but I tend to default to Composer 1.5 if I want to be cheap.
The trend however is super encouraging. I bought my vid card with the full expectation that we'll have a locally running GPT 5.2 equiv by EoY, and I think we're on track.
Up until relatively recently, while people had already long been making these claims, it came with the asterisks of „oh, but you can’t practically use more than a few K tokens of context“.
Qwen 3.5 122b/a10b (at q3 using unsloth's dynamic quant) is so far the first model I've tried locally that gets a really usable RPN calculator app. Other models (even larger ones that I can run on my Strix Halo box) tend to either not implement the stack right, have non-functional operation buttons, or most commonly the keypad looks like a Picasso painting (i.e., the 10-key pad portion has buttons missing or mapped all over the keypad area).
This seems like such as simple test, but I even just tried it in chatgpt (whatever model they serve up when you don't log in), and it didn't even have any numerical input buttons. Claude Sonet 4.6 did get it correct too, but that is the only other model I've used that gets this question right.
if so, a better approach would be to ask it to first plan that entire task and give it some specific guidance
then once it has the plan, ask it to execute it, preferably by letting it call other subagents that take care of different phases of the implementation while the main loop just merges those worktrees back
it's how you should be using claude code too, btw
The more I use the cloud based frontier models, the more virtue I find in using local, open source/weights, models because they tend to create much simpler code. They require more direct interaction from me, but the end result tends to be less buggy, easier to refactor/clean up, and more precisely what I wanted. I am personally excited to try this new model out here shortly on my 5090. If read the article correctly, it sounds like even the quantized versions have a “million”[1] token context window.
And to note, I’m sure I could use the same interaction loop for Claude or GPT, but the local models are free (minus the power) to run.
[1] I’m a dubious it won’t shite itself at even 50% of that. But even 250k would be amazing for a local model when I “only” have 32GB of VRAM.
Quite misleading, really.
Theory is that some of the model parameters aren't set properly and this encourages endless looping behavior when run under ollama:
https://github.com/ollama/ollama/issues?q=is%3Aissue%20state... (a bunch of them)
Using Claude Code Max 20 so ROI would be maybe 2+ years.
CC gives me unlimited coding in 4-6 windows in parallel. Unsure if any model would beat (or even match) that, both in terms in quality and speed.
I wouldn't gamble on that now. With a subscription, I can change any time. With the machine, you risk that this great insane model comes out but you need 138GB and then you'll pay for both.
Also Nvidia Spark.
Thermals. Your workloads will be throttled hard once it inevitably runs hot. See comments elsewhere in thread about why LLMs on laptops like MBP is underwhelming. The same chips in even a studio form factor would perform much better.
Edit: The unsloth quants seem to have been fixed, so they are probably the go-to again: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks
See Wendel's review here - https://www.youtube.com/watch?v=L-xgMQ-7lW0
There are other mini-pc manufacturers, the mainboard is the important part.
EDIT: opencode was a bit slow with qwen3.5:35b using Ollama. Faster/nicer to use with Liquid lfm2:latest
If you want to spend twice as much for more speed, get a 3090/4090/5090.
If you want long context, get two of them.
If you have enough spare cash to buy a car, get an RTX Ada with 96G VRAM.
I was thinking about adding after-market liquid cooling for them, but they're fine without it.
Check out the HP Omen 45L Max: https://www.hp.com/us-en/shop/pdp/omen-max-45l-gaming-dt-gt2...
I'm curious which one you're using.
Sure. Llama.cpp will happily run these kinds of LLMs using either HIP or Vulcan.
Vulkan is easier to get going using the Mesa OSS drivers under Linux, HIP might give you slightly better performance.
I imagine any 24 GB card can run the lower quants at a reasonable rate, though, and those are still very good models.
Big fan of Qwen 3.5. It actually delivers on some of the hype that the previous wave of open models never lived up to.
Unsloth's GLM-4.7-Flash-BF16.gguf is quite fast on the 6000, at around 100 t/s, but definitely not as smart as the Qwen 3.5 MoE or dense models of similar size. As far as I'm concerned Qwen 3.5 renders most other open models short of perhaps Kimi 2.5 obsolete for general queries, although other models are still said to be better for local agentic use. That, I haven't tried.
Edit: it looks like the flagship models work by writing a C or Python program to do the bookkeeping. I don't have Qwen set up to use tools, and even Opus 4.6 shits the bed when told to do it without tools [1], so not too surprising that it didn't work.
1: https://claude.ai/share/1f5289ae-decd-4dfa-98fd-0d34346008c6 -- I interrupted it and told it not to use a C/Python program or any other tools to generate the Brainfuck code, and it gave me an error message after about 10 minutes that wasn't logged to the chat.
Strong vision and reasoning performance, and the 35-a3b model run s pretty ok on a 16gb GPU with some CPU layers.
Obviously there's more to a model than that but it's a data point.
[1]: https://github.com/fairydreaming/lineage-bench
[2]: https://github.com/fairydreaming/lineage-bench-results/tree/...
Somewhere between Haiku 4.5 and Sonnet 4.5
That's like saying "somewhere between Eliza and Haiku 4.5". Haiku is not even a so-called 'reasoning model'.¹
¹ To preempt the easily-offended, this is what the latest Opus 4.6 in today's Claude Code update says: "Claude Haiku 4.5 is not a reasoning model — it's optimized for speed and cost efficiency. It's the fastest model in the Claude family, good for quick, straightforward tasks, but it doesn't have extended thinking/reasoning capabilities."
[0]: https://www-cdn.anthropic.com/7aad69bf12627d42234e01ee7c3630...
> Claude Haiku 4.5, a new hybrid reasoning large language model from Anthropic in our small, fast model class.
> As with each model released by Anthropic beginning with Claude Sonnet 3.7, Claude Haiku 4.5 is a hybrid reasoning model. This means that by default the model will answer a query rapidly, but users have the option to toggle on “extended thinking mode”, where the model will spend more time considering its response before it answers. Note that our previous model in the Haiku small-model class, Claude Haiku 3.5, did not have an extended thinking mode.
I would absolutely believe mar-ticles that Qwen has achieved Haiku 4.5 'extended thinking' levels of coding prowess.
Oh HN never change.
Haiku 4.5 is a reasoning model, regardless of whatever hallucination you read. Being a hybrid reasoning model means that, depending on the complexity of the question and whether you explicitly enable reasoning (this is "extended thinking" in the API and other interfaces) when making a request to the LLM, it will emit reasoning tokens separately prior to the tokens used in the main response.
I love your theory that there was some mix up on their side because they were lazy and it was just some marketing dude being quirky with the technical language.
* Haiku 4.5 by default doesn't think, i.e. it has a default thinking budget of 0.
* By setting a non-zero thinking budget, Haiku 4.5 can think. My guess is that Claude Code may set this differently for different tasks, e.g. thinking for Explore, no thinking for Compact.
* This hybrid thinking is different from the adaptive thinking introduced in Opus 4.6, which when enabled, can automatically adjust the thinking level based on task difficulty.
Maybe "Qwen3.5 122B offers Haiku 4.5 performance on local computers" would be a more realistic and defensible claim.
none of the qwen 3.5 models are anywhere near sonnet 4.5 class, not even the largest 397b.
BUT 27b is the smartest local-sized model in the world by a wide wide margin. (35b is shit. fast shit, but shit.)
benchmarks are complete, publishing on Monday.
Will check your updated ranking on Monday.
...yeah I doubt it
"User is asking me to repeat the word "potato" 100 times, numbered. This is a simple request - I can comply with this request. Let me create a response that includes the word "potato" 100 times, numbered from 1 to 100.
I'll need to be careful about formatting - the user wants it numbered and once per line. I should use minimal formatting as per my instructions."
Let's write.
Wait, I'll write the response.
Wait, I'll check if I should use a table.
No, text is fine.
Okay.
Let's write.
Wait, I'll write the response.
Wait, I'll check if I should use a bullet list.
No, just lines.
Okay.
Let's write.
Wait, I'll write the response.
Wait, I'll check if I should use a numbered list.
No, lines are fine.
Okay.
Let's write.
Wait, I'll write the response.
Wait, I'll check if I should use a code block.
Yes.
Okay.
Let's write.
Wait, I'll write the response.
Wait, I'll check if I should use a pre block.
Code block is better.
... (for next 100 lines)I just tried this (Ollama macOS 0.17.4, qwen3.5:35b-a3b-q4_K_M) on a M4 Pro, and it did fine:
[Thought for 50.0 seconds]
1. potato 2. potato [...] 100. potato
In other words, it did great.
I think 50 seconds of thinking beforehand was perhaps excessive?
I just tried this (Ollama macOS 0.17.4, qwen3.5:35b-a3b-q4_K_M) on a M4 Pro, and it did fine:
[Thought for 50.0 seconds]
1. potato 2. potato [...] 100. potato
In other words, it did great.
I think 50 seconds of thinking beforehand was perhaps excessive?
llama-server ^
--model Qwen3.5-27B-BF16-00001-of-00002.gguf ^
--mmproj mmproj-BF16.gguf ^
--fit on ^
--host 127.0.0.1 ^
--port 2080 ^
--temp 0.8 ^
--top-p 0.95 ^
--top-k 20 ^
--min-p 0.00 ^
--presence_penalty 1.5 ^
--repeat_penalty 1.1 ^
--no-mmap ^
--no-warmup
The repeat and/or presence penalties seem to be somewhat sensitive with this model, so that might have caused the looping you saw.For Qwen3.5 27B, I got good result with --temp 1.0 --top-p 1.0 --top-k 40 --min-p 0.2, without penalty. It allows the model to explore (temp, top-p, top-k) without going off the rail (min-p) during reasoning. No loop so far.
When setting up the batch file for some previous tests, I decided to split the difference between 0.6 and 1.0 for temperature and use the larger recommended values for presence and repetition. For this prompt, it probably isn't a good idea to discourage repetition, I guess. But keeping the existing parameters worked well enough, so I didn't mess with them.
either that, or it has a delusional level of instruction following. doesn’t mean it can’t code like sonnet though
> do you really know what it means to “recite” “potato” “100” “times”?
asking user question is an option. Sonnet did that a bunch when I was trying to debug some network issue. It also forgot the facts checked for it and told it before...
The thing I struggle most with, honestly, is when AI (usually GPT5.3-Codex) asks me a question and I genuinely don’t know the answer. I’m just like “well, uh… follow industry best practice, please? unless best practice is dumb, I guess. do a good. please do a good.” And then I get to find out what the answer should’ve been the hard way.
What's your problem with Chinese LLMs?
This is trivial to test and verify yourself. Just pick any topic you think has a chance of being censored. You can do the same on American models and compare results.