Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge
172 points
2 hours ago
| 28 comments
| thinkpol.ca
| HN
0xbadcafebee
1 hour ago
[-]
These posts are going to be a constant for the next year, because there's no objective way to compare models (past low-level numbers like token generation speed, average reasoning token amount, # of parameters, active experts, etc). They're all quite different in a lot of ways, they're used for many different things by different people, and they're not deterministic. So you're constantly gonna see benchmarks and tests and proclamations of "THIS model beat THAT model!", with people racing around trying to find the best one.

But there is no best one. There's just the best one for you, based on whatever your criteria is. It's likely we'll end up in a "Windows vs MacOS vs Linux" style world, where people stick to their camps that do a particular thing a particular way.

reply
ljlolel
16 minutes ago
[-]
Then useful to be able to easily switch between them with one billing like on https://trustedrouter.com/ (which I made)
reply
idonotknowwhy
3 minutes ago
[-]
So like Open Router?
reply
verve_rat
19 minutes ago
[-]
My theory is we will end up in a similar spot to hiring people. You can look at a CV (benchmarks) but you won't know for sure until you've worked with them for six months.

We as an industry cannot determine if one software engineer is objectively better than another, on practically any dimension, so why do we think we can come to an objective ranking of models?

reply
ninjahawk1
44 minutes ago
[-]
At the current rate, open sourced models are expected to surpass cloud models within a couple years based on a study I read a couple days ago.

Looking back at chatGPT and claude a couple years ago, very small Qwen models are basically equal in coding to what those cloud based models could do then. Also factoring in scaling laws, a 9b going to 18b is roughly a 40% increase, whereas 18b to 35b is 20%, I expect there will be a change of at least price in cloud based models.

Adobe used to be $600 per month, then it became $20 when distribution scaled.

reply
baxtr
40 minutes ago
[-]
While this might be true I’m worried about the hardware side of things.

What if you have a good enough model but the cloud model providers are better in procuring hardware for interference?

reply
zozbot234
24 minutes ago
[-]
The cloud providers are probably better at procuring hardware for inference, but on prem users are better at repurposing hardware that they'd need anyway for their existing uses. In a world where AI compute is likely inherently scarce, it makes sense to rely on both.
reply
gleenn
30 minutes ago
[-]
Local inference is definitely going to make more and more sense. Modern CPUs have all this amazing hardware well-optimized for inference purposes. I use a lot of web tools and see AI baked in and it feels weird. I want the smartness localized for speed and data security. I think and hope the industry points towards smart ai agents operating as locally as possible.
reply
Gigachad
21 minutes ago
[-]
You’ll be able to run the open models on any cloud at the cost of the hardware rental. While the closed models will try to mark up beyond the base cost.
reply
Traubenfuchs
25 minutes ago
[-]
What were all the datacenters for???
reply
gertlabs
1 hour ago
[-]
I'm glad we're seeing a shift towards objectively scored tests.

We've been doing this at scale at https://gertlabs.com/rankings, and although the author looks to be running unique one-off samples, it's not surprising to see how well Kimi K2.6 performed. Based on our testing, for coding especially, Kimi is within statistical uncertainty of MiMo V2.5 Pro for top open weights model, and performs much better with tools than DeepSeek V4 Pro.

GPT 5.5 has a comfortable lead, but Kimi is on par with or better than Opus 4.6. The problem with Kimi 2.6 is that it's one of the slower models we've tested.

reply
Mashimo
57 seconds ago
[-]
Seems like in agentic work flow the qween flash and Deepseek flash models are quite good.

Fits with another comment from yesterday on here who said the flash models are just better at tool calling.

Planning with gpt55 and implementation with a flash model could be bang for the buck route.

reply
veber-alex
1 hour ago
[-]
In my experience benchmarks are pretty meaningless.

Not only is performance dependent on the language and tasks gives but also the prompts used and the expected results.

In my own internal tests it was really hard to judge whether GPT 5.5 or Opus 4.7 is the better model.

They have different styles and it's basically up to preference. There where even times where I gave the win to one model only to think about it more and change my mind.

At the end of the day I think I slightly prefer Opus 4.7.

reply
magicalhippo
13 minutes ago
[-]
In addition, the harness around these models do a lot of work and changes the outcome significantly.

I just had an issue where Claude CLI with Opus 4.7 High could not figure out why my Blazor Server program was inert, buttons didn't do anything etc. After several rounds, I opened the web console and found that it failed to load blazor.js due to 404 on that file. I copied the error message to Claude CLI and after another several unproductive rounds I gave up.

I then moved the Codex, with ChatGTP 5.5 High. I gave it the code base, problem description and error codes. Unlike Claude CLI it spun up the project and used wget/curl to probe for blazor.js, and found indeed it was not served. It then did a lot more probing and some web searches and after a while found my project file was missing a setting. It added that and then probed to verify it worked.

So Codex fixed it in about 20 minutes without me laying hands on it (other than approve some program executions).

However, I'm not convinced this shows GPT 5.5 being that much better than Opus 4.7. It could very well be the harness around it, the system prompts used in the harness and tools available.

For reference this was me just trying to see how good the vibecoding experience is now, so was trying to do this as much hands-off as possible.

reply
veber-alex
3 minutes ago
[-]
I actually noticed this too. GPT 5.5 is much more "hands on" with calling tools to debug issues and verify results. I did all my tests in Cursor but I don't know if they use a different system prompt around for each model.
reply
gertlabs
38 minutes ago
[-]
I think benchmarks are improving and will always have value, but it's the equivalent to someone's college and GPA for an entry level job application.

It's a strong signal for a job, but the soft skills are sometimes going to get Claude Opus 4.6 a job over smarter applicants. That's what we'd really like to measure objectively, and are actively working on.

reply
bazlightyear
1 hour ago
[-]
Are you tests and results open source?
reply
gertlabs
48 minutes ago
[-]
Test result summaries are openly available, test environments are not.
reply
refulgentis
59 minutes ago
[-]
Any thoughts on using it on Fireworks? It's extremely fast there.
reply
gertlabs
48 minutes ago
[-]
I'm not sure how many of our requests got routed to Fireworks -- for our testing, we set preferences for routing to providers with the highest advertised quantizations / highest reasoning mode support / or preferably the model developer itself.

While it may be possible to get better numbers from certain providers, we try to establish a common baseline. I.e. if we measure that Kimi K2.6 averages 450s on a task and GLM 5.1 averages 400s, you might be able to improve that number on a provider like Fireworks but GLM 5.1 would also likely be 10% faster on the premium provider. This is a caveat worth considering when comparing to proprietary model speeds on the site, though.

reply
sieve
1 hour ago
[-]
Kimi is really good.

I have been using Sonnet and others (DeepSeek, ChatGPT, MiniMax, Qwen) for my compiler/vm project and the Claude Pro plan is mostly unusable for any serious coding effort. So I use it in chat mode in the browser where it cannot needlessly read your entire project, and use Kimi on the OpenCode Go plan with pi.

Kimi consistently exceeded Sonnet on the C+Python project. Never had to worry about it doing anything other than what I asked it to do. GLM crapped the bed once or twice. Kimi never did.

reply
ponyous
23 minutes ago
[-]
Kimi is nowhere near GPT or Opus unfortunately. I really wish it was. I’m running evals where models have to generate code that produces 3D models and it’s obvious that it lacks spatial understanding and makes many more code errors before it succeeds.

Maybe it’s better in one particular case here and there and I think this blog post is example of that.

reply
slashdave
1 hour ago
[-]
I was surprised by the ranking, until I read what the test was. Not horribly relevant for coding.

The current ranking of all tests makes more sense (well, except for how well Gemini does)

https://aicc.rayonnant.ai

reply
magicalhippo
1 hour ago
[-]
In a single challenge, measured by how performant the solution was.

Kimi K2.6 is definitely a frontier-sized model, so on the one hand it's not that surprising it's up there with the closed frontier models.

Being open is nice though, even though it doesn't matter that much for folks like me with a single consumer GPU.

reply
lelanthran
22 minutes ago
[-]
> Being open is nice though, even though it doesn't matter that much for folks like me with a single consumer GPU.

The value of open source is not that you will run it locally, it's that anyone can run it at all.

Even if you can't afford to purchase the hardware to run large open source models, someone would, price it at half the cost of the closed source models and still make a profit.

The only reason you are not seeing that happen right now is because the current front-running token-providers have subsidised their inference costs.

The minute they start their enshittification the market for alternatives becomes viable. Without open-source models, there will never be a viable alternative.

Even if they wanted to charge only 80% of what a developer costs, the existence of open source models that are not far behind is a forcing function on them. There is no moat for them.

reply
DeathArrow
1 hour ago
[-]
>Being open is nice though, even though it doesn't matter that much for folks like me with a single consumer GPU.

Of course it matters because that makes coding plans much cheaper than those from Anthropic and OpenAI.

For personal use I have coding plans with GLM 5.1, Kimi K2.6, MiniMax M2.7 and Xiaomi MiMo V2.5 Pro and I am getting a lot of bang for the buck.

reply
magicalhippo
1 hour ago
[-]
Currently it's not a huge difference given the subsidies of closed model subscriptions. Once that stops then yea it will be really nice to have open models as price competitors.
reply
smj-edison
27 minutes ago
[-]
At least in my experience switching from Claude Pro ($20/month) to Kimi 2.6 through ollama (also $20/month), I was almost always hitting my usage limit with Sonnet 4.6, but with ollama I haven't hit my usage a single time.
reply
keyle
1 hour ago
[-]
It absolutely does matter.

The enshittification will go unnoticed at first but I'm already finding my favourite frontier models severely nerfed, doing incredibly dumb stuff they weren't in the past.

We need open weight models to have a stable "platform" when we rely on them, which we do more and more.

reply
magicalhippo
1 hour ago
[-]
Most people won't roll out their own K2 deployment across rented GPUs, so in that sense it doesn't matter that much, they'll be using a paid service which is just as much of a black box as Claude or ChatGPT. For example, on OpenRouter you can select a provider which state they use a given open model, but you have no idea what actually goes on behind the curtains, which quantization levels they use and so on.

That said, I do fully agree that it is valuable to have open near-frontier models, as a balance to the closed ones.

reply
slopinthebag
1 hour ago
[-]
It's not really a black box. Useful models becoming fungible is crucial for disincentivizing bad behaviour with model providers. I can't really overstate how different it is from relying on closed models. If you don't like or trust any of the providers on OpenRouter you can rent the GPUs yourself and host it, although this is probably unnecessary.
reply
echelon
1 hour ago
[-]
This is the future though. Open weights models that run on H200s provide far more opportunity to build products and real infrastructure around.

You can always distill this for your little RTX at home. But models shaped for consumer hardware will never win wide adoption or remain competitive with frontier labs.

This is something that _can_ compete. And it will both necessitate and inspire a new generation of open cloud infra to run inference. "Push button, deploy" or "Push button, fine tune" shaped products at the start, then far more advanced products that only open weights not locked behind an API can accomplish.

Now we just need open weights Nano Banana Pro / GPT Image 2, and Seedance 2.0 equivalents.

The battle and focus should be on open weights for the data center.

reply
zozbot234
41 minutes ago
[-]
These large MoE models can work quite well on consumer or prosumer platforms, they'll just be slow, and you have to offset that by running them unattended around the clock. (Something that you can't really do with large SOTA models without spending way too much on tokens.) This actually works quite well for DeepSeek V4 series which has comparatively tiny KV-cache sizes so even a consumer platform can run big batches in parallel.
reply
bitmasher9
1 hour ago
[-]
I don’t fully understand what open weights unlocks that cannot be accomplished via API from a product standpoint.

Open weights is great if you want to do additional training, or if you need on-prem for security.

reply
mkl
1 hour ago
[-]
Multiple providers of the same model. That means competition for price, reliability, latency, etc. It also means you can use the same model as long as you want, instead of having it silently change behaviour.
reply
stldev
1 hour ago
[-]
Or try to beat Anthropic's uptime.
reply
echelon
58 minutes ago
[-]
> Open weights is great if you want to do additional training, or if you need on-prem for security.

The power of giving universities, companies, and hackers "full" models should not be understated.

Here are a just a few ideas for image, video, and creative media models:

- Suddenly you're not "blocked" for entire innocuous prompts. This is a huge issue.

- You can fine tune the model to learn/do new things. A lighting adjustment model, a pose adjustment model. You can hook up the model to mocap, train it to generate plates, etc.

- You can fine tune it on your brand aesthetic and not have it washed out.

reply
kmkrworks
12 minutes ago
[-]
I don't feel like this is an optimal way of comparing models. I really don't think any metric as of now has the ability to list down the best model as of now. It prioritizes tasks over the overall ability, and I don't even think it's possible to.
reply
aykutseker
1 hour ago
[-]
This seems less like Kimi is better at coding than Claude and more like Kimi found the right strategy for this particular game.

Still interesting though. The fact that an open weight model is close enough for that to matter is probably the real story.

reply
jrecyclebin
1 hour ago
[-]
I absolutely love Kimi's personality - some of the things it says are so out there! And it's been great for very focused, iterative work.

Its weakness is that it seems to yak on-and-on when it needs to plan out something big or read through and make sense of how to use a niche piece of a complex library. To the point where it can fill up its 256k window - and rack up a build. (No cache.) I have had better experience with GLM 5.1 in those cases.

Anyone out there relate?

reply
anderber
1 hour ago
[-]
Absolutely. I use caveman to help with that: https://github.com/JuliusBrussee/caveman
reply
LeoPanthera
1 hour ago
[-]
You can just add "be brief" to the prompt to replace the entire plugin. Same results.

https://www.maxtaylor.me/articles/i-benchmarked-caveman-agai...

reply
jrecyclebin
1 hour ago
[-]
Not a bad idea - however

> Caveman only affects output tokens — thinking/reasoning tokens are untouched.

The problem is the thinking. But could help to tune my system prompt for Kimi.

reply
wg0
25 minutes ago
[-]
About 40% of stock market consists of about 7 or 8 companies. Those companies that are all into AI circular deals collectively trillions of dollars in valuations.

Now imagine a company burning 200,000/month on AI spend. Real numbers. Not every company is but some are.

Why such a company won't deploy an open weight model (Kimi 2.6 or Deepseek v4) on their own hardware (rented or otherwise) to save about 2.4 million dollars a year?

And these are the landmines Chinese cleverly did set up. Not saying intentionally or otherwise.

But end result is that good luck recouping your investments, you can pretty much kiss goodbye to any ROI. The bucket has a hole at the bottom and the bubble bust is guaranteed.

PS: Without open weight models too the economics do not make sense neither the code generated by these SOTA models is reliable enough to be deployed as is. Anyone claiming otherwise either hasn't worked on a real software stack with real users OR didn't use AI long enough to witness the AI slop and how hard it is to untangle or de-slopify the AI generated code therefore these trillion dollar valuations are absurd anyway.

reply
bazlightyear
33 minutes ago
[-]
BTW it looks like Kimi won the subsequent challenge too https://aicc.rayonnant.ai/challenges/hexquerques/
reply
imrozim
30 minutes ago
[-]
Same experience here i use open router with claude as fallback for my startup. Is Kimi if close in quality the cost is difference hard to ignore
reply
PedroBatista
1 hour ago
[-]
Great to know, but what was the cost both in terms of $$ and tokens used?

Not to invalidate these benchmark results because they are useful, but the real usefulness it what they are capable to do when real people interact with them at scale.

Regardless, these are good news, because now that Microsoft is basically giving up their all-in strategy with Github's Copilot and Anthropic is playing the "I'm too good for you" game, it's about time for them to get pressed into not making this AI world into a divide between the haves and the have-nots.

reply
keyle
1 hour ago
[-]
Re pricing. Never as high as frontier commercial models.
reply
CryptoBanker
44 minutes ago
[-]
You’d be surprised with some long running complex tasks. I’ve seen Kimi spend 8 minutes (total) thinking on a task that Claude got done in 30 seconds. They both ultimately got it right, but Kimi spent ~$2.25 to Claude’s ~$0.20
reply
justech
1 hour ago
[-]
I’ve been maining Kimi k2.6 through opencode go and openrouter for a week and I can say it’s the same experience as when I was maining Sonnet 3.5/4 late last year.

Not as good or as fast as Claude Code on Opus now but definitely enough for casual/hobby use. The best part is multiple choices for providers, if opencode gimps their service, I’ll switch

reply
Frannky
1 hour ago
[-]
I have to try Kimi. I was looking for an alternative. If you have any experience, advice, please share. I saw Kimi is at the top of the Open Router ranking.
reply
zorked
1 hour ago
[-]
I use Kimi at home via a kimi.com subscription and Kimi CLI (sometimes running inside Zed, sometimes not). My favorite model by far. And it's just $20.

I have to use a supposedly frontier model at work and I hate it.

reply
Frannky
1 hour ago
[-]
Nice, thanks for sharing!
reply
DeathArrow
1 hour ago
[-]
Kimi K2.6 is great but I advice you to get a coding plan from Kimi.com as that way is much cheaper than paying for API calls using OpenRouter.
reply
Frannky
1 hour ago
[-]
Thanks, I am trying it right now. I had an opencode plan 5$/month, so I will play with that. I use ZED and I added Pi ACP, so I can try the both pi and Kimi. I will also try it in opencode and via Kimi code.
reply
prvnsmpth
1 hour ago
[-]
Use kimi 2.6 for planning and a cheap model (preferably local) for execution, and then kimi once again for reviewing it. Then finally I review the code. Saves a lot on tokens.
reply
Frannky
53 minutes ago
[-]
Very interesting, thanks for sharing. I am testing it with Pi in Zed and it seems pretty good.
reply
SomaticPirate
1 hour ago
[-]
This seems to be testing the models on leetcode style prompts that also require the model to implement TCP calls to send the results. Interesting but probably not a apples to apples comparison. The fact only Grok qualified for the first one seems suspect
reply
koala-news
51 minutes ago
[-]
In my opinion, this kind of comparison is not very meaningful.
reply
elromulous
1 hour ago
[-]
Is the site just slashdotted rn? Can anyone get to it?
reply
brettgo1
1 hour ago
[-]
Slashdot... Now that's a name I haven't heard in a long time. A long time.
reply
plexescor
57 minutes ago
[-]
I always though claude is the goat, but i guess its time to change the notion and try Kimi K2.6
reply
jakemanger
1 hour ago
[-]
What's the GPU VRAM requirements for this thing?

Awesome to have a open model that can compete, but damn it would be so much better if you could run it locally. Otherwise, it's almost so difficult to run (e.g. self host) that it's just way more convenient to pay OpenAI, Claude, etc

reply
DeathArrow
1 hour ago
[-]
>Otherwise, it's almost so difficult to run (e.g. self host) that it's just way more convenient to pay OpenAI, Claude, etc

Getting a coding plan from Kimi.com will make coding 20x cheaper than using Anthropic.

BTW, I am using it with Claude Code.

reply
beering
1 hour ago
[-]
I’m a little confused as to the setup. It was asking each model to one-shot a script and then the scripts faced off? Were the models given a computer environment? Or a test server to iterate against?
reply
rpmisms
1 hour ago
[-]
Sounds incredibly simple to me. One-shot.
reply
beering
1 hour ago
[-]
So nothing like real-world coding, where you’d be able to run and test the script before submitting?
reply
procinct
1 hour ago
[-]
One shot just means the user doesn’t have to iterate on it via the agent. The agent does what ever it needs to deliver the best outcome, including its own running and iteration until it’s happy with it. This could be a short or long process potentially depending on the task.
reply
pbreit
1 hour ago
[-]
All my co-workers say Claude blows away Gemini. Is it really that good? How can I do Kimi?
reply
prvnsmpth
1 hour ago
[-]
You can sign up for a plan on the kimi code platform and use it via the pi.dev coding agent, or opencode. In planning, I’d say it’s almost on par with Claude Opus.
reply
walrus01
1 hour ago
[-]
People thinking to self-host Kimi K2.6 had better be prepared for how big it is.

Q8 K XL quantization for instance is around 600GB on disk. I would bet about 700GB of VRAM needed.

Quantizations lower than Q8 are probably worthless for quality.

Or 2.05TB on disk for the full precision GGUF.

https://huggingface.co/unsloth/Kimi-K2.6-GGUF

If you can afford the hardware to run Kimi K2.6 at any decent speed for more than 1 simultaneous user, you probably have a whole team of people on staff who are already very familiar with how to benchmark it vs Claude, GPT-5.5, etc.

reply
zozbot234
54 minutes ago
[-]
Kimi is a natively quantized model, the lossless full precision release is 595GB. Your own link mentions that.
reply
CamperBob2
37 minutes ago
[-]
So, realistically, $100K for an 8x RTX 6000 Pro system that can run it at a usable rate.
reply
zozbot234
27 minutes ago
[-]
I think people will always disagree on what qualifies as a "usable rate". But keep in mind that practically no one sensible is running the latest Opus or GPT around the clock, especially not at sustainable, unsubsidized prices. With open-weights models it's easy to do that.
reply
slopinthebag
1 hour ago
[-]
Amazing. To me it feels like GLM 5.1, Kimi 2.6, DeepSeek 4 are all competitive both with each other and with the American models. Truly a great time to be alive.

I would like to see more effort making the flash variants work for coding. They are super economical to use to brute force boilerplate and drudgery, and I wonder just how good they can be with the right harness, if it provides the right UX for the steering they require.

As much as vibe coding has captured the zeitgeist, I think long term using them as tools to generate code at the hands of skilled developers makes more sense. Companies can only go so long spending obscene amounts of money for subpar unmaintainable code.

reply
rvz
1 hour ago
[-]
So we are now at the point where open weight models are rapidly catching up to the frontier models.

They are at best 30 days behind, and at worst case 2 months behind. The last issue is being able to run the best one on conventional hardware without a rack of GPUs.

The Macbooks, and Mac minis are behind on hardware but eventually in the next 2 years at worst will make it possible thanks to the advancements of the M-series machines.

All of this is why companies like Anthropic feel like they have to use "safety" to stop you from running local models on your machine and get you hooked on their casino wasting tokens with a slot machine named Claude.

reply
qakajjqj
1 hour ago
[-]
Yes gimini is a programming application
reply
ant6n
19 minutes ago
[-]
What I would like to see is a comparison of how well the models work in long running conversations:

  * do they lie and gaslight

  *  do they start breaking down on very long chats (forget old context, just get dumber)

  * do they constantly try to tell me how smart I am vs solving the problem (yes man)

  * do they follow conventions, parameters set out early in the prompts, or forget them

  * if they cant read a given file (like pdf), do they lie about it

  * is there a branch function to go back to earlier state of conversation

  * what is the quality of the presentation of results (structure, wording, excessive use of tables, appropriate use of headings)

  * how does the bot deal with user frustration (empathy?)
For example Chatpgt 5.5 is fairly smart, but presentation of results is kind of poor and unstructured, and unnecessarily long. It will break down on long conversations (the long answers dont help here), and it can’t deal with that except lying and gaslighting. It also has very little empathy, and mostly ignores user frustration. But at least theres branching, so one can go back without completely starting over.

Gemini doesnt feel quite as smart these days. It does well with very long conversations. Except it has bugs where all context gets lost or pruned, and it will lie and gaslight about it. Theres also no branching, so once context is lost you have to start over. Presentation is decent. Empathy is fairly good, except if users get frustrated, it gets more and more flustered and breaks down.

reply