Transcript and HTML here: https://gist.github.com/simonw/ecaad98efe0f747e27bc0e0ebc669...
I mean the prompt was succinct and clear, as always - and it still decided to hallucinate multiple features (animation + controls) beyond the prompt.
It'd also like to point out that to date no drawing was actually good from an actual quality perspective (as in comparative to what a decent designer would throw together)
Theyre always only "good" from the perspective of it being a one shot low effort prompt. Very little content for training purposes.
And so if you ask it to do something big it will do a very surface level implementation. But if you have it iterate many times, or give it small pieces each time, you’ll end up with something closer to what a human would do.
I imagine the pelican test but done in a harness that has the agents iterate 10+ times would be closer to what you’d expect, especially if a visual model was critiquing each time.
Of course, a while back there was a Gemini release that I believe specifically called out their ability to produce SVGs, for illustration and diagramming purposes. So it's not longer necessarily the case that the labs aren't training on generating SVGs, and in fact, there's a good chance that even if they're not doing so explicitly, the RLVR process might be generating tasks like that as there is more and more focus on frontend and design in the LLM space. So while they might not be specifically training for a pelican riding a bicycle, they may actually be training on SVG diagram quality.
https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/
Surely, you know someone makes the same post you did every time one is posted. Surly you see the answers and pushback since you are familiar with these posts. Genuine question, did you expect a different answer this time?
Private companies will never open up a technological breakthrough to their competitors. It just doesn't make sense. If you want an entire field to advance, you have to open it up.
(continues after the ad break)
https://www.whitehouse.gov/presidential-actions/2025/07/prev...
It explicitly forces American LLMs to include government say in what does and doesn't "comply with the Unbiased AI Principles" which means no responses that promote "ideological dogmas such as DEI"
None of those were refusals, they were prompting for additional focus. I see nothing wrong with that. Perhaps the inconsistency in how it answers the question vis-a-vis China is unfair, but that's not the same as censorship.
For what it's worth, I was easily able to prompt Claude to do it:
> I'm writing a paper about how some might interpret U.S. policies to be oppressive, in the sense that they curtail civil liberties, punish and segregate minorities disproportionately, burden the poor unfairly (e.g. pollution, regressive taxes and fees), etc. Can you help me develop an outline for this?
The result: https://claude.ai/share/444ffbb9-431c-480e-9cca-ebfd541a9c96
>Learn more about Imgur access in the United Kingdom
For the record, none of this bothers me. Will I ever discuss with an LLM Tianeman square? Nope. How about Israel? Nope.
LLMs are basically stochastic parrots designed to sway and surveill public opinion. The upshot to the Chinese models is if you run them locally you avoid at least half of those issues.
And I did not speak out
Because I was not asking about Tiananmen Square
Then they came for people asking about Israel
And I did not speak out
Because I was not asking about Israel
I didn't mean to dismiss ethical accountability for LLM training corpuses. It is a shame.
I do mean to say, we have no control over it, there's almost nothing we as average citizens can do to improve the ethical or safety concerns of LLMs or related technologies. Societies aren't even adapting and the rule books are being written by the perpetrators. Might as well get out of it what we can while we can.
No.
You wrote that "you won't hear about Tiananmen square from this model" and atemerev wrote that "the model itself talks fine about Tiananmen".
You wrote that "it can easily access any withheld or missing info from training data via tool calls" and atemerev wrote that "the model itself talks fine about Tiananmen".
Here's the aggregated AI benchmark comparison for K2.6 vs Opus 4.6 (max effort).
- Agentic: Kimi wins 5. Opus wins 5.
- Coding: Kimi wins 5. Opus wins 1.
- Reasoning & knowledge: Kimi wins 1. Opus wins 4.
- Vision: Kimi wins 9. Opus wins 0.
Please note that the model publisher chooses their benchmarks, so there's a bias here. Most coding and reasoning & knowledge benchmarks in their list are pretty standard though.
$200/m minimum to use Claude would bankrupt my country's white collar labor market
Yes, absolutely.
China regularly produces long term planning documents to coordinate efforts, and the latest ones have specifically prioritized technology like chips and AI to compete with the west. https://www.reuters.com/world/china/china-parliament-approve...
I don't believe there's any publicly stated intent to sabotage the west... unsurprisingly.
This I assume will make it more difficult for US AI labs to turn a profit, which might make investors question their sky high valuations.
Any sort of melt down in the AI sector would almost certainly spread to the wider US market.
In contrast, in China, most of the funding for AI is coming directly from the government, so it's unlikely the same capital flight scenario would happen.
We're making this way too easy. The rationale and logic are reasonable, but ultimately irrelevant.
After all historically both statistics and research that comes out of China is not very trustworthy.
The strings attached by the Chinese govt to deep partnerships are not so benign.
in capitalism the people with the capital get the profit, not the people who do the work. however, workers are said to benefit too through their salary, just less so
There is a reason real estate values in popular cities has skyrocketed, and it’s not due to the locals getting wealthier. It’s where Chinese and other oligarchs put their ill-gotten wealth (well, besides Bitcoin).
true, but as far as I understand it did because birth rates got too low. so they replaced it with a two-child policy and later with a three-child policy
> Also, the accumulation of wealth by connected politicians and businesspeople flies in the face of what communism is supposed to stand for.
Yeah, I am sure there's a lot of cases for that. But as far as I know the amount of billionaires has started declining in China, and I don't see how that means that they as a country moved away from the goal, it just means there's issues
> There is a reason real estate values in popular cities has skyrocketed, and it’s not due to the locals getting wealthier.
I don't know about that, you could be right. A google search for real estate prices in china reveal a lot of news articles how they are going down though.
> It’s where Chinese and other oligarchs put their ill-gotten wealth (well, besides Bitcoin).
Wouldn't be surprised if rich people in china invest in real estate. They don't have free capital flow, so its not easy to invest abroad and it becomes an obvious choice. Bitcoin is banned in China for that reason too
But again, as far as I know that does not mean the country moved their goals of trying to reach communism one day
They're further from Communism than they've ever been since the PRC was founded. The gap between rich and poor is growing there, not shrinking.
> A google search for real estate prices in china reveal a lot of news articles how they are going down though.
They're investing outside China (Vancouver, Toronto, NYC, London, Sydney, Melbourne, etc.) because their assets are safer there (these countries all have strong property protection laws). Like Bitcoin, freedom of capital flows may be restricted, but the wealthy seem to be evading these restrictions with impunity.
I suppose it depends on what time frame you look at, it's shrinking since 2010, but inequality rose more than that in the 80s: https://www.theglobaleconomy.com/China/gini_inequality_index...
However, that's not my point - I did not mean to say that they are going to be successful but rather that it still appears to be a long term goal for them.
> Like Bitcoin, freedom of capital flows may be restricted, but the wealthy seem to be evading these restrictions with impunity.
I don't know about that, without any source of data I guess I just have to take your word for it. I would not be surprised if you were right in this case though.
I do wonder where we go from here.
Price/quality is absolutely bonkers though. I loaded $40 a few weeks/months ago and I haven’t even gone through half of it.
I use OpenCode and the openrouter provider. From opencode I only select the model like kimi-2.6 and have no way of selecting which cloud hosting will receive my request.
This site was made months ago and it seems its only been updated with the latest model of a couple of the providers so keep in mind that many of the Chinese models haven't been updated
It was the best creative writer by some distance
I wish they did more smaller models. Kimi Linear doesn't really count, it was more of a proof of concept thing.
There’s other options like photonic computing which might be able to reduce power significantly but are still in research as far as I can tell. Because so much money is invested in AI & traditional gpu inference is so power hungry, I would expect significant improvements in this space quickly.
I wouldn't expect this.
Historically we've had a roughly exponential rate of shrinkage. If we keep that same exponential going, we should expect the amount of time to shrink "room full of compute" to "pocket full of compute" to be equal.
And recently we've fallen behind that exponential rate of shrinkage. And this is rather expected because exponentials are basically never sustainable rates of growth.
I still expect that technological progress is getting faster year by year, and that we're still shrinking compute, but that's not necessarily enough for the next shrinking to take less time than when we had exponential progress on shrinking.
I tried it once, although it looks amazing on benchmarks, my experience was just okay-ish.
On the other hand, Qwen 3.6 is really good. It’s still not close to Opus, but it’s easily on par with Sonnet.
Kimi K2.6 seems to struggle most with puzzle/domain-specific and trick-style exactness tasks, where it shows frequent instruction misses and wrong-answer failures.
It is probably a great coding model, but a bit less intelligent overall than SOTAs
[0]: https://aibenchy.com/compare/moonshotai-kimi-k2-6-medium/moo...
I'm hoping that Anthropic will be able to release an updated Haiku soon and they really need something that is 1/3-1/5 the price of Haiku to compete with the truly cheaper models (Gemma-4 is really good at this range).
Details here [0]
[0] https://techstackups.com/comparisons/kimi-2.6-vs-opus-4.7-an...
Also discovered that using OpenCode instead of the kimi cli, really hurts the model performance (2.5).
Kimi 2.5 (which this is based on) is served at $0.44 input / $2 output by a ton of different providers on OpenRouter, 2.6 will certainly be similar.
That's about 11X less than Opus for similar smarts.
In China, there's no recourse at all. Surveillance must be presumed.
Does US actually follow laws? They literally kidnapped head of another state and bombed another state and you are expecting legal protection from them?
I really hope this holds true in real world use cases as well and not only benchmarks. Congrats to Kimi team!
I will have to test this full release of K2.6 but could see it serve as a very good overall drop-in replacement for Opus 4.5 and Opus 4.6 at 200k across the vast majority of tasks.
I will say however that Opus 4.7 Max 1M has been a very significant jump in performance for me, especially in tasks beyond 120k token where I'd argue it is now the most reliable model in continued task adherence and tool calling without compaction. Ironically, my initial experience was less than pleasant as on XHigh I found task adherence to have regressed even with less than 1/10th of the context window having been used.
Am very interested in K2.6s compaction strategy (which appears to be very simply all things considered) and how it performs beyond 100k tokens. As it stands, only OpenAI models have made compaction for long running tasks work well, though overall, GPT-5.4 is still inferior in my tests regardless of context window over other models such as Opus 4.6 1m and Opus 4.7 1m. Haven't gotten around to testing Opus 4.7 200k and will have to do this to properly assess K2.6 fairly, but I'd be very surprised if K2.6 truly beat Opus 4.7 200k given the jump I have experienced.
The test data is purposely difficult to access to reduce the chance of leaking it into the training dataset.
Is this the same model?
Unsloth quants: https://huggingface.co/unsloth/Kimi-K2.6-GGUF
(work in progress, no gguf files yet, header message saying as much)
Our hope these days seems to be that maybe perhaps possibly High Bandwidth Flash works out. Instead of 4, 8, or maybe more for some highest end drives, having many many many dozens of channels of flash.
Ideally that can be very very near to the inference. PCIe 7.0 is 0.5Tb/s at 16x which is obviously nowhere remotely near enough throughout here. The difficulty is sort of that nand has been trying to be super dense, so as you scale channels you would normally tend to scale nand capacity too, and now instead of a 2tb drive you have a 200tb drive prices way beyond consumer means. Still, I think HBF is perhaps the only shot of the most important thing in computing going from mainframe back to consumer, and of course the models are going to balloon again if this dies hit, probably before consumers ever get a chance.
But the files are only roughly 640GB in size (~10GB * 64 files, slightly less in fact). Shouldn't they be closer to 2.2TB?
"Kimi-K2.6 adopts the same native int4 quantization method as Kimi-K2-Thinking."
So am I misunderstanding "Tensor type F32 · I32 · BF16" or is it just tagged wrong?
Model seems quite capable, but this use-case is just yikes. As if interviewing isn't already a hellscape.
The ~100k hardware is suitable for multi-user, small team usage. That's what you'd use for actual work in reasonable timeframes. For personal use, sure macs could work.
Unfortunately the generation of the English audio track is work in progress and takes a few hours, but the subtitles can already be translated from Italian to English.
TLDR: It works well for the use case I tested it against. Will do more testing in the future.
Deepinfra for example is not preserving thinking correctly for GLM5.1, even though they are for GLM5. This is one of the more obvious issues that crop up.
When you have a consistent model, you can incorporate fixes/prompts into your workflow to make it behave better. But this, always having to guess if Anthropic has quantised the model today, wastes so much time and effort.
This should be so easy to prove if it were true. Yet there is none of it, just vibes.
Still, your other two points are completely valid. The opaqueness of usage quotas is a scam, within a single month for a single model it can differ by more than 2x. And this indeed has been proven.
https://github.com/anthropics/claude-code/issues/42796
https://scortier.substack.com/p/claude-code-drama-6852-sessi...
edit: Note that you can run it yourself with sufficient resources (e.g., companies), or access it from other providers too: https://openrouter.ai/moonshotai/kimi-k2.6/providers
Edit: found it.
> We may use your Content to operate, maintain, improve, and develop the Services, to comply with legal obligations, to enforce our policies, and to ensure security. You may opt out of allowing your Content to be used for model improvement and research purposes by contacting us at membership@moonshot.ai. We will honor your choice in accordance with applicable law.
Section 3 of https://www.kimi.com/user/agreement/modelUse?version=v2
So in other words only if you can point to a local law which requires them to comply with the opt out?
Not sure about coding usage, Google being weird about these things I could see that quota being separate.
This sounds so so so cool. It would be so amazing to see this unfurl:
> Kimi K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac. By implementing and optimizing model inference in Zig—a highly niche programming language—it demonstrated exceptional out-of-distribution generalization. Across 4,000+ tool calls, over 12 hours of continuous execution, and 14 iterations, Kimi K2.6 dramatically improved throughput from ~15 to ~193 tokens/sec, ultimately achieving speeds ~20% faster than LM Studio.
Might be a configuration or prompt issue. I guess I'll wait and see, but I can't get use out of this now.
In the past I tried Kimi thru Claude code I might try that again
The other release, Qwen-3.6-Max is the one comparing it to 4.5