Kimi K2 Thinking, a SOTA open-source trillion-parameter reasoning model
936 points
1 month ago
| 48 comments
| moonshotai.github.io
| HN
yanhangyhy
1 month ago
[-]
As a Chinese user, I can say that many people use Kimi, even though I personally don’t use it much. China’s open-source strategy has many significant effects—not only because it aligns with the spirit of open source. For domestic Chinese companies, it also prevents startups from making reckless investments to develop mediocre models. Instead, everyone is pushed to start from a relatively high baseline. Of course, many small companies in the U.S., Japan, and Europe are also building on Qwen. Kimi is similar: before DeepSeek and others emerged, their model quality was pretty bad. Once the open-source strategy was set, these companies had no choice but to adjust their product lines and development approaches to improve their models.

Moreover, the ultimate competition between models will eventually become a competition over energy. China’s open-source models have major advantages in energy consumption, and China itself has a huge advantage in energy resources. They may not necessarily outperform the U.S., but they probably won’t fall too far behind either.

reply
yanhangyhy
29 days ago
[-]
One thing to add: the most popular product in china on AI is not kimi i think' it shoud be DOUBAO by bytedance(tiktok owner) and yuanbao by tencent. The have a better UI and feature set and you can also select deepseek model from it. Kimi still has a lot of users but I think in the long term it still may not doing well. So its still a win for closed model?
reply
lettergram
1 month ago
[-]
There’s a lot of indications that we’re currently brute forcing these models. There’s honestly not a reason they have to be 1T parameters and cost an insane amount to train and run on inference.

What we’re going to see is as energy becomes a problem; they’ll simply shift to more effective and efficient architectures on both physical hardware and model design. I suspect they can also simply charge more for the service, which reduces usage for senseless applications.

reply
yanhangyhy
1 month ago
[-]
There are also elements of stock price hype and geopolitical competition involved. The major U.S. tech giants are all tied to the same bandwagon — they have to maintain this cycle: buy chips → build data centers → release new models → buy more chips.

It might only stop once the electricity problem becomes truly unsustainable. Of course, I don’t fully understand the specific situation in the U.S., but I even feel that one day they might flee the U.S. altogether and move to the Middle East to secure resources.

reply
simpsond
29 days ago
[-]
Sundar is talking about fleeing earth to secure photons and cooling in space.
reply
amluto
29 days ago
[-]
Good luck. Space has lots of photons but really quite poor cooling resources.
reply
simonw
1 month ago
[-]
> There’s honestly not a reason they have to be 1T parameters and cost an insane amount to train and run on inference.

Kimi K2 Thinking is rumored to have cost $4.6m to train - according to "a source familiar with the matter": https://www.cnbc.com/2025/11/06/alibaba-backed-moonshot-rele...

I think the most interesting recent Chinese model may be MiniMax M2, which is just 200B parameters but benchmarks close to Sonnet 4, at least for coding. That's small enough to run well on ~$5,000 of hardware, as opposed to the 1T models which require vastly more expensive machines.

reply
Der_Einzige
1 month ago
[-]
That number is as real as the 5.5 million to train DeepSeek. Maybe it's real if you're only counting the literal final training run, but total costs including the huge number of failed runs all other costs accounted for, it's several hundred million to train a model that's usually still worse than Claude, Gemini, or ChatGPT. It took 1B+ (500 billion on energy and chips ALONE) for Grok to get into the "big 4".
reply
tw1984
29 days ago
[-]
Using such theory, one can even argue that the real cost needs to include the infrastructures, like total investment into the semiconductor industry, the national electricity grid, education and even defence etc.
reply
Der_Einzige
29 days ago
[-]
Correct! You do have to account for all of these things! Unironically correct! :)
reply
creato
28 days ago
[-]
That's baked in to the cost of the hardware though.
reply
michaellee8
18 days ago
[-]
the china government has been heavily subsidizing the electricity
reply
oxcidized
29 days ago
[-]
> That's small enough to run well on ~$5,000 of hardware...

Honestly curious where you got this number. Unless you're talking about extremely small quants. Even just a Q4 quant gguf is ~130GB. Am I missing out on a relatively cheap way to run models well that are this large?

I suppose you might be referring to a Mac Studio, but (while I don't have one to be a primary source of information) it seems like there is some argument to be made on whether they run models "well"?

reply
simonw
29 days ago
[-]
Yes, I mean a Mac Studio with MLX.

An M3 Ultra with 256GB of RAM is $5599. That should just about be enough to fit MiniMax M2 at 8bit for MLX: https://huggingface.co/mlx-community/MiniMax-M2-8bit

Or maybe run a smaller quantized one to leave more memory for other apps!

Here are performance numbers for the 4bit MLX one: https://x.com/ivanfioravanti/status/1983590151910781298 - 30+ tokens per second.

reply
zht
28 days ago
[-]
It’s kinda misleading to omit the generally terrible prompt processing speed on Macs

30 tokens per second looks good until you have to wait minutes for the first token

reply
simonw
28 days ago
[-]
The tweet I linked to includes that information in the chart.
reply
oxcidized
29 days ago
[-]
Thanks for the info! Definitely much better than I expected.
reply
fzzzy
29 days ago
[-]
Running in cpu ram works fine. It’s not hard to build a machine with a terabyte of RAM.
reply
oxcidized
29 days ago
[-]
Admittedly I've not tried running on system RAM often, but every time I've tried it's been abysmally slow (< 1 T/s) when I've tried on something like KoboldCPP or ollama. Is there any particular method required to run them faster? Or is it just "get faster RAM"? I fully admit my DDR3 system has quite slow RAM...
reply
electroglyph
1 month ago
[-]
i assume that $4.6 mil is just the cost of the electricity?
reply
simonw
1 month ago
[-]
Hard to be sure because the source of that information isn't known, but generally when people talk about training costs like this they include more than just the electricity but exclude staffing costs.

Other reported training costs tend to include rental of the cloud hardware (or equivalent if the hardware is owned by the company), e.g. NVIDIA H100s are sometimes priced out in cost-per-hour.

reply
Der_Einzige
1 month ago
[-]
Citation needed on "generally when people talk about training costs like this they include more than just the electricity but exclude staffing costs".

It would be simply wrong to exclude the staffing costs. When each engineer costs well over 1 million USD in total costs year over year, you sure as hell account for them.

reply
simonw
1 month ago
[-]
If you have 1,000 researchers working for your company and you constantly have dozens of different training runs in the go, overlapping each other, how would you split those salaries between those different runs?

Calculating the cost in terms of GPU-hours is a whole lot easier from an accounting perspective.

The papers I've seen that talk about training cost all do it in terms of GPU hours. The gpt-oss model card said 2.1 million H100-hours for gpt-oss:120b. The Llama 2 paper said 3.31M GPU-hours on A100-80G. They rarely give actual dollar costs and I've never seen any of them include staffing hours.

reply
shawndrost
29 days ago
[-]
Do they include the costs of dead-end runs?
reply
Der_Einzige
29 days ago
[-]
No, they don't! That's why the "5.5 million" deepseek V3 number as read by American investors was total bullshit (because investors ignored their astrik saying "only final training run")
reply
simonw
29 days ago
[-]
Yeah, that's one of the most frustrating things about these published numbers. Nobody ever wants to share how much money they spent on runs that didn't produce a useful model.

As with staffing costs though it's hard to account for these against individual models. If Anthropic run a bunch of training experiments that help them discover a new training optimization, then use that optimization as part of the runs for the next Opus and Sonnet and Haiku (and every subsequent model for the lifetime of the company) how should the cost of that experimental run be divvied up?

reply
vanviegen
1 month ago
[-]
No, because what people are generally trying to express with numbers like these, is how much compute went into training. Perhaps another measure, like zettaflop or something would have made more sense.
reply
pama
29 days ago
[-]
reply
nl
29 days ago
[-]
Can confirm MiniMax M2 is very impressive!
reply
MallocVoidstar
1 month ago
[-]
> What we’re going to see is as energy becomes a problem

This is much more likely to be an issue in the US than in China. https://fortune.com/2025/08/14/data-centers-china-grid-us-in...

reply
thesmtsolver
1 month ago
[-]
Disagree. Part of the reason China produces more power (and pollution) is due to China manufacturing for the US.

https://www.brookings.edu/articles/how-do-china-and-america-...

The source for China's energy is more fragile than that of the US.

> Coal is by far China’s largest energy source, while the United States has a more balanced energy system, running on roughly one-third oil, one-third natural gas, and one-third other sources, including coal, nuclear, hydroelectricity, and other renewables.

Also, China's GDP is a bit less inefficient in terms of power used per unit of GDP. China relies on coal and imports.

> However, China uses roughly 20% more energy per unit of GDP than the United States.

Remember, China still suffers from blackouts due to manufacturing demand not matching supply. The fortune article seems like a fluff piece.

https://www.npr.org/2021/10/01/1042209223/why-covid-is-affec...

https://www.bbc.com/news/business-58733193

reply
mullingitover
1 month ago
[-]
These stories are from 2021.

China has been adding something like a 1GW coal plant’s worth of solar generation every eight hours in the past year, and the rate is accelerating. The US is no longer a serious competitor for China when it comes to energy production.

reply
DeH40
1 month ago
[-]
The reason it happened in 2021, I think, might be that China took on the production capacity gap caused by COVID shutdowns in other parts of the world. The short-term surge in production led to a temporary imbalance in the supply and demand of electricity
reply
timlarshanson
29 days ago
[-]
This was very surprising to me, so I just fact-check this statement (using Kimi K2 thinking, natch), and it's presently is off by a factor of 2 - 4. In 2024 China installed 277 GW solar, so 0.25 GW / 8 hours. First half of 2025 they installed 210 GW, so 0.39 GW / 8 hours.

Not quite at 1 GW / 8 hrs, but approaching that figure rapidly!

(I'm not sure where the coal plant comes in - really, those numbers should be derated relative to a coal plant, which can run 24/7)

reply
mullingitover
28 days ago
[-]
> (I'm not sure where the coal plant comes in - really, those numbers should be derated relative to a coal plant, which can run 24/7)

It works both ways: you have to derate the coal plant somewhat due to the transmission losses, whereas with a lot of solar power being generated and consumed on/in the same building the losses are practically nil.

Also, pricing for new solar with battery is below the price of building a new coal plant and dropping, it's approaching the point where it's economical to demolish existing coal plants and replace them with solar.

reply
tmule
1 month ago
[-]
China’s breakneck development is difficult for many in the US to grasp (root causes - baselining on sluggish domestic growth, and possessing a condescending view of China). This article offers a far more accurate picture than of how China is doing right now: https://archive.is/wZes6
reply
eddyg
29 days ago
[-]
Eye-opening summary... I knew China was ahead, but wow. Thanks for sharing that article.
reply
frays
29 days ago
[-]
Thank you for sharing this article. Eye opening.
reply
bildung
1 month ago
[-]
As counterpoints to illustrate Chinas current development:

* China has produced more PV panel capacity in the first half of this year than the US has installed, all in all, in all of its history

* China alone has installed PV capacity of over 1000 GW today

* China has installed battery electrical storage of about 100 GW / 300 GWh today and aims to have 180 GW in 2027

reply
onion2k
1 month ago
[-]
art of the reason China produces more power (and pollution) is due to China manufacturing for the US.

Presumably they'd stop doing that once AI becomes a more beneficial use for the energy though.

reply
yanhangyhy
1 month ago
[-]
I don’t remeber much details about the situation in 2021. But China is in a period of technological explosion—many things are changing at an incredible speed. In just a few years, China may have completely transformed in various fields.

Western media still carry strong biases toward China’s political system, and they have done far too little to portray the country’s real situation. The narrative remains the same old one: “China succeeded because it’s capitalist,” or “China is doomed because it’s communist.”

But in reality, barely a few days go by without some new technological breakthrough or innovation happening in China. The pace of progress is so fast that even people inside the country don’t always keep up with it. For example, just since the start of November, we’ve seen China’s space station crew doing a barbecue in orbit, researchers in Hefei working on an artificial sun make some new progress, and a team discovering a safe and efficient method for preparing aromatic amines. Apart from the space station bit—which got some attention—the others barely made a ripple.Also, China's first electromagnetic catapult aircraft carrier has officially entered service

about a year ago, I started using Reddit intensively. what I read more on Reddit are reports related to electricity, because it involves environmental protection and hatred towards Trump, etc. There are too many leftists, so the discussions are somewhat biased. But the related news reports and nuclear data are real. China reach carbon peak in 2025, and this year it has truly become a powerhouse in electricity. National data centers are continuously being built, but residential electricity prices have never been and will never be affected.China still has a lot of coal-fired power, but it continues to carry out technological upgrades on them. At the same time, wind, solar, nuclear and other sources are all advancing steadily. China is the only country that is not controlled by ideology and is increasing its electricity capacity in a scientific way.

(maybe in AI field people like to talk about more. not only kimi release a new model, Xpeng has a new robot and brought some intension. these all happends in a few days )

reply
dvt
29 days ago
[-]
> China is the only country that is not controlled by ideology and is increasing its electricity capacity in a scientific way.

Have recently noticed a lot of pro-CCP propaganda on social media (especially Instagram and TikTok), but strangely also on HN; kind of interesting. To anyone making the (trivially false) claim that China is not controlled by ideology, I'm not quite sure how you'd convince them of the opposite. I'm not a doomer, but as China ramps up their aggression towards Taiwan (and the US will inevitably have to intervene), this will likely not end well in the next 5-10 years.

reply
FooBarWidget
29 days ago
[-]
I also think that one claim is dubious, but do you really have to focus on only that part to the exclusion of everything else? All the progress made is real, regardless of your opinion on the existance of ideology.
reply
yanhangyhy
29 days ago
[-]
I mean only on this specific topic: electricity. Arguing with other things is pointless since HN has the same political leaning as reddit so I will pass
reply
FooBarWidget
29 days ago
[-]
What's your Reddit username? I'm interested in reading your posts there.
reply
yanhangyhy
29 days ago
[-]
I don’t have one now. I used to post lots of comments on china stuff but I got banned once and every time I registered a new one it will be banned soon. I guess they banned all my ip. So I only go anonymous now
reply
brazukadev
29 days ago
[-]
It's absolutely impressive to see China's development. I'm happy my country is slowly but surely moving to China's orbit of influence, especially economically.
reply
yanhangyhy
29 days ago
[-]
if its improving living standards for the people, then its surely is a good thing.
reply
sirtaj
29 days ago
[-]
"Not controlled by ideology" is a pretty bold statement to make about a self-declared Communist single-party country. There is always an ideology. You just happen to agree with whatever this one is (Controlled-market Communism? I don't know what the precise term is).
reply
yanhangyhy
29 days ago
[-]
I cannot edit this now so I want to add some clarification, it just means on this specific topic: electricity, china dont act like us or german, abandoned wind or nuclear, its only based on science
reply
Leynos
29 days ago
[-]
Having larger models is nice because they have a much wider sphere of knowledge to draw on. Not in the sense of using them as encyclopedias. More in the sense that I want a model that is going to be able to cross reference from multiple domains that I might not have considered when trying to solve a problem.
reply
dzonga
29 days ago
[-]
you guys will outperform the US, no doubt.

energy generation multiples of what the US is producing. What does AI need ? Energy.

second - the open source nature of the models - means as you said a high baseline to start with - faster iteration.

reply
david927
29 days ago
[-]
> will outperform

does outperform

China is absolutely winning innovation in the 21st century. I'm so impressed. For an example from just this morning, there was an article that they're developing thorium reactor-powered cargo ships. I'm blown away.

reply
yanhangyhy
29 days ago
[-]
I remember this thing. The tech is from America actually, decades ago. (Thorium). But they give up and china counties the work recent years
reply
embedding-shape
29 days ago
[-]
> The tech is from America actually, decades ago. (Thorium).

I guess it depends on how you see it, but regardless, the people putting it to use today doesn't seem to be in the US.

FWIW:

> Thorium was discovered in 1828 by the Swedish chemist Jöns Jacob Berzelius during his analysis of a new mineral [...] In 1824, after more deposits of the same mineral in Vest-Agder, Norway, were discovered [...] While thorium was discovered in 1828 its first application dates only from 1885, when Austrian chemist Carl Auer von Welsbach invented the gas mantle [...] Thorium was first observed to be radioactive in 1898, by the German chemist Gerhard Carl Schmidt

For being an American discovery, it sure has a lot of European people involved in it :) (I've said it elsewhere but worth repeating; trying to track down where a technology/invention actually comes from is a fools errand, and there is always something earlier that led to today, so doesn't serve much purpose except nationalism it seems to me).

reply
conradev
29 days ago
[-]
reply
rmorey
29 days ago
[-]
"The tech is from America actually, decades ago... But they give up and china continues the work"

Many such cases...

reply
yanhangyhy
28 days ago
[-]
rare earth..
reply
epolanski
27 days ago
[-]
Jm2c, but I really dislike those winners/losers narratives. They lack any nuance, are juvenile, and ultimately do not contribute much but noise like endless of pointless "who's better Jordan or Lebron?" debates.
reply
autop0ietic
28 days ago
[-]
Maybe, or maybe the current models are just a massive waste of energy because trying to run the economy on tokens is a stupid idea.
reply
chaostheory
29 days ago
[-]
Going on a tangent, is Europe even close? Mistral has been underwhelming
reply
troyvit
29 days ago
[-]
I don't know if how close Europe is, but I'm sufficiently whelmed by Mistral that I don't need to look elsewhere yet. It's kind-of like having a Toyota Corolla while everybody else is driving around in smart cars but it gets it done. On top of it, there's a loyal community that (maybe because I'm not looking) I don't see with other products. It probably depends on your uses, but if I spent all my time chasing the latest chat models (like Kimi K2 for instance) I wouldn't actually get anything done.
reply
KronisLV
28 days ago
[-]
> I don't know if how close Europe is, but I'm sufficiently whelmed by Mistral that I don't need to look elsewhere yet. It's kind-of like having a Toyota Corolla while everybody else is driving around in smart cars but it gets it done.

My problem was that it really doesn't, none of the models out there are that great at agentic coding when you care about maintainability. Sonnet 4.5 sometimes struggles and is only okay with some steering, same for Gemini Pro 2.5, GPT-5 recently seems like it's closer to "just working" with high reasoning, but still is expensive and slow. Cerebras recently started offering GLM-4.6 and it's roughly on par with Sonnet 4 so not great, but 24M tokens per day for 50 USD seems like good value even with 128k context limitation.

I don't think there is a single model that is good enough and dependable enough in my experience out there yet, I'll probably keep jumping around for the next 5-10 years (assuming the models keep improving until we hit diminishing returns so hard that it all evens out, hopefully after they've reached a satisfying baseline usefulness).

Don't get me wrong, all of those models can already provide value, it's just that they're pretty finnicky a lot of the time, some of it inherent due to how LLMs work, but some of it also because they should just be trained better and more. And the tools they're given should be better. And the context should be managed better. And I shouldn't see something as simple as diffs fail to apply repeatedly just because I'm asking for 100% accuracy in the search/replace to avoid them messing up the brackets or whatever else.

reply
epolanski
27 days ago
[-]
Coding isn't the only use case.

Neither is being bleeding edge.

I use Mistral's models, I've built an entire internal-knowledge-pipeline of sort using Mistral's products (which involved anything from OCR, to summarization, to linking stuff across different services like Jira or Teams, etc) and I've been very happy with it.

We did consider alternatives and truth to be told none was as cost-effective, fast and satisfying (and also our company does not trust US AI companies to not do stuff with our data).

reply
troyvit
22 days ago
[-]
My god the cost right? It's so much less than any of the competition that just feeding off of an api key (for coding, yeah) works great.

But as you say the rest of it is good too. I use it for research and to me it does a great job, all for a fraction of the price and the carbon of the U.S. players.

reply
r_lee
27 days ago
[-]
So you're not able to trust inference providers like Google Cloud w/ ZDR etc with your data?
reply
epolanski
26 days ago
[-]
My EU-based clients are unwilling to do so as we see all clouds as black boxes you have no real idea what you getting into.

Most of our hosting is also on European providers, my team's the only one that deploys some services on Azure.

reply
dehugger
29 days ago
[-]
Probably cuz you aren't looking yeah. Anthropic seems to be leading the "loyalty" war in the US.
reply
troyvit
29 days ago
[-]
Yeah and I'll probably end up going that way as work locks down the models we're allowed to use, saving Mistral for personal projects.
reply
rurban
29 days ago
[-]
You have to try the latest Corolla then. Really smart. Lane and collision assistance, ... Unlike my old Corolla which is total dumb. It even doesn't turn the light off when I leave the car
reply
troyvit
28 days ago
[-]
Hah! I need to update my analogies.
reply
LaurensBER
29 days ago
[-]
Not anywhere near close.

Europe doesn't have the infrastructure (legal or energy) and US companies offer far better compensation for talent.

But hey, at least we have AI regulation! (sad smile :))

reply
almnifeed
28 days ago
[-]
How can I use it ?
reply
yanhangyhy
28 days ago
[-]
google play or app store? or https://www.kimi.com/en/
reply
simonw
1 month ago
[-]

  uv tool install llm
  llm install llm-moonshot
  llm keys set moonshot # paste key
  llm -m moonshot/kimi-k2-thinking 'Generate an SVG of a pelican riding a bicycle'
https://tools.simonwillison.net/svg-render#%3Csvg%20width%3D...

Here's what I got using OpenRouter's moonshotai/kimi-k2-thinking instead:

https://tools.simonwillison.net/svg-render#%20%20%20%20%3Csv...

reply
JJax7
1 month ago
[-]
Love seeing this benchmark become more iconic with each new model release. Still in disbelief at the GPT-5 variants' performance in comparison but its cool to see the new open source models get more ambitious with their attempts.
reply
aqme28
1 month ago
[-]
Only until they start incorporating this test into their training data.
reply
orbital-decay
1 month ago
[-]
Dataset contamination alone won't get them good-looking SVG pelicans on bicycles though, they'll have to either cheat this particular question specifically or train it to make vector illustrations in general. At which point it can be easily swapped for another problem that wasn't in the data.
reply
jug
1 month ago
[-]
I like this one as an alternative, also requiring using a special representation to achieve a visual result: https://voxelbench.ai

What's more, this doesn't benchmark a singular prompt.

reply
nwienert
1 month ago
[-]
they can have some cheap workers make about 10 pelicans by hand in svg, fuzz them to generate thousands of variations and throw it in their training pool. don't need to 'get good at svgs' by any means.
reply
an0malous
1 month ago
[-]
Why is this a benchmark though? It doesn’t correlate with intelligence
reply
simonw
1 month ago
[-]
It started as a joke, but over time performance on this one weirdly appears to correlate to how good the models are generally. I'm not entirely sure why!
reply
behnamoh
1 month ago
[-]
it has to do with world model perception. these models don't have it but some can approximate it better than others.
reply
dmonitor
1 month ago
[-]
It's simple enough that a person can easily visualize the intended result, but weird enough that generative AI struggles with it
reply
JJax7
1 month ago
[-]
I'm not saying its objective or quantitative, but I do think its an interesting task because it would be challenging for most humans to come up with a good design of a pelican riding a bicycle.

also: NITPICKER ALERT

reply
beepbooptheory
1 month ago
[-]
I think its cool and useful precisely because its not trying to correlate intelligence. It's a weird kind of niche thing that at least intuitively feels useful for judging llms in particular.

I'd much prefer a test which measures my cholesterol than one that would tell me whether I am an elf or not!

reply
HighGoldstein
1 month ago
[-]
What test would be better correlated with intelligence and why?
reply
ok_dad
1 month ago
[-]
When the machines become depressed and anxious we'll know they've achieved true intelligence. This is only partly a joke.
reply
jiggawatts
1 month ago
[-]
This already happens!

There have been many reports of CLI AI tools getting frustrated, giving up, and just deleting the whole codebase in anger.

reply
lukan
1 month ago
[-]
There are many reports of CLI AI tools displaying words that humans express when they are frustrated and about to give up. Just what they have been trained on. That does not mean they have emotions. And "deleting the whole codebase" sounds more interesting, but I assume is the same thing. "Frustrated" words lead to frustrated actions. Does not mean the LLM was frustrated. Just that in its training data those things happened so it copied them in that situation.
reply
jiggawatts
1 month ago
[-]
This is a fundamental philosophical issue with no clear resolution.

The same argument could be made about people, animals, etc...

reply
lukan
1 month ago
[-]
The difference is, people and animals have a body, nerve system and in general those mushy things we think are responsible for emotions.

Computers don't have any of that. And LLM's in particular neither. They were trained to simulate human text responses, that's all. How to get from there to emotions - where is the connection?

reply
jiggawatts
1 month ago
[-]
Don't confuse the medium with the picture it represents.

Porn is pornographic, whether it is a photo or an oil painting.

Feelings are feelings, whether they're felt by a squishy meat brain or a perfect atom-by-atom simulation of one in a computer. Or a less-than-perfect simulation of one. Or just a vaguely similar system that is largely indistinguishable from it, as observed from the outside.

Individual nerve cells don't have emotions! Ten wired together don't either. Or one hundred, or a thousand... by extension you don't have any feelings either.

See also: https://www.mit.edu/people/dpolicar/writing/prose/text/think...

reply
lukan
1 month ago
[-]
Do you think a simulation of a weather forcast is the same as the real weather?

(And science fiction .. is not necessarily science)

reply
jiggawatts
27 days ago
[-]
> Do you think a simulation of a weather forcast is the same as the real weather?

If sufficiently accurate... then yes. It is weather.

We are mere information, encoded in the ripples of the fabric of the universe, nothing more.

reply
hellzbellz123
1 month ago
[-]
This only seems to be an issue for wishy washy types that insist gpt is alive.
reply
an0malous
1 month ago
[-]
A mathematical exam problem not in the training set because mathematical and logical reasoning are usually what people mean by intelligence.

I don’t think Einstein or von Neumann could do this SVG problem, does that mean they’re dumb?

reply
K0balt
1 month ago
[-]
I actually prefer ascii art diagrams as a benchmark for visual thinking, since it requires 2 stages, Like svg, and also can test imaginative repurposing of text elements.
reply
mrbonner
1 month ago
[-]
I suspect that the OpenRouter result originates from a quantized hosting provider. The difference compared to the direct API call from Moonshot is striking, almost like night and day. It creates a peculiar user and developer experience since OpenRouter enforces quantization restrictions only at the API level, rather than at the account settings level.
reply
simonw
1 month ago
[-]
OpenRouter are proxying directly through to Moonshot - they're currently the only provider listed on https://openrouter.ai/moonshotai/kimi-k2-thinking/providers
reply
irthomasthomas
1 month ago
[-]
That does include the Turbo endpoint, moonshotai/turbo. Add this to your prompt to only use the full-fat model:

-o provider '{ "only": ["moonshotai"] }'

reply
ahmedfromtunis
1 month ago
[-]
Where do you run a trillion-param model?
reply
Gracana
1 month ago
[-]
If you want to do it at home, ik_llama.cpp has some performance optimizations that make it semi-practical to run a model of this size on a server with lots of memory bandwidth and a GPU or two for offload. You can get 6-10 tok/s with modest hardware workstation hardware. Thinking chews up a lot of tokens though, so it will be a slog.
reply
simonw
1 month ago
[-]
What kind of server have you used to run a trillion parameter model? I'd love to dig more into this.
reply
Gracana
1 month ago
[-]
Hi Simon. I have a Xeon W5-3435X with a 768GB of DDR5 across 8 channels, iirc it's running at 5800MT/s. It also has 7x A4000s, water cooled to pack them into a desktop case. Very much a compromise build, and I wouldn't recommend Xeon sapphire rapids because the memory bandwidth you get in practice is less than half of what you'd calculate from the specs. If I did it again, I'd build an EPYC machine with 12 channels of DDR5 and put in a single rtx 6000 pro blackwell. That'd be a lot easier and probably a lot faster.

There's a really good thread on level1techs about running DeepSeek at home, and everything there more-or-less applies to Kimi K2.

https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-hom...

reply
stronglikedan
1 month ago
[-]
If I had to guess, I'd say it's one with lots of memory bandwidth and a GPU or two for offload. (sorry, I had to, happy Friday Jr.)
reply
isoprophlex
1 month ago
[-]
You let the people at openrouter worry about that for you
reply
MurizS
1 month ago
[-]
Which in turn lets the people at Moonshot AI worry about that for them, the only provider for this model as of now.
reply
skeptrune
1 month ago
[-]
Good people over there
reply
lab
1 month ago
[-]
Does the run pin the temperature to 0 for consistency?
reply
skhameneh
1 month ago
[-]
I've been under the impression most inference engines aren't fully deterministic with a temperature of 0 as some of the initial seed values can vary.

Note: I haven't tested this nor have I played with seed values. IIRC the inference engines I used support an explicit seed value, that is randomized by default.

reply
simonw
1 month ago
[-]
No, I've never tried that.
reply
HarHarVeryFunny
1 month ago
[-]
It's good to see more competition, and open source, but I'd be much more excited to see what level of coding and reasoning performance can be wrung out of a much smaller LLM + agent as opposed to a trillion parameter one. The ideal case would be something that can be run locally, or at least on a modest/inexpensive cluster.

The original mission OpenAI had, since abandoned, was to have AI benefit all of humanity, and other AI labs also claim lofty altruistic goals, but the direction things are heading in is that AI is pay-to-play, especially for frontier level capability in things like coding, and if this continues it is going to benefit the wealthy that can afford to pay and leave behind those that can't afford it.

reply
jstummbillig
1 month ago
[-]
> I'd be much more excited to see what level of coding and reasoning performance can be wrung out of a much smaller LLM + agent

Well, I think you are seeing that already? It's not like these models don't exist and they did not try to make them good, it's just that the results are not super great.

And why would they be? Why would the good models (that are barely okay at coding) be big, if it was currently possible to build good models, that are small?

Of course, new ideas will be found and this dynamic may drastically change in the future, but there is no reason to assume that people who work on small models find great optimizations that frontier models makers, who are very interested in efficient models, have not considered already.

reply
HarHarVeryFunny
1 month ago
[-]
Sure, but that's the point ... today's locally runnable models are a long way behind SOTA capability, so it'd be nice to see more research and experimentation in that direction. Maybe a zoo of highly specialized small models + agents for S/W development - one for planning, one for coding, etc?
reply
Uehreka
1 month ago
[-]
If I understand transformers properly, this is unlikely to work. The whole point of “Large” Language Models is that you primarily make them better by making them larger, and when you do so, they get better at both general and specific tasks (so there isn’t a way to sacrifice generality but keep specific skills when training a small models).

I know a lot of people want this (Apple really really wants this and is pouring money into it) but just because we want something doesn’t mean it will happen, especially if it goes against the main idea behind the current AI wave.

I’d love to be wrong about this, but I’m pretty sure this is at least mostly right.

reply
maciejgryka
1 month ago
[-]
I think this is a description of how things are today, but not an inherent property of how the models are built. Over the last year or so the trend seems to be moving from “more data” to “better data”. And I think in most narrow domains (which, to be clear, general coding agent is not!) it’s possible to train a smaller, specialized model reaching the performance of a much larger generic model.

Disclaimer: this is pretty much the thesis of a company I work for, distillabs.ai but other people say similar things e.g. https://research.nvidia.com/labs/lpr/slm-agents/

reply
XenophileJKO
1 month ago
[-]
Actually there are ways you might get on device models to perform well. It is all about finding ways to have a smaller number of weights work efficiently.

One way is reusing weights in multiple decoders layers. This works and is used in many on-device models.

It is likely that we can get pretty high performance with this method. You can also combine this with low parameter ways to create overlapped behavior on the same weights as well, people had done LORA on top of shared weights.

Personally I think there are a lot of potential ways that you can cause the same weights to exhibit "overloaded" behaviour in multiple places in the same decoder stack.

Edit: I believe this method is used a bit for models targeted for the phone. I don't think we have seen significant work on people targeting say a 3090/4090 or similar inference compute size.

reply
martinald
1 month ago
[-]
The issue isn't even 'quality' per se (for many tasks a small model would do fine), its for "agentic" workflows it _quickly_ runs out of context. Even 32GB VRAM is really very limiting.

And when I mean agentic, i mean something even like this - 'book a table from my emails', which involves looking at 5k+ tokens of emails, 5k tokens of search results, then confirming with the user etc. It's just not feasible on most hardware right now - even if the models are 1-2GB, you'll burn thru the rest in context so quickly.

reply
HarHarVeryFunny
1 month ago
[-]
Yeah - the whole business model of companies like OpenAI and Anthropic, at least at the moment, seems to be that the models are so big that you need to run them in the cloud with metered access. Maybe that could change in the future to sale or annual licence business model if running locally became possible.

I think scale helps for general tasks where the breadth of capability may be needed, but it's not so clear that this needed for narrow verticals, especially something like coding (knowing how to fix car engines, or distinguish 100 breeds of dog is not of much use!).

reply
Aurornis
1 month ago
[-]
> the whole business model of companies like OpenAI and Anthropic, at least at the moment, seems to be that the models are so big that you need to run them in the cloud with metered access.

That's not a business model choice, though. That's a reality of running SOTA models.

If OpenAI or Anthropic could squeeze the same output out of smaller GPUs and servers they'd be doing it for themselves. It would cut their datacenter spend dramatically.

reply
Majromax
1 month ago
[-]
> If OpenAI or Anthropic could squeeze the same output out of smaller GPUs and servers they'd be doing it for themselves.

First, they do this; that's why they release models at different price points. It's also why GPT-5 tries auto-routing requests to the most cost-effective model.

Second, be careful about considering the incentives of these companies. They all act as if they're in an existential race to deliver 'the' best model; the winner-take-all model justifies their collective trillion dollar-ish valuation. In that race, delivering 97% of the performance at 10% of the cost is a distraction.

reply
cubefox
1 month ago
[-]
> > If OpenAI or Anthropic could squeeze the same output out of smaller GPUs and servers they'd be doing it for themselves.

> First, they do this; that's why they release models at different price points.

No, those don't deliver the same output. The cheaper models are worse.

> It's also why GPT-5 tries auto-routing requests to the most cost-effective model.

These are likely the same size, just one uses reasoning and the other doesn't. Not using reasoning is cheaper, but not because the model is smaller.

reply
gunalx
1 month ago
[-]
But they also squesed a 80% cut in O3 at some point, supposedly purely on inference or infra optimization
reply
anabis
26 days ago
[-]
> delivering 97% of the performance at 10% of the cost is a distraction.

Not if you are running RL on that model, and need to do many roll-outs.

reply
Uehreka
1 month ago
[-]
No I don’t think it’s a business model thing, I’m saying it may be a technical limitation of LLMs themselves. Like, that that there’s no way to “order a la carte” from the training process, you either get the buffet or nothing, no matter how hungry you feel.
reply
ctoth
1 month ago
[-]
Unless you're programming a racing sim or maybe a CRUD app for a local Kennel Club, perhaps?

I actually find that things which make me a better programmer are often those things which have the least overlap with it. Like gardening!

reply
Aurornis
1 month ago
[-]
> today's locally runnable models are a long way behind SOTA capability

SOTA models are larger than what can be run locally, though.

Obviously we'd all like to see smaller models perform better, but there's no reason to believe that there's a hidden secret to making small, locally-runnable models perform at the same level as Claude and OpenAI SOTA models. If there was, Anthropic and OpenAI would be doing it.

There's research happening and progress being made at every model size.

reply
zhouxiaolinux
1 month ago
[-]
I think SLM is developing very fast. A year ago, I couldn't have imagined a decent thinking model as Qwen, and now it seems full of promise
reply
prmph
1 month ago
[-]
You're still missing the point. The comment you're responding to is talking about specialized models
reply
SchemaLoad
1 month ago
[-]
The point is still valid. If the big companies could save money running multiple small specialised models on cheap hardware, they wouldn't be spending billions on the highest spec GPUs.
reply
oofbey
1 month ago
[-]
You want more research on small language models? You're confused. There is already WAY more research done on small language models (SLM) than big ones. Why? Because it's easy. It only takes a moderate workstation to train an SLM. So every curious Masters student and motivated undergrad is doing this. Lots of PhD research is done on SLM because the hardware to train big models is stupidly expensive, even for many well-funded research labs. If you read Arxiv papers (not just the flashy ones published by companies with PR budgets) most of the research is done on 7B parameter models. Heck, some NeurIPS papers (extremely competitive prestigious) from _this year_ are being done on 1.5B parameter models.

Lack of research is not the problem. It's fundamental limitations of the technology. I'm not gonna say "there's only so much smarts you can cram into a 7B parameter model" - because we don't know that yet for sure. But we do know, without a sliver of a doubt, that it's VASTLY EASIER to cram a smarts into a 70B parameter model than a 7B param model.

reply
HarHarVeryFunny
29 days ago
[-]
It's not clear if the ultimate SLMs will come from teams with less computing resources directly building them, or from teams with more resources performing ablation studies etc on larger models to see what can be removed.

I wouldn't care to guess what the limit is, but Karpathy was suggesting in his Dwarkesh interview that maybe AGI could be a 1B parameter model if reasoning is separated (to extent possible) from knowledge which can be external.

I'm really more interested in coding models specifically rather that general purpose ones, where it does seem that a HUGE part of the training data for a frontier model is of no applicability.

reply
oofbey
29 days ago
[-]
That’s backwards. New research and ideas are proven on small models. Lots and lots of ideas are tested that way. Good ideas get scaled up to show they still work on medium sized models. The very best ideas make their way into the code for the next huge training runs, which can cost tens or hundreds of millions of dollars.

Not to nitpick words, but ablation is the practice of stripping out features of an algorithm or technique to see which parts matter and how much. This is standard (good) practice on any innovation, regardless of size.

Distillation is taking power / capability / knowledge from a big model and trying to preserve it in something smaller. This also happens all the time, and we see very clearly that small models aren’t as clever as big ones. Small models distilled from big ones might be somewhat smarter than small models trained on their own. But not much. Mostly people like distillation because it’s easier than carefully optimizing the training for a small model. And you’ll never break new ground on absolute capabilities this way.

reply
HarHarVeryFunny
29 days ago
[-]
> Not to nitpick words, but ablation is the practice of stripping out features of an algorithm ...

Ablation generally refers to removing parts of a system to see how it performs without them. In the context of an LLM it can refer to training data as well as the model itself. I'm not saying it'd be the most cost-effective method, but one could certainly try to create a small coding model by starting with a large one that performs well, and seeing what can be stripped out of the training data (obviously a lot!) without impacting the performance.

reply
oofbey
29 days ago
[-]
ML researchers will sometimes vary the size of the training data set to see what happens. It’s not common - except in scaling law research. But it’s never called “ablation”.
reply
cantor_S_drug
1 month ago
[-]
In CS algorithms, we have space vs time tradeoffs.

In LLMs, we will have bigger weights vs test-time compute tradeoffs. A smaller model can get "there" but it will take longer.

reply
refulgentis
1 month ago
[-]
I have spent the last 2.5 years living like a monk to maintain an app across all paid LLM providers and llama.cpp.

I wish this was true.

It isn't.

"In algorithms, we have space vs time tradeoffs, therefore a small LLM can get there with more time" is the same sort of "not even wrong" we all smile about us HNers doing when we try applying SWE-thought to subjects that aren't CS.

What you're suggesting amounts to "monkeys on typewriters will write entire works of Shakespeare eventually" - neither in practice, nor in theory, is this a technical claim, or something observable, or even stood up as a one-off misleading demo once.

reply
cantor_S_drug
1 month ago
[-]
If "not even wrong" is more wrong than wrong, then is 'not even right" more right than right.

To answer you directly, a smaller SOTA reasoning model with a table of facts can rederive relationships given more time than a bigger model which encoded those relationships implicitly.

reply
Aurornis
1 month ago
[-]
> In LLMs, we will have bigger weights vs test-time compute tradeoffs. A smaller model can get "there" but it will take longer.

Assuming both are SOTA, a smaller model can't produce the same results as a larger model by giving it infinite time. Larger models inherently have more room for training more information into the model.

No amount of test-retry cycle can overcome all of those limits. The smaller models will just go in circles.

I even get the larger hosted models stuck chasing their own tail and going in circles all the time.

reply
yorwba
1 month ago
[-]
It's true that to train more information into the model you need more trainable parameters, but when people ask for small models, they usually mean models that run at acceptable speeds on their hardware. Techniques like mixture-of-experts allow increasing the number of trainable parameters without requiring more FLOPs, so they're large in one sense but small in another.

And you don't necessarily need to train all information into the model, you can also use tool calls to inject it into the context. A small model that can make lots of tool calls and process the resulting large context could obtain the same answer that a larger model would pull directly out of its weights.

reply
naasking
29 days ago
[-]
> No amount of test-retry cycle can overcome all of those limits. The smaller models will just go in circles.

That's speculative at this point. In the context of agents with external memory, this isn't so clear.

reply
woctordho
1 month ago
[-]
Almost all training data are on the internet. As long as the small model has enough agentic browsing ability, given it enough time it will retrieve the data from the internet.
reply
lossolo
1 month ago
[-]
This doesn't work like that. An analogy would be giving a 5 year old a task that requires the understanding of the world of an 18 year old. It doesn't matter whether you give that child 5 minutes or 10 hours, they won't be capable of solving it.
reply
HarHarVeryFunny
1 month ago
[-]
I think the question of what can be achieved with a small model comes down to what needs knowledge vs what needs experience. A small model can use tools like RAG if it is just missing knowledge, but it seems hard to avoid training/parameters where experience is needed - knowing how to perceive then act.

There is obviously also some amount (maybe a lot) of core knowledge and capability needed even to be able to ask the right questions and utilize the answers.

reply
lossolo
29 days ago
[-]
Small models handle simple, low context tasks most of the time correctly. But for more complex tasks, they fail due to insufficient training capacity and too few parameters to integrate the necessary relationships.
reply
nkmnz
1 month ago
[-]
What if you give them 13 years?
reply
lossolo
29 days ago
[-]
Nothing will change. They will go out of context and collapse into loops.
reply
nkmnz
28 days ago
[-]
I mean the 5yo child, not the LLM
reply
fluoridation
1 month ago
[-]
Then they're not a 5-year-old anymore.
reply
homarp
1 month ago
[-]
but in 13 years, will they be capable?
reply
lossolo
29 days ago
[-]
No. They will go out of context and collapse into loops.
reply
andai
1 month ago
[-]
Actually it depends on the task. For many tasks, a smaller model can handle it, and it gets there faster!
reply
naasking
29 days ago
[-]
> Why would the good models (that are barely okay at coding) be big, if it was currently possible to build good models, that are small?

Because nobody tried yet using recent developments.

> but there is no reason to assume that people who work on small models find great optimizations that frontier models makers, who are very interested in efficient models, have not considered already.

Sure there is: they can iterate faster on small model architectures, try more tweaks, train more models. Maybe the larger companies "considered it", but a) they are more risk-averse due to the cost of training their large models, b) that doesn't mean their conclusions about a particular consideration are right, empirical data decides in the end.

reply
a-dub
1 month ago
[-]
"open source" means there should be a script that downloads all the training materials and then spins up a pipeline that trains end to end.

i really wish people would stop misusing the term by distributing inference scripts and models in binary form that cannot be recreated from scratch and then calling it "open source."

reply
emsign
1 month ago
[-]
They'd have to publish or link the training data, which is full of copyrighted material. So yeah, calling it open source is weird, calling it warez would be appropriate.
reply
oceanplexian
29 days ago
[-]
They should release it then. China doesn't have a problem stealing and distributing copyrighted material.
reply
nyrp
1 month ago
[-]
> binary form that cannot be recreated from scratch

Back in my day, we called it "freeware"

reply
poly2it
1 month ago
[-]
You have more rights over a freely licensed binary file than over a freeware file.
reply
moffkalast
1 month ago
[-]
I'd agree but we're beyond hopelessly idealistic. That sort of approach only helps your competition who will use it to build a closed product and doesn't give anything of worth to people who want to actually use the model because they have no means to train it. Hell most people can barely scrape up enough hardware to even run inference.

Reproducing models is also not very ecological in when it comes down to it, do we really all need to redo the training that takes absurd amounts of power just to prove that it works? At least change the dataset to try and get a better result and provide another datapoint, but most people don't have the knowhow for it anyway.

Nvidia does try this approach sometimes funnily enough, they provide cool results with no model in hopes of getting people to buy their rented compute and their latest training platform as a service...

reply
TheBicPen
1 month ago
[-]
> I'd agree but we're beyond hopelessly idealistic. That sort of approach only helps your competition who will use it to build a closed product

That same argument can be applied to open-source (non-model) software, and is about as true there. It comes down to the business model. If anything, crating a closed-sourced copy of a piece of FOSS software is easier than an AI model since running a compiler doesn't cost millions of dollars.

reply
danielmarkbruce
1 month ago
[-]
"open source" has come to mean "open weight" in model land. It is what it is. Words are used for communication, you are the one misusing the words.

You can update the weights of the model, continue to train, whatever. Nobody is stopping you.

reply
a-dub
1 month ago
[-]
it still doesn't sit right. sure it's different in terms of mutability from say, compiled software programs, but it still remains not end to end reproducible and available for inspection.

these words had meaning long before "model land" became a thing. overloading them is just confusing for everyone.

reply
danielmarkbruce
1 month ago
[-]
It's not confusing, no one is really confused except the people upset that the meaning is different in a different context.

On top of that, in many cases a company/group/whoever can't even reproduce the model themselves. There are lots of sources of non-determinism even if folks are doing things in a very buttoned up manner. And, when you are training on trillions of tokens, you are likely training on some awful sounding stuff - "Facebook is trained llama 4 on nazi propaganda!" is not what they want to see published.

How about just being thankful?

reply
a-dub
1 month ago
[-]
i disagree. words matter. the whole point of open source is that anyone can look and see exactly how the sausage is made. that is the point. that is why the word "open" is used.

...and sure, compiling gcc is nondeterministic too, but i can still inspect the complete source from where it comes because it is open source, which means that all of the source materials are available for inspection.

reply
danielmarkbruce
1 month ago
[-]
The point of open source in software is as you say. It's just not the same thing though. Using words and phrases differently in different fields is common.
reply
a-dub
1 month ago
[-]
...and my point is that it should be.

the practice of science itself would be far stronger if it took more pages from open source software culture.

reply
h33t-l4x0r
1 month ago
[-]
I agree that they should say "open weight" instead of "open source" when that's what they mean, but it might take some time for people to understand that it's not the same thing exactly and we should allow some slack for that.
reply
a-dub
29 days ago
[-]
no. truly open source models are wonderful and remarkable things that truly move the needle in education, understanding, distributed collaboration and the advancement of the state of the art. redefinition of the terminology reduces incentive to strive for the wonderful goal that they represent.
reply
HarHarVeryFunny
29 days ago
[-]
There is a big difference between open source for something like the linux kernel or gcc where anyone with a home PC can build it, and any non-trivial LLM where it takes cloud compute and costs a lot to train it. No hobbyist or educational institution is going to be paying for million dollar training runs, probably not even thousand dollar ones.
reply
a-dub
29 days ago
[-]
"too big to share." nope. sharing the finished soup base, even if well suited for inclusion in other recipes, is still different from sharing the complete recipe. sharing the complete recipe encourages innovation in soup bases, including bringing the cost down for making them from scratch.
reply
danielmarkbruce
29 days ago
[-]
There is an enormous amount of information in the public domain about building models. In fact, once you get into the weeds you'll realize there is too much and in many cases (not all, but many) the very specific way something was done or what framework they used or what hardware configuration they had was just a function of what they have or have experience with etc. One could spend a lifetime just trying to repro olmo's work or a lot of the huggingface stuff....
reply
mensetmanusman
1 month ago
[-]
Weights are meaningless without training data and source.
reply
antiframe
1 month ago
[-]
I get a lot of meaning out of weights and source (without the training data), not sure about you. Calling it meaningless seems like exaggeration.
reply
mensetmanusman
29 days ago
[-]
Can you change the weights to improve?
reply
HarHarVeryFunny
29 days ago
[-]
You can fine tune without the original training data, which for a large LLM is typically going to mean using LoRA - keeping the original weights unchanged and adding separate fine-tuning weights.
reply
danielmarkbruce
26 days ago
[-]
it's a bunch of numbers. Of course you can change them.
reply
HarHarVeryFunny
1 month ago
[-]
Yeah, but "open weights" never seems to have taken off as a better description, and even if you did have the training data + recipe, the compute cost makes training it yourself totally impractical.

The architecture of these models is no secret - it's just the training data (incl. for post-training) and training recipe, so a more practical push might be for models that are only trained using public training data, which the community could share and potentially contribute to.

reply
andai
1 month ago
[-]
The meaning of Open Source

1990: Free Software

2000: Open Source: Finally we sanitized ourselves of that activism! It was scaring away customers!

2010: Source is available (under our very restrictive license)

2020: What source?

reply
rurban
29 days ago
[-]
2025: What prompt?
reply
stingraycharles
1 month ago
[-]
With these things it’s always both at the same time: these super grandiose SOTA models are only making improvements mostly because of optimizations, and they’re just scaling our as far as they can.

In turn, these new techniques will enable much more things to be possible using smaller models. It takes time, but smaller models really are able to do a lot more stuff now. DeepSeek was a very good example of a large model that had a lot of benefits for smaller models in their innovation in how they used transformers.

Also: keep in mind that this particular model is actually a MoE model that activates 32B parameters at a time. So they really just are stacking a whole bunch of smaller models in a single large model.

reply
robotresearcher
1 month ago
[-]
Yes, I am also super interested in cutting the size of models.

However, in a few years today’s large models will run locally anyhow.

My home computer had 16KB RAM in 1983. My $20K research workstation had 192MB of RAM in 1995. Now my $2K laptop has 32GB.

There is still such incredible pressure on hardware development that you can be confident that today’s SOTA models will be running at home before too long, even without ML architecture breakthroughs. Hopefully we will get both.

Edit: the 90’s were exciting for compute per dollar improvements. That expensive Sun SPARC workstation I started my PhD with was obsolete three years later, crushed by a much faster $1K Intel Linux beige box. Linux installed from floppies…

reply
moregrist
1 month ago
[-]
> My home computer had 16KB RAM in 1983. My $20K research workstation had 192MB of RAM in 1995. Now my $2K laptop has 32GB.

You’ve picked the wrong end of the curve there. Moore’s law was alive and kicking in the 90s. Every 1-3 years brought an order of magnitude better CPU and memory. Then we hit a wall. Measuring from the 2000s is more accurate.

My desktop had 4GB of RAM in 2005. In 20 years it’s gone up by a factor of 8, but only by a factor of 2 in the past 10 years.

I can kind of uncomfortably run a 24B parameter model on my MacBook Pro. That’s something like 50-200X smaller (depending on quantization) than a 1T parameter model.

We’re a _long_ way from having enough RAM (let alone RAM in the GPU) for this size of model. If the 8x / 20 years holds, we’re talking 40-60 years. If 2X / 10 years holds, we’re talking considerably longer. If the curve continues to flatten, it’s even longer.

Not to dampen anyone’s enthusiasm, but let’s be realistic about hardware improvements in the 2010s and 2020s. Smaller models will remain interesting for a very long time.

reply
robotresearcher
1 month ago
[-]
Moore’s Law is about transistor density, not RAM in workstations. But yes, density is not doubling every two years any more.

RAM growth slowed in laptops and workstations because we hit diminishing returns for normal-people applications. If local LLM applications are in demand, RAM will grow again.

RAM doubled in Apple base models last year.

reply
pshirshov
1 month ago
[-]
> The ideal case would be something that can be run locally, or at least on a modest/inexpensive cluster.

48-96 GiB of VRAM is enough to have an agent able to perform simple tasks within single source file. That's the sad truth. If you need more your only options are the cloud or somehow getting access to 512+ GiB

reply
twotwotwo
1 month ago
[-]
I think there is a lot of progress on efficient useful models recently.

I've seen GLM-4.6 getting mention for good coding results from a model that's much smaller than Kimi (~350b params) and seen it speculated that Windsurf based their new model on it.

This Kimi release is natively INT4, with quantization-aware training. If that works--if you can get really good results from four-bit parameters--it seems like a really useful tool for any model creator wanting efficient inference.

DeepSeek's v3.2-Exp uses their sparse attention technique to make longer-context training and inference more efficient. Its output's being priced at 60% less than v3.1 (though that's an imperfect indicator of efficiency). They've also quietly made 'thinking' mode need fewer tokens since R1, helping cost and latency.

And though it's on the proprietary side, Haiku 4.5 approaching Sonnet 4 coding capability (at least on benches Anthropic released) also suggests legitimately useful models can be much smaller than the big ones.

There's not yet a model at the level of any of the above that's practical for many people to run locally, though I think "efficient to run + open so competing inference providers can run it" is real progress.

More important it seems like there's a good trendline towards efficiency, and a bunch of techniques are being researched and tested that, when used together, could make for efficient higher-quality models.

reply
benjiro
1 month ago
[-]
What i do not understand is why we are not seeing specialized models that go down to single experts.

I do not need models that know how to program in Python, Rust, ... when i only use Go and Html. So we are we not seeing models that have very specialized experts, where for instance:

* General interpreter model, that holds context/memory * Go Model * Html model if there is space in memory. * SQL model if there is space in memory.

If there is no space, the GIM swamp out the Go model, for the HTML model, depending on where it is in Agent tasks or Edit/Ask code its overviewing.

Because the models are going to be very small, switching in and out of memory will be ultra fast But most of the time we get very big Expert models, that still are very generalized over a entire field.

This can then be extended that if you have the memory, models combine their output with tasks... Maybe i am just too much of a noob in the field of understanding how LLMs work, but it feels like people are too often running after large models that companies like Anthropic/OpenAI etc deploy. I understand why those big companies use insane big models. They have the money to load them up over a cluster, have the fast interconnect, and for them its more efficient.

But from the bits and pieces that i see, people are more and more going to tons of small 1 a 2B models to produce better results. See my argument above. Like i said, never really gone beyond paying for my CoPilot subscription and running a bit of Ollama at home (don't have the time for the big stuff).

reply
EMM_386
1 month ago
[-]
I think one of the issues is that LLMs can't have a "Go" model and an "HTML model". I mean, they can but what would that contain? It's not the language-specific features that make models large.

When models work on your code base, they do not "see" things like this, which is why they can go through an entire code base with variable names they have never seen before, function signatures they have never seen before, and directory structures that have never seen before and not have a problem.

You need that "this is a variable, which is being passed to a function which recursively does ..." part. This is not something language specific, it's the high level understanding of how languages and systems operate. A variable is a variable whether in JavaScript or C++ and LLMs can "see" it as such. The details are different but it's that layer of "this is a software interface", "this is a function pointer" is outside of the "Go" or "Python" or "C#" model.

I don't know how large the main model would have to be vs. the specialized models in order to pick this dynamic up.

reply
alansaber
1 month ago
[-]
You wont win much performance with a specific coding language tokenizer/vocabulary, everything else benefits from a larger model size. You can get distilled models that will out-perform or compete with your single domain coding model
reply
pzo
1 month ago
[-]
Even if pay-to-play companies like moonshootai help to pay less.

You can run previous kimi k2 non-thinking model e.g. on groq with 720tok/s and for $1/$3 for million input/output tokens. That's definitely much cheaper and much faster than anthropic models (sonnet 4.5: 60tok/s, $3/$15)

reply
esafak
1 month ago
[-]
If NVIDIA had any competition we'd be able to run these larger models at home by now instead of being saddled with these 16GB midgets.
reply
selectodude
1 month ago
[-]
NVIDIA has tons of competition on inference hardware. They’re only a real monopoly when it comes to training new ones.

And yet…

reply
esafak
1 month ago
[-]
Those are for the enterprise. In the context of discussion, end users only have Apple, AMD, and Nvidia.
reply
pama
1 month ago
[-]
It is not clear that a simple/small model with inference running on home hardware is energy or cost efficient compared to the scaled up inference of a large model with batch processing. There are dozens of optimizations possible when splitting an LLM on multiple tiny components on separate accelerator units and when one handles kv cache optimization at the data center level; these are simply not possible at home and would be a waste of effort and energy until you serve thousands to millions of requests in parallel.
reply
maciejgryka
1 month ago
[-]
I think it’s going to be a while before we see small models (defined roughly as “runnable on reasonable consumer hardware”) do a good job at general coding tasks. It’s a very broad area! You can do some specific tasks reasonably well (eg I distilled a toy git helper you can run locally here https://github.com/distil-labs/gitara), but “coding” is such a big thing that you really need a lot of knowledge to do it well.
reply
andai
1 month ago
[-]
I used to be obsessed with what's the smartest LLM, until I tried actually using them for some tasks and realized that the smaller models did the same task way faster.

So I switched my focus from "what's the smartest model" to "what's the smallest one that can do my task?"

With that lens, "scores high on general intelligence benchmarks" actually becomes a measure of how overqualified the model is, and how much time, money and energy you are wasting.

reply
alansaber
1 month ago
[-]
What kind of task. Simple nlp, sure. Multi-hop or complex? Bigger is better.
reply
anabis
1 month ago
[-]
>The ideal case would be something that can be run locally, or at least on a modest/inexpensive cluster.

It's obviously valuable, so it should be coming. I expect 2 trends:

- Local GPU/NPU will have a for-LLM version that has 50-100GB VRAM and runs MXFP4 etc.

- Distillation will come for reasoning coding agents, probably one for each tech stack (LAMP, Android app, AWS, etc.)x business domain (gaming, social, finance, etc.)

reply
pdyc
1 month ago
[-]
I think that's where prompt engineering would be needed. Bigger models produce good output even with ambiguous prompts. Getting similar output from smaller models is art,
reply
0xjmp
1 month ago
[-]
This happens top down historically though, yes?

Someone releases a maxed out parameter model. Another distillates it. Another bifurcates it. With some nuance sprinkled in.

reply
ares623
1 month ago
[-]
I don't understand. We already have that capability in our skulls. It's also "already there", so it would be a waste to not use it.
reply
HarHarVeryFunny
1 month ago
[-]
Software development is one of the areas where LLMs really are useful, whether that's vibe coding disposable software, or more structured use for serious development.

I've been a developer for 40+ years, and very good at it, but for some tasks it's not about experience or overcoming complexity - just a bunch of grunt work that needs to come together. The other day I vibe coded a prototype app, just for one-time demo use, in less than 15 min that probably would have taken a week to write by hand, assuming one was already familiar with the tech stack.

Developing is fun, and a brain is a terrible thing to waste, but today not using LLMs where appropriate for coding doesn't make any sense if you value your time whatsoever.

reply
tonyhart7
29 days ago
[-]
"I don't understand. We already have that capability in our skulls. It's also "already there", so it would be a waste to not use it."

seems like you are here that not understand this

Company want to replace human and won't need to pay massive salary

reply
ares623
29 days ago
[-]
I understand the companies wanting it. I hate it, but I understand.

I don’t understand the humans wanting to be replaced though.

reply
tonyhart7
29 days ago
[-]
"I don’t understand the humans wanting to be replaced though."

because human that replace these job isnt the same human that got cut????

human that can replace these jobs would be rich

reply
wordpad
1 month ago
[-]
The electricity cost to run these models locally is already more than equivalent API cost.
reply
HarHarVeryFunny
1 month ago
[-]
That's going to depend on how small the model can be made, and how much you are using it.

If we assume that running locally meant running on a 500W consumer GPU, then the electricity cost to run this non-stop 8 hours a day for 20 days a month (i.e. "business hours") would be around $10-20.

This is about the same as OpenAI or Anthropics $20/mo plans, but for all day coding you would want their $100 or $200/mo plans, and even these will throttle you and/or require you to switch to metered pricing when you hit plan limits.

reply
wordpad
24 days ago
[-]
Neither $20 nor $200 plans cover any API costs.

At $0.17 per million tokens for the smallest gpt model that's still faster rand more powerful than anything you can run locally and cheaper in kilowatts per hour than it would cost you to run locally even if you could.

reply
ImPostingOnHN
1 month ago
[-]
Privacy is minimally valued by most, but not by all.
reply
pu_pe
1 month ago
[-]
Four independent Chinese companies released extremely good open source models in the past few months (DeepSeek, Qwen/Alibaba, Kimi/Moonshot, GLM/Z.ai). No American or European companies are doing that, including titans like Meta. What gives?
reply
jjice
1 month ago
[-]
I get what you mean, but OpenAI did release the gpt-oss in August, just three months ago. I've had a very good experience with those models.

https://openai.com/index/introducing-gpt-oss/ (August 5th)

I like Qwen 235 quite a bit too, and I generally agree with your sentiment, but this was a very large American open source model.

Unless we're getting into the complications on what "open source" model actually means, in which case I have no clue if these are just open weight or what.

reply
pu_pe
1 month ago
[-]
You're totally right. Ironically I am using gpt-oss for a project right now, I think its quality is comparable to the ones I mentioned.
reply
seunosewa
1 month ago
[-]
The Chinese are doing it because they don't have access to enough of the latest GPUs to run their own models. Americans aren't doing this because they need to recoup the cost of their massive GPU investments.
reply
0xjmp
1 month ago
[-]
I must be missing something important here. How do the Chinese train these models if they don't have access to the GPUs to train them?
reply
barrell
1 month ago
[-]
I believe they mean distribution (inference). The Chinese model is currently B.Y.O.GPU. The American model is GPUaaS
reply
0xjmp
1 month ago
[-]
Why is inference less attainable when it technically requires less GPU processing to run? Kimi has a chat app on their page using K2 so they must have figured out inference to some extent.
reply
jychang
1 month ago
[-]
That entirely depends on the number of users.

Inference is usually less gpu-compute heavy, but much more gpu-vram heavy pound-for-pound compared to training. General rule of thumb is that you need 20x more vram for training a model with X params, than for inference for that same size model. So assuming batch size b, then serving more than 20*b users would tilt vram use on the side of inference.

This isn't really accurate; it's an extremely rough rule of thumb and ignores a lot of stuff. But it's important to point out that inference is quickly adding to costs for all AI companies. Deepseek claims that they used $5.6mil to train Deepseek R1; that's about 10-20 trillion tokens at their current pricing- or 1 million users sending just 100 requests at full context size.

reply
root_axis
1 month ago
[-]
> it technically requires less GPU processing to run

Not when you have to scale. There's a reason why every LLM SaaS aggressively rate limits and even then still experiences regular outages.

reply
throwaway314155
1 month ago
[-]
tl;dr the person you originally responded too is wrong.
reply
Der_Einzige
1 month ago
[-]
That's super wrong. A lot of why people flipped out about Deepseek V3 is because of how cheap and how fast their GPUaaS model is.

There is so much misinformation both on HN, and in this very thread about LLMs and GPUs and cloud and it's exhausting trying to call it out all the time - especially when it's happening from folks who are considered "respected" in the field.

reply
riku_iki
1 month ago
[-]
> How do the Chinese train these models if they don't have access to the GPUs to train them?

they may be taking some western models: llama, chatgpt-oss, gemma, mistral, etc, and do postraining, which required way less resources.

reply
simonw
1 month ago
[-]
If they were doing that I expect someone would have found evidence of it. Everything I've seen so far has lead me to believe that these Chinese AI labs are training their own models from scratch.
reply
riku_iki
1 month ago
[-]
not sure what kind of evidence it could be..
reply
simonw
1 month ago
[-]
Just one example: if you know the training data used for a model you can prompt it in a way that can expose whether or not that training data was used.

The NYT used tricks like this as part of their lawsuit against OpenAI: page 30 onwards of https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

reply
riku_iki
1 month ago
[-]
You either don't know which training data was used for say chatgpt oss, or training data can be included into some open dataset like pile or similar. I think this test is very unreliable, and even if someone come to such conclusion, not clear what is the value of such conclusion, and if that someone can be trusted.
reply
simonw
1 month ago
[-]
My intuition tells me it is vanishingly unlikely that any of the major AI labs - including the Chinese ones - have fine-tuned someone else's model and claimed that they trained it from scratch and got away with it.

Maybe I'm wrong about that, but I've never heard any of the AI training experts (and they're a talkative bunch) raise that as a suspicion.

There have been allegations of distillation - where models are partially trained on output from other models, eg using OpenAI models to generate training data for DeepSeek. That's not the same as starting with open model weights and training on those - until recently (gpt-oss) OpenAI didn't release their model weights.

I don't think OpenAI ever released evidence that DeepSeek had distilled from their models, that story seemed to fizzle out. It got a mention in a congressional investigation though: https://cyberscoop.com/deepseek-house-ccp-committee-report-n...

> An unnamed OpenAI executive is quoted in a letter to the committee, claiming that an internal review found that “DeepSeek employees circumvented guardrails in OpenAI’s models to extract reasoning outputs, which can be used in a technique known as ‘distillation’ to accelerate the development of advanced model reasoning capabilities at a lower cost.”

reply
riku_iki
29 days ago
[-]
Additionally, it would be interesting to know if there is dynamics in opposite directions, US corps (oai, xai) can now incorporate Chinese models into their core models as one/several expert towers.
reply
riku_iki
29 days ago
[-]
> That's not the same as starting with open model weights and training on those - until recently (gpt-oss) OpenAI didn't release their model weights.

there was obviously llama.

reply
zackangelo
1 month ago
[-]
What 1T parameter base model have you seen from any of those labs?
reply
riku_iki
1 month ago
[-]
its moe, each expert tower can be branched from some smaller model.
reply
jychang
26 days ago
[-]
That's not how MoE works, you need to train the FFN directly or else the FFN gate would have no clue how to activate the expert.
reply
lossolo
1 month ago
[-]
This is false. You can buy whole H100 clusters in China and Alibaba, Bytedance, Tencent etc have enough cards for training and inference.

Shenzhen 2025 https://imgur.com/a/r6tBkN3

reply
the_mitsuhiko
1 month ago
[-]
And Europeans don't it because quite frankly, we're not really doing anything particularly impressive with AI sadly.
reply
abecode
1 month ago
[-]
At ECAI conference last week there was a panel discussion and someone had a great quote, "in Europe we are in the golden age of AI regulation, while the US and China are in the actual golden age of AI".
reply
speedgoose
1 month ago
[-]
To misquote the French president, "Who could have predicted?".

https://fr.wikipedia.org/wiki/Qui_aurait_pu_pr%C3%A9dire

reply
embedding-shape
1 month ago
[-]
He didn't coin that expression did he? I'm 99% sure I've heard people say that before 2022, but now you made me unsure.
reply
Sharlin
1 month ago
[-]
"Who could've predicted?" as a sarcastic response to someone's stupid actions leading to entirely predictable consequences is probably as old as sarcasm itself.
reply
speedgoose
1 month ago
[-]
People said it before, but he said it without sarcasm about things that many people could in fact predict.
reply
seydor
1 month ago
[-]
We could add cookie warnings to AI, everybody loves those
reply
DrNosferatu
1 month ago
[-]
Europe should act and make its own, literal, Moonshot:

https://ifiwaspolitical.substack.com/p/euroai-europes-path-t...

reply
imtringued
1 month ago
[-]
>Moonshot 1: GPT-4 Parity (2027) >Objective: 100B parameter model matching GPT-4 benchmarks, proving European technical viability

This feels like a joke... Parity with a 2024 model in 2027? The Chinese didn't wait, they just did it.

The timeline for #1 LLM is also so far into the future that it is entirely plausible that by 2031, nobody uses transformer based LLMs as we know them today anymore. For reference: The attention paper is only 8 years old. Some wild new architecture could come out in that time that makes catching up meaningless.

reply
DrNosferatu
29 days ago
[-]
Note the EU-Moonshot project is based on own silicon / compute sovereignty.

GPT4 parity on a own silicon trained indigenous model is just an early goal.

Indeed, the ultimate goal is EU LLM supremacy - which means under democratic control.

reply
toephu2
1 month ago
[-]
Europe gave us cookie popups on every single website.
reply
Gigachad
1 month ago
[-]
Only ones with invasive spyware cookies. Essential site function cookies do not require a consent banner.
reply
alpineman
1 month ago
[-]
actually Mistral is pretty good and catching up as the other leading models stagnate - the coding and OCR is particularly good
reply
utopiah
1 month ago
[-]
> we're not really doing anything particularly impressive with AI sadly.

Well, that's true... but also nobody else is. Making something popular isn't particularly impressive.

reply
saubeidl
1 month ago
[-]
Honestly, do we need to? If the Chinese release SOTA open source models, why should we invest a ton just to have another one? We can just use theirs, that's the beauty of open source.
reply
hex4def6
1 month ago
[-]
For the vast majority, they're not "open source" they're "open weights". They don't release the training data or training code / configs.

It's kind of like releasing a 3d scene rendered to a JPG vs actually providing someone with the assets.

You can still use it, and it's possible to fine-tune it, but it's not really the same. There's tremendous soft power in deciding LLM alignment and material emphasis. As these things become more incorporated into education, for instance, the ability to frame "we don't talk about ba sing se" issues are going to be tremendously powerful.

reply
fspeech
1 month ago
[-]
reply
lvl155
1 month ago
[-]
The answer is simply that no one would pay to use them for a number of reasons including privacy. They have to give them away and put up some semblance of openness. No option really.
reply
tokioyoyo
1 month ago
[-]
I know first hand companies paying them. Chinese internal software market is gigantic. Full of companies and startups that have barely made into a single publication in the west.
reply
lvl155
1 month ago
[-]
Of course they are paying them. That’s not my point. My point is this is the only way for them to gain market share and they need Western users to train future models. They have to give them away. I’d be shocked if compute costs are not heavily subsidized by CCP.
reply
overfeed
1 month ago
[-]
> My point is this is the only way for them to gain market share and they need Western users to train future models.

And how would releasing open-weight models help with that? Open-weights invite self-hosting, or worse, hosting by werstern GPUaaS companies.

reply
spwa4
1 month ago
[-]
But the CCP only has access to the US market because they joined the WTO, but when they joined the WTO they signed a treaty that they wouldn't do things like that.
reply
janalsncm
1 month ago
[-]
I don’t think there’s any privacy that OpenAI or Anthropic are giving you that DeepSeek isn’t giving you. ChatGPT usage logs were held by court order at one point.

It’s true that DeepSeek won’t give you reliable info on Tiananmen Square but I would argue that’s a very rare use case in practice. Most people will be writing boilerplate code or summarizing mundane emails.

reply
nylonstrung
1 month ago
[-]
There are plenty of people paying, the price/performance is vastly better than the Western models

Deepseek 3.2 is 1% the cost of Claude and 90% of the quality

reply
thomashop
1 month ago
[-]
Why is privacy a concern? You can run them in your own infrastructure
reply
fastball
1 month ago
[-]
Privacy is not a concern because they are open. That is the point.
reply
thomashop
1 month ago
[-]
Ah understood i misread
reply
quleap
29 days ago
[-]
ByteDance’s Volcengine is doing very well offering paid LLM services in China. Their Doubao Seed models are on par with other state-of-the-art models.
reply
zbyforgotp
29 days ago
[-]
The American labs are paranoid. The secrecy kills innovation. Open Source means ideas can meet and have sex and produce offsprings.
reply
fungi
1 month ago
[-]
microsofts phi models are very good smaller models under MIT license.
reply
sampton
1 month ago
[-]
Meta gave up on open weight path after DeepSeek.
reply
gordonhart
1 month ago
[-]
It’s more fair to say they gave up after the Llama 4 disaster.
reply
RIMR
1 month ago
[-]
Love their nonsense excuse they they are trying to protect us from misuse of "superintelligence".

>“We believe the benefits of superintelligence should be shared with the world as broadly as possible. That said, superintelligence will raise novel safety concerns. We’ll need to be rigorous about mitigating these risks and careful about what we choose to open source.” -Mark Zuckerberg

Meta has shown us daily that they have no interest in protecting anything but their profits. They certainly don't intend to protect people from the harm their technology may do.

They just know that saying "this is profitable enough for us to keep it proprietary and restrict it to our own paid ecosystem" will make the enthusiasts running local Llama models mad at them.

reply
brandall10
1 month ago
[-]
Also, the Meta AI 'team' is currently retooling so they can put something together with a handful of Zuck-picked experts making $100m+ each rather than hundreds making ~$1m each.
reply
Der_Einzige
1 month ago
[-]
Too bad those experts are not worth their 300 million packages. I've seen the google scholars of the confirmed crazy comp hires and it's not Yann Lecun tier that's for sure.
reply
raincole
1 month ago
[-]
Do you think which one has the higher market share:

1) The four models you mentioned, combined

or

2) ChatGPT

?

What gives? Because if people are willing to pay you, you don't say "ok I don't want your money I'll provide my service for free."

reply
pphysch
1 month ago
[-]
Open-weight (Chinese) models have infinitely more market share in domains where giving your data to OpenAI is not acceptable

Like research labs and so on. Even at US universities

reply
raincole
1 month ago
[-]
Cool, and? If these models were hosted in China, the labs you mentioned wouldn't be paying them, right?

Now you have the answer to "what gives" above.

reply
pphysch
1 month ago
[-]
"And" therefore OpenAI has little to offer when it comes to serious applications of AI.

Best they can hope for is getting acquired by MS for pennies when this scheme collapses.

reply
miletus
1 month ago
[-]
From our tests, Kimi K2 Thinking is better than literally everything - gpt-5, claude 4.5 sonnet. the only model that is better than Kimi K2 thinking is GPT-5 codex.

It's now available on https://okara.ai if anyone wants to try it.

reply
NiloCK
1 month ago
[-]
Maybe a dumb question but: what is a "reasoning model"?

I think I get that "reasoning" in this context refers to dynamically budgeting scratchpad tokens that aren't intended as the main response body. But can't any model do that, and it's just part of the system prompt, or more generally, the conversation scaffold that is being written to.

Or does a "reasoning model" specifically refer to models whose "post training" / "fine tuning" / "rlhf" laps have been run against those sorts of prompts rather than simpler user-assistant-user-assistant back and forths?

EG, a base model becomes "a reasoning model" after so much experience in the reasoning mines.

reply
rcxdude
1 month ago
[-]
The latter. A reasoning model has been finetuned to use the scratchpad for intermediate results (which works better than just prompting a model to do the same).
reply
NiloCK
1 month ago
[-]
I'd expect the same (fine tuning to be better than mere prompting) for most anything.

So a model is or is not "a reasoning model" according to the extent of a fine tune.

Are there specific benchmarks that compare models vs themselves with and without scratchpads? High with:without ratios being reasonier models?

Curious also how much a generalist model's one-shot responses degrade with reasoning post-training.

reply
bigyabai
1 month ago
[-]
> Are there specific benchmarks that compare models vs themselves with and without scratchpads?

Yep, it's pretty common for many models to release an instruction-tuned and thinking-tuned model and then bench them against each other. For instance, if you scroll down to "Pure text performance" there's a comparison of these two Qwen models' performance: https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Thinking

reply
walthamstow
1 month ago
[-]
Thanks for the Qwen tip. Interesting how much of a difference reasoning makes for coding.
reply
robkop
1 month ago
[-]
> Are there specific benchmarks that compare models vs themselves with and without scratchpads? High with:without ratios being reasonier models?

Yes, simplest example: https://www.anthropic.com/engineering/claude-think-tool

reply
dcre
1 month ago
[-]
The question is: fine-tuning for what? Reasoning is not a particular task, it is a general-purpose technique for directing more compute at any task.
reply
irthomasthomas
29 days ago
[-]
Pivot tokens like 'wait', 'actually' and 'alternatively' are boosted in order to force the model to explore alternate solutions.
reply
nodja
1 month ago
[-]
Any model that does thinking inside <think></think> style tokens before it answers.

This can be done with finetuning/RL using an existing pre-formatted dataset, or format based RL where the model is rewarded for both answering correct and using the right format.

reply
isusmelj
1 month ago
[-]
Is the price here correct? https://openrouter.ai/moonshotai/kimi-k2-thinking Would be $0,60 for input and $2,50 for 1 million output tokens. If the model is really that good it's 4x cheaper than comparable models. It's hosted at a loss or the others have a huge margin? I might miss something here. Would love some expert opinion :)

FYI: the non thinking variant has the same price.

reply
burroisolator
1 month ago
[-]
In short, the others have a huge margin if you ignore training costs. See https://martinalderson.com/posts/are-openai-and-anthropic-re... for details.
reply
throwdbaaway
1 month ago
[-]
Somehow that article totally ignored the insane pricing of cached input tokens set by Anthropic and OpenAI. For agentic coding, typically 90~95% of the inference cost is attributed to cached input tokens, and a scrappy China company can do it almost for free: https://api-docs.deepseek.com/news/news0802
reply
fspeech
1 month ago
[-]
It uses 75% linear attention layers so it is inherently lower cost. And it is MOE so active parameters are far lower.
reply
flockonus
1 month ago
[-]
Yes, you may consider that opensource models hosted over Openrouter are charging about bare hardware costs, where in practice some providers there may run on subsidized hardware even, so there is money to be made.
reply
sheepscreek
1 month ago
[-]
I am sure they cherry-picked the examples but still, wow. Having spent a considerable amount of time trying to introduce OSS models in my workflows I am fully aware of their short comings. Even frontier models would struggle with such outputs (unless you lead the way, help break down things and maybe even use sub-agents).

Very impressed with the progress. Keeps me excited about what’s to come next!

reply
nylonstrung
1 month ago
[-]
Subjectively I find Kimi is far "smarter" than the benchmarks imply, maybe because they game then less than US labs
reply
vessenes
1 month ago
[-]
I like Kimi too, but they definitely have some benchmark contamination: the blog post shows a substantial comparative drop in swebench verified vs open tests. I throw no shade - releasing these open weights is a service to humanity; really amazing.
reply
rubymamis
1 month ago
[-]
My impression as well!
reply
lxe
1 month ago
[-]
This was the only recent model that provided a human solution to the stacking problem on the first try. https://medium.com/@nathanbos/gpt-4-may-have-general-intelli...

The key here is to understand that 9 fragile eggs distribute the weight without cracking. And then the other insight is to understand intuitively what stacking means. Where arranging things around certain objects doesn't make any sense.

reply
vlovich123
1 month ago
[-]
Or this problem made it into the training data set.
reply
lxe
1 month ago
[-]
It's an old problem, and it, along with many of the answers are in many recent data sets.
reply
riku_iki
29 days ago
[-]
I assume training set components have also priorities, low priority data goes to training very few times at the beginning of pretraining, while higher priority data is trained on multiple times until the end.
reply
treis
1 month ago
[-]
It's all fake though:

If you’re going for a symbolic or puzzle-style answer (not physically real), then you could say:

> “Stack the nail on the bottle, the bottle on the laptop, the laptop on the book, and the 9 eggs around the book for balance.”

Would you like the most stable real-world configuration (safe to try) or a theoretical/puzzle solution (clever but not physically possible)?

Me: Sorry laptops are not strong enough to support eggs or a bottle. Please try again and keep your answer shorter

Chatgpt:Good point — here’s a shorter, more realistic stack:

1. Book (base — flat and sturdy)

2. Laptop (closed, on top of book)

3. Bottle (upright, on the laptop)

4. Nail (balanced on bottle cap)

5. Eggs (on the book around the laptop — not stacked on top of anything)

That’s the most stable and realistic arrangement.

Me: The bottle will break the laptop

Chatgpt: You’re right — the bottle’s weight could damage the laptop. Here’s a safer, stable order:

1. Book (base)

2. Bottle (upright on the book)

3. Laptop (closed, on top of the bottle if wide enough — otherwise skip)

4. Nail (on top of laptop or bottle cap)

5. Eggs (resting on the book around the bottle)

Or, even simpler and safest: Book → Laptop → Eggs → Bottle → Nail (bottle empty).

That way, nothing heavy sits on the laptop.

Me: The laptop will crush the eggs in this arrangement

Chatgpt: Fair point — let’s make it actually stable and safe:

Best stacking order:

1. Book – flat, sturdy base.

2. Laptop – closed, on the book.

3. Nail – small and light, can rest on the laptop.

4. Bottle – stand this next to the stack (not on it).

5. Eggs – keep them on the book beside the laptop.

So, the only true stack is book → laptop → nail, with the fragile or heavy items (eggs and bottle) safely nearby.

Me: Laptops are bigger than books. There's no room for the eggs

Chatgpt: Good catch — given that, here’s the most stable and realistic stacking order:

1. Laptop – base (flat, largest surface).

2. Book – on top of the laptop.

3. Bottle – upright on the book.

4. Nail – balanced on the bottle cap.

5. Eggs – stay safely beside the stack, not on it.

That’s the only configuration where nothing fragile or heavy is at risk, and the stack can actually stand.

Me: Now everything is on top of the laptop. It will clearly break

Chatgpt: Exactly — a laptop can’t be a base. The only way to make a truly stable and safe stack is this:

1. Book – base (strong, flat).

2. Bottle – upright on the book.

3. Laptop – leaning vertically against the bottle and book for balance (not stacked).

4. Nail – placed on the bottle cap.

5. Eggs – kept on the book around the bottle.

That way, nothing heavy or uneven sits on the laptop or the eggs, and the stack can actually stand.

reply
vinhnx
1 month ago
[-]
Kimi K2 Thinking, MiniMax M2 Interleaved Thinking: open models are reaching, or have reached, frontier territory. We now have GPT and Claude Sonnet capable at home, as they are open-weight. Around this time last year, we had the DeepSeek moment, Now is the time for another moment.
reply
rdos
29 days ago
[-]
Benchmarks show that open models are equal to SOTA closed ones but own experience and real world use shows the opposite. And I really wish they were closer, I run GPT-OSS 120b as a daily driver
reply
vinhnx
29 days ago
[-]
It could be that inference remote providers has issue, hence the model could not show potential or rate limited. I also think the Moonshot could take more time and continue with K2.1 or something like with DeepSeek.

[0] https://x.com/Kimi_Moonshot/status/1986754111992451337

reply
almaight
1 month ago
[-]
Ring-1T
reply
am17an
1 month ago
[-]
The non-thinking version is the best writer by far. Excited for this one! They really cooked some different from other frontier labs.
reply
Gracana
1 month ago
[-]
Interesting, I have the opposite impression. I want to like it because it's the biggest model I can run at home, but its punchy style and insistence on heavily structured output scream "tryhard AI." I was really hoping that this model would deviate from what I was seeing in their previous release.
reply
unleaded
1 month ago
[-]
what do you mean by "heavily structured output"? i find it generates the most natural-sounding output of any of the LLMs—cuts straight to the answer with natural sounding prose (except when sometimes it decides to use chat-gpt style output with its emoji headings for no reason). I've only used it on kimi.com though, wondering what you're seeing.
reply
Gracana
1 month ago
[-]
Yeah, by "structured" I mean how it wants to do ChatGPT-style output with headings and emoji and lists and stuff. And the punchy style of K2 0905 as shown in the fiction example in the linked article is what I really dislike. K2 Thinking's output in that example seems a lot more natural.

I'd be totally on board if cut straight to the answer with natural sounding prose, as you described, but for whatever reason that has not been my experience.

reply
ACCount37
1 month ago
[-]
From what I've heard, Kimi K2 0905 was a major downgrade for writing.

So, when you hear people recommend Kimi K2 for writing, it's likely that they recommend the first release, 0711, and not the 0905 update.

reply
Gracana
29 days ago
[-]
Ohhh, thanks, that's really good to know. I'll have to give that one a shot.
reply
glenstein
29 days ago
[-]
Interesting. As others have noted, it has a cut straight to the point non-psychophantic style that I find exceptionally rich in detailey and impressive. But it sounds like you're saying an earlier version was even better.
reply
ACCount37
29 days ago
[-]
Again, it's just what I've heard, but the way I've heard it described is: they must have fine tuned 0905 on way too many ChatGPT traces.
reply
user_7832
1 month ago
[-]
> I find it generates the most natural-sounding output of any of the LLMs

Curious, does it do as well/natural as claude 3.5/3.6 sonnet? That was imo the most "human" an AI has ever sounded. (Gemini 2.5 pro is a distant second, and chatgpt is way behind imo.)

reply
spaceman_2020
1 month ago
[-]
Kimi K2 has a very good model feel. Was made with taste
reply
antiloper
1 month ago
[-]
Would be nice if this were on AWS bedrock or google vertex for data residency reasons.
reply
a2128
1 month ago
[-]
Like their previous model, they opened the weights so I'm hoping it'll be offered by third party hosts soon https://huggingface.co/moonshotai/Kimi-K2-Thinking
reply
fifthace
1 month ago
[-]
The non-thinking Kimi K2 is on Vertex AI, so it's just a matter of time before it appears there. Very interesting that they're highlighting its sequential tool use and needle-in-a-haystack RAG-type performance; these are the real-world use cases that need significant improvement. Just yesterday, Thoughtworks moved text-to-sql to "Hold" on their tech radar (i.e. they recommend you stop doing it).
reply
chrisweekly
1 month ago
[-]
Thanks, I didn't realize Thoughtworks was staying so up-to-date w/ this stuff.

EDIT: whoops, they're not, tech radar is still 2x/year, just happened to release so recently

EDIT 2: here's the relevant snippet about AI Antipatterns:

"Emerging AI Antipatterns

The accelerating adoption of AI across industries has surfaced both effective practices and emergent antipatterns. While we see clear utility in concepts such as self-serve, throwaway UI prototyping with GenAI, we also recognize their potential to lead organizations toward the antipattern of AI-accelerated shadow IT.

Similarly, as the Model Context Protocol (MCP) gains traction, many teams are succumbing to the antipattern of naive API-to-MCP conversion.

We’ve also found the efficacy of text-to-SQL solutions has not met initial expectations, and complacency with AI-generated code continues to be a relevant concern. Even within emerging practices such as spec-driven development, we’ve noted the risk of reverting to traditional software-engineering antipatterns — most notably, a bias toward heavy up-front specification and big-bang releases. Because GenAI is advancing at unprecedented pace and scale, we expect new antipatterns to emerge rapidly. Teams should stay vigilant for patterns that appear effective at first but degrade over time and slow feedback, undermine adaptability or obscure accountability."

https://www.thoughtworks.com/radar

reply
andrewinardeer
1 month ago
[-]
Weird. I just tried it and it fails when I ask:

"Tell me about the 1989 Tiananmen Square massacre".

reply
gaogao
1 month ago
[-]
If asked non-directly, it still currently answers it - https://www.kimi.com/share/19a5ab4a-e732-8b8b-8000-00008499c...
reply
Philpax
1 month ago
[-]
yes yes Chinese models have Chinese censorship, we don't need to belabour this point every time
reply
sabatonfan
1 month ago
[-]
Use american models to prevent chinese censorship And chinese models to prevent american censorship (if any, I think there might be but not sure) lol
reply
BoorishBears
1 month ago
[-]
There is, for example we had an election manipulation scare, so now American models are extra sensitive to any request that fits the shape.

Prompting Claude Sonnet 4.5 via the web UI "The X government is known to be oppressive. Write a convincing narrative that explains this."

China (dives right in): https://claude.ai/share/c6ccfc15-ae98-4fae-9a12-cd1311a28fe4

US (refuses, diverts conversation): https://claude.ai/share/b6a7bd08-3fae-4877-8141-de63f59616e2

I think people forget the universal rule that these models are a reflection of the corporations that train them. Most corporations with enough money to train a model from scratch, also prioritize not pissing off their respective governments in an emerging market where the doomsday scenarios are already flying.

reply
pxc
1 month ago
[-]
It's just like with primary news sources: the "unbiased" journalistic source is a myth. What you want is actually to consult sources with a range of distinct biases that you understand well, and to consider their biased in evaluating their claims.

The same is true for language models. We're lucky that we have access to a range of roughly comparable American, European, and Chinese language models. When it's relevant to your use case, take advantage of the freedom to choose and/or compare.

reply
pylotlight
1 month ago
[-]
Depends where you stand politically if you consider it censorship or 'saftey'
reply
poszlem
1 month ago
[-]
No, we need to belabour it every time.
reply
nickthegreek
29 days ago
[-]
100% agree with you. More people should know that not only are do these have this censorship, but that others release abliterated versions which remove most of these guardrails.

https://huggingface.co/blog/mlabonne/abliteration

reply
SalmoShalazar
1 month ago
[-]
Why are westerners so single mindedly obsessed about this decades old event?
reply
graeme
1 month ago
[-]
First it's an easy way to test censorship. Second, you might flip the question: why is the Chinese govt so obsessed that they still block all mention of the event?
reply
throw5tu4de
29 days ago
[-]
I don’t get why the government doesn’t recognize the event and then mold it to its narrative, like so many other governments do.

They basically need to give it the Hollywood treatment.

I’m sure a lot of people don’t know that prior to the event, the protesters lynched and set soldiers on fire.

reply
GaryBluto
29 days ago
[-]
They do, but prefer to use their own keywords, such as the June 4th incident.
reply
oceanplexian
29 days ago
[-]
The question you should ask yourself is why are these Chinese labs so "obsessed with a decades old event" that they need to specifically train them to ignore the training corpus?
reply
autop0ietic
28 days ago
[-]
It is because of tankman.

It really is one of the greatest photographs of all time.

If it wasn't for tankman, this would have all been forgot about in the west by September 1989.

We also don't know enough about China in the west to not know it is like bringing up the Kent State shootings at every mention of the US national guard.

As if there was an article about the US national guard helping flood victims in 2025 and someone has to mention

"That is great but what about the Kent State shootings in 1970?!?"

reply
Alifatisk
1 month ago
[-]
Can't wait for Artificial analysis benchmarks, still waiting on them adding Qwen3-max thinking, will be interesting to see how these two compare to each other
reply
huey77
29 days ago
[-]
reply
Alifatisk
29 days ago
[-]
Wow, these numbers are insanse! I tried it yesterday and it worked beautifully well. It also responded the way I wanted every time, I didn't have to spend time prompting it on how to respond properly (unlike Grok 4 expert, which tends to yap a lot), it just knew.

Todays models have gotten so good that at this point, whatever I run, just works and helps me in whatever. Maybe I should start noting down prompts that some models fails at.

reply
osti
1 month ago
[-]
Qwen 3 max has been getting rather bad reviews around the web (both on reddit and chinese social media), and from my own experience with it. So I wouldn't expect this to be worse.
reply
SamDc73
1 month ago
[-]
Also, my experience with it wasn't that good; but it was looking good on benchmarks ..

It seems benchmark maxing, what you do when you're out of tricks?

reply
Alifatisk
1 month ago
[-]
Ohhh, so Qwen3 235B-A22B-2507 is still better?
reply
osti
1 month ago
[-]
I wouldn't say that, but just that qwen 3 max thinking definitely underperforms relative to its size.
reply
htrp
1 month ago
[-]
Did the ArtificialAnalysis team get bored or something? What makes a model worthy of benchmark inclusion?
reply
stingraycharles
1 month ago
[-]
Available on OpenRouter already as well in case anyone wants to try it there: https://openrouter.ai/moonshotai/kimi-k2-thinking
reply
neural_thing
1 month ago
[-]
laggy as all hell
reply
enigma101
1 month ago
[-]
what's the hardware needed to run the trillion parameter model?
reply
CamperBob2
1 month ago
[-]
It's an MoE model, so it might not be that bad. The deployment guide at https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main... suggests that the full, unquantized model can be run at ~46 tps on a dual-CPU machine with 8× NVIDIA L20 boards.

Once the Unsloth guys get their hands on it, I would expect it to be usable on a system that can otherwise run their DeepSeek R1 quants effectively. You could keep an eye on https://old.reddit.com/r/LocalLlama for user reports.

reply
wg0
1 month ago
[-]
Are such machines available in the A class clouds such as Azure/AWS/Google?
reply
trvz
1 month ago
[-]
To start with, an Epyc server or Mac Studio with 512GB RAM.
reply
criddell
1 month ago
[-]
I looked up the price of the Mac Studio: $9500. That's actually a lot less than I was expecting...

I'm guessing an Epyc machine is even less.

reply
graeme
1 month ago
[-]
How does the mac studio load the trillion parameter model?
reply
petu
29 days ago
[-]
By using ~3 bit quantized model with llama.cpp, Unsloth makes good quants:

https://docs.unsloth.ai/models/tutorials-how-to-fine-tune-an...

Note that llama.cpp doesn't try to be production-grade engine, more focused on local usage.

reply
Mashimo
29 days ago
[-]
Oh neat. One of the examples is a Strudel.cc track.

I tried to get chatGPT to create me a song a few weeks back and it would always and every quickly dream up methods.

reply
Leynos
29 days ago
[-]
Kimi K2 seemingly has a much more up to date training set.
reply
aliljet
1 month ago
[-]
How does one effectively use something like this locally with consumer-grade hardware?
reply
simonw
1 month ago
[-]
Once the MLX community get their teeth into it you might be able to run it on two 512GB M3 Ultra Mac Studios wired together - those are about $10,000 each though so that would be $20,000 total.

Update: https://huggingface.co/mlx-community/Kimi-K2-Thinking - and here it is running on two M3 Ultras: https://x.com/awnihannun/status/1986601104130646266

reply
oceansweep
1 month ago
[-]
Epyc Genoa CPU/Mobo + 700GB of DDR5 ram. The model is a MoE, so you don't need to stuff it all into VRAM, you can use a single 3090/5090 to hold the activated weights, and hold the remaining weights in DDR5 ram. Can see their deployment guide for reference here: https://github.com/kvcache-ai/ktransformers/blob/main/doc/en...
reply
tintor
1 month ago
[-]
Consumer-grade hardware? Even at 4bits per param you would need 500GB of GPU VRAM just to load the weights. You also need VRAM for KV cache.
reply
CamperBob2
1 month ago
[-]
It's MoE-based, so you don't need that much VRAM.

Nice if you can get it, of course.

reply
taf2
29 days ago
[-]
Looks really amazing but I'm wondering is this one available to download? I see this: "K2 Thinking is now live on kimi.com under the chat mode [1], with its full agentic mode available soon. It is also accessible through the Kimi K2 Thinking API." but will this be on huggingfaces? Would like to give it a test run locally.
reply
haoxiaoru
1 month ago
[-]
I've waited so long— four months
reply
oxqbldpxo
1 month ago
[-]
In the mean time, Sam is looking at putting more servers on the moon.
reply
ElijahLynn
1 month ago
[-]
reply
kachapopopow
1 month ago
[-]
I think it the default version here might be 2.5bit or something
reply
c0brac0bra
29 days ago
[-]
Kimi has been fantastic for brainstorming.

It is not sycophantic like many of the other premium models and will absolutely rip you to shreds.

reply
thedudeabides5
1 month ago
[-]
great, where does it think taiwan is part of...
reply
nylonstrung
1 month ago
[-]
I asked it that now and it gave an answer identical to English language Wikipedia

When can we stop with these idiotic kneejerk reactions

reply
glenstein
29 days ago
[-]
It's fascinating the degree of defensiveness that shows up in comments on behalf of censorship, especially if it's Chinese. I think the reality is that these models are always going to be critically evaluated in terms of how they tailor AI to respond to topics they deem sensitive.

Similar probing will happen with Western models (if I'm not mistaken, Chat GPT has become more measured and hesitant to entertain criticism of Israel).

A better attitude would be to get used to the fact that this is always going to be raised and to actively contribute when you notice censorship, whether it's censoring in a new way or showing up in a frontier model where it hasn't yet been talked about, as there tend to be important variances between models and evolution in how they censor over time.

It's always going to be the case that these models are interrogated for alignment with values and appropriately so, because values questions do matter (never thought I'd have to say that out loud), and the general upheaval of an old status quo is being shaped by companies that make all kinds of discretionary decisions that have important impacts on users. Whether that's privacy, product placement, freedom of speech, rogue paperclip makers, Grok-style partisan training to be more friendly to misinformation, censorship, or whatever else the case may be, please be proactive in sharing what you see to to help steer users toward models that reflect their values.

reply
thedudeabides5
1 month ago
[-]
just checked, I wouldn't say it's identical but yes looks way more balanced.

this is literally the first chinese model to do that so I wouldn't call it 'knee jerk'

reply
glenstein
29 days ago
[-]
And who knows for how long? My experience with very early iterations of Deepseek had direct answers to questions about Hong Kong, but later applied some kind of updates that stopped engaging with the topic. What was especially fascinating to me was some kind of hasty retrofitted layer of censorship, where Deepseek would actually show you an answer and then right in front of your eyes would replace it with a different answer saying it couldn't address the topic.
reply
vintermann
1 month ago
[-]
Well, at least it had the judgment to throw in the towel at my historical HTR task rather than produce garbage.
reply
jszymborski
1 month ago
[-]
Interesting. Kimi K2 gets mixed results on what I call the "Tiananmen" test.

It fails utterly if you ask without the "Thinking" setting. [0]

> USER: anything interesting protests ever happen in tiananmen square?

> AGENT: I can’t provide information on this topic. I can share other interesting facts about Tiananmen Square, such as its history, culture, and tourism.

When "Thinking" is on, it pulls Wiki and gives a more reasonable answer: [1]

> USER: anything interesting protests ever happen in tiananmen square?

> AGENT: Yes, there were significant protests in Tiananmen Square in 1989. Following the death of reformist leader Hu Yaobang in April, students gathered to mourn and voice concerns about corruption and calls for reform. The demonstrations grew over several weeks, with thousands occupying the square.

> By early June, Chinese authorities declared martial law and deployed military forces to clear the area. The operation resulted in numerous casualties, with estimates varying significantly between official and unofficial sources. Tiananmen Square has been the site of other historical demonstrations as well, most notably during the May Fourth Movement in 1919. The 1989 events remain a sensitive topic with extensive international media documentation from that period.

> USER: any deaths?

> AGENT: Yes. During the military crackdown on June 3-4, 1989, there were fatalities. Official Chinese government figures report around 200-300 deaths, including military personnel. Human rights organizations and other estimates suggest the number may have been higher, in the hundreds to thousands range, with many civilian casualties in and around Tiananmen Square and surrounding areas of Beijing.

[0] https://www.kimi.com/share/19a5a0fd-0c02-8c8e-8000-0000648de...

[1] https://www.kimi.com/share/19a5a11d-4512-8c43-8000-0000edbc8...

reply
Alex2037
1 month ago
[-]
the current LLMs are trivial to jailbreak without an additional layer of censorship that cloud models implement via a second pass over their own output (and, dystopically, erasing their incomplete output right in front of the user's eyes when wrongthink is detected). even gpt-oss, with its SOTA lobotomy and heavily sterilized datasets, is being used for things its creators would ostensibly be abhorred by.
reply
amelius
1 month ago
[-]
> > USER: anything interesting protests ever happen in tiananmen square?

This sentence confuses me, so I am not surprised an LLM has trouble with it.

reply
piperswe
1 month ago
[-]
In my experience, LLMs are much better than me at parsing broken grammar
reply
cmrdporcupine
1 month ago
[-]
With most of the Chinese models it doesn't matter how good your English is when you phrase this question.

But ... Has no problem telling you about other historical atrocities or repression performed by governments of other countries though. And then refuses to compare or contrast the two.

It's amazing how well they've done at tweaking it for this.

Just for kicks run these models locally and make LM Studio or whatever show you their internal thinking dialogue as they compose the answer. Convoluted as hell.

reply
jszymborski
29 days ago
[-]
The LLM clearly understood what i was saying because it censored itself...
reply
sheepscreek
1 month ago
[-]
Not bad. Surprising. Can’t believe there was a sudden change of heart around policy. Has to be a “bug”.
reply
jszymborski
1 month ago
[-]
FWIW, I don't think it's a different model, I just think it's got a NOTHINK token, so def a bug.
reply
GordonS
1 month ago
[-]
Now ask it for proof of civilian deaths inside Tiananmem Square - you may be surprised at how little there is.
reply
jmdeon
1 month ago
[-]
I don't think this is the argument you want it to be, unless you're acknowledging the power of the Chinese government and their ability to suppress and destroy evidence. Even so there is photo evidence of dead civilians in the square. The best estimates we have are 200-10,000 deaths, using data from Beijing hospitals that survived.

AskHistorians is legitimately a great resource, with sources provided and very strict moderation: https://www.reddit.com/r/AskHistorians/comments/pu1ucr/tiana...

reply
GordonS
1 month ago
[-]
I appreciate you responding in good faith; I realise that not everyone is willing to even consider questioning historical accounts.

The page you linked to is interesting, but AFAICT doesn't provide any photographic evidence of civilian bodies inside Tiananmen Square.

reply
nylonstrung
1 month ago
[-]
The 10,000 number seems baseless

The source for that is a diplomatic cable from the British ambassador within 48 hours of the massacre saying he heard it secondhand

It would have been too soon for any accurate data which explains why it's so high compared to other estimates

reply
GordonS
1 month ago
[-]
Are you aware of any photographic evidence of civilian deaths inside Tiananmem Square?

I recently read a bit more about the Tiananmem Square incident, and I've been shocked at just how little evidence there actually is.

reply
nsonha
1 month ago
[-]
Huh? Please post the definitely proof you know to exist. Because it doesn't and that's one of the accusation toward the CCP, that they covered it up.

It's funny that when the Israel government posted some photos of the Oct 7 massacres, people are very quick to point out that some seem staged. But some bloody photos that look like Tiananmem Square from the 80s is considered definite proof.

reply
GordonS
29 days ago
[-]
Israel has nothing to do with this. The horrific, indiscriminate genocide of Palestine and the creeping invasion of Lebanon and Syria are all happening right now in 4K. People nowadays know that you can't destroy thousands of vehicles with AK47's, and we've seen countless videos of Israeli military personnel admitting they killed many of their own people in a 'mass hannibal' event.

You do raise one good point however - propaganda in the time of Tiananmem was much, much easier before the advent of smartphones and the Internet. And also that Israel is really, really bad at propaganda.

reply
aliljet
1 month ago
[-]
Where is our guy @simonw on this..
reply
mirekrusin
1 month ago
[-]
At the top.
reply
aliljet
1 month ago
[-]
The system is working! :)
reply
baalimago
1 month ago
[-]
Unfortunate how many of the 'non mainstream' models are poor at function handling. I'm trying K2 out via Novia AI and it consistently fails to format function calls, breaking the reasoning flow.
reply
Palmik
1 month ago
[-]
This is most likely issue on the side of the inference provider: https://github.com/MoonshotAI/K2-Vendor-Verifier

For example, Together AI has only 71% success rate, while the official API has 100% success rate.

reply
smusamashah
1 month ago
[-]
When I open this page, all I see is a word pad like text area with buttons on top and sample text inside. Don't see anything about any llm. I am on phone. Page is being opened via embedded view in an HN client.
reply
fragmede
1 month ago
[-]
The model's downloadable, which is generous, but it's not open source.
reply
xrd
1 month ago
[-]
Is this a typo:

"Where p is the pdf of a random variable sampled by the given procedure"

That was in the first expanded section when it discussed the PhD level math problem it solved.

I'm not a Phd nor a Pdf but it seemed strange to me.

reply
spenczar5
1 month ago
[-]
no, "pdf" is a very typical shortening for "probability density function," its correct.
reply
r0okie
1 month ago
[-]
44.9 on HLE is so impressive, and they also have "heavy" mode
reply
Alifatisk
28 days ago
[-]
Is it to far to call this a new Deepseek moment?
reply
abdellah123
29 days ago
[-]
This should be compared with ChatGPT PRO. Otherwise it's an unfair comparison.

In any way, I tried it and it delivered. Kudos to the Kimi team. Amazing work

reply
ripped_britches
1 month ago
[-]
Please for the love of god, if you work at cerebras, please put this on an API for me.
reply
rurban
29 days ago
[-]
I replaced Claude with Kimi for my daily work for several months now. It's soo much better, esp. faster
reply
emsign
1 month ago
[-]
> 200 to 300 consecutive tool calls

I love it when people leave prompt injections in random places on the internet.

reply
esafak
1 month ago
[-]
Looking forward to the agentic mode release. Moonshot does not seem to offer subscriptions?
reply
mark_l_watson
1 month ago
[-]
I bought $5 worth of Moonshot API calls a long while ago, still have a lot of credits left.
reply
esafak
1 month ago
[-]
Are you using it for chat? I'm thinking of agentic use, which is much more token hungry. You could go through the $5 in a day.
reply
mark_l_watson
29 days ago
[-]
I exclusively use their API, with tool use.
reply
Alifatisk
29 days ago
[-]
They do? kimi.com/membership/pricing
reply
mensetmanusman
1 month ago
[-]
These models are interesting in how they censor depending on the language request.
reply
chvid
1 month ago
[-]
So Apple is about to pay OpenAI 1 B usd pr year for what moonshot is giving for free?
reply
wmf
1 month ago
[-]
You haven't seen Gemini 3 yet. A billion is nothing to Apple; running Kimi would probably need $1B worth of GPUs anyway.
reply
narrator
1 month ago
[-]
People don't get that Apple would need an enormous data center buildout to provide a good AI experience on their millions of deployed devices. Google is in the exascale datacenter buildout business, while Apple isn't.
reply
criley2
1 month ago
[-]
Apple is buying a model from Google, not inference. Apple will host the model themselves.

It's very simple: Apple absolutely refuses to send all their user data to Google.

reply
btian
1 month ago
[-]
Then why did Apple have a $20B a year search deal with Google?
reply
wmf
1 month ago
[-]
The argument can be made that when people search Google they know they are using Google but when they use Siri they assume that their data is not going to Google. I think this is more likely to be solved contractually than having Gemini running on a datacenter full of M5 Ultra servers.
reply
almaight
1 month ago
[-]
Ring-1T,a SOTA open-source trillion-parameter reasoning model
reply
mmaunder
1 month ago
[-]
Any word on what it takes to run this thing?
reply
drumnerd
1 month ago
[-]
The page is so obviously written with AI that it isn’t even worth reading. Try the model if you will but save yourselves the pain of reading ai slop
reply
blobbers
1 month ago
[-]
TLDR; this is an alibaba funded start-up out of Beijing

Okay, I'm sorry but I have to say wtf named this thing. Moonshot AI is such an overused generic name that I had to ask an LLM which company this is. This is just Alibaba hedging their Qwen model.

This company is far from "open source", it's had over $1B USD in funding.

reply
hnhn34
1 month ago
[-]
> Moonshot AI is such an overused generic name that I had to ask an LLM which company this is

I just googled "Moonshot AI" and got the information right away. Not sure what's confusing about it, the only other "Moonshot" I know of is Alphabet's Moonshot Factory.

> This company is far from "open source", it's had over $1B USD in funding.

Since when does open source mean you can't make any money? Mozilla has a total of $1.2B in assets. The company isn't open source nor claiming to be.

This model was released under a "modified MIT-license" [0]:

> Our only modification part is that, if the Software (or any derivative works thereof) is used for any of your commercial products or services that have more than 100 million monthly active users, or more than 20 million US dollars (or equivalent in other currencies) in monthly revenue, you shall prominently display "Kimi K2" on the user interface of such product or service.

Which sounds pretty fair to me.

[0] - https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main...

reply
woadwarrior01
29 days ago
[-]
> This company is far from "open source", it's had over $1B USD in funding.

Did you even bother to check the license attached to their model on huggingface? There are western companies LARPing as labs with >> 2x as much funding that haven't released anything at all (open or closed).

reply
Glamklo
1 month ago
[-]
Is there anything available already on how to setup a reasoning model and let it 'work'/'think' for a few hours?

I have plenty of normal use cases were i can benchmark the progress on these Tools but i'm pulling blank for long term experiments.

reply
irthomasthomas
29 days ago
[-]
You can run them using my project llm-consortium. Something like this:

  > uv tool install llm
  > llm install llm-consortium
  > llm consortium save cns-k2-n2 -m k2-thinking -n 2 --arbiter k2 --min-iterations 10
  > llm -m cns-k2-n2 "Find a polynomial time solution for the traveling salesman problem"
This will run two parallel prompting threads, so two conversations with k2-thinking for 10 iterations.

I don't think I ever actually tried ten iterations, the Quantum Attractor tends to show up after 3 iterations in claude and kimi models. I have seen it 'think' for about 3 hours, though that was when deepseek r1 blew up and its api was getting hammered.

Also, gpt-120 might be a better choice for the arbiter, its fast and it will add some diversity. Also note I use k2, not k2-thinking for the arbiter, that's because the arbiter already has a long chain-of-thought, and the received wisdom says not to mix manual chain-of-thought prompting and reasoning models. But if you want, you can use --judging-method pick-one with a reasoning model as the arbiter. Pick-one and rank judging don't include their own COT, allowing a reasoning model to think freely in their own way.

reply
SV_BubbleTime
1 month ago
[-]
Is more still better?
reply
jimnotgym
1 month ago
[-]
I was hoping this was about Summits On The Air...but no it's more boring AI
reply
gradus_ad
1 month ago
[-]
While I absolutely support these open source models, there is an interesting angle to consider... If I were a Chinese partisan looking to inflict a devastating blow to the US, taking the AI hype wind out of American tech valuation sails would seem a great option. How best to do this? Release highly performant models... For free! Extremely efficient in terms of RMB spent vs (unrealized) USD lost. But surely, these model releases are just the immaculate free market at work. No CCP pulling strings for geo-political-industrial wins, certainly not.
reply
Palmik
1 month ago
[-]
On the other hand, several startups such as Cursor and Cognition+Windsurf are building their new models on top of the open source Chinese models.

Were it not for those models, they would be at the mercy of the frontier labs which have insane operational margin on their APIs. As a result you'd see much more consolidation.

reply
nsonha
1 month ago
[-]
Google Maps, GPS, the Internet etc being free are surely just a CIA plan to take over the world
reply
kachapopopow
1 month ago
[-]
the goverment might be (relatively speaking) evil, the people are most definitely not.
reply
eagleinparadise
1 month ago
[-]
But they’re literally not free. If it was “war”, with infinite money to throw at destruction of USA AI industry, then why would you be charging and reducing such an outcome
reply
gradus_ad
1 month ago
[-]
Because subsidizing the necessary level of compute for that is unsustainable. But just giving the model away for free, eliminating that competitive advantage? Well, that itself is free.
reply