1T parameters, 32b active parameters.
License: MIT with the following modification:
Our only modification part is that, if the Software (or any derivative works thereof) is used for any of your commercial products or services that have more than 100 million monthly active users, or more than 20 million US dollars (or equivalent in other currencies) in monthly revenue, you shall prominently display "Kimi K2.5" on the user interface of such product or service.
Technical awe at this marvel aside that cracks the 50th percentile of HLE, the snarky part of me says there’s only half the danger in giving something away nobody can run at home anyway…
The cheapest way is to stream it from a fast SSD, but it will be quite slow (one token every few seconds).
The next step up is an old server with lots of RAM and many memory channels with maybe a GPU thrown in for faster prompt processing (low two digits tokens/second).
At the high end, there are servers with multiple GPUs with lots of VRAM or multiple chained Macs or Strix Halo mini PCs.
The key enabler here is that the models are MoE (Mixture of Experts), which means that only a small(ish) part of the model is required to compute the next token. In this case, there are 32B active parameters, which is about 16GB at 4 bit per parameter. This only leaves the question of how to get those 16GB to the processor as fast as possible.
Back when 4k movies needed expensive hardware, no one was saying they could play 4k on a home system, then later mentioning they actually scaled down the resolution to make it possible.
The degree of quality loss is not often characterized. Which makes sense because it’s not easy to fully quantify quality loss with a few simple benchmarks.
By the time it’s quantized to 4 bits, 2 bits or whatever, does anyone really have an idea of how much they’ve gained vs just running a model that is sized more appropriately for their hardware, but not lobotomized?
int4 quantization is the original release in this case; it's not been quantized after the fact. It's a bit of a nuisance when running on hardware that doesn't natively support the format (might waste some fraction of memory throughput on padding, specifically on NPU hw that can't do the unpacking on its own) but no one here is reducing quality to make the model fit.
The broader point remains though which is, “you can run this model as home…” when actually the caveats are potentially substantial.
It would be so incredibly slow…
Any model that I can run in 128 gb in full precision is far inferior to the models that I can just barely get to run after reap + quantization for actually useful work.
I also read a paper a while back about improvements to model performance in contrastive learning when quantization was included during training as a form of perturbation, to try to force the model to reach a smoother loss landscape, it made me wonder if something similar might work for llms, which I think might be what the people over at minimax are doing with m2.1 since they released it in fp8.
In principle, if the model has been effective during its learning at separating and compressing concepts into approximately orthogonal subspaces (and assuming the white box transformer architecture approximates what typical transformers do), quantization should really only impact outliers which are not well characterized during learning.
If this were the case however, why would labs go through the trouble of distilling their smaller models rather than releasing quantized versions of the flagships?
IMO 1tln parameters and 32bln active seems like a different scale to what most are talking about when they say localLLMs IMO. Totally agree there will be people messing with this, but the real value in localLLMs is that you can actually use them and get value from them with standard consumer hardware. I don't think that's really possible with this model.
LLMs which the weights aren't available are an example of when it's not local LLMs, not when the model happens to be large.
I agree. My point was that most aren't thinking of models this large when they're talking about local LLMs. That's what I said, right? This is supported by the download counts on hf: the most downloaded local models are significantly smaller than 1tln, normally 1 - 12bln.
I'm not sure I understand what point you're trying to make here?
Over any task that has enough prefill input diversity and a decode phase thats more than a few tokens, its at least intuitive that experts activate nearly uniformly in the aggregate, since they're activated per token. This is why when you do something more than bs=1, you see forward passes light up the whole network.
Thing is, people in the local llm community are already doing that to run the largest MoE models, using mmap such that spare-RAM-as-cache is managed automatically by the OS. It's a drag on performance to be sure but still somewhat usable, if you're willing to wait for results. And it unlocks these larger models on what's effectively semi-pro if not true consumer hardware. On the enterprise side, high bandwidth NAND Flash is just around the corner and perfectly suited for storing these large read-only model parameters (no wear and tear issues with the NAND storage) while preserving RAM-like throughput.
I was trying to correct the record that a lot of people will be using models of this size locally because of the local LLM community.
The most commonly downloaded local LLMs are normally <30b (e.g. https://huggingface.co/unsloth/models?sort=downloads). The things you're saying, especially when combined together, make it not usable by a lot of people in the local LLM community at the moment.
But to answer your question directly, tensor parallelism. https://github.com/ggml-org/llama.cpp/discussions/8735 https://docs.vllm.ai/en/latest/configuration/conserving_memo...
There is a huge difference between "look I got it to answer the prompt: '1+1='"
and actually using it for anything of value.
I remember early on people bought Macs (or some marketing team was shoveling it), and proposing people could reasonably run the 70B+ models on it.
They were talking about 'look it gave an answer', not 'look this is useful'.
While it was a bit obvious that 'integrated GPU' is not Nvidia VRAM, we did have 1 mac laptop at work that validated this.
Its cool these models are out in the open, but its going to be a decade before people are running them at a useful level locally.
If it were 2016 and this technology existed but only in 1 t/s, every company would find a way to extract the most leverage out of it.
Because I feel like they mentioned that agent swarm is available their api and that made me feel as if it wasn't open (weights)*? Please let me know if all are open source or not?
Why not just say "you shall pay us 1 million dollars"?
Coincidence or not, let's just marvel for a second over this amount of magic/technology that's being given away for free... and how liberating and different this is than OpenAI and others that were closed to "protect us all".
Also their license says that if you have a big product you need to promote them, remember how Google "gave away" site searche widgets and that was perhaps one of the major ways they gained recognition for being the search leader.
OpenAI/NVidia is the Pets.com/Sun of our generation, insane valuations, stupid spend, expensive options, expensive hardware and so on.
Sun hardware bought for 50k USD to run websites in 2000 are less capable than perhaps 5 dollar/month VPS's today?
"Scaling to AGI/ASI" was always a fools errand, best case OpenAI should've squirreled away money to have a solid engineering department that could focus on algorithmic innovations but considering that Antrophic, Google and Chinese firms have caught up or surpassed them it seems they didn't.
Once things blows up, those closed options that had somewhat sane/solid model research that handles things better will be left and a ton of new competitors running modern/cheaper hardware and just using models are building blocks.
You can't with Tiananmen square in China
So they are on the same page as the UN and US?
The One China policy refers to a United States policy of strategic ambiguity regarding Taiwan.[1] In a 1972 joint communiqué with the PRC, the United States "acknowledges that all Chinese on either side of the Taiwan Strait maintain there is but one China and that Taiwan is a part of China" and "does not challenge that position."
https://en.wikipedia.org/wiki/One_China https://en.wikipedia.org/wiki/Taiwan_and_the_United_Nations
Scaling depends on hardware, so cheaper hardware on a compute-per-watt basis only makes scaling easier. There is no clear definition of AGI/ASI but AI has already scaled to be quite useful.
Given the shallowness of moats in the LLM market, optimizing for mindshare would not be the worst move.
If you look at past state projects, profitability wasn't really considered much. They are notorious for a "Money hose until a diamond is found in the mountains of waste"
My biggest source of my conspiracy is that I made a reddit thread asking a question: "Why all the deepseek hype" or something like that. And to this day, I get odd, 'pro deepseek' comments from accounts only used every few months. Its not like this was some highly upvoted topic that is in the 'Top'.
I'd put that deepseek marketing on-par with an Apple marketing campaign.
In a few years, or less, biological attacks and other sorts of attacks will be plausible with the help of these agents.
Chinese companies aren't humanitarian endeavors.
> K2.5 Agent Swarm improves performance on complex tasks through parallel, specialized execution [..] leads to an 80% reduction in end-to-end runtime
Not just RL on tool calling, but RL on agent orchestration, neat!
Is this within the model? Or within the IDE/service that runs the model?
Because tool calling is mostly just the agent outputting "call tool X", and the IDE does it and returns the data back to AI's context
What is this?
That's cool. It also has a zsh hook, allowing you to switch to agent mode wherever you're.
K2 is one of the only models to nail the clock face test as well. It’s a great model.
They are not just leeching here, they took this innovation, refined it and improved it further. This is what the Chinese is good at.
Interested in the dedicated Agent and Agent Swarm releases, especially in how that could affect third party hosting of the models.
Why is that Claude still at the top in coding, are they heavily focused on training for coding or is it their general training is so good that it performs well in coding?
Someone please beat the Opus 4.5 in coding, I want to replace it.
Also consider they are all overfitting on the benchmark itself so there might be that as well (which can go in either directions)
I consider the top models practically identical for coding applications (just personal experience with heavy use of both GPT5.2 and Opus 4.5).
Excited to see how this model compares in real applications. It's 1/5th of the price of top models!!
You can of course "run" this on cheaper hardware, but the speeds will not be suitable for actual use (i.e. minutes for a simple prompt, tens of minutes for high context sessions per turn).
Not a "I actually use this"
The difference between waiting 20 minutes to answer the prompt '1+1='
and actually using it for something useful is massive here. I wonder where this idea of running AI on CPU comes from. Was it Apple astroturfing? Was it Apple fanboys? I don't see people wasting time on non-Apple CPUs. (Although, I did do this for a 7B model)
e.g. in an office or coworking space
800-1000 gb ram perhaps?
https://openrouter.ai/moonshotai/kimi-k2-thinking https://openrouter.ai/moonshotai/kimi-k2-0905 https://openrouter.ai/moonshotai/kimi-k2-0905:exacto https://openrouter.ai/moonshotai/kimi-k2
Generally it seems to be in the neighborhood of $0.50/1M for input and $2.50/1M for output
Keep in mind that most people posting speed benchmarks try them with basically 0 context. Those speeds will not hold at 32/64/128k context length.
Anyway, in the future your local model setups will just be downloading experts on the fly from experts-exchange. That site will become as important to AI as downloadmoreram.com.
These are order-of-magnitude numbers, but the takeaway is that multi H100 boxes are plausibly ~100× faster than workstation Macs for this class of model, especially for long-context prefill.
Could be true, could be fake - the only thing we can be sure of is that it's made up with no basis in reality.
This is not how you use llms effectively, that's how you give everyone that's using them a bad name from association
Sure it's SOTA at standard vision benchmarks. But on tasks that require proper image understanding, see for example BabyVision[0] it appears very much lacking compared to Gemini 3 Pro.
If you’re really lazy - the quick summary is that you can benefit from the sweet spot of context length and reduce instruction overload while getting some parallelism benefits from farming tasks out to LLMs with different instructions. The way this is generally implemented today is through tool calling, although Claude also has a skills interface it has been trained against.
So the idea would be for software development, why not have a project/product manager spin out tasks to a bunch of agents that are primed to be good at different things? E.g. an architect, a designer, and so on. Then you just need something that can rectify GitHub PRs and bob’s your uncle.
Gas town takes a different approach and parallelizes on coding tasks of any sort at the base layer, and uses the orchestration infrastructure to keep those coders working constantly, optimizing for minimal human input.
I think the best way to think about it is that its an engineering hack to deal with a shortcoming of LLMs: for complex queries LLMs are unable to directly compute a SOLUTION given a PROMPT, but are instead able to break down the prompt to intermediate solutions and eventually solve the original prompt. These "orchestrator" / "swarm" agents add some formalism to this and allow you to distribute compute, and then also use specialized models for some of the sub problems.
then it creates a list of employees, each of them is specialized for a task, and they work in parallel.
Essentially hiring a team of people who get specialized on one problem.
Do one thing and do it well.
Where we have more specialized "jobs", which the model is actually trained for.
I think the main difference with agents swarm is the ability to run them in parallel. I don't see how this adds much compared to simply sending multiple API calls in parallel with your desired tasks. I guess the only difference is that you let the AI decide how to split those requests and what each task should be.
One positive side effect of this is that if subagent tasks can be dispatched to cheaper and more efficient edge-inference hardware that can be deployed at scale (think nVidia Jetsons or even Apple Macs or AMD APU's) even though it might be highly limited in what can fit on the single node, then complex coding tasks ultimately become a lot cheaper per token than generic chat.
My point was that this is just a different way of creating specialised task solvers, the same as with MoE.
And, as you said, with MoE it's about the model itself, and it's done at training level so that's not something we can easily do ourselves.
But with agent swarm, isn't it simply splitting a task in multiple sub-tasks and sending each one in a different API call? So this can be done with any of the previous models too, only that the user has to manually define those tasks/contexts for each query.
Or is this at a much more granular level than this, which would not be feasible to be done by hand?
I was already doing this in n8n, creating different agents with different system prompts for different tasks. I am not sure if automating this (with swarm) would work well in my most cases, I don't see how this fully complements Tools or Skills
Or did I misunderstand the concept of MoE, and it's not about having specific parts of the model (parameters) do better on specific input contexts?
I guess after Kimi K2.5, other vendors are going to the same route?
Can't wait to see how this model performs on computer automation use cases like VITA AI Coworker.
But ultimately, you need to try them yourself on the tasks you care about and just see. My personal experience is that right now, Gemini Pro performs the best at everything I throw at it. I think it's superior to Claude and all of the OSS models by a small margin, even for things like coding.
Me too!
> I like Gemini Pro's UI over Claude so much
This I don't understand. I mean, I don't see a lot of difference in both UIs. Quite the opposite, apart from some animations, round corners and color gradings, they seem to look very alike, no?
I made it reduce the price of first month to 1.49$ (It could go to 0.99$ and my frugal mind wanted it haha but I just couldn't have it do that lol)
Anyways, afterwards for privacy purposes/( I am a minor so don't have a card), ended up going to g2a to get a 10$ Visa gift card essentially and used it. (I had to pay a 1$ extra but sure)
Installed kimi code on my mac and trying it out. Honestly, I am kind of liking it.
My internal benchmark is creating pomodoro apps in golang web... Gemini 3 pro has nailed it, I just tried the kimi version and it does have some bugs but it feels like it added more features.
Gonna have to try it out for a month.
I mean I just wish it was this cheap for the whole year :< (As I could then move from, say using the completely free models)
Gonna have to try it out more!
I was thinking, that maybe it's better to make my own benchmarks with the questions/things I'm interested in, and whenever a new model comes out run those tests with that model using open-router.
Maybe we can get away with something cheaper than Claude for coding.
Depending on how well you bargain with the robot, you can go as low as 0,99$ (difficult). Either way, their moderate plan doesn’t have to be 20$. The agent wants a good reason for why it should lower the price for you.
Here’s the direct link to Kimmmmy:
https://www.kimi.com/kimiplus/sale
I’ll send an invite link too if you don’t mind:
https://www.kimi.com/kimiplus/sale?activity_enter_method=h5_...
URL is down so cannot tell.