Step 2: build on someone else's infrastructure innovations with zero acknowledgement.
Step 3: Write a blog post with "unprecedented" and "100x" and "trillions" in the first paragraph.
Seriously, this seems like cool work and enjoyed the post. But my basic trust in them has completely tanked.
For the gossipy part, I love Kimi, but find it hard to get worked up about them not labelling their model Kimi when Kimi was the base. Especially because Kimi…has had…some issues…being able to distinguish itself from Claude…
The capabilities of the top labs’ models have improved so much in just the last few releases, and I definitely foresee a world where they gate those models away behind 1st-party harnesses/tooling.
I feel like the v5.0 preview did ok but it's slid all the way down the hill to gpt 2 or 3 levels for me.
I mean sure the techniques are probably the same in 2 but its not like they're exactly advertising composer 2 here lol
However, I have edited my other claims for now and you can consider them provisionally retracted. My original advice about turning off data sharing stands. You are right to ask for more evidence given the severity of the claims. I think this merits a deeper dive, and a throwaway hacker news comment might not be the best channel for it. Stay tuned ;)
I also wonder since they’re doing constant RL on model weights with today's Cursor design, does that mean they can never change their system prompt & other parts of the harness?
1) Comparison between past trajectories data would be meaningless if they were operating under different instructions.
2) Performance will be terrible the next time they change their tool design, since the model is now "opinionated" based on how a previous version of Cursor was designed.
Anthropic is more sensible with their “constitution” approach to safety. The behaviors (and ultimately the values) you want your model to follow should be a document, not a lobotomy.
And still no mention of Kimi in a new blog post :)
Also apparently the inference provider they use, Fireworks AI, already has built-in API for RL tuning Kimi [1], so I wonder which parts are Cursor's own effort and where Fireworks AI actually deserves credit, especially since they repeatedly brag about being able to create a new checkpoint every 5 hours, which would be largely thanks to Fireworks AI's API/training infrastructure.
I mean, I'm genuinely curious how much effort it would actually take me to go from "here, lots of user data" to "the model gains +1% on benchmarks" to produce my own finetune, assuming I already use a good existing foundational model, my inference provider already handles all the tuning infrastructure/logic, and I already have a lot of usage logs.
They used Kimi, failed to acknowledge it in the original Composer announcement. Kimi team probably reached out and asked WTF? Their only recourse was to publicly disclose their whitepaper with Kimi mentioned to win brownie points about being open about their training pipeline, while placating the Kimi team.
The engineering challenge here is an order of magnitude bigger though. An LLM is orders of magnitude bigger than a recommender system model. Kudos.
also curious whether they see different convergence patterns across languages. my gut says something like python where theres more stylistic variation would be harder to get a clean reward signal vs something like rust where there are fewer idiomatic ways to do things.
Credit to the team for taking this on, but I’d be skeptical of announcements like this without at least 3–6 months of proven production deployments. Definitely curious how this plays out.