Artificial Analysis put MiniMax 2.1 Coding index on 33, far behind frontier models and I feel it's about right. [1]
It wrote an extensive test suite on just fake data and then said the app is perfectly working as all tests passed.
This is a model that was supposed to match sonnet 4.5 in benchmarks. I don't think sonnet would be that dumb.
I use LLMs a lot to code, but these chinese models don't match anthropic and openai in being able to decide stuff for themselves. They work well if you give them explicit instructions that leaves little for it to mess up, but we are slowly approaching where OpenAI and anthropic models will make the right decisions on their own
Just now it added some stuff to a file starting at L30 and I said "that one line L30 will do remove the rest", it interpreted 'the rest' as the file, and not what it added.
Instead, this one works surprisingly well for the cost: https://openrouter.ai/xiaomi/mimo-v2-flash
Not self-hosting yet, but I prefer using Chinese OSS models for AI workflows because of the potential to self-host in future if needed. Also using it to power my openclaw assistant since IMO it has the best balance of speed, quality and cost:
> It costs just $1 to run the model continuously for an hour at 100 tokens/sec. At 50 tokens/sec, the cost drops to $0.30.
Its good to have these models to keep the frontier labs honest! Can i ask if you use the API or a monthly plan? Do the monthly plan throttle/reset ?
edit: i agree that MM2.1 most economic, and K2.5 generally the strongest
- $10/mo: 100 prompts / 5 hours
- $20/mo: 300 prompts / 5 hours
- $50/mo: 1000 prompts / 5 hours
[1] https://platform.minimax.io/docs/guides/pricing-coding-plan
I'll have to look for it in OpenRouter.
/imagine an svg of an octopus riding a bike. 1 arm shading its eyes from the sun, another waving a cute white flag, 2 driving the bike, 2 peddling the wheels, and 2 drifting behind in the wind
For instance,
I'm inclined to generally believe Kimi K2.5's benchmarks, because I've found that their models tend to be extremely good qualitatively and feel actually well-rounded and intelligent instead of brittle and bench-maxed.
I'm inclined to give GLM 5 some benefit of the doubt, because while I think their past benchmarks have overstated their models' capabilities, I've also found their models relatively competent, and they 2X'd the size of their models, as well as introduced a new architecture and raised the number of active parameters, which makes me feel like there is a possibility they could actually meet the benchmarks they are claiming.
Meanwhile, I've never found MiniMax remotely competent. It's always been extremely brittle, tended to screw up edits and misformat even simple JavaScript code, get into error loops, and quickly get context rot. And it's also simply just too small, in my opinion, to see the kind of performance they are claiming.
Huge - if not groundbreaking - if the benchmark stats are true.
Anthropic Claude Code and OpenAI Codex plans are subsidised.
The Chinese open weight models hosted in US or Europe make more sense to use when you want to stay model agnostic and less dependent on a single AI company with relative expensive APIs.
So I do believe if there is something that comes up that is literally continuous, would be interesting, but I'm not sure about it right now. I would be curious if anyone has anything they would literally use running 24/7.
Like LLM that only trained on Python 3+, certain frameworks, certain code repos. Then you can use a different model for searching the internet to implement different things to cut down on costs.
Maybe I have no idea what I'm talking about lol
We've done some vibe checks on it with OpenHands and it indeed performs roughly as good as Sonnet 4.5.
OSS models are catching up
Maybe an 8x node assuming batching >= 8 users per node.