Opus 4.6 is available on the $20 plan too
But I run a AI SaaS and we do offer Opus 4.6, too. Our use case is not nearly as token intensive as something like coding so we are still able to offer it with a good profit margin.
Sorry, how do these two things go together?
If people pay for it, it has economic utility, doesn't it? I mean, people pay to watch movies or play video games, too.
It does have some kind of horrible context consistency problem though, if you ask it to rewrite something verbatim it'll inject tiny random changes everywhere and potentially break it. That's something that other SOTA models haven't done for at least two years now and is a real problem. I can't trust it to do a full rewrite, just diffs.
https://aibenchy.com/compare/minimax-minimax-m2-7-medium/moo...
I'm not saying it's bad, but it's definitely different than the others.
Yuck. At that point don't publish a benchmark, explains why their results are useless too.
Even when using structured output, sometimes you want to define how the data should be displayed or formatted, especially for cases like chat bots, article writing, tool usage, calling external api's, parsing documents, etc.
Most models get this right. Also, this is just one failure mode of Claude.
Needless to say, benchmarks are limited and impressions vary widely by problem domain, harness, written language, and personal preference (simplicity vs detail, tone, etc.). If personal experience is the only true measure, as with wine, solving this discovery gap is an interesting challenge (LLM sommelier!), even if model evolution eventually makes the choice trivial. (I prefer Gemini 3 for its wide knowledge, Sonnet 4.6 for balance, and GLM-5 for simplicity.)
The problem with doing this is cost. Constsntly testing a lot of models on a large dataset can get really costly.
They're all slop when the complexity is higher than a mid-tech intermediate engineer though.
This right here. Value prop quickly goes out the window when you're building anything novel or hard. I feel that I'm still spending the same amount of time working on stuff, except that now I'm also spending money on models.
They are equivalent of frontier models 8+ months ago.
> DeepSeek V3.2 Reasoning 86.2% ~$0.002 API, single-shot
> ATLAS V3 (pass@1-v(k=3)) 74.6% ~$0.004 Local electricity only, best-of-3 + repair pipeline
(And ideally you'd probably test first, or at least try to feed compiler errors back etc?)
Overall, I mostly agree.
1) That is relatively very slow.
2) Can also be done, simpler even, with SoTA models over API.
Being reliant on a service means you have to share whatever you're working on with the service, and the service provider decides what you can do, and make changes to their terms of service on a whim.
If locally running models can get to the point where they can be used as a daily driver, that solves the problem.
Can you explain what that means?
Local model enthusiasts often assume that running locally is more energy efficient than running in a data center, but fail to take the economies of scale into account.
Note that while a local chatbot user will mostly be using batch-size = 1, it's not going to be true if they are running an agentic framework, so the gap is going to narrow or even reverse.
Our peak import rate is 3x higher than our solar export rate. In other words, we’d need to sell 3 kWh hours of energy to offset the cost of using 1 kWh at peak.
We’re currently in the process of accepting a quote for home batteries. The rates here highly incentivise maximising self-use.
And this is with no income tax or VAT on sold electricity.
Cool work though, really excited for the potential of slimming down models.
Perhaps these things aren't well represented in the training data for these open models? Every local model I've tried (minimax2.5, GLM-4.7, Quen3, 3.5 and -coder variants) spend so much time trying to get something syntactically sensible and accepted by the compiler that when they've finished they barely seem to have any "momentum" left to actually solve the problems, as pretty much anything but the most trivial change ends up in another loop of actually trying to get it working again, often losing the intent of that change in the process.
My fear is that the solution here, having multiple instances all making the same changes for later comparison, would spend a huge amount of time beating it's head against compiler errors, types, memory allocation (NO DON'T JUST SPRINKLE IN A FEW MORE RAW "new" KEYWORDS DAMMIT) before it even gets to the "logic".
Having plenty of local GPU power I'd love to be able to actually use that, and I'm already wary about some of the training data use and it's interactions with the license of the code I'm "sending" to the cloud models...
But this can be pretty slow since you have to run the code in an isolated environment, check the outputs, wait for it to finish. Doing that for every candidate quickly adds up. So ATLAS has another shortcut for avoiding unnecessary testing. Instead of simply generating solutions and testing all of them, it tries to predict which one is most likely correct before running any tests.
ATLAS also asks the model for an embedding of what it just wrote which acts as a fingerprint. Two similar pieces of code will produce similar fingerprints. A well-written, confident solution will produce a different fingerprint than a confused, buggy one.
These fingerprints get fed into a separate, much smaller neural network called the Cost Field. This little network was trained ahead of time on examples where they already knew which solutions were correct and which were wrong. It learned to assign a score to each fingerprint. Correct solutions get a low score and incorrect ones get a high one.
So the process is to generate multiple solutions, get their fingerprints, score each one, and pick the lowest. Only that one gets tested. The Cost Field picks correctly about 88% of the time according to the repo.
Another interesting approach could be to use this set up with a language like Clojure or Common Lisp which facilitates interactive development. If you could hook up the agent directly to a REPL in a running program, then it could run tests with a lot less overhead.
So it seems like it's a difficulty classifier for task descriptions written in English.
This is then used to score embeddings of Python code, which is a completely different distribution.
Presumably it's going to look at a simple solution, figure out it lands kinda close to simple problems in embedding space and pass it.
But none of this helps you solve harder problems, or distinguish between a simple solution which is wrong, and a more complex solution which is correct.
I think the notion of a one size fits all model that is a bit like a sports car in the sense that just get the biggest/fastest/best one is overkill; you use bigger models when needed. But they use a lot of resources and cost you a lot. A lot of AI work isn't solving important math or algorithm problems. Or leet coding exercises. Most AI work is mundane plumbing work, summarizing, a bit of light scripting/programming, tool calling, etc. With skills and guard rails, you actually want agents to follow those rather than get too creative. And you want them to work relatively quickly and not overthink things. Latency is important. You can actually use guard rails to decide when to escalate to bigger models and when not to.
This will crush OpenAI.
Note: I am not talking about coding here - it will take a while longer but when it is optimized to the bone and llms output has stabilized, you will be running that too on local hardware. Cost will come down for Claude and friends too but why pay 5 when you can have it for free?
In this theory, can you explain why Apple has announced it’s paying Google for Gemini too?
Eventually, this may be true. This autumn? Highly unlikely.
I hope you are not going to say, "to avoid a global recession or depression caused by the popping of the AI bubble". That would be unnecessary and harmful (in its second-order effects), and governments do have advisors who are competent enough in economics to advise against such a move.
If the AI labs become very influential and powerful, Washington might nationalize them, but that would be very different from bailing them out because they have become unprofitable and cannot attract additional investment from the private sector.
With the recent OpenAi deal with the government I am certain they would throw tons of money at OpenAi if it got real bad. But with upcoming IPO where they are expected to be valued at $840b, we would be a LONG way from them needing a bailout. Well past this current admin.
GM on the other hand should have been left to die.
However, I was obliquely referring to the open transactionality and patronage encouraged by the current administration, and how the AI / big tech players have, with few exceptions, gleefully joined in.
Unless they run out of money for bribes, I think it's inevitable that current government will bend over backwards to prop them up.
I, too, was interested because I am always eager to use local models in my claw-like. It looks like this could be useful for an async portion of the harness but it wouldn’t work in interactive contexts.
Very cool ensemble of techniques, particularly because they’re so accessible. I think I will use this form for reusable portions of web browsing functionality in my personal agent.
There seems to be at least some detail on that point.
It looks like your card has 16GB VRAM? Start with Qwen 3.5 9B Unsloth GGUFs (UD-Q6_K_XL) and branch out from there.
I can't imagine trying to using this model on either GPU for real work. I can use much bigger and faster models on the $3 Chutes subscription or $10 OpenCode Go subscription.
Even so, I am still excited. I don't feel like there was even a model worth using with a tool like OpenCode 6 to 9 months ago. I like the way things are heading, and I am looking forward to seeing how capable coding models of this size are in another 6 to 9 months!
That doesn’t mean the 9070XT can’t do AI stuff, quite the opposite. ROCm gets better all the time. There are many AI workloads you can do on AMD cards.
Is it a card I would choose if I was primarily working on AI? Absolutely not. But it is the card I own and it’s been a great value for gaming.
It’s absurd I have to use open source programs to get INT8 FSR4 support.