And how inference prices have come down a lot, despite increasing pressure to make money. Opus 4.6 is $25/MTok, Opus 4.1 was $75/MTok, the same as Opus 4 and Opus 3. OpenAI's o1 was $60/MTok, o1 pro $600/MTok, gpt-5.2 is $14/MTok and 5.2-pro is $168/MTok.
Also note how GPT-4 was rumored to be in the 1.8T realm, and now Chinese models in the 1T realm can match or surpass it. And I doubt the Chinese have a monopoly on those efficiency improvements
I doubt frontier models have actually substantially grown in size in the last 1.5 years, and potentially have a lot fewer parameters than the frontier models of old
It was the very first thing I noticed: it looks suspiciously like they just rebranded sonnet as opus and raised the price.
I don't know why more people aren't talking about this. Even on X, where the owner directly competes in this market, it's rarely brought up. I strongly suspect there is a sort of tacit collusion between competitors in this space. They all share a strong motivation to kill any deep discussion of token economics, even about each other because transparency only arms the customers. By keeping the underlying mechanics nebulous, they can all justify higher prices. Just look at the subscription tiers: every single major player has settled on the exact same pricing model, a $20 floor and a $200 cap, no exceptions.
I'm convinced they're all doing everything they can in the background to cut costs and increase profits.
I can't prove that Gemini 3 is dumber than when it came out because of the non deterministic nature of this technology, but it sure feels like it.
... and you'd be most likely very correct with your doubt, given the evidence we have.
What improved disproportionally more than the software- or hardware-side, is density[1]/parameter, indicating that there's a "Moore's Law"-esque behind the amount of parameters, the density/parameter and compute-requirements. As long as more and more information/abilities can be squeezed into the same amount of parameters, inference will become cheaper and cheaper quicker and quicker.
I write "quicker and quicker", because next to improvements in density there will still be additional architectural-, software- and hardware-improvements. It's almost as if it's going exponential and we're heading for a so called Singularity.
Since it's far more efficient and "intelligent" to have many small models competing with and correcting each other for the best possible answer, in parallel, there simply is no need for giant, inefficient, monolithic monsters.
They ain't gonna tell us that, though, because then we'd know that we don't need them anymore.
[1] for lack of a better term that I am not aware of.
There is no gain for anyone anywhere by reducing parameter count overall if that's what you mean. That sounds more like you don't like transformer models than a real performance desire
Scaling laws are real! But they don't preclude faster processing.
A lot of inference code is set up for autoregressive decoding now. Diffusion is less mature. Not sure if Ollama or llama cpp support it.
But I wish there were more "let's scale this thing to the skies" experiments from those who actually can afford to scale things to the skies.
It would certainly be nice though if this kind of negative result was published more often instead of leaving people to guess why a seemingly useful innovation wasn't adopted in the end.
I wonder how far down they can scale a diffusion LM? I've been playing with in-browser models, and the speed is painful.
But I wonder how Taalas' product can scale. Making a custom chip for one single tiny model is different than running any model trillions in size for a billion users.
Roughly, 53B transistors for every 8B params. For a 2T param model, you'd need 13 trillion transistor assuming scale is linear. One chip uses 2.5 kW of power? That's 4x H100 GPUs. How does it draw so much power?
If you assume that the frontier model is 1.5 trillion models, you'd need an entire N5 wafer chip to run it. And then if you need to change something in the model, you can't since it's physically printed on the chip. So this is something you do if you know you're going to use this exact model without changing anything for years.
Very interesting tech for edge inference though. Robots and self driving can make use of these in the distant future if power draw comes down drastically. 2.4kW chip running inside a robot is not realistic. Maybe a 150w chip.
> The first generation HC1 chip is implemented in the 6 nanometer N6 process from TSMC. ... Each HC1 chip has 53 billion transistors on the package, most of it very likely for ROM and SRAM memory. The HC1 card burns about 200 watts, says Bajic, and a two-socket X86 server with ten HC1 cards in it runs 2,500 watts.
https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...
I'd take an army of high-school graduate LLMs to build my agentic applications over a couple of genius LLMs any day.
This is a whole new paradigm of AI.
If people can make RL scalable-- make it so that RL isn't just a final phase, but something which is as big as the supervised stuff, then diffusion models are going to have an advantage.
If not, I think autoregressive models will still be preferred. Diffusion models become fixed very fast, they can't actually refine their outputs, so we're not talking about some kind of refinement along the lines of: initial idea -> better idea -> something actually sound.
Although the lab that did this research (Chris Re and Tri Dao are involved) is run by the world's experts in squeezing CUDA and Nvidia hardware for every last drop of performance.
At the API level, the primary differences will be the addition of text infill capabilities for language generation. I also somewhat expect certain types of generation to be more cohesive (e.g. comedy or stories where you need to think of the punchline or ending first!)
https://huggingface.co/tencent/WeDLM-8B-Instruct
Diffusion isn’t natively supported in the transformers library yet so you have to use their custom inference code.