It's not quite that simple. Gemini 2.5 Flash previously had two prices, depending on if you enabled "thinking" mode or not. The new 2.5 Flash has just a single price, which is a lot more if you were using the non-thinking mode and may be slightly less for thinking mode.
Another way to think about this is that they retired their Gemini 2.5 Flash non-thinking model entirely, and changed the price of their Gemini 2.5 Flash thinking model from $0.15/m input, $3.50/m output to $0.30/m input (more expensive) and $2.50/m output (less expensive).
Another minor nit-pick:
> For LLM providers, API calls cost them quadratically in throughput as sequence length increases. However, API providers price their services linearly, meaning that there is a fixed cost to the end consumer for every unit of input or output token they use.
That's mostly true, but not entirely: Gemini 2.5 Pro (but oddly not Gemini 2.5 Flash) charges a higher rate for inputs over 200,000 tokens. Gemini 1.5 also had a higher rate for >128,000 tokens. As a result I treat those as separate models on my pricing table on https://www.llm-prices.com
One last one:
> o3 is a completely different class of model. It is at the frontier of intelligence, whereas Flash is meant to be a workhorse. Consequently, there is more room for optimization that isn’t available in Flash’s case, such as more room for pruning, distillation, etc.
OpenAI are on the record that the o3 optimizations were not through model changes such as pruning or distillation. This is backed up by independent benchmarks that find the performance of the new o3 matches the previous one: https://twitter.com/arcprize/status/1932836756791177316
That 80% drop in o3 was only a few weeks ago!
ChatGPT is simply what Google should've been 5-7 years ago, but Google was more interested in presenting me with ads to click on instead of helping me find what I was looking for. ChatGPT is at least 50% of my searches now. And they're losing revenue because of that.
I don't have any reason to doubt the reasoning this article is doing or the conclusions it reaches, but it's important to recognize that this article is part of a sales pitch.
> API Price ≈ (Hourly Hardware Cost / Throughput in Tokens per Hour) + Margin
But the thrust of the article is that contrary to conventional wisdom, we shouldn't expect llm models to continue getting more efficient, and so its worthwhile to explore other options for cost savings in inference, such as batch processing.
The conclusion they reach is one which directly serves what they're selling.
I'll repeat; I'm not disputing anything in this article. I'm really not, I'm not even trying to be coy and make allusions without directly saying anything. If I thought this was bullshit I'm not afraid to semi-anonymously post a comment saying so.
But this is advertising, just like Backblaze's hard drive reliability blog posts are advertising.
For large models, compute is very rarely dominated by attention. Take, for example, this FLOPs calculation from https://www.adamcasson.com/posts/transformer-flops
Compute per token = 2(P + L × W × D)
P: total parameters L: Number of Layers W: context size D: Embedding dimension
For Llama 8b, the window size starts dominating compute cost per token only at 61k tokens.
If these corporations had to build a car they would make the largest possible engine, because "MORE ENGINE MORE SPEED", just like they think that bigger models means bigger intelligence, but forget to add steering, or even a chassi.
I'll take a model specialized in web scraping. Give me one trained on generating report and documentation templates (I'd commit felonies for one which could spit out a near-conplete report for SSRS).
Models trained for specific helpdesk tasks ("install a printer", "grant this user access to these services with this permission level").
A model for analyzing network traffic and identifying specific patterns.
None of these things should require titanic models nearing trillions of parameters.
I suspect a large part of the reason we've had many decades of exponential improvements in compute is the general purpose nature of computers. It's a narrow set of technologies that are universally applicable and each time they get better/cheaper they find more demand, so we've put an exponentially increasing amount of economical force behind it to match. There needed to be "plenty of room at the bottom" in terms of physics and plenty of room at the top in terms of software eating the world, but if we'd built special purpose hardware for each application I don't think we'd have seen such incredible sustained growth.
I see neural networks and even LLMs as being potentially similar. They're general purpose, a small set of technologies that are broadly applicable and, as long as we can keep making them better/faster/cheaper, they will find more demand, and so benefit from concentrated economic investment.
Arguably that was Haiku 3.5 in October 2024.
I think the same hypothesis could apply though, that you price your model expecting a certain average input size, and then adjust price up to accommodate the reality that people use that cheapest model when they want to throw as much as they can into the context.
Then there is Poe with its pricing games. Prices at Poe have been going up over time since they were extremely aggressive to gain market share presumably under the assumption that there would be reduced pricing in the future and the reduced pricing for LLMs did not materialize.
Gemini Flash 2.5 and Gemini 2.5 Flash Preview were presumably a whole lot more similar to each other.
Engineers who work with LLM APIs are hopefully paying enough attention that they understand the difference between Claude 3, Claude 3.5 and Claude 4.
One addition: the O(n^2) compute cost is most acute during the one-time prefill of the input prompt. I think the real bottleneck, however, is the KV cache during the decode phase.
For each new token generated, the model must access the intermediate state of all previous tokens. This state is held in the KV Cache, which grows linearly with sequence length and consumes an enormous amount of expensive GPU VRAM. The speed of generating a response is therefore more limited by memory bandwidth.
Viewed this way, Google's 2x price hike on input tokens is probably related to the KV Cache, which supports the article’s “workload shape” hypothesis. A long input prompt creates a huge memory footprint that must be held for the entire generation, even if the output is short.
>For each new token generated, the model must access the intermediate state of all previous tokens.
Not all the previous tokens are equal, not all deserve the same attention so to speak. The farther the tokens, the more opportunity for many of them to be pruned and/or collapsed with other similarly distant and lesser meaningful tokens in a given context. So instead of O(n^2) it would be more like O(nlog(n))
I mean, you'd expect that for example "knowlegde worker" models (vs. say "poetry" models) would posses some perturbative stability wrt. changes to/pruning of the remote previous tokens, at least to those tokens which are less meaningful in the current context.
Personally, i feel the situation is good - performance engineering work again becomes somewhat valuable as we're reaching N where O(n^2) forces management to throw some money at engineers instead of at the hardware :)
FWIW Gemini 2.5 Flash Lite is still very good; I used it in my latest side project to generate entire web sites and it outputs great content and markup every single time.
Personally, I'm rooting for RWKV / Mamba2 to pull through, somehow. There's been some work done to increase their reasoning depths, but transformers still beat them without much effort.
In terms of microbiology, the architecture of Transformer is more in line with the highly interconnected global receptive field of neurons
Oh, I noticed. I've also complained how Gemini 2.0 Flash is 50% more expensive than Gemini 1.5 Flash for small requests.
Also I'm sure if Google wanted to price Gemini 2.5 Flash cheaper they could. The reason they won't is because there is almost zero competition at the <10 cents per million input token area. Google's answer to the 10 cents per million input token area is 2.5 Flash Lite which they say is equivalent to 2.0 Flash at the same cost. Might be a bit cheaper if you factor in automatic context caching.
Also the quadratic increase is valid but it's not as simple as the article states due to caching. And if it was a bit issue Google would impose tiered pricing like they do for Gemini 2.5 Pro.
And for what it's worth I've been playing around with Gemma E4B on together.ai. It takes 10x as long as Gemini 2.5 Flash Lite and it sucks at multilingual. But other than that it seems to produce acceptable results and is way cheaper.
Since Gemini CLI was recently released, many people on the "free" tier noticed that their sessions immediately devolved from Gemini 2.5 Pro to Flash "due to high utilization". I asked Gemini itself about this and it reported that the finite GPU/TPU resources in Google's cloud infrastructure can get oversubscribed for Pro usage. Google (no secret here) has a subscription option for higher-tier customers to request guaranteed provisioning for the Pro model. Once their capacity gets approached, they must throttle down the lower-tier (including free) sessions to the less resource-intensive models.
Price is one lever to move once capacity becomes constrained. Yet, as the top voted comment of this post explains, it's not honest to simply label this as a price increase. They raised Flash pricing on input tokens but lowered pricing on output tokens up to certain limits -- which gives creedence to the theory that they are trying to shape the demand in order for it to better match their capacity.
It could just as well have been Google reducing subsidisation. From the outside that would look exactly the same
DRAM scaling + interconnect bandwidth stagnation
I'm not sure where you're getting an exponential from.
Because they were the underdog. Everyone was talking about ChatGPT, or maybe Anthropic. Then Deepseek. Google were the afterthought that was renowned for that ridiculous image generator that envisioned 17th century European scientists as full-headdress North American natives.
There has been absolute 180 since then, and Google now has the ability to set their pricing similar to the others. Indeed, Google's pricing still has a pretty large discount over similarly capable model levels, even after they raised prices.
The warning is that there is no free lunch, and when someone is basically subsidizing usage to get noticed, they don't have to do that once their offering is good.
Stopped reading here, if you're positioning yourself as if you have some kind of unique insight when there is none in order to boost youe credentials and sell your product there's little chance you have anything actually insightful to offer. Might sound like an overreaction/nitpicking but it's entirely needless LinkedIn style "thought leader" nonsense.
In reality it was immediately noticed by anyone using these models, have a look at the HN threads at the time, or even on Reddit, let alone the actual spaces dedicated to AI builders.
They likely lose money when you take into account the capital cost of training the model itself, but that cost is at least fixed: once it's trained you can serve traffic from it for as long as you chose to keep the model running in production.
One of the clearest example is Deepseek v3. Deepseek has mentioned its price of 0.27/1.10 has 80% profit margin, so it cost them 90% lesser than the price of Gemini flash. And Gemini flash is very likely smaller model than Deepseek v3.
We don't have accurate price signals externally because Google, in particular, had been very aggressive at treating pricing as a competition exercise than anything that seemed tethered to costs.
For quite some time, their pricing updates would be across-the-board exactly 2/3 of the cost of OpenAI's equivalent mode.
[^1] "If you’re building batch tasks with LLMs and are looking to navigate this new cost landscape, feel free to reach out to see how Sutro can help."
[^2] "Google's decision to raise the price of Gemini 2.5 Flash wasn't just a business decision; it was a signal to the entire market." by far the biggest giveaway, the other tells are repeated fanciful descriptions of things that could be real, that when stacked up, indicate a surreal, artifical, understanding of what they're being asked to write about, i.e. "In a move that at first went unnoticed,"
Context size is the real killer when you look at running open source alternatives on your own hardware. Has anything even come close to the 100k+ range yet?
Llama 4 maverick is 16x 17b. So 67GB of size. The equivalency is 400billion.
Llama 4 behemoth is 128x 17b. 245gb size. The equivalency is 2 trillion.
I dont have the resources to be able to test these unfortunately; but they are claiming behemoth is superior to the best SAAS options via internal benchmarking.
Comparatively Deepseek r1 671B is 404gb in size; with pretty similar benchmarks.
But you compare deepseek r1 32b to any model from 2021 and it's going to be significantly superior.
So we have quality of models increasing, resources needed decreasing. In 5-10 years, do we have an LLM that loads up on a 16-32GB video card that is simply capable of doing it all?
I think the best of both worlds is a sufficiently capable reasoning model with access to external tools and data that can perform CPU-based lookups for information that it doesn't possess.
"Sir, I'm delighted to report that the productivity and insights gained outclass anything available from four years ago. We are clearly winning."