They produce drastically lower amount of tokens to solve a problem, but they haven't seem to have put enough effort into refinining their reasoning and execution as they produce broken toolcalls and generally struggle with 'agentic' tasks, but for raw problem solving without tools or search they match opus and gpt while presumably being a fraction of the size.
I feel like google will surprise everyone with a model that will be an entire generation beyond SOTA at some point in time once they go from prototyping to making a model that's not a preview model anymore. All models up till now feel like they're just prototypes that were pushed to GA just so they have something to show to investors and to integrate into their suite as a proof of concept.
I really doubt it, especially Pro. If anything I wouldn't be surprised if their hardware lets them run bigger models more cheaply and quickly than the others. Pro is probably smaller than GPT 5.4 and Opus 4.6 (looks like 4.7 decreased in size), but 5x seems way too much. IMO Gemini 3 Pro is the most "intelligent" in an all-round human way. Especially in the humanities. It's highly knowledgeable and undeniably the number one model at producing natural text in a large number of (human!) languages. The difference becomes especially large for more niche languages. That does not suggest a smaller model, more the opposite. The top 4 models at multilinguality are all Google : 1. 3 Pro 2. 3 Flash 3. 2.5 Pro 4. 2.5 Flash. Even the biggest OpenAI and Anthropic models can't compete in that dimension.
It's definitely weaker at math and much worse at agentic things. Gemini chat as an app is also lightyears behind, it's barely different from ChatGPT at release over 3 yeaes ago. These things make it feel much weaker than it is.
GPT, on the other hand, was always terrible at languages, except for the short-lived gpt-4.5-preview.
All modern models including Gemini have bugs in basic language coherency - random language switching, self-correction attempts resulting in hallucinations etc. I speculate it's a problem with heavy RL with rewards and policies not optimized for creative writing.
Maybe it's just something that people aren't bothered with?
ultra ~ mythos ~ gpt-4.5 ~ 4x behemoth
pro ~ opus ~ 2x maverick
flash ~ sonnet ~ scout ~ other 20-30b active Chinese models
Agreed, Gemini-cli is terrible compared to CC and even Codex.
But Google is clearly prioritizing to have the best AI to augment and/or replace traditional search. That's their bread and butter. They'll be in a far better place to monetize that than anyone else. They've got a 1B+ user lead on anyone - and even adding in all LLMs together, they still probably have more query volume than everyone else put together.
I hope they start prioritizing Gemini-cli, as I think they'd force a lot more competition into the space.
Using it with opencode I don't find the actual model to cause worse results with tool calling versus Opus/GPT. This could be a harness problem more than a model problem?
I do prefer the overall results with GPT 5.4, which seems to catch more bugs in reviews that Gemini misses and produce cleaner code overall.
(And no, I can't quantify any of that, just "vibes" based)
Speed? The pro models are slow for me
The model 3.1 pro model is good and i don't recognise the GP's complaint of broken tool calls but i'm only using via gemini cli harness, sounds like they might be hosting their own agentic loop?
But I picked it up again about a month ago and I have been quite impressed. Haven’t hit any of those frustrating QoL issues yet it was famous for and I’ve been using it a few hours a day.
Maybe it will let me down sooner or later but so far it has been working really well for me and is pretty snappy with the auto model selection.
After cancelling my Claude Pro plan months ago due to Anthropic enshittification I’ve been nervous relying solely on Codex in case they do the same, so I’ve been glad to have it available on my Google One plan.
Gemini will be dead in 2 years and there'll be something else, but the ad and search company will remain given that they basically own the world wide web.
Except now, so much of the WWW is filled with AI slop that it breaks the system.
When a lot of people ask the same thing they can just index the questions, like a results on the search engine and recalculate it only so often,
Where they excel is just total holistic _knowledge_ about the world. I don't like "talking" to it, because I kind of hate its tone, but I find Gemini generally extremely useful for research and analysis tasks and looking up information.
You can put whole 50,000 - 70,000 LOC codebase into Gemini 3.1 Pro context making it 800,000+ tokens, give it detailed task and ask for whole changed files back and it will execute it sometimes in one shot, sometimes in two. E.g depend on whatever stack you work with let you see all the errors at once so it can fix everything on single reply.
Yes it will give you back 5-15 files up to 4000 LOC total with only relevant parts changed.
This is terrible inefficient way to burn $10 of tokens in 20 minutes, but attention and 1:1 context retention is truly amazing.
PS: At the same time it is bad at tool use, but this have nothing to do with context.
Gemini is just not trained for autonomy/tool use/agentic behavior to the same degree as the other frontier models. Goog seems to emphasize video/images/scientific+world knowledge.
e.g. it sucks at general tool use but sucks even more at it after a chunk of time in a session. One frustrating situation is to watch it go into a loop trying and failing to edit source files.
I often wonder how my old coworkers from Google get by, if this is the the agentic coding they have available to them for working on projects on Google3. But I suspect the models they work with have been fine tuned on Google's custom tooling and perform better?
> They produce drastically lower amount of tokens to solve a problem
I LOLed at this because I of the constant death loops that don’t even solve the problem at all.
That model would then be SOTA.
Tautologically you can't be better than SOTA
(disclosure: I am long GOOG, for this and a few other reasons)
And the converse is true also. I mean, look at NVIDIA. For the longest time they were just a gaming card company, competing with AMD. I remember alternating between the two companies for my custom builds in the 90s and it basically came down to rendering speed and frame rate.
But Jensen bet on the "compute engine" horse and pushed CUDA out, which became the defacto standard for doing fast, parallel arithmetic on a GPU. He was able to ride the BitCoin wave and then the big one, DNNs. AMD still hasn't caught on yet (despite 15 years having gone by).
3.1 pro is just fundamentally not on the same level. In any context I've tried it in, for code review it acts like a model from 1yr ago in that it's all hallucinated superficial bullshit.
Claude code is significantly less likely to produce the same (yet still does a decent amount). Gpt 5.4 high/xhigh is on another level altogether - truly not comparable to Gemini.
Buying from nvidia is the only real option and even that is not optimal.
In fact I am opposite of this hypothesis for two reasons. Google has artificially limited production. And because TSMC favours whoever could pay for the most capacity(as incremental capacity is very cheap for them). So Nvidia gets first slot for new process.
Also the second reason is that GCP's operating margin is very high compared to say Hetzner or lambdalabs and you can get GPUs much cheaper there compared to GCP. So students/small researchers are stuck on GPU.
Cook did very well in all areas as well as in not trying to create a cult.
Honestly im rather impressed with how they handled it, they had enough of the infra and org in place to jump at it once the cat was out of the bag.
Sundar declared a code red or whatever and they made it happen. But that could ONLY happen if they had the bedrock of that ability already built.
No one really remembers now that google was a year behind.
The Google Antigravity subreddit is a shitshow though:
Like Apple Intelligence? Which was quite crap
Anthropic and OpenAI are having to fight like hell to secure market share. Google just gets to sit back and relax with its browser and android monopolies.
Why did our regulators fall asleep at the wheel? Google owns 92% of "URL bar" surface area and turned it into a Google search trademark dragnet. Now Anthropic has to bid for its own products against its competitors and inject a 15+% CAC which is just a Google tax.
Now consider all the bullshit Google gets to do with android and owning that with an iron fist. Every piece of software has a 30% tax, has to jump through hoops, and even finding it is subject to the same bidding process.
These companies need to be broken up.
Google would be healthier for the economy and its own investors as six different companies. And they shouldn't be allowed to set the rules for mobile apps or tax other people's IP and trademarks.
Of course they should have to fight with the inventors of the technology they’re using.
Source?
Attention e.g. was developed by Dzmitry Bahdanau et al. (those being Kyunghyun Cho and Yoshua Bengio) in 2014 while interning at the University of Montreal.
The insight of the paper you point to was that with attention you could dispense of the RNN that attention was initially developed to support.
Sam Altman's honesty problems, and Elon buying a VS code fork for $60 billion isn't a sign of moral uprightness or wisdom.
There's a lot to be said for grinding away at a problem. Being on your eighth generation AI chip and seventh generation of autonomous driving hardware is how you build value. Not by hobnobbing with fascists and building an army of stock pumping retail investors.
They're helping close to the distance to realistic quality inference on phones and other smaller devices.
If someone monopolized OS marketshare for mid- to low-priced devices, that does seem like it would be a useful research focus.
Whereas offering the same with compute-inefficiency cloud inference would be economically unviable at scale.
Free on-device Google premium closed-source models* = free Google Maps 2.0
* As long as you ship Google Apps and Play Services
that said, I actually agree: google IMHO silently dominates the 'normie business' chatbot area. gemini is low key great for day to day stuff.
It's hard to reconcile this because Google likely has the most compute and at the lowest cost, so why aren't they gassing the hell out of inference compute like the other two? Maybe all the other services they provide are too heavy? Maybe they are trying to be more training heavy? I don't know, but it's interesting to see.
I was planning on comparing them on coding but I didn't get the Gemini VSCode add-in to work so yeah, no dice.
The Android and web app is also riddled with bugs, including ones that makes you lose your chat history from the threads if you switch between them, not cool.
I'll be cancelling my Google One subscription this month.
I see it like going to the doctor and asking them to cite sources for everything they tell me. It would be ridiculous and totally make a mess of the visit. I much prefer just taking what the doctor said on the whole, and then verifying it myself afterwards.
Obviously there is a lot of nuance here, areas with sparse information and certainly things that exist post knowledge cut-off. But if I am researching cell structure, I'm not going to muck up my context making it dig for sources for things that are certainly already optimal in the latent space.
GPT (codex) was accurate on the first run and took 12 minutes
Gemini (antigravity) missed 1 value because it didn't load the full 1099 pdf (the laziness), but corrected it when prompted. However it only spent 2 minutes on the task.
Claude (CC) made all manner of mistakes after waiting overnight for it to finish because it hit my limit before doing so. However claude did the best on the next step of actually filing out the pdf forms, but it ended up not mattering.
Ultimately I used gemini in chrome to fill out the forms (freefillableforms.com), but frankly it would have been faster to manually do it copying from the spreadsheets GPT and Gemini output.
I also use anti-gravity a lot for small greenfield projects(<5k LOC). I don't notice a difference between gemini and claude, outside usage limits. Besides that I mostly use gemini for it's math and engineering capabilities.
Gemini missed on some nuances about the paperwork processes of Delaware. Gemini repeatedly assumed I could do something instantly via an online portal that actually required either snail-mail or the use of an intermediate who actually had API access to Delaware's systems. In the end, these processes took a couple days, and while I got things done in time, I wish I had not taken questions of process at face value, and instead wish I had kicked off the taxes at the end of February rather than week before they were due.
Being thrifty can be good! But it also can mean your system is not reflecting sufficiently, is not considering enough factors, isn't reading enough source code.
We are still firmly in "who really knows" territory. I have mixed feelings about token spendiness vs thrift, is all.
Interesting that there's separate inference and training focused hardware. Do companies using NV hardware also use different hardware for each task or is their compute more fungible?
One reason is that most clouds/neoclouds don't own workloads, and want fungibility. Given that you're spending a lot on H200s and what not it's good to also spend on the networking to make sure you can sell them to all kinds of customers. The Grok LPU in Vera Rubin is an inference-specific accelerator, and Cerebras is also inference-optimized so specialization is starting to happen.
Dedicated hardware will usually be faster, which is why as certain things mature, they go from being complicated and expensive to being cheap and plentiful in $1 chips. This tells me Google has a much better grasp on their stack than people building on NVidia, because Google owns everything from the keyboard to the silicon. They've iterated so much they understand how to separate out different functions that compete with each other for resources.
This seems impressive. I don't know much about the space, so maybe it's not actually that great, but from my POV it looks like a competitive advantage for Google.
The cost can also change dramatically: on top of the higher token costs for Gemini Pro ($1.25/mtok input for 2.5 versus $2/mtok input for 3.1), the newer release also tokenizes images and PDF pages less efficiently by default (>2x token usage per image/page) so you end up paying much much more per request on the newer model.
These are somewhat niche concerns that don't apply to most chat or agentic coding use cases, but they're very real and account for some portion of the traffic that still flows to older Gemini releases.
Junie tooling excels when you are more involved. Like, look in these two files, add this specific functionality, in this specific way. Junie is usually a lot faster and to the point. Very simple tooling , it just works for this workflow. But it breaks for the “code the whole thing for me” workflow.
Wow. Just Wow. I presume that's for each chip, and there are 1152 chips in a pod so that's 331TB HBM and 442TB SRAM per pod. Just wow.
https://wccftech.com/google-splits-tpuv8-strategy-two-chips-...
Owning your hardware and your entire stack is huge, especially these days with so much demand. Long term, I think they end up doing very well. People clowned so hard on Google for the first two years (until Gemini 2.5 or 3) because it wasn't as good as OpenAI or Anthropic's models, but Google just looked so good for the long game.
Another benefit for them: If LLMs end up being a huge bubble that end up not paying the absurd returns the industry expects, they're not kaput. They already own so many markets that this is just an additional thing for them, where as the big AI only labs are probably fucked.
All that said: what the hell do I know? Who knows how all of this will play out. I just think Google has a great foundation underneath them that'll help them build and not topple over.
> One pod of TPU 8t is 121 ExaFlops; or 121,000 PetaFlops.
Meanwhile, the compute capacity of the top 10 supercomputers in the entire world is 11,487 Petaflops.[1]
I know, I know, not the same flops, yada yada, but still. Just 1 pod alone is quite a beast.
Edit: [1] https://top500.org/lists/top500/2025/11/
Otherwise bitcoin mining rigs dwarf everything, if you just want to count raw operations per second.
It is about agents in that the design is for long context, many requests where the initial “chunk” is cached but spread across many requests.
They don’t call this out specifically but in the technical details like about the sram, how it’s all interconnected nodes in a pod it’s “designed” for it.
https://www.janestreet.com/join-jane-street/machine-learning...
> We build on the latest papers in LLMs, computer vision, RL, training libraries, cuda kernels, or whatever else we need to train good models.
> We invent our own set of architectures and optimizations that work for trading.
New pods use 10x as much RAM as previous generation.
IMHO that happy medium is Google. Not having to pay the NVidia tax will likely be a huge competitive advantage. And nobody builds data centers as cost-effectively as Google. It's kind of crazy to be talking ExaFLOPS and Tb/s here. From some quick Googling:
- The first MegaFLOPS CPU was in 1964
- A Cray supercomputer hit GigaFLOPS in 1988 with workstations hitting it in the 1990s. Consumer CPUs I think hit this around 1999 with the Pentium 3 at 1GHz+;
- It was the 2010s before we saw off-the-shelf TFLOPS;
- It was only last year where a single chip hit PetaFLOPS. I see the IBM Roadrunner hit this in 2008 but that was ~13,000 CPUs so...
Obviously this is near 10,000 TPUs to get to ~121 EFLOPS (FP4 admittedly) but that's still an astounding number. IT means each one is doing ~12 PFLOPS (FP4).
I saw a claim that Claude Mythos cost ~$10B to train. I personally believe Google can (or soon will be able to) do this for an order of magnitude less at least.
I would love to know the true cost/token of Claude, ChatGPT and Gemini. I think you'll find Google has a massive cost advantage here.
Can you cite this? That seems absurd.
I've seen figures that suggest GPT-4 was 1.8T parameters and cost upwards of $100 million to train (also unsubstantiated), in which case the Mythos figure might be inflated and also include development costs.
So who really knows?
[1]: https://www.softwarereviews.com/research/claude-mythos-previ...
[2]: https://x.com/duttasomrattwt/status/2041903600516133016
[3]: https://www.forrester.com/blogs/project-glasswing-the-10-con...
Google could probably train models for orders of magnitude less money as you say, but they aren't. They are not capable of creating high quality models like OpenAI and Anthropic are. Their company is just too disorganized and chaotic.
Anecdotally, I don't know a single person who uses Gemini on purpose.
How does that make any sense?
iPhones may be able to run local model inference, but Apple still can't train anything if they don't have any data.
This is such revisionist history. They were not strategicially waiting. They tried, really really hard. The entire iPhone 16 pro was built on AI. Heck, they even (re)named it as Apple Intelligence.
Remember, this is the same time when Microsoft launched Copilot (RIP), Google launched Gemini, OpenAI with ChatGPT etc.
--- They had to walk back hard because it was a flop. They might be accidentally successful because they are a company with multiple strengths, but dont think of it as they were sitting AI out.
Is that why they rushed out introducing AI summaries etc in order to play catch-up and then backpedaled when they exploded in customers' faces/individuals concerned in false headlines threatened to sue?
I use Gemini on purpose all the time. It can start timers for me, add calendar entries without having to type it out, convert email to calendar or reminders etc. I'd use it even more if it had more access to other bits of my phone.
What if somebody cracks the problem if splitting inference between local and remote? What if someone else manages so modularize learning so your local LLM doesn't need to have been trained on how to compute integrals? Obviously we can't disect a current LLM and say "we can remove these weights because they do math" but there's no guarantee there isn't an architecture that will allow for that.
Apple could also be training an LLM Siri 2.0 that knows enough to do the things you want. Setting alarms, sending messages, etc. Apple would have all the information on what the major use cases are and where Siri is currently failing. They can increase Siri's capabilities as local LLM inference improves.
As for Google creating high quality models, I personally believe the models are going to be commoditized. I don't believe a single company is going to have a model "moat" to sustain itself as a trillion dollar company. I base two reasons for this:
1. At the end of the day, it's just software and software is infinitely reproducible and distributable. I mean we already saw one significant Anthropic leak this year; and
2. China is going to make sure we're not all dependent on one US tech company who "owns" AI. DeepSeek was just the first shot across the bow for that. It's going to be too important to China's national security for that not to happen.
And OpenAI's entire funding is predicated on that happening and OpenAI "winning".
If the whole AI bubble spectularly collapes, at least we got a lot of cool pics of custom hardware!
Every other news for the past month has been about lacking capacity. Everyone is having scaling issues with more demand than they can cover. Anthropic has been struggling for a few months, especially visible when EU tz is still up and US east coast comes online. Everything grinds to a halt. MS has been pausing new subscriptions for gh Copilot, also because a lack of capacity. And yet people are still on bubble this, collapse that? I don't get it. Is it becoming a meme? Are people seriously seeing something I don't? For the past 3 years models have kept on improving, capabilities have gone from toy to actually working, and there's no sign of stopping. It's weird.
The way this could happen is if model commoditization increases - e.g. some AI labs keep publishing large open models that increasingly close the gap to the closed frontier models.
Also, if consumer hardware keep getting better and models get so good that most people can get most of their usage satisfied by smaller models running on their laptop, they won't pay a ton for large frontier models.
Well.. we won't have to as we'll have models to do it for us!
Though nowadays it feels like the bubble is going to end up being mainly an OpenAI issue. The others are at least vaguely trying to balance expansion with revenue, without counting on inventing a computer god.
Demand for internet and web services is significantly higher today than in 2000 but a bubble still popped. Heck a regular old recession or depression, completely unrelated to AI could happen next year and could collapse the industry. I mean housing is more expensive than ever nearly 20 years after collapsing in the Great Recession.
If we apply the same logic, any of oAI, xAI, Anthropic might pop, but realistically they won't, and even if they do, some other players will take their spots, and the tech will survive, and more importantly the demand will still be there. This cat isn't going back into the bag. People want this now. More than all the providers can give them. Today. The demand won't suddenly disappear now that "we got a hit" like someone put it recently.
In 2008 there was a subprime mortgage crisis that caused the housing market to crash. Nearly all banks who participated in this survived. There was and still is significant demand for houses, financed through mortgages.
The bubble can burst, most if not all the big players still survive 20 years later and yet significant value and capital can still be destroyed in the process.
Same for the dot com. There was demand for the internet, it couldn’t meet the expectations of the day, and yet here we are with like 100x more internet services than before all these years later. Saying the AI bubble will pop is not a prediction that all AI companies will cease to exist immediately. Amazon lost 80% of their stock price in 2000. Is Amazon bigger or smaller than they were in 2000 today?
Trainium3 and Maia 200 are 2.5 and 2.8Tb/s vs this 1.2Tb/s. Maia is 6 stacks of HBMe3, so ratio of mem:interconnect bandwidth is really falling behind here. Notably Maia is also, like TPU, high radix.
Thanks for posting otherwise.
Edit: actually, looks like the header got captured as a figure caption on accident.