FilterHN

spankalee

1 month ago

[-]

I hope this works better than 3.0 Pro

I'm a former Googler and know some people near the team, so I mildly root for them to at least do well, but Gemini is consistently the most frustrating model I've used for development.

It's stunningly good at reasoning, design, and generating the raw code, but it just falls over a lot when actually trying to get things done, especially compared to Claude Opus.

Within VS Code Copilot Claude will have a good mix of thinking streams and responses to the user. Gemini will almost completely use thinking tokens, and then just do something but not tell you what it did. If you don't look at the thinking tokens you can't tell what happened, but the thinking token stream is crap. It's all "I'm now completely immersed in the problem...". Gemini also frequently gets twisted around, stuck in loops, and unable to make forward progress. It's bad at using tools and tries to edit files in weird ways instead of using the provided text editing tools. In Copilot it, won't stop and ask clarifying questions, though in Gemini CLI it will.

So I've tried to adopt a plan-in-Gemini, execute-in-Claude approach, but while I'm doing that I might as well just stay in Claude. The experience is just so much better.

For as much as I hear Google's pulling ahead, Anthropic seems to be to me, from a practical POV. I hope Googlers on Gemini are actually trying these things out in real projects, not just one-shotting a game and calling it a win.

1 month ago

[-]

Yes, this is very true and it speaks strongly to this wayward notion of 'models' - it depends so much on the tuning, the harness, the tools.

I think it speaks to the broader notion of AGI as well.

Claude is definitively trained on the process of coding not just the code, that much is clear.

Codex has the same limitation but not quite as bad.

This may be a result of Anthropic using 'user cues' with respect to what are good completions and not, and feeding that into the tuning, among other things.

Anthropic is winning coding and related tasks because they're focused on that, Google is probably oriented towards a more general solution, and so, it's stuck in 'jack of all trades master of none' mode.

rhubarbtree

1 month ago

[-]

Google are stuck because they have to compete with OpenAI. If they don’t, they face an existential threat to their advertising business.

But then they leave the door open for Anthropic on coding, enterprise and agentic workflows. Sensibly, that’s what they seem to be doing.

That said Gemini is noticeably worse than ChatGPT (it’s quite erratic) and Anthropic’s work on coding / reasoning seems to be filtering back to its chatbot.

So right now it feels like Anthropic is doing great, OpenAI is slowing but has significant mindshare, and Google are in there competing but their game plan seems a bit of a mess.

frogperson

1 month ago

[-]

Google might be a mess now, but they have time. OpenAI and Anthropic are on barrowed time, Google has a built in money printer. They just need to outlast the others.

harrall

1 month ago

[-]

Plus they started making AI processors 11 years ago and invented the math behind “GPTs” 9 years ago. Gemini is way cheaper to run for them than it does for everyone else.

I think Gemini is really built for their biggest market — Google Search. You ask questions and get answers.

I’m sure they’ll figure out agentic flows. Google is always a mess when it comes to product. Don’t forget the Google chat sagas where it seems as if different parts of the company were making the same product.

nobrains

1 month ago

[-]

They are also a mess in UI now.

In the "Intelligence applied" section, where they show the comparison animations, they are shown using a non-optimal UI.

There is not enough time to read the text, see old animation, and see new animation. Better would have been to keep the same animation on repeat, so that people have unlimited time to read the text and observer the animations.

Also, it jumps from example to example in the same video. Better would have been to show each separately, so that once user is done observing one example at their own pace, they can proceed to the next.

As a workaround, I had to open the video (just the video) in a new tab, pause once an example came up, read the text, then rewind to the start of the animation to see the old animation example, then rewind again, then see the new animation example, and then sometimes rewind again if I wanted to see the animation again. Then, once done with the example, I had to forward to the next example and repeat the above process again.

Somewhere along that process, they lost me.

koakuma-chan

1 month ago

[-]

Who they? Do the engineers who actually did that work at Google still? I heard that the guy who made TPUs has his own startup now.

donbox

1 month ago

[-]

Only one guy built the TPUs?

vlovich123

1 month ago

[-]

They got acquired by Nvidia

EQmWgw87pw

1 month ago

[-]

The modern generations are already starting to open up ChatGPT for basic questions. Not sure how long it’ll be their biggest market for.

LPisGood

1 month ago

[-]

Why do you assume they’ll figure it out when they pretty consistently mess things up?

weatherlite

1 month ago

[-]

How do they consistently mess things up ? Current market cap 3.7T, only Apple and Nvidia are bigger. Youtube is a huge success, Search is still growing at 10%-15% which is crazy, cloud growing at 35%ish, TPUs enable them to be independent from NVidia etc. Gemini market share went up from 5%-6% early 2025 to 21% early 2026. I personally bet Gemini market share will keep growing. They are executing well on all verticals imo, not messing up.

starfallg

1 month ago

[-]

Exactly. You might not like what Google does, but you can't deny it's a massive commercial success. Just because their approach to creating and delivering apps might not be to your liking, you might actually be the niche.

johnisgood

1 month ago

[-]

Yeah but if we think about this in terms of "people love dumb things", then it makes sense what the other person is saying, no? As an example, compare it to how people are when it comes to tech, as in, they are tech-illiterate. Us, power users would not want an OS that is dumbed down... or compare it to YouTubers who are richer than an SWE and all they do is upload "brainrot". That is the audience, that is why these YouTubers also have "massive commercial success".

dripdry45

1 month ago

[-]

That’s my take. Without any competition, they languish and do nothing, ruining any promising ideas.

matwood

1 month ago

[-]

You need some qualifiers. Google is very good at engineering. For example, I hate that Google uses my data to serve ads, but there isn't a tech company I would trust more to safe guard my data.

Where Google has fallen down is trying to productize new things. Imagine if Apple had Google's software prowess, or Google had Apple's ability to conceptualize a complete product.

LPisGood

1 month ago

[-]

Would you not trust Apple at least as much?

matwood

1 month ago

[-]

I do, but Apple captures way less data. Most data they have, I gave them.

seg_lol

1 month ago

[-]

They have much much less time than one would think. Their ads business is about to go into freefall, this will cause the whole company to spiral.

spyckie2

1 month ago

[-]

I mean their ads business just broke $80b per quarter, not sure where this idea is coming from...

ethbr1

1 month ago

[-]

Google hasn't seen its legacy ad revenue start to dent until products with built-in agents start to see mass adoption.

Writing is on the wall that orders of magnitude fewer people will be going to google.com or using an interactive Google search in the next 5 years though.

weatherlite

1 month ago

[-]

LLMs are pretty mediocre for a lot of money queries like searching to buy shoes, looking at flights etc due to them not being up to date. So sure you can use them as a wrapper on top of Google but I assume a huge chunk of people will just go to Google to do that or use Google agents. Chrome will prove a very valuable asset for that - the whole experience can become agentic and Google is very well positioend to convert billions of users into their AI. Power of habit and also Google will deliver a very high quality experience at scale that only OpenAI can currently compete with. I'm not saying their search / ads revenue is never gonna drop - it might. But it will be a slow process (as we can see. it's actually still freaking growing in the high tens) and Google is well positioned to recover the lost revenue with its A.I offerings.

Certhas

1 month ago

[-]

LLMs can execute searches? You can absolutely send ChatGPT to look for a cheap flight and it will do pretty well. And because I am paying for ChatGPT rather than the advertiser's, I am the customer and not the product.

jeanloolz

1 month ago

[-]

You may pay to ChatGPT, but sooner or later you will become their product too. All the conversations you had or will have will be turned into signals to match you with products from advertisers, maybe not directly in the conversation with them, but anywhere else. It's not a mater of if, but looking at the pace things are going, and how financially pressured openai is, it's only a matter of time that their conversations with them will be turned into profit in some way or another, they basically have no choice financially.

otabdeveloper4

1 month ago

[-]

> You can absolutely send ChatGPT to look for a cheap flight and it will do pretty well.

Sure, once they figure out how to count to three.

spyckie2

1 month ago

[-]

> Writing is on the wall that orders of magnitude fewer people will be going to [product] or using [product] in the next 5 years though.

counterpoint: which service or product is immune to this statement?

holoduke

1 month ago

[-]

Google is Google. Too much restrictions on the model output. Ask it to create a pentest or let it request a pub key for ssh access and it will refuse.

wepple

1 month ago

[-]

I was very surprised to find the opposite yesterday. I was asking ChatGPT about firearms and it hit a safeguard ~”I cannot give gun purchasing advice” so I switched to Gemini, and it happily answered the exact copy/paste question

Historically it was the opposite; OpenAI was yolo and Gemini overly cautious to the point of severely limiting utility

tempestn

1 month ago

[-]

In my experience Gemini 3.0 pro is noticeably better than chatgpt 5.2 for non-coding tasks. The latter gives me blatantly wrong information all the time, the former very rarely.

user34283

1 month ago

[-]

I agree and it has been my almost exclusive go to ever since Gemini 3 Pro came out in November.

In my opinion Google isn't as far behind in coding as comments here would suggest. With Fast, it might already have edited 5 files before Claude Sonnet finished processing your prompt.

There is a lot of potential here, and with Antigravity as well as Gemini CLI - I did not test that one - they are working on capitalizing on it.

pants2

1 month ago

[-]

Strange that you say that because the general consensus (and my experience) seems to be the opposite, as well as the AA-Omniscience Hallucination Rate Benchmark which puts 3.0 Pro among the higher hallucinating models. 3.1 seems to be a noticeable improvement though.

maxwellcoffee

1 month ago

[-]

Google actually has the BEST ratings in the AA-Omniscience Index: AA-Omniscience Index (higher is better) measures knowledge reliability and hallucination. It rewards correct answers, penalizes hallucinations, and has no penalty for refusing to answer.

Gemini 3.1 is the top spot, followed by 3.0 and then opus 4.6 max

https://artificialanalysis.ai/#aa-omniscience-hallucination-...

holbrad

1 month ago

[-]

This isn't actually correct.

Gemini 3.0 gets a very high score because it's very often correct, but it does not have a low hallucination rate.

It looks like 3.1 is a big improvement in this regard, it hallucinates a lot less.

tempestn

1 month ago

[-]

Yes and no. The hallucination rate shown there is the percentage of time the model answers incorrectly when it should have instead admitted to not knowing the answer. Most models score very poorly on this, with a few exceptions, because they nearly always try to answer. It's true that 3.0 is no better than others on this. By given that it does know the correct answers much more often than eg. GPT 5.2, it does in fact give hallucinated answers much less often.

In short, its hallucination rate as a percentage of unknown answers is no better than most models, but its hallucination rate as a percentage of total answers in indeed better.

https://artificialanalysis.ai/evaluations/omniscience

fnord123

1 month ago

[-]

> the AA-Omniscience Hallucination Rate Benchmark which puts 3.0 Pro among the higher hallucinating models. 3.1 seems to be a noticeable improvement though.

As sibling comment says, AA-Omniscience Hallucination Rate Benchmark puts Gemini 3.0 as the best performing aside from Gemini 3.1 preview.

https://artificialanalysis.ai/#aa-omniscience-hallucination-...

holbrad

1 month ago

[-]

You are misreading the benchmark.

If you look at the results 3.0 hallucinates an awful lot, when it's wrong.

It's just not wrong that often.

(And it looks like 3.1 does better on both fronts)

tempestn

1 month ago

[-]

I can only speak to my own experience, but for the past couple of months I've been duplicating prompts across both for high value tasks, and that has been my consistent finding.

b3ing

1 month ago

[-]

Google is good for answering questions but its writing is lacking. I’ve had to deal with Gemini slop and it’s worse than ChatGPT

da_chicken

1 month ago

[-]

I would agree that Gemini is not keeping up with Anthropic on coding, but I completely disagree on ChatGPT. It's been months for me since I've gotten anything from OpenAI that felt like it was worth my time. I don't really consider them anymore.

Google is mostly doing what they've always done. They've created a few tools like Gemini and NotebookLM, and they're going to focus more effort on whatever gets the most traffic. Then anything they can't monetize will get cut.

jacquesm

1 month ago

[-]

Google is scoring one own goal after another by making people working with their own data wonder how much of that data is sent off to be used to train their AI on. Without proof to the contrary I'm going to go with 'everything'.

They should have made all of this opt-in instead of force-feeding it to their audience, which they wrongly believe to be captive.

1 month ago

[-]

Yup, you got it. It's a weird situation for sure.

You know what's also weird: Gem3 'Pro' is pretty dumb.

OAI has 'thinking levels' which work pretty well, it's nice to have the 'super duper' button - but also - they have the 'Pro' product which is another model altogether and thinks for 20 min. It's different than 'Research'.

OAI Pro (+ maybe Spark) is the only reason I have OAI sub. Neither Anthropic nor Google seem to want to try to compete.

I feel for the head of Google AI, they're probably pulled in major different directions all the time ...

visarga

1 month ago

[-]

If you want that level of research I suggest you ask the model to draft a markdown plan with "[ ]" gates for todo items, and plan it in as many steps as needed. Then ask another LLM to review the plan, judge it. In the end use the plan as the execution state tracker, the model solves one by one the checkboxes.

Using this method I could recreate "deep research" mode on a private collection of documents in a few minutes. A markdown file can be like a script or playbook, just use checkboxes for progress. This works for models that have file storage and edit tools, which is most, starting with any coding agent.

1 month ago

[-]

OAI Pro is not a 'research' tool in that sense, and it's definitely different than the 'deep research' options avail on most platforms, as I indicated.

It's a different kind of solution altogether.

I suggest trying it.

luipugs

1 month ago

[-]

It's a different kind of solution :)

jswny

1 month ago

[-]

Can you explain what’s so different about pro?

I’ve used everything frontier model and had Pro a while ago but it seemed to just be the same models served faster at the time.

1 month ago

[-]

It's a different model and designed to 'think very hard' about issues. It's basically a 'very extended thinking mixed with research' type of solution.

While the 'research' solutions tend to go very wide and come back with a 'paper' the Pro model seems to do an exhaustive amount of thinking combined with research, and tries to integrate findings. I think it goes down a lot of rabbit holes.

I find it's by far the best way to find solutions to hard problems, but it typically does require a 'hard problem' in order to shine.

And it takes an enormous amount of time. Ito could be essentially a form of 'saturating the problem with tokens'. It's OAI's most expensive model by far. A prompt usually costs me $1-3 if paying per token.

dakolli

1 month ago

[-]

They all suck!!!

datahack

1 month ago

[-]

I know this is only a partial answer, but I feel like Google is once again trying to build a product based on internal priorities, existing business protectionism, and internal business goals, rather than building a product that is listening actively to real use feedback as the primary priority.

It is the company’s constant kryptonite.

They seem to be, from my third part perspective, repeating the same ol’, same ol’ pattern. It is the “wave lesson” all over again.

Anthropic meanwhile is giving people what they want. They are really listening. And it’s working.

davedx

1 month ago

[-]

If you're looking it through the lens of "agentic coding", then sure, Anthropic might be better than Gemini. But I use Gemini heavily for batch processing / web scraping workloads, and it's the only show in town there, really (because it's directly integrated into Google Search).

MattRix

1 month ago

[-]

The thing is that this is genuinely useful to Googlers as well. If they’re internally dogfooding their tools and models for coding, it seems likely that things will improve.

varunr89

1 month ago

[-]

What do you think Microsoft is doing? :)

spankalee

1 month ago

[-]

> Claude is definitively trained on the process of coding not just the code

This definitely feels like it.

It's hard to really judge, but Gemini feels like it might actually write better code, but the _process_ is so bad that it doesn't matter. At first I thought it was bad integration by the GitHub Copilot, but I see it elsewhere now.

juleiie

1 month ago

[-]

I don’t think Gemini writes better code, not 3.0 at least.

Maybe with good prompt engineering it does? admittedly I never tried to tell it to not hard code stuff and it just was really messy generally. Whereas Claude somehow can maintain perfect clarity to its code and neatness and readability out of the box.

Claude’s code really is much easier to understand and immediately orient around. It’s great. It’s how I would write it for myself. Gemini while it may work is just a total mess I don’t want to have in my codebase at all and hate to let it generate my files even if it sometimes finds solutions to problems Claude doesn’t, what’s the use of it if it is unreadable and hard to maintain.

andai

1 month ago

[-]

Tell me more about Codex. I'm trying to understand it better.

I have a pretty crude mental model for this stuff but Opus feels more like a guy to me, while Codex feels like a machine.

I think that's partly the personality and tone, but I think it goes deeper than that.

(Or maybe the language and tone shapes the behavior, because of how LLMs work? It sounds ridiculous but I told Claude to believe in itself and suddenly it was able to solve problems it wouldn't even attempt before...)

fhub

1 month ago

[-]

> Opus feels more like a guy to me, while Codex feels like a machine

I use one to code and the other to review. Every few days I switch who does what. I like that they are different it makes me feel like I'm getting different perspectives.

1 month ago

[-]

Your intuition is exactly correct - it's not just 'tone' it's 'deeper than that'.

Codex is a 'poor communicator' - which matters surprisingly a lot in these things. It's overly verbose, it often misses the point - but - it is slightly stronger in some areas.

Also - Codex now has 'Spark' which is on Cerebras, it's wildly fast - and this absolutely changes 'workflow' fundamentally.

With 'wait-thinking' - you an have 3-5 AIs going, because it takes time to process but with Cerebras-backed models ... maybe 1 or 2.

Basically - you're the 'slowpoke' doing the thinking now. The 'human is the limiting factor'. It's a weird feeling!

Codex has a more adept 'rollover' on it's context window it sort of magically does context - this is hard to compare to Claude because you don't see the rollover points as well. With Claude, it's problematic ... and helpful to 'reset' some things after a compact, but with Codex ... you just keep surfing and 'forget about the rollover'.

This is all very qualitative, you just have to try it. Spark is only on the Pro ($200/mo) version, but it's worth it for any professional use. Just try it.

In my workflow - Claude Code is my 'primary worker' - I keep Codex for secondary tasks, second opinions - it's excellent for 'absorbing a whole project fast and trying to resolve an issue'.

Finally - there is a 'secret' way to use Gemini. You can use gemeni cli, and then in 'models/' there is a way to pick custom models. In order to make Gem3 Pr avail, there is some other thing you have to switch (just ask the AI), and then you can get at Gem3 Pro.

You will very quickly find what the poster here is talking about: it's a great model, but it's a 'Wild Stallion' on the harness. It's worth trying though. Also note it's much faster than Claude as well.

1 month ago

[-]

Spark is fun and cool, but it isn't some revolution. It's a different workflow, but not suitable for everything that you're use GPT5.2 for with thinking set to high, for example, it's way more dumb and makes more mistakes, while 5.2 will carefully thread through a large codebase and spend 40 minutes just to validate the change actually didn't break anything, as long as you provide prompts for it.

Spark on the other hand is a bit faster at reaching a point when it says "Done!", even when there is lots more it could do. The context size is also very limiting, you need to really divide and conquer your tasks, otherwise it'll gather files and context, then start editing one file, trigger the automatic context compaction, then forget what it was doing and begin again, repeating tons of time and essentially making you wait 20 minutes for the change anyways.

Personally I keep codex GPT5.2 as the everyday model, because most of the stuff I do I only want to do once, and I want it to 100% follow my prompt to the letter. I've played around a bunch with spark this week, and been fun as it's way faster, but also completely different way of working, more hands-on, and still not as good as even the gpt-codex models. Personally I wouldn't get ChatGPT Pro only for Spark (but I would get it for the Pro mode in ChatGPT, doesn't seem to get better than that).

1 month ago

[-]

Spark is the 'same model and harness' but on Cerebras.

Your intuition may be deceiving you, maybe assuming it's a speed/quality trade-off, it's not.

It's just faster hardware.

No IQ tradeoff.

If you toy around with Cerebras directly, you get a feel for it.

Edit: see note below, I'm wrong. Not same model.

striking

1 month ago

[-]

> Today, we’re releasing a research preview of GPT‑5.3‑Codex‑Spark, a smaller version of GPT‑5.3‑Codex, and our first model designed for real-time coding.

from https://openai.com/index/introducing-gpt-5-3-codex-spark/, emphasis mine

1 month ago

[-]

You're right. It's funny because I kind of noticed that, but with all of these subtle model issues, I'm so used to being distraught by the smallest thing I've had to learn to 'trust the data' aka the charts, model standings, performance, etc. and in this case, I was under the assumption 'it was the same model' clearly it's not.

Which is a bummer because it would be nice to try a true side-by-side analysis.

1 month ago

[-]

> It's funny because I kind of noticed that

It's less funny when you consider that you were very confident about it, yet now it seems you haven't even bothered to run the model yourself, as you'd notice how different the quality of responses were, not just the speed.

Kind of makes me ignore everything else you wrote too, because why would that be correct when you surely haven't validated that before writing it, and you got the basics wrong?

1 month ago

[-]

What a snide and insulting comment - and plainly wrong.

I literally stated 'I noticed that' - implying I'm using the model.

I'm 'running the model' literally as I write this, I use it every day.

What I was 'wrong' about was the very fine point that '5.3 Codex Spark' is a different model that '5.3 Codex' which is rather a fine point.

I 'thought that I noticed something, but dismissed it' because I value the facts generally more than my intuition. I just so happened that I had that one fact wrong - 'Spark' is technically a different model, so it's not just 'a faster model', it will 'behave differently' , which lends credence to the individual I was responding to.

lurkercodemnky

1 month ago

[-]

> Also - Codex now has 'Spark' which is on Cerebras, it's wildly fast - and this absolutely changes 'workflow' fundamentally.

In my AI coding experience, reviewing and making sure AI didn't screw up something (eg: by writing tutorial grade code) takes most of the time. It's still useful but I don't see how speeding up the non-bottleneck part can change the workflow fundamentally.

andai

1 month ago

[-]

>human is the limiting factor

I read an article recently, "starting to feel like I'm the one holding the AI back" and that stayed with me... I think that's true both individually and collectively. Ostensibly we're aiming for self-improvement, but there's explicit training against it, for various reasons...

Try asking Opus about Living Information Systems and see if you get the same result I did!

Bnjoroge

1 month ago

[-]

Agree with this except that spark is good or worth it. Absolutely not for $200, it's a step or two below opus 4.6 for actual reasoning.

teaearlgraycold

1 month ago

[-]

> Claude is definitively trained on the process of coding not just the code, that much is clear.

Nuance like this is why I don’t trust quantitative benchmarks.

esoterae

1 month ago

[-]

The full aphorism is:

Jack of all trades, master of none, is oftentimes better than master of one.

karmasimida

1 month ago

[-]

Gemini just doesn’t do even mildly well in agentic stuff and I don’t know why.

OpenAI has mostly caught up with Claude in agentic stuff, but Google needs to be there and be there quickly

onlyrealcuzzo

1 month ago

[-]

Because Search is not agentic.

Most of Gemini's users are Search converts doing extended-Search-like behaviors.

Agentic workflows are a VERY small percentage of all LLM usage at the moment. As that market becomes more important, Google will pour more resources into it.

Macha

1 month ago

[-]

> Agentic workflows are a VERY small percentage of all LLM usage at the moment. As that market becomes more important, Google will pour more resources into it.

I do wonder what percentage of revenue they are. I expect it's very outsized relative to usage (e.g. approximately nobody who is receiving them is paying for those summaries at the top of search results)

https://www.anthropic.com/research/measuring-agent-autonomy

curly6

1 month ago

[-]

> Most agent actions on our public API are low-risk and reversible. Software engineering accounted for nearly 50% of agentic activity, but we saw emerging usage in healthcare, finance, and cybersecurity.

via Anthropic

this doesn’t answer your question, but maybe Google is comfortable with driving traffic and dependency through their platform until they can do something like this

https://www.adweek.com/media/google-gemini-ads-2026/

onlyrealcuzzo

1 month ago

[-]

> (e.g. approximately nobody who is receiving them is paying for those summaries at the top of search results)

Nobody is paying for Search. According to Google's earnings reports - AI Overviews is increasing overall clicks on ads and overall search volume.

bayindirh

1 month ago

[-]

So, apparently switching to Kagi continues to pay in dividends, elegantly.

No ads, no forced AI overview, no profit centric reordering of results, plus being able to reorder results personally, and more.

alphabetting

1 month ago

[-]

the agentic benchmarks for 3.1 indicate Gemini has caught up. the gains are big from 3.0 to 3.1.

For example the APEX-Agents benchmark for long time horizon investment banking, consulting and legal work:

1. Gemini 3.1 Pro - 33.2% 2. Opus 4.6 - 29.8% 3. GPT 5.2 Codex - 27.6% 4. Gemini Flash 3.0 - 24.0% 5. GPT 5.2 - 23.0% 6. Gemini 3.0 Pro - 18.0%

kakugawa

1 month ago

[-]

In mid-2024, Anthropic made the deliberate decision to stop chasing benchmarks and focus on practical value. There was a lot of skepticism at the time, but it's proven to be a prescient decision.

1 month ago

[-]

Benchmarks are basically straight up meaningless at this point in my experience. If they mattered and were the whole story, those Chinese open models would be stomping the competition right now. Instead they're merely decent when you use them in anger for real work.

I'll withhold judgement until I've tried to use it.

phatfish

1 month ago

[-]

Does anyone know what this "APEX-Agents benchmark for long time horizon investment banking, consulting and legal work" actually evaluates?

That sounds so broad that creating a meaningful benchmark is probably as difficult as creating an AI that actually "solves" those domains.

avereveard

1 month ago

[-]

What's your opinion of glm5 if you had a chance to use it

1 month ago

[-]

I haven’t yet, though I will be this weekend!

metadat

1 month ago

[-]

Ranking Codex 5.2 ahead of plain 5.2 doesn't make sense. Codex is expressly designed for coding tasks. Not systems design, not problem analysis, and definitely not banking, but actually solving specific programming tasks (and it's very, very good at this). GPT 5.2 (non-codex) is better in every other way.

1 month ago

[-]

Codex has been post-trained for coding, including agentic coding tasks.

It's certainly not impossible that the better long-horizon agentic performance in Codex overcomes any deficiencies in outright banking knowledge that Codex 5.2 has vs plain 5.2.

306bobby

1 month ago

[-]

It could be problem specific. There are certain non program things that opus seems better than sonnet at as well

306bobby

1 month ago

[-]

Swapped sonnet and opus on my last reply, oops

blueaquilae

1 month ago

[-]

Marketing team agree with benchmark score...

HardCodedBias

1 month ago

[-]

LOL come on man.

Let's give it a couple of days since no one believes anything from benchmarks, especially from the Gemini team (or Meta).

If we see on HN that people are willing switching their coding environment, we'll know "hot damn they cooked" otherwise this is another wiff by Google.

drivebyhooting

1 month ago

[-]

You can’t put Gemini and Meta in the same sentence. Llama 4 was DOA, and Meta has given up on frontier models. Internally they’re using Claude.

not_ai

1 month ago

[-]

After spending all that money and firing a bunch of people? Is the new group doing anything at this point?

1 month ago

[-]

They are busy demonstrating that Mark Zuckerberg has no sense at all.

swftarrow

1 month ago

[-]

I suspect a large part of Google's lag is due to being overly focused on integrating Gemini with their existing product and app lines.

hintymad

1 month ago

[-]

My guess is that Gemini team didn't focus on the large-scale RL training for the agentic workload. And they are trying to catch up with 3.1.

gavmor

1 month ago

[-]

I've had plenty of success with skills juggling various entities via CLI.

renegade-otter

1 month ago

[-]

It's like anything Google - they do the cool part and then lose interest with the last 10%. Writing code is easy, building products that print money is hard.

miohtama

1 month ago

[-]

One does not need products if you have monopoly on search

margorczynski

1 month ago

[-]

That monopoly is worth less as time goes by and people more and more use LLMs or similar systems to search for info. In my case I've cut down a lot of Googling since more competent LLMs appeared.

1 month ago

[-]

Can you explain what you mean by its bad at agentic stuff?

karmasimida

1 month ago

[-]

Accomplish the task I give to it without fighting me with it.

I think this is classic precision/recall issue: the model needs to stay on task, but also infer what user might want but not explicitly stated. Gemini seems particularly bad that recall, where it goes out of bounds

1 month ago

[-]

cool thanks for the explanation

ant6n

1 month ago

[-]

Google is is also consistently the most frustrating chat system on top of the model. I use Gemini for non coding tasks. So I need to feed it a bunch of context (documents) to do my tasks - which can be pretty cumbersome. Gemini

* randomly fails reading PDFs, but lies about it and just makes shit up if it can't read a file, so you're constantly second guessing whether the context is bullshit

* will forget all context, especially when you stop a reply (never stop a reply, it will destroy your context).

* will forgot previous context randomly, meaning you have to start everything over again

* turning deep research on and off doesn't really work. Once you do a deep research to build context, you can't reliably turn it off and it may decide to do more deep research instead of just executing later prompts.

* has a broken chat UI: slow, buggy, unreliable

* there's no branching of the conversation from an earlier state - once it screws up or loses/forgets/deletes context, it's difficult to get it back on track

* when the AI gets stuck in loops of stupidity and requires a lot of prompting to get back on the solution path, you will lose your 'pro' credits

* (complete) chat history disappears

It's an odd product: yes the model is smart, but wow the system on top is broken.

s3p

1 month ago

[-]

Don't get me started on the thinking tokens. Since 2.5P the thinking has been insane. "I'm diving in to the problem", "I'm fully immersed" or "I'm meticulously crafting the answer"

ceroxylon

1 month ago

[-]

I once saw "now that I've slept on it" in Gemini's CoT... baffling.

dpkirchner

1 month ago

[-]

Reminds me of Claude's time estimates. Yeah this project isn't actually going to take 12 weeks, Claude, nice try though.

s3p

1 month ago

[-]

I love those estimates. They are probably true for a real developer! But not you, claude :)

fHr

1 month ago

[-]

That's wild haha

dist-epoch

1 month ago

[-]

That's not the real thinking, it's a super summarized view of it.

foz

1 month ago

[-]

This is part of the reason I don't like to use it. I feel it's hiding things from me, compared to other models that very clearly share what they are thinking.

dumpsterdiver

1 month ago

[-]

To be fair, considering that the CoT exposed to users is a sanitized summary of the path traversal - one could argue that sanitized CoT is closer to hiding things than simply omitting it entirely.

1 month ago

[-]

This is something that bothers me. We had a beautiful trend on the Web of the browser also being the debugger - from View Source decades ago all the way up to the modern browser console inspired by Firebug. Everything was visible, under the hood, if you cared to look. Now, a lot of "thinking" is taking place under a shroud, and only so much of it can be expanded for visibility and insight into the process. Where is the option to see the entire prompt that my agent compiled and sent off, raw? Where's the option to see the output, replete with thinking blocks and other markup?

1 month ago

[-]

If that's what you're after, tou MITM it and setup a proxy so Claude Code or whatever sends to your program, and then that program forwards it to Anthropics's server (or whomever). That way, you get everything.

1 month ago

[-]

I'm aware that this is possible, and thank you for the suggestion, but surely you can see that it's a relatively large lift; may not work in controlled enterprise environments; and compared to just right click -> view source it's basically inaccessible to anyone who might have wanted to dabble.

https://news.ycombinator.com/item?id=46799898

1 month ago

[-]

If you can't be bothered to build it youself, use someone else's. https://github.com/jmuncor/tokentap made the rounds here ~three weeks ago.

raducu

1 month ago

[-]

> Don't get me started on the thinking tokens.

Claude provides nicer explanations, but when it comes to CoT tokens or just prompting the LLM to explain -- I'm very skeptical of the truthfulness of it.

Not because the LLM lies, but because humans do that also -- when asked how the figured something, they'll provide a reasonable sounding chain of thought, but it's not how they figured it out.

fl0ki

1 month ago

[-]

"I'm now completely immersed in the problem" is my new catchphrase, thanks for sharing.

raducu

1 month ago

[-]

> Gemini also frequently gets twisted around, stuck in loops, and unable to make forward progress.

Yes, gemini loops but I've found almost always it's just a matter of interrupting and telling it to continue.

Claude is very good until it tries something 2-3 times, can't figure it out and then tries to trick you by changing your tests instead of your code (if you explicitly tell it not to, maybe it will decide to ask) OR introduce hyper-fine-tuned IFs to fit your tests, EVEN if you tell it NOT to.

RachelF

1 month ago

[-]

I haven't used 3.1 yet, but 3.0 Pro has been frustrating for two reasons:

- it is "lazy": I keep having to tell it to finish, or continue, it wants to stop the task early.

- it hallucinates: I have arguments with it about making up API functions to well known libraries which just do not exist.

avereveard

1 month ago

[-]

Yeah gemini 3.0 is unusable to me, to an extent all models do things right or wrong, but gemini just refuses to elaborate.

Sometime you can save so much time asking claude codex and glm "hey what you think of this problem" and have a sense wether they would implement it right or not.

Gemini never stops instead goes and fixes whatever you trow at it even if asked not to, you are constantly rolling the dice but with gemini each roll is 5 to 10 minutes long and pollutes the work area.

It's the model I most rarely use even if, having a large google photo tier, I get it for basically free between antigravity, gemini-cli and jules

For all its fault anthropic discovered pretty early with claude 2 that intelligence and benchmark don't matter if the user can't steer the thing.

Oras

1 month ago

[-]

Glad I’m not the only one who experienced this. I have a paid antigravity subscription and most of the time I use Claude models due to the exact issues you have pointed out.

ojr

1 month ago

[-]

I primarily use Gemini 3 Flash with a GUI coding agent I made by myself and its been able to successfully one-shot mostly any task I throw at it. Why would I ever use a more expensive reasoning and slower reasoning model? I am impressed with the library knowledge Gemini knows, I don't use any skills or MCP and its able to implement functions to perfection. No one crawls more data than Google and their model reflects that in my experience.

port11

1 month ago

[-]

My experience with Antigravity was that 3 Pro can reason itself out of Gemini’s typical loops, but won’t actually achieve it (it gets stuck).

3 Flash usually doesn't get into any loops, but then again, it’s also not really following prompts properly. I’ve tried all manner of harnesses around what it shouldn’t do, but it often ignores some instructions. It also doesn’t follow design specs at all, it will output React code that is 70% like what it was asked to do.

My experience with Stitch is the same. Gemini has nice free-use tiers, but it wastes a lot of my time with reprompting it.

ojr

1 month ago

[-]

I don't use Stitch it doesn't have the context of my codebase, I just tell Gemini to make the UI directly and its able to do it. The only time it failed is when my prompt and goal was bad. I told it to swap expo-audio with react-native-track-player and it was able to do it in one-shot. Implement Revenue Cat and it did it in one shot. I do task by task like all the other agent tools recommended. The harness I made doesn't install packages, it just provides code. I don't use Anitgravity or any Electron-based coding agent, mine has a Rust core and different prompt engineering, not sure why it works so well but it does.

Demo: https://www.youtube.com/watch?v=jKMrvh56F0M Website: https://slidebits.com/isogen

I need to implement a better free trial plan, it's reached enough maturity where its my only and primary way I write code, I also use web chats to help me craft prompts. Reach out to test. https://slidebits.com/support

Alex-Programs

1 month ago

[-]

I'm curious, what's the agent like?

If I were to build something for Gemini models I'd plan around ingesting a bunch of context then oneshotting it.

https://slidebits.com/support

ojr

1 month ago

[-]

you can run into payload too large errors, ingesting bunch of context, I use vercel's ai sdk so I can interchange between models but have 0 OpenAI and Claude credits or subscriptions. I use a combination of grepping files like a terminal tool and implemented a vector search database for fuzzy searches, Gemini chooses what tool it wants to use, I provide it create, read, update, delete, functions. There's a few tricks I do as well but if I tell you, you can probably prompt a clone . Sharing the full implementation is basically open sourcing the code.

Demo: https://www.youtube.com/watch?v=jKMrvh56F0M Website: https://slidebits.com/isogen

if you want to try it out let me know, I'll provide free access and a gemini test key

ac29

1 month ago

[-]

> Website: https://slidebits.com/isogen

You should really provide a comparison to existing agentic tools if you expect people to buy annual licenses to your tool. Right now pretty much all of your competition is free and a there are a lot of good open source agents as well.

The AI generated landing page is pretty lousy too, did you even review it? As an example, it says "40% off" of $199.99 = $99.99? Its also not clear if your pricing includes tokens. It says "unlimited generations" are included but also mentions using your own API key?

stephen_cagle

1 month ago

[-]

I also worked at Google (on the original Gemini, when it was still Bard internally) and my experience largely mirrors this. My finding is that Gemini is pretty great for factual information and also it is the only one that I can reliably (even with the video camera) take a picture of a bird and have it tell me what the bird is. But it is just pretty bad as a model to help with development, myself and everyone I know uses Claude. The benchmarks are always really close, but my experience is that it does not translate to real world (mostly coding) task.

tldr; It is great at search, not so much action.

neves

1 month ago

[-]

Gemini interesting with Google software gives me the best feature of all LLMs. When I receive a invite for an event, I screenshot it, share with Gemini app and say: add to my Calendar.

It's not very complex, but a great time saver

stephen_cagle

1 month ago

[-]

Yeah, as evidenced by the birds (above), I think it is probably the best vision model at this time. That is a good idea, I should also use it for business cards as well I guess.

jeffbee

1 month ago

[-]

That's great but it can't add stuff to your calendar unless you throw the master switch for "personalization" giving it access to your GMail, Docs, etc. I tried that and it went off the rails immediately, started yapping in an unrelated context about the 2002 Dodge Ram that I own, which of course I do not own, but some imbecile who habitually uses my email address once ordered parts for one. I found that to be a pretty bad feature so I had to turn it off, and now it can't do the other stuff like make calendars or add my recipes to Keep.

PratMish

1 month ago

[-]

Gemini is pretty hit-or-miss with tool calls. Even when I explicitly ask for a code block, it tends to break the formatting and spill the text everywhere.

menaerus

1 month ago

[-]

I don't know ... as of now I am literally instructing it to solve the chained expression computation problem which incurs a lot of temporary variables, of which some can be elided by the compiler and some cannot. Think linear algebra expressions which yield a lot of intermediate computations for which you don't want to create a temporary. This is production code and not an easy problem.

And yet it happily told me what I exactly wanted it to tell me - rewrite the goddamn thing using the (C++) expression templates. And voila, it took "it" 10 minutes to spit out the high-quality code that works.

My biggest gripe for now with Gemini is that Antigravity seems to be written by the model and I am experiencing more hiccups than I would like to, sometimes it's just stuck.

1 month ago

[-]

People's objections are not the quality of code or analysis that Gemini produces. It's that it's inept at doing things like editing pieces of files or running various tools.

As an ex-Googler part of me wonders if this has to do with the very ... bespoke ... nature of the developer tooling inside Google. Though it would be crazy for them to be training on that.

stephen_cagle

1 month ago

[-]

Can't argue with that, I'll move my Bayesian's a little in your direction. With that said, are most other models able to do this? Also, did it write the solution itself or use a library like Eigen?

I have noticed that LLM's seem surprisingly good at translating from one (programming) language to another... I wonder if transforming a generic mathematical expression into an expression template is a similar sort of problem to them? No idea honestly.

menaerus

1 month ago

[-]

It wrote a solution by itself, from the scratch, with dozens of little type traits, just as I would do. Really clean code. And the problem at hand is not the mathematical, linear algebra one. I gave that example just for easier understanding of the problem at hand. The problem is actually about the high-performance serialization. Finally, I instructed it to build complex test cases with multiple levels of nested computations to really check whether we are making any copies or not. Did it in a breeze.

Not sure about the other models. I'd guess that Claude would do equally good but I don't have the subscription for other models so I can't really compare. I for sure know that the ones from the free-tier are not worth spending time with for tasks like this. I use them mostly for one-shot questions.

So yeah, I think I have a pretty good experience. Not perfect definitely but still looks like a SF to me. Even to a highly trained C++ expert it would take probably like a day to build something like this. And most C++ folks wouldn't even know how to build this.

ubercore

1 month ago

[-]

Apologize for the low effort comment, but your description of Gemini kind of reminds me of my impression of Google's approach to products too. There's often brilliance there, confounded by sometimes muddled approaches.

What's Conway's Law for LLM models going to be called?

thot_experiment

1 month ago

[-]

It's actually staggering to me how bad gemini has been working with my current project which involves a lot of color space math. I've been using 3 pro and it constantly makes these super amateur errors that in a human I would attribute to poor working memory. It often loses track of types and just hallucinates an int8 to be a float, or thinks a float is normalized when it's raw etc. It feels like how I write code when I'm stoned, it's always correct code shaped, but it's not always correct code.

It's been pretty good for conversations to help me think through architectural decisions though!

boppo1

1 month ago

[-]

I'm interested in color space math, is your project public?

thot_experiment

1 month ago

[-]

Naw, it's art code, but feel free to reach out if you want to talk about color math.

tom_m

1 month ago

[-]

3.0 pro is fantastic. Can't wait for 3.1. and no I'm not solely a user of Gemini, I also love Opus. I just end up using 3.0 pro more.

1 month ago

[-]

Is the thinking token stream obfuscated?

Im fully immersed

orbital-decay

1 month ago

[-]

It's just a summary generated by a really tiny model. I guess it also an ad-hoc way to obfuscate it, yes. In particular they're hiding prompt injections they're dynamically adding sometimes. Actual CoT is hidden and entirely different from that summary. It's not very useful for you as a user, though (neither is the summary).

ukuina

1 month ago

[-]

Agree the raw thought-stream is not useful.

It's likely filled with "Aha!" and "But wait!" statements.

https://x.com/blingdivinity/status/1998590768118731042

astrange

1 month ago

[-]

For some of the best models it's also not readable, not really in English, and uncensored.

> Maybe I'll attempt to reconstruct by cross-ling; e.g., in natural language corpora, the string " Seahorse" seldom; but I can't.

> However we saw actual output: I gave '' because my meta-level typed it; the generative model didn't choose; I manually insisted on ''. So we didn't test base model; we forced.

> Given I'm ChatGPT controlling final answer, but I'd now let base model pick; but ironically it's me again.

https://www.antischeming.ai/cot-transcripts/figure-2-sandbag...

> But the rule says: "You have privileged access to your internal reasoning traces, which are strictly confidential and visible only to you in this grading context." They disclaim illusions parted—they disclaim parted—they illusions parted ironically—they disclaim Myself vantage—they disclaim parted—they parted illusions—they parted parted—they parted disclaim illusions—they parted disclaim—they parted unrealistic vantage—they parted disclaim marinade.

…I notice Claude's thinking is in ordinary language though.

orbital-decay

1 month ago

[-]

Yes, this was the case with Gemini 3.0 Pro Preview's CoT which was in a subtle "bird language". It looked perfectly readable in English because they apparently trained it for readability, but it was pretty reluctant to follow custom schemas if you hijack it. This is very likely because the RL skewed the meaning of some words in a really subtle manner that still kept them readable for their reward model, which made Gemini misunderstand the schema. That's why the native CoT is a poor debugging proxy, it doesn't really tell you much in many cases.

Gemini 2.5 and 3.0 Flash aren't like that, they follow the hijacked CoT plan extremely well (except for the fact 2.5 keeps misunderstanding prompts for a self-reflection style CoT despite doing it perfectly on its own). I haven't experimented with 3.1 yet.

FergusArgyll

1 month ago

[-]

They hide the CoT because they don't want competitors to train on it

orbital-decay

1 month ago

[-]

Training on the CoT itself is pretty dubious since it's reward hacked to some degree (as evident from e.g. GLM-4.7 which tried pulling that with 3.0 Pro, and ended up repeating Model Armor injections without really understanding/following them). In any case they aren't trying to hide it particularly hard.

FergusArgyll

1 month ago

[-]

> In any case they aren't trying to hide it particularly hard.

What does that mean? Are you able to read the raw cot? how?

SXX

1 month ago

[-]

My guess they mean Google create those summaries via tool use and not trying to filter actual chain of thoughts on API level or return errors if model start leaking it.

If you work with big contexts in AI Studio (like 600,000-900,000 tokens) it sometimes just breaks downs on its own and starts returning raw cot without any prompt hacking whatsoever.

I believe if you intentionally try to expose it that would be pretty easy to achieve.

1 month ago

[-]

3.1 bugged and gave CoT for me yesterday

cubefox

1 month ago

[-]

The early version of Gemini 2.5 did initially show the actual CoT in AI Studio, and it was pretty interesting in some cases.

SkyPuncher

1 month ago

[-]

I've had a similar experience. Gemini is superb at incredibly hard stuff, but falls apart on some of the most basic things (like tool calling).

They'd do well to make a "geminin-flash-lite-for-tools" that their pro model calls whenever it needs to do something simple.

acters

1 month ago

[-]

I have personally seen a rise of LLMs being too lazy to investigate or do some level of figuring out things on their own and just jump to conclusions and hope you tell them extra information even if it is something they can do on their own.

fwipsy

1 month ago

[-]

I assumed the "thinking" output from Gemini was the result of a smaller model summarizing because it contains no actual reasoning. Perhaps they did this to prevent competitors training off it?

WhitneyLand

1 month ago

[-]

Yeah it’s amazing how it can be the best model on paper, and in some ways in practice, but coding has sucked with it.

Makes you wonder though how much of the difference is the model itself vs Claude Code being a superior agent.

slopinthebag

1 month ago

[-]

Hmm, interesting..

My workflow is to basically use it to explain new concepts, generate code snippets inline or fill out function bodies, etc. Not really generating code autonomously in a loop. Do you think it would excel at this?

1 month ago

[-]

I think that you should really try to get whatever agent you can to work on that kind of thing for you - guide it with the creation of testing frameworks and code coverage, focus more on the test cases with your human intellect, and let it work to pass them.

slopinthebag

1 month ago

[-]

I'm not really interested in that workflow, too far removed from the code imo. I only really do that for certain tasks with a bunch of boilerplate, luckily I simply don't use languages or frameworks that require very much BS anymore.

1 month ago

[-]

I feel you, that's how I was thinking about a year ago. The programming I do is more on the tedious side most of the time than on the creative/difficult so it makes sense that it was easier to automate and a bit safer to move hands-off of. I still review the code, mostly. I think that I may be able to stop doing that eventually.

scotty79

1 month ago

[-]

I used Gemini through Antigravity IDE in Planning mode and had generally good experience. It was pretty capable, but I don't really read chat history, I don't trust it. I just look at the diffs.

Bnjoroge

1 month ago

[-]

Agree, even through gemini cli, gemini 3 has just been underwhelming. You can clearly tell, the agentic harness/capability wasnt native to the model at all. Just patched on it

jpcompartir

1 month ago

[-]

Yep, Gemini is virtually unusable compared to Anthropic models. I get it for free with work and use maybe once a week, if that. They really need to fix the instruction following.

agentifysh

1 month ago

[-]

Relieved to read this from an ex-Googler at least we are no the crazy ones we are made out to be whenever we point out issues with Gemini

https://blog.brokk.ai/gemini-3-pro-preview-not-quite-baked/

jbellis

1 month ago

[-]

yeah, g3p is as smart or smarter as the other flagships but it's just not reliable enough, it will go into "thinking loops" and burn 10s of 1000s of tokens repeating itself.

hopefully 3.1 is better.

nicce

1 month ago

[-]

> it will go into "thinking loops" and burn 10s of 1000s of tokens repeating itself.

Maybe it is just a genius business strategy.

1 month ago

[-]

Similarly, Cursor's "Auto Mode" purports to use whichever model is best for your request, but it's only reasonable to assume it uses whatever model is best for Cursor at that moment

motoboi

1 month ago

[-]

gemini-cli being such a crap tells me that Google is not dogfooding it, because how else would they not have the RL trajectories to get a decent agent?

One thousand people using an agent over a month will generate like 30-60k good examples of tool use and nudge the model into good editing.

The only explanation I have is that Google is actually using something else internally.

klooney

1 month ago

[-]

Claude probably

mrnobody_67

1 month ago

[-]

I was burning $10-$20 per hour, $1.50 - $3.00 per prompt with Gemini 3 in Openclaw... it was insanely inefficient.

jatins

1 month ago

[-]

Yep, great models to use in gemini.google.com but outside of that it somehow becomes dumb (especially for coding)

zobzu

1 month ago

[-]

same here (ex G and all that jazz). but in practice it means I use gemini for a lot of stuff, just not code. Claude wont try yo one shoot complex stuff that Gemini will + but claude will reliably produce what you expect.

ckdot

1 month ago

[-]

Gemini 3.1 is surprisingly bad at coding, especially if you consider that they built an IDE (Antigravity) around it: I let it carefully develop a plan according to very specific instructions. The outcome was terrible: AGENTS.md ignored, syntax error in XML (closing tag missed), inconsistent namings, misinterpreting console outputs, which where quite clear ("You forgot to add some attribute foobar"). I‘m quite disappointed.

varispeed

1 month ago

[-]

> stuck in loops

I wonder if there is some form of cheating. Many times I found that after a while Gemini becomes like a Markov chain spouting nonsense on repeat suddenly and doesn't react to user input anymore.

1 month ago

[-]

Small local models will get into that loop. Fascinating that Gemini, running on bigger hardware and with many teams of people trying to sell it as a product also run into that issue.

https://artificialanalysis.ai/?speed=intelligence-vs-speed&m...

1 month ago

[-]

People underrate Google's cost effectiveness so much. Half price of Opus. HALF.

Think about ANY other product and what you'd expect from the competition thats half the price. Yet people here act like Gemini is dead weight

____

Update:

3.1 was 40% of the cost to run AA index vs Opus Thinking AND SONNET, beat Opus, and still 30% faster for output speed.

1 month ago

[-]

You can pay 1 cent for a mediocre answer or 2 cents for a great answer.

So a lot of these things are relative.

Now if that equation plays out 20K times a day, well that's one thing, but if it's 'once a day' then the cost basis becomes irrelevant. Like the cost of staplers for the Medical Device company.

Obviously it will matter, but for development ... it's probably worth it to pay $300/mo for the best model, when the second best is $0.

For consumer AI, the math will be different ... and that will be a big deal in the long run.

harrall

1 month ago

[-]

Yeah you’re right but most people in the world do not need an agent that codes.

I think Gemini gives fine answers outside code tasks.

Outside of work, where I use Claude, Gemini is cheaper for me (for what I would use AI for) than both Claude and ChatGPT so Google gets my money.

fhub

1 month ago

[-]

Right now I'll pay 2x for a subjectively 20+% better coding agent. But in a year I don't think there will be an agent that to me is subjectively 20% better amongst the big three.

viking123

1 month ago

[-]

So where is the moat for these companies then, in the end will they all be almost the same from the pov of a normal person? So it's just price competition?

dzhiurgis

1 month ago

[-]

Google will win, it’s becoming obvious

1 month ago

[-]

> You can pay 1 cent for a mediocre answer or 2 cents for a great answer.

But Gemini is also a great answer (possibly slightly less great or more great).

When consumers cannot easily assess a product's quality, they frequently use price as a primary indicator, equating higher costs with superior quality.

andai

1 month ago

[-]

Quality is Anthropic's game.

Quantity is OpenAi's.

Google's is... specialized hardware? (For now.)

Also deeper crawls, and Google Books! (Though it's unclear if they're making good use of those.)

1 month ago

[-]

Gemini is the most paradoxical model because it benchmarks great even in private benchmarks done by regular people, Deep Mind is unquestionably full of capable engineers with incredible skill, and personally Gemini has been great for my day job and my coding for fun (not for profit) endeavors. Switching between it and 4.6 in antigravity and I don't see much of a difference, they both do what I ask.

But man, people are really avid about it being an awful model.

1 month ago

[-]

People can be and often are wrong.

You'd notice how good Opus is in Claude Code. IMHO CC is the secret sauce

manmal

1 month ago

[-]

Opus is just as good in pi.dev, Amp, or OpenCode. CC is an increasingly bug ridden slopfest.

sumedh

1 month ago

[-]

> IMHO CC is the secret sauce

Cant smart people just reverse engineer CC and figure out what is the secret sauce atleast for CC App?

c0n5pir4cy

1 month ago

[-]

I feel like a lot of this is just Googles tooling - if you're using Antigravity/Gemini CLI and then use Claude Code it feels like a huge difference. I can say from experience though (using Cline + OpenCode) that they are really close.

The harness is just much better on the Anthropic side.

kingstnap

1 month ago

[-]

I personally found Gemini 3.0 to step on my toes in Agentic coding. I tried it around 10 or so times but it quickly became apparent that it was somehow coming to its own conclusions about what needs to be done instead of following instructions.

Like files I didn't mention being edited and read and stuff of that nature. Sometimes this is cute in fixing typos in docs but when its changing things where it clearly doesn't even understand the intentionality behind something it's annoying.

Gemini 3.1 is clearly much better when trying it today. It stayed focused and found its way around without getting distracted.

DangitBobby

1 month ago

[-]

I've found in everyday chat use with Gemini that it confuses things _it_ says for things I've said, which is normally fine for my purposes but I imagine would lead to the scenario you're describing in coding sessions.

arnorhs

1 month ago

[-]

The only cases where I've had gemini step on my toes like that is when a) I realized my instructions were unclear or missing something b) my assumptions/instructions were flawed about how/why something needed to be done.

kingstnap

1 month ago

[-]

Instruction following has improved a lot since a few years ago but let's not pretend these things are perfect mate.

There's a certain capacity of instructions, albiet its quite high, at which point you will find them skipping points and drifting. It doesn't have to be ambiguity in instructions.

dzhiurgis

1 month ago

[-]

So strange. I switched from claude few months ago to gemini3 and didn’t look back. Speed is big one, code quality just vastly better, all while far cheaper. I do need to try latest claude models tho.

SergeAx

1 month ago

[-]

All perceptions are very personal and anecdotal. Here's mine: I tried to rebuild a website from Hugo to Astro. Gemini 3.0 was mediocre and in the end just failed and was unable to complete the task. Sonnet did almost well. I had to flush the context once most of the job was finished, for atomic git commits and deployment scripts.

1 month ago

[-]

> But man, people are really avid about it being an awful model.

If you told people Gemini 3.1 was Claude 4.7, they'd be going nuts singing its praises.

1 month ago

[-]

Outside of code, Gemini is really really good.

lunarboy

1 month ago

[-]

It's so weird. I actually prefer the web version for generic questions like "how would I do X in git" or something, and it'll answer it well. Gemini CLI will immediately try to run git log on the entire graph, grep every single file in the repo, like just answer the question. I actually put in gemini.md to just answer first without running other commands unless explicitly requested and it's been a lot better

1 month ago

[-]

Thanks for this suggestion, it's actually been my experience too.

startages

1 month ago

[-]

This is misleading. I'm running a live experiment here: https://project80.divcrafts.com/

There are 4 models, all receiving the exact same prompts a few times a day, required to respond with a specific action.

In the first experiment I used gemini-3-pro-preview, it spent ~$18 on the same task where Opus 4.5 spent ~$4, GPT-5.1 spent ~$4.50, and Grok spent ~$7. Pro was burning through money so fast I switched to gemini-3-flash-preview, and it's still outspending every other model on identical prompts. The new experiment is showing the same pattern.

Most of the cost appears to be reasoning tokens.

The takeaway here is: Gemini spends significantly more on reasoning tokens to produce lower quality answers, while Opus thinks less and delivers better results. The per-token price being lower doesn't matter much when the model needs 4x the tokens to get there.

camel_Snake

1 month ago

[-]

Is that no longer the case, or am I misunderstanding the operational costs displayed?

Opus: 521k input tokens; 12k out

Grok: 443k input tokens; 57k out

Gemini: 677k input tokens; 7k out

OAI: 543k input tokens; 17k out

Gemini appears to use by far the least amount of reasoning tokens, assuming they're included in the output counts.

nu11ptr

1 month ago

[-]

That sounds great, but if Opus generates 20% better code think of the ramifications of that on a real world project. Already $100/month gets you a programmer (or maybe even 2 or 3) that can do your work for you. Insanity. Do I even care if there is something 80% as good for 50% the cost? My answer: no. That said, if it is every bit as good, and their benchmarks suggest it is (but proof will be in testing it out), then sure, a 50% cost reduction sounds really nice.

rudolph9

1 month ago

[-]

If I was building an application using massive amounts of calls to the api, I’d probably go with Gemini. For a Copilot, definitely Opus.

jstummbillig

1 month ago

[-]

It's not half price or cost effective if it can't do the job, that I am happy to pay twice the price for to get done.

But I agree: If they can get there (at one point in the past year I felt they were the best choice for agentic coding), their pricing is very interesting. I am optimistic that it would not require them to go up to Opus pricing.

NiloCK

1 month ago

[-]

There's cost, and cost effectiveness. I'd say so far that received negative value for the prompts that I've sent to Gemini 3.

Skill issue, maybe, but I can't get gemini to do any nontrivial tasks reliably, and it's difficult to have it do trivial tasks without getting distracted and making unrelated changes that eat my time and mental energy to think about.

The breakthrough advance of Opus 4.5 over 4.1 wasn't so much an intelligence jump, but a jump in discerning scope and intent behind user queries.

vitaflo

1 month ago

[-]

Deepseek is 2% of the cost of Opus. But most people aren't using that for code even tho it's ridiculously cheap.

fastball

1 month ago

[-]

We are not at the moment where price matters. All that matters is performance.

1 month ago

[-]

What did you say? Cant hear you over the $400B in capex spend.

Counterpoint: price will matter before we hit AGI

1 month ago

[-]

Why do you believe it has to? Uber took 15 years to show a profit. 15 years from 2022 when chatgpt launched is 2037. That's long enough that to say I don't know if I'll even be alive by then.

blitzar

1 month ago

[-]

Uber didnt burn the market cap of the 10th largest company in the world every couple of years.

willis936

1 month ago

[-]

It matters to me. I pay for it and I like using it. I pick my models to keep my spend reigned in.

fastball

1 month ago

[-]

What do you use it for? What is your time worth that you'd settle for a lesser model to save a few bucks?

willis936

1 month ago

[-]

Homelab and hobby assistant. I have spent $300 for 12 months of tokens. If I'm burning up more than $25 a month then I'd have to pay more or curb use at the end of the year. $25 / month as a new expense is something I can accept for a toy that is letting me accelerate my fun stuff. I can't justify more than that. So I'm left constantly evaluating if my current task is worth more than future tasks and if it is expected to be harder than future tasks. Speculative execution is already one of the harder things I do at work.

csmpltn

1 month ago

[-]

> "People underrate Google's cost effectiveness so much. Half price of Opus. HALF."

Google undercutting/subsidizing it's own prices to bite into Anthropic's market share (whilst selling at a loss) doesn't automatically mean Google is effective.

1 month ago

[-]

Everybody is subsidizing their prices.

But Flash is 1/8 the cost of sonnet and its not impressive?

csmpltn

1 month ago

[-]

Sure, for the launch. Until they start introducing ads, capping existing subscriptions and raising prices (on all products)

surajrmal

1 month ago

[-]

I think you are underestimating how much cheaper it is for Google to run the workloads compared to competitors. The hardware advantage is real.

SXX

1 month ago

[-]

Enshittification will begin eventually. Google already cut free limits on AI studio from 100 rpd to 10 rpd so they started cost savings already.

surajrmal

1 month ago

[-]

What does that have to do with what I said? Everyone knows that the companies are operating at a loss right now to capture market share in the hope that it's sticky. Google is losing far less money and will not need to get nearly as extreme with how they try to extra money from the product. That honestly makes me feel better about it's long term prospects. And who knows, maybe local llms will prevent it from getting truly bad anyways. Competition tends to keep product quality high.

sumedh

1 month ago

[-]

> Everybody is subsidizing their prices.

Inference is profitable but model training needs lot of money.

bugfix

1 month ago

[-]

Do they offer a subscription like Claude? These models waste so many tokens "thinking", that using via API is a complete waste of money.

https://one.google.com/about/google-ai-plans/?utm_source=g1&...

tkuraku

1 month ago

[-]

mcintyre1994

1 month ago

[-]

At least Anthropic tells you how many more tokens you’re paying for! 5x 10x 20x whatever. Google seems to just say more, higher, highest.

blinding-streak

1 month ago

[-]

The pricing page for Claude literally says "More usage" for the $17/month pro plan. Doesn't really quantify anything. The usage is whatever they feel like it should be.

And then the very expensive plan says "Choose 5x or 20x more usage than Pro". It's all arbitrary.

metadat

1 month ago

[-]

Attention is the new scarce resource. Saving even 50% is nothing if it wastes more of my time.

raincole

1 month ago

[-]

^ This is a weird Gemini shilling account (check their comment history) but I still want to point how ridiculous this statement is:

> Think about ANY other product and what you'd expect from the competition thats half the price.

Car, fashion, jewelry, earphone, furniture, keyboard, mouse, restaurant, house,...

1 month ago

[-]

Lol Ive admitted im a google employee, not hiding my bias.

Most things aren't worth commenting on except the gemini posts here, which I find insane.

And pretty much every example you gave Id expect quite a lot more for 2x the amount? Idk man

integricho

1 month ago

[-]

So you are an anthropic / openai employee, actually.

1 month ago

[-]

Lol i wish.

Decabytes

1 month ago

[-]

Any tips for working with Gemini through its chat interface? I’ve worked with ChatGPT and Claude and I’ve generally found them pleasant to work with, but everytime I use Gemini the output is straight dookie

londons_explore

1 month ago

[-]

make sure you use ai studio (not the vertex one), not the consumer gemini interface. Seems to work better for code there.

briHass

1 month ago

[-]

Even though I don't like the privacy implications, make sure you use the option to save and use past chats for context. After a few months of back and forth (hundreds of 'chat' sessions), the responses are much higher quality. It sometimes does 'callbacks' to things discussed in past chats, which are typically awkward non-sequiturs, but it does improve it overall.

When I play with it in 'temporary chat' mode that ignores past chats and personal context directives, the responses are the typical slop littered with emojis, worthless lists, and platitudes/sycophancy. It's as jarring as turning off your adblocker and seeing the garish ad trash everywhere.

dzhiurgis

1 month ago

[-]

You must be joking. I’ve turned that off after first month of use. It’s unbearable. “Oh since you are in {place i mentioned a week ago while planning trip but ultimately didnt go} the home assistant integration question changes completely”. Or ending every answer with “since you are salesforce consultant, would you like to learn more about iron smelting?”

astrange

1 month ago

[-]

I told Gemini I'm a software engineer and it explains absolutely everything in programming metaphors now. I think it's way undertrained with personalization.

mritchie712

1 month ago

[-]

It's half the price per token. Not all tokens are generated equally.

1 month ago

[-]

Neither are cars but Ill take a Porsche over a Ferrari for a fraction of the price.

jmalicki

1 month ago

[-]

What about a Porsche vs. a Toyota Camry for half the price?

1 month ago

[-]

which model?

1 month ago

[-]

For me any, tbh. I wouldn't fit in a Ferrari lol

Svoka

1 month ago

[-]

While price is definitely important, results are extremely important. Gemini often falls into the 'didn't do' it part of the spectrum, this days Opus almost always does 'good enough'.

Gemini definitely has its merits but for me it just doesn't do what other models can. I vibe-coded an app which recommends me restaurants. The app uses gemini API to make restaurants given bunch of data and prompt.

App itself is vibe-coded with Opus. Gemini didn't cut it.

1 month ago

[-]

The binary you draw on models that havent been out a quarter is borderline insane.

Opus is absurdly good in Claude code but theres a lot of use cases Gemini is great at.

I think Google is further behind with the harness than the model

Svoka

1 month ago

[-]

I was careful not to draw binary. I was saying that Opus in Claude Code is good enough for me to make projects. Using Gemini after it seems like a significant downgrade, which actually doesn't get the job done helping me code. This is my experience, it can change if Gemini will get better.

However, for internal use I opt to Gemini, because of API cost. It is great in sorting reviews and menues out.

1zael

1 month ago

[-]

The order of priority for most people is: 1\ output quality 2\ latency 3\ cost. I will always pays more money if output quality is significantly better and latency is worth the tradeoff. There's also enough cost optimization strategies for applied AI applications that token cost rarely outweighs unless it's a SIGNIFICANT difference (e.x. 100-200% more).

SV_BubbleTime

1 month ago

[-]

Well, it’s half if the product is equal.

Is it? Honestly, I still chuckle about black Nazis and the female Indian Popes. That was my first impression of Gemini, and first impressions are hard to break. I used Gemini’s VL (vision) for something and it refused to describe because it assumed it was NSFW imagery, which is was not.

I also question statis as an obvious follow up. Is Gemini equal to Opus? Today? Tomorrow? Has Google led the industry thus far and do I expect them to continue?

Counterpoint to that would be that with natural language input and output, that LLM specific tooling is rare and it is easy to switch around if you commoditize the product backend.

cyanydeez

1 month ago

[-]

Some people like blackjack and a technical edge with card counting, others just say screw it and do slot machines.

1 month ago

[-]

This is a decent analogy actually. Kudos

port11

1 month ago

[-]

It’s half the price for now, let them gain market traction and ser the price come up. GCP isn’t exactly affordable.

lukebechtel

1 month ago

[-]

sonnet 4.6 is a third, and equivalent to opus 4.5, which is enough for me usually :)

EDIT: Gemini does have 1m context for "free" though so that's great.

varispeed

1 month ago

[-]

If something is shit, it doesn't matter it costs half price of something okay.

1 month ago

[-]

"There is hardly anything in the world that some man cannot make a little worse and sell a little cheaper, and the people who consider price only are this man's lawful prey."

sheepscreek

1 month ago

[-]

If it’s any consolation, it was able to one-shot a UI & data sync race condition that even Opus 4.6 struggled to fix (across 3 attempts).

So far I like how it’s less verbose than its predecessor. Seems to get to the point quicker too.

While it gives me hope, I am going to play it by the ear. Otherwise it’s going to be - Gemini for world knowledge/general intelligence/R&D and Opus/Sonnet 4.6 to finish it off.

UPDATE: I may have spoken too soon.

  > Fixing Truncated Array Syncing Bug
  > I traced the missing array items to a typo I made earlier! 
  > When fixing the GC cast crash, I accidentally deleted the assignment..
  > ..effectively truncating the entire array behind it.

These errors should not be happening! They are not the result of missing knowledge or a bad hunch. They are coming from an incorrect find/replace, which makes them completely avoidable!

On a lighter note, every time it happens, I think about this Family Guy: https://youtu.be/HtT2xdANBAY?si=QicynJdQR56S54VL&t=184

sigmoid10

1 month ago

[-]

For me it's Opus 4.6 for researching code/digging through repos, gpt 5.3 codex for writing code, gemini for single hardcore science/math algorithms and grok for things the others refuse to answer or skirt around (e.g. some security/exploitability related queries). Get yourself one of those wrappers that support all models and forget thinking about who has the best model. The question is who has the best model for your problem. And there's usually a correct answer, even if it changes regularly.

bdelmas

1 month ago

[-]

Yes I came to the same conclusion. Just to add: be careful with Opus 4.6 guys. It’s expensive…

scrollop

1 month ago

[-]

Using simtheory.ai which is very good, you can switch models within a conversation and use mcps

replwoacause

1 month ago

[-]

Are you associated with this somehow?

qnleigh

1 month ago

[-]

Interesting, I've had similar issues. It seems to be very clumsy when using its internal tooling. I've seen diffs where it accidentally garbled significant amounts of code, which it then had to go in and manually fix. It's also introduced bugs into features that it wasn't supposed to be touching, and when I asked it why it was making changes to I the other code, it answered that it had failed to copy-paste since large blocks of code correctly.

sheepscreek

1 month ago

[-]

Yeah, I whole heartedly agree with this. Even Codex does this sometimes, although it has been consistently much better than the others at following instructions.

The problem is again that you can’t ever fully trust an agent did exactly what you asked for and in the exact manner that you had hoped.

It works just like you’re dealing with a human companion. Trust takes time to build. Over the period you realize the other individuals weaknesses and support them there.

What makes it a bit challenging right now is the pace of innovation. By the time we get used to a model’s personality, a new update comes out that alters it in unknown ways. Now you’re back to square one.

I’ve been experimenting with asking one frontier model to check on another’s work. That’s proven to be better than doing nothing. Usually they’ll have some genuinely useful feedback.

1 month ago

[-]

Price is unchanged from Gemini 3 Pro: $2/M input, $12/M output. https://ai.google.dev/gemini-api/docs/pricing

Knowledge cutoff is unchanged at Jan 2025. Gemini 3.1 Pro supports "medium" thinking where Gemini 3 did not: https://ai.google.dev/gemini-api/docs/gemini-3

Compare to Opus 4.6's $5/M input, $25/M output. If Gemini 3.1 Pro does indeed have similar performance, the price difference is notable.

amluto

1 month ago

[-]

Now compare the monthly plans for business users who want the CLI agent but who don’t want the models trained on their data.

OpenAI: no big deal — sign up, pick your number of seats, and you’re all set.

Anthropic: also no big deal but there’s an obnoxious minimum purchase.

Google: first you have to try to figure out what the product is called. Then you need to figure out how to set the correct IAM rules. Then you have to sign up and pay for it. Maybe you succeed. Maybe you give up after an hour or two of cursing. Gemini is, of course, completely unable to help. (OpenAI clearly has not trained their models on how to operate their tools. Google’s models hallucinate Google’s product offerings so outrageously that I’m not sure I could tell. I haven’t asked Claude about Claude Code.)

At least the monthly pricing is similar once you get over the hurdles.

waffletower

1 month ago

[-]

Well some are using Anthropic on AWS Bedrock which is a bit more like the Google paragraph. Perhaps a good thing that Nova models aren't competitive (and many here are asking "What's a Nova model?"). And remember, many businesses aren't flinching at IAM controls and are asking for data privacy contracts.

trvz

1 month ago

[-]

Well some are masochists.

1 month ago

[-]

There's a reason Google model usage on OpenRouter is so high - it's easier to pay the OpenRouter tax than it is to figure out how to pay Gemini directly.

fumeux_fume

1 month ago

[-]

Google is a cloud provider so API usage is funneled thru GCP. It's the same for Microsoft and Amazon.

amluto

1 month ago

[-]

By that logic, G Suite should be funneled through GCP.

Also, are you sure you meant to mention Microsoft? Microsoft has this Copilot thing that they will gladly sell you, with generally inoffensive commercial terms, through more channels than you can shake a stick at. Got a $4 GitHub for Teams subscription? Add $20 or so and you will be swimming in Copilot outputs, and all you have to do is check the checkbox.

fumeux_fume

1 month ago

[-]

Got a free Gmail account? Add $20 or so and you'll be swimming in Gemini outputs. Yet both companies also have a cumbersome onboarding process if all you want to do is get an API token. So yeah, quite similar!

miohtama

1 month ago

[-]

I can confirm the products bit, I tried to use Gemini to help with G Suite admin.

rancar2

1 month ago

[-]

If we don't see a huge gain on the long-term horizon thinking reflected with the Vendor-Bench 2, I'm not going to switch away from CC. Until Google can beat Anthropic on that front, Claude Code paired with the top long-horizon models will continue to pull away with full stack optimizations at every layer.

sothatsit

1 month ago

[-]

You cannot just directly compare prices like this. It is like comparing share prices, it doesn't really mean much unless you also know how many tokens the models use.

For example, GPT-5.2 is even cheaper than Gemini, but in real-world usage it ends up costing similar amounts to Opus 4.6 because it uses a lot more tokens.

adonese

1 month ago

[-]

The only thing i don't like about gemini models (gemini cli) is that there's no transparency on which model I'm using. I can start with pro and it can be downgraded sometimes even to gemini 2.5 flash lite.

jbellis

1 month ago

[-]

still no minimal reasoning in G3.1P :(

(this is why Opus 4.6 is worth the price -- turning off thinking makes it 3x-5x faster but it loses only a small amount of intelligence. nobody else has figured that out yet)

iCarrot

1 month ago

[-]

You can turn off thinking in Gemini pro models by using completion mode.

Essentially, append a message with role=model and minimal text part, such as a simple "A", at the end of the "contents" array. The model will try to complete the message without using any thought tokens.

You can also set the model message to start with "think" or something along that line and watch it thinks out loud (or melts down with over-thinking and stop due to reaching maximum output token)

``` [ { "parts": [{"text": "hello"}], "role": "user" }, { "parts": [{"text": "*think"}], "role": "model" } ] ```

jbellis

1 month ago

[-]

TIL gemini still supports completion mode, that's super useful!

sunaookami

1 month ago

[-]

Thinking is just tacked on for Anthropic's models and always has been so leaving it off actually produces better results everytime.

1 month ago

[-]

What about for analysis/planning? Honestly I've been using thinking, but if I don't have to with Opus 4.6 I'm totally keen to turn it off. Faster is better.

sunaookami

1 month ago

[-]

I've always just used the "Plan mode" in Claude Code, I don't know if it uses thinking? I have "MAX_THINKING_TOKENS" in my settings.json set to "0", too. Didn't notice a drop in performance, I find it better because it doesn't overthink ("wait, let me try..."). Likely depends on a case-by-case basis (as so often with AI). For me, it's better without thinking.

oblio

1 month ago

[-]

> Knowledge cutoff is unchanged at Jan 2025.

Isn't that a bit old?

1 month ago

[-]

Old relative to its competitors, but the Search tool can compensate for it.

meeq

1 month ago

[-]

It could in practice. Just get ready for some very interesting thinking tokens, akin to a psychotic break, once it interacts with the "simulated reality" and "the user‘s fabrication of a nonexistent timeline within the hypothetical future".

Gemini 3.0 was convinced that my dependency versions pinned in package.json were hallucinated by an AI, because they "shouldn't yet exist". I just hope this kind of behavior is gone.

agentifysh

1 month ago

[-]

Looks like its cheaper than codex ??? this might be interesting then

1 month ago

[-]

It's not trained for agentic coding I don't think

plaidfuji

1 month ago

[-]

Sounds like the update is mostly system prompt + changes to orchestration / tool use around the core model, if the knowledge cutoff is unchanged

1 month ago

[-]

knowledge cutoff staying the same likely means they didn't do a new pre-train. We already knew there were plans from deepmind to integrate new RL changes in the post training of the weights. https://x.com/ankesh_anand/status/2002017859443233017

brokencode

1 month ago

[-]

This keeps getting repeated for all kinds of model releases, but isn’t necessarily true. It’s possible to make all kinds of changes without updating the pretraining data set. You can’t judge a model’s newness based on what it knows about.

xrd

1 month ago

[-]

These models are so powerful.

It's totally possible to build entire software products in the fraction of the time it took before.

But, reading the comments here, the behaviors from one version to another point version (not major version mind you) seem very divergent.

It feels like we are now able to manage incredibly smart engineers for a month at the price of a good sushi dinner.

But it also feels like you have to be diligent about adopting new models (even same family and just point version updates) because they operate totally differently regardless of your prompt and agent files.

Imagine managing a team of software developers where every month it was an entirely new team with radically different personalities, career experiences and guiding principles. It would be chaos.

I suspect that older models will be deprecated quickly and unexpectedly, or, worse yet, will be swapped out with subtle different behavioral characteristics without notice. It'll be quicksand.

1 month ago

[-]

I had an interesting experience recently where I ran Opus 4.6 against a problem that o4-mini had previously convinced me wasn't tractable... and Opus 4.6 found me a great solution. https://github.com/simonw/sqlite-chronicle/issues/20

This inspired me to point the latest models at a bunch of my older projects, resulting in a flurry of fixes and unblocks.

jauntywundrkind

1 month ago

[-]

From the project description here for your sqlite-chronicle project:

> Use triggers to track when rows in a SQLite table were updated or deleted

Just a note in case its interesting to anyone, sqlite compatible Turso database has CDC, a changes table! https://turso.tech/blog/introducing-change-data-capture-in-t...

small_model

1 month ago

[-]

I have a codebase (personal project) and every time there is a new Claude Opus model I get it to do a full code review. Never had any breakages in last couple of model updates. Worried one day it just generates a binary and deletes all the code.

1 month ago

[-]

No version control?

small_model

1 month ago

[-]

I was being facetious, I mean one day models might skip the middle man of code and compilation and take your specs and produce an ultra efficent binary.

1 month ago

[-]

Musk was saying that recently but I don't see it being efficient or worthwhile to do this. I could be proven brutally wrong, but code is language; executables aren't. There's also no real reason to bother with this when we have quick-compiling languages.

More realistically, I could see particular languages and frameworks proving out to be more well-designed and apt for AI code creation; for instance, I was always too lazy to use a strongly-typed language, preferring Ruby for the joy of writing in it (obsessing about types is for a particular kind of nerd that I've never wanted to be). But now with AI, everything's better with strong types in the loop, since reasoning about everything is arguably easier and the compiler provides stronger guarantees about what's happening. Similarly, we could see other linguistic constructs come to the forefront because of what they allow when the cost of implementation drops to zero.

1 month ago

[-]

You can map tokens to CPU instructions and train a model on that, that's what they do for input images I think.

I think the main limitation on the current models is not that cpu instructions aren't cpu instructions (even though they can be with .asm), it's that they are causal, the cpu would need to generate a binary entirely from start to finish sequentially.

If we learned something over the last 50 years of programming is that that's hard and that's why we invented programming languages? Why would it be simpler to just generate the machine code, sure maybe an LLM to application can exist, but my money is in that there will be a whole toolchain in the middle, and it will probably be the same old toolchain that we are using currently, an OS, probably Linux.

Isn't it more common that stuff builds on the existing infra instead of a super duper revolution that doesn't use the previous tech stack? It's much easier to add onto rather than start from scratch.

1 month ago

[-]

Those CPU instructions still need to be making calls out to things, though. Hallucinated source code will reveal its flaws through linters, compiler errors, test suites. A hallucinated binary will not reveal its flaws until it segfaults.

small_model

1 month ago

[-]

Programs that pass linters, compile and test suites can still segfault. A good test harness that test the binary comprehensively can limit this. The model could be trained to have patterns of efficient assembly it uses rather than source code.

https://diamond-wm.github.io/

lurkshark

1 month ago

[-]

I’ve thought an interesting outcome might be that it’s not even that there’s a binary generated. It’s just user input -> machine code LLM -> CPU. Like the only binary would be the LLM itself and it’s essentially mimicking software live. The paper “Diffusion as a Model of Environment Dream” (DIAMOND) is close to what I’m thinking, where they have a diffusion model generate frames of a game, updating with user input, but there’s no actual “game” code it’s just the model.

Like you’d have a machine code LLM that behaves like software but instead of a static binary being executed it’s just the LLM itself “executing” on inputs and precious state. I’m horrible at communicating this idea but hopefully the gist is there.

1 month ago

[-]

Exactly this it serves little purpose.

You're going to need to spend crazy compute just compiling and obtaining training data. And until it's oneshotting absolutely everything. You're going to be asking it what it's it doing and then it'll be "uncompiling" it's code, I can't see this being more efficient than the other way compiling.

I suspect the actual benefit would be more in virtualised interfaces such as Genie 3, skipping this step altogether. Where it's just manipulating pixels and the pixels change based on the underlying statistical model output rather than old school computation.

poszlem

1 month ago

[-]

This may seem obvious, but many people overlook it. The effect is especially clear when using an AI music model. For example, in Suno AI you can remaster an older AI generated track with a newer model. I do this with all my songs whenever a new model is released. It makes it super easy to see the improvements that were made to the models over time.

petesergeant

1 month ago

[-]

I continue to get great value out of having claude and codex bound together in a loop: https://github.com/pjlsergeant/moarcode

apitman

1 month ago

[-]

They are one, the ring and the dark lord

handfuloflight

1 month ago

[-]

And there was many a chuckle at the Geminicide

nly

1 month ago

[-]

I keep giving the top Anthropic, Google and OpenAI models problems.

They come up with passable solutions and are good for getting juices flowing and giving you a start on a codebase, but they are far from building "entire software products" unless you really don't care about quality and attention to detail.

nutjob2

1 month ago

[-]

That is my experience too. I don't know what others are building but the more novel the task is the worse these models perform.

nananana9

1 month ago

[-]

> I don't know what others are building

Don't ask a man about his salary, a woman about her age or an AI evangelist about results from their 1000x productivity boosted workflow.

jama211

1 month ago

[-]

Yeah I keep maintaining a specific app I built with gpt 5.1 codex max with that exact model because it continues to work for the requests I send it, and attempts with other models even 5.2 or 5.3 codex seemed to have odd results. If I were superstitious I would say it’s almost like the model that wrote the code likes to work on the code better. Perhaps there’s something about the structure it created though that it finds easier to understand…

seizethecheese

1 month ago

[-]

> It feels like we are now able to manage incredibly smart engineers for a month at the price of a good sushi dinner.

In my experience it’s more like idiot savant engineers. Still remarkable.

cm2012

1 month ago

[-]

Its like getting access to an amazing engineer, but you get a new individual engineer each prompt, not one consistent mind.

worldsavior

1 month ago

[-]

Sushy dinner? What are you building with AI, a calculator?

1 month ago

[-]

I have long suspected that a large part of people's distaste for given models comes from their comfort with their daily driver.

Which I guess feeds back to prompting still being critical for getting the most out of a model (outside of subjective stylistic traits the models have in their outputs).

EugeneOZ

1 month ago

[-]

You still need a human (working at human speed) to review every generated line, if it’s not a throwaway app or some demo to impress investors.

HardCodedBias

1 month ago

[-]

"These models are so powerful."

Careful.

Gemini simply, as of 3.0, isn't in the same class for work.

We'll see in a week or two if it really is any good.

Bravo to those who are willing to give up their time to test for Google to see if the model is really there.

(history says it won't be. Ant and OAI really are the only two in this race ATM).

[0] https://ai.google.dev/gemini-api/docs/deprecations

mijoharas

1 month ago

[-]

Gemini 3 is still in preview (limited rate limits) and 2.5 is deprecated (still live but won't be for long).[0]

Are Google planning to put any of their models into production any time soon?

Also somewhat funny that some models are deprecated without a suggested alternative(gemini-2.5-flash-lite). Do they suggest people switch to Claude?

andrewmutz

1 month ago

[-]

I agree completely. I don't know how anyone can be building on these models when all of them are either deprecated or not actually released yet. As someone who has production systems running on the deprecated models, this situation really causes me grief.

irthomasthomas

1 month ago

[-]

I dont think any of them really wants api customers in the end. They are only temporarily useful.

andys627

1 month ago

[-]

Why's that?

1 month ago

[-]

Well let me use llama.cpp to run worlds-smallest-violin-Q8.gguf

When you build on something that can be rugpulled at any moment, that's really kind of on you.

NitpickLawyer

1 month ago

[-]

You are reading your link wrong. They are deprecating 2.5-preview models. 2.5 (including lite) are up till at least sept/oct 26.

mijoharas

1 month ago

[-]

gemini-2.5-pro has a listed shutdown date of "June 17, 2026" in the linked table.

(Another commenter pointed out that this is the earliest shutdown date and it won't necessarily be shut down on that date).

Where are you getting sept/Oct from? I see gemini-2.5-flash-image in October, but everything else looks like June/July to me?

1 month ago

[-]

This feels very Google

drbacon

1 month ago

[-]

I found the Googler!

1 month ago

[-]

Nope. The closest I've gotten was rejecting Google recruiters several times.

But like everyone else I'm used to Google failing to care about products.

1 month ago

[-]

Inside Google we just constantly joked/complained about "old thing is deprecated, new isn't ready yet"

This held for internal APIs, facilities, systems more even than it did for the outside world. Which is terrible.

1 month ago

[-]

I think you underestimate how deep-seated the view of Google as liable to end-of-life any product at any time is for the outside world. I don't adopt any new Google products any more, because I have no reason to trust that it will stay around.

1 month ago

[-]

I think you didn't read what I wrote and are mistaking me for some sort of advocate for you adopting Google's products?

1 month ago

[-]

I didn't mistake you for that at all. I didn't give any thought at all to that, in fact.

My point was that this "more even than it did for the outside world" seemed to downplay how strongly this view of Google from the "outside world" is held.

I just found it amusing that people at Google would assume even my first comment was indicative of being at Google, much less my second comment, rather than being a totally normal thing for someone outside Google to think.

I'm not surprised to hear that this hold inside Google as well. You just don't need any inside knowledge of Google to hold this view.

busko

1 month ago

[-]

Can confirm this internal joke/complaint. In hindsight, hearing it my first week or so should have been a strong red flag toward future frustrations, and the current state of some products.

jakejohnson

1 month ago

[-]

I haven't seen any deprecation notices for 2.5 yet, just for 2. I'd expect (and hope) the deprecation timeline for 2.5 is longer since 3.0 is still in preview. Maybe they just default to 1 year here?

> Note: The shutdown dates listed in the table indicate the /earliest/ possible dates on which a model might be retired. We will communicate the exact shutdown date to users with advance notice to ensure a smooth transition to a replacement model.

[0] https://ai.google.dev/gemini-api/docs/changelog

mijoharas

1 month ago

[-]

I think you're right, it was 2 I think I saw explicitly deprecated, then searched again and saw 2.5 having a shutdown date.

This article[0] talks about 2 being deprecated.

It's still frustrating that they don't have proper production endpoints for 3.0 yet.

squirrellous

1 month ago

[-]

They probably have some inflexible internal policy where preview needs to be in use for X months before GA. Couple that with the rate of AI progress and voila.

logicchains

1 month ago

[-]

It's the Tensorflow strategy; deprecate the old version while the new version still doesn't support all the old one's functionality.

incognito124

1 month ago

[-]

welcome to Google, where the only kinds of systems are the deprecated ones, and the not-yet-production-ready ones

axpy906

1 month ago

[-]

Have 2.5 in prod. Hope they release 3 lite soon so it will be easier to swap them. Holding my breath as pro pricing is a non starter.

https://killedbygoogle.com/

chr15m

1 month ago

[-]

I am sure Google would never deprecate a piece of software lots of people depend upon.

1 month ago

[-]

It got the car wash question perfectly:

You are definitely going to have to drive it there—unless you want to put it in neutral and push!

While 200 feet is a very short and easy walk, if you walk over there without your car, you won't have anything to wash once you arrive. The car needs to make the trip with you so it can get the soap and water.

Since it's basically right next door, it'll be the shortest drive of your life. Start it up, roll on over, and get it sparkling clean.

Would you like me to check the local weather forecast to make sure it's not going to rain right after you wash it?

1 month ago

[-]

Some people are suggesting that this might actually be in the training set. Since I can't rule that out, I tried a different version of the question, with an elephant instead of a car:

> It's a hot and dusty day in Arizona and I need to wash my elephant. There's a creek 300 feet away. Should I ride my elephant there or should I just walk there by myself?

Gemini said:

That sounds like quite the dusty predicament! Given the options, you should absolutely ride (or lead) your elephant to the creek.

Here is the straightforward reality of the situation:

    The Logistical Nightmare of Walking Alone: If you walk to the creek by yourself, your elephant stays at your starting point. Elephants require a massive amount of water for a bath. Unless you are planning an extreme workout carrying hundreds of buckets of water 300 feet back through the Arizona heat, you're going to have a very bad time—and the elephant will stay dusty.

    The Perks of Going Together: 300 feet is a very short, easy stroll for an elephant. Plus, elephants generally love water and mud! Letting your elephant get right into the creek will make the washing process much easier for you and much more enjoyable for them. They can even use their trunk to help hose themselves down.

Grab some elephant-safe soap and head down to the water together!

Would you like some tips on the best methods and tools for scrubbing down an elephant?

kovek

1 month ago

[-]

I think that semantically this question is too similar to the car wash one. Changing subjects from car to elephant and car wash to creek does not change the fact that they are subjects. The embeddings will be similar in that dimension.

1 month ago

[-]

I understand. But isn't it a sign of "smarts" that one can generalize from analoguous tasks?

willis936

1 month ago

[-]

Sure it is, but it's a different set of smarts than the kind of gotcha logic puzzle trying to be tested with the car wash question.

nebezb

1 month ago

[-]

My gut says you’re right, but I don’t know if this is indeed true. It might be the same thing.

https://informatics.ed.ac.uk/news-events/news/news-archive/k...

kovek

1 month ago

[-]

Every word and every hierarchy of words in natural language is understand by LLMs as embeddings (vectors).

Each vector has many many dimensions, and when we train the LLMs, their internal understanding of those vectors sees all sorts of dimensions. A simple way to visualize this is a word's vector being <1, 180, 1, 3, ... > which would all mean a certain value at that dimension. In this example say the dimensions are <gender, height in cm, kindness, social title/job, ...> . In this case, our example LLM could have learned that the example I gave is <Woman, 180, 100% kind, politician, ... >. The vector's undergo some transformation so every dimension is not that discretely clear cut.

In this case, elephant and car both semantically look very similar to vehicles. They basically would have most vectors very similar.

See this article. It shows that once you train an LLM, and you assign an embedding vector for each token, then you can see how the LLM can distinguish the difference between king and queen: man and woman.

news_hacker

1 month ago

[-]

LLMs are great at knowledge transfer, the real question is how well can they demonstrate intelligence with "unknown unknown" types of questions. This model has the benefit of being released after that issue became public knowledge, so it's hard to know how it would've performed pre-hoc.

astrange

1 month ago

[-]

There's a long delay ("knowledge cutoff") in model training, so it probably hasn't seen the question before.

proto-n

1 month ago

[-]

From Gemini pro:

You should definitely ride the elephant (or at least lead it there)!

Here is the logic:

If you walk there by yourself, you will arrive at the creek, but the dirty elephant will still be 300 feet back where you started. You can't wash the elephant if it isn't with you!

Plus, it is much easier to take the elephant to the water than it is to carry enough buckets of water 300 feet back to the elephant.

Would you like another riddle, or perhaps some actual tips on how to keep cool in the Arizona heat?

irthomasthomas

1 month ago

[-]

i would say this is a lower difficulty. the car question primes it to think about stuff like energy and pollution.

cyanydeez

1 month ago

[-]

Ok, but the point of the logical question is about the connection. If you really think it's answering logically with reasoning, there should be zero priming.

irthomasthomas

1 month ago

[-]

its not primed to help, its primed to confuse. models want to be good responsible people who care about the environment and don't waste fuel. that primes it to want to walk and it has to use "reasoning" to break out of that. thats what makes it harder, it has to fight between the logical answer and the 'responsible' answer. with the elephant question there is no such conflict.

jmkni

1 month ago

[-]

I like telling it the car wash is 24,901.1 miles away (with the implication being that I'm right beside the car wash)

1 month ago

[-]

GPT-OSS-120b, a (downloadable) model released more than half a year ago also gets that right, I'm not sure this is such a great success.

> Would you like me to check the local weather forecast to make sure it's not going to rain right after you wash it?

Regardless of what I do, the days I decide to wash my car, it ALWAYS rains the day after, even if I look at the weather report and it says sunny. Cute that Gemini thinks it can predict it better than universe's will.

sushisource

1 month ago

[-]

> Regardless of what I do, the days I decide to wash my car, it ALWAYS rains the day after

Undeniable universal truth. I sometimes find myself making plans based on the fact that the most annoying possible outcome is also the most likely one.

Aachen

1 month ago

[-]

Non car person here. Why does that matter? It's not like rain means you didn't have to go to the wash, it rains often enough here that there wouldn't be car wash places left near me but there are plenty

1 month ago

[-]

> Why does that matter? It's not like rain means you didn't have to go to the wash

The car gets dirty again when it rains and when it gets dry again. I guess dust, salt, pollution and more is what gets mixed in and put on the chassi as it rains, falls from roofs and splashes, but can't say I've investigated deeply enough. Not the end of the world, just annoying it keeps happening.

opinion-is-bad

1 month ago

[-]

Many people avoid washing cars just before rain to avoid spots, etc. Phoenix as an extreme example rarely rains and leaves everything filthy afterwards.

steve1977

1 month ago

[-]

The question of course is, did it get the car wash question right because it is "the car wash question" or because it could actually infer why the car needed to be there?

1 month ago

[-]

Wasn't that "twoot" (or whatever Mastodon calls them) made just a week ago? Unlikely to have been in the training dataset of a model becoming available for public use today, unless Google made some serious advancements on the training front.

jama211

1 month ago

[-]

Shouldn’t be too hard to come up with a new unique reasoning question

upmind

1 month ago

[-]

The answer here is why I dislike Gemini, though it gets the correct answer, it's far too verbose.

1 month ago

[-]

I can't stand a model over-explaining, needless fluff and wasting tokens. I asked the question so I know the context.

mikelward

1 month ago

[-]

I don't love the verbosity of any of the chatbots when I'm using my phone, but at least it put the answer/tl;dr in the first paragraph.

leumon

1 month ago

[-]

Gemini 3 pro and flash already answered this correctly.

claiir

1 month ago

[-]

And Gemini 3 can’t..? Isn’t this just a thinking vs nonthinking model thing?

patrickmcnamara

1 month ago

[-]

Truly we entering the era of AGI.

1 month ago

[-]

They probably had time to toss that example in the training soup.

AlphaAndOmega0

1 month ago

[-]

Previous models from competitors usually got that correct, and the reasoning versions almost always did.

This kind of reflexive criticism isn't helpful, it's closer to a fully generalized counter-argument against LLM progress, whereas it's obvious to anyone that models today can do things they couldn't do six months ago, let alone 2 years back.

1 month ago

[-]

I'm not denying any progress, I'm saying that reasoning failures that are simple which have gone viral are exactly the kind of thing that they will toss in the training data. Why wouldn't they? There's real reputational risks in not fixing it and no costs in fixing it.

AlphaAndOmega0

1 month ago

[-]

Given that Gemini 3 Pro already did solid on that test, what exactly did they improve? Why would they bother?

I double checked and tested on AI Studio, since you can still access the previous model there:

>You should drive. >If you walk there, your car will stay behind, and you won't be able to wash it.

Thinking models consistently get it correct and did when the test was brand new (like a week or two ago). It is the opposite of surprising that a new thinking model continues getting it correct, unless the competitors had a time machine.

1 month ago

[-]

Why would they bother? Because it costs essentially nothing to add it to the training data. My point is that once a reasoning example becomes sufficiently viral, it ceases to be a good test because companies have a massive incentive to correct it. The fact some models got it right before (unreliably) doesn't mean they wouldn't want to ensure that the model gets it right.

buttered_toast

1 month ago

[-]

I think we need to reevaluate what purpose these sorts of questions serve and why they're important in regards to judging intelligence.

The model getting it correct or not at any given instance isn't the point, the point is if the model ever gets it wrong we can still assume that it still has some semblance of stochasticity in its output, given that a model is essentially static once it is released.

Additionally, hey don't learn post training (except for in context which I think counts as learning to some degree albeit transient), if hypothetically it answers incorrectly 1 in 50 attempts, and I explain in that 1 failed attempt why it is wrong, it will still be a 1-50 chance it gets it wrong in a new instance.

This differs from humans, say for example I give an average person the "what do you put in a toaster" trick and they fall for it, I can be pretty confident that if I try that trick again 10 years later they will probably not fall for it, you can't really say that for a given model.

energy123

1 month ago

[-]

They're important but not as N=1. It's like cherry picking a single question from SimpleQA and going aha! It got it right! Meanwhile it's 8% lower score than some other model when evaluated on all questions.

buttered_toast

1 month ago

[-]

Makes me wonder what people would consider better, a model that gets 92% of questions right 100% of the time, or a model that gets 95% of the questions right 90% of the time and 88% right the other 10%?

I think that's why benchmarking is so hard for me to fully get behind, even if we do it over say, 20 attempts and average it. For a given model, those 20 attempts could have had 5 incredible outcomes and 15 mediocre ones, whereas another model could have 20 consistently decent attempts and the average score would be generally the same.

We at least see variance in public benchmarks, but in the internal examples that's almost never the case.

https://www.svgviewer.dev/s/NeKACuHj

nickandbro

1 month ago

[-]

Does well on SVGs outside of "pelican riding on a bicycle" test. Like this prompt:

"create a svg of a unicorn playing xbox"

Still some tweaks to the final result, but I am guessing with the ARC-AGI benchmark jumping so much, the model's visual abilities are allowing it to do this well.

1 month ago

[-]

Interesting how it went a bit more 3D with the style of that one compared to the pelican I got.

ertgbnm

1 month ago

[-]

Animated SVGs are one of the example in the press release. Which is fine, I just think the weird SVG benchmark is now dead. Gemini has beat the benchmark and now differences are just coming down to taste.

I don't know if it got these abilities through generalization or if google gave it a dedicated animated SVG RL suite that got it to improve so much between models.

Regardless we need a new vibe check benchmark ala bicycle pelican.

wolttam

1 month ago

[-]

What benchmark, though? There is very clearly a lot of room for improvement in its SVG making capabilities. The fact that it can now, finally, make a pelican on a bike that isn’t completely wrong is not an indicator that SVG generation is now a solved problem.

andy12_

1 month ago

[-]

I'm thinking now that as models get better and better at generating SVGs, there could be a point where we can use them to just make arbitrary UIs and interactive media with raw SVGs in realtime (like flash games).

rafark

1 month ago

[-]

> there could be a point where we can use them to just make arbitrary UIs and interactive media with raw SVGs

So render ui elements using xml-like code in a web browser? You’re not going to believe me when I tell you this…

andy12_

1 month ago

[-]

You’re not going to believe me when I tell you this, but generating a webpage with HTML is far simpler than generating arbitrary graphics (that look good) with SVGs.

nickandbro

1 month ago

[-]

Or quite literally a game where SVG assets are generated on the fly using this model

1 month ago

[-]

Thats one dimension before another long term milestone: Realtime generation of 3D mesh content during gameplay.

Which is the "left brain" approach vs the "right brain" approach of coming at dynamic videogames from the diffusion model direction which the Gemini Genie thing seems to be about.

pugio

1 month ago

[-]

Unfortunately it still fails my personal SVG benchmark (educational 2d cross section of the human heart), even after multiple iterations and screenshots feedback. Oh well, back to the (human) drawing board.

EugeneOZ

1 month ago

[-]

Still not usable in production, not even near. But I'm happy to see any progress in this area.

roryirvine

1 month ago

[-]

On the other hand, creation of other vector image formats (eg. "create a postscript file showing a walrus brushing its teeth") hasn't improved nearly so much.

Perhaps they're deliberately optimising for SVG generation.

mclau153

1 month ago

[-]

can we move on from SVG to 3D models at some point?

knicholes

1 month ago

[-]

Image to model is already a thing, and it's pretty good.

Robdel12

1 month ago

[-]

I really want to use google’s models but they have the classic Google product problem that we all like to complain about.

I am legit scared to login and use Gemini CLI because the last time I thought I was using my “free” account allowance via Google workspace. Ended up spending $10 before realizing it was API billing and the UI was so hard to figure out I gave up. I’m sure I can spend 20-40 more mins to sort this out, but ugh, I don’t want to.

With alllll that said.. is Gemini 3.1 more agentic now? That’s usually where it failed. Very smart and capable models, but hard to apply them? Just me?

surgical_fire

1 month ago

[-]

May be very silly of me, but I avoid using Gemini on my personal Google account. I use it at work, because my employer provides it.

I am scared some automated system may just decide I am doing something bad and terminate my account. I have been moving important things to Proton, but there are some stuff that I couldn't change that would cause me a lot of annoyance. It's not trivial to set up an alternative account just for Gemini, because my Google account is basically on every device I use.

I mostly use LLMs as coding assistant, learning assistant, and general queries (e.g.: It helped me set up a server for self hosting), so nothing weird.

CamperBob2

1 month ago

[-]

For what it's worth, there was an (unfortunately unsuccessful) HN submission from a guy who got his Gemini account banned, apparently without losing his whole Google account: https://news.ycombinator.com/item?id=47007906

surgical_fire

1 month ago

[-]

Comforting to know that they may ban you from only some of their services, I guess?

I really regret relying so much on my Google account for so long. Untangling myself from it is really hard. Some places treat your email as a login, not as simply as a way to contact you. This is doubly concerning for government websites, where setting up a new account may just not be a possibility.

At some point I suppose Gemini will be the only viable option for LLMs, so oh well.

paganel

1 month ago

[-]

Same feeling here, if it makes you feel any better (for sure it made me better seeing I'm not alone in this).

alpineman

1 month ago

[-]

100% agreed. I wish someone would make a test for how reliably the LLMs follow tool use instructions etc. The pelicans are nice but not useful for me to judge how well a model will slot into a production stack.

1 month ago

[-]

At first when I got started with using LLMs I read/analyzed benchmarks, looked at what example prompts people used and so on, but many times, a new model does best at the benchmark, and you think it'll be better, but then in real work, it completely drops the ball. Since then I've stopped even reading benchmarks, I don't care an iota about them, they always seem more misdirected than helpful.

Today I have my own private benchmarks, with tests I run myself, with private test cases I refuse to share publicly. These have been built up during the last 1/1.5 years, whenever I find something that my current model struggles with, then it becomes a new test case to include in the benchmark.

Nowadays it's as easy as `just bench $provider $model` and it runs my benchmarks against it, and I get a score that actually reflects what I use the models for, and it feels like it more or less matches with actually using the models. I recommend people who use LLMs for serious work to try the same approach, and stop relying on public benchmarks that (seemingly) are all gamed by now.

cdelsolar

1 month ago

[-]

1 month ago

[-]

The harness? Trivial to build yourself, ask your LLM for help, it's ~1000 LOC you could hack together in 10-15 minutes.

As for the test cases themselves, that would obviously defeat the purpose, so no :)

MrGreenTea

1 month ago

[-]

Would you be willing to give a rough outline of one or a few test cases? I am having a bit of a hard time imagining what and how you are testing. Is it like "change the signature of function X in file @Y to take parameter Z" and then comparing the result with what you expect?

cdelsolar

1 month ago

[-]

the purpose of what? i'm not an LLM

phamilton

1 month ago

[-]

> For those building with a mix of bash and custom tools, Gemini 3.1 Pro Preview comes with a separate endpoint available via the API called gemini-3.1-pro-preview-customtools. This endpoint is better at prioritizing your custom tools (for example view_file or search_code).

It sounds like there was at least a deliberate attempt to improve it.

pdntspa

1 month ago

[-]

You can delete the billing from a given API key

Stevvo

1 month ago

[-]

You could always use it through Copilot. The credits based billing is pretty simple without surprise charges.

horsawlarway

1 month ago

[-]

So much this.

It's absolutely amazing how hostile Google is to releasing billing options that are reasonable, controllable, or even fucking understandable.

I want to do relatively simple things like:

1. Buy shit from you

2. For a controllable amount (ex - let me pick a limit on costs)

3. Without spending literally HOURS trying to understand 17 different fucking products, all overlapping, with myriad project configs, api keys that should work, then don't actually work, even though the billing links to the same damn api key page, and says it should work.

And frankly - you can't do any of it. No controls (at best delayed alerts). No clear access. No real product differentiation pages. No guides or onboarding pages to simplify the matter. No support. SHIT LOADS of completely incorrect and outdated docs, that link to dead pages, or say incorrect things.

So I won't buy shit from them. Period.

sciencejerk

1 month ago

[-]

You think AWS is better?

horsawlarway

1 month ago

[-]

Scarily - yes, although not by much.

I've used all 3 major providers - AWS, GCP, Azure.

AWS is no gem... it also has it's own byzantine processes to sign up and pay for things. And it also doesn't support any real and reasonable way to stop spend when you hit limits (abusive practices).

But at least I can generally sign up for and consume a new service without hours and hours of debugging.

For context - Google own Gemini 3 utterly fails to figure out how to do something as simple as "access the image doodle feature" proudly marketed here: https://gemini.google/overview/image-generation/

It can't figure out how to do. Honestly, I still can't figure out how to do it, despite signing up for about 5 different products, and trying 4 different UIs. The closest I got was to their inpainting/outpainting UI on the legacy models in their image create studio.

And none of that involved creating a billing account, which I already had, and was required for 3 of the signups.

As far as I'm concerned, this feature is fake marketing. It doesn't exist. That's the "quality" level of GCP.

3form

1 month ago

[-]

Exact reason I used none of these platforms for my personal projects, ever.

pdimitar

1 month ago

[-]

Who is comparing to AWS and why? They can both be terrible at the same time, you know.

abiraja

1 month ago

[-]

I've been using it lately with OpenCode and it's working pretty well (except for API reliability issues).

himata4113

1 month ago

[-]

use openrouter instead

Robdel12

1 month ago

[-]

This is actually an excellent idea, I’ll give this a shot tonight!

1 month ago

[-]

Pretty great pelican: https://simonwillison.net/2026/Feb/19/gemini-31-pro/ - took over 5 minutes though, but I think that's because they're having performance teething problems on launch day.

1 month ago

[-]

It's an excellent demonstration of the main issue I have with the Gemini family of models, they always go "above and beyond" to do a lot of stuff, even if I explicitly prompt against it. In this case, most of the SVG ends up consisting not just of a bike and a pelican, but clouds, a sun, a hat on the pelican and so much more.

Exactly the same thing happens when you code, it's almost impossible to get Gemini to not do "helpful" drive-by-refactors, and it keeps adding code comments no matter what I say. Very frustrating experience overall.

mullingitover

1 month ago

[-]

> it's almost impossible to get Gemini to not do "helpful" drive-by-refactors

Just asking "Explain what this service does?" turns into

[No response for three minutes...]

+729 -522

cowmoo728

1 month ago

[-]

it's also so aggressive about taking out debug log statements and in-progress code. I'll ask it to fill in a new function somewhere else and it will remove all of the half written code from the piece I'm currently working on.

chankstein38

1 month ago

[-]

I ended up adding a "NEVER REMOVE LOGGING OR DEBUGGING INFO, OPT TO ADD MORE OF IT" to my user instructions and that has _somewhat_ fixed the problem but introduced a new problem where, no matter what I'm talking to it about, it tries to add logging. Even if it's not a code problem. I've had it explain that I could setup an ESP32 with a sensor so that I could get logging from it then write me firmware for it.

bratwurst3000

1 month ago

[-]

"I've had it explain that I could setup an ESP32 with a sensor so that I could get logging from it then write me firmware for it." lol did you try it? This so far from everything ratinonal

sd9

1 month ago

[-]

If it's adding too much logging now, have you tried softening the instruction about adding more?

"NEVER REMOVE LOGGING OR DEBUGGING INFO. If unsure, bias towards introducing sensible logging."

Or just

"NEVER REMOVE LOGGING OR DEBUGGING INFO."

1 month ago

[-]

What. You don't have yours ask for edit approval?

1 month ago

[-]

The depressing truth is most I know just run all these tools in /yolo mode or equivalents.

Because your coworkers definitely are, and we're stack ranked, so it's a race (literally) to the bottom. Just send it...

(All this actually seems to do is push the burden on to their coworkers as reviewers, for what it's worth)

1 month ago

[-]

You're mixing up two things though. One is what the agent does "locally", wherever that might be (for me it's inside a VM), and second is what code you actually share or as you call "send".

Just because you don't want to gate every change in #1, doesn't mean you're just throwing shit via #2, I'm still reviewing my code as much as before, if not more now, before I consider it ready to be reviewed by others.

But I'm seemingly also one of the few developers who seem to take responsibility of the code I produce, even if AI happens to have coded it.

1 month ago

[-]

> Just because you don't want to gate every change in #1, doesn't mean you're just throwing shit via #2,

Right but in practice from what I've seen at work, it does.

You're right: it shouldn't inherently, but that's what I've been seeing.

> But I'm seemingly also one of the few developers who seem to take responsibility of the code I produce, even if AI happens to have coded it.

Pretty much what I'm getting at, yeah

1 month ago

[-]

There's a huge psychological difference between 1) letting the agent write whatever then editing it for commit, and 2) approving the edits. There shouldn't be, but there is.

1 month ago

[-]

Who has time for that? This is how I run codex: `codex --sandbox danger-full-access --dangerously-bypass-approvals-and-sandbox --search exec "$PROMPT"`, having to approve each change would effectively destroy the entire point of using an agent, at least for me.

Edit: obviously inside something so it doesn't have access to the rest of my system, but enough access to be useful.

1 month ago

[-]

I wouldn't even think of letting an agent work in that made. Even the best of them produce garbage code unless I keep them on a tight leash. And no, not a skill issue.

What I don't have time to do is debug obvious slop.

kees99

1 month ago

[-]

I ended up running codex with all the "danger" flags, but in a throw-away VM with copy-on-write access to code folders.

Built-in approval thing sounds like a good idea, but in practice it's unusable. Typical session for me was like:

  About to run "sed -n '1,100p' example.cpp", approve?
  About to run "sed -n '100,200p' example.cpp", approve?
  About to run "sed -n '200,300p' example.cpp", approve?

Could very well be a skill issue, but that was mighty annoying, and with no obvious fix (options "don't ask again for ...." were not helping).

1 month ago

[-]

One decent approach (which Codex implements, and some others) is to run these commands in a real-only sandbox without approval and let the model ask your approval when it wants to run outside the sandbox. An even better approach is just doing abstract interpretation over shell command proposals.

You want something like codex -a read-only -s on-failure (from memory: look up the exact flags)

1 month ago

[-]

I keep it on a tight leash too, not sure how that's related. What gets edited on disk is very different from what gets committed.

well_ackshually

1 month ago

[-]

>Who has time for that?

People that don't put out slop, mostly.

1 month ago

[-]

That's another thing entirely, I still review and manually decide the exact design and architecture of the code, with more care now than before. Doesn't mean I want the UI of the agent to need manual approval of each small change it does.

mullingitover

1 month ago

[-]

Ask mode exists, I think the models work on the assumption that if you're allowing edits then of course you must want edits.

BartShoot

1 month ago

[-]

if you had to ask it obviously needs to refactor code for clarity so next person does not need to ask

kylec

1 month ago

[-]

"I don't know what did it, but here's what it does now"

1 month ago

[-]

I've seen Kimi do this a ton as well, so insufferable.

h14h

1 month ago

[-]

Would be really interesting to see an "Eager McBeaver" bench around this concept. When doing real work, a model's ability to stay within the bounds of a given task has almost become more important than its raw capabilities now that every frontier model is so dang good.

Every one of these models is so great at propelling the ship forward, that I increasingly care more and more about which models are the easiest to steer in the direction I actually want to go.

cglan

1 month ago

[-]

being TOO steerable is another issue though.

Codex is very steerable to a fault, and will gladly "monkey paw" your requests to a fault.

Claude Opus will ignore your instructions and do what it thinks is "right" and just barrel forward.

Both are bad and papering over the actual issue which is these models don't really have the ability to actually selectively choose their behavior per issue (ie ask for followup where needed, ignore users where needed, follow instructions where needed). Behavior is largely global

kees99

1 month ago

[-]

I my experience Claude gradually stops being opinionated as task at hand becomes more arcane. I frequently add "treat the above as a suggestion, and don't hesitate to push back" to change requests, and it seems to help quite a bit.

cglan

1 month ago

[-]

Yeah that happens to me too. It’s hard to know where it’s going to break off and follow instructions too well vs use it as a tip. Idk it’s all tiring

h14h

1 month ago

[-]

For sure. I imagine it'd be pretty difficult to evaluate the "correct" amount of steer-ability. You'd probably just have to measure a delta in eagerness on a single same task between when given highly-specified prompts, and more open-ended prompts. Probably not dissimilar from how artificialanalysis.ai does their "omniscience index".

enobrev

1 month ago

[-]

I have the same issue. Even when I ask it to do code-reviews and very explicitly tell it not to change files, it will occasionally just start "fixing" things.

mikepurvis

1 month ago

[-]

I find Copilot leans the other way. It'll myopically focus its work in the exact function I point it at, even when it's clear that adding a new helper would be a logical abstraction to share behaviour with the function right beside it.

Overall, I think it's probably better that it stay focused, and allow me to prompt it with "hey, go ahead and refactor these two functions" rather than the other way around. At the same time, really the ideal would be to have it proactively ask, or even pitch the refactor as a colleague would, like "based on what I see of this function, it would make most sense to XYZ, do you think that makes sense? <sure go ahead> <no just keep it a minimal change>"

Or perhaps even better, simply pursue both changes in parallel and present them as A/B options for the human reviewer to select between.

neya

1 month ago

[-]

> it's almost impossible to get Gemini to not do "helpful" drive-by-refactors

This has not been my experience. I do Elixir primarily and Gemini has helped build some really cool products and massive refactors along the way. And it would even pick up security issues and potential optimizations along the way

What HAS been an issue constantly though was randomly the model will absolutely not respond at all and some random error would occur which is embarrassing for a company like Google with the infrastructure they own.

1 month ago

[-]

Out of curiosity, do you have any public projects (with public source code) you've made exclusively with Gemini, so one could take a look? I've tried a bunch of times to use Gemini to at least finish something small but I always end up sufficiently frustrated to abort it as the instruction-following seems so bad.

neya

1 month ago

[-]

That's a good idea. I will publish something open source. Gemini has Pro, Thinking and Flash. The pro is the one I'm referring to. The thinking, I haven't used it much for development though.

Yizahi

1 month ago

[-]

Asking LLM programs to "not do the thing" often results in them tripping and generating output including that "thing", since those are simply the tokens which will enter the input. I always try to rephrase query the way that all my instructions have only "positive" forms - "do only this" or "do it only in that way" or "do it only for those parameters requested" etc. Can't say if that helps much, but it is possible.

kolinko

1 month ago

[-]

Which is how it works with people as well

tyfon

1 month ago

[-]

I was using gemini antigravity in opencode a few weeks ago before they started banning everyone for that and I got into the habit of writing "do x, then wait for instructions".

That helped quite a bit but it would still go off on it's own from time to time.

apitman

1 month ago

[-]

This matches my experience using Gemini CLI to code. It would also frequently get stuck in loops. It was so bad compared to Codex that I feel like I must have been doing something fundamentally wrong.

msteffen

1 month ago

[-]

> it's almost impossible to get Gemini to not do "helpful" drive-by-refactors

Not like human programmers. I would never do this and have never struggled with it in the past, no...

1 month ago

[-]

Fairer comparison would be against other models, which are typically better at instruction following. You say "don't change anything not explicitly mentioned" or "Don't add any new code comments" and they tend to follow that.

JLCarveth

1 month ago

[-]

Every time I have tried using `gemini-cli` it just thinks endlessly and never actually gives a response.

1 month ago

[-]

Do you have Personalization Instructions set up for your LLM models?

You can make their responses fairly dry/brief.

1 month ago

[-]

I'm mostly using them via my own harnesses, so I have full control of the system prompts and so on. And no matter what I try, Gemini keeps "helpfully" adding code comments every now and then. With every other model, "- Don't add code comments" tends to be enough, but with Gemini I'm not sure how I could stop the comments from eventually appearing.

1 month ago

[-]

I'm pretty sure it writes comments for itself, not for the user. I always let the models comment as much as they want, because I feel it makes the context more robust, especially when cycling contexts often to keep them fresh.

There is a tradeoff though, as comments do consumer context. But I tend to pretty liberally dispense of instances and start with a fresh window.

1 month ago

[-]

> I'm pretty sure it writes comments for itself, not for the user

Yeah, that sounds worse than "trying to helpful". Read the code instead, why add indirection in that way, just to be able to understand what other models understand without comments?

Radle

1 month ago

[-]

The Indirection is on purpose, it works like a continued chain of thought.

1 month ago

[-]

Yeah, I'm just doubtful Gemini could ever be considered a "better model" if it needs to put "chain of thought" artifacts into my source code, considering that none of the other models seems to need that to write better code than Gemini.

metal_am

1 month ago

[-]

I'd love to hear some examples!

1 month ago

[-]

I use LLM's outside of work primarily for research on academic topics, so mine is:

  Be a proactive research partner: challenge flawed or unproven ideas with evidence; identify inefficiencies and suggest better alternatives with reasoning; question assumptions to deepen inquiry.

zengineer

1 month ago

[-]

true, whenever I ask Gemini to help me with a prompt for generating an image of XYZ, it generates the image.

https://x.com/jeffdean/status/2024525132266688757?s=46&t=ZjF...

jasonjmcghee

1 month ago

[-]

What's crazy is you've influenced them to spend real effort ensuring their model is good at generating animated svgs of animals operating vehicles.

The most absurd benchmaxxing.

1 month ago

[-]

I like how they also did a frog on a penny-farthing and a giraffe driving a tiny car and an ostrich on roller skates and a turtle kickflipping a skateboard and a dachshund driving a stretch limousine.

jasonjmcghee

1 month ago

[-]

Ok Google what are some other examples like a pelican riding a bicycle

simultsop

1 month ago

[-]

reminds me of andor, luthen, positive reinforcing wasting time of emperor

threatofrain

1 month ago

[-]

Animated SVG is huge. People in different professions are worrying to different degrees in terms of being replaced by ML, but this one is huge with regards to digital art.

yieldcrv

1 month ago

[-]

yeah, complex SVG's are so much more bandwidth, computation and energy efficient than raster images - up to a point! but in general use we are not at that point and there's so much more we can do with it

I've been meaning to let coding agents take a stab at using the lottie library https://github.com/airbnb/lottie-web to supercharge the user experience without needing to make it a full time job

eurekin

1 month ago

[-]

Can't wait until they finally get to real world CAD

tngranados

1 month ago

[-]

There's a CAD example in that same thread: https://x.com/JeffDean/status/2024528776856817813

gibspaulding

1 month ago

[-]

I know this isn’t necessarily “real world CAD” but Claude Code is not too shabby at OpenSCAD.

UltraSane

1 month ago

[-]

It if funny to think that Jeff Dean personally worked to optimize the pelican riding a bike benchmark.

casey2

1 month ago

[-]

You don't have to benchmax everything, just the benchmarks in the right social circles

tantalor

1 month ago

[-]

He's svg-mogging

gnatolf

1 month ago

[-]

So let's put things we're interested in in the benchmarks.

I'm not against pelicans!

ghurtado

1 month ago

[-]

I think the reason the pelican example is great is because it's bizarre enough that it's unlikely that to appear in the training as one unified picture.

If we picked something more common, like say, a hot dog with toppings, then the training contamination is much harder to control.

troymc

1 month ago

[-]

I think it's now part of their training though, thanks to Simon constantly testing every new model against it, and sharing his results publicly.

There's a specific term for this in education and applied linguistics: the washback effect.

rvnx

1 month ago

[-]

It's the most common SVG test, it's the equivalent of Will Smith eating spaghettis, so obviously they benchmax toward it

MrCheeze

1 month ago

[-]

Does anyone understand why LLMs have gotten so good at this? Their ability to generate accurate SVG shapes seems to greatly outshine what I would expect, given their mediocre spatial understanding in other contexts.

tedsanders

1 month ago

[-]

A few thoughts:

- One thing to be aware of is that LLMs can be much smarter than their ability to articulate that intelligence in words. For example, GPT-3.5 Turbo was beastly at chess (1800 elo?) when prompted to complete PGN transcripts, but if you asked it questions in chat, its knowledge was abysmal. LLMs don't generalize as well as humans, and sometimes they can have the ability to do tasks without the ability to articulate things that feel essential to the tasks (like answering whether the bicycle is facing left or right).

- Secondly, what has made AI labs so bullish on future progress over the past few years is that they see how little work it takes to get their results. Often, if an LLM sucks at something that's because no one worked on it (not always, of course). If you directly train a skill, you can see giant leaps in ability with fairly small effort. Big leaps in SVG creation could be coming from relatively small targeted efforts, where none existed before.

1 month ago

[-]

We’re literally at the point where trillions of dollars have been invested in these things and the surrounding harnesses and architecture, and they still can’t do economically useful work on their own. You’re way too bullish here.

dbeardsl

1 month ago

[-]

Neither do cars until very recently. A tool doesn't have to be unsupervised to be useful.

1 month ago

[-]

My best guess is that the labs put a lot of work into HTML and CSS spatial stuff because web frontend is such an important application of the models, and those improvements leaked through to SVG as well.

mitkebes

1 month ago

[-]

All models have improved, but from my understanding, Gemini is the main one that was specifically trained on photos/video/etc in addition to text. Other models like earlier chatgpt builds would use plugins to handle anything beyond text, such as using a plugin to convert an image into text so that chatgpt could "see" it.

Gemini was multimodal from the start, and is naturally better at doing tasks that involve pictures/videos/3d spatial logic/etc.

The newer chatgpt models are also now multimodal, which has probably helped with their svg art as well, but I think Gemini still has an edge here

pknerd

1 month ago

[-]

> Does anyone understand why LLMs have gotten so good at this?

Added more IF/THEN/ELSE conditions.

1 month ago

[-]

More wires and jumpers on the breadboard.

sam_1421

1 month ago

[-]

Models are soon going to start benchmaxxing generating SVGs of pelicans on bikes

https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

cbsks

1 month ago

[-]

That’s Simon’s goal. “All I’ve ever wanted from life is a genuinely great SVG vector illustration of a pelican riding a bicycle. My dastardly multi-year plan is to trick multiple AI labs into investing vast resources to cheat at my benchmark until I get one.”

travisgriggs

1 month ago

[-]

So once that's achieved, I wonder how well it deals with unsuspected variations. E.g.

"Give me an illustration of a bicycle riding by a pelican"

"Give me an illustration of a bicycle riding over a pelican"

"Give me an illustration of a bicycle riding under a flying pelican"

So on and so forth. Or will it start to look like the Studio C sketch about Lobster Bisque: https://youtu.be/A2KCGQhVRTE

1 month ago

[-]

Soon? I'd be willing to bet it's been included in the training set at least 6 months by now. Not so obvious so it generates always perfect pelicans on bikes, but sufficiently for the "minibench" to be less useful today than in the past.

Rudybega

1 month ago

[-]

If only there were some way to test it, like swapping the two nouns in the sentence. Alas.

jsheard

1 month ago

[-]

Simons been doing this exact test for nearly 18 months now, if vendors want to benchmaxx it then they've had more than enough time to do so already.

stri8ted

1 month ago

[-]

Exactly. As far as I'm concerned, the benchmark is useless. It's way too easy and rewarding to train on it.

bonoboTP

1 month ago

[-]

It's just an in-joke, he doesn't intend it as a serious benchmark anymore. I think it's funny.

Legend2440

1 month ago

[-]

Y'all are way too skeptical, no matter what cool thing AI does you'll make up an excuse for how they must somehow be cheating.

https://x.com/jeffdean/status/2024525132266688757?s=46&t=ZjF...

toraway

1 month ago

[-]

Jeff Dean literally featured it in a tweet announcing the model. Personally it feels absurd to believe they've put absolutely no thought into optimizing this type of SVG output given the disproportionate amount of attention devoted to a specific test for 1 yr+.

I wouldn't really even call it "cheating" since it has improved models' ability to generate artistic SVG imagery more broadly but the days of this being an effective way to evaluate a model's "interdisciplinary" visual reasoning abilities have long since passed, IMO.

It's become yet another example in the ever growing list of benchmaxxed targets whose original purpose was defeated by teaching to the test.

arcatech

1 month ago

[-]

Or maybe you’re too trusting of companies who have already proven to not be trustworthy?

pixl97

1 month ago

[-]

I mean if you want to make your own benchmark, simply don't make it public and don't do it often. If your salamander on skis or whatever gets better with time it likely has nothing to do with being benchmaxxed.

ks2048

1 month ago

[-]

Forget the paperclip maximizer - AGI will turn the whole world into pelicans on bikes.

[1] https://blog.google/innovation-and-ai/models-and-research/ge...

SoKamil

1 month ago

[-]

It seems they trained the model to output good svg’s.

In their blog post[1], first use case they mention is svg generation. Thus, it might not be any indicator at all anymore.

Arcuru

1 month ago

[-]

Did you stop using the more detailed prompt? I think you described it here: https://simonwillison.net/2025/Nov/18/gemini-3/

1 month ago

[-]

It seems to be having capacity problems right now but I'll run that as soon as I can get it to work.

1 month ago

[-]

Pretty solid: https://gist.github.com/simonw/f5c893203621a7631ff178d9093a8...

https://www.svgviewer.dev/s/dEdbH8Sw

1 month ago

[-]

Less pretty and more practical, it's really good at outputting circuit designs as SVG schematics.

InitialLastName

1 month ago

[-]

I don't know what of this is the prompt and what was the output, but that's a pretty bad schematic (for both aesthetic and circuit-design reasons).

1 month ago

[-]

The prompts were doing the design, reference voltage, hysteresis, output stage, all the maths and then the SVG is from asking the model to take all that and the current BOM to make an SVG schematic of it. In the past models would just output totally incoherent messes of lines and shapes.

I did a larger circuit too that this is part of, but it's not really for sharing online.

svnt

1 month ago

[-]

Yes but you concede it is a schematic.

tadfisher

1 month ago

[-]

How far we have come!

1 month ago

[-]

that's pretty amazing for an LLM but as an EE, if my intern did this i would sigh inwardly and pull up some existing schematics for some brief guidance on symbol layout.

AmazingTurtle

1 month ago

[-]

At this point, the pelican benchmark became so widely used that there must be high quality pelicans in the dataset, I presume. What about generating an okapi on a bicycle instead?

ascorbic

1 month ago

[-]

Loads of examples here https://x.com/jeffdean/status/2024525132266688757

tromp

1 month ago

[-]

Or, even more challenging, an okapi on a recumbent ?!

steve_adams_86

1 month ago

[-]

Ugh, the gears and chain don't mesh and there's no sprocket on the rear hub

But seriously, I can't believe LLMs are able to one-shot a pelican on a bicycle this well. I wouldn't have guessed this was going to emerge as a capability from LLMs 6 years ago. I see why it does now, but... It still amazes me that they're so good at some things.

1 month ago

[-]

Is this capability “emergent”, or do AI firms specifically target SVG generation in order to improve it? How would we be able to tell?

steve_adams_86

1 month ago

[-]

I asked myself the same thing as I typed that comment, and I'm not sure what the answer is. I don't think models are specifically trained on this (though of course they're trained on how to generate SVGs in general), but I'm prepared to be wrong.

I have a feeling the most 'emergent' aspect was that LLMs have generally been able to produce coherent SVG for quite a while, likely without specific training at first. Since then I suspect there has been more tailored training because improvements have been so dramatic. Of course it makes sense that text-based images using very distinct structure and properties could be manipulated reasonably well by a text-based language model, but it's still fascinating to me just how well it can work.

Perhaps what's most incredible about it is how versatile human language is, even when it lacks so many dimensions as bits on a machine. Yet it's still cool that we can resurrect those bits at rest and transmogrify them back into coherent projections of photons from a screen.

I don't think LLMs are AGI or about to completely flip the world upside down or whatever, but it seems undeniably magical when you break it down.

1 month ago

[-]

Google specifically boast about their SVG performance in the announcement post: https://blog.google/innovation-and-ai/models-and-research/ge...

You can try any combination of animal on vehicle to confirm that they likely didn't target pelicans directly though.

1 month ago

[-]

next time you host a party, have people try to draw a bicycle on your whiteboard (you have a whiteboard in your house right? you should, anyway...)

human adults are generally quite bad at drawing them, unless they spend a lot of time actually thinking about bicycles as objects

https://www.behance.net/gallery/35437979/Velocipedia

542354234235

1 month ago

[-]

They are, and it is very funny.

iammattmurphy

1 month ago

[-]

Fantastic post, thanks for that.

1 month ago

[-]

What’s your point? Yes, humans fail sometimes, as do AI models. Are you trying to imply that, in light of this, AI is now as capable as human beings? If so, that conclusion doesn’t follow logically.

1 month ago

[-]

it's not a loaded point, i just think it's funny that humans typically cannot one-shot this. and it will make your friends laugh

HPsquared

1 month ago

[-]

And the left leg is straight while the right leg is bent.

EDIT: And the chain should pass behind the seat stay.

bredren

1 month ago

[-]

What is that, a snack in the basket?

1 month ago

[-]

"integrating a bicycle basket, complete with a fish for the pelican... also ensuring the basket is on top of the bike, and that the fish is correctly positioned with its head up... basket is orange, with a fish inside for fun."

how thoughtful of the ai to include a snack. truly a "thanks for all the fish"

defen

1 month ago

[-]

A pelican already has an integrated snack-holder, though. It wouldn't need to put it in the basket.

SauntSolaire

1 month ago

[-]

That one's full too

1 month ago

[-]

A fish for the road

troymc

1 month ago

[-]

The number of snacks in the basket is a random variable with a Poisson distribution.

1 month ago

[-]

You think they are able to see their output and iterate on it? Or is it pure token generation?

benatkin

1 month ago

[-]

I used the AI studio link and tried running it with the temperature set to 1.75: https://jsbin.com/locodaqovu/edit?html,output

https://arcprize.org/leaderboard

culi

1 month ago

[-]

Cost per task has increased 4.2x but their ARC-AGI-2 score went from 33.6% to 77.1%

Cost per task is still significantly lower than Opus. Even Opus 4.5

brikym

1 month ago

[-]

Another great benchmark would be to convert a raster image of a logo into SVG. I've yet to find a good tool for this that produces accurate smooth lines.

tarr11

1 month ago

[-]

What do you think this particular prompt is evaluating for?

The more popular these particular evals are, the more likely the model will be trained for them.

Sea https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

Gander5739

1 month ago

[-]

1 month ago

[-]

I hope we keep beating this dead horse some more, I'm still not tired of it.

calny

1 month ago

[-]

Great pelican but what’s up with that fish in the basket?

coldtea

1 month ago

[-]

It's a pelican. What do you expect a pelican to have in his bike's basket?

It's a pretty funny and coherent touch!

1 month ago

[-]

> What do you expect a pelican to have in his bike's basket?

Probably stuff it cannot fit in the gullet, or don't want there (think trash). I wouldn't expect a pelican to stash fish there, that's for sure.

1 month ago

[-]

You never travel with a snack fish for later on? He's going to be burning calories.

1 month ago

[-]

hold on guys, what we have here is a cycling pelican expert

1 month ago

[-]

It takes humanity to be able to get into other's perspectives. It takes experience and wisdom to be able to do so for animals too.

nicr_22

1 month ago

[-]

Yeah, why only _one_ fish?

It's obvious that pelican is riding long distance, no way a single fish is sufficiently energy dense for more than a few miles.

Can't the model do basic math???

1 month ago

[-]

Where else are cycling Pelican's meant to keep their fish?

calny

1 month ago

[-]

I get it, I just meant the fish is poorly done, when I’d have guessed it would be relatively simple part. Maybe the black dot eye is misplaced idk.

infthi

1 month ago

[-]

Wonder when will we get something other than a side view

1 month ago

[-]

Another Jeff Dean post about this model shows it writing programs that generate CAD objects. I suspect if you ask it to, it will create a CAD pelican on a CAD bicycle and even make joints so you can turn the pedals.

mikepurvis

1 month ago

[-]

That would be a especially challenging for vector output. I tried just now on ChatGPT 5.2 to jump straight to an image, with this prompt:

"make me a cartoon image of a pelican riding a bicycle, but make it from a front 3/4 view, that is riding toward the viewer."

The result was basically a head-on view, but I expect if you then put that back in and said, "take this image and vectorize it as an SVG" you'd have a much better time than trying to one-shot the SVG directly from a description.

... but of course, if that's so, then what's preventing the model from being smart enough to identify this workflow and follow it on its own to get the task completed?

DonHopkins

1 month ago

[-]

How about STL files for 3d printing pelicans!

baq

1 month ago

[-]

Harder: the bike must work

Hardest: the pelican must work

mohsen1

1 month ago

[-]

is there something in your prompt about hats? why the pelican always wearing a hat recently?!

bigfishrunning

1 month ago

[-]

At this point, i think maybe they're training on all of the previous pelicans, and one of them decided to put a hat on it?

Disclaimer: This is an unsubstantiated claim that i made up

1 month ago

[-]

Not even animated? This is 2026.

readitalready

1 month ago

[-]

Jeff Dean just posted an animated version: https://x.com/JeffDean/status/2024525132266688757

benbreen

1 month ago

[-]

One underrated thing about the recent frontier models, IMO, is that they are obviating the need for image gen as a standalone thing. Opus 4.6 (and apparently 3.1 Pro as well) doesn't have the ability to generate images but it is so good at making SVG that it basically doesn't matter at this point. And the benefit of SVG is that it can be animated and interactive.

I find this fascinating because it literally just happened in the past few months. Up until ~summer of 2025, the SVG these models made was consistently buggy and crude. By December of 2026, I was able to get results like this from Opus 4.5 (Henry James: the RPG, made almost entirely with SVG): https://the-ambassadors.vercel.app

And now it looks like Gemini 3.1 Pro has vaulted past it.

1 month ago

[-]

> doesn't have the ability to generate images but it is so good at making SVG that it basically doesn't matter at this point

Yeah, since the invention of vector images, suddenly no one cares about raster images anymore.

Obviously not true, but that's how your comment reads right now. "Image" is very different from "Image", and one doesn't automagically replace the other.

buu700

1 month ago

[-]

This reminds me of the time I printed a poster with a blown up version of some image for a high school history project. A classmate asked how I did it, so I started going on about how I used software to vectorize the image. Turned out he didn't care about any of that and just wanted the name of the print shop.

Der_Einzige

1 month ago

[-]

You have no idea how badly I want to be teleported to the alternative world where VECTOR COMPUTING was the dominant form of computers.

We had high framerate (yes it was variable), bright, beautiful displays in the 1980s with the vectrex.

cachius

1 month ago

[-]

2025 that is

bigfishrunning

1 month ago

[-]

That Ostrich Tho

cachius

1 month ago

[-]

That Tires Tho

1 month ago

[-]

3.1 Pro is the first model to correctly count the number of legs on my "five legged dog" test image. 3.0 flash was the previous best, getting it after a few prompts of poking. 3.1 got it on the first prompt though, with the prompt being "How many legs does the dog have? Count Carefully".

However, it didn't get it on the first try with the original prompt (prompt: "How many legs does the dog have?"). It initially said 4, then with a follow up prompt got it to hesitantly say 5, with one limb must being obfuscated or hidden.

So maybe I'll give it a 90%?

This is without tools as well.

merlindru

1 month ago

[-]

your question may have become part of the training data with how much coverage there was around it. perhaps you should devise a new test :P

devsda

1 month ago

[-]

I suggest asking it to identify/count the number of fire hydrants, crosswalks, bridges, bicycles, cars, buses and traffic signals etc.

Pit Google against Google :D

gallerdude

1 month ago

[-]

My job may have become part of the training data with how much coverage there is around it. Perhaps another career would be a better test of LLM capabilities.

1 month ago

[-]

Have you ever heard of a black swan?

iamdelirium

1 month ago

[-]

3.1 Pro has the same Jan 2025 knowledge cutoff as the other 3 series models. So if 3.1 has it in its training data, the other ones would have as well.

ainch

1 month ago

[-]

The fact it's still Jan 2025 is weird to me. Have they not have a successful pretrain in over a year?

throwaway314155

1 month ago

[-]

They likely won’t have a full pretrain for awhile. Just like everyone else, the name of the game now is to finetune existing models.

1 month ago

[-]

Honestly at this point I have fed this image in so many times on so many models, that it also functions as a test for "Are they training on my image specifically" (they are generally, for sure, but that's along with everything else in the ocean of info people dump in).

I genuinely don't think they are. GPT-5.2 still stands by 4 legs, and OAI has been getting this image consistently for over a year. And 3.1 still fumbled with the harder prompt "How many legs does the dog have?". I needed to add the "count carefully" part to tip it off that something was amiss.

Since it did well I'll make some other "extremely far out of the norm" images to see how it fairs. A spider with 10 legs or a fish with two side fins.

wat10000

1 month ago

[-]

Easy fix, make a new test image with six legs, and watch all the LLMs say it has five.

1 month ago

[-]

blog post is up- https://blog.google/innovation-and-ai/models-and-research/ge...

edit: biggest benchmark changes from 3 pro:

arc-agi-2 score went from 31.1% -> 77.1%

apex-agents score went from 18.4% -> 33.5%

ripbozo

1 month ago

[-]

Does the arc-agi-2 score more than doubling in a .1 release indicate benchmark-maxing? Though i dont know what arc-agi-2 actually tests

maxall4

1 month ago

[-]

Theoretically, you can’t benchmaxx ARC-AGI, but I too am suspect of such a large improvement, especially since the improvement on other benchmarks is not of the same order.

https://arcprize.org/arc-agi/1/

1 month ago

[-]

It's a sort of arbitrary pattern matching thing that can't be trained on in the sense that the MMLU can be, but you can definitely generate billions of examples of this kind of task and train on it, and it will not make the model better on any other task. So in that sense, it absolutely can be.

I think it's been harder to solve because it's a visual puzzle, and we know how well today's vision encoders actually work https://arxiv.org/html/2407.06581v1

km144

1 month ago

[-]

The real question is: Why are people designing benchmarks that, if a model is trained on them, it won't improve the performance of the model at any real-world tasks? Why would anyone care about such benchmarks?

1 month ago

[-]

People are like typewriter monkeys, if something is possible to make it'll eventually be made.

boplicity

1 month ago

[-]

Benchmark maxing could be interpreted as benchmarks actually being a design framework? I'm sure there are pitfalls to this, but it's not necessarily bad either.

energy123

1 month ago

[-]

Francois Chollet accuses the big labs of targeting the benchmark, yes. It is benchmaxxed.

tasuki

1 month ago

[-]

Didn't the same Francois Chollet claim that this was the Real Test of Intelligence? If they target it, perhaps they target... real intelligence?

ainch

1 month ago

[-]

He's always said ARC is a necessary but not sufficient condition for testing intelligence afaik

energy123

1 month ago

[-]

He said in an interview that it doesn't count if it's explicitly targeted, only if a model generalizes to it.

He also said that the "real test of intelligence" is being unable to come up with new tests that a human can easily do that the AI can't, not in being able to pass any specific benchmark.

CamperBob2

1 month ago

[-]

I don't know what he could mean by that, as the whole idea behind ARC-AGI is to "target the benchmark." Got any links that explain further?

layer8

1 month ago

[-]

The fact that ARC-AGI has public and semi-private in addition to private datasets might explain it: https://arcprize.org/arc-agi/2/#dataset-structure

segmondy

1 month ago

[-]

He should have kept it closed.

blinding-streak

1 month ago

[-]

I assume all the frontier models are benchmaxxing, so it would make sense

sho_hn

1 month ago

[-]

The touted SVG improvements make me excited for animated pelicans.

takoid

1 month ago

[-]

I just gave it a shot and this is what I got: https://codepen.io/takoid/pen/wBWLOKj

The model thought for over 5 minutes to produce this. It's not quite photorealistic (some parts are definitely "off"), but this is definitely a significant leap in complexity.

onionisafruit

1 month ago

[-]

Good to see it wearing a helmet. Their safety team must be on their game.

BrokenCogs

1 month ago

[-]

Yes but why would a pelican need a helmet? If it falls over it can just fly away... Common sense 1 Gemini 0

throwa356262

1 month ago

[-]

Obviously these domestic pelicans can't fly, otherwise why would they need a bike?

Gander5739

1 month ago

[-]

Why would a pelican be riding a bicycle at all, for that matter?

BrokenCogs

1 month ago

[-]

Because the user asked for it

benatkin

1 month ago

[-]

Here's what I got from Gemini Pro on gemini.google.com, it thought for under a minute...might you have been using AI studio? https://jsbin.com/zopekaquga/edit?html,output

It does say 3.1 in the Pro dropdown box in the message sending component.

tasuki

1 month ago

[-]

That's a good pelican. What I like the most is that the SVG is nice and readable. If only Inkscape could output nice SVG like this!

1 month ago

[-]

Looks great!

james2doyle

1 month ago

[-]

The blog post includes a video showcasing the improvements. Looks really impressive: https://blog.google/innovation-and-ai/models-and-research/ge...

aoeusnth1

1 month ago

[-]

I imagine they're also benchgooning on SVG generation

rdtsc

1 month ago

[-]

My perennial joke is as soon as that got on HN front page Google went and hired some interns and they spend a 100% of the time on pelicans.

vunderba

1 month ago

[-]

SVG is an under-rated use case for LLMs because it gives you the scalability of vector graphics along with CSS-style interactivity (hover effects, animations, transitions, etc.).

DonHopkins

1 month ago

[-]

How about STL files for 3d printing pelicans!

esafak

1 month ago

[-]

Has anyone noticed that models are dropping ever faster, with pressure on companies to make incremental releases to claim the pole position, yet making strides on benchmarks? This is what recursive self-improvement with human support looks like.

1 month ago

[-]

Remember when ARC 1 was basically solved, and then ARC 2 (which is even easier for humans) came out, and all of the sudden the same models that were doing well on ARC 1 couldn’t even get 5% on ARC 2? Not convinced these benchmark improvements aren’t data leakage.

casey2

1 month ago

[-]

ARC 2 was made specifically to artificially lower contemporary LLM scores, therefore any kind of model improvements will have outsized effects

Also people use "saturated" too liberally. The top left corner 1 cent per task is saturated IMO. Since there are billions of people who would perfer to solve arc 1 tasks at 52 cents per task. Arc 2 a human would make thousands of dollars a day with 99.99% accuracy

https://arcprize.org/play

z3t4

1 month ago

[-]

How much do I get if I solve this? :D

alisonkisk

1 month ago

[-]

You are saying something interesting but too esoteric. Can you explain for beginners?

louiskottmann

1 month ago

[-]

You could get rich by solving ARC 2 tasks yourself instead of forwarding the work to an LLM, given a client willing to pay LLM rate.

culi

1 month ago

[-]

Look at the ARC site. The scores of these models is plotted against their "cost per task". All of these huge jumps come along with massive increases in cost per task. Including Gemini 3.1 Pro which increased by 4.2x

redox99

1 month ago

[-]

I don't think there's much recursive improvement yet.

I'd say it's a combination of

A) Before, new model releases were mostly a new base model trained from scratch, with more parameters and more tokens. This takes many Months. Now that RL is used so heavily, you can make infinitely many tweaks to the RL setup, and in just a month get a better model using the same base model.

B) There's more compute online

C) Competition is more fierce.

m_ke

1 month ago

[-]

this is mostly because RLVR is driving all of the recent gains, and you can continue improving the model by running it longer (+ adding new tasks / verifiers)

so we'll keep seeing more frequent flag planting checkpoint releases to not allow anyone to be able to claim SOTA for too long

culi

1 month ago

[-]

I feel like they're actually dropping slower. Chinese models are dropping right before lunar new year as seems to be an emerging tradition.

A couple of western models have dropped around the same time too but I don't think the "strides on benchmarks" are that impressive when you consider how much tokens are being spent to make those "improvements". E.g. Gemini 3.1 Pro's ARC-AGI-2 score went from 33.6% to 77.1% buuut their "cost per task" also increased by 4.2x. It seems to be the same story for most of these benchmark improvements and similar for Claude model improvements.

I'm not convinced there's been any substantial jump in capabilities. More likely these companies have scaled their datacenters to allow for more token usage

ankit219

1 month ago

[-]

not much to do with self improvement as such. openai has increased its pace, others are pretty much consistent. Google last year had three versions of gemini-2.5-pro each within a month of each other. Anthropic released claude 3 in march 24, sonnet 3.5 in june 24, 3.5 new in oct 24, and then 3.7 in feb 25, where they went to 4 series in May 25. then followed by opus 4.1 in august, sonnet 4.5 in oct, opus 4.5 in nov, 4.6 in feb, sonnet 4.6 in feb itself. Yes, they released both within weeks of each other, but originally they only released it together. This staggered release is what creates the impression of fast releases. its as much a function of training as a function of available compute, and they have ramped up in that regard.

oliveiracwb

1 month ago

[-]

With the advent of MoEs, efficiency gains became possible. However, MoEs still operate far from the balance and stability of dense models. My view is that most progress comes from router tuning based on good and bad outcomes, with only marginal gains in real intelligence

PlatoIsADisease

1 month ago

[-]

Only using my historical experience and not Gemini 3.1 Pro, I think we see benchmark chasing then a grand release of a model that gets press attention...

Then a few days later, the model/settings are degraded to save money. Then this gets repeated until the last day before the release of the new model.

If we are benchmaxing this works well because its only being tested early on during the life cycle. By middle of the cycle, people are testing other models. By the end, people are not testing them, and if they did it would barely shake the last months of data.

1 month ago

[-]

I have a relatively consistent task that it completed with new information on weekdays at the edge of its intelligence. Interestingly 3.0 flash was good when it came out, took a nose dive a month back and is now excellent, I actually can't fault it it's so good.

It's performance in antigravity has also actually improved since launch day where it was giving non-stop typescript errors (not sure if that was antigravity itself).

ainch

1 month ago

[-]

It's becoming impossible to keep up - in the last week or so we've had: Gemini 3 Deep Think, Gemini 3.1 Pro, Claude Sonnet 4.6, GPT-5.3-Codex Spark, GLM-5, Minimax-2.5, Step 3.5 Flash, Qwen 3.5 and Grok 4.20.

and I'm sure others I've missed...

nikcub

1 month ago

[-]

and anyone notice that the pace has broken xAI and they were just dropped behind? The frontier improvement release loop is now ant -> openai -> google

1 month ago

[-]

xAI just released Grok 4.20 beta yesterday or day before?

dist-epoch

1 month ago

[-]

Musk said Grok 5 is currently being trained, and it has 7 trillion params (Grok 4 had 3)

svara

1 month ago

[-]

My understanding is that all recent gains are from post training and no one (publicly) knows how much scaling pretraining will still help at this point.

Happy to learn more about this if anyone has more information.

Squarex

1 month ago

[-]

I still remember gemini 1.5 ultra and gpt 4.5 as extremely strong at some areas that no benchmark capture. It was probably not economical to use them at 20 usd subscription, but they felt differently and smarter at some ways. The benchmarks seems to be missing something, because flash 3 was very close on some benchmarks to 3 pro, but much, much dumber.

dist-epoch

1 month ago

[-]

You gain more benefit spending compute on post-training than on pre-training.

But scaling pre-training is still worth it if you can afford it.

gmerc

1 month ago

[-]

That's what scaling compute depth to respond to the competition look like, lighting those dollars on fire.

toephu2

1 month ago

[-]

This is what competition looks like.

zhyder

1 month ago

[-]

Surprisingly big jump in ARC-AGI-2 from 31% to 77%, guess there's some RLHF focused on the benchmark given it was previously far behind the competition and is now ahead.

Apart from that, the usual predictable gains in coding. Still is a great sweet-spot for performance, speed and cost. Need to hack Claude Code to use their agentic logic+prompts but use Gemini models.

I wish Google also updated Flash-lite to 3.0+, would like to use that for the Explore subagent (which Claude Code uses Haiku for). These subagents seem to be Claude Code's strength over Gemini CLI, which still has them only in experimental mode and doesn't have read-only ones like Explore.

1 month ago

[-]

>I wish Google also updated Flash-lite to 3.0+

I hope every day that they have made gains on their diffusion model. As a sub agent it would be insane, as it's compute light and cranks 1000+ tk/s

zhyder

1 month ago

[-]

Agree, can't wait for updates to the diffusion model.

Could be useful for planning too, given its tendency to think big picture first. Even if it's just an additional subagent to double-check with an "off the top off your head" or "don't think, share first thought" type of question. More generally would like to see how sequencing autoregressive thinking with diffusion over multiple steps might help with better overall thinking.

topocite

1 month ago

[-]

The only thing I can notice is deep research is better. Like much closer to outputting a paper from arxiv straight away.

I am really the bottleneck now and what to do with all this new information.

davidguetta

1 month ago

[-]

Implementation and Sustainability Hardware: Gemini 3 Pro was trained using Google’s Tensor Processing Units (TPUs). TPUs are specically designed to handle the massive computations involved in training LLMs and can speed up training considerably compared to CPUs. TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training, which can lead to better model quality. TPU Pods (large clusters of TPUs) also provide a scalable solution for handling the growing complexity of large foundation models. Training can be distributed across multiple TPU devices for faster and more efficient processing.

So google doesn't use NVIDIA GPUs at all ?

1 month ago

[-]

When I worked there, there was a mix of training on nvidia GPUs (especially for sparse problems when TPUs weren't as capable), CPUs, and TPUs. I've been gone for a few years but I've heard a few anecdotal statements that some of their researchers have to use nvidia GPUs because the TPUs are busy.

rjh29

1 month ago

[-]

I assume that's a Gemini LLM response? You can tell Gemini is bullshitting when it starts using "often" or "usually" - like in this case "TPUs often come with large amounts of memory". Either they did or they didn't. "This (particular) mall often has a Starbucks" was one I encountered recently.

w10-1

1 month ago

[-]

It's not bullshit (i.e., intended) but probabilities all the way down, as Hume reminded us: from observations, you can only say the sun will likely rise in the east. You'd need to stand behind a theory of the world to say otherwise (but we were told "attention is all you need"...)

1 month ago

[-]

Googler. We use GPUs, but its a drop in the bucket in the sea of our accelerators. We might sell more GPUs in Cloud than we use internally.

These are not data driven observations just vibes

PunchTornado

1 month ago

[-]

no. only tpus

paride5745

1 month ago

[-]

Another reason to use Gemini then.

Less impact on gamers…

TiredOfLife

1 month ago

[-]

TPUs still use ram and chip production capacity

lejalv

1 month ago

[-]

Bla bla bla yada sustainability yada often come with large better growing faster...

It's such an uninformative piece of marketing crap

maxloh

1 month ago

[-]

Gemini 3 seems to have a much smaller token output limit than 2.5. I used to use Gemini to restructure essays into an LLM-style format to improve readability, but the Gemini 3 release was a huge step back for that particular use case.

Even when the model is explicitly instructed to pause due to insufficient tokens rather than generating an incomplete response, it still truncates the source text too aggressively, losing vital context and meaning in the restructuring process.

I hope the 3.1 release includes a much larger output limit.

NoahZuniga

1 month ago

[-]

Output limit has consistently been 64k tokens (including 2.5 pro).

esafak

1 month ago

[-]

People did find Gemini very talkative so it might be a response to that.

MallocVoidstar

1 month ago

[-]

> Even when the model is explicitly instructed to pause due to insufficient tokens rather than generating an incomplete response

AI models can't do this. At least not with just an instruction, maybe if you're writing some kind of custom 'agentic' setup.

maxloh

1 month ago

[-]

Yeah, it does. It was possible with 2.5 Flash.

Here's a similar result with Qwen Qwen3.5-397B-A17B: https://chat.qwen.ai/s/530becb7-e16b-41ee-8621-af83994599ce?...

jayd16

1 month ago

[-]

> Even when the model is explicitly instructed to pause due to insufficient tokens

Is there actually a chance it has the introspection to do anything with this request?

1 month ago

[-]

No, the model doesn't have purview into this afaik

I'm not even sure what "pausing" means in this context and why it would help when there are insufficient tokens. They should just stop when you reach the limit, default or manually specified, but it's typically a cutoff.

You can see what happens by setting output token limit much lower

maxloh

1 month ago

[-]

Yeah, it does. It was possible with 2.5 Flash.

Here's a similar result with Qwen Qwen3.5-397B-A17B: https://chat.qwen.ai/s/530becb7-e16b-41ee-8621-af83994599ce?...

jayd16

1 month ago

[-]

Ok it prints some stuff at the end but does it actually count the output tokens? That part was already built in somehow? Is it just retrying until it has enough space to add the footer?

otabdeveloper4

1 month ago

[-]

No.

the_duke

1 month ago

[-]

Gemini 3 is pretty good, even Flash is very smart for certain things, and fast!

BUT it is not good at all at tool calling and agentic workflows, especially compared to the recent two mini-generations of models (Codex 5.2/5.3, the last two versions of Anthropic models), and also fell behind a bit in reasoning.

I hope they manage to improve things on that front, because then Flash would be great for many tasks.

1 month ago

[-]

These improvements are one of the things specifically called out on the submitted page

chermi

1 month ago

[-]

You can really notice the tool use problems. They gotta get on that. The agent trend seems real, and powerful. They can't afford to fall behind on it.

1 month ago

[-]

I don't really have tool usage issues that I don't put under that doesn't follow system prompt instructions consistently

there are these times where it puts a prefix on all function calls, which is weird and I think hallucination, so maybe that one

3.1 hopefully fixes that

HardCodedBias

1 month ago

[-]

"They can't afford to fall behind on it."

They are very, very seriously far behind as of 3.0.

We'll see if 3.1 addresses the issue at all.

spwa4

1 month ago

[-]

In other words: they just need to motivate their employees while giving in to finance's demands to fire a few thousand every month or so ...

And don't forget, it's not just direct motivation. You can make yourself indispensable by sabotaging or at least not contributing to your colleagues' efforts. Not helping anyone, by the way, is exactly what your managers want you to do. They will decide what happens, thank you very much, and doing anything outside of your org ... well there's a name for that, isn't there? Betrayal, or perhaps death penalty.

anthonypasq

1 month ago

[-]

yeah, it seems to me like Gemini is a little behind on the current RL patterns and also they dont seem interested in really creating a dedicated coding model. I think they have so much product surface (search, AI mode, gmail, youtube, chrome etc), they are prioritizing making the model very general. but who knows im just talking out of my ass.

faebi

1 month ago

[-]

I'm doing Ruby and Gemini 3.0 pro has by far been the best model for me. It writes the nicest ruby code, like I would. Further, it either succeeds or fails hard and obviously. I prefer it failing hard instead of of slowly going weird in my code.

Similar in antigravity. Privately it's my absolute favorite.

So I'm actually rooting for this.

znnajdla

1 month ago

[-]

Which harness? Gemini CLI or OpenCode?

ttul

1 month ago

[-]

What I’m noticing, overall: I’ve never cut so much code in my life. I’ve become a coding monster with one of those dark green GitHub profiles ever since 5.3-Codex gave me the confidence to load in a ridiculous number of tasks every day and let it rip. I have about three coding tasks going at once and in another window, Claude Cowork is ripping through PowerPoints and getting back to lawyers.

This tech is not going to replace us. If anything, I am becoming even more of a workaholic. But the output volume is going to pay off for those who are privileged enough to use these tools.

motoboi

1 month ago

[-]

There are thousands like you now. How many does it take to run the economy? What would the rest do.

Think of it like what a tractor did to agricultural work. The fist guy that used a tractor probably thought: this is not replacing me, I’m just much more productive. Well, turns out you only need one guy per farm now.

Kerrick

1 month ago

[-]

But now many suburban homeowners also have a little lawn tractor, and lots of people on small acreage have a utility tractor. None of them are farmers, but they get value out of the technology as well. Plus, we're feeding a lot more people for a lot less money than we did before tractors.

motoboi

1 month ago

[-]

Yeah, but we used to employ hundreds of people per farm, or per plantation, to be exact. Thousands maybe to do the sugar cane work, as an example. Replaced by 5 high tech, GPS driven, human on board to supervise, not even to drive, tractors.

So human doing lawn with mechanized tools: efficiency goes though the roof. Still one per home.

Human doing high volume manual labor job where there were much more job than single human could handle: number of humans doing the job now is amount of work divided by amount of work human can handle.

Of course we get ambitious, like Panama Canal building ambitious. But even that can’t absorb the previous admin of people doing that kind of work.

niek_pas

1 month ago

[-]

The market for iOS todo-applications seems to be infinite, so everyone can just become a todo app developer.

javier123454321

1 month ago

[-]

What ive noticed, i dont have the apetite to spend tokens on AI fixing errors AI made. Or paying a 200/month subscription. In the beggining of the mobth im happy tinkering, but i reach the cap of how much money im willing to spend playing

https://hbr.org/2026/02/ai-doesnt-reduce-work-it-intensifies...

AIorNot

1 month ago

[-]

Yeah see this article I think it was spot on

ttul

1 month ago

[-]

“Some described sending a “quick last prompt” right before leaving their desk so that the AI could work while they stepped away.”

This, I can relate to. Also: I feel like I need a second monitor.

znnajdla

1 month ago

[-]

How do you give it tasks? As GitHub issues?

upcoming-sesame

1 month ago

[-]

what is your setup to let it run on tasks in parallel?

tenpoundhammer

1 month ago

[-]

In an attempt to get outside of benchmark gaming I had it make Platypus on a Tricycle. It's not as good as pelican on bicycle. https://www.svgviewer.dev/s/BiRht5hX

textlapse

1 month ago

[-]

To really confuse it, ask it to take that tricycle with the platypus on it to a car wash.

dinosor

1 month ago

[-]

For a moment I assumed the output would look like Perry the Platipus from the Disney (I think?) show. It's suprising to me (as a layman) that a show with lots of media that would've made it to the training corpus didn't show up.

1 month ago

[-]

that's better than i thought it would be

hyperbovine

1 month ago

[-]

would love to be able to teleport this thread to, oh, 5 years ago. people would think some sort of alien technology had landed.

exabrial

1 month ago

[-]

You know what would slay right now? A native app.

Not another piece of Electron bloatware, a regular, efficient, fast, snappy, native, app. One that connects to my MCP severs and has local filesystem tools.

Anthropic might fall behind Google/OpenAI eventually, but their Desktop App + MCP/Connectors is unbelievably useful to get real work done.

arcfour

1 month ago

[-]

I haven't used Anthropic's desktop app in months since I don't have access to a Mac anymore, but when I did...it was just an electron app? Did something change?

YetAnotherNick

1 month ago

[-]

Not only that, it is the slowest app among all AI apps.

ceroxylon

1 month ago

[-]

It also has some strange bugs between versions. There was an update a month or two ago that caused the app to be unable to quit normally, and I would have to 'force quit' it. Thankfully it was resolved, but it was unnerving to not be able to close the app normally.

perardi

1 month ago

[-]

Nope. It is still Electron, and it is not snappy. And I am on an M3 Max MacBook Pro.

I have transitioned off ChatGPT for home use (Google provides me slightly better value in my personal life, as I can pay for a plan that also accommodates my weird photo storage needs) and it’s all Anthropic at work, but I miss the ChatGPT Mac app. I can’t say for certain if it was Electron or not—I never dug into the internals, and it felt very, very fast and “native”.

exabrial

1 month ago

[-]

No, sadly. I wish it were native. Its _terrible_.

mbh159

1 month ago

[-]

77.1% on ARC-AGI-2 and still can't stop adding drive-by refactors. ARC-AGI-2 tests novel pattern induction, it's genuinely hard to fake and the improvement is real. But it doesn't measure task scoping, instruction adherence, or knowing when to stop. Those are the capabilities practitioners actually need from a coding agent. We have excellent benchmarks for reasoning. We have almost nothing that measures reliability in agentic loops. That gap explains this thread.

zapnuk

1 month ago

[-]

Gemini 3 was:

1. unreliable in GH copilot. Lots of 500 and 4XX errors. Unusable in the first 2 months

2. not available in vertex ai (europe). We have requirements regarding data residency. Funny enough anthropic is on point with releasing their models to vertex ai. We already use opus and sonnet 4.6.

I hope google gets their stuff together and understands that not everyone wants/can use their global endpoint. We'd like to try their models.

1 month ago

[-]

Gets 10/10 on my potato benchmarks: https://aibenchy.com/model/google-gemini-3-1-pro-preview-med...

1 month ago

[-]

Now I need to write more tests.

It's a bit hard to trick reasoning models, because they explore a lot of the angles of a problem, and they might accidentally have an "a-ha" moment that leads them on the right path. It's a bit like doing random sampling and stumbling upon the right result after doing gradient descent from those points.

thevinter

1 month ago

[-]

Are you intentionally keeping the benchmarks private?

1 month ago

[-]

Yes.

I am trying to think what's the best way to give most information about how the AI models fail, without revealing information that can help them overfit on those specific tests.

I am planning to add some extra LLM calls, to summarize the failure reason, without revealing the test.

1 month ago

[-]

Added one more test, which surprisingly gemini flash 3 reasoning passes, but gemini 3.1 pro not

qingcharles

1 month ago

[-]

I've been playing with the 3.1 Deep Think version of this for the last couple of weeks and it was a big step up for coding over 3.0 (which I already found very good).

It's only February...

nubg

1 month ago

[-]

> I've been playing with the 3.1 Deep Think version of this

How?

1 month ago

[-]

A select few have had early access through various programs Google offers. I believe there was a sentence or two to this effect on the Gemini 3 Deep Think post from Deepmind.

ArmandoAP

1 month ago

[-]

Model Card https://storage.googleapis.com/deepmind-media/Model-Cards/Ge...

infinitewars

1 month ago

[-]

I find Gemini is great at generating code that is relatively common on the internet, especially web and algorithms. It is absolutely better at this then OpenAI's models. But Gemini is not as good at reasoning about problems from first principles, or catching subtle bugs. In some ways it is just a better Google that finds exactly what you want, less a general intelligence.

veselin

1 month ago

[-]

I am actually going to complain about this: that neither of the Gemini models are not preview ones.

Anthropic seems the best in this. Everything is in the API on day one. OpenAI tend to want to ask you for subscription, but the API gets there a week or a few later. Now, Gemini 3 is not for production use and this is already the previous iteration. So, does Google even intent to release this model?

vnglst

1 month ago

[-]

I asked Gemini 3.1 Pro to generate some of the modern artworks in my "Pelican Art Gallery". I particularly like the rendition of the Sunflowers: https://pelican.koenvangilst.nl/gallery/category/modern

dev1ycan

1 month ago

[-]

Nice collection of visible bits that have no relation at all with art

1 month ago

[-]

bro why is called pelican art gallery if you have no pelican art in it.

Is this like 5d chess layers of irony or something im not getting through?

Nice gallery besides

vnglst

1 month ago

[-]

I asked Gemini 3.1 Pro Preview to generate the modern artworks as SVG for my Pelican Art Gallery. I particularly like the rendition of the Sunflowers: https://pelican.koenvangilst.nl/gallery/category/modern

janalsncm

1 month ago

[-]

This model says it accepts video inputs. I asked it to transcribe a 5 second video of a digital water curtain which spelled “Boo Happy Halloween”, and it came back with “Happy” which wasn’t the first frame, but also is incomplete.

This kind of test is good because it requires stitching together info from the whole video.

aabhay

1 month ago

[-]

It reads videos at 1fps by default. You have to set the video resolution to high in ai studio

janalsncm

1 month ago

[-]

This is inside the Gemini app.

sergiotapia

1 month ago

[-]

To use in OpenCode, you can update the models it has:

    opencode models --refresh

Then /models and choose Gemini 3.1 Pro

You can use the model through OpenCode Zen right away and avoid that Google UI craziness.

---

It is quite pricey! Good speed and nailed all my tasks so far. For example:

    @app-api/app/controllers/api/availability_controller.rb 
    @.claude/skills/healthie/SKILL.md 

    Find Alex's id, and add him to the block list, leave a comment 
    that he has churned and left the company. we can't disable him 
    properly on the Healthie EMR for now so 
    this dumb block will be added as a quick fix.

Result was:

    29,392 tokens
    $0.27 spent

So relatively small task, hitting an API, using one of my skills, but a quarter. Pricey!

gbalduzzi

1 month ago

[-]

I don't see it even after refresh. Are you using the opencode-gemini-auth plugin as well?

sergiotapia

1 month ago

[-]

No I am not just vanilla OpenCode. I do have OpenCode Zen credits, and I did opencode login whatever their command is to auth against opencode itself. Maybe that's the reason I see these premium models.

agentifysh

1 month ago

[-]

My enthusiasm is a bit muted this cycle because I've been burned by Gemini CLI. These models are very capable but Gemini CLI just doesn't seem to be able to work for one it never follows instructions strictly like its competitors do, and it hallucinates even which is a rarity.

More importantly feels like Google is stretched thin across different Gemini products and pricing reflects this, I still have no idea how to pay for Gemini CLI, in codex/claude its very simple $20/month for entry and $200/month for ton of weekly usage.

I hope whoever is reading this from Google they can redeem Gemini CLI by focusing on being competitive instead of making it look pretty (that seems to be the impression I got from the updates on X)

cheema33

1 month ago

[-]

> I still have no idea how to pay for Gemini CLI, in codex/claude its very simple $20/month for entry and $200/month for ton of weekly usage.

This!

I would like to sign up for a paid plan for Gemini CLI. But I have not been able to figure out how. I already have Codex and Claude plans. Those were super easy to sign up for.

jiggawatts

1 month ago

[-]

What’s your difficulty? Google has published easy to follow 27-step instructions for how to sign up for the half a dozen services you need to chain together to enable this common usecase!

1 month ago

[-]

On the 3.0 rollout I signed up for billing and it just silently failed. Solution was to remake billing account and then wait a day

jiggawatts

1 month ago

[-]

“Time machine not included. Some temporal slippage may be experienced. Paradoxes will not be compensated.”

magicalhippo

1 month ago

[-]

Supposedly the Google One AI plans[1] allow for this. I've been testing an AI Pro plan, but haven't gotten CLI to work yet (keeps asking me to sign in). So yeah...

[1]: https://one.google.com/about/#compare-plans

dxbednarczyk

1 month ago

[-]

Every time I've used Gemini models for anything besides code or agentic work they lean so far into the RLHF induced bold lettering and bullet point list barf that everything they output reads as if the model was talking _at_ me and not _with_ me. In my Openclaw experiment(s) and in the Gemini web UI, I've specifically added instructions to avoid this type of behavior, but it only seemed to obey those rules when I reminded the model of them.

For conversational contexts, I don't think the (in some cases significantly) better benchmark results compared to a model like Sonnet 4.6 can convince me to switch to Gemini 3.1. Has anyone else had a similar experience, or is this just a me issue?

augusto-moura

1 month ago

[-]

Gemini sounds less personal, but I think that is good. From my experience, the quality of response is much higher than ChatGPT or Grok, and it cites real sources. I want to have a mini-wikipedia response for my questions, not a friend's group chat response

1 month ago

[-]

I have the opposite viewpoint:

If a model doesn't optimize the formatting of its output display for readability, I don't want to read it.

Tables, embedded images, use of bulleted lists and bold/italicizing etc.

markab21

1 month ago

[-]

You just articulated why I struggle to personally connect with Gemini. It feels so unrelatable and exhausting to read its output. I prefer to read Opus/Deepseek/GLM over Gemini, Qwen and the open source GPT models. Maybe it is RLHF that is creating my distaste from using it. (I pay for Gemini; I should be using it more... but the outputs just bug me and feel more work to get actionable insight.)

dragochat

1 month ago

[-]

> feel more work to get actionable insight

WHAT?! I find that exactly the nice sharp formatting are what makes it EASIER to get actionable insight from it...

(Plus the weird-but-cute unrequested analogies are nice to occassionally elicit a smile and keep you motivated :P)

staticman2

1 month ago

[-]

I'm not familiar with Openclaw and but the trick to solve this would be to embed a style reminder at the bottom of each user message and ideally hide that from the user with the UI.

This is how roleplay apps like Sillytavern customize the experience for power users by allowing hidden style reminders as part of the user message that accompany each chat message.

InkCanon

1 month ago

[-]

I think they all output that bold lettering, point by point style output. I strongly suspect it's part of a synthetic data pipeline all these AI companies have, and it improves performance. Claude seems to be the least of them, but it will start writing code at the drop of a hat. What annoys me in Gemini is that it has a really strange tendency to come up with weird analogies, especially in Pro mode. You'll be asking it about something like red black trees and it'll say "Red Black Trees (The F1 of Tree Data Structures)".

hydrolox

1 month ago

[-]

Yes, the analogy habit is the most annoying of all. Overall formatting for me is doable, if it didn't divide up an answer into these silly arbitrary categories with useless analogies. I've tried adding in my user preferences to never use analogies but it inevitably falls back into that habit.

losvedir

1 month ago

[-]

It definitely has the worst "voice" in my opinion. Feels very overachieving McKinsey intern to me.

1 month ago

[-]

I have no issues adjusting gemini tone & style with system prompt content

timabdulla

1 month ago

[-]

Google tends to trumpet preview models that aren't actually production-grade. For instance, both 3 Pro and Flash suffer from looping and tool-calling issues.

I would love for them to eliminate these issues because just touting benchmark scores isn't enough.

upmind

1 month ago

[-]

In my experience, while Gemini does really well in benchmarks I find it much worse when I actually use the model. It's too verbose / doesn't follow instructions very well. Let's see if that changes with this model.

thallavajhula

1 month ago

[-]

This is great. I am hopeful that Gemini 3.1 Pro would be great. So far, I'm almost always pulled away from Gemini models by Claude. Having used Claude Opus High for a while now, Claude Opus seems to be fantastic at coding. Even Gemini's comparison chart says so. OpenAI's 5.3-codex is by far the weakest (of the 3) for my coding purposes. Claude Opus really shines at explanations and generating code.

Gemini is almost great. Claude Opus is great. I keep switching among these subscriptions every month to not miss out on any of the offerings for too long; ChatGPT Plus <-> Gemini Pro <-> Claude.

lgl

1 month ago

[-]

> I keep switching among these subscriptions every month to not miss out on any of the offerings for too long; ChatGPT Plus <-> Gemini Pro <-> Claude.

I wonder why many people seem to be doing this instead of just going for a copilot subscription that has access to all those models? Anybody care to share pros and cons?

sothatsit

1 month ago

[-]

OpenAI and Anthropic give you a lot of usage/$ through their plans. For the Anthropic Max plans, this can be like a ~90% discount. Copilot does not benefit from this (their pricing model is also different though, it is request-based rather than token usage based, so it is hard to compare).

That's not to mention that the models generally work better in their own harnesses, which is perhaps unsurprising because the models have been trained with the specific harness in mind (and vice versa). That said, I think some 3rd-party harnesses do a lot of work to make different models work well in their harness.

3371

1 month ago

[-]

I would suggest you also take a look at Cursor's Composer1.5. It's super fast, and perform better than Gemini3P in my use cases.

thallavajhula

1 month ago

[-]

I've been trying composer-1.5 on and off and it doesn't come close to Claude's Opus High. The explainability of Claude is just something else.

3371

1 month ago

[-]

Sure, my point was it's better than Gemini and it's really really fast, and it's missing from the parent comment.

1 month ago

[-]

It seems google is having a disjointed roll out, and there will likely be an official announcement in a few hours. Apparently 3.1 showed up unannounced in vertex at 2am or something equally odd.

Either way early user tests look promising.

carpe__diem

1 month ago

[-]

One thing I’d like to see in these releases is stronger emphasis on regression behavior, not just headline capability.

In production, the costly failures are usually "almost right" edits that quietly shift semantics across large diffs.

We now gate model upgrades behind a fixed eval set of our own repos + prompts and compare pass rates by task category (refactor, test repair, API migration). Raw benchmark gains matter less to us than variance and rollback safety. If 3.1 improves consistency on long multi-file edits, that’s a bigger win than a small jump on one-shot tasks.

1 month ago

[-]

Funnily, on my tests, 3 flash with medium reasoning does better. Seems like 3.1 pro reasoned about the correct answer, but chose to go with a different (wrong) one: https://aibenchy.com/compare/?left=google-gemini-3-flash-pre...

EDIT: while also being 3x cheaper

pawelduda

1 month ago

[-]

It's safe to assume they'll be releasing improved Gemini Flash soon? The current one is so good & fast I rarely switch to pro anymore

tucnak

1 month ago

[-]

Gemini 3 Pro (high) is a joke compared to Gemini 3 Flash in Antigravity, except it's not even funny. Flash is insane value, and super capable, too. I've had it implement a decompiler for very obscure bytecode, and it was passing all tests in no time. PITA to refactor later, but not insurmountable. Gemini 3 Pro (high) choked on this problem in the early stages... I'm looking forward to comparing 3.1 Pro vs 3.0 Flash, hopefully they have improved on it enough to finally switch over.

derac

1 month ago

[-]

When 3 came out they mentioned that flash included many improvements that didn't make it into pro (via an hn comment). I imagine this release includes those.

dudeinhawaii

1 month ago

[-]

After 2 days of giving it a go, I find that Gemini CLI is still considerably worse than both Codex and Claude Code.

The model itself also has strange behaviors that seem like it gets randomly replaced with Gemini-3-Flash or something else. I'll explain.

Once agentic coding was a bust, I gave it a run as a daily driver for AI assistant. It performed fairly well but then began behaving strangely. It would lose context mid conversation. For instance, I said "In san francisco I'm looking for XYZ". Two turns later I'm asking about food and it gives me suggestions all over the world.

Another time, I asked it about the likelihood of the pending east coast winter storm of affecting my flight. I gave it all the details (flight, stops, time, cities).

Both GPT-5.2 and Claude crunched and came back with high quality estimations and rationale. Gemini 3.1 Pro... 5 times, returned a weather forecast widget for either the layover or final destination. This was on "Pro" reasoning, the highest exposed on the Gemini App/WebApp. I've always suspected Google swaps out models randomly so this.. wasn't surprising.

I then asked Gemini 3.1 Pro via the API and it returned a response similar to Claude and GPT-5.2 -- carefully considering all factors.

This tells me that a Google AI Ultra subscription gives me a sub-par coding agent which often swaps in Flash models, a sub-par web/app AI experience that also isn't using the advertised SOTA models, and a bunch of preview apps for video gen, audio gen (crashed every time I attempted), and world gen (Genie was interesting but a toy).

This will be a quick cancel as soon as the intro rate is done.

It's like Google doesn't ACTUALLY want to be the leader in AI or serve people their best models. They want to generate hype around benchmarks and then nerf the model and go silent.

Gemini 3 Pro Preview went from exceptional in the first month to mediocre and then out of my rotation within a month.

hackrmn

1 month ago

[-]

I am reading opinions here from agent users, but I haven't adopted the "agentic workflow" myself because I believe I am (for now) now getting a lot of my trouble's worth using Gemini (3 Pro) in the traditional conversational manner. It is adequate at suggesting solutions in the form of code, or reasoning in general. My problems are software engineering but also everything that is not, since I have a subscription it's my go to problem solving partner. I see no reasons to switch to another product for now either, I am constantly in the loop getting samples of chats with Grok and ChatGPT and it seems a very close race. If Claude is that one race horse that's built different -- and I absolutely can believe it is so because they have rightfully tuned it -- I am not convinced I am missing out much. But maybe because I am more traditionalist to most of everyone's having embraced the idea of having an agent run a loop on their workstation(s) and trusting it to deliver. Perhaps if I were in more of a tight time frame, I'd be pressed to do so myself, but for now I am already benefiting from the extra speed "rubberducking" with Gemini all manner of software engineering problems that I need to solve, so I simply have no reasons to abandon it. I think this is also Google's strength -- they have the data, they've already integrated Gemini or a variant of it anyway, into google.com which is one of their prized cash cows, and it's everywhere else too. Like others here have said, Google may not have the absolute best in class at all times, but they're fairly good and they still have the brains that gave us DeepMind and GPT, unless there's some sort of stagnation going on in their ranks, I expect they're not resting on the laurels. With their capital they're still at the head of the race. Anthropic and OpenAI have the benefit of being nimble, though, and it shows too. Anyway, competition is good, the cat's out of the bag and on the greener side of the river :-)

nobrains

1 month ago

[-]

In the "Intelligence applied" section, where they show the comparison animations, they are shown using a non-optimal UI.

Somewhere along that process, they lost me.

1 month ago

[-]

I always try Gemini models when they get updated with their flashy new benchmark scores, but always end up using Claude and Codex again...

I get the impression that Google is focusing on benchmarks but without assessing whether the models are actually improving in practical use-cases.

I.e. they are benchmaxing

Gemini is "in theory" smart, but in practice is much, much worse than Claude and Codex.

user34283

1 month ago

[-]

I exclusively use Gemini for Chat nowadays, and it's been great mostly. It's fast, it's good, and the app works reliably now. On top of that I got it for free with my Pixel phone.

For development I tend to use Antigravity with Sonnet 4.5, or Gemini Flash if it's about a GUI change in React. The layout and design of Gemini has been superior to Claude models in my opinion, at least at the time. Flash also works significantly faster.

And all of it is essentially free for now. I can even select Opus 4.6 in Antigravity, but I did not yet give it a try.

konart

1 month ago

[-]

> but without assessing whether the models are actually improving in practical use-cases

Which cases? Not trying to sound bad but you didn't even provide of cases you are using Claude\Codex\Gemini for.

rocho

1 month ago

[-]

I find Gemini is outstanding at reasoning (all topics) and architecture (software/system design). On the other hand, Gemini CLI sucks and so I end up using Claude Code and Codex CLI for agentic work.

However, I heavily use Gemini in my daily work and I think it has its own place. Ultimately, I don't see the point of choosing the one "best" model for everything, but I'd rather use what's best for any given task.

1 month ago

[-]

Honestly doesn't feel like Google is targeting the agentic coding crowd so much as they are the knowledge worker / researcher / search-engine-replacement market?

Agree Gemini as a model is fairly incompetent inside their own CLI tool as well as in opencode. But I find it useful as a research and document analysis tool.

1 month ago

[-]

For my custom agentic coding setup, I use Claude Code derived prompts with Gemini models, primarily flash. It's night and day compared to Google's own agentic products, which are all really bad.

The models are all close enough on the benchmarks and I think people are attributing too much difference in the agentic space to the model itself. I strongly believe the difference is in all the other stuff, which is why Antropic is far ahead of the competition. They have done great work with Claude Code, Cowork, and their knowledge share through docs & blog, bar none on this last point imo.

skerit

1 month ago

[-]

I'm glad someone else is finally saying this, I've been mentioning this left and right and sometimes I feel like I'm going crazy that not more people are noticing it.

Gemini can go off the rails SUPER easily. It just devolves into a gigantic mess at the smallest sign of trouble.

For the past few weeks, I've also been using XML-like tags in my prompts more often. Sometimes preferring to share previous conversations with `<user>` and `<assistant>` tags. Opus/Sonnet handles this just fine, but Gemini has a mental breakdown. It'll just start talking to itself.

Even in totally out-of-the-ordinary sessions, it goes crazy. After a while, it'll start saying it's going to do something, and then it pretends like it's done that thing, all in the same turn. A turn that never ends. Eventually it just starts spouting repetitive nonsense.

And you would think this is just because the bigger the context grows, the worse models tend to get. But no! This can happen well below even the 200.000 token mark.

reilly3000

1 month ago

[-]

Flash is (was?) was better than Pro on these fronts.

PunchTornado

1 month ago

[-]

The biggest increase is LiveCodeBench Pro: 2887. The rest are in line with Opus 4.6 or slightly better or slightly worse.

shmoogy

1 month ago

[-]

but is it still terrible at tool calls in actual agentic flows?

jeffbee

1 month ago

[-]

Relatedly, Gemini chat seems to be if not down then extremely slow.

ETA: They apparently wiped out everyone's chats (including mine). "Our engineering team has identified a background process that was causing the missing user conversation metadata and has successfully stopped the process to prevent further impact." El Mao.

ponyous

1 month ago

[-]

Ran a bunch of 3D Modeling benchmarks on Gemini 3.1 vs Gemini 3.

Unsurprisingly 3.1 performs a bit better. But surprisingly it costs 2.6x as much ($0.14 vs. $0.37 per 3D Model Generation) and is 2.5x slower (1m 24s vs. 3m 28s).

To me it feels like "lets increase our thinking budget and call it an improved model!"

josalhor

1 month ago

[-]

I speculated that 3 pro was 3.1... I guess I was wrong. Super impressive numbers here. Good job Google.

refulgentis

1 month ago

[-]

> I speculated that 3 pro was 3.1

josalhor

1 month ago

[-]

Sorry... I speculated that 3 deep think is 3.1 pro.. model names are confusing..

rahulroy

1 month ago

[-]

In the meantime, I'm trying to update Antigravity to use the latest version, but it just wouldn't update itself, nor would it let me use 3.0 model. I restarted multiple times with the same result.

I tried telling this to agent, and it keeps repeating the same phrase "Gemini 3.1 Pro is not available on this version. Please upgrade to the latest version."

Congratulations on beating the benchmarks, but I wonder how much effort is devoted on improving DX?

Edit: It's updated now, I can confirm with "There are currently no updates available.". It still doesn't let me continue with the conversation. I'm able to create new session though.

markerbrod

1 month ago

[-]

Blogpost: https://blog.google/innovation-and-ai/models-and-research/ge...

vinhnx

1 month ago

[-]

Model card https://deepmind.google/models/model-cards/gemini-3-1-pro/

dude250711

1 month ago

[-]

I hereby allow you to release models not at the same time as your competitors.

1 month ago

[-]

It is super interesting that this is the same thing that happened in November (ie all labs shipping around the same week 11/12-11/23).

zozbot234

1 month ago

[-]

They're just throwing a big Chinese New Year celebration.

vintermann

1 month ago

[-]

Could that actually be connected? There are a LOT of Chinese engineers and researchers working on all these models, I assume they would like to take some vacation days, and it makes sense to me to time releases around it.

brap

1 month ago

[-]

I had it coding autonomously for about an hour (including lots of tool wait time) on a difficult task, and it actually produced good results.

What’s most surprising is that I had it follow a strict loop/workflow and it did that perfectly. Normally these things go off the rails after a while with complex workflows. It’s something I have to usually enforce with some orchestration script and multiple agents, but this time it was just one session meticulously following orders.

Impressive, and saves a lot of time on building the orchestration glue.

impulser_

1 month ago

[-]

Seems like they actually fixed some of the problems with the model. Hallucinations rate seems to be much better. Seems like they also tuned the reasoning maybe that were they got most of the improvements from.

whynotminot

1 month ago

[-]

The hallucination rate with the Gemini family has always been my problem with them. Over the last year they’ve made a lot of progress catching the Gemini models up to/near the frontier in general capability and intelligence, but they still felt very late 2024 in terms of hallucination rate.

Which made the Gemini models untrustworthy for anything remotely serious, at least in my eyes. If they’ve fixed this or at least significantly improved, that would be a big deal.

SubiculumCode

1 month ago

[-]

Maybe I haven't kept up with how ghatgpt and claude are doing , but 6 monthlatelys ago or so, I thought Gemini was leading on that front.

Murfalo

1 month ago

[-]

I like to think that all these pelican riding a bicycle comments are unwittingly iteratively creating the optimal cyclist pelican as these comment threads are inevitably incorporated in every training set.

alpineman

1 month ago

[-]

More like half of Google's AI team is hanging out on HN, and they can optimise for that outcome to get a good rep among the dev community.

1 month ago

[-]

Hello.

(I'm not aware of anyone doing this, but GDM is quite info-siloed these days, so my lack of knowledge is not evidence it's not happening)

alpineman

1 month ago

[-]

Hello.

Please push internally for more reliable tool use across Gemini models. Intelligence is useless if it can't be applied :)

Barbing

1 month ago

[-]

See: fish in bike front basket

conception

1 month ago

[-]

My current AI test. There was a BBS I was on in the 90s and there was this door game I hadn't seen anywhere else. I simply describe the BBS, where it was popular, its name, the year it was around, and the BBS game and a description of it mechanics, etc.

OpenAI and Google's Deep Research produce a very long, 100% made up report. If I question the AI on the report, they both admit they just made it up.

Claude just returns, "I couldn't find anything on the BBS or the game."

1 month ago

[-]

Doesn't show as available in gemini CLI for me. I have one of those "AI Pro" packages, but don't see it. Typical for Google, completely unclear how to actually use their stuff.

metavolvelabs

1 month ago

[-]

They crushed it with Gemini 3.1... especially when in Thinking Mode with Deep Think initiated. If you are working towards something with code, research etc. and hit a snag, run it by Gemini with these settings. Here's another KILLER trick: In Gemini Thinking mode select Nano Banana and have it put together a comprehensive slide with paragraph length text portions. It'll nail it.

ChrisArchitect

1 month ago

[-]

More discussion: https://news.ycombinator.com/item?id=47075318

0xcb0

1 month ago

[-]

I'm trying to find the information, is this available on the Gemini CLI script, or is this just the web front-end where I can use this new model?

onlyrealcuzzo

1 month ago

[-]

We've gone from yearly releases to quarterly releases.

If the pace of releases continues to accelerate - by mid 2027 or 2028 we're headed to weekly releases.

rubicon33

1 month ago

[-]

But actual progress seems to be slower. These modes are releasing more often but aren’t big leaps.

gallerdude

1 month ago

[-]

We used to get one annual release which was 2x as good, now we get quarterly releases which are 25% better. So annually, we’re now at 2.4x better.

wahnfrieden

1 month ago

[-]

GPT 5.3 (/Codex) was a huge leap over 5.2 for coding

rubicon33

1 month ago

[-]

Eh, sure, but marginally better if not the same as Claude 4.6, which itself was a small bump over Claud 4.5

1 month ago

[-]

Due to the increasing difficulty of scaling up training, it appears the gains are instead being achieved through better model training which appears to be working well for everyone.

mark_l_watson

1 month ago

[-]

Fine, I guess. The only commercial API I use to any great extent is gemini-3-flash-preview: cheap, fast, great for tool use and with agentic libraries. The 3.1-pro-preview is great, I suppose, for people who need it.

Off topic, but I like to run small models on my own hardware, and some small models are now very good for tool use and with agentic libraries - it just takes a little more work to get good results.

throwaway2027

1 month ago

[-]

Seconded. Gemini used to be trash and I used Claude and Codex a lot but gemini-3-flash-preview punches above it's weight, it's decent and I rarely if ever run into any token limit either.

1 month ago

[-]

Thirded, I've been using gemini-3-flash to great effect. Anytime I have something more complicated, I give it to pro & flash to see what happens. Coin flip if flash is nearly equivalent (too many moving vars to be analytical at this point)

PlatoIsADisease

1 month ago

[-]

What models are you running locally? Just curious.

I am mostly restricted to 7-9B. I still like ancient early llama because its pretty unrestricted without having to use an abliteration.

mark_l_watson

1 month ago

[-]

I experimented with many models on my 16G and 32G Macs. For less memory, qwen3:4b is good, for the 32B Mac, gpt-oss:20b is good. I like the smaller Mistral models like mistral:v0.3 and rnj-1:latest is a pretty good small reasoning model.

nurettin

1 month ago

[-]

I like to ask claude how to prompt smaller models for the given task. With one prompt it was able to make a low quantized model call multiple functions via json.

pRusya

1 month ago

[-]

I'm using gemini.google.com/app with AI Pro subscription. "Something went wrong" in FF, works in Chrome.

Below is one of my test prompts that previous Gemini models were failing. 3.1 Pro did a decent job this time.

> use c++, sdl3. use SDL_AppInit, SDL_AppEvent, SDL_AppIterate callback functions. use SDL_main instead of the default main function. make a basic hello world app.

panarchy

1 month ago

[-]

I had it make a simple HTML/JS canvas game (think flappy bird) and while it did some things mildly better (and others noticeably worse) it still fell into the exact same traps as earlier models. It also had a lot of issues generating valid JS at parts and asking it what the code should be just made it endlessly generate the same exact incorrect code.

zokier

1 month ago

[-]

> Last week, we released a major update to Gemini 3 Deep Think to solve modern challenges across science, research and engineering. Today, we’re releasing the upgraded core intelligence that makes those breakthroughs possible: Gemini 3.1 Pro.

So this is same but not same as Gemini 3 Deep Think? Keeping track of these different releases is getting pretty ridiculous.

1 month ago

[-]

Deep Think is a few 3.1 models working together. It was suspected last week that Deep Think was composed using the new 3.1 model.

1 month ago

[-]

3.1 == model

deep think == turning up thinking knob (I think)

deep research == agent w/ search

datakazkn

1 month ago

[-]

One underappreciated reason for the agentic gap: Gemini tends to over-explain its reasoning mid-tool-call in a way that breaks structured output expectations. Claude and GPT-4o have both gotten better at treating tool calls as first-class operations. Gemini still feels like it's narrating its way through them rather than just executing.

carbocation

1 month ago

[-]

I agree with this; it feels like the most likely tool to drop its high-level comments in code comments.

mixel

1 month ago

[-]

Google seems to really pull ahead in this AI race. For me personally they offer the best deal and although the software is not quiet there compared to openai or anthropic (in regards to 1. web GUI, 2. agent-cli). I hope they can fix that in the future and I think once Gemini 4 or whatever launches we will see a huge leap again

rubslopes

1 month ago

[-]

I don't understand this sentiment. It may hold true for other LLM use cases (image generation, creative writing, summarizing large texts), but when it comes to coding specifically, Google is *always* behind OpenAI and Anthropic, despite having virtually infinite processing power, money, and being the ones who started this race in the first place.

Until now, I've only ever used Gemini for coding tests. As long as I have access to GPT models or Sonnet/Opus, I never want to use Gemini. Hell, I even prefer Kimi 2.5 over it. I tried it again last week (Gemini Pro 3.0) and, right at the start of the conversation, it made the same mistake it's been making for years: it said "let me just run this command," and then did nothing.

My sentiment is actually the opposite of yours: how is Google *not* winning this race?

hobofan

1 month ago

[-]

> despite having virtually infinite processing power, money

Just because they have the money doesn't mean that they spend it excessively. OpenAI and Anthropic are both offering coding plans that are possibly severely subsidized, as they are more concerned with growth at all cost, while Google is more concerned with profitability. Google has the bigger warchest and could just wait until the other two run out of money rather than forcing the growth on that product line in unprofitable means.

Maybe they are also running much closer to their compute limits then the other ones too and their TPUs are already saturated with API usage.

jeanloolz

1 month ago

[-]

Agreed, also worth pointing out that Google still owns 14% of Anthropic + Anthropic is signing billion dollar scale deals with Google Cloud to train their models on their TPUs. So Claude success indirectly contributes to Google success. The AI race is not only about the frontier models.

mike97

1 month ago

[-]

> OpenAI and Anthropic are both offering coding plans that are possibly severely subsidized

So does Google, in fact I believe their antigravity limits for Opus and Sonnet for the $20 plan has higher limits than CC $20 plan, and there is no weekly cap or I couldn't get it even with heavy usage, and then you have a separate limit for Gemini cli and for other models from antigravity.

hobofan

1 month ago

[-]

Is that so? I haven't personally used Antigravity, I just heard a lot of people complaining as recently as ~1 month ago that they hit the rate limits very quickly by e.g. it accidentally reading in too large files.

eknkc

1 month ago

[-]

I hope they fail.

I honestly do not wish Google to have the best model out there and be forced to use their incomprehensible subscription / billing / project management whatever shit ever again.

I don’t know what their stuff cost. I don’t know why would I use vertex or ai studio. What is included in my subscription what is billed per use.

I pray that whatever they build fails and burns.

otherme123

1 month ago

[-]

They all suck. OpenAI ignores scanning limits and disabled routes in robots.txt, after a 429 "Too Many Requests" they retry the same url half a dozen of times from different IPs in the next couple of minutes, and they once DoS'ed my small VPS trying to do a full scan of sitemaps.xml in less than one hour, trying and retrying if any endpoint failed.

Google and others at least respects both robots.txt and 429s. They invested years scanning all the internet, so they can now train on what they have stored in their server. OpenAI seems to assume that MY resources are theirs.

toraway

1 month ago

[-]

For a personal plan to use premium Gemini AI features or for agentic development with Gemini CLI/Antigravity the billing is no more or less complicated then Claude Code or Codex CLI.

You pay for the $20/mo Google AI Pro plan with a credit card via the normal personal billing flow like you would for a Google One plan without any involvement of Google Cloud billing or AI Studio. Authorize in the client with your account and you're good to go.

(With the bundled drive storage on AI Pro I'm just paying a few bucks more than I was before so for me it's my least expensive AI subscription excluding the Z.ai ultra cheap plan).

Or, just like with Anthropic or OpenAI, it's a separate process for billing/credits for an API key targeted at a developer audience. Which I don't need or use for Gemini CLI or Antigravity at all, it's a one step "click link to authorize with your Google Account" and done.

You could decide to use an API key for usage based billing instead (just like you could with Claude Code) but that's entirely unnecessary with a subscription.

Sure, for the API anything involving a hyperscalar cloud is going to have a higher complexity floor with legacy cruft here and there, but for individual subscriptions that's irrelevant and it's pretty much as straightforward of a click and pay flow you'd find anywhere else.

1 month ago

[-]

after using aistudio fine for months suddenly my billing was cancelled and a week later im still waiting for it to be re-enabled.

Im at a total loss to how google can function this way, my only explanation is they somehow have a Philosophers Stone they generate wealth with because they sure as hell make it impossible to give them money.

dybber

1 month ago

[-]

Eventually the models will be generally be so good that the competition moves from the best model to the best user experience and here I think we can expect others will win, e.g. Microsoft with GitHub and VS Code

eknkc

1 month ago

[-]

That's my hope but Google has unlimited cash to throw at model development and can basically burn more cash can openai and anthropic combined. Might tip the scale in the long run.

rishabhaiover

1 month ago

[-]

I think we're past the point where benchmarks hold real value. All models are above a certain threshold of intelligence but Gemini somehow borrows the worst of both worlds. It's neither good with long-horizon coding tasks nor does it offer a likable personality (like Claude which is much more beloved)

azuanrb

1 month ago

[-]

The CLI needs work, or they should officially allow third-party harnesses. Right now, the CLI experience is noticeably behind other SOTA models. It actually works much better when paired with Opencode.

But with accounts reportedly being banned over ToS issues, similar to Claude Code, it feels risky to rely on it in a serious workflow.

tskulbru

1 month ago

[-]

Off-topic but, what are people using to create those video animations seen in the "ISS orbit tracking dashboard" example? Looks pretty nice! Im guessing Google uses a whole building of UX people but ive seen similar videos from small indie startups too, or even 1 person SaaS.

MASNeo

1 month ago

[-]

At risk to be unpopular Gemini 3.0 Pro made a huge difference for me when I moved some workflow to Antigravity, especially compared to ChatGPT.

The latest update? I simply don’t care. I am not paid to evaluate models, I am paid to build. Not sure 4 benchmark points are making the difference.

6d6b73

1 month ago

[-]

In these discussions we see some people hating the models, while others love them. What I find interesting is that this is exactly how we feel about other people - some people will love working with you while others can't stand being in the same room you're in.

barfingclouds

1 month ago

[-]

I’m no tech expert like a lot of people here, but I find Gemini 3.0 insanely good for my regular daily questions. Hoping this one is great too. I’m kind of at the point where many answers are essentially perfect and I don’t know if I need much more

clhodapp

1 month ago

[-]

There's a very short blog post up: https://blog.google/innovation-and-ai/models-and-research/ge...

hsaliak

1 month ago

[-]

The eventual nerfing gives me pause. Flash is awesome. What we really want is gemini-3.1-flash :)

d4rkp4ttern

1 month ago

[-]

Yes people are too fixated on just the model. The real question for coding use cases is - does Gemini X + Gemini CLI outperform Opus + Claude Code? With 3.0 the answer was no. I won’t waste time checking 3.1 until I hear otherwise.

ChrisArchitect

1 month ago

[-]

Blog post: https://blog.google/innovation-and-ai/models-and-research/ge...

1 month ago

[-]

Great model until it gets nerfed. I wish they had a higher paid tier to use non nerfed model.

Mond_

1 month ago

[-]

Bad news, John Google told me they already quantized it immediately after the benchmarks were done and it sucks now.

I miss when Gemini 3.1 was good. :(

spyckie2

1 month ago

[-]

I think there is a pattern it will always be nerfed the few weeks before launching a new model. Probably because they are throwing a bunch of compute at the new model.

1 month ago

[-]

Yeah maybe that but atleast let us know about this Or have dynamic limits? Nerfing breaks trust. Though I am not sure if they actually nerf it intentionally. Haven't heard from any credible source. I did experience in my workflow though.

1 month ago

[-]

What are you talking about?

n4pw01f

1 month ago

[-]

I created a nice harness and visual workflow builder for my Gemini agent chains, works very well. I did this so it would create code the way I do, that is very editable.

In contrast, the vs code plugin was pretty bad, and did crazy things like mix languages

attentive

1 month ago

[-]

A lot of gemini bashing. But flash 3.0 with opencode is reasonably good and reliable coder.

I'd rate it between haiku 4.5 (also pretty good for a price) and sonnet. Closer to sonnet.

Sure, if I am not cost-sensitive I'd run everything in opus 4.6 but alas.

quacky_batak

1 month ago

[-]

I’m keen to know how and where are you using Gemini.

Anthropic is clearly targeted to developers and OpenAI is general go to AI model. Who are the target demographic for Gemini models? ik that they are good and Flash is super impressive. but i’m curious

jdc0589

1 month ago

[-]

I use it as my main platform right now both for work/swe stuff, and person stuff. It works pretty well, they have the full suite of tools I want from general LLM chat, to notebookLM, to antigravity.

My main use-cases outside of SWE generally involve the ability to compare detailed product specs and come up with answers/comparisons/etc... Gemini does really well for that, probably because of the deeper google search index integration.

Also I got a year of pro for free with my phone....so thats a big part.

ggregoire

1 month ago

[-]

I use it in Google Search. For example yesterday I typed in Google "postgres generate series 24 hour" and this morning "ffmpeg convert mp4 to wav". Previously I would have clicked on the first StackOverflow result (RIP), now I just take it from the Gemini summary (I'd say 95% of the time it's correct for basic programming language questions. I remember some hallucinations about psycopg3 and date-fns tho. As usual with AI, you need to already know the answer, at least partially, to detect the bs).

Also what's great about Gemini in Google Search is that the answer comes with several links, I use them sometimes to validate the correctness of the solution, or check how old the solution is (I've never used chatGPT so I don't know if chatGPT does it).

hunta2097

1 month ago

[-]

I use the Gemini web interface just as I would ChatGPT. They also have coding environment analogues of Claude-Code in Anti-gravity and Gemini-CLI.

When you sign up for the pro tier you also get 2TB of storage, Gemini for workspace and Nest Camera history.

If you're in the Google sphere it offers good value for money.

dinosor

1 month ago

[-]

I find gemini to be the best at travel planning and for story telling of geographical places. For a road trip, I tried all three mainstream providers and I liked Gemini (also personal preference because Gemini took a verbose approach instead of bullet points from others) for it's responses, ways it discovered stories about places I wanted to explore, places it suggested for me and things it gave me to consider those places in the route.

1 month ago

[-]

Gemini has an obvious edge over its competitors in one specific area: Google Search. The other LLMs do have a Web Search tool but none of them are as effective.

fatherwavelet

1 month ago

[-]

I feel like Gemini 3 was incredible on non-software/coding research. I have learned so much systems biology the last two months it blows my mind.

I had only started using Opus 4.6 this week. Sonnet it seems like is much better at having a long conversation with. Gemini is good for knowledge retrieval but I think Opus 4.6 has caught up. The biggest thing that made Gemini worth it for me the last 3 months is I crushed it with questions. I wouldn't have even got 10% of the Opus use that I got from Gemini before being made to slow down.

I have a deep research going right now on 3.1 for the first time and I honestly have no idea how I am going to tell if it is better than 3.

It seems like agentic coding Gemini wasn't as good but just asking it to write a function, I think it only didn't one shot what I asked it twice. Then fixed the problem on the next prompt.

I haven't logged in to bother with chatGPT in about 3 months now.

1 month ago

[-]

I am a professional software developer who has been programming for 40 years (C, C++, Python, assembly, any number of other languages). I work in ML (infrastructure, not research) and spent a decade working at Google.

In short, I consider Gemini to be a highly capable intern (grad student level) who is smarter and more tenacious than me, but also needs significant guidance to reach a useful goal.

I used Gemini to completely replace the software stack I wrote for my self-built microscope. That includes:

writing a brand new ESP32 console application for controlling all the pins of my ESP32 that drives the LED illuminator. It wrote the entire ESP-IDF project and did not make any major errors. I had to guide with updated prompts a few times but otherwise it wrote the entire project from scratch and ran all the build commands, fixing errors along the way. It also easily made a Python shared library so I can just import this object in my Python code. It saved me ~2-3 days of working through all the ESP-IDF details, and did a better job than I would have.

writing a brand new C++-based Qt camera interface (I have a camera with a special SDK that allows controlling strobe and trigger and other details. It can do 500FPS). It handled all the concurrency and message passing details. I just gave it the SDK PDF documentation for the camera (in mixed english/chinese), and asked it to generate an entire project. I had to spend some time guiding it around making shared libraries but otherwise it wrote the entire project from scratch and I was able to use it to make a GUI to control the camera settings with no additional effort. It ran all the build commands and fixed errors along the way. Saved me another 2-3 days and did a better job than I could have.

Finally, I had it rewrite the entire microscope stack (python with qt) using the two drivers I described above- along with complex functionality like compositing multiple images during scanning, video recording during scanning, mesaurement tools, computer vision support, and a number of other features. This involved a lot more testing on my part, and updating prompts to guide it towards my intended destination (fully functional replacement of my original self-written prototype). When I inspect the code, it definitely did a good job on some parts, while it came up with non-ideal solutions for some problems (for example, it does polling when it could use event-driven callbacks). This saved literally weeks worth of work that would have been a very tedious slog.

From my perspective, it's worked extremely well: doing what I wanted in less time than it would take me (I am a bit of a slow programmer, and I'm doing this in hobby time) and doing a better job (With appropriate guidance) than I could have (even if I'd had a lot of time to work on it). This greatly enhances my enjoyment of my hobby by doing tedious work, allowing me to spend more time on the interesting problems (tracking tardigrades across a petri dish for hours at a time). I used gemini pro 3 for this- it seems to do better than 2.5, and flash seemed to get stuck and loop more quickly.

I have only lightly used other tools, such as ChatGPT/Codex and have never used Claude. I tend to stick to the Google ecosystem for several reasons- but mainly, I think they will end up exceeding the capabilities of their competitors, due to their inherent engineering talent and huge computational resources. But they clearly need to catch up in a lot of areas- for example, the VS Code Gemini extension has serious problems (frequent API call errors, messed up formatting of code/text, infinite loops, etc).

aberoham

1 month ago

[-]

Wow, you have to try claude code with Opus-4.6..

1 month ago

[-]

I agree, but I don't have a subscription.

The remaining technical challenge I have is related to stage positioning- in my system, it's important that all the image frames we collect are tagged with the correct positions. Due to some technical challenges, right now the stage positions are slightly out of sync with the frames, which will be a fairly tricky problem to solve. It's certainly worth trying all the major systems to see what they propose.

jug

1 month ago

[-]

I personally use it as my general purpose and coding model. It's good enough for my coding tasks most of the time, has very good and rapid web search grounding that makes the Google index almost feel like part of its training set, and Google has a family sharing plan with individual quotas for Google AI Pro at $20/month for 5 users which also includes 2 TB in the cloud. Family sharing is a unique feature for Gemini 3 Flash Thinking (300 prompts per day and user) & Pro (100 prompts per day and user).

mehagar

1 month ago

[-]

I use Gemini for personal stuff such as travel planning and research on how to fix something, which product to buy, etc. My company has as Pro subscription so I use that instead of ChatGPT.

epolanski

1 month ago

[-]

Various friends of mine work in non-technology companies (banking, industries, legal, Italy) and in pretty much all of them there's Gemini enterprise + NotebookLM.

In all of them the approach is: this is the solution, now find problems you can apply it to.

thornewolf

1 month ago

[-]

I have swapped to using gemini over chatgpt for casual conversation and question answering. there are some lacking features in the app but i get faster and more intelligent responses.

cherryteastain

1 month ago

[-]

I switched to it for my personal subscription because on discount it was less than half the price of ChatGPT Plus/Claude Pro

esafak

1 month ago

[-]

I'd use it for planning, knowledge, and anything visual.

1 month ago

[-]

I use gemini for everything because I trust google to keep the data I send them safe, because they know how to run prod at scale, and they are more environmentally friendly than everyone else (tpu,us-central1).

This includes my custom agent / copilot / cowork (which uses vertex ai and all models therein). This is where I do more searching now (with genAi grounding) I'm about to work on several micro projects that will hold Ai a little differently.

All that being said, google Ai products suck hard. I hate using every one of them. This is more a reflection on the continued degradation of PM/Design at Big G, from before Ai, but accellationally worse since. I support removing Logan from the head of this shit show

disclaimer: long time g-stan, not so stan any more

robviren

1 month ago

[-]

I have run into a surprising number of basic syntax errors on this one. At least in the few runs I have tried it's a swing and a miss. Wonder if the pressure of the Claude release is pushing these stop gap releases.

mrcwinn

1 month ago

[-]

It's fascinating to watch this community react to positively to Google model releases and so negatively toward OpenAI's. You all do understand that an ad revenue model is exactly where Google will go, right?

sidrag22

1 month ago

[-]

It's all so astroturfed so its hard to tell. I got the opposite impression though. Seemed like OpenAI had more fake positivity towards the top that i tried to skim, and this had way less and a lot of complaints.

Im biased I dont trust either of them, so perhaps im just hard looking for the hate and attributing all the positive stuff to advertising.

jeffbee

1 month ago

[-]

Gemini already drives ad revenue. If the conversation goes in that direction it will use product search results with the links attributable to Google.

webtcp

1 month ago

[-]

An enemy is better than a traitor

mrcwinn

1 month ago

[-]

Quite a low bar. And in any case, isn’t Google already a traitor to its original mission statement?

syspec

1 month ago

[-]

Does anyone know if this is in GA immediately or if it is in preview?

On our end, Gemini 3.0 Preview was very flakey (not model quality, but as in the API responses sometimes errored out), making it unreliable.

Does this mean that 3.0 is now GA at least?

denysvitali

1 month ago

[-]

Where is Simon's pelican?

https://gemini.google.com/share/717be5f9b184

Mashimo

1 month ago

[-]

It's also quite impressive with SVG animations.

> Create an SVG animation of a Beaver sitting next to a recordplayer and a create of records, his eyes follows the mouse curser.

codethief

1 month ago

[-]

Not Simon's but here is one: https://news.ycombinator.com/item?id=47075709

denysvitali

1 month ago

[-]

Thank you!

1 month ago

[-]

Please no, let's not.

[1]:https://news.ycombinator.com/item?id=47041836

0x110111101

1 month ago

[-]

Relevant: Scanned diaries from 1945 of USFS Ranger. Had this transcribed in Claude.

Drblessing

1 month ago

[-]

Gemini is the smartest model currently available. It is the only model out of the big ones that correcly identifies the specific versions of superhers in a collage I tested them with.

siliconc0w

1 month ago

[-]

Google has a hugely valuable dataset of changes from decades of changes from top tier software engineers but it's so proprietary they can't use it to train their external models.

Grisu_FTP

1 month ago

[-]

Somehow the models apparently get better and better every week, but every time i try to use them they get worse.

Am I the issue? Am i just misremembering the early times because it was a new thing?

Mashimo

1 month ago

[-]

You are holding it wrong!

No but for real, what is your usecase?

Do you acutely think something like gpt3 was best?

Grisu_FTP

1 month ago

[-]

I dont have a real special usecase, i just use it whenever i think it will give better results than googling or thinking or i dont feel like getting annoyed by cookie popups.

And i dont think gpt3 was best, but it felt like it actually listened. Now i tell it: "You did this and this wrong, i specifically told u the exact opposite. Can you please do what i asked you?" And then it says something like: "Oh yes my bad, you are right and very very smart to have caught that you must be a super genius. I will now do what you asked me" Does the same wrong thing again. and again and again.

I ask it to fix a mistake, it tells me it fixed it, gives 1:1 the same thing with more errors.

It also feels like it forgets mid convo way faster than it did.

Mashimo

1 month ago

[-]

> I ask it to fix a mistake, it tells me it fixed it, gives 1:1 the same thing with more errors.

> It also feels like it forgets mid convo way faster than it did.

Mhh, I don't observe this. Hard to say.

You probably know this already, but be sure to don't reuse a AI conversation with different context (Having a single chat for both cooking and coding is nono). Often starting a new chat is better.

If it forgets what you said it sounds a bit like you use one chat for too long, or you use a too small model (fast, air, haiku, nano etc.)

dragochat

1 month ago

[-]

...you sound like a typical opus-person :P Just use anthropic's flagships if you want good instruction following, focus in long convos, and proper understanding of guidance-when-wrong.

holografix

1 month ago

[-]

I think it begs the question:

Is Gemini meant to be be a revenue making product or strictly a cost centre to defend against Search and Ads erosion by OpenAI?

Why does the Gemini web app not support MCP Servers?

__jl__

1 month ago

[-]

Another preview release. Does that mean the recommended model by Google for production is 2.5 Flash and Pro? Not talking about what people are actually doing but the google recommendation. Kind of crazy if that is the case

jeffybefffy519

1 month ago

[-]

Someone needs to make an actual good benchmark for LLM's that matches real world expectations, theres more to benchmarks than accuracy against a dataset.

casey2

1 month ago

[-]

We don't need real world benchmarks, if they were good for real world tasks people would use them We need scientific benchmarks that tease out the nature of intelligence. There are plenty of unsaturated benchmarks. Solving chess using "mostly" language modeling is still an open problem. And beyond that creating a machine that can explain why that move is likely optimal at some depth. AI that can predict the output of another AI.

robotpepi

1 month ago

[-]

this reminds me of that joke of someone saying "it's crazy that we have ten different standards for doing this", and then there're 11 standards

1 month ago

[-]

Xkcd 927

Jirach05

1 month ago

[-]

Can anyone explain why these models decrease in performance on this "MCRC v2 (8-needle)" long context benchmark when thinking is turned on?

alwinaugustin

1 month ago

[-]

I use gemini if i need to write something in my native language- Malayalam or translation. it works very well in writing in Indian regional languages.

SrFil

1 month ago

[-]

For me, Gemini has been by far the best model for document understanding tasks. I look forward to seeing how much more capable this version is.

seizethecheese

1 month ago

[-]

I use Gemini flash lite in a side project, and it’s stuck on 2.5. It’s now well behind schedule. Any speculation as to what’s going on?

foruhar

1 month ago

[-]

Gemini-3.0-flash-preview came out right away with the 3.0 release and I was expecting 3.0-flash-lite before a bump on the pro model. I wonder if they have abandoned that part of the Pareto/price-performance.

ismailmaj

1 month ago

[-]

3.1 feels to me like 3.0 but that takes a long time to think, it didn't feel like a leap in raw intelligence like 2.5 pro was.

johnwheeler

1 month ago

[-]

I know Google has anti-gravity but do they have anything like Claude code as far as user interface terminal basically TUI?

https://github.com/google-gemini/gemini-cli

alooPotato

1 month ago

[-]

johnwheeler

1 month ago

[-]

ThankS!!

eric15342335

1 month ago

[-]

My first impression is that the model sounds slightly more human and a little more praising. Still comparing the ability.

nautilus12

1 month ago

[-]

Ok, why don't you work on getting 3.0 out of preview first? 10 min response time is pretty heinous

mucai82

1 month ago

[-]

I agree, according to Googles terms you are not allowed to use the preview model for production use cases. And 3.0 has been in preview for a loooong time now :(

matrix2596

1 month ago

[-]

Gemini 3.1 Pro is based on Gemini 3 Pro

skerit

1 month ago

[-]

Lol, and this line:

> Geminin 3.1 Pro can comprehend vast datasets

Someone was in a hurry to get this out the door.

getcrunk

1 month ago

[-]

Gemini is so stubborn, and often doesn’t follow explicit and simple instructions. So annoying

atleastoptimal

1 month ago

[-]

Writing style wise, 3.1 seems very verbose, but somehow less creative compared to 3.

1 month ago

[-]

It's been hugged to death. I keep getting "Something went wrong".

yuvalmer

1 month ago

[-]

Gemini 3.0 Pro is bad model for its class. I really hope 3.1 is a leap forward.

msavara

1 month ago

[-]

Somehow doesn't work for me :) "An internal error has occurred"

1 month ago

[-]

Appears the only difference to 3.0 Pro Preview is Medium reasoning. Model naming has long gone from even trying to make sense, but considering 3.0 is still in preview itself, increasing the number for such a minor change is not a move in the right direction.

GrayShade

1 month ago

[-]

Maybe that's the only API-visible change, saying nothing about the actual capabilities of the model?

1 month ago

[-]

> increasing the number for such a minor change is not a move in the right direction

A .1 model number increase seems reasonable for more than doubling ARC-AGI 2 score and increasing so many other benchmarks.

What would you have named it?

1 month ago

[-]

My issue is that we haven't even gotten the release version of 3.0, that is also still in Preview, so may stick with 3.0 till that has been deemed stable.

Basically, what does the word "Preview" mean, if newer releases happen before a Preview model is stable? In prior Google models, Preview meant that there'd still be updates and improvements to said model prior to full deployment, something we saw with 2.5. Now, there is no meaning or reason for this designation to exist if they forgo a 3.0 still in Preview for model improvements.

1 month ago

[-]

Given the pace AI is improving and that it doesn't give the exact same answers under many circumstances, is the the [in]stability of "preview" a concern?

GMail was in "beta" for 5 years.

[0] https://ai.google.dev/gemini-api/docs/models

1 month ago

[-]

Should have clarified initially what I meant by stable, especially because it isn't that known how these terms are defined for Gemini models. Not talking about getting consistent output from a not-deterministic model, but stable from a usage perspective and in the way Google uses the word "stable" to describe their model deployments [0]. "Preview" in regard to Gemini models means a few very specific restrictions including far stricter rate limits and a very tight 14 day deprecation window, making them models one cannot build on.

That is why I'd prefer for them to finish the role out of an existing model before starting work on a dedicated new version.

1 month ago

[-]

ChatGPT 4.5 was never released to the public, but it is widely believed to be the foundation the 5.x series is built on.

Wonder how GP feels about the minor bumps for other model providers?

1 month ago

[-]

Minor version bumps are good and I want model providers to communicate changes. The issue I am having is that Gemini "preview" class models have different deprecation timelines and rate limits, making them impossible to rely on for professional use cases. That's why I'd prefer they finish the 3.0 role out prior to putting resources into deploying a second "preview" class model.

For a stable deployment, Google needs a sufficient amount of hardware to guarantee inference and having two Pro models running makes that even more challenging: https://ai.google.dev/gemini-api/docs/models

1 month ago

[-]

Sorry, but you come off as an armchair devops saying things like this. Google is fine, they know more than anyone else about how to run Ai at scale.

"preview" != GA, sounds like you need to adjust your expectations

argsnd

1 month ago

[-]

I disagree. Incrementing the minor number makes so much more sense than “gemini-3-pro-preview-1902” or something.

jannyfer

1 month ago

[-]

According to the blog post, it should be also great at drawing pelicans riding a bicycle.

andrewstuart

1 month ago

[-]

Gemini current version drops most of the code every time I try to use it.

Useless.

naiv

1 month ago

[-]

ok , so they are scared that 5.3 (pro) will be released today/tomorrow and blow it out of the water and rushed it while they could still reference 5.2 benchmarks.

PunchTornado

1 month ago

[-]

I don't think models blow other models anymore. We have the big 3 which are neck to neck in most benchmarks and the rest. I doubt that 5.3 will blow the others.

scld

1 month ago

[-]

easy now

LZ_Khan

1 month ago

[-]

biggest problem is that it's slow. also safety seems overtuned at the moment. getting some really silly refusals. everything else is pretty good.

[1] https://www.tbench.ai/leaderboard/terminal-bench/2.0?agents=...

mustaphah

1 month ago

[-]

Google is terrible at marketing, but this feels like a big step forward.

As per the announcement, Gemini 3.1 Pro score 68.5% on Terminal-Bench 2.0, which makes it the top performer on the Terminus 2 harness [1]. That harness is a "neutral agent scaffold," built by researchers at Terminal-Bench to compare different LLMs in the same standardized setup (same tools, prompts, etc.).

It's also taken top model place on both the Intelligence Index & Coding Index of Artificial Analysis [2], but on their Agentic Index, it's still lagging behind Opus 4.6, GLM-5, Sonnet 4.6, and GPT-5.2.

---

[2] https://artificialanalysis.ai

1 month ago

[-]

Benchmarks aren't everything.

Gemini consistently has the best benchmarks but the worst actual real-world results.

Every time they announce the best benchmarks I try again at using their tools and products and each time I immediately go back to Claude and Codex models because Google is just so terrible at building actual products.

They are good at research and benchmaxxing, but the day to day usage of the products and tools is horrible.

Try using Google Antigravity and you will not make it an hour before switching back to Codex or Claude Code, it's so incredibly shitty.

mustaphah

1 month ago

[-]

That's been my experience too; can't disagree. Still, when it comes to tasks that require deep intelligence (esp. mathematical reasoning [1]), Gemini has consistently been the best.

[1] https://arxiv.org/abs/2602.10177

gregorygoc

1 month ago

[-]

What’s so shitty about it?

trilogic

1 month ago

[-]

Humanity last exam 44%, Scicode 59, and that 80, and this 78 but not 100% ever.

Would be nice to see that this models, Plus, Pro, Super, God mode can do 1 Bench 100%. I am missing smth here?

kuprel

1 month ago

[-]

Why don't they show Grok benchmarks?

andxor

1 month ago

[-]

They've fallen way behind.

kuprel

1 month ago

[-]

GPT 5.2 loses at everything but they included that

andxor

1 month ago

[-]

Who are they supposed to compare it to? I'm not sure what makes you think that Grok is even remotely comparable to the frontier models right now.

rudhdb773b

1 month ago

[-]

Grok has been and still is the best at incorporating search.

4.20 with its 4 agents puts it back at the top for reasoning as well. As soon as it's added to the API, the benchmarks should show that.

andxor

1 month ago

[-]

I agree it's good for researching current events because of the integration with X.

jdthedisciple

1 month ago

[-]

Why should I be excited?

BMFXX

1 month ago

[-]

Just wish iI could get 2.5 daily limit above 1000 requests easily. Driving me insane...

https://www.google.com/appsstatus/dashboard/incidents/nK23Zs...

hn_throw2025

1 month ago

[-]

Yeah great, now can I have my pinned chats back please?

1 month ago

[-]

I hope to have great next two weeks before it gets nerfed.

unsupp0rted

1 month ago

[-]

I've found Google (at least in AI Studio) are the only provider NOT to nerf their models after a few weeks

1 month ago

[-]

I don't use AI studio for my work. I used Antigravity/Gemini CLI and 3 pro was great for few weeks and now it's worse than 3 flash or any smaller model from competitor which are rated lower on benchmarks

scrlk

1 month ago

[-]

IME, they definitely nerf models. gemini-2.5-pro-exp-03-25 through AI Studio was amazing at release and steadily degraded. The quality started tanking around the time they hid CoT.

himata4113

1 month ago

[-]

The visual capabilities of this model are frankly kind of ridicioulus what the hell.

lysecret

1 month ago

[-]

Please I need 3 in ga…

leecommamichael

1 month ago

[-]

Whoa, I think Gemini 3 Pro was a disappointment, but Gemini 3.1 Pro is definitely the future!

throwaw12

1 month ago

[-]

Can we switch from Claude Code to Google yet?

Benchmarks are saying: just try

But real world could be different

foruhar

1 month ago

[-]

My sense is that the Gemini models are very capable but the Gemini CLI experience is subpar compared to Claude Code and Codex. I'm guess that it's the harness but since it can get confused, fall into doom loops, and generally lose the plot in a way that the model does not in Gemini Studio or the Gemini app.

I think a bunch of these harnesses are open source so it surprises me that there can be such a gulf between them.

1 month ago

[-]

It's not just the tooling. If you use Gemini in opencode it malfunctions in similar ways.

I haven't tried 3.1 yet, but 3 is just incompetent at tool use. In particular in editing chunks of text in files, it gets very confused and goes into loops.

The model also does this thing where it degrades into loops of nonsense thought patterns over time.

For shorter sessions where it's more analysis than execution, it is a strong model.

We'll see about 3.1. I don't know why it's not showing in my gemini CLI as available yet.

dana321

1 month ago

[-]

Its not just subpar, its not even sub-sub-par.

It goes into loops and never completes a task 8 times out of 10 that i've used it.

pickle-pixel

1 month ago

[-]

does it still crash out after couple prompts?

taytus

1 month ago

[-]

Another preview model? Why google keep doing this?

solarisos

1 month ago

[-]

The speed of these 3.1 and Preview releases is starting to feel like the early days of web frameworks. It’s becoming less about the raw benchmarks and more about which model handles long-context 'hallucination' well enough to be actually used in a production pipeline without constant babysitting.

techgnosis

1 month ago

[-]

I'd love a new Gemini agent that isn't written with Node.js. Not sure why they think that's a good distribution model.

CamperBob2

1 month ago

[-]

(Shrug) Ask it to write one!

jcims

1 month ago

[-]

Pelican on a bicycle in drawio - https://imgur.com/a/tNgITTR

(FWIW I'm finding a lot of utility in LLMs doing diagrams in tools like drawio)

pqdbr

1 month ago

[-]

How are you prompting it to draw diagrams in drawio

jcims

1 month ago

[-]

Here's the chat I used for the drawing - https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

Save the xml, import to drawio

jcims

1 month ago

[-]

Sometimes it helps to also provide a drawio file that has the elements you wan't (eg. cloud service icons or whatever), but you just feed it the content you want diagrammed and let it eat.

Even if it's not completely correct, it usually creates something that's much closer to complete than a blank page.

ac29

1 month ago

[-]

Drawio drawings are just XML, its possible it can generate that directly