AlphaEvolve: Gemini-powered coding agent scaling impact across fields
175 points
4 hours ago
| 14 comments
| deepmind.google
| HN
momojo
2 hours ago
[-]
This reminds me of Antirez's "Don't fall into the anti-AI hype" [0]

In a sentence: These foundation models are really good at optimizing these extremely high level, extremely well defined problem spaces (ie multiply matrices faster). In Antirez's case, it's "make Redis faster".

There have been two reactions: "Oh it would never work for me" and "I have seen months of my life accomplished in an hour", and I think they're both right. I think we should be excited for Antirez, (who has since been popping off [1]), and I think the rest of us should rest easy knowing that LLM's can't (and maybe were never meant to) tackle the tacit-knowledge-filled, human-system-centric, ambiguously-defined-problem-space jobs most mortals work.

[0] https://antirez.com/news/158 [1] https://antirez.com/news/164

reply
poisonfountain
2 hours ago
[-]
>I think the rest of us should rest easy knowing that LLM's can't (and maybe were never meant to) tackle the tacit-knowledge-filled, human-system-centric, ambiguously-defined-problem-space jobs most mortals work

I don't believe that anymore, to be honest. Models are starting to get good at ambiguity. Claude Code now asks me when something is ambiguous. Soon, all meetings will be recorded, transcribed and stored in a well-indexed place for the agents to search when faced with ambiguity (free startup idea here!). If they can ask you now, they'll be able to search for the answers themselves once that's possible. In fact, they already do it now if you have a well-documented Notion/Confluence, it's just that nobody has.

It's probably harder to RL for "identify ambiguity" than RL'ing for performance algorithms, sure, but it's not impossible and it's in the works. It's just a matter of time now.

reply
TranquilMarmot
34 minutes ago
[-]
> Soon, all meetings will be recorded, transcribed and stored in a well-indexed place for the agents to search when faced with ambiguity (free startup idea here!)

We were doing that over at Vowel a few years back, unfortunately it didn't pan out because you're competing directly against Zoom, Google Meet, Microsoft Teams, etc. They are all (slowly) catching up to where we were as a scrappy startup 4 years ago.

It was truly game-changing to have all of your meetings in an easily searchable database. Even as a human.

reply
Yokohiii
31 minutes ago
[-]
So self chosen total surveillance and transparency so your fav LLM can be better?
reply
risyachka
49 minutes ago
[-]
In coding the ambiguity is very, very limited and constrained compared to any non dev job that involves any decision making
reply
dinfinity
2 hours ago
[-]
> I think the rest of us should rest easy knowing that LLM's can't [...]

What if (when?) (AI-assisted) research moves AI beyond LLMs? Do you think that can't happen?

reply
kubb
1 hour ago
[-]
Not in the next decade. Won't get funded.
reply
dinfinity
49 minutes ago
[-]
Private investment in the US has grown from 100 billion in 2024 to almost 300 billion USD in 2025 [0]. Add public investments worldwide and private investments in at least China and Europe.

I'm pretty sure money is not going to be the blocker.

[0] https://hai.stanford.edu/ai-index/2026-ai-index-report

reply
kubb
43 minutes ago
[-]
The money will go to LLMs.
reply
dgellow
37 minutes ago
[-]
Why not both? You don’t need 1trillion allocated before you have a proof of concept to demonstrate your non-LLM model, and once you have a PoC you will definitely have the larger investors interested
reply
kubb
17 minutes ago
[-]
You will need 100s of billions to make a viable POC.
reply
drysine
34 minutes ago
[-]
Advanced Machine Intelligence (AMI), a new Paris-based startup cofounded by Meta’s former chief AI scientist Yann LeCun, announced Monday it has raised more than $1 billion to develop AI world models.

LeCun argues that most human reasoning is grounded in the physical world, not language, and that AI world models are necessary to develop true human-level intelligence. “The idea that you’re going to extend the capabilities of LLMs [large language models] to the point that they’re going to have human-level intelligence is complete nonsense,” he said. [0]

[0] https://www.wired.com/story/yann-lecun-raises-dollar1-billio...

reply
kubb
18 minutes ago
[-]
Now check how much OpenAI got in their last funding round, and you have your answer.
reply
vonneumannstan
1 hour ago
[-]
>I think the rest of us should rest easy knowing that LLM's can't (and maybe were never meant to) tackle the tacit-knowledge-filled, human-system-centric, ambiguously-defined-problem-space jobs most mortals work.

A Statement all but guaranteed to look incredibly short sighted by 2030.

reply
cyanydeez
2 hours ago
[-]
I'd say it's a malefactor of:

1. Amazing, you just tweaked 1% efficiency

2. You idiot, you just spent an hour trying to trouble shoot a hallucinated api.

On average, it's really hard to tell which ones going to win here.

reply
dakolli
1 hour ago
[-]
Its not hard to tell at all, just look at how much it costs to run a 10T param model (especially with parallelized agents). Those costs are not worth the occasional slot machine-eque jackpot you get. For an entity like Google it might be worth it, but that's it. They definitely aren't going to let us use these things for cost they are now for much longer.

Imagine going back to 2020 and tell people in 6 years going to be able to spend $200.00 a month and be able to spin up $2mm in GPUs at full throttle to respond to your emails. None of this makes sense.

reply
Leynos
1 hour ago
[-]
You don't pay for a £200 a month account to respond to your emails, and if you are, I would tell you that you're wasting your money.
reply
throw310822
1 hour ago
[-]
I don't know, I guess it depends from a) how many hours per month you spend answering emails, and b) how much more revenue you could get in that same time. $200 should be reasonably 2/3 hours of work? So that's about the amount of saved time per month to break even on your subscription. It's a steal.
reply
ogogmad
25 minutes ago
[-]
Whenever you solve any hard problem, you start off by finding a complicated solution, which you then scale down to a simpler solution.

LLMs are a "complicated solution" in the sense that they're expensive. Once you know what they're capable of, you can scale them down to something less expensive. There's usually a way.

Also, an important advantage of LLMs over other approaches is that it's easy to improve them by finding better ways of prompting them. Those prompting strategies can then get hard-coded into the models to make them more efficient. Rinse and repeat. Similarly, you can produce curated data to make them better in certain areas like programming or mathematics.

reply
snapcaster
47 minutes ago
[-]
Do you realize you're fighting a strawman or do you actually think this is a compelling argument?
reply
pingou
3 hours ago
[-]
AI improving itself (or at least the architecture it runs on), the singularity is near as they say.

Do we have other examples of AI being used to improve the LLMs, apart for the creation of synthetic data and the testing of the models?

reply
HarHarVeryFunny
14 minutes ago
[-]
There is an apples and oranges difference between AI improving itself (becoming more capable) and AI optimizing software that happens to be used for AI training or inference.

A more efficient transformer just costs less to run.

"AI improving AI" would be if one generation of AI designed a next-gen AI that was fundamentally more capable (not just faster/cheaper) than itself. A reptilian brain that could autonomously design a mammalian brain.

Even when hooked up into a smart harness like AlphaEvolve, I don't think LLMs have the creativity to do this, unless the next-gen architecture is hiding in plain sight as an assemblage of parts than an LLM can be coaxed into predicting.

More likely it'll take a few more steps of human innovation, steps towards AGI, before we have an AI capable of autonomous innovation rather than just prompted mashup generation.

reply
NitpickLawyer
2 hours ago
[-]
> Do we have other examples of AI being used to improve the LLMs

Yes, last year when they revealed AlphaEvolve they used a previous gemini model to improve kernels that were used in training this gen models, netting them a 1% faster training run. Not much, but still.

reply
mkw5053
3 hours ago
[-]
I feel like the most viral lately is https://github.com/karpathy/autoresearch
reply
aleksiy123
1 hour ago
[-]
Self improving, doesn’t necessarily imply singularity right?

There still could be hard constraints to make singularity intractable or just such a long time horizon it’s not practical right?

reply
dinfinity
2 hours ago
[-]
> AI improving itself

This is the thing to look for in 2027, imho. All the big AI labs have big projects working on research agents, also specifically into improving AI (duh) and I expect a lot of that to get out of the experimental phases this year.

Next year they actually get to do a lot of work and I think we will see the first big effective architectural change co-invented by AI.

reply
pjmlp
2 hours ago
[-]
And then on 2028 we will be selling ice cream at the beach.
reply
lewtun
2 hours ago
[-]
Shameless plug: https://huggingface.co/spaces/smolagents/ml-intern

It’s a simple harness around Opus, but with tight integration to Hugging Face infra, so the agent can read papers, test code and launch experiments

reply
westurner
1 minute ago
[-]
What are the benchmarks for this, in terms of costs of computation and error; cost to converge?

Re: hyperparameter tuning and autoresearch: https://news.ycombinator.com/item?id=47444581

Parameter-free LLMs would be cool

reply
cyanydeez
2 hours ago
[-]
The hard part about this is for every few 'WOW', there's a lineage of 'you dumbass'.

I mean, if you can create aharrness to filter these two, sure, singularity away; it's really hard to see how someones gonna do that.

reply
stijntonk
40 minutes ago
[-]
I wish that Google would focus on bringing their Gemini 3.x models to GA, and provide enough capacity such that one not constantly has to fight with 429 errors.

It often feels like they do not want me to develop applications for corporate clients using their Vertex API. It is just such a shame, given that their models were so great for document analysis etc.

reply
VladVladikoff
27 minutes ago
[-]
Are you doing it on a free plan? I noticed they serve way more 429s on the free plan.
reply
stijntonk
16 minutes ago
[-]
No, for clients we use paid Vertex AI accounts. We often need to host workloads in an EU region, which rules out “global” models (and probably better capacity).

In the past, we used a wrapper that round-robined across multiple projects to get enough quota. Luckily, many of our workloads are workflow-style tasks, so we can simply keep retrying on 429s.

Fun fact: for one of their services, I think it was Stitch, I noticed that my paid key kept hitting quota, while the free worked fine. That blew my mind.

reply
alecco
3 hours ago
[-]
Are Googlers themselves happy using Gemini coding agent instead of Claude Code or Codex? (no snark, I'm really asking)
reply
jensensbutton
2 hours ago
[-]
Note that coding is not the only use of Gemini or any of these models. It's also not what this article is talking about. Gemini can be not the best coding agent, but very good at other things.
reply
dekhn
2 hours ago
[-]
If you mean specifically the Gemini VS Code Extension: it's terrible compared to Claude Code or Codex. I don't know how they can get away with it. Just constant timeouts, weird failure modes, have to start a new chat to switch modes... but I don't think any of that is specific to gemini the model- it seems to be the extension.

As for actual solutions to problems ignoring the VS Code extension aspect, I find all three premiere models to be excellent coding agents for my purposes.

reply
Groxx
45 minutes ago
[-]
The overall quality of LLM coding tools is shockingly bad. I haven't found a single one without major issues, and many have the same problems reappear every few months, sometimes bad enough to almost break the entire thing (e.g. 100% failure rate in editing files, broken for weeks, with the same cause each time, multiple times in a year).

I'd say I'm surprised by it, but uh

reply
nine_k
2 hours ago
[-]
The point of dogfooding is exactly that: if we're unhappy, we're the ones to improve.
reply
anthonypasq
1 hour ago
[-]
the engineers using gemini have no control over deepmind
reply
robohoe
1 hour ago
[-]
Antigravity comes to mind
reply
j2kun
2 hours ago
[-]
I for one can't tell the difference between Claude and Gemini for coding. And the internal agent tooling is many times faster than Claude Code in my experience.
reply
llmslave
1 hour ago
[-]
they use claude code at deepmind
reply
PunchTornado
2 hours ago
[-]
Codex?
reply
carbocation
3 hours ago
[-]
Last month, Steve Yegge suggested that they are not: https://xcancel.com/Steve_Yegge/status/2043747998740689171
reply
NitpickLawyer
2 hours ago
[-]
> He says the problem is that they can't use Claude Code because it's the enemy, and Gemini has never been good enough to capture people's workflows like Claude has, so basically agentic coding just never really took off inside Google. They're all just plodding along, completely oblivious to what's happening out there right now.

This is a bunch of gabagoo. Wrong on so many layers, it's not even worth reading further.

a) goog has agentic coding in both antigravity & cli forms. While it is not at the level of cc + opus, it's still decent.

b) goog has their own versions of models trained on internal code

c) goog has claude in vertex, and most definitely can set it up in secure zones (like they can for their clients) so they'd be able to use claude (at cost) within their own projects.

reply
aleksiy123
57 minutes ago
[-]
Agreed, however imo there is def some problems unique to Google which is making the internal experience less than ideal.

Hoping they can figure it out sooner rather than later.

reply
stormbeard
2 hours ago
[-]
Demis Hassabis chimed in on that thread and called it what it is: clickbait.
reply
typs
2 hours ago
[-]
I’m not so sure. From talking to some of my own friends at google they feel that antigravity/gemini models are handicapping them and would much rather be using claude code (which only deepmind gets to use)
reply
beanard
2 hours ago
[-]
Sure, but there's cavernous distance between "google = john deere" and "darn I have to use Gemini"
reply
PunchTornado
2 hours ago
[-]
This couldn't be further from the truth
reply
FrustratedMonky
2 hours ago
[-]
There is value in the "eating your own dog food".

If internal staff aren't happy with the tools they build, typically that should drive improvements to their own tools

reply
dandaka
2 hours ago
[-]
How many times we have to hear again about Erdös problems? :) It sounds like a great achievement for humanity at first, but after a while they keep coming back!
reply
nightski
25 minutes ago
[-]
I'm not a fan of this approach were Deepmind keeps announcing these advances in vague promises but it's never available to anyone. Trust me bro style tactics. That's great, but if you really want impact it needs to be accessible. The truth is they want a monopoly on the future in the darkest of ways and the evil is really bleeding through. They want to control the market and pick and choose winners.

Before research was publicly funded and accessible to all, even if flawed. These corporate labs are not serving anyone except those seeking extreme power and control.

reply
nmitchko
35 minutes ago
[-]
A fantastically simple solution to improving algorithms, I wish I had this years ago in activation engineering: https://blog.n.ichol.ai/llm-activation-engineering-an-easy-f...

How do I access AlphaEvolve?

reply
Yokohiii
32 minutes ago
[-]
This is just a flex post. Be a billion dollar company or get out.
reply
arian_
33 minutes ago
[-]
We went from 'AI will replace programmers' to 'AI will help programmers' to 'AI writes code while other AI reviews it' in about 18 months. At this rate the humans are just providing the electricity.
reply
brkn
2 hours ago
[-]
I would be interested to see how exactly the agent helped. How was it used, where did it lead to the given improvement and in how far would it have taken a human to come to the same solution.
reply
j2kun
2 hours ago
[-]
The blog post has many links to papers and preprints discussing this exact question.
reply
Lt_Riza_Hawkeye
22 minutes ago
[-]
The CANOS arxiv link says absolutely nothing about AlphaEvolve, Gemini, or LLMs. It seems to use purely traditional ML models. If AE did in fact write a quick script to test different configurations in order to optimize the results, they don't seem to have bothered to write about it.

I can't read the Nature paper about DeepConsensus, but from the summary, it doesn't really explain what role AE had in improving DC. It would be nice to be able to read about what role it actually played, and whether it used traditional or novel methods of performing it

reply
pilooch
1 hour ago
[-]
AlphaEvolve couples map-elites with LLMs. It's an key step in machine learning, in the vein of DQN for reinforcement learning.

AE brings diversity from the genetic algorithms community to large scale optmized deep learning and RL models.

It is a mandatory step for moving forward. The approach is clean and simple, while generic.

The only caveats is the per optimization problem definition of the map élites dimensions. But surely, this will get tackled somehow over the next few years.

If you don't know about map-elites, go look up Jean-Baptiste Mouret' s work and talks, it's both very interesting and universal.

reply
AndrewKemendo
40 minutes ago
[-]
From the comments it seems that this community (mostly career software people) is starting to move into a new phase of grief about the median software engineer losing their hoped for permanent place in society.

-2021-2024 was Denial

-2024-2025 was Anger and Bargaining

-2026 seems to be some combo of anger, bargaining and acceptance depending mostly on your class/age

reply
baq
3 hours ago
[-]
RSI is here on the hardware level and on software level. Sprinkle with a couple algorithmic breakthroughs and results are nigh unimaginable.
reply
maxothex
3 hours ago
[-]
What I'm most curious about is how this translates to messy, real-world codebases without well-defined metrics. Most production software isn't chip design or kernel optimization - it's business logic with unclear success criteria. The infrastructure story is impressive, but I'd love to see how they handle domains where the evaluation function itself is ambiguous.
reply
svieira
16 minutes ago
[-]
> In advertising and marketing, WPP used AlphaEvolve to refine AI model components, navigating complex, high-dimensional campaign data and achieving 10% accuracy gains over their competitive manual model optimizations.

Ah good, we're getting closer and closer to Venus, Inc. every day. /s

reply