Scaling long-running autonomous coding
94 points
3 hours ago
| 17 comments
| cursor.com
| HN
micimize
1 hour ago
[-]
> While it might seem like a simple screenshot, building a browser from scratch is extremely difficult.

> Another experiment was doing an in-place migration of Solid to React in the Cursor codebase. It took over 3 weeks with +266K/-193K edits. As we've started to test the changes, we do believe it's possible to merge this change.

In my view, this post does not go into sufficient detail or nuance to warrant any serious discussion, and the sparseness of info mostly implies failure, especially in the browser case.

It _is_ impressive that the browser repo can do _anything at all_, but if there was anything more noteworthy than that, I feel they'd go into more detail than volume metrics like 30K commits, 1M LoC. For instance, the entire capability on display could be constrained to a handful of lines that delegate to other libs.

And, it "is possible" to merge any change that avoids regressions, but the majority of our craft asks the question "Is it possible to merge _the next_ change? And the next, and the 100th?"

If they merge the MR they're walking the walk.

If they present more analysis of the browser it's worth the talk (not that useful a test if they didn't scrutinize it beyond "it renders")

Until then, it's a mountain of inscrutable agent output that manages to compile, and that contains an execution pathway which can screenshot apple.com by some undiscovered mechanism.

reply
embedding-shape
1 hour ago
[-]
> it's a mountain of inscrutable agent output that manages to compile

But is this actually true? They don't say that as far as I can tell, and it also doesn't compile for me nor their own CI it seems.

reply
micimize
1 hour ago
[-]
Hah I don't know actually! I was assuming it must if they were able to get that screenshot video.
reply
Snuggly73
1 hour ago
[-]
error: could not compile `fastrender` (lib) due to 34 previous errors; 94 warnings emitted

I guess probably at some point, something compiled, but cba to try to find that commit. I guess they should've left it in a better state before doing that blog post.

reply
jaggederest
1 hour ago
[-]
I find it very interesting the degree to which coding agents completely ignore warnings. When I program I generally target warning-free code, and even with significant effort in prompting, I haven't found a model that treats warnings as errors, and they almost all love the "ignore this warning" pragmas or comments over actually fixing them.
reply
suriya-ganesh
23 minutes ago
[-]
unfortunately this is not the most common practice. I've worked on rust codebases with 10K+ warning. and rust was supposed to help you.

It is also close to impossible run any node ecosystem without getting a wall of warnings.

You are an extreme outlier for putting in the work to fix all warnings

reply
ianbutler
23 minutes ago
[-]
Yeah I've had problems with this recently. "Oh those are just warnings." Yes but leaving them will make this codebase shit in short time.

I do use AI heavily so I resorted to actually turning on warnings as errors in the rust codebases I work in.

reply
sashank_1509
1 hour ago
[-]
Oh it doesn’t compile? that’s very revealing
reply
rvz
1 hour ago
[-]
Some people just believe anything said on X these days. No timeline from start to finish, just "trust me bro".

If you can't reproduce or compile the experiment then it really doesn't work at all and nothing but a hype piece.

reply
simonw
2 hours ago
[-]
"To test this system, we pointed it at an ambitious goal: building a web browser from scratch."

I shared my LLM predictions last week, and one of them was that by 2029 "Someone will build a new browser using mainly AI-assisted coding and it won’t even be a surprise" https://simonwillison.net/2026/Jan/8/llm-predictions-for-202... and https://www.youtube.com/watch?v=lVDhQMiAbR8&t=3913s

This project from Cursor is the second attempt I've seen at this now! The other is this one: https://www.reddit.com/r/Anthropic/comments/1q4xfm0/over_chr...

reply
mrefish
2 hours ago
[-]
Time to raise the bar. By 2029 someone will build a new browser using mainly AI-assisted coding and the surprise is that it was designed to be used by pelicans.
reply
bob1029
2 hours ago
[-]
The goal I am currently using for long horizon coding experiments is implementation of a PDF rasterizer given an ISO32000 specification document.
reply
xenni
58 minutes ago
[-]
We're almost there, I've been working on something similar using a markdown'd version of the ISO32000 spec
reply
leptons
46 minutes ago
[-]
Great, they can call it "artificial Internet Explorer", or aIE for short.
reply
cheevly
2 hours ago
[-]
2029? I have no idea why you would think this is so far off. More like Q2 2026.
reply
xmprt
2 hours ago
[-]
You're either overestimating the capabilities of current AI models or underestimating the complexity of building a web browser. There are tons of tiny edge cases and standards to comply with where implementing one standard will break 3 others if not done carefully. AI can't do that right now.
reply
rvz
43 minutes ago
[-]
It's most likely both.

> There are tons of tiny edge cases and standards to comply with where implementing one standard will break 3 others if not done carefully. AI can't do that right now.

Firstly the CI is completely broken on every commit, all tests have failed and its and looking closely at the code, it is exactly what you expect for unmaintainable slop.

Having more lines of code is not a good measure of robust software, especially if it does not work.

reply
gordonhart
1 hour ago
[-]
Web browsers are insanely hard to get right, that’s why there are only ~3 decent implementations out there currently.
reply
mkoubaa
1 hour ago
[-]
Yeah if you let them index chromium I'm sure it could do it next week. It just won't be original or interesting.
reply
geeunits
2 hours ago
[-]
because it makes him look smart when inevitably he's 'right'
reply
embedding-shape
2 hours ago
[-]
Did anyone manage to run the tests from the repository itself? The code seems filled with errors and warnings, as far as I can tell none of them because of the platform I'm on (Linux). I went and looked at the Action workflow history for some pages, and seems CI been failing for a while, PRs also all been failing CI but merged. How exactly was this verified to be something to be used as an successful example, or am I misunderstanding what point they are trying to make? They mention a screenshot, but they never actually mention if their goal was successfully met, do they?

I'm not sure the approach of "completely autonomous coding" is the right way to go. I feel like maybe we'll be able to use it more effectively if we think of them as something to be used by a human to accomplish some thing instead, lean into letting the human drive the thing instead, because quality spirals so quickly out of control.

reply
trjordan
2 hours ago
[-]
This is going to sound sarcastic, but I mean this fully: why haven't they merged that PR.

The implied future here is _unreal cool_. Swarms of coding agents that can build anything, with little oversight. Long-running projects that converge on high-quality, complex projects.

But the examples feel thin. Web browsers, Excel, and Windows 7 exist, and they specifically exist in the LLM's training sets. The closest to real code is what they've done with Cursor's codebase .... but it's not merged yet.

I don't want to say, call me when it's merged. But I'm not worried about agents ability to produce millions of lines of code. I'm worried about their ability to intersect with the humans in the real world, both as users of that code and developers who want to build on top of it.

reply
risyachka
1 hour ago
[-]
>> why haven't they merged that PR.

because it is absolutely impossible to review that code and there is gazillion issues there.

The only way it can get merged is YOLO and then fix issues for months in prod which kinda defeats the purpose and brings gains close to zero.

reply
mkoubaa
1 hour ago
[-]
On the other hand, finding fixing issues for months is still training data
reply
dist-epoch
2 hours ago
[-]
Pretty much everything exists in the training sets. All non-research software is just a mishmash of various standard modules and algorithms.
reply
galaxyLogic
1 hour ago
[-]
Not everything, only code-bases of existing (open-source?) applications.

But what would be the point of re-creating existing applications? It would be useful if you can produce a better version of those applications. But the point in this experiment was to produce something "from scratch" I think. Impressive yes, but is it useful?

A more practically useful task would be for Mozilla Foundation and others to ask AI to fix all bugs in their application(s). And perhaps they are trying to do that, let's wait and see.

reply
mkoubaa
1 hour ago
[-]
You have to be careful which codebase to try this on. I have a feeling if someone unleashed agents on the Linux kernel to fix bugs it'd lead to a ban on agents there
reply
jphelan
2 hours ago
[-]
This looks like extremely brittle code to my eyes. Look at https://github.com/wilsonzlin/fastrender/blob/main/crates/fa...

What is `FrameState::render_placeholder`?

``` pub fn render_placeholder(&self, frame_id: FrameId) -> Result<FrameBuffer, String> { let (width, height) = self.viewport_css; let len = (width as usize) .checked_mul(height as usize) .and_then(|px| px.checked_mul(4)) .ok_or_else(|| "viewport size overflow".to_string())?;

    if len > MAX_FRAME_BYTES {
      return Err(format!(
        "requested frame buffer too large: {width}x{height} => {len} bytes"
      ));
    }

    // Deterministic per-frame fill color to help catch cross-talk in tests/debugging.
    let id = frame_id.0;
    let url_hash = match self.navigation.as_ref() {
      Some(IframeNavigation::Url(url)) => Self::url_hash(url),
      Some(IframeNavigation::AboutBlank) => Self::url_hash("about:blank"),
      Some(IframeNavigation::Srcdoc { content_hash }) => {
        let folded = (*content_hash as u32) ^ ((*content_hash >> 32) as u32);
        Self::url_hash("about:srcdoc") ^ folded
      }
      None => 0,
    };
    let r = (id as u8) ^ (url_hash as u8);
    let g = ((id >> 8) as u8) ^ ((url_hash >> 8) as u8);
    let b = ((id >> 16) as u8) ^ ((url_hash >> 16) as u8);
    let a = 0xFF;

    let mut rgba8 = vec![0u8; len];
    for px in rgba8.chunks_exact_mut(4) {
      px[0] = r;
      px[1] = g;
      px[2] = b;
      px[3] = a;
    }

    Ok(FrameBuffer {
      width,
      height,
      rgba8,
    })
  }
} ```

What is it doing in these diffs?

https://github.com/wilsonzlin/fastrender/commit/f4a0974594e3...

I'd be really curious to see the amount of work/rework over time, and the token/time cost for each additional actual completed test case.

reply
blibble
2 hours ago
[-]
this is certainly an interesting way to pull out an attribute from a tag: https://github.com/wilsonzlin/fastrender/blob/main/crates/fa...
reply
blamestross
1 hour ago
[-]
I suppose brittle code is fine if you have cursor to update and fix it. Ideal really, keeps you dependent.
reply
ZitchDog
2 hours ago
[-]
I used similar techniques to build tjs [1] - the worlds fastest and most accurate json schema validator, with magical TypeScript types. I learned a lot about autonomous programming. I found a similar "planner/delegate" pattern to work really well, with the use of git subtrees to fan out work [2].

I think any large piece of software with well established standards and test suites will be able to be quickly rewritten and optimized by coding agents.

[1] https://github.com/sberan/tjs

[2] /spawn-perf-agents claude command: https://github.com/sberan/tjs/blob/main/.claude/commands/spa...

reply
luhego
38 minutes ago
[-]
> We initially built an integrator role for quality control and conflict resolution, but found it created more bottlenecks than it solved

Of course it creates bottlenecks, since code quality takes time and people don’t get it right on the first try when the changes are complex. I could also be faster if I pushed directly to prod!

Don’t get me wrong. I use these tools, and I can see the productivity gains. But I also believe the only way to achieve the results they show is to sacrifice quality, because no software engineer can review the changes at the same speed the agent generates code. They may solve that problem, or maybe the industry will change so only output and LOC matter, but until then I will keep cursing the agent until I get the result I want.

reply
mdswanson
36 minutes ago
[-]
Over the past year or so, I've built my own system of agents that behaves almost exactly like this. I can describe what I'd like built before I go to bed and have a fantastic foundation in place by the next day. For simpler projects, they'll be complete. Because of the reviews, the code continually improves until the agents are satisfied. I'm impressed every time.
reply
nl
39 minutes ago
[-]
Remember when 3D printers meant the death of factories? Everyone would just print what they wanted at home.

I'm very bullish on LLMs building software, but this doesn't mean the death of software products anymore than 3D printers meant the death of factories.

reply
matthewfcarlson
1 hour ago
[-]
It’s fascinating that many of the issues they faced I’ve seen in human software engineering teams.

Things like integration creating bottlenecks or a lack of consistent top down direction leading to small risk adverse changes instead of bold redesigns. All things I’ve seen before.

reply
2001zhaozhao
1 hour ago
[-]
At least the AI teams aren't politically competing against each other unlike human teams.

(Or are they?)

reply
jphoward
2 hours ago
[-]
The browser it built, obviously the context window of the entire project is huge. They mention loads of parallel agents in the blog post, so I guess each agent is given a module to work on, and some tests? And then a 'manager' agent plugs this in without reading the code? Otherwise I can't see how, even with ChatGPT 5.2/Gemini 3, you could do this otherwise? In retrospect it seems an obvious approach and akin to how humans work in teams, but it's still interesting.
reply
simonw
2 hours ago
[-]
GPT-5.2-Codex has a 400,000 token window. Claude 4.5 Opus is half of that, 200,000 tokens.

It turns out to matter a whole lot less than you would expect. Coding Agents are really good at using grep and writing out plans to files, which means they can operate successfully against way more code than fits in their context at a single time.

reply
jaggederest
50 minutes ago
[-]
The other issue with "a huge token window" is that if you fill it, it seems like relevance for any specific part of the window is diminished - which makes it hard to override default model behavior.

Interestingly, recently it seems to me like codex is actually compressing early and often so that it stays in the smarter-feeling reasoning zone of the first 1/3rd of the window, which is a neat solution for this, albeit with the caveat of post-compression behavior differences cropping up more often.

reply
nl
41 minutes ago
[-]
Generally they only load a bit of the project into the context at a time. Grep works really well for working out what.
reply
observationist
2 hours ago
[-]
Get a good "project manager" agents.md and it changes the whole approach of vibe coding. For a professional environment, with each person given a little domain, arranged in the usual hierarchy of your coding team, truly amazing things can get done.

Presumably the security and validation of code still needs work, I haven't read anything that indicates those are solved yet, so people still need to read and understand the code, but we're at the "can do massive projects that work" stage.

Division of labor and planning and hierarchy are all rapidly advancing, the orchestration and coordination capabilities are going to explode in '26.

reply
galaxyLogic
1 hour ago
[-]
> so I guess each agent is given a module to work on, and some tests?

Who created those agents and gives them the tasks to work on. Who created the tests? AI, or the humans?

reply
WOTERMEON
37 minutes ago
[-]
Weird twist the hiring call at the end for a company that says

> Our mission is to automate coding

reply
mccoyb
2 hours ago
[-]
Supposing agents and their organization improve, it seems like we’re approaching a point where the cost of a piece of software will be driven down to the cost of running the hardware, and the cost of the tokens required to replicate it.

The tokens were “expensive” from the minds of humans …

reply
Daishiman
2 hours ago
[-]
It will be driven down to the cost of having a good project and product manager effectively understanding what the customer wants, which has been the main barrier to excellent software for a good long time.
reply
galaxyLogic
1 hour ago
[-]
And not only understanding what the customer wants, but communicating that unambiguously to the AI. And note who is the "customer" here? Is it the end-users, or is it a client-company which contracts the project-manager for this task? But then the issue is still there, who in the client-company decides exactly what is needed and what the (potential) users want?

I think this situation emphasizes the importance of (something like) Agile. To produce something useful can only happen via experimentation and getting feedback from actual users, and re-iterating relentlessly.

reply
mk599
2 hours ago
[-]
Define "from scratch" in "building a web browser from scratch". This thing has over 100 crates as dependencies... To implement css layouting, it uses Taffy, a crate used by existing browser implementations...
reply
rvz
59 minutes ago
[-]
When I see hundreds of crates being used in a project, I have to just scratch my head and ask: What the f___?

If one vulnerability exists in those crates well, thats that.

reply
tired_and_awake
1 hour ago
[-]
The moment all code is interacted with through agents I cease to care about code quality. The only thing that matters is the quality of the product, cost of maintenance etc. exactly the thing we measure software development orgs against. It could be handy to have these projects deployed to demonstrate their utility and efficacy? Looking at PRs of agents feels a wrong headed, like who cares if agents code is hard to read if agents are managing the code base?
reply
visarga
1 hour ago
[-]
> Looking at PRs of agents feels a wrong headed

It would be walking the motorcycle.

reply
icedchai
1 hour ago
[-]
This is how we wound up with non-technical "engineering managers." Looks good to me.
reply
tired_and_awake
40 minutes ago
[-]
I think this misses the point, see the other comments. Fully scaled agentic coding replaces managers too :) cause for celebration all around
reply
flyinglizard
55 minutes ago
[-]
You could look at agents as meta-compilers, the problem is that unlike real compilers they aren't verified in any way (neither formally or informally), in fact you never know which particular agent you're running against when you're asking for something; and unlike compilers, you don't just throw away everything and start afresh on each run. I don't think you could test a reasonably complex system to a degree where it really wouldn't matter what runs underneath, and as you're going to (probably) use other agents to write THOSE tests, what makes you certain they offer real coverage? It's turtles all the way down.
reply
tired_and_awake
41 minutes ago
[-]
Completely agree and great points. The conclusion of "agents are writing the tests" etc is where I'm at as well. More over the code quality itself is also an agentic problem, as is compile time, reliability, portability... Turtles all the way down as you say.

All code interactions all happen through agents.

I suppose the question is if the agents only produce Swiss cheese solutions at scale and there's no way to fill in those gaps (at scale). Then yeah fully agentic coding is probably a pipe dream.

On the other hand if you can stand up a code generation machine where it's watts + Gpus + time => software products. Then well... It's only a matter of time until app stores entirely disappear or get really weird. It's hard to fathom the change that's coming to our profession in this world.

reply
sashank_1509
2 hours ago
[-]
Can a browser expert please go through the code the agent wrote (skim it), and let us know how it is. Is it comparable to ladybird, or Servo, can it ever reach that capability soon?
reply
dist-epoch
2 hours ago
[-]
So, who is going to compile the browser and post the binaries so we can check it out? (in a sandbox/VM obviously)
reply