> I don’t recall what happened next. I think I slipped into a malaise of models. 4-way split-paned worktrees, experiments with cloud agents, competing model runs and combative prompting.
You’re trying to have the LLM solve some problem that you don’t really know how to solve yourself, and then you devolve into semi-random prompting in the hope that it’ll succeed. This approach has two problems:
1. It’s not systematic. There’s no way to tell if you’re getting any closer to success. You’re just trying to get the magic to work.
2. When you eventually give up after however many hours, you haven’t succeeded, you haven’t got anything to build on, and you haven’t learned anything. Those hours were completely wasted.
Contrast this with you beginning to do the work yourself. You might give up, but you’d understand the source code base better, perhaps the relationship between Perl and Typescript, and perhaps you’d have some basics ported over that you could build on later.
This feels like the LLM-enabled version of this behavior (except that in the former case, students will quickly realize that what they’re doing is pointless and ask a peer or teacher for help; whereas maybe the LLM is a little too good at hijacking that and making its user feel like things are still on track).
The most important thing to teach is how to build an internal model of what is happening, identify which assumptions in your model are most likely to be faulty/improperly captured by the model, what experiments to carry out to test those assumptions…
In essence, what we call an “engineering mindset” and what good education should strive to teach.
One difference between this story and the various success stories is that the latter all had comprehensive test suites as part of the source material that agents could use to gain feedback without human intervention. This doesn’t seem to exist in this case, which may simply be the deal breaker.
I'm currently torn on whether to actually release it - it's in a private GitHub repository at the moment. It's super-interesting and I think complies just fine with the MIT licenses on MicroQuickJS so I'm leaning towards yes.
Its got to 402 tests with 2 failing - the big unlock was the test suite from MicroQuickJS: https://github.com/bellard/mquickjs/tree/main/tests
Its been spitting out lines like this as it works:
I see the issue - toFixed is using
Python’s default formatting which uses
round-half-to-even rounding, but
JavaScript uses round-half-away-from-zero.Or why not run MicroQuickJS under Fil-C? It's ideal since it has not dependencies.
Though if you look in those files some of them run a ton of test functions and assertions.
My new Python library executes copies of the tests from that mquickjs repo - but those only count as 7 of the 400+ other tests.
Here's the transcript showing how I built it: https://static.simonwillison.net/static/2025/claude-code-mic...
I'm generating a lot of PDFs* in claude, so it does ascii diagrams for those, and it's generally very good at it, but it likely has a lot of such diagrams in its training set. What it then doesn't do very well is aligning them under modification. It can one-shot the diagram, it can't update it very well.
The euphoric breakthrough into frustration of so-called vibe-coding is well recognised at this point. Sometimes you just have to step back and break the task down smaller. Sometimes you just have to wait a few months for an even better model which can now do what the previous one struggled at.
* Well, generating Typst mark-up, anyway.
Specially considering that the output would be essentially the same: a bunch of code that doesn't work.
Edit: I totally agree with your point about not wanting to learn a language. That's definitely a situation where LLMs can excel and almost an ideal use case for them. I just think that Perl, in particular, will be hard to work with, given the current capabilities of LLM coding tools and models. It might be necessary to actually learn the language, and even that might not be enough.
"I took a long-overdue peek at the source codebase. Over 30,000 lines of battle-tested Perl across 28 modules. A* pathfinding for edge routing, hierarchical group rendering, port configurations for node connections, bidirectional edges, collapsing multi-edges. I hadn’t expected the sheer interwoven complexity."
The AI's are super capable now, but still need a lot of guiding towards the right workflow for the project. They're like a sports team, but you still need to be a good coach.
I found Google Antigravity (with the current Gemini models) to be fairly capable. If I had to guess, it seems like they set up their system to get that divide-and-conquer going. As you suggest, it's not that hard: they just have to put the instructions in their equivalent of the system prompt.
Well, when I say 'not that hard', I mean it's an engineering problem to get the system and tooling working together nicely, not really an AI problem.
> I spent weeks casually trying to replicate what took years to build. My inability to assess the complexity of the source material was matched by the inability of the models to understand what it was generating.
When the trough of disillusionment hits, I anticipate this will become collective wisdom, and we'll tailor LLMs to the subset of uses where they can be more helpful than hurtful. Until then, we'll try to use AI to replace in weeks what took us years to build.
I don’t see a particularly good reason why LLMs wouldn’t be able to do most programming tasks, with the limitation being our ability to specify the problem sufficiently well.
I feel we were hearing very similar claims 40 years ago, about how the next version of "Fourth Generation Languages" were going to enable business people and managers to write their own software without needing pesky programmers to do it for them. They'll "just" need to learn how to specify the problem sufficiently well.
(Where "just" is used in it's "I don't understand the problem well enough to know how complicated or difficult what I'm about to say next is" sense. "Just stop buying cigarettes, smoker!", "Just eat less and exercise more, fat person!", "Just get a better paying job, poor person!", "Just cheer up, depressed person!")
I disagree. This almost entirely model capability increases. I've stated this elsewhere: https://news.ycombinator.com/item?id=46362342
Improved tooling/agent scaffolds, whatever, are symptoms of improved model capabilities, not the cause of better capabilities. You put a 2023-era model such as GPT-4 or even e.g. a 2024-era model such as Sonnet 3.5 in today's tooling and they would crash and burn.
The scaffolding and tooling for these models have been tried ever since GPT-3 came out in 2020 in different forms and prototypes. The only reason they're taking off in 2025 is that models are finally capable enough to use them.
If not working at one of the big players or running your own, it appears that even the APIs these days are wrapped in layers of tooling and abstracting raw model access more than ever.
No, the APIs for these models haven't really changed all that much since 2023. The de facto standard for the field is still the chat completions API that was released in early 2023. It is almost entirely model improvements, not tooling improvements that are driving things forward. Tooling improvements are basically entirely dependent on model improvements (if you were to stick GPT-4, Sonnet 3.5, or any other pre-2025 model in today's tooling, things would suck horribly).
> able to do most programming tasks, with the limitation being our ability to specify the problem sufficiently well
We've spent 80 years trying to figure that out. I'm not sure why anyone would think we're going to crack this one anytime in the next few years.
Incremental gains are fine. I suspect capability of models scales roughly as the logarithm of their training effort.
> (read: drinking water and energy)
Water is not much of a concern in most of the world. And you can cool without using water, if you need to. (And it doesn't have to be drinking water anyway.)
Yes, energy is a limiting factor. But the big sink is in training. And we are still getting more energy efficient. At least to reach any given capability level; of course in total we will be spending more and more energy to reach ever higher levels.
Such has always been the largest issue with software development projects, IMO.
I consider myself a bit of an expert vibe engineer and the challenge is alluring :D
This is easy work, made hard by the "allure" of LLMs, which go from emphatic to emetic in the blink of an eye.
If you don't know what you are doing, you should stay away from LLMs if there is anything at all at stake.
The actual goal is to faithfully replicate the functionality and solve the same use cases with a different set of base technologies.
You describing similar but different instrumental goals, which may help reaching the real goal.
Cheekiness aside, your framing is helpful!
I use this LLM called git clone.
I'm sure the MS plan is not just asking Claude "port this code to rust: <paste>", but it's just fun to think it is :)
0: https://www.theregister.com/2025/12/24/microsoft_rust_codeba...
I simply cannot come up with tasks the LLMs can't do, when running in agent mode, with a feedback loop available to them. Giving a clear goal, and giving the agent a way to measure it's progress towards that goal is incredibly powerful.
With the problem in the original article, I might have asked it to generate 100 test cases, and run them with the original Perl. Then I'd tell it, "ok, now port that to Typescript, make sure these test cases pass".
It's really easy to come up with plenty of algorithmic tasks that they can't do.
Like: implement an algorithm / data structure that takes a sequence of priority queue instructions (insert element, delete smallest element) in the comparison model, and return the elements that would be left in the priority queue at the end.
This is trivial to do in O(n log n). The challenge is doing this in linear time, or proving that it's not possible.
(Spoiler: it's possible, but it's far from trivial.)
Our team used claude to help port a bunch of python code to java for a critical service rewrite.
As a "skeptic", I found this to demonstrate both strengths and weaknesses of these tools.
It was pretty good at taking raw python functions and turning them into equivalent looking java methods. It was even able to "intuit" that a python list of strings called "active_set" was a list of functions that it should care about and discard other top level, unused functions. The functions had reasonable names and picked usable data types for every parameter, as the python code was untyped.
That is, uh, the extent of the good.
The bad: It didn't "one-shot" this task. The very first attempt, it generated everything, and then replaced the generated code with a "I'm sorry, I can't do that"! After trying a slightly different prompt it of course worked, but it silently dropped the code that caused the previous problem! There was a function that looked up some strings in the data, and the lookup map included swear words, and apparently real companies aren't allowed to write code that includes "shit" or "f you" or "drug", so claude will be no help writing swear filters!
It picked usable types but I don't think I know Java well enough to understand the ramifications of choosing Integer instead of integer as a parameter type. I'll have to look into it.
It always writes a bunch of utility functions. It refactored simple and direct conditionals into calls to utility functions, which might not make the code very easy to read. These utility functions are often unused or outright redundant. We have one file with like 5 different date parsing functions, and they were all wrong except for the one we quickly and hackily changed to try different date formats (because I suck so the calling service sometimes slightly changes the timestamp format). So now we have 4 broken date parsing functions and 1 working one and that will be a pain that we have to fix in the new year.
The functions look right at first glance but often had subtle errors. Other times the ported functions had parts where it just gave up and ignored things? These caused outright bugs for our rewrite. Enough to be annoying.
At first it didn't want to give me the file it generated? Also the code output window in the Copilot online interface doesn't always have all the code it generated!
It didn't help at all with the hard part: Actual engineering. I had about 8 hours and needed find a way to dispatch parameters to all 50ish of these functions and I needed to do it in a way that didn't involve rebuilding the entire dispatch infrastructure from the python code or the dispatch systems we had in the rest of the service already, and I did not succeed. I hand wrote manual calls to all the functions, filling in the parameters, which the autocomplete LLM in intellij kept trying to ruin. It would constantly put the wrong parameters places and get in my way, which was stupid.
Our use case was extremely laser focused. We were working from python functions that were designed to be self contained and fairly trivial, doing just a few simple conditionals and returning some value. Simple translation. To that end it worked well. However, we were only able to focus the tool into this use case because we already had the 8 years experience of the development and engineering of this service, and had already built out the engineering of the new service, building lots of "infrastructure" that these simple functions could be dropped into, and giving us easy tooling to debug the outcomes and logic bugs in the functions using tens of thousands of production requests, and that still wasn't enough to kill all errors.
All the times I turned to claude for help on a topic, it let me down. When I thought java reflection was wildly more complicated than it actually is, it provided the exact code I had already started writing, which was trivial. When I turned to it for profiling our spring boot app, it told me to write log statements everywhere. To be fair, that is how I ended up tracking down the slowdown I was experiencing, but that's because I'm an idiot and didn't intuit that hitting a database on the other side of the country takes a long time and I should probably not do that in local testing.
I would pay as much for this tool per year as I pay for Intellij. Unfortunately, last I looked, Jetbrains wasn't a trillion dollar business.
Property based testing can be really useful here.