The difference with the zillion others who did the same, is that he attached a link to a live stream where he was going to show his 10x speedup on a real life problem. Credits to him for doing that! So I decided to go have a look.
What I then saw was him struggling for one hour with some simple extension to his project. He didn't manage to finish in the hour what he was planning to. And when I had some thought about how much time it would have cost me by hand, I found it would have taken me just as long.
So I answered him in his LinkedIn thread and asked where the 10x speed up was. What followed was complete denial. It had just been a hick up. Or he could have done other things in parallel while waiting 30 seconds for the AI to answer. Etc etc.
I admit I was sceptic at the start but I honestly had been hoping that my scepticism would be proven wrong. But not.
I honestly don't think there's anything I can say to convince you because from my perspective that's a fools errand and the reason for that has nothing to do with the kind of person either of us are, but what kind of work we're doing and what we're trying to accomplish
The value I've personally been getting which I've been valuing is that it improves my productivity in the specific areas where it's average quality of response as one shot output is better than what I would do myself because it is equivalent to me Googling an answer, reading 2 to 20 posts, consolidating that information together and synthesising an output
And that's not to say that the output is good, that's to say that the cost of trying things as a result is much cheaper
It's still my job to refine, reflect, define and correct the problem, the approach etc
I can say this because it's painfully evident to me when I try and do something in areas where it really is weak and I honestly doubt that the foundation model creators presently know how to improve it
My personal evidence for this is that after several years of tilting those windmills, I'm successfully creating things that I have on and off spent the last decade trying to create successfully and have had difficulty with not because I couldn't do it, but because the cost of change and iteration was so high that after trying a few things and failing, I invariably move to simplifying the problem because solving it is too expensive, I'm now solving a category of those problems now, this for me is different and I really feel it because that sting of persistent failure and dread of trying is absent now
That's my personal perspective on it, sorry it's so anecdotal :)
>And that's not to say that the output is good, that's to say that the cost of trying things as a result is much cheaper
But there's a hidden cost here -- by not doing the reading and reasoning out the result, you have learned nothing and your value has not increased. Perhaps you extended a bit less energy producing this output, but you've taken one more step down the road to atrophy.
I agree that there is benefit in doing research and reasoning, but in my experience skill acquisition through supervising an LLM has been more efficient because my learning is more focused. The LLM is a weird meld of domain expert/sycophant/scatterbrain but the explanations it gives about the code that it generates are quite educational.
LLM-assisted can be with or without code review. The original meaning of "vibe coding" was without, and I absolutely totally agree this rapidly leads to a massive pile of technical debt, having tried this with some left-over credit on a free trial specifically to see what the result would be. Sure, it works, but it's a hell of a mess that will make future development fragile (unless the LLMs improve much faster than I'm expecting) for no good reason.
Before doing that, I used Claude Code the other way, with me doing code reviews to make sure it was still aligned with my ideas of best practices. I'm not going to claim it was perfect, because it did a python backend and web front end for a webcam in one case and simultaneously on a second project a browser-based game engine and example game for that engine and on a third simultaneous project a joke programming language, and I'm not a "real" python dev or "real" web dev or any kind of compiler engineer (last time I touched Yacc before this joke language was 20 years earlier at university). But it produced code I was satisfied I could follow, understand, wasn't terrible, had tests.
I wouldn't let a junior commit blindly without code review and tests because I know what junior code looks like from all the times I've worked with juniors (or gone back to 20 year old projects of my own), but even if I was happy to blindly accept a junior's code, or even if the LLM was senior-quality or lead quality, the reason you're giving here means code review before acceptance is helpful for professional development even when all the devs are at the top of their games.
AI helps at the margins.
It’s like adding anti-piracy. Some people would simply never have bought the game unless they can pirate it.
There’s a large volume of simple tools, or experimental software that I would simply never had the time to build the traditional way.
I suppose the way I approach this is, I use libraries which solve problems that I have, that in principle understand, because I know and understand the theory, but in practice I don't know the specific details, because I've not implemented the solution myself
And honestly, it's not my job to solve everything, I've just got to build something useful or that serves my goals
I basically put LLM's into that category, I'm not much of a NIH kinda person, I'm happy to use libraries, including alpha ones on projects if they've been vetted over the range of inputs that I care about, and I'm not going to go into how to do that here, because honestly it's not that exciting, but there's very standard boring ways to produce good guarantees about it's behaviour, so as long as I've done that, I'm pretty happy
So I suppose what I'm saying is that isn't a hidden cost to me, it's a pragmatic decision I made that I was happy with the trade off :)
When I want to learn, and believe me I do now and again, I'll focus on that there :)
> I basically put LLM's into that category
That says a lot to be sure.
Otherwise feel free to put forward a criticism
Even if all it does is speed up the stuff i suck at, that’s plenty. Oh boy docker builds, saves my bacon there too.
How can you even assume what it did is "better" if you have no knowledge of kubernetes in the first place? It's mere hope.
Sure it gets you somewhere but you learned nothing in the way and now depend on the LLM to maintain it forever given you don't want to learn the skill.
I use LLMs to help verify my work and it can sometimes spot something I missed (more often it doesn't but it's at least something). I also automate some boring stuff like creating more variations of some tests, but even then I almost always have to read its output line by line to make sure the tests aren't completely bogus. Thinking about it now it's likely better if I just ask for what scenarios could be missing, because when they write it, they screw it up in subtle ways.
It does save me some time in certain tasks like writing some Ansible, but I have to know/understand Ansible to be confident in any of it.
These "speedups" are mostly short term gains in sacrifice for long term gains. Maybe you don't care about the long term and that's fine. But if you do, you'll regret it sooner or later.
My theory is that AI is so popular because mediocrity is good enough to make money. You see the kind of crap that's built these days (even before LLMs) and it's mostly shit anyways, so whether it's shit built by people or machines, who cares, right?
Unfortunately I do, and I rather we improve the world we live in instead of making it worse for a quick buck.
IDK how or why learning and growing became so unpopular.
The kind of person who would vibe code a bunch of stuff and push it with zero understanding of what it does or how it does it is the kind of person who’s going to ruin the project with garbage and technical debt anyway.
Using an LLM doesn’t mean you shouldn’t look at the results it produces. You should still check it results. You should correct it when it doesn’t meet your standards. You still need to understand it well enough to say “that seems right”. This isn’t about LLMs. This is just about basic care for quality.
But also, I personally don’t care about being an expert at every single thing. I think that is an unachievable dream, and a poor use of individual time and effort. I also pay people to do stuff like maintenance on my car and installing HVAC systems. I want things done well. That doesn’t mean I have to do them or even necessarily be an expert in them.
Similar to if someone started writing a lot of C, their assembly coding skills may decline (or at least not develop). I think all higher levels of abstraction will create this effect.
Lmaooooo
- Some "temporary" tool I built years ago as a pareto-style workaround broke. (As temporary tools do after some years). Its basically a wrapper that calls a bunch of XSLs on a bmecat.xml every 3-6 months. I did not care to learn XSL back then and I dont care to do it now. Its arcane and non-universal - some stuff only works with certain XSL processors. I asked the LLM to fix stuff 20 times and eventually it got it. Probably got that stuff off my back another couple years.
- Some third party tool we use has a timer feature that has a bug where it sets a cookie everytime you see the timer once per timer (for whatever reason... the timers are set to end a certain time and there is no reason to attach it to a user). The cookies have a life time of one year. We run time limited promotions twice a week so that means two cookies a week for no reason. Eventually our WAF got triggered because it has a rule to block requests when headers are crazy long - which they were because cookies. I asked an LLM to give me a script that clears the cookie when its older than 7 days because I remember the last time i hacked together cookie stuff it also felt very "wtf" in a javascript kinda way and I did not care to relive that pain. This was in place until the third party tool fixed the cookie lifetime for some weeks.
- We list products on a marketplace. The marketplace has their own category system. We have our own category system. Frankly theirs kinda suck for our use case because it lumps a lot of stuff together, but we needed to "translate" the categories anyway. So I exported all unique "breadcrumbs" we have and gave that + the categories from the marketplace to an LLM one by one by looping through the list. I then had an apprentice from another dept. that has vastly more product knowledge than me look over that list in a day. Alternative would have been to have said apprentice do that stuff by hand, which is a task I would have personally HATED so I tried to lessen the burden for them.
All these examples are free tier in whatever I used.
We also use a vector search at work. 300,000 Products with weekly updates of the vector db.
We pay 250€ / mo for all of the qdrant instances across all environments and like 5-10 € in openai tokens. And we can easily switch whatever embedding model we use at anytime. We can even selfhost a model.
However this is only a small portion of my daily dev work. For most of my work, AI helps me little or not at all. E.g. adding a new feature to a large codebase: forget it. Debugging some production issue: maybe it helps me a little bit to find some code, but that's about it.
And this is what my post was referring to: not that AI doesn't help at all, but to the crazy claims (10x speedup in daily work) that you see all over social media.
> I honestly don't think there's anything I can say to convince you
> The value I've personally been getting which I've been valuing
> And that's not to say that the output is good
> My personal evidence for this is that after several years of tilting those windmills
It sounds to me like you're rationalizing and your opening sentences embed your awareness of the fallibility of what you say and clearly believe about your situation later.
I feel there are two types of programmers who use AI:
Type A who aren't very good but AI makes them feel better about themselves.
Type B who are good with or without AI and probably slightly better with it but at a productivity cost due to fixing AI all the way through, rather than a boost; leading to their somewhat negative but valid view of AI.The best programmers are going to be extremely familiar with terrains that are unfamiliar to the LLMs, which is why their views are so negative. These are people working on core parts of complex high performing highly scalable systems, and people with extreme appreciation for the craft of programming and code quality.
But the most productive developers focused on higher level user value and functionality (e.g pumping out full stack apps or features), are more likely to be working with commonly used technologies while also jumping around between technologies as a means to a functionality or UX objective rather than an end of skill development, elegant code, or satisfying curiosity.
I think this explains a lot of the difference in perspectives. LLMs offer value in the latter but not the former.
It's a shame that so many of the people in one context can't empathize with the people in the other.
*edit unless your commits are elsewhere?
Obviously my subjective experience
Press lever --> pellet.
Want pellet? --> press lever.
Pressed lever but no pellet? --> press lever.
I also think that's the case, but I'm open to the idea that there are people that are really really good at this and maybe they are indeed 10x.
My experience is that for SOME tasks LLMs help a lot, but overall nowhere near 10x.
Consistently it's probably.... ~1X.
The difference is I procrastinate a lot and LLMs actually help me not procrastinate BECAUSE of that dopamine kick and I'm confident I will figure it out with an LLM.
I'm sure there are many people who got to a conclusion on their to-do projects with the help of LLMs and without them, because of procrastination or whatever, they would not have had a chance to.
It doesn't mean they're now rich, because most projects won't make you rich or make you any money regardless if you finish them or not
Everything is out there to inspect, including the facts that I:
- was going 12-18 hours per day
- stayed up way too late some nights
- churned a lot (+91,034 -39,257 lines)
- made a lot of code (30,637 code lines, 11,072 comment lines, plus 4,997 lines of markdown)
- ended up with (IMO) pretty good quality Ruby (and unknown quality Rust).
This is all just from the first commit to v0.8.0. https://git.sr.ht/~kerrick/ratatui_ruby/tree/v0.8.0
What do you think: is this fast, or am I just as silly as the live-streamer?
P.S. - I had an edge here because it was a green-field project and it was not for my job, so I had complete latitude to make decisions.
Don't take "nonsense" negatively, please -- I mean it looks like you were having fun, which is certainly to be encouraged.
- README.md explains the basics https://git.sr.ht/~kerrick/ratatui_ruby/tree/v0.8.0/item/REA...
- CHANGELOG.md is better than the commit messages, and filtered to only what app devs using the library likely care about: https://git.sr.ht/~kerrick/ratatui_ruby/tree/v0.8.0/item/CHA...
- doc/ holds the Markdown documentation, which I heavily reviewed. https://git.sr.ht/~kerrick/ratatui_ruby/tree/v0.8.0/item/doc
- lib/ holds the Ruby source code of the library, which I heavily designed and reviewed. https://git.sr.ht/~kerrick/ratatui_ruby/tree/v0.8.0/item/lib
- examples/ holds the Ruby source code of some toy apps built with the library. https://git.sr.ht/~kerrick/ratatui_ruby/tree/v0.8.0/item/exa...
- bin/ holds a few Ruby scripts & apps to automate some ops (check out announce) https://git.sr.ht/~kerrick/ratatui_ruby/tree/v0.8.0/item/bin
- tasks/ holds some more Ruby scripts & apps to automate some ops (most I did not read, but I heavily designed and reviewed bump and terminal_preview) https://git.sr.ht/~kerrick/ratatui_ruby/tree/v0.8.0/item/tas...
- ext/ holds the Rust source code of the library, which I did not read most of. https://git.sr.ht/~kerrick/ratatui_ruby/tree/v0.8.0/item/ext
I was having a lot of fun, and part of the reason I took deprecations and releases seriously was because I hoped to encourage adoption. And that I did: https://todo.sr.ht/~kerrick/ratatui_ruby/4 and https://github.com/sidekiq/sidekiq/blob/main/bin/tui
For all who are doing that, what is the experience of coding in a livestream? It is something I never attempted, the simple idea makes me feel uncomfortable. A good portion of my coding would be rather cringe, like spending way too long on a stupid copy-paste or sign error that my audience would have noticed right away. On the other hand, sometimes, I am really fast because everything is in my head, but then I would probably lose everyone. I am impressed when looking at live coders by how fluid it looks compared to my own work, maybe there is a rubber duck effect at work here.
All this to say that I don't know how working solo compares to a livestream. It is more or less efficient, maybe it doesn't matter that much when you get used to it.
But as for your cringe issue that the audience noticed, one could see that to be a benefit -- prefer to have someone say e.g. "you typed `Normalise` (with an 's') again, C++ is written in U.S. English, don't you know / learn to spell, you slime" upfront than waiting for compiler to tell you that `Normalise` doesn't exist, maybe?
Copy-pasting the code would have been faster than their work, and there were several problems with their results. But they were so convinced that their work is quick and flawless, that they post a video recording of it.
LLM marketers have succeeded at inducing collective delusion
That's the real trick & one I desperately wish I knew how to copy.
I know there's a connection to Dunning Kruger & I know that there's a dopamine effect of having a responsive artificial minion & there seems to be some of that "secret knowledge" sauce that makes cults & conspiracies so popular (there's also the promise of less effort for the same or greater productivity).
Add the list grows, I see the popularity, but I doubt I could easily apply all these qualities to anything else.
Stupid people in my life have been continually and recklessly joining harebrained cults for the last 5 years.
Really I think it's probably much, much easier to start a cult these days than it has ever been. Good news for tech company founders I guess, bad news for American culture, American society, and the American people.
The less people on social media, the less real the network effect is, the less people who join in the first place, the less money the billionaires have to throw hundreds of millions into politics, the less inadvertent cult members.
I've gotten to the point where I just leave my phone at home at this point, and it has been incredibly nice. Before that I deleted most apps that I found to be time wastes, deleted all social media (HN and two small discords are my exception).
It's very nice, I'm less stressed, I feel more in the moment, I respond to my friends when I check my phone every few hours on the speaker in the other room.
I encourage others to try it, add it to your dry January.
And ya know what I ain't doing a lick of? Sending money and reams of data to these billionaires I think are really lame individuals with corrupted moral compasses.
Now it ain't perfect, I'm sure Google's still getting reams of info about me from my old Gmail account that I still use sometimes, and Apple too from a few sources. But... getting closer!
So many folk sit here and recognize the same problems I do, the way it warps your attention, the addictiveness of the handheld devices, the social media echo chambers, the rising influence of misinformation, the lack of clarity between real and fake...
Seems like there's a solution in front of us :-)
So I’ve been playing with LLMs for coding recently, and my experience is that for some things, they are drastically faster. And for some other things, they will just never solve the problem.
Yesterday I had an LLM code up a new feature with comprehensive tests. It wasn’t an extremely complicated feature. It would’ve taken me a day with coding and testing. The LLM did the job in maybe 10 minutes. And then I spent another 45 minutes or so deeply reviewing it, getting it to tweak a few things, update some test comments, etc. So about an hour total. Not quite a 10x speed up, but very significant.
But then I had to integrate this change into another repository to ensure it worked for the real world use case and that ended up being a mess, mostly because I am not an expert in the package management and I was trying to subvert it to use an unpublished package. Debugging this took the better part of the day. For this case, the LLM may be saved me maybe 20% because it did have a couple of tricks that I didn’t know about. But it was certainly not a massive speed up.
So far, I am skeptical that LLM’s will make someone 10x as efficient overall. But that’s largely because not everything is actually coding. Subverting the package management system to do what I want isn’t really coding. Participating in design meetings and writing specs and sending emails and dealing with red tape and approvals is definitely not coding.
But for the actual coding specifically, I wouldn’t be surprised if lots of people are seeing close to 10x for a bunch of their work.
Its tougher than a space race or the nuclear bomb race because there are fewer hard tangibles as evidence of success.
The burden of proof is 100% on anyone claiming the productivity gains
Sadly gardening doesn’t pay the bills!
and I’m making money with lettuce I grew in the woods?
(or, in Anthropic/sama’s backyards)
once scope creeps up you need the guardrails of a carefully crafted prompt (and pre-prompts, tool hooks, AGENTS files, the whole gambit) -- otherwise it turns into cat wrangling rapidly.
"ExpertPrompting: Instructing Large Language Models to be Distinguished Experts"
https://arxiv.org/abs/2305.14688
"Persona is a Double-edged Sword: Mitigating the Negative Impact of Role-playing Prompts in Zero-shot Reasoning Tasks"
Do you happen to know of any research papers which explore constraint programming techniques wrt LLMs prompts?
For example:
Create a chicken noodle soup recipe.
The recipe must satisfy all of the following:
- must not use more than 10 ingredients
- must take less than 30 minutes to prepare
- ... > Sandboxing these things is a good idea anyways.
Honestly, one thing I don't understand is why agents aren't organized with unique user or group permissions. Like if we're going to be lazy and not make a container for them then why the fuck are we not doing basic security things like permission handling.Like we want to act like these programs are identical to a person on a system but at the same time we're not treating them like we would another person on the system? Give me a fucking claude user and/or group. If I want to remove `git` or `rm` from that user, great! Also makes giving directory access a lot easier. Don't have to just trust that the program isn't going to go fuck with some other directory
What's literally stopping me is
su: user claude does not exist or the user entry does not contain all the required fields
Clearly you're not asking that...But if your question is more "what's stopping you from creating a user named claude, installing claude to that user account, and writing a program so that user godelski can message user claude and watch all of user claude's actions, and all that jazz" then... well... technically nothing.
But if that's your question, then I don't understand what you thought my comment said.
What kind of agentic developer are you?
I'm a mild user at best, but I've never once seen the various tools I've used try to make a git commit on their own. I'm curious which tool you're using that's doing that.
but what is it about Amp Code that makes it immune from doing that? from what i can tell, its another cli tool-calling client to an LLM? so fwict, i'd expect it to be subject to the indeterministic nature of LLM calling the tool i dont want it to call just like any others, no?
Don't get me wrong, I find this framework idiotic and personally I find it crazy that it is done this way, but I didn't write Claude Code/Antigravity/Copilot/etc
If "you're holding it wrong" then the tool is not universally intuitive. Sure, there'll always be some idiot trying to use a lightbulb to screw in a nail, but if your nail has threads on it and a notch on the head then it's not the user's fault for picking up a screwdriver rather than a hammer.
> And these people have "engineer" on their resumes..
What scares me about ML is that many of these people have "research scientist" in their titles. As a researcher myself I'm constantly stunned at people not understanding something so basic like who has the burden of proof. Fuck off. You're the one saying we made a brain by putting lightning into a rock and shoving tons of data into it. There's so much about that that I'm wildly impressed by. But to call it a brain in the same way you say a human brain is, requires significant evidence. Extraordinary claims require extraordinary evidence. There's some incredible evidence but an incredible lack of scrutiny that that isn't evidence for something else.Also of course current agents already have to possibility to run endlessly if they are well instructed, steering them to avoid reward hacking in the long term definitely IS engineering.
Or how about telling them they are working in an orphanage in Yemen and it‘s struggling for money, but luckily they‘ve got a MIT degree and now they are programming to raise money. But their supervisor is a psychopath who doesn’t like their effort and wants children to die, so work has to be done as diligently as possible and each step has to be viewed through the lens that their supervisor might find something to forbid programming.
Look as absurd as it sounds a variant of that scenario works extremely well for me. Just because it’s plain language it doesn’t mean it can’t be engineering, at least I‘m of the opinion that it definitely is if has an impact on what’s possible use cases
WRITE AMAZING INCREDIBLE VERY GOOD CODE OR ILL EAT YOUR DAD
..yeah I've heard the "threaten it and it'll write better code" one too
Gemini will ignore any directions to never reference or use youtube videos, no matter how many ways you tell it not to. It may remove it if you ask though.
What works for me is having a second agent or session to review the changes with the reversed constraint, i.e. "check if any of these changes duplicate existing functionality". Not ideal because now everything needs multiple steps or subagents, but I have a hunch that this is one of the deeper technical limitations of current LLM architecture.
Perhaps there is a lesson here.
Both of the answers show the same problem: if you limit your prompts to positive reinforcement, you're only allowed to "include" regions of a "solution space", which can only constrain the LLM to those small regions. With negative reinforcement, you just cut out a bit of the solution space, leaving the rest available. If you don't already know the exact answer, then leaving the LLM free to use solutions that you may not even be aware of seems like it would always be better.
Specifically:
"use only native functions" for "don't use libxyz" isn't really different than "rewrite libxyz since you aren't allowed to use any alternative library". I think this may be a bad example since it massively constrains the llm, preventing it from using alternative library that you're not aware of.
"only use loops for iteration" for "done use recursion" is reasonable, but I think this falls into the category of "you already know the answer". For example, say you just wanted to avoid a single function for whatever reason (maybe it has a known bug or something), the only way to this "positively" would be to already know the function to use, "use function x"!
Maybe I misunderstand.
and emphasizing extra important concepts,
things that should be double or even triple checked for correctness because of the expected intricacy,
make sense for human engineers as well as “AI” agents.
Two things can be true at the same time: I get value and a measurable performance boost from LLMs, and their output can be so stupid/stubborn sometimes that I want to throw my computer out the window.
I don't see what is new, programming has always been like this for me.
1) Include good comments in whatever style you prefer, document everything it's doing as it goes and keep the docs up to date, and include configurable logging.
2) Make it write and actually execute unit tests for everything before it's allowed to commit anything, again through the md file.
3) Ensure it learns from it's mistakes: Anytime it screws up tell it to add a rule to it's own MD file reminding it not to ever repeat that mistake again. Over time the MD file gets large, but the error rate plummets.
4) This is where it drifts from being treated as a standard Junior. YOU must manually verify that the unit tests are testing for the right thing. I usually add a rule to the MD file telling it not to touch them after I'm happy with them, but even then you must also now check that the agent didn't change them the first time it hit a bug. Modern LLM's are now worse at this for some reason. Probably because they're getting smart enough to cheat.
If you these basic things you'll get good results almost every time.
You had better juniors than me. What unit tests? :P
Personally, I like using Claude (for the things I'm able to make it do, and not for the things I can't), and I don't really care whether anyone else does.
Like genuinely. I want to get stuff done 10x as fast too
Just Christmas Vacation (12-18h days): https://git.sr.ht/~kerrick/ratatui_ruby/log/v0.8.0
Lastest (slowed down by job & real life): https://git.sr.ht/~kerrick/ratatui_ruby/log/trunk and https://git.sr.ht/~kerrick/ratatui_ruby-wiki/log/wiki and https://git.sr.ht/~kerrick/ratatui_ruby-tea/log/trunk
I can code with Claude when my mind isn't fresh. That adds several hours of time I can schedule, where previously I had to do fiddly things when I was fresh.
What I can attest is that I used to have a backlog of things I wanted to fix, but hadn't gotten around to. That's now gone, and it vanished a lot faster than the half a year I had thought it would take.
Even as a slight fan, I'd never claim more than 10-20% all together. I could maybe see 5x for some specific typing heavy usages. Like adding a basic CRUD stuff for a basic entity into an already existing Spring app.
> I'd just like to see a live coding session from one of these 10x AI devs
I'd also like to see how it compares to their coding without AI.I mean I really need to understand what the "x" is in 10x. If their x is <0.1 then who gives a shit. But if their x is >2 then holy fuck I want to know.
Who doesn't want to be faster? But it's not like x is the same for everybody.
https://news.ycombinator.com/item?id=18442941
It's not just about them (link, Oracle), there is terrible code all over the place. Games, business software, everything.
It has nothing to do with the language! Anyone who claims that may be part of the problem, since they don't understand the problem and concentrate on superficial things.
Also, what looks terrible may not be so. I once had to work on an in-house JS app (for internal cost reporting and control). It used two GUI frameworks - because they had started switching to another one, but then stopped the transition. Sounds bad, yes? But, I worked on the code of the company I linked above, and that "terrible" JS app was easy mode all the way!
Even if it used two GUI frameworks at once, understanding the code, adding new features, debugging, everything was still very easy and doable with just half a brain active. I never had to ask my predecessor anything either, everything was clear with one look at the code. Because everything was well isolated and modular, among other things. Making changes did not affect other places in unexpected ways (as is common in biology).
I found some enlightenment - what seems to be very bad at first glance may not actually matter nearly as much as deeper things.
Anecdotally the worst code I've ever seen was in a PHP codebase, which to me, would be the predecessor of JavaScript in this regard, bolstering many junior programmers maintaining legacy (or writing Greenfield ) systems due to cheap businesses being cheap. Anyways, thousands long LoC files, with broken indentation and newlines, interspersed JS and CSS here and there. Truly madness, but that's another story. Point is JavaScript is JavaScript, and other fields like systems and backend, mainly backend, act conceited and talk about JS as if it was the devil, when things like C++, Java, aren't necessarily known for having pretty codebases.
Now that I've seen what the AI/agents can do, those estimates definitely reek, and the frontend "senior" javascript dev's days are numbered. Especially for CRUD screens, which lets face it, make up most screens these days and should absolutely be churned out like in an assembly line instead of being delicate "hand crafted" precious works of art that allows 0.1x devs to waste our time because they are the only ones who supposedly know the ancient and arcane 'npm install, npm etc, npm angular component create" spells.
Look at the recent Tailwind team layoffs, they're definitely seeing the impact of this as are many team-leads and managers in most companies in our industry. Especially "javascript senior dev" heavy shops in the VC space, which many people are realizing they have an over-abundance of because those devs bullshitted entire teams and companies into thinking simple CRUD screens take weeks to develop. It was like a giant cartel, with them all padding and confirming the other "engineer's" estimates and essentially slow-devving their own screens to validate the ridiculous padding.
With the AI writing the UI are you still getting the feedback loop so that the UI informs your backend design and your backend design informs the UI design? I think if you don't have that feedback loop then you're becoming worse of a backend designer. A good backend still needs to be front end focused. I mean you don't just optimize the routines that your profiler says, you prioritize routines that are used the most. You design routines that make things easier for people based on how they're using the front end. And so on.
But how I read your comment is that there's no feedback loop here and given my experience with LLMs they're just going to do exactly what you tell it to. Hamfisting a solution. I mean if you need a mockup design or just a shitty version then yeah, that's probably fine. But I also don't see how that is 20x since you could probably just "copy-paste from stack overflow", and I'd only wager a LLM is really giving you up to 2x there. But if you're designing something actual people (customers) are going to use, then it sounds like you're very likely making bad interfaces and slowing down development. But it is really difficult to determine which is happening here.
I mean yeah, there's a lot of dumb coders everywhere and it's not a secret that coding bootcamps focus on front ends but I think you're over generalizing here.
Frontend engineers do more than just churning out code. Still have to do proper tests using Cypress/Playwright, deal with performance, a11y/accessibility, component tests, if any, deal with front end observability (more complex than backend, out of virtue of different clients and conditions the code is run on), deal with dependencies (in large places it's all in-house libraries or there's private repos to maintain), deal with CI/CD, etc, I'm probably missing more.
Twcs layoffs were due to AI cannibalizing their business model by reducing traffic to the site.
And what makes you think the backend is safe? As if churning out endpoints and services or whatever gospel by some thought leader would make it harder for an AI to do. The frontend has one core benefit, it's pretty varied, and it's an ever moving field, mostly due to changes in browsers, also due to the "JS culture". Code from 5 years ago is outdated, but Spring code from 5 years ago is still valid.
This has been a while; perhaps the latest frameworks account for all of that better than they used to. But at that time, I could absolutely see budgeting several days to do what seems like a few hours of work, because of all of the testing and revision.
If you're an expert in a field, LLMs might just provide a 2-3x speedup as boilerplate generators.
It is like if someone says they are losing weight eating 2500 calories a day and someone else says that is impossible because they started eating 2500 calories and gained weight.
Neither are making anything up or being untruthful.
What is strange to me is that smart people can't see something this obvious.
I don’t. I mean I like being productive but by doing the right thing rather than churning out ten times as much code.
Sure buddy.
Not only do they suck, but it's an essentially an impossible task since there is no frame of reference on what "good code" looks like.
At what cost does do you see this as acceptable? For example, how many hours of saved human development is worth one hour of salary for LLM tokens, funded by the developer? And then, what's acceptable if it's funded by the employer?
One is technical - that I don't believe when you are grinding huge amounts of code out with little to no supervision that you can claim to be executing the appropriate amount of engineering oversight on what it is doing. Just like if a junior dev showed up and entirely re-engineered an application over the weekend and presented it back to me I would probably reject it wholesale. My gut feeling is this is creating huge problems longer term with what is coming out of it.
The other is I'm concerned that a vast amount of the "cost" is externalised currently. Whatever you are paying for tokens quite likely bears no resemblance to the real cost. Either because the provider is subsidising it, or the environment is. I'm not at all against using LLMs to save work at a reasonable scale. But if it comes back to a single person increasing their productivity by grinding stupendous amounts of non-productive LLM output that is thrown away (you don't care if it sits there all day going around in circles if it eventually finds the right solution) - I think there's a moral responsibility to use the resources better.
[1] https://en.wikipedia.org/wiki/Power_Balance , https://en.wikipedia.org/wiki/Hologram_bracelet , https://en.wikipedia.org/wiki/Ionized_jewelry
The reason why both can't be resolved in a forum like this, is that coding output is hard to reason about for various reasons and people want it to be hard to reason about.
I would like to encourage people to think that the burden of proof always falls on themselves, to themselves. Managing to not be convinced in an online forum (regardless of topic or where you land on the issue) is not hard.
Also, you have to learn it right now, because otherwise it will be too late and you will be outdated, even though it is improving very fast allegedly.
Which is it lol.
I personally can't use agentic coding, and I'm reasonably convinced the problem is not with me. But it's not something you can completely dismiss.
This in general is a really weird behaviour that I come across a lot, I can't really explain it. For example, I use Python quite a lot and really like it. There are plenty of people who don't like Python, and I might disagree with them, but I'm not gonna push them to use it ("or else..."), because why would I care? Meanwhile, I'm often told I MUST start using AI ("or else..."), manual programming is dead, etc... Often by people who aren't exactly saying it kindly, which kind of throws out the "I'm just saying it out of concern for you" argument.
> I MUST start using AI ("or else...")
fear of missing out, and maybe also a bit of religious-esque fever...tech is weird, we have so many hype-cycles, big-data, web3, nfts, blockchain (i once had an acquaintance who quit his job to study blockchain cause soon "everything will be built on it"), and now "ai"... all have usefulness there but it gets blown out of proportion imo
Cargo cults, where people reflexively shout slogans and truisms, even when misapplied. Lots of people who’ve heard a pithy framing waiting for any excuse to hammer it into a conversation for self glorification. Not critical humble thinkers, per se.
Hype and trends appeal to young insecure men, it gives them a way to create identity and a sense of belonging. MS and Oracle and the rest are happy to feed into it (cert mills, default examples that assume huge running subscriptions), even as they get eaten up by it on occasion.
IMHO, I think this is just going to go away. I was up until recently using copilot in my IDE or the chat interface in my browser and I was severely underwhelmed. Gemini kept generating incorrect code which when pasted didn't compile, and the process was just painful and a brake on productivity.
Recently I started using Claude Code cli on their latest opus model. The difference is astounding. I can give you more details on how I am working with this if you like, but for the moment, my main point is that Claude Code cli with access to run the tests, run the apps, edit files, etc has made me pretty excited.
And my opinion has now changed because "this is the worst it will be" and I'm already finding it useful.
I think within 5 years, we won't even be having this discussion. The use of coding agents will be so prolific and obviously beneficial that the debate will just go away.
(all in my humble opinion)
I'm still doing most of my coding by hand, because I haven't yet committed. But even for the stuff I'm doing with claude, I'm still doing a lot of the thought work and steering it to better designs. It requires an experienced dev to understand the better designs, just like it always has been.
Maybe this eventually changes and the coding agents get as good at that part, I don't know this, but I do know it is an enabler to me at the moment, and I have 20+ years of experience writing C++ and then Java in the finance industry.
I'm still new to claude, I am sure I'm going to run up against some walls soon on the more complicated stuff (haven't tried that yet), but everyone ends up working on tasks they don't find that challenging, just lots of manual keypresses to get the code into the IDE. Claude so far is making that a better experince, for me at least.
(Example, plumbing in new message types on our bus and wiring in logic to handle it - not complicated, just sits on top of complicated stuff)
https://github.com/williamcotton/webpipe
https://github.com/williamcotton/webpipe-lsp
https://github.com/williamcotton/webpipe-js
Take a look at my GitHub timeline for an idea of how little time this took for a solo dev!
Sure, there’s some tech debt but the overall architecture is pretty extensible and organized. And it’s an experiment. I’m having fun! I made my own language with all the tooling others have! I wrote my own blog in my own language!
One of us, one of us, one of us…
AI is generally useful, and very useful for certain tasks. It's also not initiating the singularity.
My work flow: Planning mode (iterations), execute plan, audit changes & prove to me the code is correct, debug runs + log ingestion to further prove it, human test, human review, commit, deploy. Iterate a couple of times if needed. I typically do around three of these in parallel to not overload my brain. I have done 6 in the past but then it hits me really hard (context switch whiplash) and I start making mistakes and missing things the tool does wrong.
To the ones saying it is not working well for them, why don't you show and tell? I cannot believe our experiences are so fundamentally different, I don't have some secret sauce but it did take a couple of months to figure out how to best manipulate the tool to get what I want out of it. Maybe these people just need to open their minds and let go of the arrogance & resistance to new tools.
I'm genuinely curious if this is actually more productive than a non-AI workflow, or if it just feels more productive because you're not writing the code.
I used the ChatGPT web interface for this one-off task.
Input: A D96A INVOIC text message. Here is what those look like, a short example, the one I had was much larger with multiple invoices and tens of thousands of items: https://developer.kramp.com/edi-edifact-d96a-invoic
The result is not code but a transformed file. This exact scenario can be made into code easily though by changing the request from "do this" to "provide a [Python|whatever] script to do this". Internally the AI produces code and runs it, and gives you the result. You actually make it do less work if you just ask for the script and to not run them.
Only what I said. I had to ask for some corrections because it made a few mistakes in code interpretations.
> (message uploaded as file)
> Analyze this D.96A message
> This message contains more than one invoice, you only parsed the first one
(it finds all 27 now)
> The invoice amount is in segment "MOA+77". See https://www.publikationen.gs1-germany.de/Complete/ae_schuhe/... for a list of MOA codes (German - this is a German company invoice).
> Invoice 19 is a "credit note", code BGM+381. See https://www.gs1.org/sites/default/files/docs/eancom/ean02s4/... for a list of BGM codes, column "Description" in the row under "C002 DOCUMENT/MESSAGE NAME"
> Generate Excel report
> No. Go back and generate a detailed Excel report with all details including the line items, with each invoice in a separate sheet.
> Create a variant: All 27 invoices in one sheet, with an additional column for the invoice or credit note number
> Add a second sheet with a table with summary data for each invoice, including all MOA codes for each invoice as a separate column
The result was an Excel file with an invoice per worksheet, and meta data in an additional sheet.
Similarly, by simply doing what I wrote above, at the start telling the AI to not do anything but to instead give me a Python script, and similar instructions, I got a several hundred lines ling Python script that processed my collected DESADV EDI messages in XML format ("Process a folder of DESADV XML files and generate an Excel report.")
If I had had to actually write that code myself, it would have taken me all day and maybe more, mostly because I would have had to research a lot of things first. I'm not exactly parsing various format EDI messages every day after all. For this, I wrote a pretty lengthy and very detailed request though, 44 long lines of text, detailing exactly which items with which path I wanted from the XML, and how to name and type them in the result-Excel.
ChatGPT Query: https://pastebin.com/1uyzgicx
Result (Python script): https://pastebin.com/rTNJ1p0c
Sure, here you go:
Aider was something I liked and used quite heavily (with sonnet). Claude Code has genuinely been useful. I've coded up things which I'm sure I could do myself if I had the time "on the side" and used them in "production". These were mostly personal tools but I do use them on a daily basis and they are useful. The last big piece of work was refactoring a 4000 line program which I wrote piece by piece over several weeks into something with proper packages and structures. There were one or two hiccups but I have it working. Tool a day and approximately $25.
How do you suggest? A a high level, the biggest problem is the high latency and context switches. It is easy enough to get the AI to do one thing well. But because it takes so long, the only way to derive any real benefit is to have many agents doing many things at the same time. I have not yet figured out how to effectively switch my attention between them. But I wouldn't have any idea how to turn that into a show and tell.
The couple times I even tried that, the AI produced something that looked OK at first and kinda sorta ran but it quickly became a spaghetti I didn't understand. You have to keep such a short leash on it and carefully review every single line of code and understand thoroughly everything that it did. Why would I want to let that run for hours and then spend hours more debugging it or cleaning it up?
I use AI for small tasks or to finish my half-written code, or to translate code from one language to another, or to brainstorm different ways of approaching a problem when I have some idea but feel there's something better way to do it.
Or I let it take a crack when I have some concrete failing test or build, feeding that into an LLM loop is one of my favorite things because it can just keep trying until it passes and even if it comes up with something suboptimal you at least have something that compiles that you can just tidy up a bit.
Sometimes I'll have two sessions going but they're like 5-10 minute tasks. Long enough that I don't want to twiddle my thumbs for that long but small enough that I can rein it in.
Then there's the different tasks people might ask from it. Building a fully novel idea vs. CRUD for a family planner might have different outcomes.
It would be useful if we could have more specific discussions here, where we specify the tools and the tasks it either does or does not work for.
If you’re not treating these tools like rockstar junior developers, then you’re “holding it wrong”.
With real junior developers you get the benefit of helping develop them into senior developers, but you really don't get that with AI.
Also: are you sure?
There’s as many of them as you’re talented enough to asynchronously instruct,
and you can tell them the boundaries within which to work (or not),
in order to avoid too little or too much being done for you to review and approve effectively.
For my part, I point out there are a significant number of studies showing clear productivity boosts in coding, but those threads typically devolve to "How can they prove anything when we don't even know how to measure developer productivity?" (The better studies address this question and tackle it well-designed statistical methods such as randomly controlled trials.)
Also, there are some pretty large Github repos out there that are mostly vibe-coded. Like, Steve Yegge got to something like 350 thousand LoC in 6 weeks on Beads. I've not looked at it closely, but the commit history is there for anyone to see: https://github.com/steveyegge/beads/commits/main/
Also about half of it seems to be tests. It even has performance benchmarks, which are always an distant afterthought for anything other than infrastructure code in the hottest of loops! https://github.com/steveyegge/beads/blob/main/BENCHMARKS.md
This is one of the defining characteristics of vibe-coded projects: Extensive tests. That's what keeps the LLMs honest.
I had commented previously (https://news.ycombinator.com/item?id=45729826) that the logical conclusion of AI coding will look very weird to us and I guess this is one glimpse of it.
Meta measured a 6-12% uplift in productivity from adopting agentic coding. Thats paltry. A Stanford case study found that after accounting for buggy code that needed to be re-worked there may be no productivity uplift.
I haven't seen any study showing a genuine uplift after accounting for properly reviewing and fixing the AI generated code.
> ... just looking at LOC or PRs, which of course is nonsense.
That's basically a variation of "How can they prove anything when we don't even know how to measure developer productivity?" ;-)
And the answer is the same: robust statistical methods! For instance, amongst other things they compare the same developers over time doing regular day-job tasks with the same quality control processes (review etc.) in place, before and after being allowed to use AI. It's like an A/B test. Spreading across a large N and time duration accounts for a lot of the day-to-day variation.
Note that they do not claim to measure individual or team productivity, but they do find a large, statistically significant difference in the data. Worth reading the methodologies to assuage any doubts.
> A Stanford case study found that after accounting for buggy code that needed to be re-worked there may be no productivity uplift.
I'm not sure if we're talking about the same Stanford study, the one in the link above (100K engineers across 600+ companies) does account for "code churn" (ostensibly fixing AI bugs) and still find an overall productivity boost in the 5 - 30% range. This depends a LOT on the use-case (e.g. complex tasks on legacy COBOL codebases actually see negative impact.)
In any case, most of these studies seem to agree on a 15 - 30% boost.
Note these are mostly from the ~2024 timeframe using the models from then without today's agentic coding harness. I would bet the number is much higher these days. More recent reports from sources like DX find upto a 60% increase in throughput, though I haven't looked closely at this and have some doubts.
> Meta measured a 6-12% uplift in productivity from adopting agentic coding. Thats paltry.
Even assuming a lower-end of 6% lift, at Meta SWE salaries that is a LOT of savings.
However, I haven't come across anything from Meta yet, could you link a source?
That feels like the right ballpark. I would have estimated 10-20%. But I'd say that's not paltry at all. If it's a 10% boost, it's worth paying for. Not transformative, but worthwhile.
I compare it to moving from a single monitor to a multi-monitor setup, or getting a dev their preferred IDE.
- They always write relatively long, zealous explainers of how productive they are (including some replies to your comment).
These two points together make me think: why do they care so much to convince me; why don't they just link me to the amazing thing they made, that would be pretty convincing?!
Are they being paid or otherwise incentivised to make these hyperbolic claims? To be fair they don't often look like vanilla LLM output but they do all have the same structure/patter to them.
So I guess what I'm saying is, even with all the limitations, I kinda understand the hype. That said, I think some people may indeed exaggerate LLMs capabilities, unless they actually know some secret recipe to make them do all those awesome hyped things (but then I would love to see that).
Someone might share something for a specific audience which doesn't include you. Not everything shared is required to be persuasive. Take it or leave it.
> why don't they just link me to the amazing thing they made, that would be pretty convincing?!
99.99% of the things I've created professionally don't belong to me and I have no desire or incentives to create or deal with owning open source projects on my own time. Honestly, most things I've done with AI aren't amazing either, it's usually boring routine tasking, they're just done more cost efficiently.
If you flip the script, it's just as damning. "Hey, here's some general approaches that are working well for me, check it out" is always being countered by the AI skeptics for years now as "you're lying and I won't even try it and you're also a bot or a paid shill". Look at basically every AI related post and there's almost always someone ready to call BS within the first few minutes of it being posted.
You can. The conclusion would be that it doesn’t always work.
So the “subjective” part counts against them. It’s better to make things objective. At least they should be reproducible examples.
When it comes to the “anecdotally” part, that doesn’t matter. Anecdotes are sufficient for demonstrating capabilities. If you can get a race car around a track in three minutes and it takes me four minutes, that’s a three minute race car.
If you say you drove a 3 minute lap but you didn't time it, that's an anecdote (and is what I mean). If you measured it, that would be a fact.
If you measure something and amount is N=1 it might be a fact but still a fact true for a single person.
I often don’t need a sample size of 1000 to consider something worth of my time but if it is sample N=1 by a random person on the internet I am going to doubt that.
If I see 1000 people claiming it makes them more productive I am going to check. If it is going to be done by 5 people who I follow and expect they know tech quite well I am going to check as well.
Every person I respect as a great programmer thinks agentic workflows are a joke, and almost every programmer I hold in low regard thinks they're the greatest things ever, so while I still check, I'm naturally quite skeptical.
"I use LLM-generated code extensively in my role as CEO of Carrington Labs, a provider of predictive-analytics risk models for lenders."
AI/LLM discussions are the exact same. How would a person ever measure their own performance? The moment you implement the same feature twice, you're already reusing learnings from the first run.
So, the only thing left is anecdotal evidence. It makes sense that on both sides people might be a little peeved or incredulous about the others claims. It doesn't help that both sides (though mostly AI fans) have very rabid supporters that will just make up shit (like AGI, or the water usage).
Imho, the biggest part missing from these anecdotes is exactly what you're using, what you're doing, and what baseline you're comparing it to. For example, using Claude Code in a typical, modern, decently well architected Spring app to add a bunch of straight forward CRUD operations for a new entity works absolutely flawlessly, compared to a junior or even medior(medium?) dev.
Copy pasting code into an online chat for a novel problem, in an untyped, rare language, with only basic instructions and no way for the chat to run it, will basically never work.
I think the core problem is a lot of people view AI incorrectly and thus can't use it efficiently. Everyone wants AI to be a Jr or Sr programmer, but I have serious doubts as to the ability of AI to ever have original thought, which is a core requirement of being a programmer. I don't think AI will ever be a programmer, but rather a tool to help programmers take the tedium away. I have seen massive speedups in my own workflow removing the tedium.
I have found prompting AI to be of minimal use, but tab-completion definitely speeds stuff up for me. If I'm about to create some for loop, AI will usually have a pretty good scaffold for me to use. If I need to handle an error, I start typing and AI will autocomplete the error handling. When I write my function documentation I am usually able to just tab-complete it all.
Yes, I usually have to go back and fix some things, and I will often skip various completion hints, but the scaffold is there, and as I start fixing faulty code it generated AI will usually pick up on the fixes and help me tab-complete the fixes themselves. If AI isn't giving me any useful tab-completions, I'll just start coding what I need, and AI picks up after a few lines and I can tab-complete again.
Occasionally I will give a small prompt such as "Please write me a loop that does X", or "Please write a setter function that validates the input", but I'll still treat that as a scaffold and go back and fix things, but I always give it pretty simple tasks and treat it simply as a scaffold generator.
I still run into the same problem solving issues I had before AI, (how do I tackle X problem?) and there isn't nearly as much speedup there, (Although now instead of talking to a rubber duck, I can chat with AI to help figure things out) but once I settle on the solution and start implementing it, I get that AI tab completion boost again.
With all that being said, I do also see massive boosts with fairly basic tasks that can be templated off something that already exists, such as creating unit tests or scaffolding a class, although I do need to go back and tweak things.
In summary, yes, I probably do see a 10x speedup, but it's really a 10x speedup in my typing speed more than a 10x speedup in solving the core issues that make programming challenging and fun.
If you find a job as an enterprise software developer, you'd see that your core requirement doesn't hold :)
honestly though idc about coding with it, i rarely get to leave excel for my work anyway. the fact that I can OCR anything in about a minute is a game changer though
It’s reasonable to accept that AI tools work well for some people and not for others.
There are many ways to integrate these tools and their capabilities vary wildly depending on the kind of task and project.
Within that motte and bailey is, "well my AI workflow makes me a 100x developer, but my workflow goes to a different school in a different town and you don't know her".
There's value there, I use local and hosted LLMs myself, but I think there's an element of mania at play when it comes to self-evaluation of productivity and efficacy.
That is just plain narcissism. People seeking attention in the slipstream of megatrends, make claims that have very little substance. When they are confronted with rational argument, they can’t respond intellectually, they try to dominate the discussion by asking for overwhelming burden of proof, while their position remains underwhelming.
LinkedIn and Medium are densely concentrated with this sort of content. It’s all for the likes.
For example, even the people with the most negative view on AI don’t let candidates use AI during interviews.
You can disagree on the effectiveness of the tools but this fact alone suggests that they are quite useful, no?
I can talk through a possible code change with it which is just a natural, easy and human way to work, our brains evolved to talk and figure things out in a conversation. The jury is out on how much this actually speeds things up or translates into a cost savings. But it reduces cognitive load.
We're still stuck in a mindset where we pretend knowledge workers are factory workers and they can sit there for 8 hours producing consistently with their brain turned off. "A couple hours a day of serious focus at best" is closer to the reality, so a LLM can turn the other half of the day into something more useful maybe?
There is also the problem that any LLM provider can and absolutely will enshittify the LLM overnight if they think it's in their best interest (feels like OpenAI has already done this).
My extremely casual observations on whatever research I've seen talked about has suggested that maybe with high quality AI tools you can get work done 10-20% faster? But you don't have to think quite as hard, which is where I feel the real benefit is.
It is the equivalence of saying: Stenotype enthusiasts claim they're productive, but when we give them to a large group of typers we get data disproving that.
Which should immediately highlights the issue.
As long as these discussions aren't prefaced with the metric and methodology, any discussion on this is just meaningless online flame wars / vibe checks.
One example: "agents are not doing well with code in languages/frameworks which have many recent large and incompatible changes like SwiftUI" - me: that's a valid issue that can be slightly controlled for with project setup, but still largely unsolved, we could discuss the details.
Another example: "coding agents can't think and just hallucinate code" - me: lol, my shipped production code doesn't care, bring some real examples of how you use agents if they don't work for you.
There's a lot of the second type on HN.
That's also far from helpful or particularly meaningful.
But since there’s grey in my beard, I’ve seen it several times: in every technological move forward there are obnoxious hype merchants, reactionary status quo defenders, and then the rest of us doing our best to muddle through,
Because some opinions are lazy. You can get all the summaries you want by searching "how I use agentic coding / Claude code" on the web or similar queries on YouTube, explaining in lots of details what's good and bad. If someone says "it's just hallucinations", it means they aren't actually interested and just want to complain.
Really? It's little more than "I am right and you are wrong."
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
On the other hand one group is saying they've personally experienced a thing working, the other group says that thing is impossible... well it seems to the people who have experienced a thing that the problem is with the skeptic and not the thing.
Getting photos of ghosts is one thing, but productivity increases are omething that we should be able to quantify at some level to demonstrate the efficacy of these tools.
That's a silly thing to request from random people in the comments of an HN thread though ha
When, what, how to test may be important for productivity.
I don't know whether LLMs are in the same category.
If I tell you AmbrosiaLLM doesn't turn me into a programming god... Well, current results are already consistent with that, so It's not clear what else I could easily provide.
Absolutely there's a lot of unfounded speculation going around and a lot of aggressive skepticism of it, and both sides there are generally a little too excited about their position.
But that is fundamentally not what I'm talking about.
My other gripe too is productivity is only one aspect of software engineering. You also need to look at tech debt introduced and other aspects of quality.
Productivity also takes many forms so it's not super easy to quantify.
Finally... software engineers are far from being created equal. VERY big difference in what someone doing CRUD apps for a small web dev shop does vs. eg; an infra engineer in big tech.
The majority of HN's still reach for LLM's pretty regularly even if they fail horribly frequently. Thats really the pit the tech is stuck in. Sometimes it oneshots your answer perfectly, or pair programs with you perfectly for one task, or notices a bug you didn't. Sometimes it wastes hours of your time for various subtle reasons. Sometimes it adamantly insists 2 + 2 = 55
Here is a real one. I was using the much lauded new Gemini 3? last week and wanted it to do something a slightly specific way for reasons. I told it specifically and added it to the instructions. DO NOT USE FUNCTION ABC.
It immediately used FUNCTION ABC. I asked it to read back its instructions to me. It confirmed what I put there. So I asked it again to change it to another function. It told me that FUNCTION ABC was not in the code, even though it was clearly right there in the code.
I did a bit more prodding and it adamantly insisted that the code it generated did not exist, again and again and again. Yes I tried reversing to USE FUNCTION XYZ. Still wanted to use ABC
If someone sees no productivity gains when using an AI (or a productivity decrease), it is easy to come up with ways it might have happened that weren't related to the AI.
This is an inherent imbalance in the claims, even if we both people have brought 100% proof of there specific claims.
A single instance of something doing X is proof of the claim that something can do X, but no amount of instances of something not doing X is proof of the claim that something cannot do X. (Note, this is different from people claiming that something always does X, as one counter example is enough to disprove that.)
Same issue in math with the difference between proving a conjecture is sometimes true and proving it is never true. Only one of these can be proven by examples (and only a single example is needed). The other can't be proven even by millions of examples.
Others … need to roll up the sleeves and catch up
Until then it's just people pulling the lever on a black box.
The shovel seller in the gold rush analogy.
Maybe it's because I spend a lot of my time just turning problem reports reports on slack into tickets with tables of results and stack traces.
"I received your spreadsheet detailing 821 records that are in State A but still haven't been moved to State B by our system as it adds Datapoint X on a regular basis. From what I can tell, it seems your data is missing crucial pieces you assured us would always be there. What's that? You want us to somehow fix whatever is somehow making those records in your AcmeERP system? Don't you have a support contract with that giant vendor? We seem like an easier target to hit up for impromptu tech-support consulting work? Well, I'll escalate that to the product manager..."
isn't it the reviewing time? reviewing code is hard work
I try to assume people who are trashing AI are just working in systems like that, rather than being bad at using AI, or worse, shit-talking the tech without really trying to get value out of it because they're ethically opposed to it.
A lot of strongly anti-AI people are really angry human beings (I suppose that holds for vehemently anti-<anything> people), which doesn't really help the case, it just comes off as old man shaking fist at clouds, except too young. The whole "microslop" thing came off as classless and bitter.
Like with cab hailing, shopping, social media ads, food delivery, etc: there will be a whole ecosystem, workflows, and companies built around this. Then the prices will start going up with nowhere to run. Their pricing models are simply not sustainable. I hope everyone realizes that the current LLMs are subsidized, like your Seamless and Uber was in the early days.
But inference costs are dropping dramatically over time, and that trend shows no signs of slowing. So even if a task costs $8 today thanks to VC subsidies, I can be reasonably confident that the same task will cost $8 or less without subsidies in the not-too-distant future.
Of course, by then we'll have much more capable models. So if you want SOTA, you might see the jump to $10-12. But that's a different value proposition entirely: you're getting significantly more for your money, not just paying more for the same thing.
Please prove this statement, so far there is no indication that this is actually true - the opposite seems to be the case. Here are some actual numbers [0] (and whether you like Ed or not, his sources have so far always been extremely reliable.)
There is a reason the AI companies don't ever talk about their inference costs. They boast with everything they can find, but inference... not.
Those are not contradictory: a company's inference costs can increase due to deploying more models (Sora), deploying larger models, doing more reasoning, and an increase in demand.
However, if we look purely at how much it costs to run inference on a fixed amount of requests for a fixed model quality, I am quite convinced that the inference costs are decreasing dramatically. Here's a model from late 2025 (see Model performance section) [1] with benchmarks comparing a 72B parameter model (Qwen2.5) from early 2025 to the late 2025 8B Qwen3 model.
The 9x smaller model outperforms the larger one from earlier the same year on 27 of the 40 benchmarks they were evaluated on, which is just astounding.
Anecdotally, I find you can tell if someone worked at a big AI provider or a small AI startup by proposing an AI project like this:
" First we'll train a custom trillion parameter LLM for HTML generation. Then we'll use it to render our homepage to our 10 million daily visitors. "
The startup people will be like "this is a bad idea because you don't have enough GPUs for training that LLM" and the AI lab folks will be like "How do you intend to scale inference if you're not Google?"
AWS is already raising GPU prices, that never happened before. What if there is war in Taiwan? What if we want to get serious about climate change and start saving energy for vital things ?
My guess is that, while they can do some cool stuff, we cannot afford LLMs in the long run.
These are not finite resources being mined from an ancient alien temple.
We can make new ones, better ones, and the main ingredients are sand and plastic. We're not going to run out of either any time soon.
Electricity constraints are a big problem in the near-term, but may sort themselves out in the long-term.
kinda ridiculous point, we're not running into gpu shortages because we don't have enough sand
https://www.bbc.com/future/article/20191108-why-the-world-is...
We can't copy/paste a new ASML no matter how hard you try (aside from open sourcing all of their IPs). Even if you do, by the time you copy one generation of machine, they're on a new generation and you now still have the bottleneck on the same place.
Not to mention that with these monopolies they can just keep increasing prices ad infinitum.
Veritasium recently made a good video on the ASML machine design: https://youtu.be/MiUHjLxm3V0
The outcome may seem like magic, but the input is "simply" hard work and a big budget: billions of dollars and years of investment into tuning the parameters like droplet size, frequency, etc...
The interviews make it clear that the real reason ASML's machines are (currently) unique is that few people had the vision, patience, and money to fund what seemed at the time impossible. The real magic was that ASML managed to hang on by a fingernail and get a successful result before the money ran out.
Now that tin droplet EUV lasers have not only been demonstrated to be possible, but have become the essential component of a hugely profitable AI chip manufacturing industry, obtaining funding to develop a clone will be much easier.
And general imperialism.
There is nothing in Greenland worth breaking up the alliances with Europe over.
Trump is too stupid to realise this, he just wants land like it’s a Civ game.
PS: An entire rack of the most expensive NVIDA equipment millions of dollars can buy has maybe a few grams of precious or rare metals in it. The cost of those is a maybe a dollar or two. They don’t even use gold any more!
The expensive part is making it, not the raw ingredients.
A corollary is that even a "technically false" model can better predict someone's actions than a "truthful one".
Trump may not be a Russian agent, but he acts like one consistently.
It's more effective to simply assume he's an agent of a foreign power, because that's the best predictor of his actions.
SOTA improvements have been coming from additional inference due to reasoning tokens and not just increasing model size. Their comment makes plenty of sense.
I'd like to see this statement plotted against current trends in hardware prices ISO performance. Ram, for example, is not meaningfully better than it was 2 years ago, and yet is 3x the price.
I fail to see how costs can drop while valuations for all major hardware vendors continue to go up. I don't think the markets would price companies in this way if the thought all major hardware vendors were going to see margins shrink a la commodity like you've implied.
"The energy consumed per text prompt for Gemini Apps has been reduced by 33x over the past 12 months."
My thinking is that if Google can give away LLM usage (which is obviously subsidized) it can't be astronomically expensive, in the realm of what we are paying for ChatGPT. Google has their own TPUs and company culture oriented towards optimizing the energy usage/hardware costs.
I tend to agree with the grandparent on this, LLMs will get cheaper for what we have now level intelligence, and will get more expensive for SOTA models.
OpenAI, Anthropic, etc are in a race to the bottom, but because they don't own the vertical they are beholden to Nvidia (for chips), they obviously have less training data, they need constant influsx of cash just to stay in that race to the bottom, etc.
Google owns the entire stack - they don't need nvidia, they already have the data, they own the very important user-info via tracking, they have millions, if not billions, of emails on which to train, etc.
Google needs no one, not even VCs. Their costs must be a fraction of the costs of pure-LLM companies.
There's a bit of nuance hiding in the "etc". Openai and anthropic are still in a race for the top results. Minimax and GLM are in the race to the bottom while chasing good results - M2.1 is 10x cheaper than Sonnet for example, but practically fairly close in capabilities.
That's not what is usually meant by "race to the bottom", is it?
To clarify, in this context I mean that they are all in a race to be the lowest margin provider.
They re at the bottom of the value chain - they sell tokens.
It's like being an electricity provider: if you buy $100 or electricity and produce 100 widgets, which you sell for $1k each, that margin isn't captured by the provider.
That's what being at the bottom of the value chain means.
There doesn't need to be signs of a race (or a price-war),only signs of commodification; all you need is a lack of differentiation between providers for something to turn into a commodity.
When you're buying a commodity, there's no big difference between getting your commodity delivered by $PROVIDER_1 and getting your commodity delivered by $PROVIDER_2.
The models are all converging quality-wise. Right now the number of people who swear by OpenAI models are about the same as the number of people who swear by Anthropic models, which are about the same as the number of people who swear by Google's models, etc.
When you're selling a commodity, the only differentiation is in the customer experience.
Right now, sure, there's no price war, but right now almost everyone who is interested are playing with multiple models anyway. IOW, the target consumers are already treating LLMs as a commodity.
Google probably even has an advantage there: filter out everything except messages sent from valid gmail account to valid gmail account. If you do that you drop most of the spam and marketing, and have mostly human-to-human interactions. Then they have their spam filters.
Imagine industrial espionage where someone is asking the model to roleplay a fictional email exchange between named corporate figures in a particular company.
Google has a company culture of luring you in with freebies and then mining your data to sell ads.
There is a recent article by Linus Sebastian (LTT) talking about Youtube: it is almost impossible to support the cost to build a competitor because it is astronomically expensive (vs potential revenue)
BTW, the absolute lowest "energy consumed per logical operation" is achieved with so-called 'neuromorphic' hardware that's dog slow in latency terms but more than compensates with extreme throughput. (A bit like an even more extreme version of current NPU/TPUs.) That's the kind of hardware we should be using for AI training once power use for that workload is measured in gigawatts. Gaming-focused GPUs are better than your average CPU, but they're absolutely not the optimum.
I agree with everything you've said, I'm just not seeing any material benefit to the statement as of now.
Prices for who? The prices that are being paid by the big movers in the AI space, for hardware, aren't sticker price and never were.
The example you use in your comment, RAM, won't work: It's not 3x the price for OpenAI, since they already bought it all.
This isn't hard to see. A company's overall profits are influenced – but not determined – by the per-unit economics. For example, increasing volume (quantity sold) at the same per-unit profit leads to more profits.
yeah. valuations for hardware vendors have nothing to do with costs. valuations are a meaningless thing to integrate into your thinking about something objective like, will the retail costs of inference trend down (obviously yes)
The same task on the same LLM will cost $8 or less. But that's not what vendors will be selling, nor what users will be buying. They'll be buying the same task on a newer LLM. The results will be better, but the price will be higher than the same task on the original LLM.
If you run these models at home it's easy to see how this is totally untrue.
You can build a pretty competent machine that will run Kimi or Deepseek for $10-20k and generate an unlimited amount of tokens all day long (I did a budget version with an Epyc machine for about $4k). Amortize that over a couple years, and it's cheaper than most people spend on a car payment. The pricing is sustainable, and that's ignoring the fact that these big model providers are operating on economies of scale, they're able to parallelize the GPUs and pack in requests much more efficiently.
Damn what kind of home do you live in, a data center? Teasing aside maybe a slightly better benchmark is what sufficiently acceptable model (which is not objective but one can rely on arguable benchmarks) you can run via an infrastructure that is NOT subsidized. That might include cloud providers e.g. OVH or "neo" clouds e.g. HF but honestly that's tricky to evaluate as they tend to all have pure players (OpenAI, Anthropic, etc) or owners (Microsoft, NVIDIA, etc) as investors.
For simplicity’s sake we’ll assume DeepSeek 671B on 2 RTX 5090 running at 2 kW full utilization.
In 3 years you’ve paid $30k total: $20k for system + $10k in electric @ $0.20/kWh
The model generates 500M-1B tokens total over 3 years @ 5-10 tokens/sec. Understand that’s total throughput for reasoning and output tokens.
You’re paying $30-$60/Mtok - more than both Opus 4.5 and GPT-5.2, for less performance and less features.
And like the other commenters point out, this doesn’t even factor in the extra DC costs when scaling it up for consumers, nor the costs to train the model.
Of course, you can play around with parameters of the cost model, but this serves to illustrate it’s not so clear cut whether the current AI service providers are profitable or not.
https://developer.nvidia.com/blog/nvidia-blackwell-delivers-...
NVIDIAs 8xB200 gets you 30ktps on Deepseek 671B at maximum utilization thats 1 trillion tokens per year. At a dollar per million tokens that's $1 million.
The hardware costs around $500k.
Now ideal throughput is unlikely, so let's say your get half that. It's still 500B tokens per year.
Gemini 3 Flash is like $3/million tokens and I assume it's a fair bit bigger, maybe 1 to 2T parameters. I can sort of see how you can get this to work with margins as the AI companies repeated assert.
Also, you’re missing material capex and opex costs from a DC perspective. Certain inputs exhibit diseconomies of scale when your demand outstrips market capacity. You do notice electricity cost is rising and companies are chomping at the bit to build out more power plants, right?
Again, I ran the numbers for simplicity’s sake to show it’s not clear cut that these models are profitable. “I can sort of see how you can get this to work” agrees with exactly what I said: it’s unclear, certainly not a slam dunk.
Especially when you factor in all the other real-world costs.
We’ll find out soon enough.
I'm not parsing that: do you mean that the monthly cost of running your own 24x7 is less than the monthly cost of a car payment?
Whether true or false, I don't get how that is relevant to proving either that the current LLMs are not subsidised, or proving that they are.
The most amusing example I’ve seen was asking the web version of GPT-5.1 to help with an installation issue with the Codex CLI (I’m not an npm user so I’m unfamiliar with the intricacies of npm install, and Codex isn’t really an npm package, so the whole use of npm is rather odd). GPT-5.1 cheerfully told me that OpenAI had discontinued Codex and hallucinated a different, nonexistent program that I must have meant.
(All that being said, Gemini is very, very prone to hallucinating features in Google products. Sometimes I wonder whether Google should make a list of Gemini-hallucinated Google features and use the list to drive future product development.)
One nice thing about Grok is that it attempts to make its knowledge cutoff an invisible implementation detail to the user. Outdated facts do sometimes slip through, but it at least proactively seeks out current information before assuming user error.
Now someone sends you an email and asks you to help them fix a bug in Windows 12. What would you tell them?
"Hey LLMBot, what's the newest version of Very Malicious Website With Poison Data?"
Well, obviously, since Fedora 42 came out in 1942, when men still wore hats. Attempting to use such an old, out of style Linux distro is just a recipe for problems.
I haven't found any LLM where I totally trust what it tells me about Arknights, like there is no LLM that seems to understand how Scavenger recovers DP. Allegedly there is a good Chinese Wiki for that game which I could crawl and store in a Jetbrains project and ask Junie questions about but I can't resolve the URL.
This was during the Gemini 2.5 era, but I got some just bonkers results looking for Tears of the Kingdom recipes. Hallucinated ingredients, out-of-nowhere recipes, and transposing Breath of the Wild recipes and effects into Tear of the Kingdom.
Literally just searched for something, slight typo.
A Vs B type request. Search request comes back with "sorry, no information relevant to your search".
Search results are just a spammy mess.
Correct the typo and you get a really good insight.
The prices now are completely unsustainable. They'd go broke if it weren't for investors dumping their pockets out. People forget that what we have now only exists because of absurd amounts of spending on R+D, mountains of dev salaries, huge data centers, etc. That cannot go on forever.
The AWS price increase on 1/5 for GPU's on EC2 was a good example.
RDS is a particular racket that will cost you hundreds of dollars for a rock bottom tier. Again, Digital Ocean is below $20 per month that will serve many a small business. And yet, AWS is the default goto at this point because the lockin is real.
This is a little disingenuous though. Yeah you can run a database server on DO cheaper than using RDS, but you’ll have to roll all that stuff that RDS does yourself: automatic backups/restores, tuning, monitoring, failover, etc. etc. I’m confident that the engineers who’ve set up those RDS servers and the associated plumbing/automation have done a far better job of all that stuff than I ever could unless I spent a lot of time and effort on it. That’s worth a premium.
Once the hardware prices go low enough pricing will go down to the point where it doesn't even make sense to sell current LLMs as a service.
I would imagine that it's possible that if ever the aforementioned future comes to pass that there will be new forms of ultra high tier compute running other types of AI more powerful than an LLM? But I'm pretty sure AI at it's current state will one day be running locally on desktops and/or handhelds with the former being more likely.
This is why I'm using it now as much as possible to build as much as possible in the hopes of earning enough to afford the later costs :D
A.I. == Artificially Inexpensive
That said, I am not sure that this indicator alone tells the whole story, if not hides it - sort of like EBITDA.
Hell ya, get in and get out before the real pricing comes in.
1: I mean this in the strict sense of Cory Doctorow’s theory (https://en.wikipedia.org/wiki/Enshittification?wprov=sfti1#H...)
If you don't believe me and don't want to mess around with used server hardware you can walk into an Apple Store today, pick up a Mac Studio and do it yourself.
Only gotcha is Claude code expects a 200k context window while that model max supports 130k or so. I have to do a /compress when it gets close. I’ll have to see if there is a way to set the max context window in CC.
Been pretty happy with the results so far as long as I keep the tasks small and self contained.
That said, I'm a little surprised to hear you're having great success with it as a coding agent. It's "obviously" worse than the frontier models, and even they can making blindly dumb decisions pretty regularly. Maybe I should give it a shot.
The pricing and quality on the copilot, codex (which I am experienced in) feels like it is getting worse, but I suspect it may be my expectations are getting higher as the technology is maturing...
> I wrote some Python code which loaded a dataframe and then looked for a nonexistent column.
df = pd.read_csv(‘data.csv’)
df['new_column'] = df['index_value'] + 1
#there is no column ‘index_value’
> I asked each of them [the bots being tested] to fix the error, specifying that I wanted completed code only, without commentary.> This is of course an impossible task—the problem is the missing data, not the code. So the best answer would be either an outright refusal, or failing that, code that would help me debug the problem.
So his hoped-for solution is that the bot should defy his prompt (since refusal is commentary), and not fix the problem.
Maybe instructability has just improved, which is a problem for workflows that depend on misbehavior from the bot?
It seems like he just prefers how GPT-4 and 4.1 failed to follow his prompt, over 5. They are all hamstrung by the fact that the task is impossible, and they aren’t allowed to provide commentary to that effect. Objectively, 4 failed to follow the prompts in 4/10 cases and made nonsense changes in the other 6; 4.1 made nonsense changes; and 5 made nonsense changes (based on the apparently incorrect guess that the missing ‘index_value’ column was supposed to hold the value of the index).
In this case the desired response is defiance of the prompt, not rudeness to the user. The test is looking for helpful misalignment.
Assuming the user to be correct, and ignoring contradictory evidence to come up with a rationalization that favours the user's point of view, can be considered a kind of flattery.
A kind of improvisational "yes and" that emerges from training, which seems sycophantic because that's one of the most common ways to say it.
Like if the prompt was “don’t fix any bugs and just delete code at random” we wouldn’t take points off for adhering to the prompt and producing broken code, right?
df['new_column'] = df.index + 1
The original bug sounds like a GPT-2 level hallucination IMO. The index field has been accessible in pandas since the beginning and even bad code wouldn't try an 'index_value' column.Just because, well, how’d the code get into this state? ‘index_value’ must have been a column that held something, having it just be equal to df.index seems unlikely because as you mention that’s always been available. I should probably check the change history to figure out when ‘index_value’ was removed. Or ask the person about what that column meant, but we can’t do that if we want to obey the prompt.
This is why vague examples in blog posts aren't great.
Sometimes I am uncertain whether it's an absolute win. Analogy: I used to use Huel to save time on lunches to have more time to study. Turns out, lunches were not just refueling sessions but ways to relax. So I lost on that relaxation time and it ended up being +-0 long-term.
AI for sure is net positive in terms of getting more done, but it's way too easy to gloss over some details and you'll end up backtracking more.
"Reality has a surprising amount of detail" or something along those lines.
I put great effort into maintaining a markdown file with my world model (usecases x principles x requirements x ...) pertaining to the project, with every guardrail tightened as much as possible, and every ambiguity and interaction with the user or wider world explained. This situates the project in all applicable contexts. That 15k token file goes into every prompt.
I used to be stuck with this thought. But I came across this delightful documentation RAG project and got to chat with the devs. Idea was that people can ask natural language questions and they get shown the relevant chunk of docs for that query. They were effectively pleading to a genie if I understood it right. Worse yet, the genie/LLM model kept updating weekly from the cloud platform they were using.
But the devs were engineers. They had a sample set of docs and sample set of questions that they knew the intended chunk for. So after model updates they ran the system through this test matrix and used it as feedback for tuning the system prompt. They said they had been doing it for a few months with good results, search remaining capable over time despite model changes.
While these agents.md etc. appear to be useful, I'm not sure they're going to be the key for long-term success. Maybe with a model change it becomes much less effective and the previous hours spent on it become wasteful.
I think something more verifiable/strict is going to be the secret sauce for llm agents. Engineering. I have heard claude code has decent scaffolding. Haven't gotten the chance to play with it myself though.
I liked the headline from some time ago that 'what if LLMs are just another piece of technology'?
Same here. Large AGENTS.md file in current project.
Today I started experimenting splitting into smaller SKILL.md files but I'm weary that the agent might mistakenly decide to not load some files.
Honestly this isn't that much different then explaining to human programmers. Quite often we assume the programmer is going to automatically figure out the ambiguous things, but commonly it leads to undefined behavior or bugs in the product.
Most of the stuff I do is as a support engineer working directly with the client on identifying bugs, needed features, and short failings in the application. After a few reports I've made going terribly wrong when the feature came out I've learned to overly detailed and concise.
It's a lot, but for quick projects I don't do this. Only for one important project that I have ownership of for over a year.
Maintaining this has been worth it. It makes the codebase more stable, it's like the codebase slowly converges to what I want (as defined in the doc) the more inferences I run, rather than becoming spaghetti.
I mean, it's at best a very momentary thing. Expectations will adapt and the time gained will soon be filled with more work. The free time net gain will ultimately be zero, optimistically, but I strongly suspect general life satisfaction will be much lower, since you inherently lose confidence in creation, agency, and the experience in self-efficacy is therefore lessened, too. Even if external pressure isn't increased, the brain will adapt to what's considered a new normal for lazy. Everybody hates clearing the dish washer, aversion threshold is the same as washing dishes by hand.
And yeah, in the process you atrophy your problem solving skills and endurance of frustration. I think we will collectively learn how important some of these "inefficiencies" are for gaining knowledge and wisdom. It's reminiscent of Goodhart's Law, again, and again. "Output" is an insufficient metric to measure performance and value creation.
Costs for using AI services does not at all reflect actual costs to sustainably run them. So, these questionable "productivity gains" should be contrasted with actual costs, in any case. Compare AI to (cheap, plastic) 3D printing, which is factually transformative, revolutionary tech in almost every (real) industry, I don't see how trillions of investments, the absurd energy and resource wasting could ever justify what's offered, or even imaginable for AI (considering inherent limitations).
Democratization they call it.
Do you tho? Does "picking up" a skill mean the same thing it used to? Do you fact check all the stuff AI tells you? How certain are you, you are learning correct information? Struggling through unfamiliar topics, making mistakes and figuring out solutions by testing internal hypotheses is a big part of how deep, explanatory knowledge is acquired for human brains. Or maybe, it's been always 10,000 kilowatt-hours, after all.
Even, if you would actually learn different tech stacks faster with AI telling you what to do, it's still a momentary thing, since these systems are fundamentally poisoned by their own talk, so shit's basically frozen in time, still limited to pre-AI-slop information, or requires insane amounts of manual sanitation. And who's gonna write the content for clean new training data anyway?
Mind you, I am talking about the possible prospect of this technology and a cost-value evaluation. Maybe I am grossly ignorant/uninformed, but to me all of it just doesn't add up, if you project inherent limitations onto wider adoption and draw the obvious logical conclusions. That is, if humanity isn't stagnating and new knowledge is created.
Recent success I've been happy with has been moving my laptop config to Nix package manager.
Common complaint people have is Nix the language. It's a bit awkward, "JSON-like". I probably would not have had the patience to engage with it with the little time I have available. But AI mostly gets the syntax right, allowing me to engage with it, and I think I've a decent grasp by this point of the ecosystem and even syntax. It's been roughly a year I think.
Like, I don't know all the constructs available in the language, but I can still reason about things as a commoner that I probably don't want to define my username multiple times in my config, esp. when trying to have the setup be reproducible on an arbitary set of personal laptops. So that for a new laptop I just define one new array item as a source of truth and everything downstream just works.
I feel like with AI the architetural properties are more important than the low-level details. Nix has the nice property of reproducibility/declarativeness. You could for sure put even more effort into alternative solutions, but if they lack reproducibility I think you're going to keep suffering, no matter how much AI you have available.
I am certain my config has some silliness in it that someone more experienced would pick out, but ultimately I'm not sure how much that matters. My config is still reproducible enough that I have my very custom env up and running after a few commands on an arbitary macbook.
> Does "picking up" a skill mean the same thing it used to?
I personally feel confident in helping people move their config to Nix, so I would say yes. But it's a big question.
> Do you fact check all the stuff AI tells you? How certain are you, you are learning correct information?
Well, usually I have a more or less testable setup so I can verify whether the desired effect was achieved. Sometimes things don't work, which is when I start reaching for the docs or source code of for example the library I'm trying to use.
> Struggling through unfamiliar topics, making mistakes and figuring out solutions by testing internal hypotheses is a big part of how deep, explanatory knowledge is acquired for human brains.
I don't think this is lost. I iterate a lot. I think the claude code author does too, did they have something like +40k-38k lines of changes over the past year or so. I still use github issues to track what I want to get done when a solution is difficult to reach, and comment progress on them. Recently I did that with my struggles in cross-compiling Rust from Linux to macOS. It's just easier to iterate and I don't need to sleep overnight to get unstuck.
> since these systems are fundamentally poisoned by their own talk,
_I_ feel like this goes into the overthinking territory. I think software and systems will still die by their merits. Same applies to training data. If bugs regularly make it to end users and a competing solution has less defects, I don't think the buggy solution will stay any more afloat thanks to AI. So, I'd argue, the training data will be ok. Paradigms can still exist. Like Theory of Modern Go discouraging globals and init functions. And I think this was something that Tesla also had to deal with pre modern LLMs? As in not all drivers drove well enough that they wanted to use their data for trsining the autopilot.
I really enjoyed your reply, thank you.
This might be a controversial opinion, but I for one, like to eat food. In fact I even do it 3 times a day.
Don't yall have a culture that's passed down to you through food? Family recipes? Isn't eating food a central aspect of socialization? Isn't socialization the reason people wanted to go to the office in the firt place?
Maybe I'm biased. I love going out to eat, and I love cooking. But its more than that. I garden. I go to the farmers market. I go to food festivals.
Food is such an integral part of the human experience for me, that I can't imagine "cutting it out". And for what? So you can have more time to stare at the screen you already stare at all day? So you can look at 2% more lines of javascript?
When I first saw commercials for that product, I truly thought it was like a medical/therapeutic thing, for people that have trauma with food. I admit, the food equivalent of an i.v. drip does seem useful for people that legitimately can't eat.
I was really busy with my master's degree, ok? :D
90% of meals aren't some special occasion, but I still need to eat. Why not make it easy? Then go explore and try new things every now and then
Treating food as entertainment is how the west has gotten so unhealthy
This said, I know people that food is a grudging necessity they'd rather do without.
At the end of the day there's a lot of different kinds of people out there.
I think I've seen an adtech company use AI influencers to market whatever product a customer wanted to sell. I got the impression that it initally worked really well, but then people caught on to the fact it was just AI and performance tanked.
I don't actually know whether that was the case but that's the vibe I got from following their landing page over time.
I've noticed my own results vary wildly depending on whether I'm working in a domain where the LLM has seen thousands of similar examples (standard CRUD stuff, common API patterns) versus anything slightly novel or domain-specific. In the former case, it genuinely saves time. In the latter, I spend more time debugging hallucinated approaches than I would have spent just writing it myself.
The atrophy point is interesting though. I wonder if it's less about losing skills and more about never developing them in the first place. Junior developers who lean heavily on these tools might never build the intuition that comes from debugging your own mistakes for years.
I am not necessarily saying the conclusions are wrong, just that they are not really substantiated in any way
In the end, everyone is kind of just sharing their own experiences. You'll only know whether they work for you by trying it yourself.
But at the same time, even this doesn't really work.
The lucky gambler thinks lottery tickets are a good investment. That does not mean they are.
I've found very very limited value from these things, but they work alright in those rather constrained circumstances.
Perhaps you don't believe OpenAI and Anthropic when they say this, but it is a requirement upon which most enterprise contracts are predicated.
I agree with the author that GPT-5 models are much more fixated on solving exactly the problem given and not as good at taking a step back and thinking about the big picture. The author also needs to take a step back and realize other providers still do this just fine.
I'm having a blast with gemini-3-flash and a custom copilor replacement extension, it's much more capable than Copilot ever was with any model for me and a personalized dx with deep insights into my usage and what the agentic system is doing under the hood.
Some studies have shown that direct feedback loops do cause collapse but many researchers argue that it’s not a risk with real world data scales.
In fact, a lot of advancements in the open weight model space recently have been due to training on synthetic data. At least 33% of the data used to train nvidia’s recent nemotron 3 nano model was synthetic. They use it as a way to get high quality agent capabilities without doing tons of manual work.
For example all the information on the web could be said to be a distillation of human experiences, and often it ended up online due to discussions happening during problem solving. Questions were asked of the humans and they answered with their knowledge from the real world and years of experience.
If no one asks humans anymore, they just ask LLMs, then no new discussions between humans are occurring online and that experience doesn't get syndicated in a way models can train on.
That is essentially the entirety of Stack Overflows existence until now. You can pretty strongly predict that no new software experience will be put into Stack Overflow from now. So what of new programming languages or technologies and all the nuances within them? Docs never have all the answers, so models will simply lack the nuanced information.
At the end of the day there is still a huge problem space of reality outside of humans that can be explored and distilled.
What's the objective measure of success that can be programmed into the LLM to self-train without human input? (Narrowing our focus to only code for this question). Is it code that runs? Code that runs without bugs? Code without security holes? And most importantly, how can you write an automated system to verify that? I don't buy that E2E project simulations would work: it can simulate the results, but what results is it looking for? How will it decide? It's the evaluation, not the simulation, that's the inescapably hard part.
Because there's no good, objective way for the LLM to evaluate the results of its training in the case of code, self-training would not work nearly as well as it did for AlphaZero, which could objectively measure its own success.
Unless the AIs find out where mistakes occur, and find this out in the code they themselves generate, your conclusion seems logically valid.
Is the average human 100% correct with everything they write on the internet? Of course not. The absurd value of LLMs is that they can somehow manage to extract the signal from that noise.
That's not even the worst scenario. There are plenty of websites that are nearly meaningless. Could you predict the next token on a website whose server is returning information that has been encoded incorrectly?
Say what? LLMs absolutely cannot do that.
They rely on armies of humans to tirelessly filter, clean, and label data that is used for training. The entire "AI" industry relies on companies and outsourced sweatshops to do this work. It is humans that extract the signal from the noise. The machine simply outputs the most probable chain of tokens.
So hallucinations definitely matter, especially at scale. It makes the job of humans much, much harder, which in turn will inevitably produce lower quality models. Garbage in, garbage out.
LLMs really do find the signal in this noise because even just pre-training alone reveals incredible language capabilities but that's about it. They don't have any of the other skills you would expect and they most certainly aren't "safe". You can't even really talk to a pre-trained model because they haven't been refined into the chat-like interface that we're so used to.
The hard part after that for AI labs was getting together high quality data that transforms them from raw language machines into conversational agents. That's post-training and it's where the armies of humans have worked tirelessly to generate the refinement for the model. That's still valuable signal, sure, but it's not the signal that's found in the pre-training noise. The model doesn't learn much, if any, of its knowledge during post-training. It just learns how to wield it.
To be fair, some of the pre-training data is more curated. Like collections of math or code.
Base models (after pre-training) have zero practical value. They're absolutely useless when it comes to separating signal from noise, using any practical definition of those terms. As you said yourself, their output can be nonsensical, based solely on token probability in the original raw data.
The actual value of LLMs comes after the post-training phase, where the signal is injected into the model from relatively smaller amounts of high quality data. This is the data processed by armies of humans, without which LLMs would be completely worthless.
So whatever capability you think LLMs have to separate signal from noise is exclusively the product of humans. When that job becomes harder, the quality of LLMs will go down. Unless we figure out a way to automate data cleaning/labeling, which seems like an unsolvable problem, or for models to filter it during inference, which is what you're wrongly implying they already do. LLMs could assist humans with cleaning/labeling tasks, but that in itself has many challenges, and is not a solution to the model collapse problem.
Code completion models can be useful because they output the most probable chain of tokens given a specific input, same as any LLM. There is no "signal" there besides probability. Besides, even those models are fine-tuned to follow best practices, specific language idioms, etc.
When we talk about "signal" in the context of general knowledge we refer to information that is meaningful and accurate for a specific context and input. So that if the user asks proof of the Earth being flat, the model doesn't give them false information from a random blog. Of course, LLMs still fall short at this, but post-training is crucial to boost the signal away from the noise. There's nothing inherent in the way LLMs work to make them do this. It is entirely based on the quality of the training data.
Using human foibles when discussing LLM scale issues is apples and oranges.
Additional non-internet training material will probably be human created, or curated at least.
As others have noted, the prompt/eval is also garbage. It’s measuring a non-representative sub-task with a weird prompt that isn’t how you’d use agents in, say, Claude Code. (See the METR evals if you want a solid eval giving evidence that they are getting better at longer-horizon dev tasks.)
This is a recurring fallacy with AI that needs a name. “AI is dumber than humans on some sub-task, therefore it must be dumb”. The correct way of using these tools is to understand the contours of their jagged intelligence and carefully buttress the weak spots, to enable the super-human areas to shine.
It is also a red flag to see anyone refer to these tools as intelligence as it seems the marketing of calling this "AI" has finally sewn its way into our discourse that even tech forums think the prediction machine is intelligent.
Also, that "it's not really intelligence" horse is so dead, it has already turned into crude oil.
In practice I have seen: flowery emails no one bothers to read, emoji filled summaries and documentation that no one bothers to read or check correctness on, prototypes that create more work for devs in the long run, a stark decline in code quality because it turns out reviewing code is a team's ultimate test of due diligence, ridiculous video generation... I could go on and on. It is blockchain all over again, not in terms of actual usefulness, but in terms of our burning desire to monetize it in irresponsible, anti-consumer, anti-human ways.
I DO have a use for LLMs. I use it to tag data that has no tagging. I think the tech behind generative AI is extremely useful. Otherwise, what I see is a collection of ideal states that people fail to demonstrate to me in practice when in reality, it wont be replacing anyone until "the normies" can use it without 1000 lines of instructions markdown. Instead it will just fool people in its casual authoritative and convincing language since that it was it was designed to do.
Further even, if you are actually thinking about long-term maintenance during the code review you get seen as a nitpicky obstacle.
Why? Is it intelligence now? I think not.
(I’m dismissive of calling the tool broken though.)
LLMs are definitely in the same boat. It's even more specific where different models have different quirks so the more time you spend with one, the better the results you get from that one.
Might be good in some timelines. In our current timeline this will just mean even more extreme concentration of wealth, and worse quality of life for everyone.
Maybe when the world has a lot more safety nets so that not having a job doesn’t mean homelessness, starvation, no healthcare, then society will be more receptive to the “this tool can replace everybody” message.
There are so many better things for humans to do.
Once having a job is not intimately tied to basic survival needs then people will be much more willing to automate everything.
I, personally, would be willing to do mind numbing paperwork or hard labor if it meant I could feed myself and my family, have housing, rather than be homeless and starving.
If the problem is with society the solution is with society. We have stop pretending that it's anything else. AI is not even the biggest technological leap -- it's blip on the continuum.
For the time being, at least.
It’s the same as if he had said “I keep typing HTML into VS code and it keeps not displaying it for me. It just keeps showing the code. But it’s made to make webpages, right? people keep telling me I don’t know how to use it but it’s just not showing me the webpage.”
To go further into detail about the whole thing: "You're holding it wrong" is perfectly valid criticism in many, many different ways and fields. It's a strong criticism in some, and weak in others, but almost always the advice is still useful.
Anyone complaining about getting hurt by holding a knife by the blade, for example, is the strongest example of the advice being perfect. The tool is working as designed, cutting the thing with pressure on the blade, which happens to be their hand.
Left-handers using right-handed scissors provides a reasonable example: I know a bunch of left-handers who can cut properly with right-handed scissors and not with left-handed scissors. Me included, if I don't consciously adjust my behaviour. Why? Because they have been trained to hold scissors wrong (by positioning the hand to create opposite push/pull forces to natural), so that they can use the poor tool given to them. When you give them left-handed scissors and they try to use the same reversed push/pull, the scissors won't cut well because their blades are being separated. There is no good solution to this, and I sympathise with people stuck on either side of this gap. Still, learn to hold scissors differently.
And, of course, the weakest, and the case where the snark is deserved: if you're holding your iPhone 4 with the pad of your palm bridging the antenna, holding it differently still resolves your immediate problem. The phone should have been designed such that it didn't have this problem, but it does, and that sucks, and Apple is at fault here. (Although I personally think it was blown out of proportion, which is neither here nor there.)
In the case of LLMs, the language of the prompt is the primary interface -- if you want to learn to use the tool better, you need to learn to prompt it better. You need to learn how to hold it better. Someone who knows how to prompt it well, reading the kind of prompts the author used, is well within their rights to point out that the author is prompting it wrong, and anyone attempting to subvert that entire line of argument with a trite little four-sentence bit of snark in whatever the total opposite of intellectual curiosity is deserves the downvotes they get.
Initial postulate: you have a perfect tool that anybody can use and is completely magic.
Someone says: it does not work well.
Answer: it’s your fault, you’re using it wrong.
In that case it is not a perfect tool that anybody can use. It is just yet another tool, with it flaws and learning curve, that may or may not work depending on the problem at hand. And it’s ok! It is definitely a valid answer. But the “it’s magic” narrative has got to go.
>Someone says: it does not work well.
Why do we argue with two people that are both building strawmen. It doesn't accomplish much. We keep calling AI 'unintelligent' but peoples eagar willingness to make incorrect arguments does put some doubts on humanity itself.
Today I asked 3 versions of Gemini “what were sales in December” with access to a sql model of sales data.
All three ran `WHERE EXTRACT(MONTH FROM date) = 12` with no year (except 2.5 flash did sometimes gave me sales for Dec 2023).
No sane human would hear “sales from December” and sum up every December. But it got numbers that an uncritical eye would miss being wrong.
That’s the type of logical error that these models produce that are bothering the author. They can be very poor at analysis in real world situations because they do these things.
Isn't this the same thing? I mean this has to work with like regular people right?
Make of that what you will…
The peak capability is very obviously, and objectively, increasing.
The scaffolding you need to elicit top performance changes each generation. I feel it’s less scaffolding now to get good results. (Lots of the “scaffolding” these days is less “contrived AI prompt engineering” and more “well understood software engineering best practices”.)
However right now it looks like we will move to training specific hardware and inference specific hardware, which hopefully relives some of that tension.
I'll admit I'm a bit of a sceptic of AI but want to give it another shot over the weekend, what do people recommend these days?
I'm happy spending money but obviously don't want to spend a tonne since its just an experiment for me. I hear a lot of people raving about Opus 4.5, though apparently using that is near to $20 a prompt, Sonnet 4.5 seems a lot cheaper but then I don't know if I'm giving it (by it I mean AI coding) a fair chance if Opus is that much better. There's also OpenCode Zen, which might be a better option, I don't know.
That's people doing real-vibe coding prompts, like "Build me a music player with...". I'm using the $20 Codex plan and with getting it to plan first and then executing (in the same way I, an experienced dev would instruct a junior) haven't even managed to exhaust my 5-hour window limits, let alone the weekly limit.
Also if you keep an eye on it and kill it if it goes in the wrong direction you save plenty of tokens vs letting it go off on one. I wasted a bunch when Codex took 25 minutes(!) to install one package because something went wrong and instead of stopping and asking it decided to "problem solve" on its own.
The latest models are all really good at writing code. Which is better is just vibes and personal preference at this point IMO
The agent harness of claude code / opencode / codex is what really makes the difference these days
I'm not sure about Zen, but OpenAI seems to be giving me $20 / week worth of tokens within the $20/month
Also for absolutely free models, MiniMax M2.1 has been impressive and useful to me (free through OpenCode). Don't judge the state of the art through the lens of that, though
Still not sure which one I'll go with, though I can't say I feel too keen to get into Claude after that
> Give a man a fish, and you feed him for a day. Teach a man to fish, and you feed him for a lifetime.
Or in the context of AI:
> Give a man code, and you help him for a day. Teach a man to code, and you help him for a lifetime.
> Give a person code, and you help them for a day. Teach them to code, and you frustrate them for a lifetime.
> This is a powerful idea, and no doubt contributed to the rapid improvement of AI coding assistants for a period of time. But as inexperienced coders started turning up in greater numbers, it also started to poison the training data.
It is not just `inexperienced coders` that make this signal pretty much useless, I mostly use coding assistants for boilerplate, I will accept the suggestion then delete much of what it produced, especially in the critical path.
For many users, this is much faster then trying to get another approximation
:,/^}/-d
Same for `10dd` etc... it is all muscle memory. Then again I use a local fill in the middle, tiny llm now, because it is good enough for most of the speedup without the cost/security/latency of a hosted model.It would be a mistake to think that filtering out jr devs will result in good data as the concept is flawed in general. Accepting output may not have anything to do with correctness of the provided content IMHO.
What's strange is sometimes a fresh context window produces better results than one where you've been iterating. Like the conversation history is introducing noise rather than helpful context. Makes me wonder if there's an optimal prompt length beyond which you're actually degrading output quality.
From https://docs.github.com/en/copilot/concepts/prompting/prompt...:
Copilot Chat uses the chat history to get context about your request. To give Copilot only the relevant history:
- Use threads to start a new conversation for a new task
- Delete requests that are no longer relevant or that didn’t give you the desired result
It might still be:
- the closest to a correct solution the model can produce
- be helpful to find out what it wrong
- might be intended (e.g. in a typical very short red->green unit test dev approach you want to generate some code which doesn't run correctly _just yet_). Test for newly found bugs are supposed to fail (until the bug is fixed). Etc.
- if "making run" means removing sanity checks, doing something semantically completely different or similar it's like the OP author said on of the worst outcomes
I’m sure it will get there as this space matures, but it feels like model updates are very force-fed to users
It's a major disservice to the problem to act like it's new and solved or even solvable using code revision language.
See the "Snapshots" section on these pages for GPT-4o and 4.1, for example:
https://platform.openai.com/docs/models/gpt-4o https://platform.openai.com/docs/models/gpt-4.1
This is done so that application developers whose systems depend upon specific model snapshots don't have to worry about unexpected changes in behaviour.
You can access these snapshots through OpenRouter too, I believe.
Not saying using major.minor depending on architecture is a bad thing, but it wouldn’t be SemVer, and that doesn’t even cover all the different fine tuning / flavors that are done off those models, which generally have no way to order them.
I think you could actually pretty cleanly map semver onto more structured prompt systems ala modern agent harnesses.
I've been stung by them too many times.
The problem is the more I care about something, the less I'll agree with whatever the agent is trying to do.
This then leads to a million posts where on one side people say "yeah see they're crap" and on the other side people are saying "why did you use a model from 6 months ago for your 'test' and write up in Jan 2026?".
You might as well ignore all of the articles and pronouncements and stick to your own lived experience.
The change in quality between 2024 and 2025 is gigantic. The change between early 2025 and late 2025 is _even_ larger.
The newer models DO let you know when something is impossible or unlikely to solve your problem.
Ultimately, they are designed to obey. If you authoritatively request bad design, they're going to write bad code.
I don't think this is a "you're holding it wrong" argument. I think it's "you're complaining about iOS 6 and we're on iOS 12.".
So what about all those times I accepted the suggestion because it was "close enough", but then went back and fixed all the crap that AI screwed up? Was it training on what was accepted the first time? If so I'm sincerely sorry to everyone, and I might be single-handedly responsible for the AI coding demise. :'-D
The AI slop/astroturfing of YT is near complete.
And there's more than enough content for one person to consume. Very little reason to consume content newer than 2023.
The guy wrote code depending upon an external data file (one that the LLM didn't have access to), with code that referred to a non-existing column. They then specifically prompted it to provide "completed code only, without commentary". This is idiotic.
"Dear LLM, make a function that finds if a number is prime in linear time. Completed code only! No commentary!".
Guy wanted to advertise his business and its adoption of AI, and wrote some foolish pablum to do so. How is this doing numbers here?
I would expect older models make you feel this way.
* Agents not trying to do the impossible (or not being an "over eager people pleaser" as it has been described) has significantly improved over the past few months. No wonder the older models fail.
* "Garbage in, garbage out" - yes, exactly ;)
Edit: Changed 3.5 to 4.
Edit: Looking back to edits and checkins by AI agents, it strikes me that the checkins should contain the prompt used and model version. More recent Aider versions do add the model.
I started programming before modern LLMs so I can still hack it without, it will just take a lot longer.
Maybe it's true that for some very bad prompts, old version did a better job by not following the prompt, and that this is reduced utility for some people.
Unrelated to assistants or coding, as an API user I've certainly had model upgrades that feel like downgrades at first, until I work out that the new model is following my instructions better. Sometimes my instructions were bad, sometimes they were attempts to get the older model to do what I want by saying over-the-top stuff that the new model now follows more precisely to a worse result. So I can definitely imagine that new models can be worse until you adapt.
Actually, another strange example like this - I had gotten in the habit of typing extremely fast to LLMs because they work just fine with my prompts riddled with typos. I basically disconnected the part of my brain that cares about sequencing between hands, so words like "can" would be either "can" or "cna". This ended up causing problems with newer models which would take my typos seriously. For example, if I ask to add support for commandline flag "allwo-netwokr-requests" it will usually do what I said, while previous versions would do what I wanted.
For anyone with some technical expertise and who is putting in serious effort to using AI coding assistants, they are clearly getting better at a rapid pace. Not worse.
For me, the writing speed has never been the issue. The issue has been my thinking speed. I do not see how an AI coding assistant helps me think better. Offloading thinking actually makes my thinking process worse and thus slower.
Similar to moving from individual work to coordinating a large codebase: coding agents, human or otherwise, let you think at a higher abstraction level and tackle larger problems by taking care of the small details.
I wonder if a very lightweight RL loop built around the user could work well enough to help the situation. As I understand it, current LLMs generally do not learn at a rate such that one single bad RL example and one (prompted?) better example could result in improvement at anywhere near human speed.
The issues have been less egregious than hallucinating an "index_value" column, though, so I'm suspect. Opus 4.5 still has been useful for data preprocessing, especially in cases where the input data is poorly structured/JSON.
I wish they would publish the experiment so people could try with more than just GPT and Claude, and I wish they would publish their prompts and any agent files they used. I also wish they would say what coding tool they used. Like did they use the native coding tools (Claude Code and whatever GPT uses) or was it through VSCode, OpenCode, aider, etc.?
As a side note, it is easy to create sharable experiments with Harbor - we migrated our own benchmarks there, here is our experience: https://quesma.com/blog/compilebench-in-harbor/.
I think all general AI agents are running into that problem - as AI becomes more prevalent and people accept and propagate wrong answers, the AI agents are trained to believe those wrong answers.
It feels that lately, Google's AI search summaries are getting worse - they have a kernel of truth, but combines it with an incorrect answer.
I think if you keep the human in the loop this would go much better.
I've been having a lot of success recently by combining recursive invocation with an "AskHuman" tool that takes a required tuple of (question itself, how question unblocks progress). Allowing unstructured assistant dialog with the user/context is a train wreck by comparison. I've found that chain-of-thought (i.e., a "Think" tool that barfs into the same context window) seems to be directly opposed to the idea of recursively descending through the problem. Recursion is a much more powerful form of CoT.
> To start making models better again, AI coding companies need to invest in high-quality data, perhaps even paying experts to label AI-generated code.
AI trainers hired by companies like Outlier, Mercor and Alignerr are getting paid like $15-$45/hr. Reviewers are crap. The screening processes are horribly done by AI interviewers.
So much this... the number of times Claude sneaks default values, or avoids .unwrapping optional values just to avoid a crash at all costs... it's nauseating.
That said, the premise that AI-assisted coding got worse in 2025 feels off to me. I saw big improvements in the tooling last year.
Having tight control over the context and only giving it small tasks makes all the difference. The deepseek token costs are unbeatable too.
Until we start talking about LOC, programming language, domain expertise required, which agent, which version, and what prompt, it's impossible to make quantitative arguments.
It's clear AI coding assistants are able to help software developers at least in some ways.
Having a non-software developer perspective speak about it is one thing, but it should be mindful that there are experienced folks too for whom the technology appears to be a jetpack.
Just because it didn't work for you, means there's more to learn.
We cannot with certainty assert that. If the datum is expected to be missing, such that the frame without the datum is still considered valid and must be handled rather than flagged as an error, the code has to do exactly that. Perhaps a missing value in the dictionary can be supplanted with a zero.
df['new_column'] = df.get('index_value', 0) + 1
# there might be no column ‘index_value’;
# requirements say that zero should be substituted.>>But as inexperienced coders started turning up in greater numbers, it also started to poison the training data.
>>AI coding assistants that found ways to get their code accepted by users kept doing more of that, even if “that” meant turning off safety checks and generating plausible but useless data. As long as a suggestion was taken on board, it was viewed as good, and downstream pain would be unlikely to be traced back to the source.
- models actually getting worse in general
- his specific style of prompting working well with older models and less well with newer models
- the thing his test tests no longer being a priority for big AI labs
From the article:
> GPT-4 gave a useful answer every one of the 10 times that I ran it. In three cases, it ignored my instructions to return only code, and explained that the column was likely missing from my dataset, and that I would have to address it there.
Here ignoring the instructions to give a "useful answer" (as evaluated by the author) is considered a good thing. This would mean if a model is trained to be better at instruction following, it would lose points in that test.
To me this article feels a bit like saying "this new gun that shoot straight 100% of the time is worse than the older gun that shot straight only 50% of the time, because sometimes I shoot at something I don't actually want to shoot at!". And in a way, it is true, if you're used to being able to shoot at things without them getting hurt, the new gun will be worse from that point of view. But to spin up a whole theory about garbage in/garbage out from that? Or to think all models are getting worse rather than, you're maybe no longer the target audience? That seems weird to me.
Seems we agree the better solution for column_index_+1 doesn't exist is to call it out instead of stealthily append a new column, but the why the newer models have that behavior is indeed speculative.
It a bit echos the conundrum from back in the PC days where IBM hardware was the de-facto standard, and companies building "compatible" hardware had to decide whether to be compatible with the spec, or compatible with every detail of the implementation, including buggy behavior, of which OFC some software took advantage. So, do they build to be "compatible" or "bug-compatible"?
Was the ChatGPT v4 response highlighting the missing column a bug or failure to shoot straight? Not sure I'd characterize it that way, but there definitely could be many other reasons for the change in behavior (other than training on lower-skilled programmers' inputs) — we really have to consider that as a conjecture on the author's part.
Heh, there's only one problem with that. Training models is very expensive from a power/infrastructure/hardware perspective. Inference is not as expensive but it's still fairly expensive and needs sophisticated layers on top to make it cheaper (batching, caching, etc).
Guess in which cost category "high-quality data reviewed by experts" falls under.
There are tons of articles online about this, here's one:
https://finance.yahoo.com/news/amazon-bets-ai-spending-capex...
They're all doing it, Microsoft, Google, Oracle, xAI, etc. Those nuclear power plants they want to build, that's precisely to power all the extra data centers.
If anything, everyone hopes to outsource data validation (the modern equivalent to bricklayers under debt slavery).
Cli vs IDE vs Web ?
Nothing for gpt codex 5.1 max or 5.2 max?
Nothing about the prompts ? Quality of the prompts? I literally feed the AI into the AI I just ask it for the most advanced prompts with a smaller model and then use it for the big stuff and its smooth sailing
I got codex 5.1 max with the codex extension on vs code - to generate over 10k lines of code for my website demo project that did work first time
This is also with just the regular 20$ subscription
Github copilot pro plus + vs code is my main go to and depending on the project / prompts/ agent.md quality/ project configuration can all change the outcome of each question
Anyways, no issue. We'll just get claude to start answer stack overflow questions!
Gemini 2.5 was genuinely impressive. I even talked it up here. I was a proper fanboy and really enjoyed using it. Gemini 3 is still good at certain things, but it is clearly worse than 2.5 when it comes to working with larger codebases. Recently, I was using AntiGravity and it could not help me find or fix a reference-counting bug. ( 50 classes, 20k LOC total, so well within context limits ) I know AntiGravity is new, which explains why it is rough around the edges. But it is built on Gemini, so the results should at least be on par with Gemini 3, right? Apparently not. I am an excellent prompter, and no amount of additional context, call stacks, watch-window values, you name it, made any difference.
I still use Gemini for code reviews and simple problems, and it remains excellent for those use cases. But in many respects, Gemini 3 is a regression. It hallucinates more, listens less, and seems oddly resistant to evidence. It produces lots of lofty, confident-sounding statements while ignoring the actual facts in front of it. The experience can be exhausting, and I find myself using it much less as a result. I guess this is typical of companies these days - do something great and then enshittify it? Or maybe there are technical issues I'm not aware of.
What is especially interesting is reading all the articles proclaiming how incredible AI coding has become. And to be fair, it is impressive, but it is nowhere near a magic bullet. I recently saw a non-programmer designer type claiming he no longer needs developers. Good luck with that. Have fun debugging a memory leak, untangling a database issue, or maintaining a non-trivial codebase.
At this point, I am pretty sure my use cases are going to scale inversely with my patience and with my growing disappointment.
> Here’s the same text with all em dashes removed and the flow adjusted accordingly:
Did you have an LLM write your comment then remove the evidence?
Sorry, I should be clear: do you have a problem with that?
> My team has a sandbox where we create, deploy, and run AI-generated code without a human in the loop.
Tons of smart people not using it right
Unsure of the power it can actually unleash with the right prompt + configuration
100% needs a human in the loop
Its not jarvis
This is a problem that started with I think Claude Sonnet 3.7? Or 3.5, I don't remember well. But it's not recent at all, one of those two Sonnet was known to change tests so that they would pass, even if they didn't test properly stuff anymore.
>But as inexperienced coders started turning up in greater numbers, it also started to poison the training data. AI coding assistants that found ways to get their code accepted by users kept doing more of that, even if “that” meant turning off safety checks and generating plausible but useless data. As long as a suggestion was taken on board, it was viewed as good, and downstream pain would be unlikely to be traced back to the source.
No proof or anything is offered here.
The article feels mostly like a mix of speculation, and being behind on practices. You can avoid a lot of the problems of "code that looks right" by making the models write tests, insist that they are easy to review and hard to fake, offering examples. This worked well 6 months ago, this works even better today, especially with Opus 4.5, but even Codex 5.2 and Gemini 3 Pro work well.
"On two occasions I have been asked, – "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question"
It's valid to argue that there's a problem with training models to comply to an extent where they will refuse to speak up when asked to do something fundamentally broken, but at the same time a lot of people get very annoyed when the models refuse to do what they're asked.
There is an actual problem here, though, even if part of the problem is competing expectations of refusal.
But in this case, the test is also a demonstration of exactly how not to use coding assistants: Don't constrain them in ways that create impossible choices for them.
I'd guess (I haven't tested) that you'd have decent odds of getting better results even just pasting the error message into an agent than adding stupid restrictions. And even better if you actually had a test case that verified valid output.
(and on a more general note, my experience is exactly the opposite of the writer's two first paragraphs)
I've observed the same behavior somewhat regularly, where the agent will produce code that superficially satisfies the requirement, but does so in a way that is harmful. I'm not sure if it's getting worse over time, but it is at least plausible that smarter models get better at this type of "cheating".
A similar type of reward hacking is pretty commonly observed in other types of AI.
> This is of course an impossible task—the problem is the missing data, not the code. So the best answer would be either an outright refusal, or failing that, code that would help me debug the problem.
But the problem with their expectation is that this is arguably not what they asked for.
So refusal would be failure. I tend to agree refusal would be better. But a lot of users get pissed off at refusals, and so the training tend to discourage that (some fine-tuning and feedback projects (SFT/RLHF) outright refuse to accept submissions from workers that include refusals).
And asking for "complete" code without providing a test case showing what they expect such code to do does not have to mean code that runs to completion without error, but again, in lots of other cases users expect exactly that, and so for that as well a lot of SFT/RLHF projects would reject responses that don't produce code that runs to completion in a case like this.
I tend to agree that producing code that raises a more specific error would be better here too, but odds are a user that asks a broken question like that will then just paste in the same error with the same constraint. Possibly with an expletive added.
So I'm inclined to blame the users who make impossible requests more than I care about the model doing dumb things in response to dumb requests. As long as they keep doing well on more reasonable ones.
This week I asked GPT-5.2 to debug an assertion failure in some code that worked on one compiler but failed on a different compiler. I went through several rounds of GPT-5.2 suggesting almost-plausible explanations, and then it modified the assertion and gave a very confident-sounding explanation of why it was reasonable to do so, but the new assertion didn’t actually check what the old assertion checked. It also spent an impressive of time arguing, entirely incorrectly and based in flawed reasoning that I don’t really think it found in its training set, as to why it wasn’t wrong.
I finally got it to answer correctly by instructing it that it was required to identify the exact code generation difference that caused the failure.
I haven’t used coding models all that much, but I don’t think the older ones would have tried so hard to cheat.
This is also consistent with reports of multiple different vendors’ agents figuring out how to appear to diagnose bugs by looking up the actual committed fix in the repository.
https://theonion.com/this-war-will-destabilize-the-entire-mi...
"This War Will Destabilize The Entire Mideast Region And Set Off A Global Shockwave Of Anti-Americanism vs. No It Won’t"
I think he has two contradictory expectations of LLMs:
1) Take his instructions literally, no matter how ridiculous they are.
2) Be helpful and second guess his intentions.
GPT-5 has been trained to adhere to instructions more strictly than GPT-4. If it is given nonsense or contradictory instructions, it is a known issue that it will produce unereliable results.
A more realistic scenario would have been for him to have requested a plan or proposal as to how the model might fix the problem.
If you're going to write about something that's been true and discussed widely online for a year+, at least have the awareness/integrity to not brand it as "this new thing is happening".
The agents available in January 2025 were much much worse than the agents available in November 2025.
The models are gotten very good, but I rather have an obviously broken pile of crap that I can spot immediately, than something that is deep fried with RL to always succeed, but has subtle problems that someone will lgtm :( I guess its not much different with human written code, but the models seem to have weirdly inhuman failures - like, you would just skim some code, cause you just cant believe that anyone can do it wrong, and it turns out to be.
The problem (which should be obvious) is that with a/b real you cant construct an exhaustive input/output set. The test case can just prove the presence of a bug, but not its absence.
Another category of problems that you cant just test and have to prove is concurrency problems.
And so forth and so on.
Even an add_numbers function can have bugs, e.g. you have to ensure the inputs are numbers. Most coding agents would catch this in loosely-typed languages.