Every one of these discussions boils down to the following:
- LLMs are not good at writing code on their own unless it's extremely simple or boilerplate
- LLMs can be good at helping you debug existing code
- LLMs can be good at brainstorming solutions to new problems
- The code that is written by LLMs always needs to be heavily monitored for correctness, style, and design, and then typically edited down, often to at least half its original size
- LLMs utility is high enough that it is now going to be a standard tool in the toolbox of every software engineer, but it is definitely not replacing anyone at current capability.
- New software engineers are going to suffer the most because they know how to edit the responses the least, but this was true when they wrote their own code with stack overflow.
- At senior level, sometimes using LLMs is going to save you a ton of time and sometimes it's going to waste your time. Net-net, it's probably positive, but there are definitely some horrible days where you spend too long going back and forth, when you should have just tried to solve the problem yourself.
Searching for solutions and integrating examples found requires effort that develops into a skill. You would rarely get solutions that would just fit into the codebase from SO. If I give a task to you and you produce a correct solution on the initial review I now know I can trust you to deal with this kind of problem in the future. Especially after a few reviews.
If you just vibed through the problem the LLM might have given you the correct solution - but there is no guarantee that it will do it again in the future. Just because you spent less effort on search/official docs/integration into the codebase you learned less about everything surrounding it.
So using LLMs as a junior you are just breaking my trust, and we both know you are not a competent reviewer of LLM code - why am I even dealing with you when I'll get LLM outputs faster myself ? This was my experience so far.
So much this. I see a 1000 lines super complicated PR that was whipped up in less than a day and I know they didn't read all of it, let alone understand.
Ultra short cycle: Pairing with a senior, solid manual and automated testing during development.
Reasonably short cycle: Code review by a senior within hours and for small subsets of the work ideally, QA testing by a seperate person within hours.
Borderline too long cycle: Code review of larger chunks of code by a senior with days of delay, QA testing by a seperate person days or weeks after implementation.
Terminally long feedback cycle: Critical bug in production, data loss, negative career consequences.
I'm confident that juniors will still learn, eventually. Seniors can help them learn a whole lot faster though, if both sides want that, and if the organisation lets them. And yeah, that's even more the case than in the pre LLM world.
> you learned less about everything surrounding it.
I think one of the big acceleration points in my skills as a developer was when I moved from searching SO and other similar sources to reading the docs and reading the code. At first, this was much slower. I was usually looking for a more specific thing and didn't usually need the surrounding context. But then as I continued, that surrounding context became important. That stuff I was reading compounded and helped me see much more. These gains were completely invisible and sometimes even looked like losses. In reality, that context was always important, I just wasn't skilled enough to understand why. Those "losses" are more akin to a loss you have when you make an investment. You lost money, but gained a stock.I mean I still use SO, medium articles, LLMs, and tons of sources. But I find myself just turning to the docs as my first choice now. At worst I get better questions to pay attention to with the other sources.
I think there's this terrible belief that's developed in CS and the LLM crowd targets. The idea that everything is simple. There's truth to this, but there's a lot of complexity to simplicity. The defining characteristic between an expert and a novice is their knowledge of nuance. The expert knows what nuances matter and what don't. Sometimes a small issue compounds and turns into a large one, sometimes it disappears. The junior can't tell the difference, but the expert can. Unfortunately, this can sound like bikeshedding and quibbling over nothings (sometimes it is). But only experts can tell the difference ¯\_(ツ)_/¯
The thing about AI is when it started out(coding models) they were kinda bad. But I feel any tool that provides value to time or effort is a useful tool. I use AI now mostly to add some methods, ask questions about the code base and brainstorm ideas against that code base. There are levels on how you use this tool(AI).
1. Complete trust(if its easy task and you can verify quickly). 2. medium trust( you ask questions back to AI to critically understand why it did what it did). 3. zero trust.(this is very important for learning fast, not coding. You need to stress AI to give me lots of information, right or wrong, cross-check it manually and soak it in your brain carefully. Here you will know whether that AI is good or bad.)
Conclusion: We are human beings. Any tool must be used with caution, especially AI that is capable of playing tricks with your precious brain. Build razor sharp instincts and trust them ONLY.
It's exhausting to hear about AI all the time but it's fun to watch history happen. In a decade we'll look back at all these convos and remember how wild of a time it was to be a programmer.
I stopped coming here for a year or two, now I visit once a day or so and mostly just skim a couple threads.
Eventually, this entire field... just starts to feel pretty cyclical.
A bit of proof: https://www.joelonsoftware.com/2000/08/09/the-joel-test-12-s...
1. Do you use source control? - I haven't seen a software company without source control in... 2 decades?
2. Can you make a build in one step? - This is a bit tricky, but it super widespread. Maybe not universal, but very widespread.
3. Do you make daily builds? - Same as #2.
4. Do you have a bug database? - Same as #2.
5. Do you fix bugs before writing new code? - This is a debatable topic but you could argue that modern bugs are more complex and we are fixing them.
6. Do you have an up-to-date schedule? - Heh, some things you just can't win :-p
7. Do you have a spec? - Similar to #6.
8. Do programmers have quiet working conditions? - This one is the biggest modern failure, but it's not one of tech.
9. Do you use the best tools money can buy? - Similar to #2.
10. Do you have testers? - We've moved to automated testing. We've lost some flair but we've gained a lot in day-to-day quality.
11. Do new candidates write code during their interview? - Really widespread now, but not universal. Less widespread than having proper build systems.
12. Do you do hallway usability testing? - Varies a lot by field, it used to vary even back in the day.
makes me think the bots are providing these conversations
That's what programming with LLMs is, it's just project management: You split the tasks into manageable chunks (ones that can be completed in a single context window), you need to have good onboarding documentation (CLAUDE.md or the equivalent) and good easy to access documentation (docs/ with markdown files).
Exactly what you use to manage a team of actual human programmers.
Right! Problem, billions of dollars have been poured into this wrt to infrastructure, datacenters, compute and salaries. LLMs need to be at the level of replacing vast swathes of us to be worth it. LLMs are not going to be doing that.
This is a collosal malinvestment.
Nobody knows when. But it will. TBH the biggest danger is that all the hopes and dreams aren't materialised and the appetite for high-risk investments dissipates.
We've had this period in which you can be money losing and its OK. But I believe we have passed the peak on that - and this is destined to blow up.
It’s wasted so much time trying to make it write actual production quality code. The consistency and over-verbose nature kill it for me.
If you have a sophisticated agent system that uses multiple forward and backward passes, the quality improves tremendously.
Based on my set up as of today, I’d imagine by sometime next year that will be normal and then the conversation will be very different; mostly around cost control. I wouldn’t be surprised if there is a break out popular agent control flow language by next year as well.
The net is that unsupervised AI engineering isn’t really cheaper better or faster than human engineering right now. Does that mean in two years it will be? Possibly.
There will be a lot of optimizations in the message traffic, token uses, foundational models, and also just the Moore’s law of the hardware and energy costs.
But really it’s the sophistication of the agent systems that control quality more than anything. Simply following waterfall (I know, right? Yuck… but it worked) increased code quality tremoundously.
I also gave it the SelfDocumentingCode pattern language that I wrote (on WikiWikiWeb) as a code review agent and quality improved tremendously again.
Currently it's just VC funded. The $20 packages they're selling are in no way cost-effective (for them).
That's why I'm driving all available models like I stole them, building every tool I can think of before they start charging actual money again.
By then local models will most likely be at a "good enough" level especially when combined with MCPs and tool use so I don't need to pay per token for APIs except for special cases.
Now they have 5 hour buckets of limited use.
Groq most likely stays afloat because they're a bit player - and propped by VC money.
With a local system I can run it at full blast all the time, nobody can suddenly make it stupid by reallocating resources to training their new model, nobody can censor it or do stealth updates that make it perform worse.
I had 3 projects running today. I hit my Claude Max Pro session limits twice today in about 90 minutes. I'm now keeping it down to 1 project, and I may interrupt it until the evening when I don't need Claude Web. If I could run it passively on my laptop, I would.
Just an hour ago I asked Claude to find bugs in a function and it found 1 real bug and 6 hallucinated bugs.
One of the "bugs" it wanted to "fix" was to revert a change that I had made previously to fix a bug in code it had written.
I just don't understand how people burning tokens on sophisticated multi-agent systems are getting any value from that. These LLMs don't know when they are doing something wrong, and throwing more money at the problem won't make them any smarter. It's like trying to build Einstein by hiring more and more schoolkids.
Don't get me wrong, Claude is a fantastic productivity boost but letting it run around unsupervised would slow me down rather than speed me up.
What Moore's law?
For this language matters a lot, if whatever you're using has robust tools for linting and style checks, it makes the LLMs job a lot easier. Give it a rule (or a forced hook) to always run tests and linters before claiming a job is done and it'll iterate until what it produces matches the rules.
But LLM code has a habit of being very verbose and covers every situation no matter how minuscule.
This is especially grating when you're doing a simple project for local use and it's bootstrapping something that's enterprise-ready :D
1) I broke the tests, guess I should delete them.
2) I broke the tests, guess the code I wrote was wrong, guess I should delete all of that code I wrote.
3) I broke the tests, guess I should keep adding more code and scaffolding. Another abstraction layer might work? What if I just add skeleton code randomly, does this add random code whack-a-mole work?
That last one can be particularly "fun" because already verbose LLM code skyrockets into baroque million line PRs when left truly unsupervised, and that PR still won't build or pass tests.
There's no true understanding by an LLM. Forcing it to lint/build can be important/useful, but still not a cure-all, and leads to such fun even more degenerate cases than hand-holding it.
I also think there's some big variance in each of the "sides" (I think it is more a bimodal spectrum really) with a lot to you last point. Sometimes they save you lots of time, sometimes they waste a lot of time. I expect more senior people are going to get less benefits from them because they've already spent lots of time developing time saving strategies. Plus, writing lines is only a small part of the job. The planning and debugging stages are much more time intensive and can be much more difficult to wrangle an LLM with. Honestly I think it is a lot about trust. Forgetting "speed", do I trust myself to be more likely to catch errors in code that I write or code that I review?
Personally, I find that most of the time I end up arguing with the LLM over some critical detail and I've found Claude code will sometimes revert things that I asked it to change (these can be time consuming errors because they are often invisible). It gives the appearance of being productive (even feeling that way) but I think it is a lot more like if you spent time in a meeting vs time coding. Meetings can help and are very time consuming, but can also be a big waste of time when over used. Sometimes it is better to have two engineers go try out their methods independently and see what works out within the larger scope. Something is always learned too.
Small price to pay for shuffling Agile Manifesto off the stage.
1) Don't ask for large / complex change. Ask for a plan but ask it to implement the plan in small steps and ask the model to test each step before starting the next.
2) For really complex steps, ask the model to write code to visualize the problem and solution.
3) If the model fails on a given step, ask it to add logging to the code, save the logs, run the tests and the review the logs to determine what went wrong. Do this repeatedly until the step works well.
4) Ask the model to look at your existing code and determine how it was designed to implement a task. Some times the model will put all of the changes in one file but your code has a cleaner design the model doesn't take into account.
I've seen other people blog about their tricks and tips. I do still see garbage results but not as high as 95%.
That's been my experience.
I've been working on a 100% vibe-coded app for a few weeks. API, React-Native frontend, marketing website, CMS, CI/CD - all of it without changing a single line of code myself. Overall, the resulting codebase has been better than I expected before I started. But I would have accomplished everything it has (except for the detailed specs, detailed commit log, and thousands of tests), in about 1/3 of the time.
The commits - some would be detailed, plenty would have been "typo" or "same as last commit, but works this time"
The tests - Probably would have been decent for the API, but not as thorough. Likely non-existent for the UI.
As for time - I agree with the other response - I wouldn't have taken the time.
If you really can write a full-ass system like that faster than an LLM, you're either REALLY fucking good at what you do (and an amazing typer), or you're holding the LLM wrong as they say.
The issue is getting the LLM to write _reasonably decent_ code without having to read every line and make sure it's not doing anything insane. I've tried a few different methods of prompting, but setting up a claude sub-agent that's doing TDD very explicitly and ensuring that all tests pass after every iteration has been most effective.
My first attempt was so fast, it was mind-bending. I had a "working" App and API running in about a day and a half. And then when I tried to adjust features, it would start changing things all over the place, LOTS of tests were failing, and after a couple prompts, I got to a point where the app was terribly broken. I spent about a day trying to prompt my way out of a seemingly infinite hole. I did a more thorough code review and it was a disaster: Random code styles, tons of half-written and abandoned code, tests that did nothing, //TODOs everywhere, and so, so many tweaks for backwards compatibility - which I did NOT need for a greenfield project
At that point I scrapped the project and adjusted my approach. I broke down the PRD into more thorough documentation for reference. I streamlined the CLAUDE.md files. I compiled a standard method of planning / documenting work to be done. I created sub-agents for planning and implementation. I set up the primary implementation sub-agents to split up the spec into bite-sized portions of work ("30-45 minute tasks").
Now I'm at the opposite side of the spectrum - implementation is dog slow, but I rarely have to read what was actually written. I still review the code at large after the primary tasks are finished (comparing the feature branch against main in my IDE), but for the most part I've been able to ignore the output and rely on my manual app tests and then occasionally switch models (or LLMs) and prompt for a thorough code-review.
I'm at the point now where I have to yell at the AI once in a while, but I touch essentially zero code manually, and it's acceptable quality. Once I stopped and tried to fully refactor a commit that CC had created, but I was only able to make marginal improvements in return for an enormous time commitment. If I had spent that time improving my prompts and running refactoring/cleanup passes in CC, I suspect I would have come out ahead. So I'm deliberately trying not to do that.
I expect at some point on a Friday (last Friday was close) I will get frustrated and go build things manually. But for now it's a cognitive and effort reduction for similar quality. It helps to use the most standard libraries and languages possible, and great tests are a must.
Edit: Also, use the "thinking" commands. think / think hard / think harder / ultrathink are your best friend when attempting complicated changes (of course, if you're attempting complicated changes, don't.)
I have been doing my best to give these tools a fair shake, because I want to have an informed opinion (and certainly some fear of being left behind). I find that their utility in a given area is inversely proportional to my skill level. I have rewritten or fixed most of the backend business logic that AI spits out. Even if it’s mostly ok on a first pass, I’ve been doing this gig for decades now and I am pretty good at spotting future technical debt.
On the other hand, I’m consistently impressed by its ability to save me time with UI code. Or maybe it’s not that it saves me time, but it gets me to do more ambitious things. I’d typically just throw stuff on the page with the excuse that I’m not a designer, and hope that eventually I can bring in someone else to make it look better. Now I can tell the robot I want to have drag and drop here and autocomplete there, and a share to flooberflop button, and it’ll do enough of the implementation that even if I have to fix it up, I’m not as intimidated to start.
It even discovered that we have some internal components and used them for it.
Got me from 0-MVP in less then an hour. Would've easily taken me a full day
I have loved GPT5 but the other day I was trying to implement a rather novel idea that would be a rather small function and GPT5 goes from a genius to an idiot.
I think HN has devolved into random conversations based on a random % of problems being in the training data or not. People really are having such different experiences with the models based on the novelty of the problems that are being solved.
At this point it is getting boring to read.
In theory. In practice, it's not a very secure sandbox and Claude will happily go around updating files if you insist / the prompt is bad / it goes off on a tangent.
I really should just set up a completely sandboxed VM for it so that I don't care if it goes rm-rf happy.
A sandboxed devcontainer is worth setting up though. Lets me run it with —dangerously-skip-permissions
Nothing dangerous, but the limits are more like suggestions, as the Pirate code says.
But here are fine prints, it has "exit plan mode" tool, documented here: https://minusx.ai/blog/decoding-claude-code/#appendix
So it can exit plan mode on its own and you wouldn't know!
Permission limitations on the root agent have, in many cases, not been propagated to child agents, and they’ve been able to execute different commands. The documentation is incomplete and unclear, and even to the extent that it is clear it has a different syntax with different limitations than are used to configure permissions for the root agent. When you ask Claude itself to generate agent configurations, as is recommended, it will generate permissions that do not exist anywhere in the documentation and may or may not be valid, but there’s no error admitted if an invalid permission is set. If you ask it to explain, it gets confused by their own documentation and tells you it doesn’t know why it did that. I’m not sure if it’s hallucinating or if the agent-generating-agent has access to internal detail details that are not documented anywhere in which the normal agent can’t see.
Anthropic is pretty consistently the best in this space in terms of security and product quality. They seem to actually care about doing software engineering properly. (I’ve personally discovered security bugs in several competing products that are more severe and exploitable than what I’m talking about here.) I have a ton of respect for Anthropic. Unfortunately, when it comes to sub agents in Claude code, they are not living up to standard they have set.
I.e. not its own tools, but command-line executables.
Its assumptions about these commands, and specifically the way it ran them, were correct.
But I have seen it run commands in plan mode.
The softmax activation function picks the most promising activations for a given output token.
The V(=value) matrix forms another neural network where each token is turned into a tiny regressor neural network that accepts the activation as an input and produces multiple outputs that are summed up to produce an intermediate token which is then fed into the MLP layer.
From this perspective the transformer architecture is building neural networks at runtime.
But there are some pretty obvious limitations here: The LLM operates on tokens, which means it can only operate on what is in the KV-cache/context window. If the candidates are not in the context window, it can't score them.
My question to the post I replied to was basically: given a coding problem, and a list of possible solutions (candidates), how can a LLM generate a meaningful numerical score for each candidate to then say this one is a better solution than that one?
In order for it not to do useless stuff I need to expend more energy on prompting than writing stuff myself. I find myself getting paranoid about minutia in the prompt, turns of phrase, unintended associations in case it gives shit-tier code because my prompt looked too much like something off experts-exchange or whatever.
What I really want is something like a front-end framework but for LLM prompting, that takes away a lot of the fucking about with generalised stuff like prompt structure, default to best practices for finding something in code, or designing a new feature, or writing tests..
It's not simple to even imagine ideal solution. The more you think about it the more complicated your solution becomes. Simple solution will be restricted to your use cases. Generic is either visual or a programming language. I's like to have visual constructor, graph of actions, but it's complicated. The language is more powerful.
Writing the code is the fast and easy part once you know what you want to do. I use AI as a rubber duck to shorten that cycle, then write it myself.
How do you verify it is teaching you the correct thing if you don't have any baseline to compare it to?
Doesn't that sound ridiculous to you?
Admittedly, part of it is my own desire for code that looks a certain way, not just that which solves the problem.
Choosing the battles to pick is part of the skill at the moment.
I use AI for a lot of boiler plate, tedious tasks I can’t quite do a vim recording for, small targeted scripts.
The boilerplate argument is becoming quite old.
It’s basically just a translation, but with dozens of tables, each with dozens of columns it gets tedious pretty fast.
If given other files from the project as context it’s also pretty good at generating the table and column descriptions for documentation, which I would probably just not write at all if doing it by hand.
I think you need to imagine all the things you could be doing with LLMs.
For me the biggest thing is so many tedious things are now unlocked. Refactors that are just slightly beyond the IDE, checking your config (the number of typos it’s picked up that could take me hours because eyes can be stupid), data processing that’s similar to what you have done before but different enough to be annoying.
It's not AI, there is no intelligence. A language model as the name says deals with language. Current ones are surprisingly good at it but it's still not more than that.
I’ve noticed colleagues who enjoy Claude code are more interested in “just ship it!” (and anecdotally are more extroverted than myself).
I find Claude code to be oddly unsatisfying. Still trying to put my finger on it, but I think it’s that I quickly lose context. Even if I understand the changes CC makes, it’s not the same as wrestling with a problem and hitting roadblocks and overcoming. With CC I have no bearing on whether I’m in an area of code with lots of room for error, or if I’m standing in the edge of a cliff and can’t cross some line in the design.
I’m way more concerned with understanding the design and avoiding future pain than my “ship it” colleagues (and anecdotally am way more introverted). I see what they build and, yes, it’s working, for now, but the table relationships aren’t right and this is going to have to be rebuilt later, except now it’s feeding a downstream report that’s being consumed by the business, so the beta version is now production. But the 20 other things this app touches indirectly weren’t part of the vibe coding context, so the design obviously doesn’t account for that. It could, but of course the “ship it” folks aren’t the ones that are going to build out lengthy requirements and scopes of work and document how a dozen systems relate to and interact with each other.
I guess I’m seeing that the speed limit of quality is still the speed of my understanding, and (maybe more importantly) that my weaponizing of my own obsession only works when I’m wrestling and overcoming, not just generating code as fast as possible.
I do wonder about the weaponized obsession. People will draw or play music obsessively, something about the intrinsic motivation of mastery, and having AI create the same drawing, or music, isn’t the same in terms of interest or engagement.
I just don't enjoy the work as much as I did when was younger. Now I want to get things done and then spend the day on other more enjoyable (to me) stuff.
But I can’t tell you any useful tips or tricks to be honest. It’s like trying to teach a new driver the intuition of knowing when to brake or go when a traffic light turns yellow. There’s like nothing you can really say that will be that helpful.
Sure, some skills are more about practice, not rules, but hopefully you're not a driving instructor.
The funny thing is - we need less. Less of everything. But an up-tick in quality.
This seems to happen with humans with everything - the gates get opened, enabling a flood of producers to come in. But this causes a mountain of slop to form, and overtime the tastes of folks get eroded away.
Engineers don't need to write more lines of code / faster - they need to get better at interfacing with other folks in the business organisation and get better at project selection and making better choices over how to allocate their time. Writing lines of code is a tiny part of what it takes to get great products to market and to grow/sustain market share etc.
But hey, good luck with that - ones thinking power is diminished overtime by interacing with LLMs etc.
Sometimes I reflect on how much more efficiently I can learn (and thus create) new things because of these technologies, then get anxiety when I project that to everyone else being similarly more capable.
Then I read comments like this and remember that most people don't even want to try.
Come back and post here when you have built something that has commercial success.
Show us all how it's done.
Until then go away - more noise doesn't help.
I'm still the one doing the doing after the learning is complete.
They are the single closest thing we've ever had to objective evaluation on if an engineering practice is better or worse. Simply because just about every single engineering practice that I see that makes coding agents work well also makes humans work well.
And so many of these circular debates and other best practices (TDD, static typing, keeping todo lists, working in smaller pieces, testing independently before testing together, clearly defined codebase practices, ...) have all been settled in my mind.
The most controversial take, and the one I dislike but may reluctantly have to agree with is "Is it better for a business to use a popular language less suited for the task than a less popular language more suited for it." While obviously it's a sliding scale, coding agents clearly weight in on one side of this debate... as little as I like seeing it.
The best way is to create tests yourself, and block any attempts to modify them
I've interviewed with three tier one AI labs and _no-one_ I talked to had any idea where the business value of their models came in.
Meanwhile Chinese labs are releasing open source models that do what you need. At this point I've build local agentic tools that are better than anything Claude and OAI have as paid offerings, including the $2,000 tier.
Of course they cost between a few dollars to a few hundred dollars per query so until hardware gets better they will stay happily behind corporate moats and be used by the people blessed to burn money like paper.
This doesn't match the sentiment on hackernews and elsewhere that claude code is the superior agentic coding tool, as it's developed by one of the AI labs, instead of a developer tool company.
You don't see better ones from code tooling companies because the economics don't work out. No one is going to pay $1,000 for a two line change on a 500,000k line code base after waiting four hours.
LLMs today the equivalent of a 4bit ALU without memory being sold as a fully functional personal computer. And like ALUs today, you will need _thousands_ of LLMs to get anything useful done, also like ALUs in 1950 we're a long way off from a personal computer being possible.
Doesn't specifically seem to jive with the claim Anthropic made where they were worried about Claude Code being their secret sauce, leaving them unsure whether to publicly release it. (I know some skeptical about that claim.)
I asked Claude Code to read a variable from a .env file.
It proceeded to write a .env parser from scratch.
I then asked it to just use Node's built in .env file parsing....
This was the 2nd time in the same session that it wrote a .env file parser from scratch. :/
Claude Code is amazing, but it'll goes off and does stupid even for simple requests.
If you ignore that I had to pay for its initial failure...
For me it built a full-ass YAML parser when it couldn't use Viper to parse the configuration correctly :)
It was a fully vibe-coded project (I like playing stupid and seeing what the LLM does), but it got caught when the config got a bit more complex and its shitty regex-yaml-parser didn't work anymore. :)
One option is to write "Please implement this change in small steps?" more-or-less exactly
Another option is to figure out the steps and then ask it "Please figure this out in small steps. The first step is to add code to the parser so that it handles the first new XML element I'm interested in, please do this by making the change X, we'll get to Y and Z later"
I'm sure there's other options, too.
I give an outline of what I want to do, and give some breadcrumbs for any relevant existing files that are related in some way, ask it to figure out context for my change and to write up a summary of the full scope of the change we're making, including an index of file paths to all relevant files with a very concise blurb about what each file does/contains, and then also to produce a step-by-step plan at the end. I generally always have to tell it to NOT think about this like a traditional engineering team plan, this is a senior engineer and LLM code agent working together, think only about technical architecture, otherwise you get "phase 1 (1-2 weeks), phase 2 (2-4 weeks), step a (4-8 hours)" sort of nonsense timelines in your plan. Then I review the steps myself to make sure they are coherent and make sense, and I poke and prod the LLM to fix anything that seems weird, either fixing context or directions or whatever. Then I feed the entire document to another clean context window (or two or three) and ask it to "evaluate this plan for cohesiveness and coherency, tell me if it's ready for engineering or if there's anything underspecified or unclear" and iterate on that like 1-3 times until I run a fresh context window and it says "This plan looks great, it's well crafted, organized, etc...." and doesn't give feedback. Then I go to a fresh context window and tell it "Review the document @MY_PLAN.md thoroughly and begin implementation of step 1, stop after step 1 before doing step 2" and I start working through the steps with it.
As an engineer, especially as you get more experience, you can kind of visualize the plan for a change very quickly and flesh out the next step while implementing the current step
All you have really accomplished with the kind of process described is make the worlds least precise, most verbose programming language
I can say the right precise wording in my prompt to guide it to a good plan very quickly. As the other commenter mentioned, the entire above process only takes something like 30-120 minutes depending on scope, and then I can generate code in a few minutes that would take 2-6 weeks to write myself, working 8 hr days. Then, it takes something like 0.5-1.5 days to work out all the bugs and clean up the weird AI quirks and maybe have the LLM write some playwright tests or whatever testing framework you use for integration tests to verify it's own work.
So yes, it takes significant time to plan things well for good results, and yes the results are often sloppy in some parts and have weird quirks that no human engineer would make on purpose, but if you stick to working on prompt/context engineering and getting better and faster at the above process, the key unlock is not that it just does the same coding for you, with it generating the code instead. It's that you can work as a solo developer at the abstraction level of a small startup company. I can design and implement an enterprise grade SSO auth system over a weekend that integrates with Okta and passes security testing. I can take a library written in one language and fully re-implement it in another language in a matter of hours. I recently took the native libraries for Android and iOS for a fairly large, non-trivial SDK, and had Claude build me a React Native wrapper library with native modules that integrates both natives libraries and presents a clean, unified interface and typescript types to the react native layer. This took me about two days, plus one more for validation testing. I have never done this before. I have no idea how "Nitro Modules" works, or how to configure a react native library from scratch. But given the immense scaffolding abilities of LLMs, plus my debugging/hacking skills, I can get to a really confident place, really quickly and ship production code at work with this process, regularly.
It takes 30-40 minutes to generate a plan and it generates code that would have taken 20-30 minutes to write.
When it’s generating “weeks” worth of code, it inevitably goes off the rails and the crap you get goes in the garbage.
This isn’t to say agents don’t have their uses, but i have not seen this specific problem actually work. They’re great for refactoring (usually) and crapping out proof of concepts and debugging specific problems. It’s also great for exploring a new code base where you have little prior knowledge.
It makes sense that it sucks at generating large amounts of code that fits cohesively into the project. The context is too small. My code base is millions of lines of code. My brain has a shitload more of that in context than any of the models. So they have to guess and check and end up incorrect and poor and i don’t. I know which abstractions exist that i can use. It doesn’t. Sometimes it guesses right. Often Times it doesn’t. And once it’s wrong, it’s fucked for the entire rest of the session so you just have to start over
Take this for example: https://www.reddit.com/r/ClaudeAI/comments/1m7zlot/how_planm...
This trick is just the basic stuff, but it works really well. You can add on and customize from there. I have a “/task” slash command that will run a full development cycle with agents generating code, many more (12-20) agent critics analyzing the unstaged work, all orchestrated by a planning agent that breaks the complex task into small atomic steps.
The first stage of this project (generating the plan) is interactive. It can then go off and make 10kLOC code spread over a dozen commits and the quality is good enough to ship, most of the time. If it goes off the rails, keep the plan document but nuke the commits and restart. On the Claude MAX plan this costs nothing.
This is how I do all my development now. I spend my time diagnosing agent failures and fixing my workflows, not guiding the agent anymore (other than the initial plan document).
I still review every line of code before pushing changes.
So I'll say something like "evaluate the URL fetcher library for best practices, security, performance, and test coverage. Write this up in a markdown file. Add a design for single flighting and retry policy. Break this down into steps so simple even the dumbest LLM won't get confused.
Then I clear the context window and spawn workers to do the implementation.
Right now it's not easy prompting claude code (for example) to keep fixing until a test suite passes. It always does some fixed amount of work until it feels it's most of the way there and stops. So I have to babysit to keep telling it that yes I really mean for it to make the tests pass.
Tried this on a developer I worked with once and he just scoffed at me and pushed to prod on a Friday.
that's the --yolo flag in cc :D
Most users will just give a vague tasks like: "write a clone of Steam" or "create a rocket" and then they blame Claude Code.
If you want AI to code for you, you have to decompose your problem like a product owner would do. You can get helped by AI as well, but you should have a plan and specifications.
Once your plan is ready, you have to decompose the problem into different modules, then make sure each modules are tested.
The issue is often with the user, not the tool, as they have to learn how to use the tool first.
This seems like half of HN with how much HN hates AI. Those who hate it or say it’s not useful to them seem to be fighting against it and not wanting to learn how to use it. I still haven’t seen good examples of it not working even with obscure languages or proprietary stuff.
The main difference is that with the current batch of genai tools, the AI's context resets after use, whereas a (good) intern truly learns from prior behavior.
Additionally, as you point out, the language and frameworks need to be part of the training set since AI isn't really "learning" it's just prepolulating a context window for its pre-existing knowledge (token prediction), so ymmv depending on hidden variables from the secret (to you, the consumers) training data and weights. I use Ruby primarily these days, which is solidly in the "boring tech" camp and most AIs fail to produce useful output that isn't rails boilerplate.
If I did all my IC contributions via directed intern commits I'd leave the industry out of frustration. Using only AI outputs for producing code changes would be akin to torture (personally.)
Edit: To clarify I'm not against AI use, I'm just stating that with the current generation of tools it is a pretty lackluster experience when it comes to net new code generation. It excells at one off throwaway scripts and making large tedious redactors less drudgerly. I wouldn't pivot to it being my primary method of code generation until some of the more blatant productiviy losses are addressed.
Now, it's not always useless. It's GREAT at adding debugging output and knowing which variables I just added and thus want to add to the debugging output. And that does save me time.
And it does surprise me sometimes with how well it picks up on my thinking and makes a good suggestion.
But I can honestly only accept maybe 15-20% of the suggestions it makes - the rest are often totally different from what I'm working on / trying to do.
And it's C++. But we have a very custom library to do user-space context switching, and everything is built on that.
I kind of feel this. I’ll code for days and forget to eat or shower. I love it. Using Claude code is oddly unsatisfying to me. Probably a different skillset, one that doesn’t hit my obsessive tendencies for whatever reason.
I could see being obsessed with some future flavor of it, and I think it would be some change with the interface, something more visual (gamified?). Not low-code per se, but some kind of mashup of current functionality with graph database visualization (not just node force graphs, something more functional but more ergonomic). I haven’t seen anything that does this well, yet.
I’ve seen incredible improvements just by doing this and using precise prompting to get Claude to implement full services by itself, tests included. Of course it requires manual correction later but just telling Claude to check the development documentation before starting work on a feature prevents most hallucinations (that and telling it to use the Context7 MCP for external documentation), at least in my experience.
The downside to this is that 30% of your context window will be filled with documentation but hey, at least it won’t hallucinate API methods or completely forget that it shouldn’t reimplement something.
Just my 2 cents.
I want the code to have subsequently been deployed in production and demonstrably robust, without additional work outside of the livestream.
The livestream should include code review, test creation, testing, PR creation.
It should not be on a greenfield project, because nearly all coding is not.
I want to use Claude and I want to be more productive, but my experience to date is that for writing code beyond autocomplete AI is not good enough and leads to low quality code that can’t be maintained, or else requires so much hand holding that it is actually less efficient than a good programmer.
There are lots of incentives for marketing at the grassroots level. I am totally open to changing my mind but I need evidence.
Mind you I've never wrote a non-trivial game before in my life. It would take me weeks to do this on my own without any AI assistance.
Right now I'm working on a 3d world map editor for Final Fantasy VII that was also almost exclusively vibe-coded. It's almost finished and I plan a write up and a video about it when I'm done.
Now of course you've made so many qualifiers in your post that you'll probably dismiss this as "not production", "not robust enough", "not clean" etc. But this doesn't matter to me. What matters is I manage to finish projects that I would not otherwise if not for the AI coding tools, so having them is a huge win for me.
I think the problem is in your definition of finishing a project.
Can you support said code, can you extend it, are you able to figure out where bugs are when they show up? In a professional setting, the answer to all of those should likely be yes. That's what production code is.
The difference isn't what's finishing a project is, is the dissonance between what M4v3R and rhubarbtree understand when talking about "nontrivial production" software.
When you're working in enterprise, you usually have multiple stakeholders each defining sometimes even conflicting requirements to behavior of your software. And you're required to adhere to these requirements stringently.
That's an environment that's inherently a bad fit for vibe coding.
It can still be used there, too, but you will not get a 2-3x speed up, because the LLM will always introduce minor behavioral changes - which aren't important in M4v3R scenario, but a complete deal breaket for rhubarbtree.
From my own experience, I don't get a speed up at all via CoPilot agentic mode (Claude code is banned at my workplace). But I have had a significant boost in productivity in projects that don't need to adhere to any specific spec - namely projects I do an my own time (with Claude code right now).
I still use Copilot agentic mode though. While I haven't timed myself, I don't think I'm faster with it whatsoever. It's just less mentally involved in a lot of scenarios, so it's less exhausting - leaving more energy for side projects .
In a few thousand lines of code you can get away with a massive amount of code bloat, quick hacks and inconsistent APIs. In a program that's anything more than a few thousand lines, you can't. It just becomes too confusing. You have to be deliberate. Code has to follow patterns so the cognitive load is lowered. Stuff has to be split up in a predictable manner.
And there's another problem, sensible and predictable maintenance. Changes and fixes have to be targeted and specific. They have to be written to avoid side-effect.
For organisation, it's been a huge effort on everyone's part these last 30 years to achieve that. Make code understandable, by organising it better. From one direction, languages have improved, with authors reducing boilerplate + cross-pollination of ideas between languages like anonymous methods. On the other, it's developers inventing + describing patterns or KISS or the single responsibility principle. Or even seemingly trivial things like choosing predictable folder structures and enforcing indentation rules[1]. I'm starting to feel that's often the main skill a senior dev brings to the table, organising code well.
Better code organization has made it possible for developers to make larger program. Code organisation is a need that becomes a big problem if you're not doing it well in large projects, but not really a problem if you're not doing it well in small projects.
And right now, AI isn't very good at code organisation. We might believe that you have to have a mental model of the whole program in your head, something an LLM is just not capable of right now. And I don't know if that's going to turn out to be a solvable problem as it seems like a huge context problem.
For maintenance, I'm not sure. AI seems pretty terrible at it. It often rewrites everything and throws the baby out with the bathwater. Again, it's a context problem.
Both could turn out to be easy to solve for this generation of AI, in the end.
[1] Younger programmers will not believe that even 15/20 years ago it was still a common problem that developers did not bother to indent their code consistently. In my first two jobs I'd regularly hit inconsistently indented code.
I think the code organization isn't amazing, but it's fine and frankly not that much of a concern to me usually as I'm usually just reading diffs and not digging around in the code much myself.
I haven't tried it yet, but I thought elixirs easily implementable static analysis of code could make enforcement whenever the LLM goes off rails highly useful, and an umbrella architecture would make modularity well established.
Modules could all define their own contexts via nested CLAUDE.md and subagents could be used to give it explicit implementation details.
Did you try something like that before MGriisser? (successfully or not?)
I mostly use Claude in that repo for controllers, DB access, and front end via heex templates, often with LiveView. I find it can get a bit mixed up with heex stuff occasionally given the weirdness of nested code into the HTML and all that but I think on pure Elixir it usually does a good job.
Sure, my interest is whether it’s suitable for production use on an existing codebase, ie for what constitutes most of software engineering.
But - thanks for sharing, I will take a look and watch some of the stream.
I think he had a positive experience overall, but it was clear throughout the stream that he was not yielding control to a pure-agent workflow soon.
And your starry-eyed CEO is asking the same old question: How come everything takes so long when a 2-person team over two days was able to produce a shiny new thing?!. sigh
Could be used for early prototyping, though, before you hire your first engineers just to fire them 6 months later.
And I highly doubt you spend months, as in 5+ weeks at the least making it production ready.
What even is "production readiness?" 100% fully unit tested and ready for planetary hyper scale or something? 95% of the human generated software I work on is awful but somehow makes people money.
Claude Code for example is also not that quick at all. It produces some code quickly, but even scaffolding three hello world level example projects together definitely takes more than an hour. And that’s with zero novelty. The first version of code is done quickly, but the continuous loop of self corrections after that takes a long time. Even with Serena, Context7, and other MCPs.
And, of course, without real code review. That’s easily hours even with just few thousands lines of code, if it uses something which you don’t know. But I know that almost everybody gave up understanding “their” “own” code, during vibe coding. Even before AIs, it was a well known fact, that real code reviewing is hard, and people rarely did it.
AI can make you quicker in certain situations, but these “15 minutes” claims are totally baseless. This is one reason why many people are against AIs, vibe coding, etc. These stupid claims which cannot hold even the smallest scrutiny.
I suspect videos meeting your criteria are rare because most AI coding demos either cherry-pick simple problems or skip the messy reality of maintaining real codebases.
First off, Rust represents quite a small part of the training dataset (last I checked it was under 1% of code dataset) in most public sets, so it's got waaay less training then other languages like TS or Java. You added 2 solid features, backed with tests and documentation and nice commit messages. 80% of devs would not deliver this in 2.5 hours.
Second, there was a lot of time/token waste messing around with git and git messages. Few tips I noticed that could help you in the workflow:
#1: Add a subagent for git that knows your style, so you don't poison direct claude context and spend less tokens/time fighting it.
#2: Claude has hooks, if your favorite language has a formatter like rust fmt, just use hooks to run rust fmt and similar.
#3: Limit what they test, as most LLM models tend to write overeager tests, including testing if "the field you set as null is null", wasting tokens.
#5: Saying "max 50 characters title" doesn't really mean anything to the LLM. They have no inherent ability to count, so you are relying on probability, which is quite low since your context is quite filled at this point. If they want to count the line length, they also have to use external tools. This is an inherent LLM design issue and discussing it with an LLM doesn't get you anywhere really.
Heh, I write this for some production code too (python). I guess because python is not typed, I'm testing if my pydantic implementation works.
I've not heard of this for, what does this mean practically? Some kind of invocation in claude? Opening another claude window?
So the main claude can tell the test-runner agent "Run tests using `task test` and return the results"
Then the test-runner agent runs the tests, "wasters" its context by reading 500 lines of test results, sees that it's ok. Returns "tests ok" to the main context.
This way the main context is spared from the useless chatter and can go on for longer.
Then claude can delegate the work to them when appropriate, or you can tell it directly to use the subagent, i.e. a subagent for your frontend, backend, specific microservice, database, etc etc.
Quite depends on your workflow which ones you create/need, but they are a really nice quality of life change.
Or we’re just having too much fun making stuff to make videos to convince people that are never going to be convinced.
We agree most problems stem from: 1. Getting lazy and auto-accepting edits. Always review changes and make sure you understand everything. 2. Clearly written specification documents before starting complex work items 3. Breaking down tasks into a managable chunk of scope 4. Clean digestible code architecture. If it's hard for a human to understand (e.g: poor separation of concerns) it will be hard for the LLM too.
But yeah I would never waste my time making that video. Having too much fun turning ideas into products to care about proving a point.
This is a strange response to me. Perhaps you and others aren’t aware that there’s a subculture of folks who livestream coding in general? Nothing to do with proving a point.
My interest in finding such examples is exactly due to the posting of comments like yours - strong claims of AI success - that don’t reflect my experience. I want to see videos that show what I’m doing wrong, and why that gives very different results.
I don’t have an agenda or point to prove, I just want to understand. That is the hacker way!
I'm kinda hoping that this LLM craze will force people to be better at it. Have documentation up to date and easily accessible is good for everyone.
Like we're (over here) better at marking lines in the road, because the EU mandated lane keeping assist needs the road markings to be there or it won't work.
For me LLM coding is 90% going from "hey this kind of tool would be cool" to a workable MVC in an evening.
The 10% is me using it at work to debug issues or create boilerplate crap.
If you want me to show an example of vibe coding, I bet I can migrate someone's blog to Astro with Claude Code faster than a frontend engineer.
> It should not be on a greenfield project, because nearly all coding is not.
Well, Claude Code does not work the best for existing projects. (With some exceptions.)
But if you do professional development and use something like Claude Code (the current standard, IMO) you'll quickly get a handle on what it's good at and what it isn't. I think it took me about 3-4 weeks of working with it at an overall 0x gain to realize what it's going to help me with and what it will make take longer.
To summarize, the authors enlisted a panel of expert developers to review the quality of various pull requests, in terms of architecture, readability, maintainability, etc. (see 8:27 in the video for a partial list of criteria), and then somehow aggregate these criteria into an overall "productivity score." They then trained a model on the judgments of the expert developers, and found that their model had a high correlation with the experts' judgment. Finally, they applied this model to PRs across thousands of codebases, with knowledge of whether the PR was AI-assisted or not.
They found a 35-40% productivity gain for easy/greenfield tasks, 10-15% for hard/greenfield tasks, 15-20% for easy/brownfield tasks, and 0-10% for hard/brownfield tasks. Most productivity gains went towards "reworked" code, i.e. refactoring of recent code.
All in all, this is a great attempt at rigorously quantifying AI impact. However, I do take one major issue with it. Let's assume that their "productivity score" does indeed capture the overall quality of a PR (big assumption). I'm not sure this measures the overall net positive/negative impact to the codebase. Just because a PR is well-written according to a panel of expert engineers doesn't mean it's valuable to the project as a whole. Plenty of well-written code is utterly superfluous (trivial object setters/getters come to mind). Conversely, code that might appear poorly written to an outsider expert engineer could be essential to the project (the highly optimized C/assembly code of ffmpeg comes to mind, or to use an extreme example, anything from Arthur Whitney). "Reworking" that code to be "better written" would be hugely detrimental, even though the judgment of an outside observer (and an AI trained on it) might conclude that said code is terrible.
AI coding should be transforming OSS, and we should be able to get a rough idea of the scale of the speed up in development. It’s an ideal application area.
I would estimate the majority of developers spend most of their time on problems encompassing all three of these, even if their software is not as meaningful/widely used as the previous examples. Everyone knows that LLMs are fantastic at generating greenfield boilerplate very quickly. They are an invaluable rapid prototyping/MVP generation tool, and that in itself is hugely useful.
But that's not where developers spend most of their time. They spend it maintaining complicated, mature codebases, and the utility of LLMs is much less proven for that use case. This utility would be most easily measured in contributions to open-source projects, since all commits are public and maintainers have no monetary incentive to misrepresent the impact of AI [0, 1, 2, ...].
[0] https://www.businessinsider.com/anthropic-ceo-ai-90-percent-...
[1] https://www.cnbc.com/2025/06/26/ai-salesforce-benioff.html
[2] https://www.cnbc.com/2025/04/29/satya-nadella-says-as-much-a...
Just reinforces my biases that LLMs are currently garbage for anything new and complicated. But they are a great interactive note taker and brainstorming tool.
I've considered live-streaming my work a few times, but all my work is on closed-source backend applications with sensitive code and data. If I ever get to work on an open-source product, I'll ask about live-streaming it. I think it would be a fun experience.
Although I cannot show the live stream or the code, I am writing and deploying production code for a brownfield project.
Two recent production features:
1. Quota crossing detection system for billable metrics - Complex business logic for billing infrastructure - Detects when usage crosses configurable thresholds across multiple metric types - Time: 4 days while working on other smaller tasks in parallel work vs probably 10 days focused without AI
2. Sentry monitoring wrapper for metering cron jobs - Reusable component wrapping all cron jobs with Sentry monitoring capabilities - Time: 1 day parallelled with other tasks vs 2 days focused
As you can probably tell, my work is not glamorous :D. It's all the head-scratching backend work, extending the existing system with more capabilities or to make it more robust.
I agree there is a lot of hand-holding required, but I'm betting on the systems getting better as time goes on. We are only two years into this AI journey, and the capabilities will most likely improve over the next few years.
And because this "consensus" adopted it, you know what it's good for and what kind of problems its good at solving and whether it's a good option for what you specifically are doing?
Using LLMs is a skill that's (currently) a bit hard to teach, it's a ball of math and vectors that doesn't work in a deterministic way. Some magic words in the prompt will try to make it do something, but not always.
You really need to use one, preferably a few different ones, and get a feel for how they operate. Like driving a car. You can watch 420 hours of videos of people driving cars, but you really need to sit in one to get comfortable doing it.
If everyone’s using it I will certainly learn it, yes.
I think it misses a feedback loop. Something that evaluates what went wrong, what works, what wont, and remembers that and then can use that to make better plans. From making sure it runs the tests correctly (instead of trying 5 different methods each time) to how to do TDD and what comments to add.
A common thread in articles about developers using AI is that they're not impressed at first but then write more precise instructions and provide context in a more intuitive manner for the AI to read and that's the point at which they start to see results.
Would these principles not apply to regular developers as well? I suspect that most of my disappointment with these tools is that I haven't spend enough time learning how to use them correctly.
With Claude Code you can tell it what it did wrong. It's a bit hit-or-miss as to whether it will take your comments on board (or take them too literally) but I do think it's too powerful a tool to just ignore.
I don't want someone to just come and eat my cake because they've figured out how to make themselves productive with it.
If I were a non-tech, non-specialist and/or had no business skills/experience and my job was mostly office admin I would be retraining however, because those jobs may be over except as vanity positions.
I personally have not watched much, but it sounds just like what you are looking for!
The quality is much better but it is much slower than a human engineer. However that’s irrelevant to me. If I can build two projects a day I am more productive than if I can build one. And more importantly I can build projects that increase my velocity and capability.
The difference is I run my own business so that matters to me more than my value or aptitude as an engineer.
I think not.
The reason is about missing context. Such non-trivial problems have a lot of specific unwritten context. It takes a lot of effort to share that context. Often more than doing anything one self.
We are already only talking about the subset the writes AI blog posts, not about all of humanity.
Which isn’t entirely unreasonable; AI is not really there yet. If you took this moment and said AI will never get better, and tools and processes will never improve to better accommodate AI, and the only fair comparison is a top-tier developer, and the only legitimate scenario is high quality human-maintainable code at scale… then yes, AI coding is a lot of hype with little value.
But that’s not what’s going on, is it? The trajectory here is breathtaking. A year ago you could have set a much lower bar and AI still would have failed. And the tooling to automate PRs and documentation was rough.
AI is already providing massive leverage to both amateur and professional developers. They use the tools differently (in my world the serious developers mostly use it for boilerplate and tests).
I don’t think you’ll be convinced if the value until the revolution is in the past. Which is fine! For many of us (me being in the amateur but lifelong programmer camp) it’s already delivering value that makes its imperfections worthwhile.
Is the code I’m generating world class, ready to be handed over to humans at enterprise sclae? No, definitely not. But it exists, and the scale of my amateur projects has gone through the roof, while quality is also up because tests take near zero effort.
I know it won’t convince you, and you have every right to be skeptical and dismiss the whole thing as marketing. But IMO rejecting this new tech in the short term means you’re in for a pretty rough time when the evidence is so insurmountable. Which might be a year or two. Or even three!
I've been building commercial codebases with Claude for the last few months and almost all of my input is on taste and what defines success. The code itself is basically disposable.
I'm finding this is the case for my work as well. The spec is the secret sauce, the code (and its many drafts) are disposable. Eventually I land on something serviceable, but until I do, I will easily drop a draft and start on a new one with a spec that is a little more refined.
This is key. We’re in mass production of software era. It’s easier and cheaper to replace a broken thing/part than to fix it, things being some units of code.
Yes it knows a lot and can regurgitate things and create plausible code (if I have it run builds and fix errors every time it changes a file - which of course eats tokens) but having absolutely no understanding of how time or space works leads to 90% of its great ideas being nonsensical for UI tasks. Everything is needing very careful guidance and supervision otherwise it decides to do something different instead. For back end stuff, maybe it's better.
I'm on the fence regarding overall utility but $20/month could almost be worth it for a tool that can add a ton of debug logging in seconds, some months.
I find it difficult to include examples because a lot of my work is boring backend work on existing closed-source applications. It's hard to share, but I'll give it a go with a few examples :)
----
First example: Our quota detection system (shipped last month) handles configurable threshold detection across billing metrics. The business logic is non-trivial, distinguishing counter vs gauge metrics, handling multiple consumers, and efficient SQL queries across time windows.
Claude's evolution: - First pass: Completely wrong approach (DB triggers) - Second pass: Right direction, wrong abstraction - Third pass: Working implementation, we could iterate on
---- Second example: Sentry monitoring wrapper for cron jobs, a reusable component to help us observe our cronjob usage
Claude's evolution: - First pass: Hard-coded the integration into each cron job, a maintainability nightmare. - Second pass: Using a wrapper, but the config is all wrong - Third pass: Again, OK implementation, we can iterate on it
----
The "80%" isn't about line count; it's about Claude handling the exploration space while I focus on architectural decisions. I still own every line that ships, but I'm reviewing and directing rather than typing.
This isn't writing boilerplate, it's core billing infrastructure. The difference is that Claude is treated like a very fast junior who needs clear boundaries rather than expecting senior-level architecture decisions.
Things that make you go "Hmmmmmm."
It’s a very different discussion when you’re building a product to sell.
We'll just keep getting submission after submission talking about how amazing Claude Code is with zero real world examples.
Two recent production features:
1. *Quota crossing detection system* - Complex business logic for billing infrastructure - Detects when usage crosses configurable thresholds across multiple metric types - Time: 4 days parallel work vs ~10 days focused without AI
The 3-attempt pattern was clear here:
- Attempt 1: DB trigger approach - wouldn't scale for our requirements
- Attempt 2: SQL detection but wrong interfaces, misunderstood counter vs gauge metrics
- Attempt 3: Correct abstraction after explaining how values are stored and consumed
2. *Sentry monitoring wrapper for cron jobs*
- Reusable component wrapping all cron jobs with monitoring
- Time: 1 day parallel vs 2 days focusedNothing glamorous, but they are real-world examples of changes I've deployed to production quicker because of Claude.
it's funny because as I have gotten better as a dev I've gone backwards through his progression. when I was less experienced I relied on Google; now, just read the docs
https://www-cdn.anthropic.com/58284b19e702b49db9302d5b6f135a...
Abstracting the boilerplate is how you make things easier for future you.
Giving it to an AI to generate just makes the boilerplate more of a problem when there's a change that needs to be made to _all_ the instances of it. Even worse if the boilerplate isn't consistent between copies in the codebase.
I'm lazy af. I have not been manually typing up boilerplate for the past 15 years. I use computers to do repetitive tasks. LLMs are good at some of them, but it's just another tool in the box for me. For some it seems like their first and only one.
What I can't understand is how people are ok with all that typing that you still have to do just going into /dev/null while only some translation of what you wrote ends up in the codebase. That one makes me even less likely to want to type. At least if I'm writing source code I know it's going into the repository directly.
Not to mention - while I know many don't like it, they may be able to achieve enough of a productivity boost to not require hiring as many of those crazy salaried devs.
Its literally a no-brainer. Thinking about it from just the individual cost factor is too simplified a view.
If the average US salaried developer is 10-15% more productive for just 1k more a month it is literally a no-brainer for companies to invest in that.
Of course on the other side of the coin there are many companies that are very stingy with paying for literally anything for their employees that could measurably improve productivity, and hamper their ability to be productive by intentionally paying for cheap shitty tools. They will just lose out.
Having said the above some level of AI spending is the new reality. Your workplace pays for internet right? Probably a really expensive fast corporate grade connection? Well they now also need to pay for an AI subscription. That's just the current reality.
Aider felt similar when I tried it in architect mode; my prompt would be very short and then I'd chew through thousands of tokens while it planned and thought and found relevant code snippets and etc.
What happens if you don't pay $1k/mo for Claude? Do you get an appreciable drop in productivity and output?
Genuinely asking.
Here's what works for me:
- Detailed claude.md containing overall information about the project.
- Anytime Claude chooses a different route that's not my preferred route - ask my preference to be saved in global memory.
- Detailed planning documentation for each feature - Describe high-level functionality.
- As I develop the feature, add documentation with database schema, sample records, sample JSON responses, API endpoints used, test scripts.
- MCP, MCP, MCP! Playwright is a game changer
The more context you give upfront, the less back-and-forth you need. It's been absolutely transformative for my productivity.
Thank you Claude Code team!
We write C++ code in a very customized internal idiom to drive our hardware. Claude is great at filling in debugging statements / iterating over standard data structures to dump their contents, but not much else.
Claude code is amazing at producing code for this stack. It does excellent job at outputting ffmpeg, curl commands, linux shell script etc.
I have written detailed project plan and feature plan in MarkDown - and Claude has no trouble understanding the instructions.
I am curious - what is your usecase?
Interestingly, this guy has been making pretty much the same app as you, and live-streamed making it on youtube:
https://www.youtube.com/@RayFernando1337
Looks like he's now pivoted to selling access to his discord server for vibe coding tips as I can't find a link to his product.
But if we're honest here, it's not going to take a ton of code to make that. All the functionality to do it is well documented.
Many people here could make a competitor in a week, without agentic AI, just using AI as a super-charged SO. The limiter pre-AI (aside from AI transcribing it) would have been reading and implementing/debugging all the documentation of the libraries you're using, which AI is great at circumventing.
Your product looks really good, and is an excellent example of what vibe coded AI is great at. I hope you're getting good traction.
Really simple workflow!
EDIT: I see, you're asking Claude to modify claude.md to track your preference there, right?
Ask Claude to update the preference and document the moment you realize that claude has deviated away from the path.
First I know my problem space better than the LLM.
Second, the best way to express coding intention is with code. The models often have excellent suggestions on improvements I wouldn’t have thought of. I suspect the probability of providing a good answer has been increased significantly by narrowing the scope.
Another technique is to say “do this like <some good project> does it” but I suspect that might be close to copyright theft.
“The future of agentic coding with Claude Code”
Is this another case of someone using API keys and not knowing about the claude MAX plans? It's $100 or $200 a month, if you're not pure yolo brute-force vibe coding $100 plan works.
For context: that's 1-2% of a senior engineer's fully loaded cost. The ROI is clear if it delivers even 10% productivity gain (we're seeing 2-3x on specific tasks).
You're right that many devs can start with MAX plans. The higher tier becomes necessary when running multiple parallel contexts and doing systematic exploration (the "3-attempt pattern" burns tokens fast).
I wouldn't be doing it if I didn't think it was value for money. I've always been a cost-conscious engineer who weighs cost/value, and with Claude, I am seeing the return.
What if what feels like a productivity gain is actually a productivity loss?
https://mikelovesrobots.substack.com/p/wheres-the-shovelware...
(see link in the article to a study showing developers thought AI gave them a 20% gain in productivity, but measuring this showed they instead had a 20% loss)
I refine that spec and then give that to planning mode and then go from there.
I’ve found if I jump straight into planning mode I miss some critical aspects of what ever it is I am building.
Claude code can access pretty much all those third party services in the shell, using curl or gh and so on. And in at least one case using MCP can cause trouble: the linear MCP server truncates long issues, in my experience, whereas curling the API does not.
What am I missing?
I just haven't heard others express the same over-engineering problem and wonder if this is a general observation or only shows up b/c my requests are quite simple.
(I have found that prompting it for the simplest or most efficient solution seems to help - sometimes taking 20+ lines down to 2-3, often more understandable.)
P.S. I tend to work with data and a web app for processes related to a small business, while not a formally trained developer.
For me, stuff like that is the same weird uncanny valley that you used to see in AI text, and see now in AI video. It just does such inhuman things. A senior developer would NEVER think to manually mutate the cache, because it's such desperate hack. A junior dev wouldn't even realize it's an option.
I agree it's wasteful, but from a long-form view of what spending looks like (or at least should/used to look like). Those who see 1.5k/month as "saving" money typically only care about next quarter.
As the old adage goes: a thousand dollars saved this month is 100 thousand spent next year.
Also, there may be selfish reasons to do this as well: (1) "Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance" https://arxiv.org/abs/2402.14531 (2) "Three Things to Know About Prompting LLMs" https://sloanreview.mit.edu/article/three-things-to-know-abo...
1. I don't see the output of the compiler, as in, all I get is an executable blob. It could be inspected, but I don't think that I ever have in my 20+ year career. Maybe I lie and I've rocked up with a Hex editor once or twice, out of pure curiousity, but I've never got past looking for strings that I recognise.
2. When I use Claude, I am using it to do things that I can do, by hand, myself. I am reviewing the code as I go along, and I know what I want it to do because it's what I would be writing myself if I didn't have Claude (or Gemini for that matter).
So, no, I have never congratulated the compiler (or interpreter, linker, assembler, or even the CPU).
Finally, I view the AI as a pairing partner, sometimes it's better than me, sometimes it's not, and I have to be "in the game" in order to make sure I don't end up with a vibe coded mess.
edit: This is from yesterday (Claude had just fixed a bug for me - all I did was paste the block of code that the bug was in, and say "x behaviour but getting y behaviour instead)
perfect, thanks
Edit You're welcome! That was a tricky bug - using rowCount instead of colCount in the index calculation is the kind of subtle error that can be really hard to spot. It's especially sneaky because row 0 worked correctly by accident, making it seem like the logic was mostly right. Glad we got it sorted out! Your Gaps redeal should now work properly with all the 2s (and other correctly placed cards) staying in their proper positions across all rows.
In my opinion this should be the default config. Increasing the quality of the plans gives you a much better experience using Claude Code.
Detachment from the code has been excellent for me. Just started a v2 rewrite of something I’d never had done in the past. Mostly because it would have taken me too much time to try it out if I wrote it all by hand.
I fed Claude a copy of everything I've ever written on Hacker News. Then I asked it to generate an essay that sounds like me.
Out of five paragraphs I had to change one sentence. Everything else sounded exactly as I would have written it.
It was scary good.
I'm not comfortable using it to generate code for this project, but I can absolutely see using it to generate code for a project I'm familiar with in a language I know well.
https://www.linkedin.com/posts/reidhoffman_can-talking-with-...
I've watched a handful of videos with this "digital twin", and I don't know how much post-processing has gone into them, but it is scary accurate. And this was a year+ ago.
Personally I'm a Neovim addict, so you can pry TUIs out of my cold dead hands (although I recognize that's not a preference everyone shares). I'm also not purely vibecoding; I just use it to speed up annoying tasks, especially UI work.
Claude code is more user friendly than cursor with its CLI like interface. The file modifications are easy to view and it automatically runs psql, cd, ls , grep command. Output of the commands is shown in more user friendly fashion. Agents and MCPs are easy to organized and used.
how long until he falls from staff engineer back down to senior or something less?
It’s way easier to let the agent code the whole thing if your prompt is good enough than to give instructions bit by bit only because your colleagues cannot review a PR with 50 file changes.
"Ask the LLM" is a good enough solution to an absurd number of situations. Being open to questioning your approach - or even asking the LLM (with the right context) to question your approach has been valuable in my experience.
But from a more general POV, its something we'll have to spend the next decade figuring out. 'Agile'/scrum & friends is a sort of industry-wide standard approach, and all of that should be rethought - once a bit of the dust settles.
We're so early in the change that I haven't even seen anybody get it wrong, let alone right.
The 50 file changes is most likely unsafe to deploy and unmaintainable.
So far, nothing I've seen convinces me that machines can (yet) write or review code autonomously (although they can certainly be useful as assistants). Maybe some day.
I am sorry, but this is so out of touch with reality. Maybe in the US most companies are willing to allocate you 1000 or 1500 USD/month/engineer, but I am sure that in many countries outside of the US not even a single line (or other type of) manager will allocate you such a budget.
I know for a fact that in countries like Japan you even need to present your arguments for a pizza party :D So that's all you need to know about AI adoption and what's driving it
Edit: Why is this downvoted? Different corp cultures have different ideas about what is worthwhile. Some places value innovation and experimentation and some places don't.
1) Summarize what I think my project currently does
2) Summarize what I think it should do
3) Give a couple of hints about how to do it
4) Watch it iterate a write-compile-test loop until it thinks it's ready
I haven't added any files or instructions anywhere, I just do that loop above. I know of people who put their Claude in YOLO mode on multiple sessions, but for the moment I'm just sitting there watching it.
Example:
"So at the moment, we're connecting to a websocket and subscribing to data, and it works fine, all the parsing tests are working, all good. But I want to connect over multiple sockets and just take whichever one receives the message first, and discard subsequent copies. Maybe you need a module that remembers what sequence number it has seen?"
Claude will then praise my insightful guidance and start making edits.
At some point, it will do something silly, and I will say:
"Why are you doing this with a bunch of Arc<RwLock> things? Let's share state by sharing messages!"
Claude will then apologize profusely and give reasons why I'm so wise, and then build the module in an async way.
I just keep an eye on what it tries, and it's completely changed how I code. For instance, I don't need to be fully concentrated anymore. I can be sitting in a meeting while I tell Claude what to do. Or I can be close to falling asleep, but still be productive.
I don't know if this is a question of the language or what but I just have no good luck with its consistency. And I did invest time into defining various CLAUDE.md files. To no avail.
Does it end in a forever loop for you? I used to have this problem with other models.
But yeah, strongly typed languages, test driven development, and good high quality compiler errors are real game changers for LLM performance. I use Rust for everything now.
Typescript on the other hand, seems to do much better on first pass. Still not always beautiful code, but much more application ready.
My hypothesis is that this is due to the billions LOC of Jupyter Notebook it was probably trained on :/
It will fix those if you catch them, but I haven't been able to figure out a prompt that prevents this in the first place.
I notice what worked and what didn't, what was good and what was garbage -- and also how my own opinion of what should be done changed. I have Claude Code help me update the initial prompt, help me update what should have been in the initial context, maybe add some of the bits that looked good to the initial context as well, and then write it all to a file.
Then I revert everything else and start with a totally blank context, except that file. In this session I care about the code, I review it, I am vigilant to not let any slop through. I've been trying for the second session to be the one that's gonna work -- but I'm open to another round or two of this iteration.
OK I made up the statistic, but the core idea is true, and it's something that is rarely considered in this debate. At least with code you wrote, you can probably recognize it later when you need to maintain it or just figure out what it does.
I think I can also end up with a better result, and having learned more myself. It's just better in a whole host of directions all at once.
I don't end up intimately familiar with the solution however. Which I think is still a major cost.
> This isn't failure; it's the process!
> The biggest challenge? AI can't retain learning between sessions
ai slop
for the record, I've been bullish on the tooling from the beginning
My dev-tooling AI journey has been chatGPT -> vscode + copilot -> early cursor adopter -> early claude + cursor adopter -> cursor agent with claude -> and now claude code
I've also spent a lot of time trying out self-hosted LLMs such as couple version of Qwen coder 2.5/3 32B, as well as deepseek 30B - and talking to them through the vscode continue.dev extension
My personal feelings are that the AI coding/tooling industry has seen a major plateau in usefulness as soon as agents became apart of the tooling. The reality is coding is a highly precise task, and LLMs down to the very core of the model architecture are not precise in the way coding needs them to be. and it's not that I don't think we won't one day see coding agents, but I think it will take a deep and complete bottom up kind of change and an possibly an entirely new model architecture to get us to what people imagine a coding agent is
I've accepted to just use claude w/ cursor and to be done with experimenting. the agent tooling just slows my engineering team down
I think the worst part about this dev tooling space is the comment sections on these kinds of articles is completely useless. it's either AI hype bots just saying non-sense, or the most mid an obvious takes that you here everywhere else. I've genuinely have become frustrated with all this vague advice and how the AI dev community talks about this domain space. there is no science, data, or reason as to why these things fail or how to improve it
I think anyone who tries to take this domain space seriously knows that there's limit to all this tooling, we're probably not going to see anything group breaking for a while, and there doesn't exist a person, outside the AI researchers at the the big AI companies, that could tell ya how to actually improve the performance of a coding agent
I think that famous vibe-code reddit post said it best
"what's the point of using these tools if I still need a software engineer to actually build it when I'm done prototyping"
I havn't put a huge effort into learning to write prompts but in short, it seems easier to write the code myself than determine prompts. If you don't know every detail ahead of time and ask a slightly off question, the entire result will be garbage.