I'm a CS teacher, so this is where I see a huge danger right now and I'm explicit with my students about it: you HAVE to write the code. You CAN'T let the machines write the code. Yes, they can write the code: you are a student, the code isn't hard yet. But you HAVE to write the code.
This is the ultimate problem with AI in academia. We all inherently know that “no pain no gain” is true for physical tasks, but the same is true for learning.
Of course this becomes a different thing outside of learning, where delivering results is more important in a workplace context. But even then you still need someone who does the high level thinking.
It's not a perfect analogy though because in this case it's more like automated driving - you should still learn to drive because the autodriver isn't perfect and you need to be ready to take the wheel, but that means deliberate, separate practice at learning to drive.
My favorite historic example of typical modern hypertrophy-specific training is the training of Milo of Croton [1]. By legend, his father gifted him with the calf and asked daily "what is your calf, how does it do? bring it here to look at him" which Milo did. As calf's weight grew, so did Milo's strength.
This is application of external resistance (calf) and progressive overload (growing calf) principles at work.
[1] https://en.wikipedia.org/wiki/Milo_of_Croton
Mile lived before Archimedes.
I think forklifts probably carry more weight over longer distances than people do (though I could be wrong, 8 billion humans carrying small weights might add up).
Certainly forklifts have more weight * distance when you restrict to objects that are over 100 pounds, and that seems like a good decision.
So the idea is that you should learn to do things by hand first, and then use the powerful tools once you're knowledgeable enough to know when they make sense. If you start out with the powerful tools, then you'll never learn enough to take over when they fail.
Indeed, usually after doing weightlifting, you return the weights to the place where you originally took them from, so I suppose that means you did no work at in the first place..
Here's the thing -- I don't care about "getting stronger." I want to make things, and now I can make bigger things WAY faster because I have a mech suit.
edit: and to stretch the analogy, I don't believe much is lost "intellectually" by my use of a mech suit, as long as I observe carefully. Me doing things by hand is probably overrated.
The activity would train something, but it sure wouldn't be your ability to lift.
Unfortunately, many sdevs don't understand it.
I do Windows development and GDI stuff still confuses me. I'm talking about memory DC, compatible DC, DIB, DDB, DIBSECTION, bitblt, setdibits, etc... AIs also suck at this stuff. I'll ask for help with a relatively straightforward task and it almost always produces code that when you ask it to defend the choices it made, it finds problems, apologizes, and goes in circles. One AI (I forget which) actually told me I should refer to Petzold's Windows Programming book because it was unable to help me further.
A good analogy here is programming in assembler. Manually crafting programs at the machine code level was very common when I got my first computer in the 1980s. Especially for games. By the late 90s that had mostly disappeared. Games like Roller Coaster Tycoon were one of the last ones with huge commercial success that were coded like that. C/C++ took over and these days most game studios license an engine and then do a lot of work with languages like C# or LUA.
I never did any meaningful amount of assembler programming. It was mostly no longer a relevant skill by the time I studied computer science (94-99). I built an interpreter for an imaginary CPU at some point using a functional programming language in my second year. Our compiler course was taught by people like Eric Meyer (later worked on things like F# at MS) who just saw that as a great excuse to teach people functional programming instead. In hindsight, that was actually a good skill to have as functional programming interest heated up a lot about 10 years later.
The point of this analogy: compilers are important tools. It's more important to understand how they work than it is to be able to build one in assembler. You'll probably never do that. Most people never work on compilers. Nor do they build their own operating systems, databases, etc. But it helps to understand how they work. The point of teaching how compilers work is understanding how programming languages are created and what their limitations are.
I don't know that it's all these things at once, but most people I know that are good have done a bunch of spikes / side projects that go a level lower than they have to. Intense curiosity is good, and to the point your making, most people don't really learn this stuff just by reading or doing flash cards. If you want to really learn how a compiler works, you probably do have to write a compiler. Not a full-on production ready compiler, but hands on keyboard typing and interacting with and troubleshooting code.
Or maybe to put another way, it's probably the "easiest" way, even though it's the "hardest" way. Or maybe it's the only way. Everything I know how to do well, I know how to do well from practice and repitition.
Your curriculum may be different than it is around here, but here it's frankly the same stuff I was taught 30 years ago. Except most of the actual computer science parts are gone, replaced with even more OOP, design pattern bullshit.
That being said. I have no idea how you'd actually go about teaching students CS these days, considering a lot of them will probably use ChatGPT or Claude regardless of what you do. That is what I see in the statistic for grades around here. For the first 9 years I was a well calibrated grader, but these past 1,5ish years it's usually either top marks or bottom marks with nothing in between. Which puts me outside where I should be, but it matches the statistical calibration for everyone here. I obviously only see the product of CS educations, but even though I'm old, I can imagine how many corners I would have cut myself if I had LLM's available back then. Not to mention all the distractions the internet has brought.
In my experience, people who talk about business value expect people to code like they work at the assembly line. Churn out features, no disturbances, no worrying about code quality, abstractions, bla bla.
To me, your comment reads contradictory. You want initiative, and you also don't want initiative. I presume you want it when it's good and don't want it when it's bad, and if possible the people should be clairvoyant and see the future so they can tell which is which.
What I read from GP is that they’re looking for engineering innovation, not new science. I don’t see it as contradictory at all.
> Hell, I'd even like developers who will know when the code quality doesn't matter because shitty code will cost $2 a year but every hour they spend on it is $100-200.
> Except most of the actual computer science parts are gone, replaced with even more OOP, design pattern bullshit.
Maybe you should consider a different career, you sound pretty burnt out. There are terrible takes, especially for someone who is supposed to be fostering the next generation of developers.
That's your job.
The great thing about coding agents is that you can tell them "change of design: all API interactions need to go through a new single class that does authentication and retries and rate-limit throttling" and... they'll track down dozens or even hundreds of places that need updating and fix them all.
(And the automated test suite will help them confirm that the refactoring worked properly, because naturally you had them construct an automated test suite when they built those original features, right?)
Going back to typing all of the code yourself (my interpretation of "writing by hand") because you don't have the agent-managerial skills to tell the coding agents how to clean up the mess they made feels short-sighted to me.
I dunno, maybe I have high standards but I generally find that the test suites generated by LLMs are both over and under determined. Over-determined in the sense that some of the tests are focused on implementation details, and under-determined in the sense that they don't test the conceptual things that a human might.
That being said, I've come across loads of human written tests that are very similar, so I can see where the agents are coming from.
You often mention that this is why you are getting good results from LLMs so it would be great if you could expand on how you do this at some point in the future.
Or I can say "use pytest-httpx to mock the endpoints" and Claude knows what I mean.
Keeping an eye on the tests is important. The most common anti-pattern I see is large amounts of duplicated test setup code - which isn't a huge deal, I'm much more more tolerant of duplicated logic in tests than I am in implementation, but it's still worth pushing back on.
"Refactor those tests to use pytest.mark.parametrize" and "extract the common setup into a pytest fixture" work really well there.
Generally though the best way to get good tests out of a coding agent is to make sure it's working in a project with an existing test suite that uses good patterns. Coding agents pick the existing patterns up without needing any extra prompting at all.
I find that once a project has clean basic tests the new tests added by the agents tend to match them in quality. It's similar to how working on large projects with a team of other developers work - keeping the code clean means when people look for examples of how to write a test they'll be pointed in the right direction.
One last tip I use a lot is this:
Clone datasette/datasette-enrichments
from GitHub to /tmp and imitate the
testing patterns it uses
I do this all the time with different existing projects I've written - the quickest way to show an agent how you like something to be done is to have it look at an example.Yeah, this is where I too have seen better results. The worse ones have been in places where it was greenfield and I didn't have an amazing idea of how to write tests (a data person working on a django app).
Thanks for the information, that's super helpful!
If you start with an example file of tests that follow a pattern you like, along with the code the tests are for, it's pretty good at following along. Even adding a sentence to the prompt about avoiding tautological tests and focusing on the seams of functions/objects/whatever (integration tests) can get you pretty far to a solid test suite.
I increasingly feel a sort of "guilt" when going back and forth between agent-coding and writing it myself. When the agent didn't structure the code the way I wanted, or it just needs overall cleanup, my frustration will get the best of me and I will spend too much time writing code manually or refactoring using traditional tools (IntelliJ). It's clear to me that with current tooling some of this type of work is still necessary, but I'm trying to check myself about whether a certain task really requires my manual intervention, or whether the agent could manage it faster.
Knowing how to manage this back and forth reinforces a view I've seen you espouse: we have to practice and really understand agentic coding tools to get good at working with them, and it's a complete error to just complain and wait until they get "good enough" - they're already really good right now if you know how to manage them.
Or those skills are a temporary side effect of the current SOTA and will be useless in the future, so honing them is pointless right now.
Agents shouldn't make messes, if they did what they say on the tin at least, and if folks are wasting considerable time cleaning them up, they should've just written the code themselves.
> So I’m back to writing by hand for most things. Amazingly, I’m faster, more accurate, more creative, more productive, and more efficient than AI, when you price everything in, and not just code tokens per hour
At least he said "most things". I also did "most things" by hand, until Opus 4.5 came out. Now it's doing things in hours I would have worked an entire week on. But it's not a prompt-and-forget kind of thing, it needs hand holding.
Also, I have no idea _what_ agent he was using. OpenAI, Gemini, Claude, something local? And with a subscription, or paying by the token?
Because the way I'm using it, this only pays off because it's the 200$ Claude Max subscription. If I had to pay for the token (which once again: are hugely marked up), I would have been bankrupt.
Exactly.
AI assisted development isn't all or nothing.
We as a group and as individuals need to figure out the right blend of AI and human.
Vibe coding is the extreme end of using AI, while handwriting is the extreme end of not using AI. The optimal spot is somewhere in the middle. Where exactly that spot is, I think is still up for debate. But the debate is not progressed in any way by latching on to the extremes and assuming that they are the only options.
It is hands down good for code which is laborious or tedious to write, but once done, obviously correct or incorrect (with low effort inspection). Tests help but only if the code comes out nicely structured.
I made plenty of tools like this, a replacement REPL for MS-SQL, a caching tool in Python, a matplotlib helper. Things that I know 90% how to write anyway but don't have the time, but once in front of me, obviously correct or incorrect. NP code I suppose.
But business critical stuff is rarely like this, for me anyway. It is complex, has to deal with various subtle edge cases, be written defensively (so it fails predictably and gracefully), well structured etc. and try as I might, I can't get Claude to write stuff that's up to scratch in this department.
I'll give it instructions on how to write some specific function, it will write this code but not use it, and use something else instead. It will pepper the code with rookie mistakes like writing the same logic N times in different places instead of factoring it out. It will miss key parts of the spec and insist it did it, or tell me "Yea you are right! Let me rewrite it" and not actually fix the issue.
I also have a sense that it got a lot dumber over time. My expectations may have changed of course too, but still. I suspect even within a model, there is some variability of how much compute is used (eg how deep the beam search is) and supply/demand means this knob is continuously tuned down.
I still try to use Claude for tasks like this, but increasingly find my hit rate so low that the whole "don't write any code yet, let's build a spec" exercise is a waste of time.
I still find Claude good as a rubber duck or to discuss design or errors - a better Stack Exchange.
But you can't split your software spec into a set of SE questions then paste the code from top answers.
This is the bit I think enthusiasts need to argue doesn't apply.
Have you ever read a 200 page vibewritten novel and found it satisfying?
So why do you think a 10 kLoC vibecoded codebase will be any good engineering-wise?
I've been coding a side-project for a year with full LLM assistance (the project is quite a bit older than that).
Basically I spent over a decade developing CAD software at Trimble and now have pivoted to a different role and different company. So like an addict, I of course wanted to continue developing CAD technology.
I pretty much know how CAD software is supposed to work. But it's _a lot of work_ to put together. With LLMs I can basically speedrun through my requirements that require tons of boilerplate.
The velocity is incredible compared to if I would be doing this by hand.
Sometimes the LLM outputs total garbage. Then you don't accept the output, and start again.
The hardest parts are never coding but design. The engineer does the design. Sometimes I pain weeks or months over a difficult detail (it's a sideproject, I have a family etc). Once the design is crystal clear, it's fairly obvious if the LLM output is aligned with the design or not. Once I have good design, I can just start the feature / boilerplate speedrun.
If you have a Windows box you can try my current public alpha. The bugs are on me, not on the LLM:
https://github.com/AdaShape/adashape-open-testing/releases/t...
I suspect part of the reason we see such a wide range of testimonies about vibe-coding is some people are actually better at it, and it would be useful to have some way of measuring that effectiveness.
—
I would never use, let alone pay for, a fully vibe-coded app whose implementation no human understands.
Whether you’re reading a book or using an app, you’re communicating with the author by way of your shared humanity in how they anticipate what you’re thinking as you explore the work. The author incorporates and plans for those predicted reactions and thoughts where it makes sense. Ultimately the author is conveying an implicit mental model (or even evoking emotional states or sensations) to the reader.
The first problem is that many of these pathways and edge cases aren’t apparent until the actual implementation, and sometimes in the process the author realizes that the overall product would work better if it were re-specified from the start. This opportunity is lost without a hands on approach.
The second problem is that, the less human touch is there, the less consistent the mental model conveyed to the user is going to be, because a specification and collection of prompts does not constitute a mental model. This can create subconscious confusion and cognitive friction when interacting with the work.
Well yea, but you can guard against this in several ways. My way is to understand my own codebase and look at the output of the LLM.
LLMs allow me to write code faster and it also gives a lot of discoverability of programming concepts I didn't know much about. For example, it plugged in a lot of Tailwind CSS, which I've never used before. With that said, it does not absolve me from not knowing my own codebase, unless I'm (temporarily) fine with my codebase being fractured conceptually in wonky ways.
I think vibecoding is amazing for creating quick high fidelity prototypes for a green field project. You create it, you vibe code it all the way until your app is just how you want it to feel. Then you refactor it and scale it.
I'm currently looking at 4009 lines of JS/JSX combined. I'm still vibecoding my prototype. I recently looked at the codebase and saw some ready made improvements so I did them. But I think I'll start to need to actually engineer anything once I reach the 10K line mark.
Then you are not vibe coding. The core, almost exclusive requirement for "vibe coding" is that you DON'T look at the code. Only the product outcome.
Is it a skill for the layman?
Or does it only work if you have the understanding you would need to manage a team of junior devs to build a project.
I feel like we need a different term for those two things.
Programming together with AI however, is a skill, mostly based on how well you can communicate (with machines or other humans) and how well your high-level software engineering skills are. You need to learn what it can and cannot do, before you can be effective with it.
I call the act of using AI to help write code that you review, or managing a team of coding agents "AI-assisted programming", but that's not a snappy name at all. I've also skirted around the idea of calling it "vibe engineering" but I can't quite bring myself to commit to that: https://simonwillison.net/2025/Oct/7/vibe-engineering/
> It’s not until I opened up the full codebase and read its latest state cover to cover that I began to see what we theorized and hoped was only a diminishing artifact of earlier models: slop.
This is true vibe coding, they exclusively interacted with the project through the LLM, and only looked at its proposed diffs in a vacuum.
If they had been monitoring the code in aggregate the entire time they likely would have seen this duplicative property immediately.
> What’s worse is code that agents write looks plausible and impressive while it’s being written and presented to you. It even looks good in pull requests (as both you and the agent are well trained in what a “good” pull request looks like).
Which made me think that they were indeed reading at least some of the code - classic vibe coding doesn't involve pull requests! - but weren't paying attention to the bigger picture / architecture until later on.
Normally I'd know 100% of my codebase, now I understand 5% of it truly. The other 95% I'd need to read it more carefully before I daresay I understand it.
I agree there is a spectrum, and all the way to the left you have "vibe coding" and all the way to the right you have "manual programming without AI", of course it's fine to be somewhere in the middle, but you're not doing "vibe coding" in the way Karpathy first meant it.
Might be my skills but I can tell you right now I will not be as fast as the AI especially in new codebases or other languages or different environments even with all the debugging and hell that is AI pull request review.
I think the answer here is fast AI for things it can do on its own, and slow, composed, human in the loop AI for the bigger things to make sure it gets it right. (At least until it gets most things right through innovative orchestration and model improvement moving forward.)
My habit now: always get a 2nd or 3rd opinion before assuming one LLM is correct.
I have AI build self-contained, smallish tasks and I check everything it does to keep the result consistent with global patterns and vision.
I stay in the loop and commit often.
Looks to me like the problem a lot of people are having is that they have AI do the whole thing.
If you ask it "refactor code to be more modern", it might guess what you mean and do it in a way you like it or not, but most likely it won't.
If you keep tasks small and clearly specced out it works just fine. A lot better than doing it by hand in many cases, specially for prototyping.
It's worth mentioning that even today, Copilot is an underwhelming-to-the-point-obstructing kind of product. Microsoft sent salespeople and instructors to my job, all for naught. Copilot is a great example of how product > everything, and if you don't have a good product... well...
All under one subscription.
Does not support upload / reading of PDF files :(
While this is likely feasible, I imagine it is also an instant fireable offense at these sites if not already explicitly directed by management. Also not sure how Microsoft would react upon finding out (never seen the enterprise licensing agreement paperwork for these setups). Someone's account driving Claude Code via Github Copilot will also become a far outlier of token consumption by an order(s) of magnitude, making them easy to spot, compared to their coworkers who are limited to the conventional chat and code completion interfaces.
If someone has gotten the enterprise Github Copilot integration to work with something like Claude Code though (simply to gain access to the models Copilot makes available under the enterprise agreement, in a blessed golden path by the enterprise), then I'd really like to know how that was done on both the non-technical and technical angles, because when I briefly looked into it all I saw were very thorny, time-consuming issues to untangle.
Outside those environments, there are lots of options to consume Claude Code via Github Copilot like with Visual Studio Code extensions. So much smaller companies and individuals seem to be at the forefront of adoption for now. I'm sure this picture will improve, but the rapid rate of change in the field means those whose work environment is like those enterprise constrained ones I described but also who don't experiment on their own will be quite behind the industry leading edge by the time it is all sorted out in the enterprise context.
"AI can be good -- very good -- at building parts. For now, it's very bad at the big picture."
"Amazingly, I’m faster, more accurate, more creative, more productive, and more efficient than AI, when you price everything in, and not just code tokens per hour."
For 99.99% of developers this just won't be true.
this is such an individualized technology that two people at the same starting point two years ago, could've developed wildly different workflows.
his points about why he stopped using AI: these are the things us reluctant AI adopters have been saying since this all started.
It requires refactoring at scale, but GenAI is fast so hitting the same code 25 times isn’t a dealbreaker.
Eventually the refactoring is targeted at smaller and smaller bits until the entire project is in excellent shape.
I’m still working on Sharpee, an interactive fiction authoring platform, but it’s fairly well-baked at this point and 99% coded by Claude and 100% managed by me.
Sharpee is a complex system and a lot of the inner-workings (stdlib) were like coats of paint. It didn’t shine until it was refactored at least a dozen times.
It has over a thousand unit tests, which I’ve read through and refactored by hand in some cases.
The results speak for themselves.
https://sharpee.net/ https://github.com/chicagodave/sharpee/
It’s still in beta, but not far from release status.
Sharpee’s success is rooted in this and its recorded:
https://github.com/ChicagoDave/sharpee/tree/main/docs/archit...