If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.
If you tell them the code is slow, they'll try to add optimized fast paths (more code), specialized routines (more code), custom data structures (even more code). And then add fractally more code to patch up all the problems that code has created.
If you complain it's buggy, you can have 10 bespoke tests for every bug. Plus a new mocking framework created every time the last one turns out to be unfit for purpose.
If you ask to unify the duplication, it'll say "No problem, here's a brand new metamock abstract adapter framework that has a superset of all feature sets, plus two new metamock drivers for the older and the newer code! Let me know if you want me to write tests for the new adapters."
But I can see the carnage with offshoring+LLM, or "most employees", including so call software engineer + LLM.
LLM code is still mostly absurdly bad, unless you tell it in painstaking detail what to do and what to avoid, and never ask it to do a bigger job at a time than a single function or very small class.
Edit: I'll admit though that the detailed explanation is often still much less work than typing everything yourself. But it is a showstopper for autonomous "agentic coding".
> LLM code is higher quality than any codes I have seen in my 20 years in F500.
"Any codes"?And in my French brain, code or codebase is countable and not uncountable.
There is a countable "code" (just like "un café" is either a place, or a cup of coffee, or a type of coffee), and "un code" would be the one used as a password or secret, as in "j'ai utilisé tous les codes de récupération et perdu mon accès Gmail" (I used all the recovery codes and lost Gmail access).
Now I can't get the Pulp Fuction dialog out of my head.
- Do you know what they call code in France?
- No
- Le code
But what set me off is an universal qualifier: there was no code seen by you that is of equal quality or better that what LLMs generate.
https://www.neatorama.com/2007/01/22/a-mathematical-cow-joke...
You can for example have two different organizations with different codes of conduct.
There is though nothing technically wrong with seeing each line of code as an complete individual code and referring to then multiple of them as codes.
If that's obvious to you than you're just being rude. If it's not obvious to you, then you'll also find this is a common deviance (plural 'code') from those who come from a particular primary language's region.
Edit; This got me thinking - what is the grammar/rule around what gets pluralized and what doesn't? How does one know that "code" can refer to a single line of code, a whole file of code, a project, or even the entirety of all code your eyes have ever seen without having to have an s tacked on to the end of it?
As for the grammar rule, it's the question of whether a word is countable or uncountable. In common industry usage, "code" is an uncountable noun, just like "flour" in cooking (you say 2 lines of code, 1 pound of flour).
It's actually pretty common for the same word to have both countable and uncountable versions, with different, though related, meanings. Typically the uncountable version is used with a measure of quantity, while the countable version denotes different kinds (flours - different types of flour; peoples - different groups of people).
This was very helpful, thank you! (I had just gotten off the phone with Claude learning about countable and uncountable nouns but those additional details you provided should prove quite valuable)
As if author of the comment had not seen any code that is better or of equal quality of code generated by LLMs.
Well, the grammar is that English has two different classes of noun, and any given noun belongs to one class or the other. Standard terminology calls them "mass nouns" and "count nouns".
The distinction is so deeply embedded in the language that it requires agreement from surrounding words; you might compare many [which can only apply to count nouns] vs much [only to mass nouns], or observe that there are separate generic nouns for each class [thing is the generic count noun; stuff is the generic mass noun].
For "how does one know", the general concept is that count nouns refer to things that occur discretely, and mass nouns refer to things that are indivisible or continuous, most prototypically materials like water, mud, paper, or steel.
Where the class of a noun is not fixed by common use (for example, if you're making it up, or if it's very rare), a speaker will assign it to one class or the other based on how they internally conceive of whatever they're referring to.
You need code to get it to generate proper code.
I certainly read it as one and found it funny.
Nevermind the fact that it only migrated 3 out of 5 duplicated sections, and hasn’t deleted any now-dead code.
Can you imagine working with someone who produces 100k lines of unmaintainable code in a single sprint?
This is your future.
I had a coworker that more or less exactly did that. You left a comment in a ticket about something extra to be done, he answered "yes sure" and after a few days proceeded to close the ticket without doing the thing you asked. Depending on the quantity of work you had at the moment, you might not notice that until after a few months, when the missing thing would bite you back in bitter revenge.
Tool works as expected? It's superintelligence. Programming is dead.
Tool makes dumb mistake? So do humans.
I admire your experience with people.
You need to do this when coding manually as well, but the speed at which AI tools can output bad code means it's so much more important.
I think programming is giving people a false impression on how intelligent the models are, programmers are meant to be smart right so being able to code means the AI must be super smart. But programmers also put a huge amount of their output online for free, unlike most disciplines, and it's all text based. When it comes to problem solving I still see them regularly confused by simple stuff, having to reset context to try and straighten it out. It's not a general purpose human replacement just yet.
Set the boundaries and guidelines before it starts working. Don't leave it space to do things you don't understand.
ie: enforce conventions, set specific and measurable/verifiable goals, define skeletons of the resulting solutions if you want/can.
To give an example. I do a lot of image similarity stuff and I wanted to test the Redis VectorSet stuff when it was still in beta and the PHP extension for redis (the fastest one, which is written in C and is a proper language extension not a runtime lib) didn't support the new commands. I cloned the repo, fired up claude code and pointed it to a local copy of the Redis VectorSet documentation I put in the directory root telling it I wanted it to update the extension to provide support for the new commands I would want/need to handle VectorSets. This was, idk, maybe a year ago. So not even Opus. It nailed it. But I chickened out about pushing that into a production environment, so I then told it to just write me a PHP run time client that mirrors the functionality of Predis (pure-php implementation of redis client) but does so via shell commands executed by php (lmao, I know).
Define the boundaries, give it guard rails, use design patterns and examples (where possible) that can be used as reference.
You say "Do this thing".
- It does the thing (takes 15 min). Looks incredibly fast. I couldn't code that fast. It's inhuman. So far all the fantastical claims hold up.
But still. You ask "Did you do the thing?"
- it says oops I forgot to do that sub-thing. (+5m)
- it fixes the sub-thing (+10m)
You say is the change well integrated with the system?
- It says not really, let me rehash this a bit. (+5m)
- It irons out the wrinkles (+10m)
You say does this follow best engineering practices, is it good code, something we can be proud of?
- It says not really, here are some improvements. (+5m)
- It implements the best practices (+15m)
You say to look carefully at the change set and see if it can spot any potential bugs or issues.
- It says oh, I've introduced a race condition at line 35 in file foo and an null correctness bug at line 180 of file bar. Fixing. (+15m)
You ask if there's test coverage for these latest fixes?
- It says "i forgor" and adds them. (+15m)
Now the change set has shrunk a bit and is superficially looking good. Still, you must read the code line by line, and with an experienced eye will still find weird stuff happening in several of the functions, there's redundant operations, resources aren't always freed up. (60m)
You ask why it's implemented in such a roundabout way and how it intends for the resources to be freed up?
- It says "you're absolutely right" and rewrites the functions. (+15m)
You ask if there's test coverage for these latest fixes?
- It says "i forgor" and adds them. (+15m)
Now the 15 minutes of amazingly fast AI code gen has ballooned into taking most of the afternoon.
Telling Claude to be diligent, not write bugs, or to write high quality code flat out does not work. And even if such prompting can reduce the odds of omissions or lapses, you still always always always have to check the output. It can not find all the bugs and mistakes on its own. If there are bugs in its training data, you can assume there will be bugs in its output.
(You can make it run through much of this Socratic checklist on its own, but this doesn't really save wall clock time, and doesn't remove the need for manual checking.)
I've definitely built the same thing a few times, getting incrementally better designs each time.
Perform regular sessions dedicated to cleaning up tech debt (including docs).
It's a tool. It's a wildly effective and capable tool. I don't know how or why I have such a wildly different experience than so many that describe their experiences in a similar manner... but... nearly every time I come to the same conclusion that the input determines the output.
> If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.
Yes, when the prompt/instructions are overly broad and there's no set of guardrails or guidelines that indicate how things should be done... this will happen. If you're not using planning mode, skill issue. You have to get all this stuff wrapped up and sorted before the implementation begins. If the implementation ends up being done in a "not-so-great" approach - that's on you.
> If you tell them the code is slow
Whew. Ok. You don't tell it the code is slow. Do you tell your coworker "Hey, your code is slow" and expect great results? You ask it to benchmark the code and then you ask it how it might be optimized. Then you discuss those options with it (this is where you do the part from the previous paragraph, where you direct the approach so it doesn't do "no-so-great approach") until you get to a point where you like the approach and the model has shown it understands what's going on.
Then you accept the plan and let the model start work. At this point you should have essentially directed the approach and ensured that it's not doing anything stupid. It will then just execute, it'll stay within the parameters/bounds of the plan you established (unless you take it off the rails with a bunch of open ended feedback like telling it that it's buggy instead of being specific about bugs and how you expect them to be resolved).
> you can have 10 bespoke tests for every bug. Plus a new mocking framework created every time the last one turns out to be unfit for purpose.
This is an area I will agree that the models are wildly inept. Someone needs to study what it is about tests and testing environments and mocking things that just makes these things go off the rails. The solution to this is the same as the solution to the issue of it keeping digging or chasing it's tail in circles... Early in the prompt/conversation/message that sets the approach/intent/task you state your expectations for the final result. Define the output early, then describe/provide context/etc. The earlier in the prompt/conversation the "requirements" are set the more sticky they'll be.
And this is exactly the same for the tests. Either write your own tests and have the models build the feature from the test or have the model build the tests first as part of the planned output and then fill in the functionality from the pre-defined test. Be very specific about how your testing system/environment is setup and any time you run into an issue testing related have the model make a note about that and the solution in a TESTING.md document. In your AGENTS.md or CLAUDE.md or whatever indicate that if the model is working with tests it should refer to the TESTING.md document for notes about the testing setup.
Personally, I focus on the functionality, get things integrated and working to the point I'm ready to push it to a staging or production (yolo) environment and _then_ have the model analyze that working system/solution/feature/whatever and write tests. Generally my notes on the testing environment to the model are something along the lines of a paragraph describing the basic testing flow/process/framework in use and how I'd like things to work.
The more you stick to convention the better off you'll be. And use planning mode.
Yes? Why don't you?
They are capable people that just didn't notice something, id I notice some telemetry and tell them "hey this is slow" they are expected to understand the reason(s).
"Hey, I saw that metric A was reporting 40% slower, are you aware already or have any ideas as to what might be causing that?"
Those two approaches are going to produce rather distinctly different results whether you're speaking to a human or typing to a GPU.
The suggestion to tell the agent to do performance analysis of the part of the code you think is problematic, and offer suggestions for improvements seems like the proper way to talk to a machine, whereas "hey your code is slow" feels like the proper way to talk to a human.
right, I'm sure there are all sorts of scenarios where that is the case and probably the phrasing would be something like that seems slow, or it seems to be taking longer than expected or some other phrasing that is actually synonymous with the code is slow. On the other hand there are also people that you can say the code is slow to, and they won't worry about it.
>So no that’s not the proper way to talk to humans
In my experience there are lots of proper ways to talk to humans, and part of the propriety is involved with what your relationship with them is. so it may be the proper way to talk to a subset of humans, which is generally the only kinds of humans one talks to - a subset. I certainly have friends that I have worked to for a long time who can say "what the fuck were you thinking here" or all sorts of things that would not be nice if it came from other people but is in fact a signifier of our closeness that we can talk in such a way. Evidently you have never led a team with people who enjoyed that relationship between them, which I think is a shame.
Finally, I'll note that when I hear a generalized description of a form of interaction I tend to give what used to be called "the benefit of a doubt" and assume that, because of the vagaries of human language and the necessity of keeping things not a big long harangue as every communication must otherwise become in order to make sure all bases of potential speech are covered, that the generalized description may in fact cover all potential forms of polite interaction in that kind of interaction, otherwise I should have to spend an inordinate amount of my time lecturing people I don't know on what moral probity in communication requires.
But hey, to each their own.
on edit: "the what the fuck were you thinking here" quote is also an example of a generalized form of communication that would be rude coming from other people but was absolutely fine given the source, and not an exact quote despite the use of quotation marks in the example.
It can be a tool, for specific niche problems: summarization, extraction, source-to-source translation -- if post-trained properly.
But that isn't what y'all are doing, you're engaging in "replace all the meatsacks AGI ftw" nonsense.
It's a tool. It's good for some things, not for others. Use the right tool for the job and know the job well enough to know which tools apply to which tasks.
More than anything it's a learning tool. It's also wildly effective at writing code, too. But, man... the things that it makes available to the curious mind are rather unreal.
I used it to help me turn a cat exercise wheel (think huge hamster wheel) into a generator that produces enough power to charge a battery that powers an ESP32 powered "CYD" touchscreen LCD that also utilizes a hall effect sensor to monitor, log and display the RPMs and "speed" (given we know the wheel circumference) in real time as well as historically.
I didn't know anything about all this stuff before I started. I didn't AGI myself here. I used a learning tool.
But keep up with your schtick if that's what you want to do.
P.S. The real big deal is the democratization of oracles. Back in the day building an oracle was a megaproject accessible only to megacorps like Google. Today you can build one for nothing if you have a gaming GPU and use it for powering your kobold text adventure session.
So what? That's honestly amateur hour. And the LLM derived all of it from things that have been done and posted about a thousand times before.
You could have achieved the same thing with a few google searches 15 years ago (obviously not with ESP32, but other microcontrollers).
Obviously there is a just keep generating more tokens bias in software management, since so many developer metrics over the years do various lines of code style analysis on things.
But just as experience and managerial programs have over time developed to say this is a bad bias for ranking devs, it should be clear it is a bad bias for LLMs to have.
Generative AI.
I wouldn't be surprised if over half my prompts start with "Why ...?", usually followed by "Nope, ... instead”
Maybe the occasional "Fuck that you idiot, throw the whole thing out"
Are you using plan mode? I used to experience the do a poor approach and dig issue, but with planning that seems to have gone away?
I find LLMs at present work best as autocomplete -
The chunks of code are small and can be carefully reviewed at the point of writing
Claude normally gets it right (though sometimes horribly wrong) - this is easier to catch in autocomplete
That way they mostly work as designed and the burden on humans is completely manageable, plus you end up with a good understanding of the code generated. They make mistakes I'd say 30% of the time or so when autocompleting, which is significant (mistakes not necessarily being bugs but ugly code, slow code, duplicate code or incorrect code.
Having the AI produce the majority of the code (in chats or with agents) takes lots of time to plan and babysit, and is harder to review, maintain and diagnose; it doesn't seem like much of a performance boost, unless you're producing code that is already in the training data and just want to ignore the licensing of the original code.
Pretty much. I've been advocating this for a while. For automation you need intent, and for comparison you need measurement. Blast radius/risk profile is also important to understand how much you need to cover upfront.
The Author mentions evaluations, which in this context are often called AI evals [1] and one thing I'd love to see is those evals become a common language of actually provable user stories instead of there being a disconnect between different types of roles, e.g. a scientist, a business guy and a software developer.
The more we can speak a common language and easily write and maintain these no matter which background we have, the easier it'll be to collaborate and empower people and to move fast without losing control.
- [1] https://ai-evals.io/ (or the practical repo: https://github.com/Alexhans/eval-ception )
They just write code that is (semantically) similar to code (clusters) seen in its training data, and which haven't been fenced off by RLHF / RLVR.
This isn't that hard to remember, and is a correct enough simplification of what generative LLMs actually do, without resorting to simplistic or incorrect metaphors.
We stubbornly use the same language to refer to all software development, regardless of the task being solved. This lets us all be a part of the same community, but is also a source of misunderstanding.
Some of us are prone to not thinking about things in terms of what they are, and taking the shortcut of looking at industry leaders to tell us what we should think.
These guys consistently, in lockstep, talk about intelligent agents solving development tasks. Predominately using the same abstract language that gives us an illusion of unity. This is bound to make those of us solving the common problems believe that the industry is done.
If you are using an LLM via a harness like claude.ai, chatgpt.com, Claude Code, Windsurf, Cursor, Excel Claude plug-in, etc... then you are not using an LLM, you are using something more, correct?
An example I keep hearing is "LLMs have no memory/understanding of time so ___" - but, agents have various levels of memory.
I keep trying to explain this in meetings, and in rando comments. If I am not way off-base here, then what should be the term, or terms, be? LLM-based agents?
You always need a harness of some kind to interact with an LLM. Normal web APIs (especially for hosted commercial systems) wrapped around LLMs are non-minimal harnesses, that have built in tools, interpretation of tool calls, application of what is exposed in local toolchains as “prompt templates” to transform the context structure in the API call into a prompt (in some cases even supporting managing some of the conversation state that is used to construct the prompt on the backend.)
> If you are using an LLM via a harness like claude.ai, chatgpt.com, Claude Code, Windsurf, Cursor, Excel Claude plug-in, etc... then you are not using an LLM, you are using something more, correct?
You are essentially always using something more than an LLM (unless “you” are the person writing the whole software stack, and the only thing you are consuming is the model weights, or arguably a truly minimal harness that just takes setting and a prompt that is not transformed in any way before tokenization, and returns the result after no transformations or filtering other than mapping back from tokens to text.)
But, yes, if you are using an elaborate frontend of the type you enumerate (whether web or CLI or something else), you are probably using substantially more stuff on top of the LLM than if you are using the providers web API.
However, they just look at the whole thing as "the LLM," which carries specific baggage. If we could all spread the knowledge of what is actually going on to the wider public, it would make my meetings easier, and prevent many very smart folks who are not practitioners from saying inaccurate stuff.
If we could all spread the knowledge of what is actually going on to the wider public, it would make my meetings easier, and prevent very smart folks from outside the field from saying dumb-sounding stuff.
This is an example of why LLMs won't displace engineers as severely as many think. There are very old solved processes and hyper-efficient ways of building things in the real world that still require a level of understanding many simply don't care or want to achieve.- LLM = the model itself (stateless, no tools, just text in/text out) - LLM + system prompt + conversation history = chatbot (what most people interact with via ChatGPT, Claude, etc.) - LLM + tools + memory + orchestration = agent (can take actions, persist state, use APIs)
When someone says "LLMs have no memory" they're correct about the raw model, but Claude Code or Cursor are agents - they have context, tool access, and can maintain state across interactions.
The industry seems to be settling on "agentic system" or just "agent" for that last category, and "chatbot" or "assistant" for the middle one. The confusion comes from product names (ChatGPT, Claude) blurring these boundaries - people say "LLM" when they mean the whole stack.
My own experience using Claude Code and similar tools tells me that "hidden requirements" could include:
* Make sure DESIGN.md is up to date
* Write/update tests after changing source, and make sure they pass
* Add integration test, not only unit tests that mock everything
* Don't refactor code that is unrelated to the current task
...
These are not even project/language specific instructions. They are usually considered common sense/good practice in software engineering, yet I sometimes had to almost beg coding agents to follow them. (You want to know how many times I have to emphasize don't use "any" in a TypeScript codebase?)
People should just admit it's a limitation of these coding tools, and we can still have a meaningful discussion.
An interesting example of the training data overriding the context.
Humans would execute that code and validate it. From plausible it'd becomes hey, it does this and this is what I want. LLMs skip that part, they really have no understanding other than the statistical patterns they infer from their training and they really don't need any for what they are.
It's better to describe what you can do that LLMs currently can't.
If they'd bother to see how modern neuroscience tries to explain human cognition they'd see it explained in terms that parallel modern ML. https://en.wikipedia.org/wiki/Predictive_coding
We only have theories for what intelligence even means, I wouldn't be surprised there are more similarities than differences between human minds and LLMs, fundamentally (prediction and error minimization)
What a shame your human reasoning and "true understanding" led you astray here.
I don't use a planner though, I have my own workflow setup to do this (since it requires context isolated agents to fix tests and fix code during differential testing). If the planner somehow added broad test coverage and a performance feedback loop (or even just very aggressive well known optimizations), it might work.
I don't always write correct code, either. My code sure as hell is plausible but it might still contain subtle bugs every now and then.
In other words: 100% correctness was never the bar LLMs need to pass. They just need to come close enough.
Someone (with deep pockets to bear the token costs) should let Claude run for 26 months to have it optimize its Rust code base iteratively towards equal benchmarks. Would be an interesting experiment.
The article points out the general issue when discussing LLMs: audience and subject matter. We mostly discuss anecdotally about interactions and results. We really need much more data, more projects to succeed with LLMs or to fail with them - or to linger in a state of ignorance, sunk-cost fallacy and supressed resignation. I expect the latter will remain the standard case that we do not hear about - the part of the iceberg that is underwater, mostly existing within the corporate world or in private GitHubs, a case that is true with LLMs and without them.
In my experience, 'Senior Software Engineer' has NO general meaning. It's a title to be awarded for each participation in a project/product over and over again. The same goes for the claim: "Me, Senior SWE treat LLMs as Junior SWE, and I am 10x more productive." Imagine me facepalming every time.
https://github.com/fugue-labs/gollem/blob/main/ext/codetool/...
They are buying a service. As long as the service 'works' they do not care about the other stuff. But they will hold you liable when things go wrong.
The only caveat is highly regulated stuff, where they actually care very much.
Anything they happen to get "correct" is the result of probability applied to their large training database.
Being wrong will always be not only possible but also likely any time you ask for something that is not well represented in it's training data. The user has no way to know if this is the case so they are basically flying blind and hoping for the best.
Relying on an LLM for anything "serious" is a liability issue waiting to happen.
For example, let's try a simple experiment. I'll generate a random UUID:
> uuidgen 44cac250-2a76-41d2-bbed-f0513f2cbece
Now it is extremely unlikely that such a UUID is in the training set.
Now I'll use OpenCode with "Qwen3 Coder 480B A35B Instruct" with this prompt: "Generate a single Python file that prints out the following UUID: "44cac250-2a76-41d2-bbed-f0513f2cbece". Just generate one file."
It generates a Python file containing 'print("44cac250-2a76-41d2-bbed-f0513f2cbece")'. Now this is a very simple task (with a 480B model), but it solves a problem that is not in the training data, because it is a generalisation over similar but different problems in the training data.
Almost every programming task is, at some level of abstraction, and with different levels of complexity, an instance of solving a more general type of problem, where there will be multiple examples of different solutions to that same general type of problem in the training set. So you can get a very long way with Transformer model generalisations.
If you've made a significant investment in human capital, you're even more likely to protect it now and prevent posting valuable stuff on the web.
This means an LLM can autogenerated millions of code problem prompts, attempt millions of solutions (both working and non-working), and from the working solutions, penalize answers that have poor performance. The resulting synthetic dataset can then be used as a finetuning dataset.
There are now reinforcement finetuning techniques that have not been incorporated into the existing slate of LLMs that will enable finetuning them for both plausibility AND performance with a lot of gray area (like readability, conciseness, etc) in between.
What we are observing now is just the tip of a very large iceberg.
If Im the govt, Id be foaming at the mouth - those projects that used to require enormous funding now will supposedly require much less.
Hmmm, what to do? Oh I know. Lets invest in Digital ID-like projects. Fun.
I don't think you grasp my statement. LLMs will exceed humans greatly for any domain that is easy to computationally verify such as math and code. For areas not amenable to deterministic computations such as human biology, or experimental particle physics, progress will be slower
No exaggeration it floundered for an hour before it started to look right.
It's really not good at tasks it has not seen before.
Break the job into microtasks, ask for one petal as a pair of cubic Beziers with explicit numeric control points, render that snippet locally with a simple rasterizer, then iterate on the numbers. If determinism matters accept the tradeoff of writing a small generator using a geometry library like Cairo or a bezier solver so you get reproducible coordinates instead of watching the model flounder for an hour.
I think some industries with mostly proprietary code will be a bit disappointing to use AI within.
Given a harness that allows the model to validate the result of its program visually, and given the models are capable of using this harness to self correct (which isn't yet consistently true), then you're in a situation where in that hour you are free to do some other work.
A dishwasher might take 3 hours to do for what a human could do in 30 minutes, but they're still very useful because the machine's labor is cheaper than human labor.
TBH I would have just rendered a font glyph, or failing that, grabbed an image.
Drawing it with vector graphics programmatically is very hard, but a decent programmer would and should push back on that.
If an LLM did that, people would be all up in arms about it cheating. :-)
For all its flaws, we seem to hold LLMs up to an unreasonably high bar.
Just about anyone can eventually come up with a hideously convoluted HeraldicImageryEngineImplFactory<FleurDeLis>.
Opus would probably do better though.
Whatever the cause, LLMs have gotten significantly better over time at generating SVGs of pelicans riding bicycles:
https://simonwillison.net/tags/pelican-riding-a-bicycle/
But they're still not very good.
It basically just re-created the wikipedia article fleur-de-lis, which I'm not sure proves anything beyond "you have to know how to use LLMs"
Beyond the fact that it was "correct" in the same way the author of the article talked about, there was absolutely bizarre shit in there. As an example, multiple times it tried to import modules that didn't exist. It noticed this when tests failed, and instead of figuring out the import problem it add a fucking try/except around the import and did some goofy Python shenanigans to make it "work".
Interesting shortcoming, really shows how weak the reasoning is.
Write a lambda that takes an S3 PUT event and inserts the rows of a comma separated file into a Postgres database.
Naive implementation: download the file from s3 and do a bulk insert - it would have taken 20 minutes and what Claude did at first.
I had to tell it to use the AWS sql extension to Postgres that will load a file directly from S3 into a table. It took 20 seconds.
I treat coding agents like junior developers.
The deeper issue is that "efficient ingest" depends heavily on context that's implicit in your setup: file sizes, partitioning, schema evolution expectations, downstream consumers. A Lambda doing direct S3-to-Postgres import is fine for small/occasional files, but if you're dealing with high-volume event-driven ingestion you'll hit connection pool pressure fast on RDS. At that point the conversation shifts to something like a queue buffer or moving toward a proper staging layer (S3 → Redshift/Snowflake/Databricks with native COPY or autoloader). The LLM won't surface that tradeoff unless you explicitly bring it up. It optimizes for the stated task, not for the unstated architectural constraints.
> Now 2 case studies are not proof. I hear you! When two projects from the same methodology show the same gap, the next step is to test whether similar effects appear in the broader population. The studies below use mixed methods to reduce our single-sample bias.
Cherry picked AI fail for upvotes. Which you’ll get plenty of here an on Reddit from those too lazy to go and take a look for themselves.
Using Codex or Claude to write and optimize high performance code is a game changer. Try optimizing cuda using nsys, for example. It’ll blow your lazy little brain.
You're glossing over so much stuff. Moreover, how does the Junior grow and become the senior with those characteristics, if their starting point is LLMs?
This series of articles is gold.
Unsurprisingly, writing good software with AI follows the same principles as writing it without AI. Keep scopes small. Ship, refactor, optimize, and write tests as you go.
Electronic synthesisers went from "it's a piano, but expensive and sounds worse" to every weird preset creating a whole new genre of electronic music.
So it seems plausible, like Claude's code, that our complaints about unmaintainable code are from trying to use it like a piano, and the rave kids will find a better use for it.
A few tips for a quickstart:
Give yourself permission to play.
Understand basic concepts like context window, compaction, tokens, chain of thought and reasoning, and so on. Use AI to teach you this stuff, and read every blog post OpenAI and Anthropic put out and research what you don't understand.
Pick a hard coding problem in Python or Typescript and take a leap of faith and ask the agent to code it for you.
My favorite phrase when planning is: "Don't change anything. Just tell me.". Save this as a tmux shortcut and use it at the end of every prompt when planning something out.
Use markdown .md docs to create a planning doc and keep chatting to the agent about it and have it update the plan until you're super happy, always using the magic phrase "Don't change anything. Just tell me." (I should get myself a patent on that little number. Best trick I know)
Every time you see an anti-AI post, just move on. It's lazy people making lazy assumptions. Approach agentic coding with a sense of love, excitement, optimism, and take massive leaps of faith and you'll be very very surprised at what you find.
Best of luck Serious Angel.
Your answer is to play with it. Cool. But why cant you and others put together a proper guide lol? It cant be that hard.
Go ahead and do it - it'll challenge the Anti-AI posters you are referencing. I and others want to see that debate.
One of the rare resources I found recently was the OpenClaw guys interview on Lex. He drops a few bangers that are really valuable and will save you having to spend a long time figuring it out.
Also there's a very strong disincentive for anyone to write right now because we're competing against the noise and the slop in the space. So best to just shut the fuck up and create as fast as we can, and let the outcome speak for itself. You're going to see a lot more products like OpenClaw where the pace of innovation is rapid, and the author freely admits that they're coding agentically and not writing a single line.
I think the advantage that Peter has (openclaw author) is that he has enough money and success to not give a fuck about what people say re him writing purely agentically, so he's been very open about it which has been great for others who are considering doing the same.
But if you have a software engineering career or are a public figure with something to lose, you tend to STFU if you're doing pure agentic coding on a project.
But that'll change. Probably over the next few months. OpenClaw broke the ice.
Start small. Figure out what it (whatever tool you’re using) can do reliably at a quality level you’re comfortable with. Try other tools. There are tons. If it doesn’t get it right with the first prompt, iterate. Refine. Keep at it until you get there.
When you have seen some pattern work, do that a bunch. It won’t always work. Write rules / prompts / skills to try to get it to avoid making the mistakes you see. Keep doing this for a while and you’ll get into a groove.
Then try taking on bigger chunks of work at a time. Break apart a problem the same way you’d do it yourself first. Write a framework first. Build hello world. Write tests. Build the happy path. Add features. Don’t forget to make it write lots of tests. And run them. It’ll be lazy if you let it, so don’t let it. Each architectural step is not just a single prompt but a conversation with the output being a commit or a PR.
Also, use specs or plans heavily. Have a conversation with it about what you’re trying to do and different ways to do it. Their bias is to just code first and ask questions later. Fight that. Make it write a spec doc first and read it carefully. Tell it “don’t code anything but first ask me clarifying questions about the problem.” Works wonders.
As for convincing the AI haters they’re wrong? I seriously do. Not. Care. They’ll catch up. Or be out of a job. Not my problem.
https://news.ycombinator.com/item?id=47280645
It is more about LLMs helping me understand the problem than giving me over engineered cookie cutter solutions.
idk what to say, just because it's rust doesn't mean it's performant, or that you asked for it to be performant.
yes, llms can produce bad code, they can also produce good code, just like people
Over time, you develop a feel for which human coders tend to be consistently "good" or "bad". And you can eliminate the "bad".
With an LLM, output quality is like a box of chocolates, you never know what you're going to get. It varies based on what you ask and what is in it's training data --- which you have no way to examine in advance.
You can't fire an LLM for producing bad code. If you could, you would have to fire them all because they all do it in an unpredictable manner.
It's probably a good idea to improve your test suite first, to preserve correctness.
Just copy and paste from an open source relational db repo
Easy. And more accurate!
Just like you can't develop musical taste without writing and listening to a lot of music, you can't teach your gut how to architect good code without putting in the effort.
Want to learn how to 10x your coding? Read design patterns, read and write a lot of code by hand, review PRs, hit stumbling blocks and learn.
I noticed the other day how I review AI code in literally seconds. You just develop a knack for filtering out the noise and zooming in on the complex parts.
There are no shortcuts to developing skill and taste.
Same experience, but the hype bros do only need a shiny screengrab to proclaim the age of "gatekeeping" SWE is over to get their click fix from the unknowingly masses.
Related:
- <http://archive.today/2026.03.07-020941/https://lr0.org/blog/...> (I'm not consulting an LLM...)
- <https://web.archive.org/web/20241021113145/https://slopwatch...>