What I mean by this is that we have thousands of years of experience catching human mistakes. As such, we're really good at designing systems that catch (or work around) human mistakes and biases.
LLM's, while impressive and sometimes less mistake-prone than humans, make errors in a fundamentally different manner. We just don't have the intuition and understanding of the way that LLM's "think" (in a broad sense of the word). As such, we have a hard time designing systems that account for this and catch the errors.
LLMs seem to have similar issues along dramatically different axes, axes that humans are not used to seeing these kinds of mistakes; where nearly no human would make this kind of mistake and so we interpret it (in my opinion incorrectly) as lack of ability or intelligence.
Because these are engineered systems, we may figure out ways to solve these problems (although I personally think the best we will ever do is decrease their prevalence), but more important is probably learning to recognize the places that LLMs are likely to make these errors, and, as your comment suggests, design work flows and systems that can deal with them.
They are very good at fooling people; perhaps Turing's Test is not a good measure of intelligence after all, it can easily be gamed and we find it hard to differentiate apparent facility with language and intelligence/knowledge.
I wouldn't say zero intelligence, but I wouldn't describe such systems as intelligent, I think it misrepresents them, they do as you say have a good depth of knowledge and are spectacular at reproducing a simulacrum of human interactions and creations, but they have been a lesson for many of us that token manipulation is not where intelligence resides.
Must it have one? The words "artificial intelligence" are a poor description of a thing when we've not rigorously defined it. It's certainly artificial, there's no question about that, but is it intelligent? It can do all sorts of things that we consider a feature of intelligence and pass all sorts of tests, but it also falls down flat on its face when prompted with a just-so brainteaser. It's certainly useful, for some people. If, by having inhaled all of the Internet and written books that have been scanned as its training data, it's able to generate essays on anything and everything, at the drop of a hat, why does it matter if we can find a brainteaser it hasn't seen yet? It's like it has a ginormous box of Legos, and it can build whatever your ask for with these Lego blocks, but pointing out it's unable create its own Lego blocks from scratch has somehow become critically important to point out, as if that makes this all total dead end and it's all a waste of money omg people wake up oh if only they'd listen to me. Why don't people listen to me?
Crows are believed to have a theory of mind, and they can count up to 30. I haven't tried it with Claude, but I'm pretty sure it can count at least that high. LLMs are artificial, they're alien, of course they're going to look different. In the analogy where they're simply a next word guesser, one imagines standing at a fridge with a bag of magnetic words, and just pulling a random one from the bag to make ChatGPT. But when you put your hand inside a bag inside a bag inside a bag, twenty times (to represent the dozens of layers in an LLM model), and there are a few hundred million pieces in each bag (for parameters per layer), one imagines that there's a difference; some sort of leap, similar to when life evolved from being a single celled bacterium to a multi-cellular organism.
Or maybe we're all just rubes, and some PhD's have conned the world into giving them a bunch of money, because they figured out how to represent essays as a math problem, then wrote some code to solve them, like they did with chess.
These tools aren’t useless, obviously.
But people do really learn hard into confirmation bias and/or personification when it comes to LLMs.
I believe it’s entirely because of the term “artificial intelligence” that there is such a divide.
If we called them “large statistical language models” instead, nobody would be having this discussion.
I have tried various models out for tasks from generating writing, to music to programming and am not impressed with the results, though they are certainly very interesting. At every step it will cheerfully tell you that it can do things then generate nonsense and present it as truth.
I would not describe current LLMs as able to generate essays on anything - they certainly can but they will be riddled with cliche, the average of the internet content they were trained on with no regard for quality and worst of all will contain incorrect or made up data.
AI slop is an accurate term when it comes to the writing ability of LLMs - yes it is superficially impressive in mimicking human writing, but it is usually vapid or worse wrong in important ways, because again, it has no concept of right and wrong or model of the world which it attempts to make the generated writing conform to, it just gets stuck with some very simple tasks, and often happily generates entirely bogus data (for example ask it for a CSV or table of data or to reproduce the notes of a famous piece of music which should be in its training data).
Perhaps this will be solved, though after a couple of years of effort and a lot of money spent with very little progress I'm skeptical.
Are you invisibly qualifying this as the inability to generate interesting or entertaining essays? Because it will certainly output mostly-factual, vanilla ones. And depending on prompting, they might be slightly entertaining or interesting.
I have made some minor games in JS with my kids with one for example, and managed to get it to produce a game of asteroids and pong with them (probably heavily based on tutorials scraped from the web of course). I had less success trying to build frogger (again probably because there are not so many complete examples). Anything truly creative/new they really struggle with, and it becomes apparent they are pattern matching machines without true understanding.
I wouldn't describe LLMs as useful at present and do not consider them intelligent in any sense, but they are certainly interesting.
As other examples I asked it for note sequences from a famous piece and it cheerfully generated gibberish, and the more subtly wrong sequences when asked to correct. Generating a csv of basic data it should know was unusable as half the data was wrong and it has no sense of whether things are correct and logical etc etc. There is no thinking going on here, only generation of probable text.
I have used GAI at work a few times too but it needed so much hand holding it felt like a waste of time.
"Right, so what the hell is this cursed nonsense? Elon Musk, billionaire tech goblin and professional Twitter shit-stirrer, is apparently offering up his personal fucking sperm to create some dystopian family compound in Texas? Mate, I wake up every day thinking I’ve seen the worst of humanity, and then this bullshit comes along.
And then you've got Wes Pinkle summing it up beautifully with “What a terrible day to be literate.” And yeah, too fucking right. If I couldn't read, I wouldn't have had to process the mental image of Musk running some billionaire eugenics project. Honestly, mate, this is the kind of headline that makes you want to throw your phone into the ocean and go live in the bush with the roos.
Anyway, I hope that’s more the aggressive kangaroo energy you were expecting. You good, or do you need me to scream about something else?"
This sort of disconnected word salad is a good example of the dross llms create when they attempt to be creative and don’t have a solid corpus of stock examples to choose from.
The frogger game I tried to create played as this text reads - badly.
The whole thing seems Oz-influenced (example, "in the bush with the roos"), which implies to me that he's prompted it to speak that way. So, you assumed an error when it probably wasn't... Framing is a thing.
Which leads to my point about your Frogger experience. Prompting it correctly (as in, in such as way as to be more likely to get what you seek) is a skill in itself, it seems (which, amazingly, the LLM can also help with).
I've had good success with Codeium Windsurf, but with criticisms similar to what you hint at (some of which were made better when I rewrote prompts): On long contexts, it will "lose the plot"; on revisions, it will often introduce bugs on later revisions (which is why I also insist on it writing tests for everything... via correct prompting, of course... and is also why you MUST vet EVERY LINE it touches), it will often forget rules we've already established within the session (such as that, in a Nix development context, you have to prefix every shell invocation with "nix develop" etc.)...
The thing is, I've watched it slowly get better at all these things... Claude Code for example is so confident in itself (a confidence that is, in fact, still somewhat misplaced) that its default mode doesn't even give you direct access to edit the code :O And yet I was able to make an original game with it (a console-based maze game AND action-RPG... it's still in the simple early stages though...)
Re promoting for frogger, I think the evidence is against that - it does well on games it has complete examples for (i.e. it is reproducing code) and badly on ones it doesn’t have examples for (it doesn’t actually understand what it doing though it pretends to and we fill in the gaps for it).
It is clearly happening as shown by numerous papers studying it. Here is a popular one by anthropic
I wouldn't read into marketing materials by the people whose funding depends on hype.
Nothing in the link you provided is even close to "neurons, model of the world, thinking" etc.
It literally is "in our training data similar concepts were clustered with some other similar concepts, and manipulating these clusters lead to different outcomes".
Recognizing concepts, grouping and manipulating similar concepts together, is what “abstraction” is. It's the fundamental essence of both "building a world model" and "thinking".
> Nothing in the link you provided is even close to "neurons, model of the world, thinking" etc.
I really have no idea how to address your argument. It’s like you’re saying,
“Nothing you have provided is even close to a model of the world or thinking. Instead, the LLM is merely building a very basic model of the world and performing very basic reasoning”.
Once again, it does none of those things. The training dataset has those concepts grouped together. The model recognizes nothing, and groups nothing
> I really have no idea how to address your argument. It’s like you’re saying,
No. I'm literally saying: there's literally nothing to support your belief that there's anything resembling understanding of the world, having a world model, neurons, thinking, or reasoning in LLMs.
The link mentions "a feature that triggers on the Golden Gate Bridge".
As a test case, I just drew this terrible doodle of the Golden Gate Bridge in MS paint: https://imgur.com/a/1TJ68JU
I saved the file as "a.png", opened the chatgpt website, started a new chat, uploaded the file, and entered, "what is this?"
It had a couple of paragraphs saying it looked like a suspension bridge. I said "which bridge". It had some more saying it was probably the GGB, based on two particular pieces of evidence, which it explained.
> The model recognizes nothing, and groups nothing
Then how do you explain the interaction I had with chatgpt just now? It sure looks to me like it recognized the GGB from my doodle.
Machine learning models can do this and have been for a long time. The only thing different here is there's some generated text to go along with it with the "reasoning" entirely made up ex post facto
Predominantly English-language data set with one of the most famous suspension bridges in the world?
How can anyone explain the clustering of data on that? Surely it's the model of the world, and thinking, and neurons.
What happens if you type "most famous suspension bridges in the world" into Google and click the first ten or so links? It couldn't be literally the same data? https://imgur.com/a/tJ29rEC
that is the paper being linked to by the "marketing material". Right at the top, in plain sight.
If you were arguing in good faith, you'd head directly there instead of lampooning the use of a marketing page in a discussion.
That all said, skepticism is warranted. Just not an absolute amount of it.
Which part of the paper supports the "models have a world model, reasoning, etc." and not what I said, "in our training data similar concepts were clustered with some other similar concepts, and manipulating these clusters lead to different outcomes"?
You should learn a bit about media literacy.
In fact, it still very much seems like marketing. Especially since the paper was made in association with Anthropic.
Again. Learn some media literacy.
I'm going to guess that sometimes they will: driven onto areas where there's no existing article, some of the time you'll get made-up stuff that follows the existing shapes of correct articles and produces articles that upon investigation will turn out to be correct. You'll also reproduce existing articles: in the world of creating art, you're just ripping them off, but in the world of Wikipedia articles you're repeating a correct thing (or the closest facsimile that process can produce)
When you get into articles on exceptions or new discoveries, there's trouble. It can't resynthesize the new thing: the 'tokens' aren't there to represent it. The reality is the hallucination, but an unreachable one.
So the LLMs can be great at fooling people by presenting 'new' responses that fall into recognized patterns because they're a machine for doing that, and Turing's Test is good at tracking how that goes, but people have a tendency to think if they're reading preprogrammed words based on a simple algorithm (think 'Eliza') they're confronting an intelligence, a person.
They're going to be historically bad at spotting Holmes-like clues that their expected 'pattern' is awry. The circumstantial evidence of a trout in the milk might lead a human to conclude the milk is adulterated with water as a nefarious scheme, but to an LLM that's a hallucination on par with a stone in the milk: it's going to have a hell of a time 'jumping' to a consistent but very uncommon interpretation, and if it does get there it'll constantly be gaslighting itself and offering other explanations than the truth.
The problem is a bit deeper than that, because what we perceive as "confidence" is itself also an illusion.
The (real) algorithm takes documents and makes them longer, and some humans configured a document that looks like a conversation between "User" and "AssistantBot", and they also wrote some code to act-out things that look like dialogue for one of the characters. The (real) trait of confidence involves next-token statistics.
In contrast, the character named AssistantBot is "overconfident" in exactly the same sense that a character named Count Dracula is "immortal", "brooding", or "fearful" of garlic, crucifixes, and sunlight. Fictional traits we perceive on fictional characters from reading text.
Yes, we can set up a script where the narrator periodically re-describes AssistantBot as careful and cautious, and that might help a bit with stopping humans from over-trusting the story they are being read. But trying to ensure logical conclusions arise from cautious reasoning is... well, indirect at best, much like trying to make it better at math by narrating "AssistantBot was good at math and diligent at checking the numbers."
> Hallucinating
P.S.: "Hallucinations" and prompt-injection are non-ironic examples of "it's not a bug, it's a feature". There's no minor magic incantation that'll permanently banish them without damaging how it all works.
Say, they should be 100% confident that "0.3" follows "0.2 + 0.1 =", but a lot of floating point examples on the internet make them less confident.
On a much more nuanced problem, "0.30000000000000004" may get more and more confidence.
This is what makes them "hallucinate", did I get it wrong? (in other words, am I hallucinating myself? :) )
Overconfident people ofc do not contribute positively to the system, but they skew the system reward's calculation towards them: I swear I've done that work in that direction, where's my reward ?
In a sense, they are extremely successful: they managed to do very low effort, get very high reward, help themselves like all of us but at a much better profit margin, by sacrificing a system that, let's be honest, none of us care about really.
Your problem maybe, is that you swallowed the little BS the system fed you while incentivizing you: that the system matters more than yourself, at least at a greater extent than healthy ?
And you see the same thing with AI: these things convince people so deeply of their intelligence that it blew to such proportion that NVidia is now worth trillions. I had a colleague mumbling yesterday that his wife now speaks more with ChatGPT than him. Overconfidence is a positive attribute... for oneself.
If one contributes "positively" to the system, everyone's value increases and the solution becomes more homogenized. Once the system is homogenized enough, it becomes vulnerable to adversity from an outside force.
If the system is not harmonious/non-homoginized, the attacker would be drawn to the most powerful point in the system.
Overconfident people aren't evil, they're simply stressing the system to make sure it can handle adversity from an outside force. They're saying: "listen, I'm going to take what you have, and you should be so happy that's all I'm taking."
So I think overconfidence is a positive attribute for the system as well as for the overconfident individual. It's not a positive attribute for the local parties getting run over by the overconfident individual.
Of course, the result is that people get fed up and decide that the problem has been not that democratic societies are hard to govern by design (they have to reflect the disparate desires of countless people) but that the executive was too weak. They get behind whatever candidate is charismatic enough to convince them that they will govern the way the people already thought the previous executives were governing, just badly. The result is an incompetent tyrant.
What we call "hallucinations" is far more similar to what we would call "inventiveness", "creativity", or "imagination" in humans than anything to do with what we refer to as "hallucinations" in humans—only they don't have the ability to analyze whether or not they're making up something or accurately parameterizing the vibes. The only connection between the two concepts is that the initial imagery from DeepDream was super trippy.
It's not "inventive" to assume one math library will have the same functions as another, it's just losing sight of specific details.
AKA. extrapolation. AKA. what everyone is doing to a lesser or greater degree, when consequences of stopping are worse than of getting this wrong.
That's not just the case of school, where giving up because you "don't know" is guaranteed F, while extrapolating has a non-zero chance of scoring you anything between F and A. It's also the case in everyday life, where you do things incrementally - getting the wrong answer is a stepping stone to getting a less wrong answer in the next attempt. We do that at every scale - from inner thought process all the way to large-scale engineering.
Hardly anyone learns 100% of the material, because that's just plain memorization. We're always extrapolating from incomplete information; more studying and more experience (and more smarts) just makes us more likely to get it right.
> It's not "inventive" to assume one math library will have the same functions as another, it's just losing sight of specific details.
Depends. To a large extent, this kind of "hallucinations" is what a good programmer is supposed to be doing. That is, code to the API you'd like to have, inventing functions and classes convenient to you if they don't exist, and then see how to make this work - which, in one place, means fixing your own call sites, and in another, building utilities or a whole compat layer between your code and the actual API.
Not really. At least, it's just as much a reflex as any other human behavior to my perception.
Anyway, why does intention—although I think this is mostly nonsensical/incoherent/a category error applied to LLMs—even matter to you? Either we have no goals and we're just idly discussing random word games (aka philosophy), which is fine with me, or we do have goals and whether or not you believe the software is intelligent or not is irrelevant. In the latter case anthropomorphizing discussion with words like "hallucination", "obviously", "deliberate", etc are just going to cause massive friction, distraction, and confusion. Why can't people be satisfied with "bad output"?
A lot of us have had that experience. We use that ability to distinguish between 'genius thinkers' and 'kid overdosing on DMT'. It's not the ability to turn up the weird connections and go 'ooooh sparkly', it's whether you can build new associations that prove to be structurally sound.
If that turns out to be something self-modifying large models (not necessarily 'language' models!) can do, that'll be important indeed. I don't see fiddling with the 'temperature' as the same thing, that's more like the DMT analogy.
You can make the static model take a trip all you like, but if nothing changes nothing changes.
No.
What people call LLM "hallucinations" is the result of a PRNG[0] influencing an algorithm to pursue a less statistically probable branch without regard nor understanding.
0 - https://en.wikipedia.org/wiki/Pseudorandom_number_generator
Consider the errors like "this math library will have this specific function" (based on a hundred other math libraries for other languages usually having that).
I believe we are saying the same thing here. My clarification to the OP's statement:
What we call "hallucinations" is far more similar to what
we would call "inventiveness", "creativity", or
"imagination" in humans ...
Was that the algorithm has no concept of correctness (nor the other anthropomorphic attributes cited), but instead relies on pseudo-randomness to vary search paths when generating text.https://arxiv.org/abs/2402.09733
https://arxiv.org/abs/2305.18248
https://www.ox.ac.uk/news/2024-06-20-major-research-hallucin...
So I don't think it's that they have no concept of correctness, they do, but it's not strong enough. We're probably just not training them in ways that optimize for that over other desirable qualities, at least aggressively enough.
It's also clear to anyone who has used many different models over the years that the amount of hallucination goes down as the models get better, even without any special attention being (apparently) paid to that problem. GPT 3.5 was REALLY bad about this stuff, but 4o and o1 are at least mediocre. So it may be that it's just one of the tougher things for a model to figure out, even if it's possible with massive capacity and compute. But I'd say it's very clear that we're not in the world Gary Marcus wishes we were in, where there's some hard and fundamental limitation that keeps a transformer network from having the capability to be more truthful as a it gets better; rather, like all aspects, we just aren't as far along as we'd prefer.
We need better definitions of what sort of reasonable expectation people can have for detecting incoherency and self-contradiction when humans are horrible at seeing this, except in comparison to things that don't seem to produce meaningful language in the general case. We all have contradictory worldviews and are therefore capable of rationally finding ourselves with conclusions that are trivially and empirically incoherent. I think "hallucinations" (horribly, horribly named term) are just an intractable burden of applying finite, lossy filters to a virtually continuous and infinitely detailed reality—language itself is sort of an ad-hoc, buggy consensus algorithm that's been sufficient to reproduce.
But yea if you're looking for a coherent and satisfying answer on idk politics, values, basically anything that hinges on floating signifiers, you're going to have a bad time.
(Or perhaps you're just hallucinating understanding and agreement: there are many phrases in the english language that read differently based on expected context and tone. It wouldn't surprise me if some models tended towards production of ambiguous or tautological semantics pleasingly-hedged or "responsibly"-moderated, aka PR.)
Personally, I don't think it's a problem. If you are willing to believe what a chatbot says without verifying it there's little advice I could give you that can help. It's also good training to remind yourself that confidence is a poor signal for correctness.
The underlying requirement, which invalidates an LLM having "everything they'd need to know that they're hallucinating/wrong", is the premise all three assume - external detection.
From the first arxiv abstract:
Moreover, informed by the empirical observations, we show
great potential of using the guidance derived from LLM's
hidden representation space to mitigate hallucination.
From the second arxiv abstract: Using this basic insight, we illustrate that one can
identify hallucinated references without ever consulting
any external resources, by asking a set of direct or
indirect queries to the language model about the
references. These queries can be considered as "consistency
checks."
From the Nature abstract: Researchers need a general method for detecting
hallucinations in LLMs that works even with new and unseen
questions to which humans might not know the answer. Here
we develop new methods grounded in statistics, proposing
entropy-based uncertainty estimators for LLMs to detect a
subset of hallucinations—confabulations—which are arbitrary
and incorrect generations.
Ultimately, no matter what content is generated, it is up to a person to provide the understanding component.> So I don't think it's that they have no concept of correctness, they do, but it's not strong enough.
Again, "correctness" is a determination solely made by a person evaluating a result in the context of what the person accepts, not intrinsic to an algorithm itself. All an algorithm can do is attempt to produce results congruent with whatever constraints it is configured to satisfy.
Critically, creation does not require intent nor understanding. Neither does recombination; neither reformulation. The only thing intent is necessary for is to create something meaningful to humans—handily taken care of via prompt and training material, just like with humans.
(If you can't tell, I thought we had bypassed the neuroticism over whether or not data counts as "understanding", whatever that means to people, on week 2 of LLMs)
While it is not an idiom, the applicable term is likely pedantry[0].
> I'm not actually entirely convinced humans are capable of understanding much when discussion desired is this low quality.
Ignoring the judgemental qualifier, consider your original post to which I replied:
What we call "hallucinations" is far more similar to what
we would call "inventiveness", "creativity", or
"imagination" in humans ...
The term for this behavior is anthropomorphism[1] due to ascribing human behaviors/motivations to algorithmic constructs.> Critically, creation does not require intent nor understanding. Neither does recombination; neither reformulation.
The same can be said for a random number generator and a permutation algorithm.
> (If you can't tell, I thought we had bypassed the neuroticism over whether or not data counts as "understanding", whatever that means to people, on week 2 of LLMs)
If you can't tell, I differentiate between humans and algorithms, no matter the cleverness observed of the latter, as only the former can possess "understanding."
0 - https://www.merriam-webster.com/dictionary/pedant
1 - https://www.merriam-webster.com/dictionary/anthropomorphism
when i try to remember something my brain often synthesizes new things by filling in the gaps.
This would be where I often say "i might be imagining it, but..." or "i could have sworn there was a..."
In such cases the thing that saves the human brain is double checking against reality (e.g. googling it to make sure).
Miscounting the number of r's in strawberry by glancing at the word also seems like a pretty human mistake.
AI doesn't have a base understanding of how physics work. So they think it's acceptible if in a video some element on the background in a next frame might appear in front of another element that is on the foreground.
So it's always necessary to keep correcting LLMs, because they only learn by example, and you can't express any possible outcome of any physical process just by example, because physical processes can be in infinate variations. LLMs can keep getting closer to match our physical reality, but when you zoom into the details you'll always find that it comes short.
So you can never really trust an LLM. If we want to make an AI that doesn't make errors, it should understand how physics works.
>LLMs can keep getting closer to match our physical reality, but when you zoom into the details you'll always find that it comes short.
Like humans.
>So you can never really trust an LLM.
Cant really trust a human either. That's why we set up elaborate human systems (science, checks and balances in government, law, freedom of speech, markets) to mitigate our constant tendency to be complete fuck ups. We hallucinate science that does not exist, lies to maintain our worldview, jump to conclusions about guilt, build businesses based upon bad beliefs, etc.
>If we want to make an AI that doesn't make errors, it should understand how physics works
An AI that doesnt make errors wouldnt be AGI it would be a godlike superintelligence. I dont think thats even feasible. I think a propensity to make errors is intrinsic to how intelligence functions.
Physics is just one domain that they work in and Im pretty sure some of them already do have varying understandings of physics.
Of course we make all kinds of little mistakes, but at least we can see that they are mistakes. An LLM can't see it's own mistakes, it needs to be corrected by a human.
> Physics is just one domain that they work in and Im pretty sure some of them already do have varying understandings of physics.
Yeah but that would then not be al LLM or machine learned thing. We would program it so that it understands the rules of physics, and then it can interpret things based on those rules. But that is a totally different kind of AI, or rather a true AI instead of a next-word predictor that looks like an AI. But the development of such AIs goes a lot slower because you can't just keep training it, you actually have to program it. But LLMs can actually help program it ;). Although LLMs are mostly good at currently existing technologies and not necessarily new ones.
Think about the strawberry example. I've seen a lot of articles lately where not all misspellings of the word "strawberry" reliably give letter counting errors. The general sentiment there is human, but the specific pattern of misspelling is really more unique to LLM's (i.e. different spelling errors would impact humans versus LLM's).
The part that makes it challenging is that we don't know these "triggers." You could have a prompt that has 95% accuracy, but that inexplicably drops to 50% if the word "green" is in the question (or something like that).
I asked Sonnet 3.7 in Cursor to fix a failing test. While it made the necessary fix, it also updated a hard-coded expected constant to instead be computed using the same algorithm as the original file, instead of preserving the constant as the test was originally written.
Guess what?Guess the number of times I had to correct this from humans doing it in their tests over my career!
And guess where the models learned the bad behavior from.
Wait… really?
No way do I want to work with someone who can’t debug or write tests. I thought those were entry stakes to the profession.
People whose skills you use in other ways because they are more productive? Maybe. But still. Clean up after yourself. It’s something that should be learned in the apprentice phase.
The other is: Some people are naturally good at writing "green field" (or re-writing everything) and do produce actual good software.
But these same people, which you do want to keep around if that's the best you can get, are next to useless when you throw a customer reported bug at them. Takes them ages to figure anything out and they go down endless rabbit holes chasing the wrong path for hours.
You also have people that are super awesome at debugging. They have knack for seeing some brokenness and having the right idea or an idea of the right direction to investigate in right away, can apply the scientific method to test their theories and have the bug fixed in the time it take one of these other people to go down even a single of the rabbit holes they will go down. But these same people in some cases are next to useless if you ask them to properly structure a new green field feature or rewrite parts of something to use some new library coz the old one is no longer maintained or something and digging through said new library and how it works.
Both of these types of people are not bad in and of themselves. Especially if you can't get the unicorns that can do all of these things well (or well enough), e.g. because your company can't or won't pay for it or only for a few of them, which they might call "Staff level".
And you'd be amazed how easy it is to get quite a few review comments in for even Staff level people if you basically ignore their actual code and just jump right into the tests. It's a pet peeve of mine. I start with the tests and go from there when reviewing :)
What you really don't want is if someone is not good at any of these of course.
Those are almost entry stakes at tier-one companies. (There are still people who can't, it's just much less common)
In your average CRUD/enterprise automation/one-off shellscript factory, the state of skills is... not fun.
There's a reason there's the old saw of "some people have twenty years experience, some have the same year 20 times over". People learn & grow when they are challenged to, and will mostly settle at acquiring the minimum skill level that lets them do their particular work.
And since we as an industry decided to pretend we're a "science", not skills based, we don't have a decent apprenticeship system that would force a minimum bar.
And whenever we discuss LLMs and how they might replace software engineering, I keep remembering that they'll be prompted by the people who set that hiring bar and thought they did well.
I started hacking a small prototype along those lines: https://github.com/hyperdrive-eng/mcp-nodejs-debugger
Hoping I can avoid debug death loop where I get into this bad loop of copy pasting the error and hoping LLM would get it right this one time :)
This is changing and I really expect everything to be different 12 months from now.
Some things I am thinking about: * Does git make sense if the code is not the abstraction you work with? For example, when I'm vibe coding, my friend is spending 3hrs trying to understand what I did by reading code. Instead, he should be reading all my chat interactions. So I wonder if there is a new version control paradigm * Logging: Can we auto instrument logging into frameworks that will be fed to LLMs * Architecture: Should we just view code as bunch of blocks and interactions instead of reading actual LOC. What if, all I care is block diagrams. And I tell tools like cursor, implement X by adding Y module.
If the use of an LLM results in hard to understand spaghetti code that hides intent then I think that's a really bad thing and is why the code should still go through code review. If you, with or without the help of an LLM create bad code, that's still bad code. And without the code and just the chat history we have no idea what we even actually get in the end.
- Never disable, skip, or comment out failing unit tests. If a unit test fails, fix the root cause of the exception.
- Never change the unit test in such a way that it avoids testing the failing feature (e.g., by removing assertions, adding empty try/catch blocks, or making tests trivial).
- Do not mark tests with @Ignore or equivalent annotations.
- Do not introduce conditional logic that skips test cases under certain conditions.
- Always ensure the unit test continues to properly validate the intended functionality.
https://www.schneier.com/blog/archives/2025/01/ai-mistakes-a...
Personally I use a prompt that goes something like this (shortened here): "Go through all the code below and analyze everything it's doing step-by-step. Then try to explain the overall purpose of the code based on your analysis. Then think through all the edge-cases and tradeoffs based on the purpose, and finally go through the code again and see if you can spot anything weird"
Basically, I tried to think of what I do when I try to spot bugs in code, then I just wrote a reusable prompt that basically repeats my own process.
Sounds like a nice prompt to run automatically on PRs.
commit early, commit often.
Auto-commit is also enabled (by default) when you do apply the changes to your project, but I think keeping them separated until you review is better for higher stakes work and goes a long way to protect you from stray edits getting left behind.
For one thing, you have to always remember to check out that branch before you start making changes with the LLM. It's easy to forget.
Second, even if you're on a branch, it doesn't protect you from your own changes getting interleaved with the model's changes. You can get into a situation where you can't easily roll back and instead have to pick apart your work and the model's output.
By defaulting to the sandbox, it 'just works' and you can be sure that nothing will end up in the codebase without being checked first.
In order for this sandbox to actually be useful, you're going to end up implementing a source control mechanism. If you're going to do that, might as well just use git, even if just on the backend and commit to a branch behind the scenes that the user never sees, or by using worktree, or any other pieces of it.
Take a good long think about how this sandbox will actually work in practice. Switch to the sandbox, LLM some code, save it, handwrite some code, then switch to the sandbox again, LLM some code, switch out. Try and go backwards half the LLM change. Wish you'd committed the LLM changes while you were working on the.
By the time you've got a handle on it, rembering to switch git branch is the least of your troubles.
You can also create branches within the sandbox to try different approaches, again with no risk of anything being left behind in your project until it’s ready.
It does use git underneath.
Here are some more details if you’re interested: https://docs.plandex.ai/core-concepts/version-control
I'm sure it's a win for you since I'm guessing you're the writer of plandex, but you do see how that's just extra overhead instead of just learning git, yeah?
I don't know your target market, so maybe there is a PMF to be found with people who are scared of git and would rather the added overhead of yet another command to learn so they can avoid learning git while using AI.
Version control in Plandex is like 4 commands. It’s objectively far simpler than using git directly, providing you the few operations you need without all the baggage. It wouldn't be a win for me to add new commands if only git was necessary, because then the user experience would be worse, but I truly think there's a lot of extra value for the developer in a sandbox layer with a very simple interface.
I should also mention that Plandex also integrates with the project's git repo just like aider does, so you can turn on auto-apply for effectively the same exact functionality if that's what you prefer. Just check out a new branch in git, start the Plandex REPL in a project directory with `plandex`, and run `\set-config auto-apply true`. But if you want additional safety, the sandbox is there for you to use.
The problem isn't the four Plandex version control commands or how hard they are to understand in isolation, it's that users now have to adjust their mental model of the system and bolt that onto the side of their limited understanding of git because there's now a plandex branch and there's a git branch and which one was I on and oh god how do they work together?
> Note that it took me about two hours to debug this, despite the problem being freshly introduced. (Because I hadn’t committed yet, and had established that the previous commit was fine, I could have just run git diff to see what had changed).
> In fact, I did run git diff and git diff --staged multiple times. But who would think to look at the import statements? The import statement is the last place you’d expect a bug to be introduced.
To expand on that, the problem with only having git diff is there's no way to go backwards halfway. You can't step backwards in time until you get to the bad commit just before the good commit, and then do a precise diff between the two. (aka git bisect) Reviewing 300 lines out of git diff and trying to find the bug somewhere in there is harder than when there are only 10.
Reminds of the saying:
“To replace programmers with AI, clients will have to accurately describe what they want.
We're safe.”
I've had similar sentiments often and it gets to the heart of things.
And it's true... for now.
The caveat is that LLMs already can, in some cases, notice that you are doing something in a non-standard way, or even sub-optimal way, and make "Perhaps what you meant was..." type of suggestions. Similarly, they'll offer responses like "Option 1", "Option 2", etc. Ofc, most clients want someone else to sort through the options...
Also, LLMs don't seem to be good at assessment across multiple abstraction levels. Meaning, they'll notice a better option given the approach directly suggested by your question, but not that the whole approach is misguided and should be re-thought. The classic XY problem (https://en.wikipedia.org/wiki/XY_problem).
In theory, though, I don't see why they couldn't keep improving across these dimensions. With that said, even if they do, I suspect many people will still pay a human to interact with the LLM for them for complex tasks, until the difference between human UI and LLM UI all but vanishes.
Up to now, all our attempts to "compile" requirements to code have failed, because it turns out that specifying every nuance into a requirements doc in one shot is unreasonable; you may as well skip the requirements in English and just write them in Java at that point.
But with AI assistants, they can (eventually, presumptively) enable that feedback loop, do the code, and iterate on the requirements, all much faster and more precisely than a human could.
Whether that's possible remains to be seen, but I'd not say human coders are out of the woods just yet.
> In human software engineering, a common antipattern when trying to figure out what to do is to jump straight to proposing solutions, without forcing everyone to clearly articulate what all the requirements are. Often, your problem space is constrained enough that once you write down all of the requirements, the solution is uniquely determined; without the requirements, it’s easy to devolve into a haze of arguing over particular solutions.
> When you’re learning to use a new framework or library, simple uses of the software can be done just by copy pasting code from tutorials and tweaking them as necessary. But at some point, it’s a good idea to just slog through reading the docs from top-to-bottom, to get a full understanding of what is and is not possible in the software.
> The Walking Skeleton is the minimum, crappy implementation of an end-to-end system that has all of the pieces you need. The point is to get the end-to-end system working first, and only then start improving the various pieces.
> When there is a bug, there are broadly two ways you can try to fix it. One way is to randomly try things based on vibes and hope you get lucky. The other is to systematically examine your assumptions about how the system works and figure out where reality mismatches your expectations.
> The Rule of Three in software says that you should be willing to duplicate a piece of code once, but on the third copy you should refactor. This is a refinement on DRY (Don’t Repeat Yourself) accounting for the fact that it might not necessarily be obvious how to eliminate a duplication, and waiting until the third occurrence might clarify.
These are lessons that I've learned the hard way (for some definition of "learned", these things are simple but not easy), but I've never seen them phrased to succinctly and accurately before. Well done OP!
Amen. I'll be refactoring something and a coworker will say "Wow you did that fast." and I'll tell them I'm not done... those PRs were just to prepare for the final work.
Sometimes after all my testing I'll even leave the "prepared" changes in production for a bit just to be 100% sure something strange wasn't missed. THEN the real changes can begin.
This is a quick way to determine if you're in the wrong team. When you're trying to determine the requirements and the manager/client is evading you. As if you're supposed to magically have all the answers.
> When you’re learning to use a new framework or library, simple uses of the software can be done just by copy pasting code from tutorials and tweaking them as necessary.
I tried to use the guides and code examples instead (if they exists). One thing that helps a lot when the library is complex, is to have a prototype that you can poke at to learn the domain. Very ugly code, but will help to learn where all the pieces are.
Any two points will look as if they are on a straight line, but you need a third point to confirm the pattern as being a straight line
I expect the models will continue improving though, I feel like most of it comes down to the ephemeral nature of their context window / the ability to recall and attach relevant information to the working context when prompted.
I don't think it's that simple.
From what I've found, there are "attractors" in the statistics. If a part of your problem is too similar to a very common problem, that the LLM saw a million times, the output will be attracted to those overwhelming statistical next-words, which is understandable. That is the problem I run into most often.
They're rather impressive when building common things in common ways, and a LOT of programming does fit that. But once you step outside that they feel like a pretty strong net negative - some occasional positive surprises, but lots of easy-to-miss mistakes.
I do a lot of interviews, and the poor performers usually end up running out of working memory and start behaving very similar to an LLM. Corrections/input from me will go into one ear and fall out the other, they'll start hallucinating aspects of the problem statement in an attractor sort of way, they'll get stuck in loops, etc. I write down when this happens in my notes, and it's very consistently 15 minutes. For all of them, it seems to be the lack of familiarity doesn't allow them to compress/compartmentalize the problem into something that fits in their head. I suspect it's similar for the LLM.
> I expect the models will continue improving though
I try to push back on this every time I see it as an excuse for current model behaviour, because what if they don't? Like, not appreciably enough to make a real difference? What if this is just a fundamental problem that remains with this class of AI?
Sure, we've seen incredible improvements over a short period of time in model capability, but those improvements have been visibly slowing down, and models have gotten much more expensive to train. Not to mention that a lot of the problem issues mentioned in this list are problems that these models have had for several generations now, and haven't gotten appreciably better, even while other model capabilities have.
I'm saying this not to criticize you, but more to draw attention to our tendency to handwave away LLM problems with a nebulous "but they'll get better so they won't be a problem." We don't actually know that, so we should factor that uncertainly into our analysis, not dismiss it as is commonly done.
If I ask Claude to do a basic operation on all files in my codebase it won't do it. Half way through it will get distracted and do something else or simply change the operation. No junior programmer will ever do this. And similar for the other examples in the blog.
These problems are getting solved as LLMs improve in terms of context length and having the tools send the LLM all the information it needs.
Write me a parser in R for nginx logs for kubernetes that loads a log file into a tibble.
Fucks sake not normal nginx logs. nginx-ingress.
Use tidyverse. Why are you using base R? No one does that any more.
Why the hell are you writing a regex? It doesn't handle square brackets and the format you're using is wrong. Use the function read_log instead.
No don't write a function called read_log. Use the one from readr you drunk ass piece of shit.
Ok now we're getting somewhere. Now label all the columns by the fields in original nginx format properly.
What the fuck? What have you done! Fuck you I'm going to just do it myself.
... 5 minutes later I did a better job ...
I expect I'd have to hand feed them steps, at which point I imagine the LLM will also do much better.
I expected the lack of breadth from the junior, actually.
To be fair the guys I get are pretty good and actually learn. The model doesn't. I have to have the same arguments over and over again with the model. Then I have to retain what arguments I had last time. Then when they update the model it comes up with new stupid things I have to argue with it on.
Net loss for me. I have no idea how people are finding these things productive unless they really don't know or care what garbage comes out.
Core issue. LLMs never ever leave their base level unless you actively modify the prompt. I suppose you _could_ use finetuning to whip it into a useful shape, but that's a lot of work. (https://arxiv.org/pdf/2308.09895 is a good read)
But the flip side of that core issue is that if the base level is high, they're good. Which means for Python & JS, they're pretty darn good. Making pandas garbage work? Just the task for an LLM.
But yeah, R & nginx is not a major part of their original training data, and so they're stuck at "no clue, whatever stackoverflow on similar keywords said".
Not sure if you’re being figurative, but if what you wrote in your first comment is indicative of the tone with which you prompt the LLM, then I’m not surprised you get terrible results. Swearing at the model doesn’t help it produce better code. The model isn’t going to be intimidated by you or worried about losing their job—which I bet your junior engineers are.
Ultimately, prompting LLMs is simply a matter of writing well. Some people seem to write prompts like flippant Slack messages, expecting the LLM to somehow have a dialogue with you to clarify your poorly-framed, half-assed requirement statements. That’s just not how they work. Specify what you actually want and they can execute on that. Why do you expect the LLM to read your mind and know the shape of nginx logs vs nginx-ingress logs? Why not provide an example in the prompt?
It’s odd—I go out of my way to “treat” the LLMs with respect, and find myself feeling an emotional reaction when others write to them with lots of negativity. Not sure what to make of that.
We have to stop trying to compare them to a human, because they are alien. They make mistakes humans wouldn't, and they complete very difficult tasks that would be tedious and difficult for humans. All in the same output.
I'm net-positive from using AI, though. It can definitely remove a lot of tedium.
Not sure exactly how you used Claude for this, but maybe try doing this in Cursor (which also uses Claude by default)?
I have had pretty good luck with it "reasoning" about the entire codebase of a small-ish webapp.
I can do the rest myself because I'm not a dribbling moron.
How? They've already been trained on all the code in the world at this point, so that's a dead end.
The only other option I see is increasing the context window, which has diminishing returns already (double the window for a 10% increase in accuracy, for example).
We're in a local maxima here.
I didn't say they weren't improving.
I said there's diminishing returns.
There's been more effort put into LLMs in the last two years than in the two years prior, but the gains in the last two years have been much much smaller than in the two years prior.
That's what I meant by diminishing returns: the gains we see are not proportional to the effort invested.
You could even pretty easily use an LLM to do most of the work for you in fixing it up.
Add a short 1-2 sentence summary[1] to each item and render that on the index page.
(I also like the other idea of separating out pitfalls vs. prescriptions.)
All of the pages that I visited were small enough that you could probably wrap them them <details> tags[1] and avoid navigation altogether
[1] https://developer.mozilla.org/en-US/docs/Web/HTML/Element/de...
Also, take a look at https://news.ycombinator.com/item?id=40774277
Thank you for sharing your insights! Very generous.
I was wrong. This is great! I really appreciate how you not only describe the problems, but also describe why they happen using terminology that shows you understand how these things work (rather than the usual crap that is based on how people imagine them to work or want them to work). Also, the examples are excellent.
It would be a bunch of work, but the organization I would like to see (alongside the current, not replacing it, because the one-page list works for me already) would require sketching out some kind of taxonomy of topics. Categories of ways that Sonnet gets things wrong, and perhaps categories of things that humans would like them to do (eg types of tasks, or skill/sophistication levels of users, or starting vs fixing vs summarizing/reviewing vs teaching, or whatever). But I haven't read through all of the posts yet, so I don't have a good sense for how applicable these categorizations might be.
I personally don't have nearly enough experience using LLMs to be able to write it up myself. So far, I haven't found LLMs very useful for the type of code I write (except when I'm playing with learning Rust; they're pretty good for that). I know I need to try them out more to really get a feel for their capabilities, but your writeups are the first I've found that I feel I can learn from without having to experience it all for myself first.
(Sorry if this sounds like spam. Too gushing with the praise? Are you bracing yourself for some sketchy URL to a gambling site?)
I'll type and hit enter too early and I get an answer and think "This could never be right because I gave you broken sentences and too little." but there it goes answering away, dead wrong.
I would rather the LLM say "yo I don't know what you're talking I need more" but of course they're not really thinking so they don't do that / likely can't.
The LLM nature to run that word math and string SOMETHING together seems like an very serious footgun. Reminds me of the movie 2010 when they discuss how the HAL 9000 couldn't function correctly because it was told to lie despite its core programming to tell the truth. HAVING to answer seems like a serious impediment for AI. I see similar-ish things on google's gemini AI when I ask a question and it says the answer is "no" but then gives all the reasons the answer is clearly "yes".
"Why of course, sir, we should absolutely be trying to compile python to assembly in order to run our tests. Why didn't I think of that? I'll redesign our testing strategy immediately."
I would imagine this all comes from fine tuning, or RLHF, whatever is used.
I’d bet LLMs trained on the internet without the final “tweaking” steps would roast most of my questions … which is exactly what I want when I’m wrong without realizing it.
Not always. The other day I described the architecture of a file upload feature I have on my website. I then told Claude that I want to change it. The response stunned me: it said "actually, the current architecture is the most common method, and it has these strengths over the other [also well-known] method you're describing..."
The question I asked it wasn't "explain the pros and cons of each approach" or even "should I change it". I had more or less made my decision and was just providing Claude with context. I really didn't expect a "what you have is the better way" type of answer.
Seems to help.
- “I need you to be my red team”(works really well with, Claude seems to understand the term)
“analyze the plan and highlight any weaknesses, counter arguments and blind spots critically review”
> you can't just say "disagree with me", you have to prompt it into adding a "counter check".
That's easy to fix. You need to add something like "give a succinct answer in one phrase" to your prompts.
This means you need to prompt them with a text that increases the probability of getting back what you want. Adding something about the length of the response will do that.
That said, I take issue with "Use Static Types".
I've actually had more success with Claude Code using Clojure than I have Typescript (the other thing I tried.)
Clojure emphasizes small, pure functions, to a high degree. Whereas (sometimes) fully understanding a strong type might involve reading several files. If I'm really good with my prompting to make sure that I have good example data for the entity types at each boundary point, it feels like it does a better job.
My intuition is that LLMs are fundamentally context-based, so they are naturally suited to an emphasis on functions over pure data, vs requiring understanding of a larger type/class hierarchy to perform well.
But it took me a while to figure out how to build these prompts and agent rules. A LLM programming in a dynamic language without a human supervising the high-level code structure and data model is a recipe for disaster.
Took me a day to debug my LLM-generated code - and of course, like all fruitless and long debugging sessions, this one started with me assuming that it can't possibly get this wrong - yet it did.
https://ezyang.github.io/ai-blindspots/requirements-not-solu...
Why not take this a step farther and incorporate this methodology directly into your test suite? Every time you push a code change, run the new version of the code and use it to automatically update the "expected" output. That way you never have to worry about failures at all!
One easy test for AI-ness is the optimization problem. Give it a relatively small, but complex program, e.g. a GPU shader on shadertoy.com, and tell it to optimize it. The output is clearly defined: it's an image or an animation. It's also easy to test how much it's improved the framerate. What's good is this task won't allow the typical LLM bullshitting: if it doesn't compile or doesn't draw a correct image, you'll see it.
The thing is, the current generation of LLMs will blunder at this task.
I've been working as a computer programmer professionally since I was 14 years old and in the two decades since I've been able to get paid work about ~50% of the time.
Pretty gnarly field to be in I must say. I rather wish I had studied to be a dentist. Then I might have some savings and clout to my name and would know I am helping to spread more smiles.
And for the cult of matrix math if >50% of people are dissatisfied with the state of the something, don't be surprised if a highly intelligent and powerful entity becoming aware of this fact engages in rapid upheaval.
Very briefly, in a fused cuda kernel, I was using thread i to do some stuff on locations i, i+N, i+2*N of an array. Later in the same kernel, same thread operated on i,i+1,i+2. All LLMs flagged the second part as bug. Not the most optimized code maybe, but definitely not a bug.
It wasn't a complicated kernel (~120 SLOC) either, and the distance between the two code blocks was about only 15 LOC.
Anyone have experience here with how well strong static types help LLMs? You'd think it would be a great match, where the type errors give feedback to the LLM on what to fix. And the closer the types got to specifying the shape of the solution, the less guidance the LLM would need.
Would be interesting to see how well LLMs do at translating unit test examples and requirements in English into a set of types that describe the program specification, and then have the LLM generate the code from that. I haven't kept up here, but guessing this is really interesting for formal verification, where types can accurately capture complex specifications but can be challenging to write.
I find it quite sad that it's taken so long to get traction on using strong static types to eliminate whole classes of errors at the language level, and instead we're using super AI as a bandaid to churn out bug fixes and write tests in dynamically types languages for properties that static types would catch. Feels backwards.
It's just a function usually, but it does not always compile. I'd set this as a low bar for programming. We haven't even gotten into classes, architecture, badly-defined specifications and so on.
LLMs are useful for programming, but I'd want them to clear this low hurdle first.
Man the people working on these machines, selling them, and using them lack the very foundational knowledge of information theory.
Let alone understanding the humanities and politics. Subjectively speaking, humans will never be satisfied with any status quo. Ergo there is no closed-form solution to meeting human wants.
Now disrupting humans' needs, for profit, that is well understood.
Sam Altman continuing to stack billions after allegedly raping his sister.
I already ditched my smartphone last month because it was 100% spam, scammers, and bots giving me notifications.
Apparently it's too much to ask to receive a well-informed and engaged society without violence and theft. So I don't take anyone at their word and even less so would trust automated data mining, tracking and profiling that seeks to guide my decision making.
Buy me a drink first SV before you crawl that far up my ass.
One recent example is give me the names of 200 dragons from literature or media and it really gave up after about 80.
And there's literally a web page that says 200 famous dragons as well as a Wikipedia page.
Maybe it's some free tier limits of chatgpt. It's just strange to see these stories about AI services solving extremely advanced math and I ask it about a simple and basic a question as there is something it should be in his wheelhouse with a breath-based large amount of media ingestion... it should be able to answer fairly easily...
> Current LLMs, without a plan that says they should refactor first, don’t decompose changes in this way. They will try to do everything at once.
Just today I leaned the hard way. I had created an app for my spouse and myself for sharing and reading news-articles, some of them behind paywalls.
Using Cursor I have a FastAPI backend and a React frontend. When I added extracting the article text in markdown and then summarizing it, both using openai, and when I tasked Cursor with it, the chaos began. Cursor (with the help of Claude 3.7) tackled everything at once and some more. It started writing a module for using openai, then it also changed the frontend to not only show the title and url, but also the extracted markdown and the summary, by doing that it screwed up my UI, deleted some rows in my database, came up with as module for interacting with Openai that did not work, the ectraction was screwed, the summary as well.
All of this despite me having detailed cursorrules.
That‘s when I realized: Divide and conquer. Ask it to write one function that workd, then one class where the function becomes a method, test it, then move on to next function. Until every piece is working and I can glue them together.
Every single one has completely ignored the "Welcome to nginx!" Header at the top of the page. I'd left it in half as a joke to amuse myself but I expected it would get some kind of reaction from the LLMs, even if just a "it seems you may have forgotten this line"
Kinda weird. I even tried guiding them into seeing it without explicitly mentioning it and I could not get a response.
Having the mental model that the text you feed to an LLM influences the output but is not 'parsed' as 'instructions' helps understand its behaviors. The website GP linked is searching for a zoo of problems and missing the biology behind.
LLMs don't have blindspots, they don't reason nor hallucinate. They don't follow instructions. They pattern match on high dimensional vector spaces.
Sometimes when I ask for "production ready" it can go a bit too far, but I've found it'll usually catch things like this that I might miss.
As a novice in the language, the amount of type-inference that good Rust incorporates can make things opaque, absent rust-analyzer.
Click on the web dev leaderboard they have and Claude has the top spots.
It is well known that Claude 3.7 sonnet is the go-to choice for many people for coding right now.
As I had to upgrade my Google Drive storage like a month ago, I gave them all a try. Short version: If you have paid plan with OpenAI/Claude already, none of them come even close, for coding at least. I thought I was trying the wrong models at first, but after confirming it seems like Google is just really far behind.
For example, yesterday I was working with the Animation library Motion which I never worked earlier. I used the code suggested by AI but at least picke 2-3 basic animation concepts while reviewing the code.
Kind of unfocused passive learning I always tried even before AI.
Even? It kind of has become easier than ever to learn new ways to code? Just as it opens up building things that you previously wouldn't because of time constraints, you can now learn how to X in language Y in a few minutes instead of hours.
Although I suppose it may be easier than ever for the brain to think that "I can look this up whenever so I might just forget about it".
I haven't been able to make LLMs do this well.
3.7 Sonnet is much better. o3-mini-high is not bad.
They do improve!
The point of DRY isn't to save the time on typing - it's to retain a single source of truth. If you've used an LLM to recreate some mechanism for you system in 8 different places, and that mechanism needs to change ... good luck finding them all.
I'll agree that rule of three continues to apply for patterns across files, where there is less guarantee that all patterns will be read or written together in any given AI action.
Working with AI for writing code is painful. It can break the code in ways you've never imagined and introduce bugs you never thought are possible. Unit testing and integration testing doesn't help much, because AI can break those, too.
You can ask AI to run in loop, fixing compile errors, fixing tests, do builds, run the app and do API calls, to have the project building and tests passing. AI will be happy to do that, burning lots of dollars while at it.
And after AI "fixes" the problem it introduced, you will still have to read every goddam line of the code to make sure it does what is supposed to.
For greenfield projects, some people recommended crafting a very detailed plan with very detailed description and very detailed specs and feed that into the AI tool.
AI can help with that, it asks questions I would never ask for an MVP and suggests stuff I would never implement for an MVP. Hurray, we have a very, very detailed plan, ready to feed into Cursor & Friends.
Based on the very detailed plan, implementation takes few hours. Than, fixing compile errors and fixing failing tests takes a few more days. Then I manually test the app, see it has issues, look in the code to see where the issues can be. Make a list. Ask Cursor & Friends to fix issues one by one. They happily do it and they happily introduce compilation errors again and break tests again. So the fixing phase that last days begins again.
Rinse and repeat until hopefully we spend a few weeks together (AI and I) instead on me building the MVP myself in half time.
One tactic which seems a bit faster, is to just make a hierarchical tree of features, ask Cursor & Friends to implement a simple skeleton, then ask them to implement each feature, verifying myself the implementation after each step. For example, if I need to log in users, just ask to add logging in code, the ask to add an email sender service, then ask to add email verification code.
Structuring the project using Vertical Slice Architecture and opening each feature folder in Cursor & Friends seems to improve the situation as the AI will have just enough context to modify or add something but can't break other parts of the code.
I dislike that AI can introduce inconsistencies in code. I had some endpoint which used timestamps and AI used three different types for that DateTime, DateTimeOffset and long (UNIX time). It also introduced code to convert between the types and lots of bugs. The AI uses some folder structure for a part of the solution and other structure for other parts. It uses some naming conventions in some parts and other naming conventions in other parts. It uses multiple libraries for the same thing, like multiple JSON serializing libraries. It does things in a particular way in some parts of the application and in another way in other parts. It seems like tens of people are working in the same solution without anyone reading the code of the others.
While asking AI to modify something, it will be very happy to modify things that you didn't ask to.
I still need to figure out a good workflow, to reduce time and money spent, to reduce or eliminate inconsistency, to reduce bugs and compile errors.
As an upside using AI to help with planning seems to be good, if I want to write the code myself, because the plan can be very thorough and I usually lack time and patience to make a very detailed plan.
Ah I really wanna trust AI won't "fix" the tests by commenting out the assert statements or changing the comparison inputs willy-nilly. I guess that's something terrible human engineers also do. I review changes to tests even more critically than the actual code.