FilterHN

pornel

28 days ago

[-]

Their default solution is to keep digging. It has a compounding effect of generating more and more code.

If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.

If you tell them the code is slow, they'll try to add optimized fast paths (more code), specialized routines (more code), custom data structures (even more code). And then add fractally more code to patch up all the problems that code has created.

If you complain it's buggy, you can have 10 bespoke tests for every bug. Plus a new mocking framework created every time the last one turns out to be unfit for purpose.

If you ask to unify the duplication, it'll say "No problem, here's a brand new metamock abstract adapter framework that has a superset of all feature sets, plus two new metamock drivers for the older and the newer code! Let me know if you want me to write tests for the new adapters."

unlikelytomato

28 days ago

[-]

This is why I'm confused when people say it isn't ready to replace most of the programmer workforce.

sonofhans

27 days ago

[-]

I love that you’re getting straightforward replies to this absolutely sick burn. The blade is so sharp that some people aren’t even feeling it.

28 days ago

[-]

LLM code is higher quality than any codes I have seen in my 20 years in F500. So yeah you need to "guide" it, and ensure that it will not bypass all the security guidance for ex...But at least you are in control, although the cognitive load is much higher as well than just "blind trust of what is delivered".

But I can see the carnage with offshoring+LLM, or "most employees", including so call software engineer + LLM.

28 days ago

[-]

Huh, that explains a lot about the F500, and their buzzword slogans like "culture of excellence".

LLM code is still mostly absurdly bad, unless you tell it in painstaking detail what to do and what to avoid, and never ask it to do a bigger job at a time than a single function or very small class.

Edit: I'll admit though that the detailed explanation is often still much less work than typing everything yourself. But it is a showstopper for autonomous "agentic coding".

28 days ago

[-]

> unless you tell it in painstaking detail what to do and what to avoid, and never ask it to do a bigger job at a time than a single function or very small class.

This is hyperbolic, but the general sentiment is accurate enough, at least for now. I've noticed a bimodal distribution of quality when using these tools. The people who approach the LLM from the lens of a combo architect & PM, do all the leg work, set up the guard rails, define the acceptance criteria, these are the people who get great results. The people who walk up and say "sudo make me a sandwich" do not.

Also the latter group complains that they don't see the point of the first group. Why would they put in all the work when they could just code? But what they don't see is that *someone* was always doing that work, it just wasn't them in the past. We're moving to a world where the mechanical part of grinding the code is not worth much, people who defined their existence as avoiding all the legwork will be left in the cold.

27 days ago

[-]

> This is hyperbolic

Maybe a bit, but unfortunately sometimes not so much. I recently had an LLM write a couple of transforms on a tree in Python. The node class just had "kind" and "children" defined, nothing else. The LLM added new attributes to use in the new node kinds (Python allows to just do "foo.bar=baz" to add one). Apparently it saw a lot of code doing that during training.

I corrected the code by hand and modified the Node class to raise an error when new attributes are added, with an emphatic source code comment to not add new attributes.

A couple of sessions later it did it again, even adding it's own comment about circumventing the restriction! X-|

Anyways, I think I mostly agree with your assessment. I might be dating myself here, but I'm not even sure what happened that made "coding" grunt work. It used to be every "coder" was an "architect" as well, and did their own legwork as needed. Maybe labor shortages changed that.

27 days ago

[-]

> It used to be every "coder" was an "architect" as well, and did their own legwork as needed.

I disagree. I remember in the days before "software engineer" became the rage that the standard job titles had a clear delineation between the people who thought the big thoughts with titles like "analyst" and the people who did the grunt work of coding who were "programmers". You'd also see roles in between like "programmer/analyst"

27 days ago

[-]

Might be a big company thing then, but I'm not wholly convinced. There's a big gap between designing the outline of a big system and coding instructions that can be followed without having to make your own decisions. The question of how much of that gap is filled by the "design" vs "coding" levels is a spectrum.

27 days ago

[-]

I think I see what you're saying and if so we're talking past each other a bit and I agree with what you're saying as well.

The point I was raising is by the time an IC developer sees something, there's already been a process of curation that happens that frames the possible solutions & constrains branch points. This is different from saying that an IC makes 0 implementation decisions. The C-suite has set a direction. A product manager has defined the shape of the solution. A tech lead, architect, or whatever may have further limited scope. And any of these could just already be in effect at a global scale or on the specific problem at hand. Then the IC picks up the work and proceeds to make the last mile decisions. And it's turtles all the way up. At almost all levels on the career ladder, there are people above and/or upstream of you who are pre-curating your potential decision tree.

As an analogy, I once had a fresh tech lead under me where they didn't understand this. Their team became a mess. They'd introduce raw tickets straight from the PM to their team without having thought about them at all and things ground to a halt due to decision paralysis. From their perspective that's how it was always done when they were an IC in that group. The team tackled the tickets together to work out how to accomplish their goals. It took a lot of effort to convince them that what they *didn't see* was their prior tech lead narrowing down the search space a bit, and then framing the problem in a way that that made it easier for the team to move forward.

27 days ago

[-]

I'm on board with that framing of the process, and I see how my original formulation was too rough.

I was reacting to "We're moving to a world where the mechanical part of grinding the code is not worth much". I have the impression that in the past just mechanically grinding the code was less of a thing than it apparently is today. Guidance, sure, but not as much as seems to be common (often necessarily so) today. But I'm sure that varies with a lot of factors, not just the calendar year.

27 days ago

[-]

Exactly. I was channeling the stereotypical dev that says they "just want to write code". To your point they're not literally *only* writing code, but this was the sort of person/mentality I was calling out.

What it says to me is they've actively avoided what appears to be becoming the most important skills in the new world. They're likely to find themselves on the short end of the stick.

27 days ago

[-]

I'm with you, it's constantly doing stupid shit and ignoring instructions, and I've always been responsible for determining architecture and doing the "legwork." Unless the task is so small and well defined that it's less typing to tell the LLM (and clean up its output) then i may as well just do it myself

aryehof

27 days ago

[-]

> what happened that made "coding" grunt work

Modern human programming has devolved to nothing more than modeling problems and systems using lines of code, procedures, sub-routines and modules, utilizing a “hack it till it works”(tm) methodology.

froggit

26 days ago

[-]

> utilizing a “hack it till it works”(tm) methodology.

Your post describes my coding perfectly. I don't have CS training of any type, never been formally involved in software development (recently started dabbling in OSS) and never used an LLM/agent for help (do use a local SLM for autocomplete and suggestions only).

Yet I can "code." I suspect a (pre-2023ish) software developer would likely tell me "go learn to code" if i asked for review. I don't know the formal syntax people expect to see and it has organization more typical of raging dumpster fires. Doesn't mean it's not code.

gedy

27 days ago

[-]

> The people who walk up and say "sudo make me a sandwich" do not.

My personal beef is the human devs get "make me a sandwich", and the LLM superfans now suddenly know how to specify requirements. That's fine but don't look down your nose at people for not getting the same info.

This is happening now at my company where leadership won't explain what they want, won't answer questions, but now type all day into Claude and ChatGPT. Like you could have Slacked me the same info last year knuckleheads...

27 days ago

[-]

Absolutely. Merely being a member of the business class does not magically mean one has the ability to specify business requirements much less product specifications. These are *not* the people I'm talking about now having superpowers.

I am picturing people who blend high level engineering and product skills, ideally with business sense.

godelski

27 days ago

[-]

  > Merely being a member of the business class does not magically mean one has the ability to specify business requirements much less product specifications

Is this not why COBOL failed? Common Business-Oriented Language sure does look much more like natural language than a lot of other code, but it could never solve the abstraction needed to do the complex things.

I don't think LLMs will ever get rid of coders. Business people can no more tell an LLM what to build than they can a team of programmers. I've long argued that the contention between the "business monkeys" and "coding monkeys" is a good one. That the former focuses on making money and the latter focuses on making a better product. The contention is good because they need each other (though I do not think the dependence is symmetric).

Maybe one day AI will get there, but I don't see how it does without achieving AGI. To break down intent. To differentiate what was asked from what was intended. To understand the depth and all the context surrounding the many little parts. To understand the needs of the culture. The needs of the users. The needs of the business. This is all quite complex and it's why the number of employees typically grows quite rapidly.

How do we move forward without asking how we got here? Why we got here? How optimizing for decades (or much longer) led us to these patters. Under what conditions makes these patterns (near) optimal? I've yet to see a good answer to how LLMs actually address this. If typing was the bottleneck I think we would have optimized in very different ways.

khy34

26 days ago

[-]

My intuition tells me that llm’s combined with SWE’s with really amazing fundamentals will kill the code monkeys.

And frankly? That’s the best outcome. Code monkeys (in my view that’s an individual who writes out code just to complete a jira ticket) are a liability. Not only that but each additional person you have in an org means more noise creation.

If this forces the code monkeys to level up to compete… again a good thing.

The code base should not be elongated nor complicated. I’m not even a SWE by trade, rather a CEO, and this is my preferred outcome.

valicord

27 days ago

[-]

I agree with your first paragraph but not the second one. In many cases it's easier for me to directly write the code that satisfies the unwritten acceptance criteria I have in my head than to write those criteria down in English, have an LLM turn them into code, and then have to carefully review that code to see if I forgot some detail that changes everything.

Terretta

27 days ago

[-]

> easier for me to directly write the code that satisfies the unwritten acceptance criteria I have in my head than to write those criteria down in English

Yes, and for team or company code, "there's the problem".

Those acceptance criteria are guardrails for the change that comes after, and getting those out of your head into English is more important over the long haul than your undocumented short-term solution to the criteria.

Virtually all teams — because virtually all PgMs, PjMs, TLs, and Devs — miscalculate this.

Easier for you, not better for team or firm.

• • •

FWIW, perpetuation of this problem isn't really a fault of culture or skill or education. It's largely thanks to "leadership" having no idea how to correctly incentivize what the outcome should holistically be, as they don't know enough to know what long-haul good looks like.

FWIW, you can make that easier for them by having the LLM derive your acceptance criteria into English (based not only on code but on your entire conversation+iteration history) and write that up, which you can read and correct, after the countless little iterations you made since your head-spec wasn't as concrete as you imagined before you started iterating.

Even if you refuse to do spec driven development, LLMs can do development-driven spec. You can review that, you must correct it, and then ... Change can come after more easily — thanks to that context.

valicord

27 days ago

[-]

> Those acceptance criteria are guardrails for the change that comes after, and getting those out of your head into English is more important over the long haul than your undocumented short-term solution to the criteria.

I have a lot of context about the system/codebase inside my head. 99.9% of it is not relevant to the specific task I need to do this week. The 0.1% that is relevant to this task is not relevant to other tasks that I or my teammates will need to do next week.

You're suggesting that I write down this particular 0.1% in some markdown file so that LLM can write the code for me, instead of writing the code myself (which would have been faster). Chances are, nobody is going to touch that particular piece of code again for a long time. By the time they do, whatever I have written down is likely out of date, so the long term benefit of writing everything down disappears.

> after the countless little iterations you made since your head-spec wasn't as concrete as you imagined before you started iterating.

That's exactly the point. If I need to iterate on the spec anyway, why would I use an intermediary (LLM) instead of just writing the code myself?

26 days ago

[-]

You also write it down so that the bus factor goes down.

And if nobody touches the code for a long time, how can the documentation be out of date?

godelski

27 days ago

[-]

  > getting those out of your head into English is more important over the long haul than your undocumented short-term solution to the criteria.

I think there may be miscommunication going on, or I may be misreading the conversation. What I do not know is what valicord means by "satisfies the unwritten acceptance criteria".

In one interpretation, I think they make a ton of sense. We invented formal languages to solve precisely this problem. The precision and pedantic nature of formal languages (like math and code[0]) is to solve ambiguity. If this is the meaning, then yes, code is far more concise and clear[1] than a natural language. That's why we invented formal languages after all. So they may be having trouble converting it to English because they are unsatisfied with the (lack of) precision and verbosity. That when they are more concise that people are interpreting it incorrectly, which is only natural. Natural languages' advantage is their flexibility, but that's their greatest disadvantage too. Everything is overloaded.

But on the other hand, if they are saying that they are unable to communicate the basics (it seems you have read in this way) then I agree with you. Being able to communicate your work is extremely important. I am unsure if it is more important than ever, but it is certainly a critical skill. But then we still have the ambiguous question of "to who?" The type of writing one does significantly differs depending on the audience.

Only valicord can tell us[edit], but I think we're just experiencing the ambiguity that makes natural languages so great and so terrible. I think maybe more important than getting the words out of ones head is to recognize the ambiguity in our language. As programmers this should be apparent, as we often communicate in extremely precise languages. But why I'd say it is more important than ever is because the audience is more diverse than ever. I'd wager a large number of arguments on the internet occur simply due to how we interpret one another's words. The obvious interpretation for one is different for another.

[0] Obviously there's a spectrum with code. C is certainly more formal than Python and thus less ambiguous.

[1] Clear != easy to understand. Or at least not easy to understand by everyone. This is a skill that needs training.

[edit] Reading their response, I think it is the first interpretation.

27 days ago

[-]

This is the point I'm raising. I agree with you, but what I'm saying is I think the skillset you describe is the next on the chopping block.

The acquaintances of mine who are absolutely *killing* it with these tools are very experienced, technically minded, product managers. They have an intimate knowledge of how to develop business requirements and how to convert them into high level technical specifications. They have enough technical knowledge to understand when someone is bullshitting them, and what the search space for the problem should be. Historically these people would lead teams of engineers to develop for them, and now they're sitting down and having LLMs crank out what they want in an afternoon. They no longer need engineers at all.

My contention is that people with that sort of skillset will have an advantage due to their experience with skills like finding product fit, identifying user needs, and defining business requirements.

Of course, the people I'm talking about were already killing it in the old paradigm too. I'll admit it's a bit of a unicorn skillset I'm describing.

27 days ago

[-]

It's almost as if architecture and code quality mattered just as before and that those who don't know proper engineering principles and problem decomposition will not succeed with these new tools.

26 days ago

[-]

If you're using words in English to "tell it in painstaking detail", you're doing it the hard way.

You can provide it with a tool to do that. Agents run tools in a loop, give it good tools. We have linters, code analysers, fuzzers and everything else.

Configure them correctly, tell the agent to use them (in painstaking detail) and it can't mess things up.

112233

27 days ago

[-]

Uhuh. Let me present you Rudolph. For the next 15 minutes, he will paste pieces of top rated SO answers and top starred GH repos. Then he will suffer complete amnesia. He might not understand your question or remember what he just did, but the code he pastes is higher quality than any codes you have seen in your 20 years in F500! For 20$ a month, he's all yours, he just needs a 4 hour break every 5 hours. But he runs on money, like gumball machine, so you can wake him with a donation. Oh, you are responsible for giving him precise instructions, that he often ignores in favour of other instructions from uncle Sam. No, you can't see them.

28 days ago

[-]

  > LLM code is higher quality than any codes I have seen in my 20 years in F500.

"Any codes"?

28 days ago

[-]

At least my comment hasn't been reviewed or written by a LLM.

And in my French brain, code or codebase is countable and not uncountable.

sebastiennight

28 days ago

[-]

As far as I've ever heard, "le code" used in a codebase is uncountable, like "le café" you'd put in a cup, so we would still say "meilleur que tout le code que j'ai vu en 20 ans" and not "meilleur que tous les codes que j'ai vus en 20 ans".

There is a countable "code" (just like "un café" is either a place, or a cup of coffee, or a type of coffee), and "un code" would be the one used as a password or secret, as in "j'ai utilisé tous les codes de récupération et perdu mon accès Gmail" (I used all the recovery codes and lost Gmail access).

troupo

28 days ago

[-]

> As far as I've ever heard, "le code" used in a codebase is uncountable

Now I can't get the Pulp Fuction dialog out of my head.

- Do you know what they call code in France?

- No

- Le code

ahartmetz

28 days ago

[-]

As an additional wrinkle, the word seems quite French in origin in this case.

28 days ago

[-]

You are correct, we generally say le code. To be exact at that time, I was more thinking toutes les lignes de code.

28 days ago

[-]

I guess you can guide it to write in any style.

But what set me off is an universal qualifier: there was no code seen by you that is of equal quality or better that what LLMs generate.

https://www.neatorama.com/2007/01/22/a-mathematical-cow-joke...

mejutoco

28 days ago

[-]

cows are brown, from one side.

28 days ago

[-]

I got curious and had to fire up the ol LLM to find out what the story is about the words that aren't pluralized - TIL about countable and uncountable nouns. I wonder if the guy giving you trouble about your English speaks French.

28 days ago

[-]

I speak Russian and some English, but the question was about universal quantification: author declares that LLMs generate code of better quality than "any codes" he seen in his career.

dahart

27 days ago

[-]

LLMs got their training data from somewhere. But maybe they’re good at percolating the good code to the top and filtering the bad code.

iLoveOncall

28 days ago

[-]

I'm native French and nobody would consider code countable. "codes" makes no sense. We'd talk about "lines of code" as a countable in French just like in English.

true_religion

28 days ago

[-]

Codes is a proper grammatical word in English, but we don’t use it in reference to general computer programming.

You can for example have two different organizations with different codes of conduct.

There is though nothing technically wrong with seeing each line of code as an complete individual code and referring to then multiple of them as codes.

dahart

27 days ago

[-]

Codes can be synonymous with codebases and is grammatically just fine, though probably not the most common usage.

raincole

27 days ago

[-]

Quite sure they're not criticizing your grammar, but your substance.

28 days ago

[-]

You'll find, at times, that those communicating in a language that's not their primary language will tend to deviate from what one whose it was their primary language might expect.

If that's obvious to you than you're just being rude. If it's not obvious to you, then you'll also find this is a common deviance (plural 'code') from those who come from a particular primary language's region.

Edit; This got me thinking - what is the grammar/rule around what gets pluralized and what doesn't? How does one know that "code" can refer to a single line of code, a whole file of code, a project, or even the entirety of all code your eyes have ever seen without having to have an s tacked on to the end of it?

tsimionescu

28 days ago

[-]

"Codes" as a way to refer to programs/libraries is actually common usage in academia and scientific programming, even by native English speakers. I believe, but am not sure, that it may just be relatively old jargon, before the use of "programs" became more common in the industry.

As for the grammar rule, it's the question of whether a word is countable or uncountable. In common industry usage, "code" is an uncountable noun, just like "flour" in cooking (you say 2 lines of code, 1 pound of flour).

It's actually pretty common for the same word to have both countable and uncountable versions, with different, though related, meanings. Typically the uncountable version is used with a measure of quantity, while the countable version denotes different kinds (flours - different types of flour; peoples - different groups of people).

28 days ago

[-]

> Typically the uncountable version is used with a measure of quantity, while the countable version denotes different kinds (flours - different types of flour; peoples - different groups of people).

This was very helpful, thank you! (I had just gotten off the phone with Claude learning about countable and uncountable nouns but those additional details you provided should prove quite valuable)

thaumasiotes

28 days ago

[-]

> what is the grammar/rule around what gets pluralized and what doesn't? How does one know that "code" can refer to a single line of code, a whole file of code, a project, or even the entirety of all code your eyes have ever seen without having to have an s tacked on to the end of it?

Well, the grammar is that English has two different classes of noun, and any given noun belongs to one class or the other. Standard terminology calls them "mass nouns" and "count nouns".

The distinction is so deeply embedded in the language that it requires agreement from surrounding words; you might compare many [which can only apply to count nouns] vs much [only to mass nouns], or observe that there are separate generic nouns for each class [thing is the generic count noun; stuff is the generic mass noun].

For "how does one know", the general concept is that count nouns refer to things that occur discretely, and mass nouns refer to things that are indivisible or continuous, most prototypically materials like water, mud, paper, or steel.

Where the class of a noun is not fixed by common use (for example, if you're making it up, or if it's very rare), a speaker will assign it to one class or the other based on how they internally conceive of whatever they're referring to.

thaumasiotes

26 days ago

[-]

For the sake of completeness, I should mention that mass nouns, as a matter of grammar, do not and cannot have plural forms.

28 days ago

[-]

The question was about universal quantification, not grammar error.

As if author of the comment had not seen any code that is better or of equal quality of code generated by LLMs.

28 days ago

[-]

Well now I look like an idiot. But I did learn some things! :D My apologies.

ben_w

28 days ago

[-]

FWIW, I've noticed that scientists (native English speakers at least) will say "codes" rather "code". I don't know if this is universal or just specific domains (physics) nor if this is common or rare, but I've noticed it.

m3kw9

27 days ago

[-]

Offshoring pretty much guarantees a couple vibe coders will be there to operate

mettamage

28 days ago

[-]

Giving it prompts of the Shannon project helps for security

27 days ago

[-]

You've worked at some shitty places. Nothing I've seen from Claude matches even my worst coworker (and my last job was an F500)

empath75

28 days ago

[-]

If you a) know what you are doing and b) know what an llm is capable of doing, c) can manage multiple llm agents at a time, you can be unbelievably productive. Those skills I think are less common than people assume.

You need to be technical, have good communication skills, have big picture vision, be organized, etc. If you are a staff level engineer, you basically feel like you don’t need anyone else.

OTOH i have been seeing even fairly technical engineering managers struggle because they can’t get the LLMs to execute because they don’t know how to ask it what to do.

jurgenburgen

27 days ago

[-]

> can manage multiple llm agents at a time

How is that supposed to work? Humans are notoriously poor at multi-tasking. If you spend all day context switching between agents you’re going to have a bad time.

empath75

26 days ago

[-]

I deployed a whole fleet of agents working Jira tickets and PRs a couple of weeks ago. You manage them the way you manage people.

(https://okbjgm.weebly.com/uploads/3/1/5/0/31506003/11_laws_o...)

awinter-py

27 days ago

[-]

it's like that '11 rules for showrunning' doc where you need to operate at a level where you understand the product being made, and the people making it, and their capabilities, in order to make things come out well without touching them directly.

if you can do every job + parallelize + read fast, and you are only limited by the time it takes to type, claude is remarkable. I'm not superhuman in those ways but in the small domains where I am it has helped a lot; in other domains it has ramped me to 'working prototype' 10x faster than I could have alone, but the quality of output seems questionable and I'm not smart enough to improve it

danparsonson

28 days ago

[-]

Yeah that describes most legacy codebases I've worked on XD

lwansbrough

28 days ago

[-]

For me, I'll do the engineering work of designing a system, then give it the specific designs and constraints. I'll let it plan out the implementation, then I give it notes if it varies in ways I didn't expect. Once we agree on a solution, that's when I set it free. The frontier models usually do a pretty good job with this work flow at this point.

m3kw9

27 days ago

[-]

That’s vibe coding and you won’t read more than 20% of the code written that way. You really can’t build complex software that way

allajfjwbwkwja

27 days ago

[-]

At least he gets to enjoy doing the remaining 80% of the work in a lovely codebase with foundations written by an LLM.

iLoveOncall

28 days ago

[-]

Really? Because this perfectly explains why it will never replace them: it needs an exact language listing everything required to function as you expect it.

You need code to get it to generate proper code.

abm53

28 days ago

[-]

I think GP was a joke about the ability of a typical programmer.

I certainly read it as one and found it funny.

YesBox

28 days ago

[-]

Heh, people like to have someone else to blame.

jaredklewis

26 days ago

[-]

This comment is why I read HN

stingraycharles

28 days ago

[-]

> If you ask to unify the duplication, it'll say "No problem, here's a brand new metamock abstract adapter framework that has a superset of all feature sets, plus two new metamock drivers for the older and the newer code! Let me know if you want me to write tests for the new adapters."

Nevermind the fact that it only migrated 3 out of 5 duplicated sections, and hasn’t deleted any now-dead code.

Mavvie

28 days ago

[-]

Sounds like my coworkers.

GeoAtreides

28 days ago

[-]

people also piss in rivers, yet dumping raw sewage by million m^3 in the same rivers is generally (less so in uk) frowned upon...

lelanthran

28 days ago

[-]

Maybe, but I'd bet a large sum of money that each of your coworkers aren't turning out this drivel at a rate of 3kLoC per hour.

Can you imagine working with someone who produces 100k lines of unmaintainable code in a single sprint?

This is your future.

28 days ago

[-]

That's the reality nobody really wants to say.

Jweb_Guru

28 days ago

[-]

It's not reality. I'm really not a fan of the way that people excuse the really terrible code LLMs write by claiming that people write code just as bad. Even if that were true, it is not true that when you ask those people to do otherwise they simply pretend to have done it and forget you asked later.

imiric

28 days ago

[-]

It's an easy copout.

Tool works as expected? It's superintelligence. Programming is dead.

Tool makes dumb mistake? So do humans.

brabel

28 days ago

[-]

Yes and both are right. It’s a matter of which is working as expected and making fewer mistakes more often. And as someone using Claude Code heavily now, I would say we’re already at a point where AI wins.

darkwater

28 days ago

[-]

> it is not true that when you ask those people to do otherwise they simply pretend to have done it and forget you asked later.

I had a coworker that more or less exactly did that. You left a comment in a ticket about something extra to be done, he answered "yes sure" and after a few days proceeded to close the ticket without doing the thing you asked. Depending on the quantity of work you had at the moment, you might not notice that until after a few months, when the missing thing would bite you back in bitter revenge.

Jweb_Guru

27 days ago

[-]

You may have had one. It clearly made a pretty negative impression on you because you are still complaining about them years later. I find it pretty misanthropic when people ascribe this kind of antisocial behavior to all of their coworkers.

darkwater

27 days ago

[-]

It's still relatively recent. Anyway I'm not saying everyone is like this, absolutely (not even an important chunk), but they do exist. At the same time it's not true that current LLMs only write terrible code.

lukan

28 days ago

[-]

"Even if that were true, it is not true that when you ask those people to do otherwise they simply pretend to have done it and forget you asked later."

I admire your experience with people.

dns_snek

28 days ago

[-]

The point is, that's not the typical experience and people like that can be replaced. We don't willingly bring people like that on our teams, and we certainly don't aim to replace entire teams with clones of this terrible coworker prototype.

27 days ago

[-]

Not only have i never had a coworker as bad as these people describe, the point is as you say: why would I want an LLM that works like these people's shitty coworkers?

My worst coworkers right now are the ones using Claude to write every word of code and don't test it. These are people who never produced such bad code on their own.

So the LLMs aren't just as bad as the bad coworkers, they're turning good coworkers into bad ones!

lukan

27 days ago

[-]

Couple of reasons, but mainly speed and avaiability.

I can give Claude a job anytime and it will do it immediately.

And yes, I will have to double check anything important, but I am way better and faster at checking, than doing it myself.

So obviously I don't want a shitty LLM as coworker, but a competent one. But the progress they made is pretty astonishing and they are good enough now that I started really integrating them.

ttoinou

28 days ago

[-]

No but they will despise you for bringing the problem up

Jweb_Guru

27 days ago

[-]

In the long run, good code makes everyone much happier than code that is bad because people are being "nice" and letting things slide in code review to avoid confrontation.

duskdozer

28 days ago

[-]

Maybe, but it lets them pump out much, much more code than they otherwise would have been able to. That's the "100x" in their AI productivity multipliers.

27 days ago

[-]

Sounds like you just work at a shitty company

0x008

23 days ago

[-]

The problem is that you are looking at the code. /s

28 days ago

[-]

My sense is that the code generation is fast, but then you always need to spend several hours making sure the implementation is appropriate, correct, well tested, based on correct assumptions, and doesn't introduce technical debt.

You need to do this when coding manually as well, but the speed at which AI tools can output bad code means it's so much more important.

ehnto

28 days ago

[-]

Well when you write it manually you are doing the review and sanity checking in real time. For some tasks, not all but definitely difficult tasks, the sanity checking is actually the whole task. The code was never the hard part, so I am much more interested in the evolving of AIs real world problem solving skills over code problems.

I think programming is giving people a false impression on how intelligent the models are, programmers are meant to be smart right so being able to code means the AI must be super smart. But programmers also put a huge amount of their output online for free, unlike most disciplines, and it's all text based. When it comes to problem solving I still see them regularly confused by simple stuff, having to reset context to try and straighten it out. It's not a general purpose human replacement just yet.

LPisGood

28 days ago

[-]

And it’s slower to review because you didn’t do the hard part of understanding the code as it was being written.

28 days ago

[-]

You're holding it wrong.

Set the boundaries and guidelines before it starts working. Don't leave it space to do things you don't understand.

ie: enforce conventions, set specific and measurable/verifiable goals, define skeletons of the resulting solutions if you want/can.

To give an example. I do a lot of image similarity stuff and I wanted to test the Redis VectorSet stuff when it was still in beta and the PHP extension for redis (the fastest one, which is written in C and is a proper language extension not a runtime lib) didn't support the new commands. I cloned the repo, fired up claude code and pointed it to a local copy of the Redis VectorSet documentation I put in the directory root telling it I wanted it to update the extension to provide support for the new commands I would want/need to handle VectorSets. This was, idk, maybe a year ago. So not even Opus. It nailed it. But I chickened out about pushing that into a production environment, so I then told it to just write me a PHP run time client that mirrors the functionality of Predis (pure-php implementation of redis client) but does so via shell commands executed by php (lmao, I know).

Define the boundaries, give it guard rails, use design patterns and examples (where possible) that can be used as reference.

slopinthebag

28 days ago

[-]

They aren't holding it wrong, it's a fundamental limitation of not writing the code yourself. You can make it easier to understand later when you review it, but you still need to put in that effort.

nemo44x

27 days ago

[-]

Work in smaller parts then. You should have a mental model of what the code is doing. If the LLM is generating too much you’re being too broad. Break the problem down. Solve smaller problems.

All the old techniques and concepts still apply.

27 days ago

[-]

This

ModernMech

28 days ago

[-]

Enforce conventions, be specific, and define boundaries… in English?!

27 days ago

[-]

Can you not? If not, learn how to. You'll find it helps immensely.

philipp-gayret

28 days ago

[-]

You are correct but developers are not yet ready to face it. The argument you'll always get is the flawed premise that it's less effort to write it yourself (While the same people work in teams that have others writing code for them every day of the week).

28 days ago

[-]

So in my experience with Opus 4.6 evaluating it in an existing code base has gone like this.

You say "Do this thing".

- It does the thing (takes 15 min). Looks incredibly fast. I couldn't code that fast. It's inhuman. So far all the fantastical claims hold up.

But still. You ask "Did you do the thing?"

- it says oops I forgot to do that sub-thing. (+5m)

- it fixes the sub-thing (+10m)

You say is the change well integrated with the system?

- It says not really, let me rehash this a bit. (+5m)

- It irons out the wrinkles (+10m)

You say does this follow best engineering practices, is it good code, something we can be proud of?

- It says not really, here are some improvements. (+5m)

- It implements the best practices (+15m)

You say to look carefully at the change set and see if it can spot any potential bugs or issues.

- It says oh, I've introduced a race condition at line 35 in file foo and an null correctness bug at line 180 of file bar. Fixing. (+15m)

You ask if there's test coverage for these latest fixes?

- It says "i forgor" and adds them. (+15m)

Now the change set has shrunk a bit and is superficially looking good. Still, you must read the code line by line, and with an experienced eye will still find weird stuff happening in several of the functions, there's redundant operations, resources aren't always freed up. (60m)

You ask why it's implemented in such a roundabout way and how it intends for the resources to be freed up?

- It says "you're absolutely right" and rewrites the functions. (+15m)

You ask if there's test coverage for these latest fixes?

- It says "i forgor" and adds them. (+15m)

Now the 15 minutes of amazingly fast AI code gen has ballooned into taking most of the afternoon.

Telling Claude to be diligent, not write bugs, or to write high quality code flat out does not work. And even if such prompting can reduce the odds of omissions or lapses, you still always always always have to check the output. It can not find all the bugs and mistakes on its own. If there are bugs in its training data, you can assume there will be bugs in its output.

(You can make it run through much of this Socratic checklist on its own, but this doesn't really save wall clock time, and doesn't remove the need for manual checking.)

27 days ago

[-]

You didn't use plan mode.

27 days ago

[-]

I did use plan mode. Plan looked great. Code left something else to be desired.

27 days ago

[-]

I've had very consistent success with plan mode, but when I haven't I've noticed many times it's been working with code/features/things that aren't well defined. ie: not using a well defined design pattern, maybe some variability in the application on how something could be done - these are the things I notice it really trips up on. Well defined interfaces, or even specifically telling it to identify and apply design principles where it seems logical.

When I've had repeated issues with a feature/task on existing code often times it really helps to first have the model analyze the code and recommend 'optimizations' - whether or not you agree/accept, it'll give you some insight on the approach it _wants_ to take. Adjust from there.

27 days ago

[-]

Ok so here are the actual course corrections I had to make to push through a replacement implementation of a btree.

Note that almost all of the problems aren't with the implementation, it basically one shot that. Almost all the issues are with integrating the change with the wider system.

"The btree library is buggy, and inefficient (using mmap, a poor design idea). Can you extract it to an interface, and then implement a clean new version of the interface that does not use mmap? It should be a balanced btree. Don't copy the old design in anything other than the interface. Look at how SkipListReader and SkipListWriter uses a BufferPool class and use that paradigm. The new code should be written from scratch and does not need to be binary compatible with the old implementation. It also needs extremely high test coverage, as this is notoriously finnicky programming."

"Let's move the old implementation to a separate package called legacy and give them a name like LegacyBTree... "

"Let's add a factory method to the interfaces for creating an appropriate implementation, for the writer based on a system property (\"index.useLegacyBTree\"), and for the reader, based on whether the destination file has the magic word for the new implementation. The old one has no magic word."

"Are these changes good, high quality, good engineering practices, in line with known best practices and the style guide?"

"Yeah the existing code owns the lifetime of the LongArray, so I think we'd need larger changes there to do this cleanly. "

"What does WordLexicon do? If it's small, perhaps having multiple implementations is better"

"Yes that seems better. Do we use BTrees anywhere else still?"

"There should be an integration test that exercises the whole index construction code and performs lookups on the constructed index. Find and run that."

"That's the wrong test. It may just be in a class called IntegrationTest, and may not be in the index module."

"Look at the entire change set, all unstaged changes, are these changes good, high quality, good engineering practices, in line with known best practices and the style guide?"

"Remove the dead class. By the way, the size estimator for the new btree, does it return a size that is strictly greater than the largest possible size? "

"But yeah, the pool size is very small. It should be configurable as a system property. index.wordLexiconPoolSize maybe. Something like 1 GB is probably good."

"Can we change the code to make BufferPool optional? To have a version that uses buffered reads instead?"

"The new page source shoud probably return buffers to a (bounded) free list when they are closed, so we can limit allocation churn."

"Are these latest changes good, high quality, good engineering practices, in line with known best practices and the style guide?"

"Yes, all this is concurrent code so it needs to be safe."

"Scan the rest of the change set for concurrency issues too."

"Do we have test coverage for both of the btree reader modes (bufferpool, direct)?"

"Neat. Think carefully, are there any edge cases our testing might have missed? This is notoriously finnicky programming, DBMSes often have hundreds if not thousands of tests for their btrees..."

"Any other edge cases? Are the binary search functions tested for all corner cases?"

"Can you run coverage for the tests to see if there are any notable missing branches?"

"Nice. Let's lower the default pool size to 64 MB by the way, so we don't blow up the Xmx when we run tests in a suite."

"I notice we're pretty inconsistent in calling the new B+-tree a B-Tree in various places. Can you clean that up?"

"Do you think we should rename these to reflect their actual implementation? Seems confusing the way it is right now."

"Can you amend the readme for the module to describe the new situation, that the legacy modules are on the way out, and information about the new design?"

"Add a note about the old implemenation being not very performant, and known to have correctness issues."

"Fix the guice/zookeeper issues before proceeding. This is a broken window."

"It is pre-existing, let's ignore it for now. It seems like a much deeper issue, and might inflate this change scope."

"Let's disable the broken test, and add a comment explaining when and any information we have on what may or may not cause the issue."

"What do you think about making the caller (IndexFactory) decide which WordLexicon backing implementation to use, with maybe different factory methods in WordLexicon to facilitate?"

"I'm looking at PagedBTreeReader. We're sometimes constructing it with a factory method, and sometimes directly. Would it make sense to have a named factory method for the \"PagedBTreeReader(Path filePath, int poolSize)\" case as well, so it's clearer just what that does?"

"There's a class called LinuxSystemCalls. This lets us do preads on file descriptors directly, and (appropriately) set fadviseRandom(). Let's change the channel backed code to use that instead of FileChannels, and rename it to something more appropriate. This is a somewhat big change. Plan carefully."

"Let's not support the case when LinuxSystemCalls.isAvailable() is false, the rest of the index fails in that scenario as well. I think good names are \"direct\" (for buffer pool) and \"buffered\" (for os cached), to align with standard open() nomenclature."

"I'm not a huge fan of PreadPageSource. It's first of all named based on who uses it, not what it does. It's also very long lived, and leaking memory whenever the free list is full. Let's use Arena.ofAuto() to fix the latter, and come up with a better name. I also don't know if we'll ever do unaligned reads in this? Can we verify whether that's ever actually necessary?"

"How do we decide whether to open a direct or buffered word lexicon?"

"I think this should be a system property. \"index.wordLexicon.useBuffered\", along with \"index.wordLexicon.poolSizeBytes\" maybe?"

"Is the BufferPoolPageSource really consistent with the rest of the nomenclature?"

"Are there other inconsistencies in naming or nomenclature?"

dimaggiosghost

24 days ago

[-]

One thing I'm finding early success with is to define how the system can know if this statement is being met. Frequently I will include in the prompt e.g. "research what makes good high quality engineering practices and derive how to tell if those practices are being followed".

Directly telling it my team's values would be better, if we have it developed (like the style guide you mentioned) ... but that's a lot of work, the reasons that hasn't happened before are just as true now, and honestly - there's a lot of overlap with the generic research result.

> Are these changes good, high quality, good engineering practices, in line with known best practices and the style guide

xeromal

27 days ago

[-]

The same as asking one of your JRs to do something except now it follows instructions a little bit better. Coding has never been about line generation and now you can POC something in a few hours instead of a few days / weeks to see if an idea is dumb.

oblio

27 days ago

[-]

LLMs can easily output overwhelming quantities of code. Junior devs couldn't really do that, not consistently.

Scale/quantity matter.

This industry is not mature enough for 1000x the bad code we have now. It was barely hanging on with 1x bad code.

27 days ago

[-]

Yeah. Due diligence is exponentially more important with something like Claude because it is so fast. Get lazy for a few hours and you've easily added 20K LOC worth of technical debt to your code base, and short of reverting the commits and starting over, it'll not be easy to get it to fix the problems after the fact.

It's still pretty fast even considering all the coaxing needed, but holy crap will it rapidly deteriorate the quality of a code base if you just let it make changes as it pleases.

It very much feels like how the most vexing enemy of The Flash is like just some random ass banana peel on the road. Raw speed isn't always an asset.

LPisGood

27 days ago

[-]

The cost of reverting the commits and starting over is not so high though. I find it is really good for prototyping ideas that you might not have tried to do previously.

27 days ago

[-]

It's cheap only if this happens shortly after the bad design mistakes, and there aren't other changes on top of them. Bad design decisions ossify fairly quickly in larger projects with multiple contributors outputting large volumes of code. Claude Code's own "game engine" rendering pipeline[1] is a good example of an almost comically inappropriate design that's likely to be some work to undo now that it's set.

[1] https://spader.zone/engine/

26 days ago

[-]

"Several hours"? How big are your change sets?

If a human dropped a PR on me that took "several hours" to go through (10k+ lines or non-trivial changes), I'd jump in my car and drive to the office just to specifically slap them on the back of the head ffs.

25 days ago

[-]

This was like 1K LOC? It's not the review that was slow, but the wrestling with the model to get the code to not suck.

vannevar

28 days ago

[-]

I'd highly recommend working top down, getting it to outline a sane architecture before it starts coding. Then if one of the modules starts getting fouled up, start with a clean sheet context (for that module) incorporating any cautions or lessons learned from the bad experience. LLMs are not yet good at working and reworking the same code, for the reasons you outline. But they are pretty good at a "Groundhog Day" approach of going through the implementation process over and over until they get it right.

coolius

28 days ago

[-]

+1 if you are vibe coding projects from scratch. if the architecture you specify doesn't make sense, the llm will start struggling, the only way out of their misery is mocking tests. the good thing is that a complete rewrite with proper architecture and lessons learned is now totally affordable.

disgruntledphd2

28 days ago

[-]

I think the best thing about LLMs is how incredibly easy they make it to build one to throw away.

I've definitely built the same thing a few times, getting incrementally better designs each time.

28 days ago

[-]

Not trying to be snarky, with all due respect... this is a skill issue.

It's a tool. It's a wildly effective and capable tool. I don't know how or why I have such a wildly different experience than so many that describe their experiences in a similar manner... but... nearly every time I come to the same conclusion that the input determines the output.

> If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.

Yes, when the prompt/instructions are overly broad and there's no set of guardrails or guidelines that indicate how things should be done... this will happen. If you're not using planning mode, skill issue. You have to get all this stuff wrapped up and sorted before the implementation begins. If the implementation ends up being done in a "not-so-great" approach - that's on you.

> If you tell them the code is slow

Whew. Ok. You don't tell it the code is slow. Do you tell your coworker "Hey, your code is slow" and expect great results? You ask it to benchmark the code and then you ask it how it might be optimized. Then you discuss those options with it (this is where you do the part from the previous paragraph, where you direct the approach so it doesn't do "no-so-great approach") until you get to a point where you like the approach and the model has shown it understands what's going on.

Then you accept the plan and let the model start work. At this point you should have essentially directed the approach and ensured that it's not doing anything stupid. It will then just execute, it'll stay within the parameters/bounds of the plan you established (unless you take it off the rails with a bunch of open ended feedback like telling it that it's buggy instead of being specific about bugs and how you expect them to be resolved).

> you can have 10 bespoke tests for every bug. Plus a new mocking framework created every time the last one turns out to be unfit for purpose.

This is an area I will agree that the models are wildly inept. Someone needs to study what it is about tests and testing environments and mocking things that just makes these things go off the rails. The solution to this is the same as the solution to the issue of it keeping digging or chasing it's tail in circles... Early in the prompt/conversation/message that sets the approach/intent/task you state your expectations for the final result. Define the output early, then describe/provide context/etc. The earlier in the prompt/conversation the "requirements" are set the more sticky they'll be.

And this is exactly the same for the tests. Either write your own tests and have the models build the feature from the test or have the model build the tests first as part of the planned output and then fill in the functionality from the pre-defined test. Be very specific about how your testing system/environment is setup and any time you run into an issue testing related have the model make a note about that and the solution in a TESTING.md document. In your AGENTS.md or CLAUDE.md or whatever indicate that if the model is working with tests it should refer to the TESTING.md document for notes about the testing setup.

Personally, I focus on the functionality, get things integrated and working to the point I'm ready to push it to a staging or production (yolo) environment and _then_ have the model analyze that working system/solution/feature/whatever and write tests. Generally my notes on the testing environment to the model are something along the lines of a paragraph describing the basic testing flow/process/framework in use and how I'd like things to work.

The more you stick to convention the better off you'll be. And use planning mode.

riffraff

28 days ago

[-]

> Whew. Ok. You don't tell it the code is slow. Do you tell your coworker "Hey, your code is slow" and expect great results?

Yes? Why don't you?

They are capable people that just didn't notice something, id I notice some telemetry and tell them "hey this is slow" they are expected to understand the reason(s).

28 days ago

[-]

So, you observed some telemetry - which would have been some sort of specific metric, right? Wouldn't you communicate that to them as well, not just "it's slow"?

"Hey, I saw that metric A was reporting 40% slower, are you aware already or have any ideas as to what might be causing that?"

Those two approaches are going to produce rather distinctly different results whether you're speaking to a human or typing to a GPU.

28 days ago

[-]

Yeah if my co-worker can't start figuring out why the code is slow, with a reasonable reference to what the code in question is, that is a knock against their skills. I would actually expect some ideas as to what the problem is just off the top of their heads, but that the coding agent can't do that isn't a hit against it specifically, this is now a good part of what needs to be done differently.

The suggestion to tell the agent to do performance analysis of the part of the code you think is problematic, and offer suggestions for improvements seems like the proper way to talk to a machine, whereas "hey your code is slow" feels like the proper way to talk to a human.

brabel

28 days ago

[-]

As someone who leads a team of engineers, telling someone their code is slow is not nice, helpful or something a good team member should do. It’s like telling them there’s a bug and not explaining what the bug is. Code can be slow for infinite reasons, maybe the input you gave is never expected and it’s plenty fast otherwise. Or the other dev is not senior enough to know where problems may be. It can be you when I tell you your OOP code is super slow, but you only ever done OOP and have no idea how to put data in a memory layouts that avoids cpu cache misses or whatever. So no that’s not the proper way to talk to humans. And AI is only as good as the quality of what you’re asking. It’s a bit like a genie, it will give you what you asked , not what you actually wanted. Are you prepared for the ai to rewrite your Python code in C to speed it up? Can it just add fast libraries to replace the slow ones you had selected? Can it write advanced optimization techniques it learned about from phd thesis you would never even understand?

28 days ago

[-]

>As someone who leads a team of engineers, telling someone their code is slow is not nice, helpful or something a good team member should do

right, I'm sure there are all sorts of scenarios where that is the case and probably the phrasing would be something like that seems slow, or it seems to be taking longer than expected or some other phrasing that is actually synonymous with the code is slow. On the other hand there are also people that you can say the code is slow to, and they won't worry about it.

>So no that’s not the proper way to talk to humans

In my experience there are lots of proper ways to talk to humans, and part of the propriety is involved with what your relationship with them is. so it may be the proper way to talk to a subset of humans, which is generally the only kinds of humans one talks to - a subset. I certainly have friends that I have worked to for a long time who can say "what the fuck were you thinking here" or all sorts of things that would not be nice if it came from other people but is in fact a signifier of our closeness that we can talk in such a way. Evidently you have never led a team with people who enjoyed that relationship between them, which I think is a shame.

Finally, I'll note that when I hear a generalized description of a form of interaction I tend to give what used to be called "the benefit of a doubt" and assume that, because of the vagaries of human language and the necessity of keeping things not a big long harangue as every communication must otherwise become in order to make sure all bases of potential speech are covered, that the generalized description may in fact cover all potential forms of polite interaction in that kind of interaction, otherwise I should have to spend an inordinate amount of my time lecturing people I don't know on what moral probity in communication requires.

But hey, to each their own.

on edit: "the what the fuck were you thinking here" quote is also an example of a generalized form of communication that would be rude coming from other people but was absolutely fine given the source, and not an exact quote despite the use of quotation marks in the example.

crazygringo

27 days ago

[-]

...no?

"Your code is slow" is essentially meaningless.

A normal human conversation would specify which code/tasks/etc., how long it's currently taking, how much faster it needs to be, and why. And then potentially a much longer conversation about the tradeoffs involved in making in faster. E.g. a new index on the database that will make it gigabytes larger, a lookup table that will take up a ton more memory, etc. Does the feature itself need to be changed to be less capable in order to achieve the speed requirements?

If someone told me "hey your code is slow" and walked away, I'd just laugh, I think. It's not a serious or actionable statement.

27 days ago

[-]

Thank you.

zabzonk

28 days ago

[-]

Well, I would say something like "We seem to be having some performance issues the business has noticed in the XYZ stuff. Shall we sit down together and see if we can work out if we can improve things?"

scotty79

25 days ago

[-]

There was a 20+ person team of well paid, smart (mostly Java) programmers that dealt for months with slow application they were building, that everyone knew was slow. I nagged them for weeks to set up indexes even for small, 100 row tables. Once they did things started running orders of magnitude faster.

Your expectations for people (and LLMs) are way too high.

pornel

28 days ago

[-]

My comment was a summary of the situation, not literal prompts I use. I absolutely realize the work needs to be adequately described and agents must be steered in the right direction. The results also vary greatly depending on the task and the model, so devs see different rates of success.

On non-trivial tasks (like adding a new index type to a db engine, not oneshotting a landing page) I find that the time and effort required to guide an LLM and review its work can exceed the effort of implementing the code myself. Figuring out exactly what to do and how to do it is the hard part of the task. I don't find LLMs helpful in that phase - their assessments and plans are shallow and naive. They can create todo lists that seemingly check off every box, but miss the forest for the trees (and it's an extra work for me to spot these problems).

Sometimes the obvious algorithm isn't the right one, or it turns out that the requirements were wrong. When I implement it myself, I have all the details in my head, so I can discover dead-ends and immediately backtrack. But when LLM is doing the implementation, it takes much more time to spot problems in the mountains of code, and even more effort to tell when it's a genuinely a wrong approach or merely poor execution.

If I feed it what I know before solving the problem myself, I just won't know all the gotchas yet myself. I can research the problem and think about it really hard in detail to give bulletproof guidance, but that's just programming without the typing.

And that's when the models actually behave sensibly. A lot of the time they go off the rails and I feel like a babysitter instructing them "no, don't eat the crayons!", and it's my skill issue for not knowing I must have "NO eating crayons" in AGENTS.md.

27 days ago

[-]

Don't worry, Claude ignores my CLAUDE.md and eats crayons anyway

brabel

28 days ago

[-]

Great answer, and the reason some people have bad experiences is actually patently clear: they don’t work with the AI as a partner, but as a slave. But even for them, AI is getting better at automatically entering planning mode, asking for clarification (what exactly is slow, can you elaborate?), saying some idea is actually bad (I got that a few times), and so on… essentially, the AI is starting to force people to work as a partner and give it proper information, not just tell them “it’s broken, fix it” like they used to do on StackOverflow.

girvo

28 days ago

[-]

I absolutely tell a coworker their code is slow and expect them to fix it…

Bayko

28 days ago

[-]

I too tell my boss to promote me and expect him to do so.

otabdeveloper4

28 days ago

[-]

It is not a tool. It is an oracle.

It can be a tool, for specific niche problems: summarization, extraction, source-to-source translation -- if post-trained properly.

But that isn't what y'all are doing, you're engaging in "replace all the meatsacks AGI ftw" nonsense.

28 days ago

[-]

If I was on the "replace all the meatsacks AGI ftw" team then I would have referred to it as an oracle, by your own logic, wouldn't I have?

It's a tool. It's good for some things, not for others. Use the right tool for the job and know the job well enough to know which tools apply to which tasks.

More than anything it's a learning tool. It's also wildly effective at writing code, too. But, man... the things that it makes available to the curious mind are rather unreal.

I used it to help me turn a cat exercise wheel (think huge hamster wheel) into a generator that produces enough power to charge a battery that powers an ESP32 powered "CYD" touchscreen LCD that also utilizes a hall effect sensor to monitor, log and display the RPMs and "speed" (given we know the wheel circumference) in real time as well as historically.

I didn't know anything about all this stuff before I started. I didn't AGI myself here. I used a learning tool.

But keep up with your schtick if that's what you want to do.

otabdeveloper4

28 days ago

[-]

Oracles have their use too, but as long as you keep confusing "oracle" and "tool" you will get nowhere.

P.S. The real big deal is the democratization of oracles. Back in the day building an oracle was a megaproject accessible only to megacorps like Google. Today you can build one for nothing if you have a gaming GPU and use it for powering your kobold text adventure session.

27 days ago

[-]

> Oracles have their use too, but as long as you keep confusing "oracle" and "tool" you will get nowhere.

Arguably, I'm getting somewhere.. ;)

leptons

28 days ago

[-]

>I used it to help me turn a cat exercise wheel (think huge hamster wheel) into a generator that produces enough power to charge a battery that powers an ESP32 powered "CYD" touchscreen LCD that also utilizes a hall effect sensor to monitor, log and display the RPMs and "speed" (given we know the wheel circumference) in real time as well as historically.

So what? That's honestly amateur hour. And the LLM derived all of it from things that have been done and posted about a thousand times before.

You could have achieved the same thing with a few google searches 15 years ago (obviously not with ESP32, but other microcontrollers).

27 days ago

[-]

Right - it's not a big deal and it LITERALLY is amateur hour. But I did it. I wouldn't have done it prior, sure I could have done a bunch of google searches but the time investment it would have taken to sift through all that information and distill it into actionable chunks would have far exceeded the benefit of doing so, in this case.

The whole point is that it is amateur hour and it's wildly effective as a learning tool.

The fact it derived everything from things that have been done... yea, that's also the point? What point are you trying to make here? I'm well aware it's not a great tool if you're trying to use it to create novel things... but I'm not a nuclear physicist. I'm a builder, fixer, tinkerer who happens to make a living writing code. I use it to teach me how to do things, I use it to analyze problems and recommend approaches that I can then delve into myself.

I'm not asking it to fold proteins. (I guess that's been done quite a bit too, so would be amateur as well)

leptons

27 days ago

[-]

>The whole point is that it is amateur hour and it's wildly effective as a learning tool.

You sound so proud of your accomplishment, and I question if there's really nothing to be proud of here. I doubt you really learned anything, a machine told you what to do and you did it, like coloring by numbers - it doesn't make you an artist. You won't be able to build upon it, without asking the machine to do more of the thinking for you. And I think that's kind of sad.

>I'm a builder, fixer, tinkerer who happens to make a living writing code

I have to doubt that. If you were all those things, you would have been able to complete that project with very little effort, and without a machine telling you what to do.

26 days ago

[-]

What would be the appropriate way to learn then?

If a human gave me the same "amateur hour" instructions, would that be bad?

If I follow a "make exercise wheel display RPM" tutorial on a website, will I learn?

If it's in a book (distilled information is bad, right?), will I learn then?

leptons

25 days ago

[-]

OP was writing how great the LLM is, and that he couldn't do this stuff as easily before LLMs. And that simply isn't true.

Instead of breaking down the task himself into achievable steps, the LLM did that "thinking" for him. This will inevitably lead to atrophy of the brain. If you don't exercise your brain, and let the tin-can tell you what to do, you're going to get pretty dull. It's well known that keeping your brain active, solving problems, will keep your mental abilities strong. Using LLMs is the opposite of that.

26 days ago

[-]

lmao - I'm not at all proud of what you called an accomplishment. I literally said it _is_ amateur hour, it's hacked together, not safe, not stylish, not well engineered. But it does work. And despite your assumption about me learning anything - I had _no idea_ how generators worked. The realization that spinning an electric motor would result in electricity being produced blew my mind and got me asking claude things related to that, then I wanted to interface a wheel against my wheel to spin a stepper motor to get a charge and had the hair brain idea to just make the whole thing the generator instead. None of this was stuff I knew.

Despite this thing I made being rather useless in the grand scheme of things it was _wildly_ illuminating in terms of my understanding of electricity and the various objects around me and how they function. Which has spurred another rabbit hole that is having _real measurable effect_ for a host of feral cats to live a more comfortable life. (Not the wheel generator thing)

> a machine told you what to do and you did it, like coloring by numbers - it doesn't make you an artist.

I never claimed to be an artist ;) And, maybe it's different for you, but someone or something showing me how to do something is quite literally the best way for me to learn. /shrug

> I have to doubt that. If you were all those things, you would have been able to complete that project with very little effort, and without a machine telling you what to do.

I love that for you.

raincole

27 days ago

[-]

> Do you tell your coworker "Hey, your code is slow" and expect great results? You ask it to benchmark the code and then you ask it how it might be optimized.

...Really? I think 'hey we have a lot of customers reporting the app is laggy when they do X, could you take a look' is a very reasonable thing to tell your coworker who implemented X.

joquarky

28 days ago

[-]

Don't let it deteriorate so far that it can't recover in one session.

Perform regular sessions dedicated to cleaning up tech debt (including docs).

MattGaiser

28 days ago

[-]

> If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.

Are you using plan mode? I used to experience the do a poor approach and dig issue, but with planning that seems to have gone away?

28 days ago

[-]

maybe there should be an LLM trained on a corpus of a deletions and cleanup of code.

krackers

28 days ago

[-]

I'm guessing there's a very strong prior to "just keep generating more tokens" as opposed to deleting code that needs to be overcome. Maybe this is done already but since every git project comes with its own history, you could take a notable open-source project (like LLVM) and then do RL training against against each individual patch committed.

movedx01

28 days ago

[-]

Perhaps the problem is that you RL on one patch a time, failing to capture the overarching long term theme, an architecture change being introduced gradually over many months, that exists in the maintainer’s mental model but not really explicitly in diffs.

28 days ago

[-]

right, it would have to a specialized tool that you used to do analysis of codebase every now and then, or parts that you thought should be cleaned up.

Obviously there is a just keep generating more tokens bias in software management, since so many developer metrics over the years do various lines of code style analysis on things.

But just as experience and managerial programs have over time developed to say this is a bad bias for ranking devs, it should be clear it is a bad bias for LLMs to have.

ashdksnndck

28 days ago

[-]

I think this is in the training data since they use commit data from repos, but I imagine code deletions are rarer than they should be in the real data as well.

28 days ago

[-]

deleting and code cleanup is perhaps more an expression of seniority, and personal preferences. Maybe there should be the same kind style transfer with code that you see with graphical generative AI, "rewrite this code path in the style of Donald Knuth"

ashdksnndck

27 days ago

[-]

I imagine there would be value in not just throwing all of GitHub commits in as training data, but also rating the quality.

enraged_camel

27 days ago

[-]

I have no idea what I'm doing differently because I haven't experienced this since Opus 4.5. Even with Sonnet 4.5, providing explicit instructions along the lines of "reuse code where sensible, then run static analysis tools at the end and delete unused code it flags" worked really well.

I always watch Opus work, and it is pretty good with "add code, re-read the module, realize some pre-existing code (either it wrote, or was already there) is no longer needed and delete it", even without my explicit prompts.

carlosjobim

28 days ago

[-]

Yes, this is exactly the experience I have had with LLMs as a non-programmer trying to make code. When it gets too deep into the weeds I have to ask it to get back a few steps.

m3kw9

27 days ago

[-]

Yes that’s my observation too. I have to be double careful the longer they run a task. They like to hack and patch stuff even when I tell it I don’t prefer it.

karussell

27 days ago

[-]

The solution is to know when to use an existing solution like sqlite and when to create your own. So the biggest problem with LLMs is that they don't repel or remind you about possible consequences (too often). But if they would, I would find it even more awkward... and this is one of the reasons I prefer Claude Code over Codex.

bluepoint

26 days ago

[-]

Actually I get improvements when I ask two llms to simplify each other’s work repeatedly.

codebolt

28 days ago

[-]

I use the restore checkpoint/fork conversation feature in GitHub Copilot heavily because of this. Most of the time it's better to just rewind than to salvage something that's gone off track.

disgruntledphd2

28 days ago

[-]

Yeah I'm a big fan of branching for basically every change, as it provides a known good checkpoint.

ThrowawayTestr

28 days ago

[-]

I feel like there's two types of LLM users. Those that understand it's limitations, and those that ask it to solve a millennium problem on the first try.

esafak

28 days ago

[-]

I have run into this too. Some of it is because models lack the big picture; so called agentic search (aka grep) is myopic.

cyanydeez

27 days ago

[-]

The reason theyre not intelligent is becaise they want to predict the next token, so verbosity is baked in.

bgitarts

27 days ago

[-]

have you tired adding to your agents file: "Prefer solutions that reduce lines of code over adding lines of code"?

leke

28 days ago

[-]

i wonder if the solution is to just ask it to refactor its code once it's working.

mirsadm

28 days ago

[-]

I do this all the time but then you end up with really over engineered code that has way more issues than before. Then you're back to prompting to fix a bunch of issues. If you didn't write the initial code sometimes it's difficult to know the best way to refactor it. The answer people will say is to prompt it to give you ideas. Well then you're back to it generating more and more code and every time it does a refactor it introduces more issues. These issues aren't obvious though. They're really hard to spot.

MadnessASAP

28 days ago

[-]

You can, and it might make things a bit better. The only real way I've found so far is to start going through file by file, picking it apart.

I wouldn't be surprised if over half my prompts start with "Why ...?", usually followed by "Nope, ... instead”

Maybe the occasional "Fuck that you idiot, throw the whole thing out"

fmbb

28 days ago

[-]

It’s in the name, isn’t it?

Generative AI.

treetalker

27 days ago

[-]

This is my experience with how LLMs "draft" legal arguments: at first glance, it's plausible — but may be, and often is, invalid, unsound, and/or ill-advised.

The catch is that many judges lack the time, energy, or willingness to not only read the documents in detail, but also roll up their sleeves and dig into the arguments and cited authorities. (Some lack the skills, but those are extreme cases.) So the plausible argument (improperly and unfortunately) carries the day.

LLM use in litigation drafting is thus akin to insurgent/guerilla warfare: it take little time, energy, or thinking to create, yet orders of magnitude more to analyze and refute. (It's a species of Brandolini's Law / The Bullshit Asymmetry Principle.) Thus justice suffers.

I imagine that this is analogous to the cognitive, technical, and "sub-optimal code" debt that LLM-produced code is generating and foisting upon future developers who will have to unravel it.

roarcher

27 days ago

[-]

> LLM use in litigation drafting is thus akin to insurgent/guerilla warfare: it take little time, energy, or thinking to create, yet orders of magnitude more to analyze and refute.

The same goes for coding. I have coworkers who use it to generate entire PRs. They can crank out two thousand lines of code that includes tests "proving" that it works, but may or may not actually be nonsense, in minutes. And then some poor bastard like me has to spend half a day reviewing it.

When code is written by a human that I know and trust, I can assume that they at least made reasonable, if not always correct, decisions. I can't assume that with AI, so I have to scrutinize every single line. And when it inevitably turns out that the AI has come up with some ass-backwards architecture, the burden is on me to understand it and explain why it's wrong and how to fix it to the "developer" who hasn't bothered to even read his own PR.

I'm seriously considering proposing that if you use AI to generate a PR at my company, the story points get credited to the reviewer.

patrakov

27 days ago

[-]

Evil voice: "I don't mind not getting credits for the story points. The story was AI-generated anyway."

26 days ago

[-]

If code smells like LLM, then you walk to said coworker and ask them to explain it for you. Play dumb if necessary.

Or you use YOUR LLM to review the PR :D

...and wtf, you get "credited" story points for finishing tasks? That sounds completely insane.

roarcher

26 days ago

[-]

> you get "credited" story points for finishing tasks? That sounds completely insane.

Developers' names are attached to stories, and stories have points on them. Why is that insane, and how does your company track who did what?

I propose that the name on the story should be that of the reviewer since they did the work.

26 days ago

[-]

We don't really track individual features to people in a way I could call "crediting" - as in nobody really checks afterwards who did how many story points in a sprint.

As long as the team as a whole gets stuff done, everything is good.

scotty79

25 days ago

[-]

Because story points is a tool for the business to know when optimistically a thing could be done. Or more realistically get a decent "no sooner than" estimation of the task.

Using them for anything else, or by anyone else, like scoring the team or like here, individual contributors, is idiotic.

deaux

27 days ago

[-]

> This is my experience with how LLMs "draft" legal arguments: at first glance, it's plausible — but may be, and often is, invalid, unsound, and/or ill-advised.

Correct, and this of course extends past just laws, into the whole scope of rules and regulations described in human languages. It will by its nature imply things that aren't explicitly stated nor can be derived with certainty, just because they're very plausible. And those implications can be wrong.

Now I've had decent success with having LLMs then review these LLM-generated texts to flag such occurences where things aren't directly supported by the source material. But human review is still necessary.

The cases I've been dealing with are also based on relatively small sets of regulations compared the scope of the law involved with many legal cases. So I imagine that in the domain you're working on, much more needs flagging.

FpUser

27 days ago

[-]

>" justice suffers"

Possible. It also suffers when majority simply can not afford proper representation

basch

27 days ago

[-]

"Reasoning" needs to go back to the drawing board.

Reasonable tasks need to be converted into formal logic, calculated and computed like a standard evaluation, and then translated back into english or language of choice.

LLMs are being used to think when really they should be the interpret and render steps with something more deterministic in the middle.

Translate -> Reason -> Store to Database. Rinse Repeat. Now the context can call from the database of facts.

otterley

27 days ago

[-]

As an attorney, I’m interested in this theory. Do you have any examples that illustrate the phenomenon you describe?

treetalker

26 days ago

[-]

Since it's too late to edit my other reply — here is a description of a recent case involving several of the categories I mentioned: https://reason.com/volokh/2026/03/06/california-appeals-cour...

And you will find many more by reading (or subscribing to the RSS feed of) the Volokh Conspiracy blog's "AI in Court" tag: https://reason.com/category/law/ai-in-court/

treetalker

26 days ago

[-]

Sure, but which part: opposing counsel using LLMs; opposing counsel simply using bullshit asymmetry to befuddle (nothing new); or judges not always reading and looking deeply into the arguments and authorities (also nothing new)?

If the first category, there have been plenty of examples that have even made their way onto the HN front page in the last half year or so. There have even been instances of judges using LLMs to draft orders containing confabulated authorities.

otterley

26 days ago

[-]

> opposing counsel simply using bullshit asymmetry to befuddle (nothing new)

This one in particular.

grey-area

28 days ago

[-]

This is a fascinating look into code generated by an LLM that is correct in one sense (passes tests) but doesn't meet requirements (painfully slow). Doesn't use is_ipk to identify primary keys, uses fsync on every statement. The problem with larger projects like this even if you are competent is that there are just too many lines of code to read it properly and understand it all. Bravo to the author for taking the time to read this project, most people never will (clearly including the author of it).

I find LLMs at present work best as autocomplete -

The chunks of code are small and can be carefully reviewed at the point of writing

Claude normally gets it right (though sometimes horribly wrong) - this is easier to catch in autocomplete

That way they mostly work as designed and the burden on humans is completely manageable, plus you end up with a good understanding of the code generated. They make mistakes I'd say 30% of the time or so when autocompleting, which is significant (mistakes not necessarily being bugs but ugly code, slow code, duplicate code or incorrect code.

Having the AI produce the majority of the code (in chats or with agents) takes lots of time to plan and babysit, and is harder to review, maintain and diagnose; it doesn't seem like much of a performance boost, unless you're producing code that is already in the training data and just want to ignore the licensing of the original code.

26 days ago

[-]

> This is a fascinating look into code generated by an LLM that is correct in one sense (passes tests) but doesn't meet requirements (painfully slow).

Why isn't requirements testing automated? Benchmarking the speed isn't rocket science. At worst a nightly build should run a benchmark and log it so you can find any anomalies.

consumer451

28 days ago

[-]

Nitpick/question: the "LLM" is what you get via raw API call, correct?

If you are using an LLM via a harness like claude.ai, chatgpt.com, Claude Code, Windsurf, Cursor, Excel Claude plug-in, etc... then you are not using an LLM, you are using something more, correct?

An example I keep hearing is "LLMs have no memory/understanding of time so ___" - but, agents have various levels of memory.

I keep trying to explain this in meetings, and in rando comments. If I am not way off-base here, then what should be the term, or terms, be? LLM-based agents?

dragonwriter

28 days ago

[-]

> Nit pick/question: The LLM is what you get via raw API call, correct?

You always need a harness of some kind to interact with an LLM. Normal web APIs (especially for hosted commercial systems) wrapped around LLMs are non-minimal harnesses, that have built in tools, interpretation of tool calls, application of what is exposed in local toolchains as “prompt templates” to transform the context structure in the API call into a prompt (in some cases even supporting managing some of the conversation state that is used to construct the prompt on the backend.)

> If you are using an LLM via a harness like claude.ai, chatgpt.com, Claude Code, Windsurf, Cursor, Excel Claude plug-in, etc... then you are not using an LLM, you are using something more, correct?

You are essentially always using something more than an LLM (unless “you” are the person writing the whole software stack, and the only thing you are consuming is the model weights, or arguably a truly minimal harness that just takes setting and a prompt that is not transformed in any way before tokenization, and returns the result after no transformations or filtering other than mapping back from tokens to text.)

But, yes, if you are using an elaborate frontend of the type you enumerate (whether web or CLI or something else), you are probably using substantially more stuff on top of the LLM than if you are using the providers web API.

consumer451

28 days ago

[-]

In meetings, I try to explain the roles of system prompts, agentic loops, tool calls, etc in the products I create, to the stakeholders.

However, they just look at the whole thing as "the LLM," which carries specific baggage. If we could all spread the knowledge of what is actually going on to the wider public, it would make my meetings easier, and prevent many very smart folks who are not practitioners from saying inaccurate stuff.

staplers

28 days ago

[-]

  If we could all spread the knowledge of what is actually going on to the wider public, it would make my meetings easier, and prevent very smart folks from outside the field from saying dumb-sounding stuff.

This is an example of why LLMs won't displace engineers as severely as many think. There are very old solved processes and hyper-efficient ways of building things in the real world that still require a level of understanding many simply don't care or want to achieve.

xlth

28 days ago

[-]

You're not off-base at all. The way I think about it:

- LLM = the model itself (stateless, no tools, just text in/text out) - LLM + system prompt + conversation history = chatbot (what most people interact with via ChatGPT, Claude, etc.) - LLM + tools + memory + orchestration = agent (can take actions, persist state, use APIs)

When someone says "LLMs have no memory" they're correct about the raw model, but Claude Code or Cursor are agents - they have context, tool access, and can maintain state across interactions.

The industry seems to be settling on "agentic system" or just "agent" for that last category, and "chatbot" or "assistant" for the middle one. The confusion comes from product names (ChatGPT, Claude) blurring these boundaries - people say "LLM" when they mean the whole stack.

simonw

28 days ago

[-]

I like to use the term "coding agents" for LLM harnesses that have the ability to directly execute code.

This is an important distinction because if they can execute the code they can test it themselves and iterate on it until it works.

The ChatGPT and Claude chatbot consumer apps do actually have this ability now so they technically class as "coding agents", but Claude Code and Codex CLI are more obvious examples as that's their key defining feature, not a hidden capability that many people haven't spotted yet.

alexhans

28 days ago

[-]

> The vibes are not enough. Define what correct means. Then measure.

Pretty much. I've been advocating this for a while. For automation you need intent, and for comparison you need measurement. Blast radius/risk profile is also important to understand how much you need to cover upfront.

The Author mentions evaluations, which in this context are often called AI evals [1] and one thing I'd love to see is those evals become a common language of actually provable user stories instead of there being a disconnect between different types of roles, e.g. a scientist, a business guy and a software developer.

The more we can speak a common language and easily write and maintain these no matter which background we have, the easier it'll be to collaborate and empower people and to move fast without losing control.

- [1] https://ai-evals.io/ (or the practical repo: https://github.com/Alexhans/eval-ception )

D-Machine

28 days ago

[-]

This article is great. And the blog-article headline is interesting, but wrong. LLM's don't in general write plausible code (as a rule) either.

They just write code that is (semantically) similar to code (clusters) seen in its training data, and which haven't been fenced off by RLHF / RLVR.

This isn't that hard to remember, and is a correct enough simplification of what generative LLMs actually do, without resorting to simplistic or incorrect metaphors.

ozozozd

28 days ago

[-]

Exactly. It’s also easy to find yourself in the out-of-distribution territory. Just ask for some tree-sitter queries and watch Gemini 3, Opus 4.5 and GLM 5 hallucinate new directives.

ehnto

28 days ago

[-]

I think this could be the key difference in how people are experiencing the tools. Using Claude in industries full of proprietary code is a totally different experience to writing some React components, or framework code in C#, PHP or Java. It's shockingly good at the later, but as you get into proprietary frameworks or newer problem domains it feels like AI in 2023 again, even with the benefit of the agentic harnesses and context augments like memory etc.

28 days ago

[-]

You’ve hit the nail on the head.

I characterise llm’s as being black boxes that are filled with a dense pool of digital resources - that with the correct prompt you can draw out a mix of resources to produce an output.

But if the mix of resources you need isn’t there - it won’t work. This isn’t limited to just text. This also applies with video models - llms work better for prompts in which you are trying to get material that is widely available on the internet.

empath75

28 days ago

[-]

I think in the long term, if an LLM can’t use a tool, people won’t stop using LLM’s, they’ll stop using the tool.

We are building everything right now with LLM agents as a primary user in mind and one of our principles is “hallucination driven development”. If LLMs hallucinate an interface to your product regularly, that is a desire path and you should create that interface.

simianwords

28 days ago

[-]

Any example of how I can get it to hallucinate?

kubb

28 days ago

[-]

IIRC, the most code in its training data is Python. Closely followed by Web technologies (HTML, JS/TS, CSS). This corresponds to the most abundant developers. Many of them dedicated their entire careers to one technology.

We stubbornly use the same language to refer to all software development, regardless of the task being solved. This lets us all be a part of the same community, but is also a source of misunderstanding.

Some of us are prone to not thinking about things in terms of what they are, and taking the shortcut of looking at industry leaders to tell us what we should think.

These guys consistently, in lockstep, talk about intelligent agents solving development tasks. Predominately using the same abstract language that gives us an illusion of unity. This is bound to make those of us solving the common problems believe that the industry is done.

jmull

28 days ago

[-]

> They just write code that is (semantically) similar to code (clusters) seen in its training data, and which haven't been fenced off by RLHF / RLVR.

"Plausible" sounds like the right word to me. (It would be a mistake to digress into these features of LLMs in an article where it isn't needed.)

HarHarVeryFunny

27 days ago

[-]

I agree - I took "plausible" here to mean plausible-looking, no different than similar-looking.

The trouble of course is that similar/plausible isn't good enough unless the LLM has seen enough similar-but-different training samples to refine it's notion of similarity to the point where it captures the differences that are critical in a given case.

I'd rather just characterize it as a lack of reasoning, since "add more data" can't be the solution to a world full of infinite variety. You can keep playing whack a mole to add more data to fix each failure, and I suppose it's an interesting experiment to see how far that will get you, but in the end the LLM is always going to be brittle and susceptible to stupid failure cases if it doesn't have the reasoning capability to fully analyze problems it was not trained on.

https://news.ycombinator.com/item?id=47176209

comex

28 days ago

[-]

Based on a search, the SQLite reimplementation in question is Frankensqlite, featured on Hacker News a few days ago (but flagged):

scotty79

25 days ago

[-]

Flagging on HN is getting insane.

flerchin

28 days ago

[-]

Yes plausible text prediction is exactly what it is. However, I wonder if the author included benchmarking in their prompt. It's not exactly fair to keep hidden requirements.

g947o

28 days ago

[-]

Attributing these to "hidden requirements" is a slippery slope.

My own experience using Claude Code and similar tools tells me that "hidden requirements" could include:

* Make sure DESIGN.md is up to date

* Write/update tests after changing source, and make sure they pass

* Add integration test, not only unit tests that mock everything

* Don't refactor code that is unrelated to the current task

...

These are not even project/language specific instructions. They are usually considered common sense/good practice in software engineering, yet I sometimes had to almost beg coding agents to follow them. (You want to know how many times I have to emphasize don't use "any" in a TypeScript codebase?)

People should just admit it's a limitation of these coding tools, and we can still have a meaningful discussion.

grey-area

28 days ago

[-]

The training data is full of ‘any’ so you will keep getting ‘any’ because that is the code the models have seen.

An interesting example of the training data overriding the context.

26 days ago

[-]

Then you add a biome rule to say "no any ever" and the LLM will fix it before claiming the job is done.

flerchin

28 days ago

[-]

Yeah I agree generally that the most banal things must be specified, but I do think that a single sentence in the prompt "Performance should be equivalent" would likely have yielded better results.

27 days ago

[-]

Ok, I’ll bite: how is that different from humans?

strken

27 days ago

[-]

Human behaviour is goal-directed because humans have executive function. When you turn off executive function by going to sleep, your brain will spit out dreams. Dream logic is famous for being plausible but unhinged.

I have the feeling that LLMs are effectively running on dream logic, and everything we've done to make them reason properly is insufficient to bring them up to human level.

27 days ago

[-]

Isn’t a modern LLM with thinking tokens fairly goal directed? But yes, we hallucinate in our sleep while LLMs will hallucinate details if the prompt isn’t grounded enough.

zarzavat

27 days ago

[-]

The thing about dream logic is that it can be a completely rational series of steps, but there's usually a giant plot hole which you only realise the second you wake up.

This definitely matches my experience of talking to AI agents and chatbots. They can be extremely knowledgeable on arcane matters yet need to have obvious (to humans) assumptions pointed out to them, since they only have book smarts and not street smarts.

27 days ago

[-]

Assuming this is not a rhetorical question: no, it is not. The only "goal" is to maximize plausibility.

27 days ago

[-]

Again, how is that different from humans? I’m not going around trying to prove my code correct when I write it manually.

27 days ago

[-]

I write code to solve a problem. Not code that looks like it solves the problem if a non-technical client squints at it.

And if you don't prove your code, do you not design at all then? Do you never draw state diagrams?

Every design is an informal proof of the solution. Rarely I write formal proofs. Most of the time I write down enough for myself to be convinced that the desing solves the problem.

27 days ago

[-]

Yes, you can dedicate extra tokens to draw state diagrams, the LLM can actually do that, if you don't have it generating one or more design documents before you are writing code you are doing that wrong. I still don't get how that is different from what humans are doing.

> Most of the time I write down enough for myself to be convinced that the desing solves the problem.

Again, why do you assume we aren't doing the same thing with LLMs?

1. Spec given

2. Ask LLM to write a bunch of design documents based off of spec

3. Ask LLM to identify edge cases

4. Ask LLM to device edge cases in to a test plan involving N tests

5. Ask LLM to write tests

6. Ask LLM to write commented code

7. Ask LLM to run tests on code, and determine on failing tests if test or code is wrong, go back to the appropriate step to fix test and/or code.

Whenever I hear someone here on HN imply that the only way to code with an AI is via vibe coding I just die a bit more inside.

27 days ago

[-]

You completely misunderstood what I wrote.

It was a response to you saying: "Im not going around trying to prove my code correct when I write it manually."

How did you manage to forget what you wrote previously?

Also, in this post you are now suddenly taking the exact opposite position, contradicting your previous point.

26 days ago

[-]

I did not contradict my previous point. But now I’m confused in how you think we use LLMs to write code. You made it sound like we just get it todump out code without any process in between.

25 days ago

[-]

You most definitely did contradict yourself. First you said you don't prove anything about the code you write, then you said you do. But that's fine. We can agree to disagree.

And I have not made any statements about how you use LLMs, only about how the LLMs produce code. All statements about how you use LLMs have been made by you, not me. I haven't discussed it since it is not related to the arguments, which are: 1) whether LLMs are goal-oriented and 2) whether humans and LLMs both merely maximize plausibility when writing/generating code.

Both claims that you made. Note, however, that if you are correct in your own points, then you should indeed be able to "just dump out code without any process in between". So if anyone is claiming this, it's you.

abdullahkhalids

27 days ago

[-]

You are correct. However, humans sometimes do write stuff that "looks like it solves the problem". A prime example of this is a student who doesn't know how to answer a question. So they make up a plausible sounding answer.

As a exam grader, you can easily tell when a student has the mindset of "solving a problem" but made a mistake, and when they had the mindset of "looks like it solves the problem" and just wrote some stuff.

tsunamifury

27 days ago

[-]

It’s amazing how much you get wrong here. As LLM attention layers are stacked goal functions.

What they lack is multi turn long walk goal functions — which is being solved to some degree by agents.

strken

27 days ago

[-]

I don't argue that thinking and attention are missing. I argue that they are trying to do the job of human executive function but aren't as good at it.

nemo44x

27 days ago

[-]

LLMs are literally goal machines. It’s all they do. So it’s important that you input specific goals for them to work towards. It’s also why logically you want to break the problem into many small problems with concrete goals.

27 days ago

[-]

Do you only mean instruct-tuned LLMs? Or the base (pretrained) model too?

nemo44x

27 days ago

[-]

The entire system and the agent loop allows for more complex goal resolution. The LLM models language (obviously) and language is goal oriented so it models goal oriented language. It’s an emergent feature of the system.

27 days ago

[-]

A prompt for an LLM is also a goal direction and it'll produce code towards that goal. In the end, it's the human directing it, and the AI is a tool whose code needs review, same as it always has been.

basch

27 days ago

[-]

Id argue humans have some sort of parallelness going on that machines dont yet. Thoughts happening at multiple abstraction levels simultaneously. As I am doing something, I am also running the continuous improvement cycle in my head, at all four steps concurrently. Is this working, is this the right direction, does this validate?

You could build layers and layers of LLMs watching the output of each others thoughts and offering different commentary as they go, folding all the thoughts back together at the end. Currently, a group of agents acts more like a discussion than something somewhat omnipotent or omnitemporal.

whoamii

27 days ago

[-]

Some of my best code comes from my dreams though.

spiderfarmer

27 days ago

[-]

And yet LLM’s are incredibly useful as they are right now.

strken

27 days ago

[-]

And yet they're going to be better in a decade, which will require understanding why they aren't perfect today.

apical_dendrite

27 days ago

[-]

The volume is different. Someone submitted a PR this week that was 3800 lines of shell script. Most of it was crap and none of it should have been in shell script. He's submitting PRs with thousands of lines of code every day. He has no idea how any of it actually works, and it completely overwhelms my ability to review.

Sure, he could have submitted a ill-considered 3800 line PR five years ago, but it would have taken him at least a week and there probably would have been opportunities to submit smaller chunks along the way or discuss the approach.

switchbak

27 days ago

[-]

It’s harder when the person doing what you describe has the ability to have you fired. Power asymmetry + irresponsible AI use + no accountability = a recipe for a code base going right to hell in a few months.

I think we’re going to see a lot of the systems we depend on fail a lot more often. You’d often see an ATM or flight staus screen have a BSOD - I think we’re going to see that kind of thing everywhere soon.

27 days ago

[-]

Just block that user, that seems to be the way.

somewhereoutth

27 days ago

[-]

Humans have a 'world model' beyond the syntax - for code, an idea of what the code should do and how it does it. Of course, some humans are better than others at this, they are recognized as good programmers.

27 days ago

[-]

Papers show that AI also has a world model, so I don't think that's the right distinction.

27 days ago

[-]

Could you please cite these papers. If by AI you mean LLMs, that is not supported by what I know. If you mean a theoretical world-model-based AI, that's just a tautological statement.

https://arxiv.org/abs/2305.11169

27 days ago

[-]

https://arxiv.org/abs/2506.02996

salawat

27 days ago

[-]

Their world model is completely a byproduct of language though, not experience. Furthermore, they by deliberate design do not maintain any form of self-recognition or narrative tracking, which is the necessary substrate for developing validating experience. The world model of an LLM is still a map. Not the territory. Even though ours has some of the same qualities arguably, the identity we carry with us and our self-narrative are incredibly powerful in terms of allowing us to maintain alignment with the world as she is without munging it up quite as badly as LLM's seem prone to.

27 days ago

[-]

How do you know ours is any different, that we are not in a simulation or a solipsistic scenario? The truth is that one cannot know, it's a philosophical quandary that's been debated for millennia.

27 days ago

[-]

It is absolutely obvious how different it is from interacting with any LLM about the ways that it is wrong.

27 days ago

[-]

Nope, appeal to obviousness is not a sound argument. There are many things people thought were obvious that were wrong.

27 days ago

[-]

It wasn't an argument. There isn't much point in going to a lot of trouble to make an argument to someone so clearly determined to ignore the truth. It is nevertheless true.

27 days ago

[-]

Just saying something is true doesn't make it so. Truth requires justification, and if you can't provide that, then there's no reason to believe it's true. For someone making a claim, the onus is on them to provide evidence.

Otherwise I'll just say I'm right and you're wrong, after all, that's what you're saying.

salawat

26 days ago

[-]

Simple. I have two sets of data I can pull from to validate a claim an LLM makes. I have the linguistic corpora we produce (artificial memory, analogical to latent space built by an LLM). You are correct in that this modality is shared. I also, however, have internal self-narrative and experiential state that is non-linguistic, but sensory/perception driven. An LLM can try to convince me that a bunch of mathematicians would come up with a system that requires one to make many copies of the same bitwise representation of a block for loading by the execution framework due to munging of the latent space via quantization. However, I have recollections of my time amongst Mathematicians and theorists. I can replay my lived perceptions of those times, and analyze and extract new meaning from them as my neural hardware evolves. Therefore, when that claim is made, my validation of the world as she is comes to a screeching halt to the tune of a recollection of a calculus class where the entire point is to pound into you the utility of fungibility of mathematical representations (substitution), and a further connection to optimization (replace entire cluster of an equation with a letter to process other things first and deal with the internal details later). That synthesizes also to the principle Mathematicians are both lazy, and clever. Alias that bitch, and moving right along. LLM's don't have that without you deliberately injecting that mechanism into their context. They'll in fact just run off the rails.

Now, could an equivalent process be modelled at some point? Probably. It'd be a conscious decision to do so on our part, and given fears over the AI Alignment quandary, it seems a rather fraught direction to carelessly proceed.

27 days ago

[-]

One conference proceeding paper and one preprint, about LLMs encoding either relative geometric information of objects or simple 2D paths.

One of the papers call this "programming language semantics", but it is using a 2D grid navigation DSL. The semantics of that language are nothing like actual programming language semantics.

These are not the same as the concept being discussed here, a human "world model" of a computer system, through which to interpret the semantics of a program.

27 days ago

[-]

Well I didn't find any papers off the bat for code world models but if they can create a world model for the task given, such as geometric manipulation, I don't see why they wouldn't in terms of code.

27 days ago

[-]

Because a "world model" for relative positions in space is just a partial ordering of points.

That's not really a world model.

detourdog

27 days ago

[-]

What I'm surprises me about the current development environment is the acceleration of technical debt. When I was developing my skills the nagging feeling that I didn't quite understand the technology was a big dark cloud. I felt this clopud was technical debt. This was always what I was working against.

I see current expectations that technical debt doesn't matter. The current tools embrace superficial understand. These tools to paper over the debt. There is no need for deeper understanding of the problem or solution. The tools take care of it behind the scenes.

27 days ago

[-]

It’s not. LLMs are just averaging their internet snapshot, after all.

But people want an AI that is objective and right. HN is where people who know the distinction hang out, but it’s not what the layperson things they are getting when they use this miraculous super hyped tool that everybody is raving about?

mrwh

27 days ago

[-]

The etiquette, even at the bigtech place I work, has changed so quickly. The idea that it would be _embarrassing_ to send a code review with obvious or even subtle errors is disappearing. More work is being put on the reviewer. Which might even be fine if we made the further change that _credit goes to the reviewer_. But if anything we're heading in the opposite direction, lines of code pumped out as the criterion of success. It's like a car company that touts how _much_ gas its cars use, not how little.

27 days ago

[-]

Review is usually delegated to an AI too

27 days ago

[-]

By now, a few years after ChatGPT released, I don't think anyone is thinking AI is objective and right, all users have seen at least one instance of hallucination and simply being wrong.

27 days ago

[-]

Sorry I can think of so many counter examples. I also detect a lot of “well it hallucinates about subject X (that the person knows well, so can spot the hallucination)” but continue to trust it on subjects Y and Z (which the person knows less well so can’t spot the hallucinations).

YMMV.

27 days ago

[-]

> Briefly stated, the Gell-Mann Amnesia effect works as follows. You open the newspaper to an article on some subject you know well. In Murray's case, physics. In mine, show business. You read the article and see the journalist has absolutely no understanding of either the facts or the issues. Often, the article is so wrong it actually presents the story backward-reversing cause and effect. I call these the "wet streets cause rain" stories. Paper's full of them. In any case, you read with exasperation or amusement the multiple errors in a story-and then turn the page to national or international affairs, and read with renewed interest as if the rest of the newspaper was somehow more accurate about far-off Palestine than it was about the story you just read. You turn the page, and forget what you know.

-Michael Crichton

27 days ago

[-]

Sure, Gell-Mann amnesia exists, but remember that its origin is actually human, in the form of newspaper writers. So, how can we trust humans the same way? In just the same way, AI cannot also be fully trusted.

27 days ago

[-]

The current way of doing AI cannot be trusted.

that doesn’t mean the future won’t herald a way of using what a transformer is good at - interfacing with humans - to translate to and interact with something that can be a lot more sound and objective.

27 days ago

[-]

You're falling into the extrapolation fallacy, there is no reason to think that the future won't have the same issues as today in terms of hallucinations.

And even if they were solved, how would that even work? The world is not sound and objective.

27 days ago

[-]

It’s a thought experiment. I am not saying I believe it will happen.

But right now there are lots of domains where current lauded success is in treating something objective - like code - as tokens for an llm.

We could instead explore using transformers to translate human languages to a symbology that can be reasoned about and applied eg to code.

It’s the talk of conferences. But whether it works better than we have today, or whether it aligns with the incentives or the big players, is another matter

27 days ago

[-]

There are a lot of binary thinkers on HN, but they shouldn’t make up a majority.

rDr4g0n

27 days ago

[-]

It's much easier to fire an employee which produces low quality/effort work than to convince leadership to fire Claude.

27 days ago

[-]

You can fire employees who don't review code generated though, because ultimately it's their responsibility to own their code, whether they hand wrote it or an LLM did.

It seems to me that it's all a matter of company culture, as it has always been, not AI. Those that tolerate bad code will continue to tolerate it, at their peril.

27 days ago

[-]

It writes statistically represented code, which is why (unless instructed otherwise) everything defaults to enterprisey, OOP, "I installed 10 trendy dependencies, please hire me" type code.

Shyaamal11

20 days ago

[-]

One thing I’ve noticed while working with data/AI workflows is that the “acceptance criteria first” idea applies even more strongly once you move beyond code generation into data pipelines and analytics.

LLMs can generate queries, transformations, or even Spark jobs that look reasonable but if the underlying data contracts, schema expectations, or evaluation criteria aren’t defined, you end up with something that looks correct but is semantically wrong.

In practice, the teams that get the most value from AI-assisted development tend to have: clearly defined datasets reproducible data pipelines well-defined outputs / metrics Once those pieces are in place, AI becomes much more useful because it’s operating inside a structured system instead of guessing context. That’s also why there’s been a lot of interest lately in lakehouse-style platforms that combine data engineering, analytics, and AI workflows in one place (e.g. platforms like IOMETE).

When the data layer is structured and reproducible, AI tooling becomes far more reliable. Curious if others here have seen the same pattern when using LLMs for data engineering or analytics work.

siliconc0w

27 days ago

[-]

Just a recent anecdote, I asked the newest Codex to create a UI element that would persist its value on change. I'm using Datastar and have the manual saved on-disk and linked from the AGENTS.md. It's a simple html element with an annotation, a new backend route, and updating a data model. And there are even examples of this elsewhere in the page/app.

I've asked it to do why harder things so I thought it'd easily one-shot this but for some reason it absolutely ate it on this task. I tried to re-prompt it several times but it kept digging a hole for itself, adding more and more in-line javascript and backend code (and not even cleaning up the old code).

It's hard to appreciate how unintuitive the failure modes are. It can do things probably only a handful of specialists can do but it can also critical fail on what is a straightforward junior programming task.

ollybrinkman

28 days ago

[-]

This maps directly to the shift happening in API design for agent-to-agent communication.

Traditional API contracts assume a human reads docs and writes code once. But when agents are calling agents, the "contract" needs to be machine-verifiable in real-time.

The pattern I've seen work: explicit acceptance criteria in API responses themselves. Not just status codes, but structured metadata: "This response meets JSON Schema v2.1, latency was 180ms, data freshness is 3 seconds."

Lets the calling agent programmatically verify "did I get what I paid for?" without human intervention. The measurement problem becomes the automation problem.

Similar to how distributed systems moved from "hope it works" to explicit SLOs and circuit breakers. Agents need that, but at the individual request level.

jt2190

27 days ago

[-]

Interesting, but couldn’t the agent be given access to tools that allow it to make those evaluations without having to modify the API responses? (Maybe I’m not visualizing “API” the same way you are.)

swiftcoder

28 days ago

[-]

What's up with the (somewhat odd) title HN has gone with for this article? it's implying a very different article than the one I just read

lukeify

28 days ago

[-]

Most humans also write plausible code.

28 days ago

[-]

LLMs piggyback on human knowledge encoded in all the texts they were trained on without understanding what they're doing.

Humans would execute that code and validate it. From plausible it'd becomes hey, it does this and this is what I want. LLMs skip that part, they really have no understanding other than the statistical patterns they infer from their training and they really don't need any for what they are.

red75prime

28 days ago

[-]

Could we stop using vague terms like “understanding” when talking about LLMs and machine learning? You don't know what understanding is. You only know how it feels to understand something.

It's better to describe what you can do that LLMs currently can't.

stevenhuang

28 days ago

[-]

At least it's an easy way for those who don't know that they're talking about to out themselves.

If they'd bother to see how modern neuroscience tries to explain human cognition they'd see it explained in terms that parallel modern ML. https://en.wikipedia.org/wiki/Predictive_coding

We only have theories for what intelligence even means, I wouldn't be surprised there are more similarities than differences between human minds and LLMs, fundamentally (prediction and error minimization)

calf

27 days ago

[-]

We can't use understanding but we can use learning, got it.

owlninja

28 days ago

[-]

They probably at least look at the docs?

stevenhuang

28 days ago

[-]

LLMs can execute code and validate it too so the assertions you've made in your argument are incorrect.

What a shame your human reasoning and "true understanding" led you astray here.

gitaarik

28 days ago

[-]

All code is plausible by design

jqpabc123

28 days ago

[-]

LLMs have no idea what "correct" means.

Anything they happen to get "correct" is the result of probability applied to their large training database.

Being wrong will always be not only possible but also likely any time you ask for something that is not well represented in it's training data. The user has no way to know if this is the case so they are basically flying blind and hoping for the best.

Relying on an LLM for anything "serious" is a liability issue waiting to happen.

A1kmm

28 days ago

[-]

Yes Transformer models are non-deterministic, but it is absolutely not true that they can't generalise (the equivalent of interpolation and extrapolation in linear regression, just with a lot more parameters and training).

For example, let's try a simple experiment. I'll generate a random UUID:

> uuidgen 44cac250-2a76-41d2-bbed-f0513f2cbece

Now it is extremely unlikely that such a UUID is in the training set.

Now I'll use OpenCode with "Qwen3 Coder 480B A35B Instruct" with this prompt: "Generate a single Python file that prints out the following UUID: "44cac250-2a76-41d2-bbed-f0513f2cbece". Just generate one file."

It generates a Python file containing 'print("44cac250-2a76-41d2-bbed-f0513f2cbece")'. Now this is a very simple task (with a 480B model), but it solves a problem that is not in the training data, because it is a generalisation over similar but different problems in the training data.

Almost every programming task is, at some level of abstraction, and with different levels of complexity, an instance of solving a more general type of problem, where there will be multiple examples of different solutions to that same general type of problem in the training set. So you can get a very long way with Transformer model generalisations.

tonypapousek

28 days ago

[-]

It’s a shame of bulk of that training data is likely 2010s blogspam that was poor quality to begin with.

28 days ago

[-]

But isn't that a reflection of reality?

If you've made a significant investment in human capital, you're even more likely to protect it now and prevent posting valuable stuff on the web.

27 days ago

[-]

No?

27 days ago

[-]

Yes it is. There’s a reason why university knowledge is gated. And was gated for centuries.

Can’t believe I have to explain simple stuff.

28 days ago

[-]

Aye. I wish more conversations would be more of this nature - in that we should start with basic propositions - e.g. the thing does not 'know' or 'understand' what correct is.

LarsDu88

28 days ago

[-]

This is about to change very soon. Unlike many other domains (such as greenfield scientific discovery), most coding problems for which we can write tests and benchmarks are "verifiable domains".

This means an LLM can autogenerated millions of code problem prompts, attempt millions of solutions (both working and non-working), and from the working solutions, penalize answers that have poor performance. The resulting synthetic dataset can then be used as a finetuning dataset.

There are now reinforcement finetuning techniques that have not been incorporated into the existing slate of LLMs that will enable finetuning them for both plausibility AND performance with a lot of gray area (like readability, conciseness, etc) in between.

What we are observing now is just the tip of a very large iceberg.

28 days ago

[-]

Lets suppose whatever you say is true.

If Im the govt, Id be foaming at the mouth - those projects that used to require enormous funding now will supposedly require much less.

Hmmm, what to do? Oh I know. Lets invest in Digital ID-like projects. Fun.

LarsDu88

28 days ago

[-]

It is true. Here is the publication going over how to generate this type of dataset and finetune: https://arxiv.org/pdf/2506.14245

I don't think you grasp my statement. LLMs will exceed humans greatly for any domain that is easy to computationally verify such as math and code. For areas not amenable to deterministic computations such as human biology, or experimental particle physics, progress will be slower

27 days ago

[-]

lol did you even read my post, dude?

simianwords

28 days ago

[-]

This is easily proven incorrect. Just go to ChatGPT and say something incorrect and ask it to verify. Why do people still believe this type of thing?

27 days ago

[-]

I did this yesterday and it was happy to provide me with an incorrect explanation. Not just that, but incorrect thermodynamic data supporting its claims, despite readily available published values to the contrary.

girvo

28 days ago

[-]

And yet models get things wrong all the time, too.

simianwords

28 days ago

[-]

That’s what I would expect even if it can have the concept of truth. Like humans.

28 days ago

[-]

I'm using an LLM to write queries ATM. I have it write lots of tests, do some differential testing to get the code and the tests correct, and then have it optimize the query so that it can run on our backend (and optimization isn't really optional since we are processing a lot of rows in big tables). Without the tests this wouldn't work at all, and not just tests, we need pretty good coverage since if some edge case isn't covered, it likely will wash out during optimization (if the code is ever correct about it in the first place). I've had to add edge cases manually in the past, although my workflow has gotten better about this over time.

I don't use a planner though, I have my own workflow setup to do this (since it requires context isolated agents to fix tests and fix code during differential testing). If the planner somehow added broad test coverage and a performance feedback loop (or even just very aggressive well known optimizations), it might work.

88j88

28 days ago

[-]

100% I found that you think you are smarter than the LLM and knowing what you want, but this is not the case. Give the LLM some leeway to come up with solution based on what you are looking to achieve- give requirements, but don't ask it to produce the solution that you would have because then the response is forced and it is lower quality.

mirsadm

28 days ago

[-]

100% dependent on the person driving it

cadamsdotcom

27 days ago

[-]

This is a bit unfair - to generate a bunch of code but not give the model data/tools and direct it to optimize it; then compare it to the optimized work of thousands over decades.

Feels like an extremely high effort hit piece, even though I know it’s not.

bitwize

27 days ago

[-]

You: Claude, do you know how to program?

Claude: No, but if you hum a few bars I can fake it!

Except "faking it" turns out to be good enough, especially if you can fake it at speed and get feedback as to whether it works. You can then just hillclimb your way to an acceptable solution.

27 days ago

[-]

Iterative Faking™ — now with plausible-looking test suite!

plandis

27 days ago

[-]

I’ve found this to be critical for having any chance of getting agents to generate code that is actually usable.

The more frequently you can verify correctness in some automated way the more likely the overall solution will be correct.

I’ve found that with good enough acceptance criteria (both positive and negative) it’s usually sufficient for agents to complete one off tasks without a human making a lot of changes. Essentially, if you’re willing to give up maintainability and other related properties, this works fairly well.

I’ve yet to find agents good enough to generate code that needs to be maintained long term without a ton of human feedback or manual code changes.

nprateem

28 days ago

[-]

In the last month I've done 4 months of work. My output is what a team of 4 would have produced pre-AI (5 with scrum master).

Just like you can't develop musical taste without writing and listening to a lot of music, you can't teach your gut how to architect good code without putting in the effort.

Want to learn how to 10x your coding? Read design patterns, read and write a lot of code by hand, review PRs, hit stumbling blocks and learn.

I noticed the other day how I review AI code in literally seconds. You just develop a knack for filtering out the noise and zooming in on the complex parts.

There are no shortcuts to developing skill and taste.

allajfjwbwkwja

27 days ago

[-]

> I review AI code in literally seconds

You've just settled for hackathon standards and told yourself it's okay because you're using AI.

Everyone with experience should know that even thorough code reviews only catch stylistic issues, glaring errors, and the most obvious design deficiencies. The only time new code is truly thought about is as it's being written.

gormen

28 days ago

[-]

Excellent article. But to be fair, many of these effects disappear when the model is given strict invariants, constraints, and built-in checks that are applied not only at the beginning but at every stage of generation.

[0]: https://giancarlostoro.com/introducing-guardrails-a-new-codi...

giancarlostoro

27 days ago

[-]

This is why I used to use Beads and now GuardRails (shameless plug[0]). You brain dump to the model what you want, it breaks it down into discrete tasks, you have it refine them with you. By the time you have the model work on everything it can spawn workers in parallel that know what to do. In hindsight I should have called it BrainDump.

jbergqvist

27 days ago

[-]

Producing the most plausible code is literally encoded into the cross entropy loss function and is fundamental to the pre-training. I suppose post training methods like RLVR are supposed to correct for this by optimizing correctness instead of plausibility, but there are probably many artifacts like these still lurking in the model's reasoning and outputs. To me it seems at least possible that the AI labs will find ways to improve the reward engineering to encourage better solutions in the coming years though.

jamesblonde

26 days ago

[-]

The reference in the text to Anthropic’s “Towards Understanding Sycophancy in Language Models” is related to RLHF (reinforcement learning with human feedback).

Claude code uses primarily different "pathways" in Anthropic LLMs that were not post-trained with RLHF, but rather with RLVF (reinforcement learning with verifiable rewards).

So, his point about code being produced to please the user isn't valid from where I am sitting.

28 days ago

[-]

I tried to make Claude Code, Sonnet 4.6, write a program that draws a fleur-de-lis.

No exaggeration it floundered for an hour before it started to look right.

It's really not good at tasks it has not seen before.

jshmrsn

28 days ago

[-]

Considering that a fleur-de-lis involves somewhat intricate curves, I think I'd be pretty happy with myself if I could get that task done in an hour.

Given a harness that allows the model to validate the result of its program visually, and given the models are capable of using this harness to self correct (which isn't yet consistently true), then you're in a situation where in that hour you are free to do some other work.

A dishwasher might take 3 hours to do for what a human could do in 30 minutes, but they're still very useful because the machine's labor is cheaper than human labor.

28 days ago

[-]

I didn't provide any constraints on how to draw it.

TBH I would have just rendered a font glyph, or failing that, grabbed an image.

Drawing it with vector graphics programmatically is very hard, but a decent programmer would and should push back on that.

zeroxfe

28 days ago

[-]

> TBH I would have just rendered a font glyph, or failing that, grabbed an image.

If an LLM did that, people would be all up in arms about it cheating. :-)

For all its flaws, we seem to hold LLMs up to an unreasonably high bar.

28 days ago

[-]

That's the job description for a good programmer though. Question assumptions and requirements, and then find the simplest solution that does the job.

Just about anyone can eventually come up with a hideously convoluted HeraldicImageryEngineImplFactory<FleurDeLis>.

ehnto

28 days ago

[-]

Even with well understood languages, if there isn't much in the public domain for the framework you're using it's not really that helpful. You know you're at the edges of its knowledge when you can see the exact forum posts you are looking at showing up verbatim in it's responses.

I think some industries with mostly proprietary code will be a bit disappointing to use AI within.

comex

28 days ago

[-]

LLMs are really bad at anything visual, as demonstrated by pelicans riding bicycles, or Claude Plays Pokémon.

Opus would probably do better though.

28 days ago

[-]

How could they be any good at visuals? They are trained on text after all.

https://simonwillison.net/tags/pelican-riding-a-bicycle/

comex

28 days ago

[-]

Supposedly the frontier LLMs are multimodal and trained on images as well, though I don't know how much that helps for tasks that don't use the native image input/output support.

Whatever the cause, LLMs have gotten significantly better over time at generating SVGs of pelicans riding bicycles:

But they're still not very good.

28 days ago

[-]

I have to admit I'm seeing this for the first time and am somewhat impressed by the results and even think they will get better with more training, why not... But are these multimodal LLMs still LLMs though? I mean, they're still LLMs but with a sidecar that does other things and the training of the image takes place outside the LLMs so in a way the LLMs still don't "know" anything about these images, they're just generating them on the fly upon request.

simonw

28 days ago

[-]

Some of the LLMs that can draw (bad) pelicans on bicycles are text-input-only LLMs.

The ones that have image input do tend to do better though, which I assume is because they have better "spatial awareness" as part of having been trained on images in addition to text.

I use the term vLLMs or vision LLMs to define LLMs that are multimodal for image and text input. I still don't have a great name for the ones that can also accept audio.

The pelican test requires SVG output because asking a multimodal output model like Gemini Flash Image (aka Nano Banana) to create an image is a different test entirely.

boxedemp

28 days ago

[-]

Maybe we should drop one of the L's

astrange

28 days ago

[-]

Claude is multimodal and can see images, though it's not good at thinking in them.

msephton

28 days ago

[-]

Shapes can be described as text or mathematical formulas.

tempest_

28 days ago

[-]

An SVG is just text.

internet2000

28 days ago

[-]

I got Opus 4.6 to one shot it, took 5-ish mins. "Write me a python program that outputs an svg of a fleur-de-lis. Use freely available images to double check your work."

It basically just re-created the wikipedia article fleur-de-lis, which I'm not sure proves anything beyond "you have to know how to use LLMs"

64738

28 days ago

[-]

Just for reference, Codex using GPT-5.4 and that exact prompt was a 4-shot that took ten minutes. The first result was a horrific caricature. After a slight rebuke ("That looks terrible. Read https://en.wikipedia.org/wiki/Fleur-de-lis for a better understanding of what it should look like."), it produced a very good result but it then took two more prompts about the right side of the image being clipped off before it got it right.

robertcope

28 days ago

[-]

Same, I used Sonnet 4.6 with the prompt, "Write a simple program that displays a fleur-de-lis. Python is a good language for this." Took five or six minutes, but it wrong a nice Python TK app that did exactly what it was supposed to.

hrmtst93837

28 days ago

[-]

The model stumbles when asked to invent procedural geometry it has rarely tokenized because LLMs predict tokens, not precise coordinate math. For reliable output define acceptance criteria up front and require a strict format such as an SVG path with absolute coordinates and explicit cubic Bezier control points, plus a tiny rendering test that checks a couple of landmark pixels.

Break the job into microtasks, ask for one petal as a pair of cubic Beziers with explicit numeric control points, render that snippet locally with a simple rasterizer, then iterate on the numbers. If determinism matters accept the tradeoff of writing a small generator using a geometry library like Cairo or a bezier solver so you get reproducible coordinates instead of watching the model flounder for an hour.

scuff3d

28 days ago

[-]

I tried to use Codex to write a simple TCP to QUIC proxy. I intentionally kept the request fairly simple, take one TCP connection and map it to a QUIC connection. Gave a detailed spec, went through plan mode, clarified all the misunderstandings, let it write it in Python, had it research the API, had it write a detailed step by step roadmap... The result was a fucking mess.

Beyond the fact that it was "correct" in the same way the author of the article talked about, there was absolutely bizarre shit in there. As an example, multiple times it tried to import modules that didn't exist. It noticed this when tests failed, and instead of figuring out the import problem it add a fucking try/except around the import and did some goofy Python shenanigans to make it "work".

28 days ago

[-]

Have you tried describing to Claude what it is? The more the detail the better the result. At some point it does become easier to just do it yourself.

28 days ago

[-]

It knows what it is, it's a very well known symbol. But translating that knowledge to code is something else.

Interesting shortcoming, really shows how weak the reasoning is.

cat_plus_plus

28 days ago

[-]

Try writing code from description without looking at the picture or generated graphics. Visual LLM with a suggestion to find coordinates of different features and use lines/curves to match them might do better.

parvardegr

28 days ago

[-]

agreed with part that at some point it's better to just do it yourself but for sure they will get better and better

vdfs

28 days ago

[-]

Most people just forget to tell it "make it quick" and "make no mistake"

mekael

28 days ago

[-]

I’m unable to determine if you’re missing /s or not.

28 days ago

[-]

That's kind of foolish IMO. How can an open ended generic and terse request satisfy something users have in mind?

msvana

27 days ago

[-]

I think there is one problem with defining acceptance criteria first: sometimes you don't know ahead of time what those criteria are. You need to poke around first to figure out what's possible and what matters. And sometimes the criteria are subjective, abstract, and cannot be formally specified.

Of course, this problem is more general than just improving the output of LLM coding tools

plandis

27 days ago

[-]

Yeah it’s extremely helpful to clarify your thoughts before starting work with LLM agents.

I find Claude Code style plan mode to be a bit restrictive for me personally, but I’ve found that creating a plan doc and then collaboratively iterating on it with an LLM to be helpful here.

I don’t really find it much different than the scoping I’d need to do before handing off some work to a more junior engineer.

ramoz

27 days ago

[-]

> Claude Code style plan mode to be a bit restrictive

Hey thats why i built plannotator: https://github.com/backnotprop/plannotator

I like staying within Claude Code for orchestrating its plan mode, but I needed a better way to actually review the plan, address certain parts, see plan diffs, etc all in a better visual way. The hooks system through permissionrequest:exitplanmode keep this fairly ergonomic.

see it in action: https://www.youtube.com/watch?v=a_AT7cEN_9I

dillonsmartdev

28 days ago

[-]

Humans work best like this too

vicchenai

27 days ago

[-]

Been building a fintech data pipeline with Claude Code lately and yeah this tracks. The moment I started writing actual test cases before letting it loose the quality jumped massively. Before that it was generating stuff that looked right but would silently drop edge cases in the data parsing. Treating it like a junior dev who needs a clear spec is exactly right imo.

28 days ago

[-]

The difference for me recently

Write a lambda that takes an S3 PUT event and inserts the rows of a comma separated file into a Postgres database.

Naive implementation: download the file from s3 and do a bulk insert - it would have taken 20 minutes and what Claude did at first.

I had to tell it to use the AWS sql extension to Postgres that will load a file directly from S3 into a table. It took 20 seconds.

I treat coding agents like junior developers.

svpyk

28 days ago

[-]

Unlike junior developers, llms can take detailed instructions and produce outstanding results at first shot a good number of times.

27 days ago

[-]

While I’m pro LLMs over junior developers. The other issue with LLMs is even the most junior developer will learn your business context over time.

In my case, in consulting (cloud + app dev), I just start the AGENTS.md file with a summary of the contract (the SOW), my architectural diagram and the transcript of my design review with the customer.

conception

28 days ago

[-]

Did you ask it to research best practices for this method, have an adversarial performance based agent review their approach or search for performant examples of the task first? Relying on training data only will always get your subpar results. Using “What is the most performant way to load a CSV from S3 into PostgreSQL on RDS? Compare all viable and research approaches before recommending one.” gave me the extension as the top option.

28 days ago

[-]

I knew the best way. I was just surprised that Claude got it wrong. As soon as I told it to use the s3 extension, it knew to add the appropriate permissions, to update my sql unit script to enable the extension and how to write the code

conception

27 days ago

[-]

Yeah, give them a research project first they do pretty well. Off the cuff usually trash. I think thats the biggest disconnect between people who think AI good from bad - relying on training data memory will usually lead to subpar results.

datagobes

28 days ago

[-]

Same pattern in data engineering generally. LLMs default to the obvious row-by-row or download-then-insert approach and you have to steer them toward the efficient path (COPY, bulk loaders, server-side imports). Once you name the right primitive, they execute it correctly, permissions and all, as you found.

The deeper issue is that "efficient ingest" depends heavily on context that's implicit in your setup: file sizes, partitioning, schema evolution expectations, downstream consumers. A Lambda doing direct S3-to-Postgres import is fine for small/occasional files, but if you're dealing with high-volume event-driven ingestion you'll hit connection pool pressure fast on RDS. At that point the conversation shifts to something like a queue buffer or moving toward a proper staging layer (S3 → Redshift/Snowflake/Databricks with native COPY or autoloader). The LLM won't surface that tradeoff unless you explicitly bring it up. It optimizes for the stated task, not for the unstated architectural constraints.

28 days ago

[-]

Also with Redshift - split the file up before ingestion to equal the number of nodes or combine a lot of small files into larger files before putting them into S3 and/or use an Athena CTAS command to combine a lot of small files into one big file.

So in my other case, the whole thing was

Web crawler (internal customer website) using Playwrite -> S3 -> SNS -> SQS -> Lambda (embed with Bedrock) -> S3 Vector Store.

Similar to what you said, I ran into Bedrock embedding service limits. Then once I told it that, it knew how to adjust the lambda concurrency limits. Of course I had to tell it to also adjust the sqs poller so messages wouldn’t be backed up in flight, then go to the DLQ without ever being processed.

ZeroGravitas

27 days ago

[-]

Does it work if you get the agent to throw away all of its actual implementation and start again from scratch, keeping all the learning and tests and feedback?

Gemini seems to try to get a lot of information upfront with questions and plans but people are famously bad at knowing what they want.

Maybe it should build a series of prototypes and spikes to check? If making code is cheap then why not?

freedomben

27 days ago

[-]

This does work but it requires prompts to instruct on it. It's also not perfect, though it is pretty good.

What I've found when doing exactly this, is that the cost of the initial code makes me hesitant to throw it away. A better workflow I've been using is instead to iterate on very detailed planning documents written in markdown and repeatedly iterating on that instead (like, sometimes 50+ times for a complex app). It's really quite amazing how much that helps. It can lead to a design doc that is good enough that I can turn the agent loose on implementation and get decent results. Best results are still with guidance throughout, but I have never once regretted hammering out a very detailed planning document. I have many times regretted keeping code (or throwing code away).

nickcoffee

25 days ago

[-]

The acceptance criteria point translates directly outside of coding too. Using Claude Code for sales and operational workflows, having acceptable criteria upfront (along with some manual checks along the way depending on the task) definitely helps the output.

geysersam

27 days ago

[-]

There's also such a thing as being too ambitious. 99% of developers can not rewrite SQLite in rust even if they spent the rest or their lifetime doing it.

Expecting an AI do to a good job vibe-coding a Sqllite clone over a few weekends just isn't realistic. Despite that, it's useful technology.

akoboldfrying

28 days ago

[-]

The following paragraph appears twice:

> Now 2 case studies are not proof. I hear you! When two projects from the same methodology show the same gap, the next step is to test whether similar effects appear in the broader population. The studies below use mixed methods to reduce our single-sample bias.

teucris

27 days ago

[-]

This article hits on an important point not easily discerned from the title:

Sometimes good software is good due to a long history of hard-earned wins.

AI can help you get to an implementation faster. But it cannot magically summon up a battle-hardened solution. That requires going through some battles.

Great software takes time.

skybrian

28 days ago

[-]

You can ask an LLM to write benchmarks and to make the code faster. It will find and fix simple performance issues - the low-hanging fruit. If you want it to do better, you can give it better tools and more guidance.

It's probably a good idea to improve your test suite first, to preserve correctness.

codethief

28 days ago

[-]

> Your LLM Doesn't Write Correct Code. It Writes Plausible Code.

I don't always write correct code, either. My code sure as hell is plausible but it might still contain subtle bugs every now and then.

In other words: 100% correctness was never the bar LLMs need to pass. They just need to come close enough.

https://news.ycombinator.com/item?id=47280645

ontouchstart

28 days ago

[-]

I made a comment in another thread about my acceptance criteria

It is more about LLMs helping me understand the problem than giving me over engineered cookie cutter solutions.

arikrahman

27 days ago

[-]

Uncle Bob made this concept clear to me when he introduced to me that code itself IS requirements specification. LLMs are the new intermediary, but the necessity of the word and the machine persists.

einrealist

28 days ago

[-]

> SQLite is not primarily fast because it is written in C. Well.. that too, but it is fast because 26 years of profiling have identified which tradeoffs matter.

Someone (with deep pockets to bear the token costs) should let Claude run for 26 months to have it optimize its Rust code base iteratively towards equal benchmarks. Would be an interesting experiment.

The article points out the general issue when discussing LLMs: audience and subject matter. We mostly discuss anecdotally about interactions and results. We really need much more data, more projects to succeed with LLMs or to fail with them - or to linger in a state of ignorance, sunk-cost fallacy and supressed resignation. I expect the latter will remain the standard case that we do not hear about - the part of the iceberg that is underwater, mostly existing within the corporate world or in private GitHubs, a case that is true with LLMs and without them.

In my experience, 'Senior Software Engineer' has NO general meaning. It's a title to be awarded for each participation in a project/product over and over again. The same goes for the claim: "Me, Senior SWE treat LLMs as Junior SWE, and I am 10x more productive." Imagine me facepalming every time.

grey-area

28 days ago

[-]

This would be a really interesting experiment.

I suspect performance is not the only problem with the codebase though.

graphememes

28 days ago

[-]

bad input > bad output

idk what to say, just because it's rust doesn't mean it's performant, or that you asked for it to be performant.

yes, llms can produce bad code, they can also produce good code, just like people

jqpabc123

28 days ago

[-]

yes, llms can produce bad code, they can also produce good code, just like people

Over time, you develop a feel for which human coders tend to be consistently "good" or "bad". And you can eliminate the "bad".

With an LLM, output quality is like a box of chocolates, you never know what you're going to get. It varies based on what you ask and what is in it's training data --- which you have no way to examine in advance.

You can't fire an LLM for producing bad code. If you could, you would have to fire them all because they all do it in an unpredictable manner.

graphememes

28 days ago

[-]

no but you're a human and you're responsible for it, so it's on you

you can make horrible images with photoshop that doesn't make photoshop bad

jqpabc123

27 days ago

[-]

The key word here is *you*.

Photoshop doesn't make anything --- *you* make the image horrible --- or not. Any results relate directly to *your* skill.

A direct comparison to agentic AI is less than equitable. AI is supposedly able to provide skill --- which it often fails to do.

graphememes

27 days ago

[-]

you talk to the llm bro, you are responsible for the outcome

FrankWilhoit

28 days ago

[-]

Enterprise customers don't buy correct code, they buy plausible code.

28 days ago

[-]

They're not buying code.

They are buying a service. As long as the service 'works' they do not care about the other stuff. But they will hold you liable when things go wrong.

The only caveat is highly regulated stuff, where they actually care very much.

kibwen

28 days ago

[-]

Enterprise customers don't buy plausible code, they buy the promise of plausible code as sold by the hucksters in the sales department.

28 days ago

[-]

I think SolarWinds would have preferred correct code back in 2020.

qup

28 days ago

[-]

Okay, but what did they buy?

28 days ago

[-]

Code, from their employees.

qup

25 days ago

[-]

Plausible code from their employees.

bamboozled

28 days ago

[-]

I'm sure this is because they are pattern matching masters, if you program them to find something, they are good at that. But you have to know what you're looking for.

namuol

27 days ago

[-]

These LLM prompting tip articles write themselves if you just take the last decade of project management articles and replace “IC” with “agent”.

thrill

27 days ago

[-]

Increasing plausibility tends towards correctness.

pmarreck

28 days ago

[-]

Yes, which is why TDD is finally necessary

sim04ful

28 days ago

[-]

I've noticed a key quality signal with LLM coding is an LOC growth rate that tapers off or even turns negative.

cat_plus_plus

28 days ago

[-]

That's very impressive. Your LLM actually wrote a correct code for a full relational database on the first try, like it takes 2.5 seconds to insert 100 rows but it stores them correctly and select is pretty fast. How many humans can do this without a week of debugging? I would suggest you install some profiling tools and ask it to find and address hotspots. SQL Lite had how long and how many people to get to where it is?

bluefirebrand

28 days ago

[-]

I could "write" this code the same way, it's easy

Just copy and paste from an open source relational db repo

Easy. And more accurate!

cat_plus_plus

28 days ago

[-]

The actual task is usually to mix something that looks like a dozen of different open source repos combined but to take just the necessary parts for task at hand and add glue / custom code for the exact thing being built. While I could do it, LLM is much faster at it, and most importantly I would not enjoy the task.

snoob2021

28 days ago

[-]

It is a Rust reimplementation of SQLite. Not exactly just "copy and paste"

gzread

28 days ago

[-]

Early LLMs would do better at a task if you prefixed the task with "You are an expert [task doer]"

worik

27 days ago

[-]

This is becoming clear, now?

I have had similar experiences, and I read over and over others experiences like this.

A powerful tool...

https://github.com/fugue-labs/gollem/blob/main/ext/codetool/...

helsinki

28 days ago

[-]

That's why I added an invariant tool to my Go agent framework, fugue-labs/gollem:

spullara

28 days ago

[-]

human developers work best when the user defines their acceptance criteria first.

JasonHEIN

28 days ago

[-]

Bro you are like saying "OH LLM can't do X within 10 days which few people spend over decades" Live a life bro applause and change the title to "it can do xyz" instead of adding the "critical and critical" ...

jswelker

27 days ago

[-]

I also write plausible code. Not much of a moat.

seba_dos1

27 days ago

[-]

s/code/stuff/

maremmano

27 days ago

[-]

this won't age well.

riffraff

28 days ago

[-]

To be fair, people do too.

mentalgear

28 days ago

[-]

> I write this as a practitioner, not as a critic. After more than 10 years of professional dev work, I’ve spent the past 6 months integrating LLMs into my daily workflow across multiple projects. LLMs have made it possible for anyone with curiosity and ingenuity to bring their ideas to life quickly, and I really like that! But the number of screenshots of silently wrong output, confidently broken logic, and correct-looking code that fails under scrutiny I have amassed on my disk shows that things are not always as they seem.

Same experience, but the hype bros do only need a shiny screengrab to proclaim the age of "gatekeeping" SWE is over to get their click fix from the unknowingly masses.

malkia

27 days ago

[-]

Are we now at the bottom of the the Uncanny Valley of AI?

user3939382

28 days ago

[-]

I have great techniques to fix this issue but not sure how it behooves me to explain it.

27 days ago

[-]

Oftentimes, plausible code is good enough, hence why people keep using AI to generate code. This is a distinction without a difference.

27 days ago

[-]

There appears to be a similar approach in UX... plausible user experience is close enough.

27 days ago

[-]

Yes, especially because in UX there is no "correct" approach to it, it's all relative.

27 days ago

[-]

2 seconds to insert 100 rows in an empty database table is not "good enough" if you are doing anything that is worth doing.

27 days ago

[-]

Who said anything about this? I never did.

27 days ago

[-]

Tfa did

bluetomcat

27 days ago

[-]

No. Plausible code is syntactically-correct BS disguised as a solution, hiding a countless amount of weird semantic behaviours, invariants and edge cases. It doesn't reflect a natural and common-sense thought process that a human may follow. It's a jumble of badly-joined patterns with no integral sense of how they fit together in the larger conceptual picture.

27 days ago

[-]

Why do people keep insisting that LLMs don't follow a chain of reasoning process? Using the latest LLMs you can see exactly what they "think" and see the resultant output. Plausible code does not mean random code as you seem to imply, it means...code that could work for this particular situation.

27 days ago

[-]

Because they don't. The chain-of-reasoning feature is really just a way to get the LLM to prompt more.

The fact that it generates these "thinking" steps does not mean it is using them for reasoning. It's most useful effect is making it seem to a human that there is a reasoning process.

seba_dos1

27 days ago

[-]

I love how generating strings like "let me check my notes" is effective at ending up with somewhat better end results - it pushes the weights towards outputting text that appears to be written by someone who did check their notes :D

27 days ago

[-]

I can't remember which lecture it was, but a guy said "they don't think, they only seem to think, and they won't replace a substantial portion of human labor, they will only seem to do so" ;)

seba_dos1

27 days ago

[-]

Joking aside, this is exactly what happens with companies announcing "AI" replacing human labor when what they actually do is correcting for COVID-time overhiring while trying to make it appear in a way that won't make the stocks go too red.

27 days ago

[-]

Is this position axiomatic or falsifiable? What would it take to change your mind?

27 days ago

[-]

It doesn't have to be either because the burden of proof is not on me. It's on whoever claims that chaining multiple prompts together produces thinking, even though a single prompt is just predicting n-grams.

The chain does not change the token generation process, it just artificially lengthens it.

27 days ago

[-]

How would you determine humans have reasoning then, in a way that LLMs do not?

25 days ago

[-]

Easy, humans can synthesize new facts using logic and context. And the conclusions can be checked against the real world to check that the reasoning was correct.

Or another way: reasoning is a socially constructed concept, developed by humans. Humans therefore have defined reasoning, and must therefore know how to reason.

Or a third way: I experience reasoning, you experience reasoning. I am currently reasoning. You are currently reasoning. I am human, as are you. Therefore humans reason.

27 days ago

[-]

Or — here's a fun one — subjective experience.

25 days ago

[-]

This one is even easier. LLMs record objective data about n-gram distribution, there is no room for any "subjective state" in their working set.

Or another way: an LLM will respond the same way if you wait for one second or a decade between prompts. The only way it is interacted with is through a stream of tokens. There is exactly one stream at all times, and each time the stream is input again, it is barely different from the previous input. The LLM does not behave differently depending on the contents of the stream. It may produce the exact same token for two different streams. It may also encounter the same stream twice, and will act the same in both cases. If it were a "subject" "experiencing" say, a discussion, it would use its pasts "experience". But it does not.

27 days ago

[-]

They haven't replied to my comment but have to yours, so I can only assume they actually cannot point out the difference, which makes sense as the philosophy of mind is a very old subject and there is no way a threaded conversation like this would produce any concrete answers.

28 days ago

[-]

But my AI didn't do what your AI did.

Cherry picked AI fail for upvotes. Which you’ll get plenty of here an on Reddit from those too lazy to go and take a look for themselves.

Using Codex or Claude to write and optimize high performance code is a game changer. Try optimizing cuda using nsys, for example. It’ll blow your lazy little brain.

kccqzy

28 days ago

[-]

Yeah right. A LLM in the hands of a junior engineer produces a lot of code that looks like they are written by juniors. A LLM in the hands of a senior engineer produces code that looks like they are written by seniors. The difference is the quality of the prompt, as well as the human judgement to reject the LLM code and follow-up prompts to tell the LLM what to write instead.

28 days ago

[-]

Lol what. The difference is that the senior... is a senior. Ask yourself what characteristics comprises a senior vs junior...

You're glossing over so much stuff. Moreover, how does the Junior grow and become the senior with those characteristics, if their starting point is LLMs?

kccqzy

27 days ago

[-]

I’m not glossing over anything. You and I are talking about the exact same thing phrased differently. How does a senior know when to reject some LLM code and start over? Experience. I don’t disagree with you but your tone is aggravating.

G3rn0ti

28 days ago

[-]

This. I really wonder how trainees are supposed to grow in an age where they are asked not to code themselves but guide a machine doing so.

jonnycoder

28 days ago

[-]

Prompting is just step 1. Creating and reviewing a plan is step 2. Step 0 was iterating and getting the right skills in place. Step 3 is a command/skill that decomposes the problem into small implementation steps each with a dependency and how to verify/test the implementation step. Step 4 is execute the implementation plan using sub agents and ensuring validation/testing passes. Step 5 is a code review using codex (since I use claude for implementation).

28 days ago

[-]

I kind of agree. But I'd adjust that to say that in both cases you get good looking code. In the hands of a junior you get crappy architecture decisions and complete failure to manage complexity which results in the inevitable reddit "they degraded the model" post. In the hands of seniors you get well managed complexity, targeted features, scalable high performance architecture, and good base technology choices.

28 days ago

[-]

It’s easy to get AI to write bad code. Turns out you still need coding skills to get AI to write good code. But those who have figured it out can crank out working systems at a shocking pace.

28 days ago

[-]

Agreed 100%. I'd add that it's the knowledge of architecture and scaling that you got from writing all that good code, shipping it, and then having to scale it. It gives you the vocabulary and broad and deep knowledge base to innovate at lightning speeds and shocking levels of complexity.

serious_angel

28 days ago

[-]

I am sorry for asking, but... is there guide even on how to "figure it out"? Otherwise, how are you so sure about it?

wmeredith

28 days ago

[-]

Right here: https://codemanship.wordpress.com/2025/10/30/the-ai-ready-so...

This series of articles is gold.

Unsurprisingly, writing good software with AI follows the same principles as writing it without AI. Keep scopes small. Ship, refactor, optimize, and write tests as you go.

pornel

28 days ago

[-]

When a new technology emerges we typically see some people who embrace it and "figure it out".

Electronic synthesisers went from "it's a piano, but expensive and sounds worse" to every weird preset creating a whole new genre of electronic music.

So it seems plausible, like Claude's code, that our complaints about unmaintainable code are from trying to use it like a piano, and the rave kids will find a better use for it.

simonw

28 days ago

[-]

I'm working on one here: https://simonwillison.net/guides/agentic-engineering-pattern...

28 days ago

[-]

That's actually a great question. Truth be told the best way right now is to grab Codex CLI or Claude CLI (I strongly prefer Codex, but Claude has its fans), and just start. Immediately. Then go hard for a few months and you'll develop the skills you need.

A few tips for a quickstart:

Give yourself permission to play.

Understand basic concepts like context window, compaction, tokens, chain of thought and reasoning, and so on. Use AI to teach you this stuff, and read every blog post OpenAI and Anthropic put out and research what you don't understand.

Pick a hard coding problem in Python or Typescript and take a leap of faith and ask the agent to code it for you.

My favorite phrase when planning is: "Don't change anything. Just tell me.". Save this as a tmux shortcut and use it at the end of every prompt when planning something out.

Use markdown .md docs to create a planning doc and keep chatting to the agent about it and have it update the plan until you're super happy, always using the magic phrase "Don't change anything. Just tell me." (I should get myself a patent on that little number. Best trick I know)

Every time you see an anti-AI post, just move on. It's lazy people making lazy assumptions. Approach agentic coding with a sense of love, excitement, optimism, and take massive leaps of faith and you'll be very very surprised at what you find.

Best of luck Serious Angel.

28 days ago

[-]

You're not really answering the question are you?

Your answer is to play with it. Cool. But why cant you and others put together a proper guide lol? It cant be that hard.

Go ahead and do it - it'll challenge the Anti-AI posters you are referencing. I and others want to see that debate.

appcustodian2

28 days ago

[-]

Don't worry we'll all be taking the Claude certification courses soon enough

28 days ago

[-]

Ah - I know! Seriously I know. There's such a bad need for this right now. The problem is that the folks who are great at agentic coding are coding their asses off 16 to 20 hours a day and don't have a minute they want to spend on writing guides because of the opportunity cost.

One of the rare resources I found recently was the OpenClaw guys interview on Lex. He drops a few bangers that are really valuable and will save you having to spend a long time figuring it out.

Also there's a very strong disincentive for anyone to write right now because we're competing against the noise and the slop in the space. So best to just shut the fuck up and create as fast as we can, and let the outcome speak for itself. You're going to see a lot more products like OpenClaw where the pace of innovation is rapid, and the author freely admits that they're coding agentically and not writing a single line.

I think the advantage that Peter has (openclaw author) is that he has enough money and success to not give a fuck about what people say re him writing purely agentically, so he's been very open about it which has been great for others who are considering doing the same.

But if you have a software engineering career or are a public figure with something to lose, you tend to STFU if you're doing pure agentic coding on a project.

But that'll change. Probably over the next few months. OpenClaw broke the ice.

28 days ago

[-]

Here’s some practical tips:

Start small. Figure out what it (whatever tool you’re using) can do reliably at a quality level you’re comfortable with. Try other tools. There are tons. If it doesn’t get it right with the first prompt, iterate. Refine. Keep at it until you get there.

When you have seen some pattern work, do that a bunch. It won’t always work. Write rules / prompts / skills to try to get it to avoid making the mistakes you see. Keep doing this for a while and you’ll get into a groove.

Then try taking on bigger chunks of work at a time. Break apart a problem the same way you’d do it yourself first. Write a framework first. Build hello world. Write tests. Build the happy path. Add features. Don’t forget to make it write lots of tests. And run them. It’ll be lazy if you let it, so don’t let it. Each architectural step is not just a single prompt but a conversation with the output being a commit or a PR.

Also, use specs or plans heavily. Have a conversation with it about what you’re trying to do and different ways to do it. Their bias is to just code first and ask questions later. Fight that. Make it write a spec doc first and read it carefully. Tell it “don’t code anything but first ask me clarifying questions about the problem.” Works wonders.

As for convincing the AI haters they’re wrong? I seriously do. Not. Care. They’ll catch up. Or be out of a job. Not my problem.

28 days ago

[-]

I’m not a SWE by trade so I could care less about your last comment.

But again this is all… vague. I’m personally not convinced at all.

I’ll be hiring for a large project soon, so I’ll see for myself what benefits (well I care about net benefits) these tools are providing in the workplace.

27 days ago

[-]

If it wasn’t clear, I don’t have any desire to convince anybody of anything. You don’t believe the future is here yet? Good luck holding on to that position. Not my problem. I was taking time to try to help somebody who sounded genuinely curious and seeking help. That I’m happy to do.

27 days ago

[-]

You’re writing novels when if you had something compelling to show it’d be simple and easy.

If you can’t make it simple and easy… then you haven’t understood it at all. All geniuses refer to this as the standard by which one understands something. Whether it’s Steve Jobs or Einstein. So don’t get mad. Show us all how simple and easy it is. If you can’t.. then accept you’re full of it and don’t quite get it as well as you claim. Not rocket science is it?

But here we are. And actually my project is going to create the future. You’re a bozo programmer who creates the future that others already see. Know your role and don’t speak for others like me who are in the position of choosing who gets hired.

23 days ago

[-]

You’re not going to create any future if you insult people trying to offer friendly advice, or think of the talent you rely on to create your vision as “bozo programmers”. I’d wish you good luck, but you have convinced me you don’t deserve it.

appcustodian2

28 days ago

[-]

How do you figure anything out? You go use it, a lot.