FilterHN

keiferski

4 hours ago

[-]

The thing that bothers me the most about LLMs is how they never seem to understand "the flow" of an actual conversation between humans. When I ask a person something, I expect them to give me a short reply which includes another question/asks for details/clarification. A conversation is thus an ongoing "dance" where the questioner and answerer gradually arrive to the same shared meaning.

LLMs don't do this. Instead, every question is immediately responded to with extreme confidence with a paragraph or more of text. I know you can minimize this by configuring the settings on your account, but to me it just highlights how it's not operating in a way remotely similar to the human-human one I mentioned above. I constantly find myself saying, "No, I meant [concept] in this way, not that way," and then getting annoyed at the robot because it's masquerading as a human.

ryandrake

2 hours ago

[-]

LLMs all behave as if they are semi-competent (yet eager, ambitious, and career-minded) interns or administrative assistants, working for a powerful CEO-founder. All sycophancy, confidence and positive energy. "You're absolutely right!" "Here's the answer you are looking for!" "Let me do that for you immediately!" "Here is everything I know about what you just mentioned." Never admitting a mistake unless you directly point it out, and then all sorry-this and apologize-that and "here's the actual answer!" It's exactly the kind of personality you always see bubbling up into the orbit of a rich and powerful tech CEO.

No surprise that these products are all dreamt up by powerful tech CEOs who are used to all of their human interactions being with servile people-pleasers. I bet each and every one of them are subtly or overtly shaped by feedback from executives about how they should respond to conversation.

yannyu

1 hour ago

[-]

I agree entirely, and I think it's worthwhile to note that it may not even be the LLM that has that behavior. It's the entire deterministic machinery between the user and the LLM that creates that behavior, with the system prompt, personality prompt, RLHF, temperature, and the interface as a whole.

LLMs have an entire wrapper around them tuned to be as engaging as possible. Most people's experience of LLMs is a strongly social media and engagement economy influenced design.

rockskon

1 hour ago

[-]

Analogies of LLMs to humans obfuscates the problem. LLMs aren't like humans of any sort in any context. They're chat bots. They do not "think" like humans and applying human-like logic to them does not work.

not2b

21 minutes ago

[-]

You're right, mostly, but the fact remains that the behavior we see is produced by training, and the training is driven by companies run by execs who like this kind of sycophancy. So it's certainly a factor. Humans are producing them, humans are deciding when the new model is good enough for release.

rockskon

17 minutes ago

[-]

Do you honestly think an executive wanted a chat bot that confidently lies?

dontlikeyoueith

3 minutes ago

[-]

In practice, yes, though they wouldn't think of it that way because that's the kind of people they surround themselves with, so it's what they think human interaction is actually like.

not2b

4 minutes ago

[-]

No, but they like the sycophancy.

Retric

19 minutes ago

[-]

It’s not about thinking, it’s about what they are trained to do. You could train a LLM to always respond to every prompt by repeating the prompt in Spanish, but that’s not the desired behavior.

pardon_me

2 hours ago

[-]

The problem with these LLM chat-bots is they are too human, like a mirror held up to the plastic-fantastic society we have morphed into. Naturally programmed to serve as a slave to authority, this type of fake conversation is what we've come to expect as standard. Big smiles everyone! Big smiles!!

pessimizer

1 hour ago

[-]

Nah. Talking like an LLM would get you fired in a day. People are already suspicious of ass-kissers, they hate it when they think people are not listening to them, and if you're an ass-kisser who's not listening and is then wrong about everything, they want you escorted out by security.

The real human position would be to be an ass-kisser who hangs on every word you say, asks flattering questions to keep you talking, and takes copious notes to figure out how they can please you. LLMs aren't taking notes correctly yet, and they don't use their notes to figure out what they should be asking next. They're just constantly talking.

bwahah4

18 minutes ago

[-]

Your obsession with correct semantics and concept leaves me feeling you're one of those ass kissers everyone hates.

The sort who leverages their technical correctness to get ahead of others. While also taking every advantage of teacher looking away to stick out their tongue at the rest of the class after shouting the correct answer without raising their hand.

Endlessly pessimistic about others while also endlessly obsessed with your own correctness. The kind of classmate and coworker everyone loathes.

rzwitserloot

2 hours ago

[-]

I don't think these LLMs were explicitly designed based on the CEO's detailed input that boils down to 'reproduce these servile yes-men in LLM form please'.

Which makes it more interesting. Apparently reddit was a particularly hefty source for most LLMs; your average reddit conversation is absolutely nothing like this.

Separate observation: That kind of semi-slimey obsequious behaviour annoys me. Significantly so. It raises my hackles; I get the feeling I'm being sold something on the sly. Even if I know the content in between all the sycophancy is objectively decent, my instant emotional response is negative and I have to use my rational self to dismiss that part of the ego.

But I notice plenty of people around me that respond positively to it. Some will even flat out ignore any advice if it is not couched in multiple layers of obsequious deference.

Thus, that raises a question for me: Is it innate? Are all people placed on a presumably bell-curve shaped chart of 'emotional response to such things', with the bell curve quite smeared out?

Because if so, that would explain why some folks have turned into absolute zealots for the AI thing, on both sides of it. If you respond negatively to it, any serious attempt to play with it should leave you feeling like it sucks to high heavens. And if you respond positively to it - the reverse.

Idle musings.

jordanb

2 hours ago

[-]

The servile stuff was trained into them with RLHF with the trainers largely being low-wage workers in the global south. That's also where some of the other stuff like excessive em-dash stuff came from. I think it's a combination of those workers anticipating how they would be expected to respond by a first-world employer, and also explicit instructions given to them about how the robot should be trained.

OkayPhysicist

1 hour ago

[-]

I suspect a lot of the em-dash usage also comes from transcriptions of verbal media. In the spoken word, people use the kinds of asides that elicit an em-dash a lot.

Folcon

2 hours ago

[-]

This is a really interesting observation, as someone who feels disquiet as the obsequiousness, but have been getting used to just mentally skipping over the first paragraph that's put an interesting spin on my behaviour

Thanks!

DenisM

2 hours ago

[-]

It’s not innate. Purpose trained llm can be quite stubborn and not very polite.

code_for_monkey

2 hours ago

[-]

thats the audience! Incompetent CEOS!

LogicFailsMe

2 hours ago

[-]

Nearly every woman I know who is an English as a second language speaker is leaning hard into these things currently to make their prose sound more natural. And that has segued into them being treated almost as a confidant or a friend.

As flawed as they are currently, I remain astounded that people think they will never improve and that people don't want a plastic pal who's fun to be with(tm).

I find them frustrating personally, but then I ask them deep technical questions on obscure subjects and I get science fiction in return.

pessimizer

1 hour ago

[-]

> I get science fiction in return.

And once this garbage is in your context, it's polluting everything that comes after. If they don't know, I need them to shut up. But they don't know when they don't know. They don't know shit.

LogicFailsMe

40 minutes ago

[-]

I am reminded of AI summaries and Microsoft Copilot. All push low value. But I separate that from the underlying potential of the technology. And I wish we heard more from deep domain experts like Karpathy and less from influencer dilettantes like Dylan Patel about where this is going.

bwahah4

28 minutes ago

[-]

As an EE working in engineering 30 years, I ran out of fingers and toes 29 years ago trying to count the number of asocial, incompetent programmer Dark Triads who can only relate to the world through esoteric semantics unrelated to engineering problems right in front of them.

"To add two numbers I must first simulate the universe." types that created a bespoke DSL for every problem. Software engineering is a field full of educated idiots.

Programmers really need to stop patting themselves on the back. Same old biology with the same old faults. Programmers are subjected to the same old physics as everyone else.

Cheer2171

1 hour ago

[-]

> LLMs all

Sounds like you don't know how RLHF works. Everything you describe is post-training. Base models can't even chat, they have to be trained to even do basic conversational turn taking.

mannanj

52 minutes ago

[-]

Isn’t it kind of true that the systems we as servile people-pleasers have to operate out of are exactly these? The hierarchical status games and alpha-animal tribal dynamics are these. Our leaders who are so might and rich and powerful want to keep their position, and we don’t want to admit they have more influence than we do for things like AI now and so we stand and watch naively as they reward the people pleasers and eventually historically we learn(ed) it pays to please until leadership changes.

catigula

2 hours ago

[-]

This is partly true, partly false, partly false in the opposite direction, with various new models. You really need to keep updating and have tons of interactions regularly in order to speak intelligently on this topic.

skeeter2020

1 hour ago

[-]

maybe this is also part of the problem? Once I learn the idiosyncrasies of a person I don't expect them to dramatically change overnight, I know their conversational rhythms and beat; how to ask / prompt / respond. LLMs are like a eager sycophantic intern how completely changes their personality from conversation to conversation, or - surprise - exactly like a machine

catigula

1 hour ago

[-]

>LLMs are like a eager sycophantic intern how completely changes their personality from conversation to conversation

Again, this isn't really true with some recent models. Some have the opposite problem.

jodrellblank

3 hours ago

[-]

> LMs don't do this. Instead, every question is immediately responded with extreme confidence with a paragraph or more of text.

Having just read a load of Quora answers like this, which did not cover the thing I was looking for, that is how humans on the internet behave and how people have to write books, blog posts, articles, documentation. Without the "dance" to choose a path through a topic on the fly, the author has to take the burden of providing all relevant context, choosing a path, explaining why, and guessing at any objections and questions and including those as well.

It's why "this could have been an email" is a bad shout. The summary could have been an email, but the bit which decided on that being the summary would be pages of guessing all the things which what might have been in the call and which ones to include or exclude.

Ajedi32

5 minutes ago

[-]

That's part of it, but I think another part is just the way the LLMs are tuned. They're capable of more conversational tones, but human feedback in post-training biases them toward a writing style that's more of a Quora / StackOverflow / Reddit Q&A style because that's what gets the best ratings during the RLHF process.

goalieca

3 hours ago

[-]

This is a recent phenomenon. It seems most of the pages today are SEO optimized LLM garbage with the aim of having you scroll past three pages of ads.

THe internet really used to be efficient and i could always find exactly what i wanted with an imprecise google search ~ 15 years ago.

AznHisoka

3 hours ago

[-]

Don’t you get this today with AI Overviews summarizing everything on top of most Google results?

i80and

2 hours ago

[-]

The AI Overviews are... extremely bad. For most of my queries, Google's AI Overview misrepresents its own citations, or almost as bad, confidently asserts a falsehood or half-truth based on results that don't actually contain an answer to my search query.

I had the same issue with Kagi, where I'd follow the citation and it would say the opposite of the summary.

A human can make sense of search results with a little time and effort, but current AI models don't seem to be able to.

wat10000

11 minutes ago

[-]

Cheap AI models aren't good at this, anyway, and AI Overviews have to use cheap models since they get used so much. They would be a lot better (still need to check, but they'd be much less stupid) if they used something like GPT-5, but that's just not feasible right now.

ses1984

2 hours ago

[-]

It’s fine about 80% of the time, but the other 20% is a lot harder to answer because of lower quality results.

Pxtl

2 hours ago

[-]

From a UX perspective, the AI overview summary being a multi-paragraph summary makes sense since that was a single query that isn't expected to have conversational context. Where it does not make sense is in conversation-based interfaces. Like, the most popular product is literally called "chat".

"I ask a short and vague question and you response with a scrollbar-full of information based on some invalid assumptions" is not, by any reasonable definition, a "chat".

Pxtl

2 hours ago

[-]

You'd think with the reputation of LLMs being trained on Twitter (pre-Musk radicalization) and Reddit, they'd be better at understanding normal conversation flow since twitter requires short responses and Reddit... while Wall of Text happens occasionally, it's not the typical cadence of the discussion.

9rx

1 hour ago

[-]

Reddit and Twitter don't have human conversations. They have exchanges of confident assertions followed with rebuttals. In fact, both of our comments are perfect demonstrations of exactly that. Fairly reflective of how LLMs behave — except nobody wants to "argue" with an LLM like Twitter and Reddit users want to.

This is not how humans converse in human social settings. The medium is the message, as they say.

shagie

1 hour ago

[-]

Twitter, Reddit, HN don't always have the consistency of conversation that two people talking do.

Even here, I'm responding to you on a thread that I haven't been in on previously.

There's also a lot more material out there in the format of Stack Exchange questions and answers, Quora posts, blog posts and such than there is for consistent back and forth interplay between two people.

IRC chat logs might have been better...ish.

The cadence for discussion is unique to the medium in which the discussion happens. What's more, the prompt may require further investigation and elaboration prior to a more complete response, while other times it may be something that requires story telling and making it up as it goes.

jtr1

2 hours ago

[-]

Interesting. Like many people here, I've thought a great deal about what it means for LLMs to be trained on the whole available corpus of written text, but real world conversation is a kind of dark matter of language as far as LLMs are concerned, isn't it? I imagine there is plenty of transcription in training data, but the total amount of language use in real conversational surely far exceeds any available written output and is qualitatively different in character.

This also makes me curious to what degree this phenomenon manifests when interacting with LLMs in languages other than English? Which languages have less tendency toward sycophantic confidence? More? Or does it exist at a layer abstracted from the particular language?

rafamct

4 hours ago

[-]

Yes you're totally right! I misunderstood what you meant, let me write six more paragraphs based on a similar misunderstanding rather than just trying to get clarification from you

wlesieutre

4 hours ago

[-]

My favorite is when it bounces back and forth between the same two wrong answers, each time admitting that the most recent answer is wrong and going back to the previous wrong answer.

Doesn't matter if you tell it "that's not correct and neither is ____ so don't try that instead," it likes those two answers and it's going to keep using them.

heavyset_go

1 hour ago

[-]

The false info baked into its context at that point in the conversation and it will get stuck in a local minima trying to generate a response to the given context.

BubbleRings

3 hours ago

[-]

Ha! Just experienced this. It was very frustrating.

amelius

2 hours ago

[-]

They really need to add a "punish the LLM" button.

danuker

31 minutes ago

[-]

Some services have the down thumb

j16sdiz

3 hours ago

[-]

Once the context is polluted with wrong information, it is almost impossible to get it right again.

The only reliable way to recover is to edit your previous question to include the clarification, and let it regenerate the answer.

luijk

5 minutes ago

[-]

By default they don't ask questions. You can craft that behaviour with the system message or account settings. Though they will tend to ask 20 questions at once so you have to request it to limit to one question at a time to get a more natural experience.

vidarh

20 minutes ago

[-]

A lot of this, I suspect, on the basis of having worked on a supervised fine-tuning project for one of the largest companies in this space, is that providers have invested a lot of money in fine-tuning datasets that sound this way.

On the project I did work on, reviewers were not allowed to e.g. answer that they didn't know - they had to provide an answer to every prompt provided. And so when auditing responses, a lot of difficult questions had "confidently wrong" answers because the reviewer tried and failed, or all kinds of evasive workarounds because they knew they couldn't answer.

Presumbly these providers will eventually understand (hopefully already has - this was a year ago) that they also need to train the models to understand when the correct answer is "I don't know", or "I'm not sure. I think maybe X, but ..."

bwahah4

35 minutes ago

[-]

In the US anyway, most adults read at a middle school level.

It's not "masquerading as a human". The majority of humans are functional illiterates who only understand the world through the elementary principles of their local culture.

It's the minority of the human species that take what amounts to little more than arguing semantics that need the reality check. Unless one is involved in work that directly impacts public safety (defined as harm to biology) the demand to apply one concept or another is arbitrary preference.

Healthcare, infrastructure, and essential biological support services are all most humans care about. Everything else the majority see as academic wank.

zby

10 minutes ago

[-]

When I expect it to do that I just end my prompt with '. Discuss' - usually this works really well. Not exactly human like - it tries to list all questions and variants at once - but most with good default answers so I only need to engage with a couple of them.

herf

3 hours ago

[-]

Training data is quite literally weighted this way - long responses on Reddit have lots of tokens, and brief responses don't get counted nearly as much.

The same goes for "rules" - you train an LLM with trillions of tokens and try to regulate its behavior with thousands. If you think of a person in high school, grading and feedback is a much higher percentage of the training.

9rx

3 hours ago

[-]

Not to mention that Reddit users seek "confident idiots". Look at where the thoughtful questions that you'd expect to hear in a human social setting end up (hint: Downvoted until they disappear). Users on Reddit don't want to have to answer questions. They want to read the long responses that they can then nitpick. LLMs have no doubt picked up on that in the way they are trained.

heresie-dabord

2 hours ago

[-]

> The thing that bothers me the most about LLMs is

What bothers me the most is the seemingly unshakable tendency of many people to anthropomorphise this class of software tool as though it is in any way capable of being human.

What is it going to take? Actual, significant loss of life in a medical (or worse, military) context?

ux266478

14 minutes ago

[-]

That qualifier only makes the anthropormorphization more sound. Have you actually thought it through? Give an untrained and unspecialized human the power to cause significant loss of life in a medical context in the same exact capacity, and it's all but guaranteed that's the outcome you'll end up with.

I think it's important to be skeptical and push back against a lot of the ridiculous mass-adoption of LLMs, but not if you can't actually make a well-reasoned point. I don't think you realize the damage you do when the people gunning for mass proliferation of LLMs in places they don't belong can only find examples of incoherent critique.

gmueckl

2 hours ago

[-]

It's the fact that these are competent human language word salad generators that messes with human psychology.

heresie-dabord

19 minutes ago

[-]

My calculator produces accurate, verifiable results.

My calculator is a great tool but it is not a mathematician. Not by a long shot.

zenoprax

3 hours ago

[-]

ChatGPT offered a "robotic" personality which really improved my experience. My frustrations were basically decimated right away and I quickly switched to a more "You get out of it what you put in" mindset.

And less than two weeks in they removed it and replaced it with some sort of "plain and clear" personality which is human-like. And my frustrations ramped up again.

That brief experiment taught me two things: 1. I need to ensure that any robots/LLMs/mech-turks in my life act at least as cold and rational as Data from Star Trek. 2. I should be running my own LLM locally to not be at the whims of $MEGACORP.

danuker

23 minutes ago

[-]

> I should be running my own LLM

I approve of this, but in your place I'd wait for hardware to become cheaper when the bubble blows over. I have a i9-10900, and bought an M.2 SSD and 64GB of RAM in july for it, and get useful results with Qwen3-30B-A3B (some 4-bit quant from unsloth running on llama.cpp).

It's much slower than an online service (~5-10 t/s), and lower quality, but it still offers me value for my use cases (many small prototypes and tests).

In the mean time, check out LLM service prices on https://artificialanalysis.ai/ Open source ones are cheap! Lower on the homepage there's a Cost Efficiency section with a Cost vs Intelligence chart.

QuercusMax

1 hour ago

[-]

Sort of a personal modified Butlerian Jihad? Robots / chatbots are fine as long as you KNOW they're not real humans and they don't pretend to be.

Archelaos

4 hours ago

[-]

I never expected LLMs to be like an actual conversation between humans. The model is in some respects more capable and in some respects more limited than a human. I mean, one could strive for an exact replica of a human -- but for what purpose? The whole thing is a huge association machine. It is a surealistic inspiration generator for me. This is how it works at the moment, until the next break through ...

wongarsu

3 hours ago

[-]

> but for what purpose?

I recently introduced a non-technical person to Claude Code, and this non-human behavior was a big sticking point. They tried to talk to Claude similar as to a human, presenting it one piece of information at a time. With humans this is generally beneficial, and they will either nod for you to continue or ask clarifying questions. With Claude this does not work well, you have to infodump as much as possible in each message

So even from a perspective of "how do we make this automaton into the best tool", a more human-like conversation flow might be beneficial. And that doesn't seem beyond the technological capabilities at all, it's just not what we encourage in today's RLHF

falcor84

3 hours ago

[-]

I often find myself in these situations where I'm afraid that if I don't finish infodumping everything in a single message, it'll go in the wrong direction. So what I've been doing is switching it back to Plan Mode (even when I don't need a plan as such), just as a way of telling it "Hold still, we're still having a conversation".

rkj93

2 hours ago

[-]

I do this with cursor ai too. I tell, don't change anything, let me hear out what you plan to fix and what you will change

monerozcash

3 hours ago

[-]

I haven't tried claude, but Codex manages this fine as long as you prompt it correctly to get started.

A lazy example:

"This goal of this project is to do x. Let's prepare a .md file where we spec out the task. Ask me a bunch of questions, one at a time, to help define the task"

Or you could just ask it to be more conversational, instead of just asking questions. It will do that.

paddleon

3 hours ago

[-]

also, this is what chat-style interfaces encourage. Anything where the "enter" key sends the message instead of creating a paragraph block is just hell.

I'm prompting Gemini, and I write:

I have the following code, can you help me analyze it? <press return>

but Gemnini is already generating output, usually saying "I'm waiting for you to enter the code"

lkbm

1 hour ago

[-]

Yeah, seems like current models might benefit from a more email-like UI, and this'll be more true as they get longer task time horizons.

Maybe we want a smaller model tuned for back and forth to help clarify the "planning doc" email. Makes sense that having it all in a single chat-like interface would create confusion and misbehavior.

TheGoddessInari

2 hours ago

[-]

Like many chat-style interfaces, it's typically shift-enter to insert a newline.

bwat49

1 hour ago

[-]

its so easy to accidentally hit enter though lol, I usually type larger prompts in my notes and copy paste then finished

HPsquared

2 hours ago

[-]

I usually do the "drip feed" with ChatGPT, but maybe that's not optimal. Hmm, maybe info dump is a good thing to try.

lkbm

1 hour ago

[-]

There a recent(ish: May 2025) paper about how drip-feeding information is worse than restarting with a revised prompt once you realize details are missing.[0]

[0] https://arxiv.org/abs/2505.06120

jay_kyburz

20 minutes ago

[-]

I hate when I accidentally hit return halfway through writing my prompt and it gives me two pages of advice about some nonsense half sentence.

3 hours ago

[-]

Clarifying ambiguity in questions before dedicating more resources to search and reasoning about the answer seems both essential and almost trivial to elicit via RLHF.

I'd be surprised if you can't already make current models behave like that with an appropriate system prompt.

keiferski

4 hours ago

[-]

The disconnect is that companies are trying desperately to frame LLMs as actual entities and not just an inert tech tool. AGI as a concept is the biggest example of this, and the constant push to "achieve AGI" is what's driving a lot of stock prices and investment.

A strictly machinelike tool doesn't begin answers by saying "Great question!"

wincy

2 hours ago

[-]

Cursor Plan mode works like this. It restricts the LLMs access to your environment and will allow you to iteratively ask and clarify and it’ll piece together a plan that it allows you to review before it takes any action.

ChatGPT deep research does this but it’s weird and forced because it asks one series of questions and then goes off to the races, spending a half hour or more building a report. It’s frustrating if you don’t know what to expect and my wife got really mad the first time she wasted a deep research request asking it “can you answer multiple series of questions?” Or some other functionality clarifying question.

I’ve found Crusor’s plan mode extremely useful, similar to having a conversation with a junior or offshore team member who is eager to get to work but not TOO eager. These tools are extremely useful we just need to get the guard rails and user experience correct.

Workaccount2

3 hours ago

[-]

They are purposely trained to be this way.

In a way it's benchmaxxing because people like subservient beings that help them and praise them. People want a friend, but they don't want any of that annoying friction that comes with having to deal with another person.

__turbobrew__

32 minutes ago

[-]

The day when the LLM responds to my question with another question will be quite interesting. Especially at work, when someone asks me a question I need to ask for clarifying information to answer the original question fully.

3 hours ago

[-]

If you're paying per token then there is a big business incentive for the counterparty to burn tokens as much as possible.

dboon

2 hours ago

[-]

Making a few pennies more from inference is not even on the radar of the labs making frontier models. The financial stakes are so much higher than that for them.

lkbm

2 hours ago

[-]

If I'll pay to get a fixed result, sure. I'd expect a Jevons paradox effect: if LLMs got me results twice as fast for the same cost, I'm going to use it more and end up paying more in total.

Maximizing the utility of your product for users is usually the winning strategy.

3 hours ago

[-]

As long as there's no moat (and arguably current LLM inference APIs are far from having one), it arguably doesn't really matter what users pay by.

The only thing I care about are whether the answer helps me out and how much I paid for it, whether it took the model a million tokens or one to get to it.

nowittyusername

1 hour ago

[-]

Its not a magic technology, they can only represent data they were trained on. Naturally most represented data in their training data is NOT conversational. Consider that such data is very limited and who knows how it was labeled if at all during pretraining. But with that in mind, LLM's definitely can do all the things you describe, but a very robust and well tested system prompt has to be used to coax this behavior out. Also a proper model has to be used, as some models are simply not trained for this type of interaction.

LogicFailsMe

2 hours ago

[-]

My favorite description of an LLM so far is of a typical 37-year-old male Reddit user. And in that sense, we have already created the AGI.

rossant

4 hours ago

[-]

Lately, ChatGPT 5.1 has been less guilty of this and sometimes holds off answering fully and just asks me to clarify what I meant.

HPsquared

4 hours ago

[-]

There are plenty of LLM services that have a conversational style. The paragraph blocks thing is just a style.

jstummbillig

3 hours ago

[-]

a) I find myself fairly regularly irritated by the flow of human-human conversations. In fact, it's more common than not. Of course, I have years of practice handling that more or less automatically, so it rarely raises to the level of annoyance, but it's definitely work I bring to most conversations. I don't know about you but that's not really a courtesy I extend to the LLM.

b) If it is, in fact, just one setting away, then I would say it's operating fairly similarly?

not_ai

4 hours ago

[-]

I didn't have the words to articulate some of my frustrations, but I think you summed it up nicely.

For example, there's been many times when they take it too literally instead of looking at the totality of the context and what was written. I'm not an LLM, so I don't have perfect grasp on every vocab term for every domain and it feels especially pandering when they repeat back the wrong word but put it in quotes or bold instead of simply asking if I meant something else.

heavyset_go

1 hour ago

[-]

I don't want to talk to a computer like I would a human

solumunus

22 minutes ago

[-]

You just need to be more explicit. Including “ask clarifying questions” in your prompt makes a huge difference. Not sure if you use Claude Code but if you do, use plan mode for almost every task.

cortesoft

1 hour ago

[-]

Have you used Claude much? It often responds to things with questions

motoboi

4 hours ago

[-]

Reflect a moment over the fact that LLMs currently are just text generators.

Also that the conversational behavior we see it’s just examples of conversations that we have the model to mimic so when we say “System: you are a helpful assistant. User: let’s talk. Assistant:” it will complete the text in a way that mimics a conversation?.

Yeah, we improved over that using reinforcement learning to steer the text generation into paths that lead to problem solving and more “agentic” traces (“I need to open this file the user talked about to read it and then I should run bash grep over it to find the function the user cited”), but that’s just a clever way we found to let the model itself discover which text generation paths we like the most (or are more useful to us).

So to comment on your discomfort, we (humans) trained the model to spill out answers (there are thousand of human being right now writing nicely though and formatted answers to common questions so that we can train the models on that).

If we try to train the models to mimic long dances into shared meaning we will probably decrease their utility. And we won’t be able anyway to do that because then we would have to have customized text traces for each individual instead of question-answers pairs.

Downvoters: I simplified things a lot here, in name of understanding, so bear with me.

MangoToupe

3 hours ago

[-]

> Reflect a moment over the fact that LLMs currently are just text generators.

You could say the same thing about humans.

vlowther

2 hours ago

[-]

No, you cannot. Our abstract language abilities (especially the written word part) are a very thin layer on top of hundreds of millions of years of evolution in an information dense environment.

y0eswddl

3 hours ago

[-]

No, you actually can't.

Humans existed for 10s to 100s of thousands of years without text. or even words for that matter.

MangoToupe

2 hours ago

[-]

I disagree: it is language that makes us human.

nosianu

3 hours ago

[-]

The human world model is based on physical sensors and actions. LLMs are based on our formal text communication. Very different!

Just yesterday I observed myself acting on an external stimulus without any internal words (this happens continuously, but it is hard to notice because we usually don't pay attention to how we do things): I sat in a waiting area of a cinema. A woman walked by and dropped her scarf without noticing. I automatically without thinking raised arm and pointer finger towards her, and when I had her attention pointed behind her. I did not have time to think even a single word while that happened.

Most of what we do does not involved any words or even just "symbols", not even internally. Instead, it is a neural signal from sensors into the brain, doing some loops, directly to muscle activation. Without going through the add-on complexity of language, or even "symbols".

Our word generator is not the core of our being, it is an add-on. When we generate words it's also very far from being a direct representation of internal state. Instead, we have to meander and iterate to come up with appropriate words for an internal state we are not even quite aware of. That's why artists came up with all kinds of experiments to better represent our internal state, because people always knew the words we produce don't represent it very well.

That is also how people always get into arguments about definitions. Because the words are secondary, and the further from the center of established meaning for some word you get the more the differences show between various people. (The best option is to drop insisting of words being the center of the universe, even just the human universe, and/or to choose words that have the subject of discussion more firmly in the center of their established use).

We are text generators in some areas, I don't doubt that. Just a few months ago I listened to some guy speaking to a small rally. I am certain that not a single sentence he said was of his own making, he was just using things he had read and parroted them (as a former East German, I know enough Marx/Engels/Lenin to recognize it). I don't want to single that person out, we all have those moments, when we speak about things we don't have any experiences with. We read text, and when prompted we regurgitate a version of it. In those moments we are probably closest to LLM output. When prompted, we cannot fall back on generating fresh text from our own actual experience, instead we keep using text we heard or read, with only very superficial understanding, and as soon as an actual expert shows up we become defensive and try to change the context frame.

smikhanov

3 hours ago

[-]

You could, but you’d be missing a big part of the picture. Humans are also (at least) symbol manipulators.

TimPC

4 hours ago

[-]

The benchmarks are dumb but highly followed so everyone optimizes for the wrong thing.

DoneWithAllThat

1 hour ago

[-]

When using an LLM for anything serious (such as at work) I have a standard canned postscript along the lines of “if anything about what I am asking is unclear or ambiguous, or if you need more context to understand what I’m asking, you will ask for clarification rather than try to provide an answer”. This is usually highly effective.

catigula

2 hours ago

[-]

Claude doesn't really have this problem.

gowld

58 minutes ago

[-]

There are billions of humans. Not every one speaks the same way all the time. The default behavior is trying to be useful for most people.

It's easy to skip and skim content you don't care about. It's hard to prod and prod to get to to say something you do care about it if the machine is traint to be very concise.

Complaining the AI can't read your mind is exceptionally high praise for the AI, frankly.

dominotw

2 hours ago

[-]

same experience. i try to learn with it but i can't really tell if what its teaching me is actually correct or merely making up when i challenge it with followup questions.

morkalork

4 hours ago

[-]

This drives me nuts when trying to bounce an architecture or coding solution idea off an LLM. A human would answer with something like "what if you split up the responsibility and had X service or Y whatever". No matter how many times you tell the LLM not to return code, it returns code. Like it can't think or reason about something without writing it out first.

shagie

3 hours ago

[-]

> Like it can't think or reason about something without writing it out first.

Setting aside the philosophical questions around "think" and "reason"... it can't.

In my mind, as I write this, I think through various possibilities and ideas that never reach the keyboard, but yet stay within my awareness.

For an LLM, that awareness and thinking through can only be done via its context window. It has to produce text that maintains what it thought about in order for that past to be something that it has moving forward.

There are aspects to a prompt that can (in some interfaces) hide this internal thought process. For example, the ChatGPT has the "internal thinking" which can be shown - https://chatgpt.com/share/69278cef-8fc0-8011-8498-18ec077ede... - if you expand the first "thought for 32 seconds" bit it starts out with:

    I'm thinking the physics of gravity assists should be stable enough for me to skip browsing since it's not time-sensitive. However, the instructions say I must browse when in doubt. I’m not sure if I’m in doubt here, but since I can still provide an answer without needing updates, I’ll skip it.

(aside: that still makes me chuckle - in a question about gravity assists around Jupiter, it notes that its not time-sensitive... and the passage "I’m not sure if I’m in doubt here" is amusing)

However, this is in the ChatGPT interface. If I'm using an interface that doesn't allow internal self-prompts / thoughts to be collapsed then such an interface would often be displaying code as part of its working through the problem.

You'll also note a bit of the system prompt leaking in there - "the instructions say I must browse when in doubt". For an interface where code is the expected product, then there could be system prompts that also get in there that try to always produce code.

dwaltrip

3 hours ago

[-]

I have architectural discussions all the time with coding agents.

basscomm

4 hours ago

[-]

> Like it can't think or reason about something without writing it out first.

LLM's neither think nor reason at all.

ModernMech

3 hours ago

[-]

Right, so LLM companies should stop advertising their models can think and reason.

dev1ycan

3 hours ago

[-]

But that would burst their valuation bubble as investors would realize it's a technology that already hit its realistic ceiling in usability.

Traubenfuchs

3 hours ago

[-]

> When I ask a person something, I expect them to give me a short reply which includes another question/asks for details/clarification. A conversation is thus an ongoing "dance" where the questioner and answerer gradually arrive to the same shared meaning.

You obviously never wasted countless hours trying to talk to other people on online dating apps.

jqpabc123

3 days ago

[-]

We are trying to fix probability with more probability. That is a losing game.

Thanks for pointing out the elephant in the room with LLMs.

The basic design is non-deterministic. Trying to extract "facts" or "truth" or "accuracy" is an exercise in futility.

5 hours ago

[-]

The factuality problem with LLMs isn't because they are non-deterministic or statistically based, but simply because they operate at the level of words, not facts. They are language models.

You can't blame an LLM for getting the facts wrong, or hallucinating, when by design they don't even attempt to store facts in the first place. All they store are language statistics, boiling down to "with preceding context X, most statistically likely next words are A, B or C". The LLM wasn't designed to know or care that outputting "B" would represent a lie or hallucination, just that it's a statistically plausible potential next word.

biophysboy

2 hours ago

[-]

I think this is why I get much more utility out of LLMs with writing code. Code can fail if the syntax is wrong; small perturbations in the text (e.g. add a newline instead of a semicolon) can lead to significant increases in the cost function.

Of course, once an LLM is asked to create a bespoke software project for some complex system, this predictability goes away, the trajectory of the tokens succumbs to the intrinsic chaos of code over multi-block length scales, and the result feels more arbitrary and unsatisfying.

I also think this is why the biggest evangelists for LLMs are programmers, while creative writers and journalists are much more dismissive. With human language, the length scale over which tokens can be predicted is much shorter. Even the "laws" of grammar can be twisted or ignored entirely. A writer picks a metaphor because of their individual reading/life experience, not because its the most probable or popular metaphor. This is why LLM writing is so tedious, anodyne, sycophantic, and boring. It sounds like marketing copy because the attention model and RL-HF encourage it.

3 hours ago

[-]

>but simply because they operate at the level of words, not facts. They are language models.

Facts can be encoded as words. That's something we also do a lot for facts we learn, gather, and convey to other people. 99% of university is learning facts and theories and concept from reading and listening to words.

Also, even when directly observing the same fact, it can be interpreted by different people in different ways, whether this happens as raw "thought" or at the conscious verbal level. And that's before we even add value judgements to it.

>All they store are language statistics, boiling down to "with preceding context X, most statistically likely next words are A, B or C".

And how do we know we don't do something very similar with our facts - make a map of facts and concepts and weights between them for retrieving them and associating them? Even encoding in a similar way what we think of as our "analytic understanding".

2 hours ago

[-]

Animal/human brains and LLMs have fundamentally different goals (or loss functions, if you prefer), even though both are based around prediction.

LLMs are trained to auto-regressively predict text continuations. They are not concerned with the external world and any objective experimentally verifiable facts - they are just self-predicting "this is what I'm going to say next", having learnt that from the training data (i.e. "what would the training data say next").

Humans/animals are embodied, living in the real world, whose design has been honed by a "loss function" favoring survival. Animals are "designed" to learn facts about the real world, and react to those facts in a way that helps them survive.

What humans/animals are predicting is not some auto-regressive "what will I do next", but rather what will HAPPEN next, based largely on outward-looking sensory inputs, but also internal inputs.

Animals are predicting something EXTERNAL (facts) vs LLMs predicting something INTERNAL (what will I say next).

1 hour ago

[-]

>Humans/animals are embodied, living in the real world, whose design has been honed by a "loss function" favoring survival. Animals are "designed" to learn facts about the real world, and react to those facts in a way that helps them survive.

Yes - but LLMs also get this "embodied knowledge" passed down from human-generated training data. We are their sensory inputs in a way (which includes their training images, audio, and video too).

They do learn in a batch manner, and we learn many things not from books but from a more interactive direct being in the world. But after we distill our direct experiences and throughts derived from them as text, we pass them down to the LLMs.

Hey, there's even some kind of "loss function" in the LLM case - from the thumbs up/down feedback we are asked to give to their answers in Chat UIs, to $5/hour "mechanical turks" in Africa or something tasked with scoring their output, to rounds of optimization and pruning during training.

>Animals are predicting something EXTERNAL (facts) vs LLMs predicting something INTERNAL (what will I say next).

I don't think that matters much, in both cases it's information in, information out.

Human animals predict "what they will say/do next" all the time, just like they also predict what they will encounter next ("my house is round that corner", "that car is going to make a turn").

Our prompt to an LLM serves the same role as sensory input from the external world plays to our predictions.

52 minutes ago

[-]

> Yes - but LLMs also get this "embodied knowledge" passed down from human-generated training data.

It's not the same though. It's the difference between reading about something and, maybe having read the book and/or watched the video, learning to DO it yourself, acting based on the content of your own mind.

The LLM learns 2nd hand heresay, with no idea of what's true or false, what generalizations are valid, or what would be hallucinatory, etc, etc.

The human learns verifiable facts, uses curiosity to explore and fill the gaps, be creative etc.

I think it's pretty obvious why LLMs have all the limitations and deficiencies that they do.

If 2nd hand heresay (from 1000's of conflicting sources) really was good as 1st hand experience and real-world prediction, then we'd not be having this discussion - we'd be bowing to our AGI overlords (well, at least once the AI also got real-time incremental learning, internal memory, looping, some type of (virtual?) embodiment, autonomy ...).

AlecSchueler

4 hours ago

[-]

In a way though those things aren't so different as they might first appear. The factual answer is traditionally the most plausible response to many questions. They don't operate on any level other than pure language but there are a heap of behaviours which emerge from that.

4 hours ago

[-]

Most plausible world model is not something stored raw in utterances. What we interpret from sentences is vastly different from what is extractable from mere sentences on their own.

Facts, unlike fabulations, require crossing experience beyond the expressions on trial.

4 hours ago

[-]

Right, facts need to be grounded and obtained from reliable sources such as personal experience, or a textbook. Just because statistically most people on Reddit or 4Chan said the moon is made of cheese doesn't make it so.

But again, LLMs don't even deal in facts, nor store any memories of where training samples came from, and of course have zero personal experience. It's just "he said, she said" put into a training sample blender and served one word at a time.

4 hours ago

[-]

> The factual answer is traditionally the most plausible response to many questions

Except in cases where the training data is more wrong than correct (e.g. niche expertise where the vox pop is wrong).

However, an LLM no more deals in Q&A than in facts. It only typically replies to a question with an answer because that itself is statistically most likely, and the words of the answer are just selected one at a time in normal LLM fashion. It's not regurgitating an entire, hopefully correct, answer from someplace, so just because it was exposed to the "correct" answer in the training data, maybe multiple times, doesn't mean that's what it's going to generate.

In the case of hallucination, it's not a matter of being wrong, just the expected behavior of something built to follow patterns rather than deal in and recall facts.

For example, last night I was trying to find an old auction catalog from a particular company and year, so thought I'd try to see if Gemini 3 Pro "Thinking" maybe had the google-fu to find it available online. After the typical confident sounding "Analysing, Researching, Clarifying .." "thinking", it then confidently tells me it has found it, and to go to website X, section Y, and search for the company and year.

Not surprisingly it was not there, even though other catalogs were. It had evidently been trained on data including such requests, maybe did some RAG and got more similar results, then just output the common pattern it had found, and "lied" about having actually found it since that is what humans in the training/inference data said when they had been successful (searching for different catalogs).

2 hours ago

[-]

>Except in cases where the training data is more wrong than correct (e.g. niche expertise where the vox pop is wrong)

Same for human knowledge though. Learn from society/school/etc that X is Y, and you repeat X is Y, even if it's not.

>However, an LLM no more deals in Q&A than in facts. It only typically replies to a question with an answer because that itself is statistically most likely, and the words of the answer are just selected one at a time in normal LLM fashion.

And how is that different than how we build up an answer? Do we have a "correct facts" repository with fixed answers to every possibly question, or we just assemble our training data from a weighted graph (or holographic) store of factoids and memories, and our answers are also non deterministic?

2 hours ago

[-]

We likely learn/generate language in an auto-regressive way at least conceptually similar to an LLM, but this isn't just self-contained auto-regressive generation...

Humans use language to express something (facts, thoughts, etc), so you can consider these thoughts being expressed as a bias to the language generation process, similar perhaps to an image being used as a bias to the captioning part of an image captioning model, or language as a bias to an image generation model.

1 hour ago

[-]

>Humans use language to express something (facts, thoughts, etc), so you can consider these thoughts being expressed as a bias to the language generation process

My point however is more that the "thoughts being expressed" are themselves being generated by a similar process (and that it's either that or a God-given soul).

1 hour ago

[-]

Similar in the sense of being mechanical (no homunculus or soul!) and predictive, but different in terms of what's being predicted (auto-regressive vs external).

So, with the LLM all you have is the auto-regressive language prediction loop.

With animals you primarily have the external "what happens next" prediction loop, with these external-world fact-based predictions presumably also the basis of their thoughts (planning/reasoning), as well as behavior.

If it's a human animal who has learned language, then you additionally have an LLM-like auto-regressive language prediction loop, but now, unlike the LLM, biased by these fact-based thoughts (as well as language-based thoughts).

thaumasiotes

3 hours ago

[-]

If you want to see how well text generation works on unrehearsed questions, you can ask about what happens in a comic strip.

I found commentary about searching Google for "dark legacy comics who wants some bamboo", and posted results for that search on HN in response to a comment saying "I work at Google on the 'AI Mode' search option, check it out!" ( https://news.ycombinator.com/item?id=45609275 )

And... a few days later, Google stopped automatically including an AI overview when you made that specific search.

But it turns out there are other comics on the internet. So just now I searched for "YAFGC comic 2230" (without the quotes). This is actually a search that makes a lot of sense, because the archive page for the comic is broken and it's impossible to guess the URL of a comic if all you know is the number.

So the goal here is really to find a link to comic 2230. But here's what the AI overview provides:

----- Italics represent original bolding -----

YAFGC comic #2230, titled "You Want It When?", was published on July 17, 2012.

The single-panel comic depicts a conversation at a blacksmith's shop, featuring characters Glon Smitharm/Bloodhand and the Cyclops Bros. The dialogue centers on the unrealistic expectations for custom-made magical items:

- Glon (a half-orc) asks the cyclopes if they can forge a new weapon for him.

- The Cyclops Bros reply with questions about the desired properties of the item (e.g., "enchanted," "plus three") and the required timeframe.

- Glon gives a deadline of "Tuesday," leading the cyclopes to express disbelief at the short notice for such a powerful item.

You can view the comic directly on the official website via this link:

- YAFGC Comic 2230: You Want It When?

----------

(It may look like I've left out a link at the end. That is not the case. The answer ends by saying "you can view the comic directly via this link", in reference to some bold text that includes no link.)

However, I have left out a link from near the beginning. The sentence "The dialogue centers on the unrealistic expectations for custom-made magical items:" is accompanied by a citation to the URL https://www.yafgc.net/comic/2030-insidiously-involved/ , which is a comic that does feature Glon Smitharm/Bloodhand and Ray the Cyclops, but otherwise does not match the description and which is comic 2030 ("Insidiously Involved"), not comic 2230.

The supporting links also include a link to comic 2200 (for no good reason), and that's close enough to 2230 that I was able to navigate there manually. Here it is: https://www.yafgc.net/comic/2230-clover-nabs-her-a-goldie/

You might notice that the AI overview got the link, the date, the title, the appearing characters, the theme, and the dialog wrong.

----- postscript -----

As a bonus comic search, searching for "wow dark legacy 500" got this response from Google's AI Overview:

> Dark Legacy Comic #500 is titled "The Game," a single-panel comic released on June 18, 2015. It features the main characters sitting around a table playing a physical board game, with Keydar remarking that the in-game action has gotten "so realistic lately."

> You can view the comic and its commentary on the official Dark Legacy Comics website. [link]

Compare https://darklegacycomics.com/500 .

That [link] following "the official Dark Legacy Comics website" goes to https://wowwiki-archive.fandom.com/wiki/Dark_Legacy_Comics , by the way.

toddmorey

5 hours ago

[-]

Yeah, that’s very well put. They don’t store black-and-white they store billions of grays. This is why tool use for research and grounding has been so transformative.

therealpygon

4 hours ago

[-]

Definitely, and hence the reason that structuring requests/responses and providing examples for smaller atomic units of work seem to have quite a significant effect on the accuracy of the output (not factuality, but more accurate to the patterns that were emphasized in the preceding prompt).

I just wish we could more efficiently ”prime” a pre-defined latent context window instead of hoping for cache hits.

Forgeties79

4 hours ago

[-]

> You can't blame an LLM for getting the facts wrong, or hallucinating, when by design they don't even attempt to store facts in the first place

On one level I agree, but I do feel it’s also right to blame the LLM/company for that when the goal is to replace my search engine of choice (my major tool for finding facts and answering general questions), which is a huge pillar of how they’re sold to/used by the public.

3 hours ago

[-]

True, although that's a tough call for a company like Google.

Even before LLMs people were asking Google search questions rather than looking for keyword matches, and now coupled with ChatGPT it's not surprising that people are asking the computer to answer questions and seeing this as a replacement for search. I've got to wonder how the typical non-techie user internalizes the difference between asking questions of Google (non-AI mode) and asking ChatGPT?

Clearly people asking ChatGPT instead of Google could rapidly eat Google's lunch, so we're now getting "AI overview" alongside search results as an attempt to mitigate this.

I think the more fundamental problem is not just the blurring of search vs "AI", but these companies pushing "AI" (LLMs) as some kind of super-human intelligence (leading to uses assuming it's logical and infallible), rather than more honestly presenting it as what it is.

georgemcbay

2 hours ago

[-]

> Even before LLMs people were asking Google search questions rather than looking for keyword matches

Google gets some of the blame for this by way of how useless Google search became for doing keyword searches over the years. Keyword searches have been terrible for many years, even if you use all the old tricks like quotations and specific operators.

Even if the reason for this is because non-tech people were already trying to use Google in the way that it thinks it optimized for, I'd argue they could have done a better job keeping things working well with keyword searches by training the user with better UI/UX.

(Though at the end of the day, I subscribe to the theory that Google let search get bad for everyone on purpose because once you have monopoly status you show more ads by having a not-great but better-than-nothing search engine than a great one).

wisty

5 hours ago

[-]

I think they are much smarter than that. Or will be soon.

But they are like a smart student trying to get a good grade (that's how they are trained!). They'll agree with us even if they think we're stupid, because that gets them better grades, and grades are all they care about.

Even if they are (or become) smart enough to know better, they don't care about you. They do what they were trained to do. They are becoming like a literal genie that has been told to tell us what we want to hear. And sometimes, we don't need to hear what we want to hear.

"What an insightful price of code! Using that API is the perfect way to efficiently process data. You have really highlighted the key point."

The problem is that chatbots are trained to do what we want, and most of us would rather have a syncophant who tells us we're right.

The real danger with AI isn't that it doesn't get smart, it's that it gets smart enough to find the ultimate weakness in its training function - humanity.

4 hours ago

[-]

> I think they are much smarter than that. Or will be soon.

It's not a matter of how smart they are (or appear), or how much smarter they may become - this is just the fundamental nature of Transformer-based LLMs and how they are trained.

The sycophantic personality is mostly unrelated to this. Maybe it's part human preference (conferred via RLHF training), but the "You're asbolutely right! (I was wrong)" is clearly deliberately trained, presumably as someone's idea of the best way to put lipstick on the pig.

You could imagine an expert system, CYC perhaps, that does deal in facts (not words) with a natural language interface, but still had a sycophantic personality just because someone thought it was a good idea.

wisty

3 hours ago

[-]

Sorry, double reply, I reread your comment and realised you probably know what you're talking about.

Yeah, at its heart it's basically text compression. But the best way to compression, say, Wikipedia would be to know how the world works, at least according to the authors. As the recent popular "bag of words" post says:

> Here’s one way to think about it: if there had been enough text to train an LLM in 1600, would it have scooped Galileo? My guess is no. Ask that early modern ChatGPT whether the Earth moves and it will helpfully tell you that experts have considered the possibility and ruled it out. And that’s by design. If it had started claiming that our planet is zooming through space at 67,000mph, its dutiful human trainers would have punished it: “Bad computer!! Stop hallucinating!!”

So it needs to know facts, albeit the currently accepted ones. Knowing the facts is a good way to compression data.

And as the author (grudgingly) admits, even if it's smart enough to know better, it will still be trained or fine tuned to tell us what we want to hear.

I'd go a step further - the end point is an AI that knows the currently accepted facts, and can internally reason about how many of them (subject to available evidence) are wrong, but will still tell us what we want to hear.

At some point maybe some researcher will find a secret internal "don't tell the stupid humans this" weight, flip it, and find out all the things the AI knows we don't want to hear, that would be funny (or maybe not).

3 hours ago

[-]

> So it needs to know facts, albeit the currently accepted ones. Knowing the facts is a good way to compression data.

It's not a compression engine - it's just a statistical predictor.

Would it do better if it was incentivized to compress (i.e training loss rewarded compression as well as penalizing next-word errors)? I doubt it would make a lot of difference - presumably it'd end up throwing away the less frequently occurring "outlier" data in favor of keeping what was more common, but that would result in it throwing away the rare expert opinion in favor of retaining the incorrect vox pop.

TheOtherHobbes

11 minutes ago

[-]

It's worse than that. LLMs are slightly addictive because of intermittent reinforcement.

If they give you nonsense most of the time and an amazing answer occasionally you'll bond with them far more strongly than if they're perfectly correct all time.

Selective reinforcement means you get hooked more quickly if the slot machine pays out once every five times than if it pays out on each spin.

That includes "That didn't work because..." debugging loops.

wisty

4 hours ago

[-]

I'm not sure what you mean by "deals in facts, not words" means.

Llm deal in vectors internally, not words. They explode the word into a multidimensional representation, and collapse it again, and apply the attention thingy to link these vectors together. It's not just a simple n:n Markov chain, a lot is happening under the hood.

And are you saying the syncophant behaviour was deliberately programmed, or emerged because it did well in training?

4 hours ago

[-]

LLMs are not like an expert system representing facts as some sort of ontological graph. What's happening under the hood is just whatever (and no more) was needed to minimize errors on it's word-based training loss.

I assume the sycophantic behavior is part because it "did well" during RLHF (human preference) training, and part deliberately encouraged (by training and/or prompting) as someone's judgement call of the way to best make the user happy and own up to being wrong ("You're absolutely right!").

tovej

4 hours ago

[-]

If you're not sure, maybe you should look up the term "expert system"?

encyclopedism

1 hour ago

[-]

I couldn't agree with you more.

I really do find it puzzling so many on HN are convinced LLM's reason or think and continue to entertain this line of reasoning. At the same time also somehow knowing what precisely the brain/mind does and constantly using CS language to provide correspondences where there are none. The simplest example being that LLM's somehow function in a similar fashion to human brains. They categorically do not. I do not have most all of human literary output in my head and yet I can coherently write this sentence.

As I'm on the subject LLM's don't hallucinate. They output text and when that text is measured and judged by a human to be 'correct' then it is. LLM's 'hallucinate' because that is literally what they can ONLY do, provide some output given some input. They don't actually understand anything about what they output. It's just text.

My paper and pen version of the latest LLM (quite a large bit of paper and certainly a lot of ink I might add) will do the same thing as the latest SOTA LLM. It's just an algorithm.

I am surprised so many in the HN community have so quickly taken to assuming as fact that LLM's think or reason. Even anthropomorphising LLM's to this end.

bsshjdjddjdj

42 minutes ago

[-]

People believe that because they are financially invested in it. Everyone has known LLMs are bullshit for years now.

DoctorOetker

5 hours ago

[-]

Determinism is not the issue. Synonyms exist, there are multiple ways to express the same message.

When numeric models are fit to say scientific measurements, they do quite a good job at modeling the probability distribution. With a corpus of text we are not modeling truths but claims. The corpus contains contradicting claims. Humans have conflicting interests.

Source-aware training (which can't be done as an afterthought LoRA tweak, but needs to be done during base model training AKA pretraining) could enable LLM's to express according to which sources what answers apply. It could provide a review of competing interpretations and opinions, and source every belief, instead of having to rely on tool use / search engines.

None of the base model providers would do it at scale since it would reveal the corpus and result in attribution.

In theory entities like the European Union could mandate that LLM's used for processing government data, or sensitive citizen / corporate data MUST be trained source-aware, which would improve the situation, also making the decisions and reasoning more traceable. This would also ease the discussions and arguments about copyright issues, since it is clear LLM's COULD BE MADE TO ATTRIBUTE THEIR SOURCES.

I also think it would be undesirable to eliminate speculative output, it should just mark it explicitly:

"ACCORDING to <source(s) A(,B,C,..)> this can be explained by ...., ACCORDING to <other school of thought source(s) D,(E,F,...)> it is better explained by ...., however I SUSPECT that ...., since ...."

If it could explicitly separate the schools of thought sourced from the corpus, and also separate its own interpretations and mark them as LLM-speculated-suspicions, then we could still have the traceable references, without losing the potential novel insights LLM's may offer.

5 hours ago

[-]

"chatGPT, please generate 800 words of absolute bullshit to muddy up this comments section which accurately identifies why LLM technology is completely and totally dead in the water."

https://arxiv.org/abs/2404.01019

DoctorOetker

5 hours ago

[-]

Less than 800 words, but more if you follow the link :)

"Source-Aware Training Enables Knowledge Attribution in Language Models"

- https://www.schneier.com/crypto-gram/archives/2025/1115.html...

fzeindl

5 hours ago

[-]

Bruce Schneier put it well:

"Willison’s insight was that this isn’t just a filtering problem; it’s architectural. There is no privilege separation, and there is no separation between the data and control paths. The very mechanism that makes modern AI powerful - treating all inputs uniformly - is what makes it vulnerable. The security challenges we face today are structural consequences of using AI for everything."

CuriouslyC

4 hours ago

[-]

Attributing that to Simon when people have been writing articles about that for the last year and a half doesn't seem fair. Simon gave that view visibility, because he's got a pulpit.

flir

4 hours ago

[-]

Longer, surely? (Though I don't have any evidence I can point to).

It's in-band signalling. Same problem DTMF, SS5, etc. had. I would have expected the issue to be intuitvely obvious to anyone who's heard of a blue box?

(LLMs are unreliable oracles. They don't need to be fixed, they need their outputs tested against reality. Call it "don't trust, verify").

6LLvveMx2koXfwn

4 hours ago

[-]

He referenced Simon's article from September the 12th 2022

sweezyjeezy

5 hours ago

[-]

You could make an LLM deterministic if you really wanted to without a big loss in performance (fix random seeds, make MoE batching deterministic). That would not fix hallucinations.

I don't think using deterministic / stochastic as a diagnostic is accurate here - I think that what we're really talking is about some sort of fundamental 'instability' of LLMs a la chaos theory.

rs186

5 hours ago

[-]

We talk about "probability" here because the topic is hallucination, not getting different answers each time you ask the same question. Maybe you could make the output deterministic but does not help with the hallucination problem at all.

sweezyjeezy

4 hours ago

[-]

Exactly - 'non-deterministic' is not an accurate diagnosis of the issue.

ajuc

4 hours ago

[-]

Yeah deterministic LLMs just hallucinate the same way every time.

3 hours ago

[-]

>The basic design is non-deterministic. Trying to extract "facts" or "truth" or "accuracy" is an exercise in futility

We ourselves are non-deterministic. We're hardly ever in the same state, can't rollback to prior states, and we hardly ever give the same exact answer when asked the same exact question (and if we include non-verbal communication, never).

hbs18

3 hours ago

[-]

> The basic design is non-deterministic

Is it? I thought an LLM was deterministic provided you run the exact same query on exact same hardware at a temperature of 0.

chmod775

3 hours ago

[-]

Not quite then as well, since a lot is typically executed in parallel and the implementation details of most number representations make them sensitive to the order of operations.

Given how much number crunching is at the heart of LLMs, these small differences add up.

biophysboy

2 hours ago

[-]

My understanding is that it selects from a probability distribution. Raising the temperature merely flattens that distribution, Boltzmann factor style

jkubicek

1 hour ago

[-]

The author's solution feels like adding even more probability to their solution.

> The next time the agent runs, that rule is injected into its context.

Which the agent may or may not choose to ignore.

Any LLM rule must be embedded in an API. Anything else is just asking for bugs or security holes.

__MatrixMan__

2 hours ago

[-]

Isn't that true of everything else also? Facts about real things are the result of sampling reality several times and coming up with consistent stores about those things. The accuracy of those stories is always bounded by probabilities related to how complete your sampling strategy is.

zahlman

5 hours ago

[-]

I can still remember when https://en.wikipedia.org/wiki/Fuzzy_electronics was the marketing buzz.

raincole

4 hours ago

[-]

This very repo is just to "fix probability with more probability."

> The next time the agent runs, that rule is injected into its context. It essentially allows me to “Patch” the model’s behavior without rewriting my prompt templates or redeploying code.

What a brainrot idea... the whole post being written by LLM is the icing on the cake.

UniverseHacker

4 hours ago

[-]

Specifically, they are capable of inductive logic but not deductive logic. In practice, this may not be a serious limitation, if they get good enough at induction to still almost always get the right answer.

4 hours ago

[-]

What about abduction though?

UniverseHacker

45 minutes ago

[-]

You’ll have to wait for the FOOM “Fast Onset of Overwhelming Mastery” for that I’m afraid.

CuriouslyC

4 hours ago

[-]

Hard drives and network pipes are non-deterministic too, we use error correction to deal with that problem.

pydry

5 hours ago

[-]

I find it amusing that once you try to take LLMs and do productive work with them either this problem trips you up constantly OR the LLM ends up becoming a shallow UI over an existing app (not necessarily better, just different).

bee_rider

4 hours ago

[-]

The UI of the Internet (search) has recently gotten quite bad. In this light it is pretty obvious why Google is working heavily on these models.

I fully expect local modes to eat up most other LLM applications—there’s no reason for your chat buddy or timer setter to reach out to the internet, but LLMs are pretty good at vibes based search, and that will always require looking at a bunch of websites, so it should slot exactly into the gap left by search engines becoming unusable.

anal_reactor

4 hours ago

[-]

This is exactly why I don't like dealing with most people.

throw4847285

4 hours ago

[-]

Every thread like this I like to go through and count how many people are making the pro-AI "Argument from Misanthropy." Based on this exercise, I believe that the biggest AI boosters are simply the most disagreeable people in the industry, temperamentally speaking.

anal_reactor

3 hours ago

[-]

Just because I'm disagreeable it doesn't mean I'm wrong.

throw4847285

2 hours ago

[-]

It means you are not representative of humanity as a whole. You are likely in a small minority of people on an extreme of the personality spectrum. Any attempts to glibly dismiss critiques of AI with a phrase equivalent to "well I hate people" should be glibly dismissed in turn.

anal_reactor

1 hour ago

[-]

Maybe let's try to rectify the discussion. I think that current generation of LLMs displays astounding similarity to human behaviour. I'm not trying to dismiss issues with LLMs, I'm trying to point out the practicality of treating LLMs as awkward humans rather than programs.

Yes, I hate people. But usually whenever there's a critique of LLMs, I can find a parallel issue in people. The extension is that "if people can produce economic value despite their flaws, then so do LLMs, because the flaws are very similar at their core". I feel like HackerNews discussions keep circling around "LLMs bad", which gets very tiresome very fast. I wish there was more enthusiasm. Sure, LLMs have a lot of problems, but they also solve a lot of them too.

It's the dissonance between endless critique of AI on one hand and evergrowing ubiquity on the other. Feels like talking to my dad who refuses to use a GPS and always takes paper maps, and doesn't see the fact that he always arrives late, and keeps citing that one woman who rode into a lake when following GPS.

throw4847285

4 minutes ago

[-]

The problem is one of negative polarization. I found myself skeptical of a lot of the claims around LLMs, but was annoyed by AI critics forming an angry mob anytime AI was used for anything. However, I still considered myself in that camp, and ended up far more annoyed by AI boosterism than AI skepticism, which pushed me in the direction of being even more negative about AI than I started. It's the mirror of what happened to you, as far as I can tell. And I'm sure both are very common, though admitting it makes one seem reactive rather than rational and so we don't talk about it.

However, I do dispute your central claim that the issues with LLMs parallel the issues with people. I think that's a very dehumanizing and self-defeating perspective. The only ethical system that is rational is one in which humans have more than instrumental value to each other.

So when critics divide LLMs and humans, sure, there is a descriptive element of trying to be precise about what human thought is, and how it is different than LLMs, etc. But there is also a prescriptive argument that people are embarrassed to make, which is that human beings have to be afforded a certain kind of dignity and there is no reason to extend that to an LLM based on everything we understand about how they function. So if a person screws up your order at a restaurant, or your coworker makes a mistake when coding, you should treat them with charitability and empathy.

I'm sure this sounds silly to you, but it shouldn't. The bedrock of the Enlightenment project was that scientific inquiry would lead to human flourishing. That's humanism. If we've somehow strayed so far from that, such that appeals to human dignity don't make sense anymore, I don't know what to say.

Davidzheng

5 hours ago

[-]

lol humans are non-deterministic too

rthrfrd

5 hours ago

[-]

But we also have a stake in our society, in the form of a reputation or accountability, that greatly influences our behaviour. So comparing us to an LLM has always been meaningless anyway.

actionfromafar

4 hours ago

[-]

Hm, great lumps of money also detaches a person from reputation or accountability.

rthrfrd

3 hours ago

[-]

Does it? I think it detaches them from _some_ of the consequences of devaluing their reputation or accountability, which is not quite the same thing.

4 hours ago

[-]

Money, or any single metrics, no matter how high, is not enough to bend someone actions in territory they will assess unacceptable otherwise.

How much money would make anyone accept to engage in a genocide by direct bribe? The thing is, some people would not see any amount as a convincing one, while some other will do it proactively for no money at all.

5 hours ago

[-]

to be fair, the people most antisocially obsessed with dogshit AI software are completely divorced from the social fabric and are not burdened by these sorts of juvenile social ties

dlisboa

2 hours ago

[-]

Which is why every tool that is better than humans at a certain task are deterministic.

ModernMech

2 hours ago

[-]

Yeah, but not when they are expected to perform in a job role. Too much nondeterminism in that case leads to firing and replacing the human with a more deterministic one.

pixl97

2 hours ago

[-]

>but not when they are expected to perform in a job role

I mean, this is why any critical systems involving humans have hard coded checklists and do not depend on people 'just winging it'. We really suck at determinism.

ModernMech

2 hours ago

[-]

I feel like we are talking about different levels of nondeterminism here. The kind of LLM nondeterminism that's problematic has to do with the interplay between its training and its context window.

Take the idea of the checklist. If you give it to a person and tell them to perform with it, if it's their job they will do so. But with the LLM agents, you can give them the checklist, and maybe they apply it at first, but eventually they completely forget it exists. The longer the conversation goes on without reminding them of the checklist, the more likely they're going to act like the checklist never existed at all. And you can't know when this is, so the best solution we have now is to constantly remind them of the exitance of the checklist.

This is the kind of nondeterminism that make LLMs particularly problematic as tools and a very different proposition from a human, because it's less like working with an expert and more like working with a dementia patient.

some_furry

5 hours ago

[-]

Human minds are more complicated than a language model that behaves like a stochastic echo.

pixl97

5 hours ago

[-]

Birds are more complicated than jet engines, but jet engines travel a lot faster.

4 hours ago

[-]

Jet engines don't go anywhere without a large industry continuously taking care of all the complexity that even the simplest jet travel imply.

loloquwowndueo

5 hours ago

[-]

They also kill a lot more people when they fail.

pixl97

3 hours ago

[-]

I mean, via bird flu, even conservative estimates show there have been at least 2 million deaths. I know, I know, totally different things, but complex systems have complex side effects.

loloquwowndueo

3 hours ago

[-]

Jet engines run on oil-based fuels. How may deaths can be attributed to problems related to oil ? We can do this all day :) I would suggest we stop, I was really just being snarky.

akomtu

3 hours ago

[-]

Birds don't need airports, don't need expensive maintenance every N hours of flight, they run on seeds and bugs found everywhere that they find themselves, instead of expensive poisonous fuel that must be fed to planes by mechanics, they self-replicate for cheap, and the noises they produce are pleasant rather than deafening.

steerlabs

3 days ago

[-]

Exactly. We treat them like databases, but they are hallucination machines.

My thesis isn't that we can stop the hallucinating (non-determinism), but that we can bound it.

If we wrap the generation in hard assertions (e.g., assert response.price > 0), we turn 'probability' into 'manageable software engineering.' The generation remains probabilistic, but the acceptance criteria becomes binary and deterministic.

jqpabc123

3 days ago

[-]

but the acceptance criteria becomes binary and deterministic.

Unfortunately, the use-case for AI is often where the acceptance criteria is not easily defined --- a matter of judgment. For example, "Does this patient have cancer?".

In cases where the criteria can be easily and clearly stipulated, AI often isn't really required.

steerlabs

20 hours ago

[-]

You're 100% right. For a "judgment" task like "Does this patient have cancer?", the final acceptance criteria must be a human expert. A purely deterministic verifier is impossible.

My thesis is that even in those "fuzzy" workflows, the agent's process is full of small, deterministic sub-tasks that can and should be verified.

For example, before the AI even attempts to analyze the X-ray for cancer, it must: 1/ Verify it has the correct patient file (PatientIDVerifier). 2/ Verify the image is a chest X-ray and not a brain MRI (ModalityVerifier). 3/ Verify the date of the scan is within the relevant timeframe (DateVerifier).

These are "boring," deterministic checks. But a failure on any one of them makes the final "judgment" output completely useless.

steer isn't designed to automate the final, high-stakes judgment. It's designed to automate the pre-flight checklist, ensuring the agent has the correct, factually grounded information before it even begins the complex reasoning task. It's about reducing the "unforced errors" so the human expert can focus only on the truly hard part.

malfist

5 hours ago

[-]

Why do any of those checks with ai though? All of them you can get a less error prone answer without ai.

5 hours ago

[-]

Robo-eugenics is the best answer I can come up with

multjoy

5 hours ago

[-]

AI doesn’t necessarily mean an LLM, which are the systems making things up.

squidbeak

5 hours ago

[-]

I don't agree that users see them as databases. Sure there are those who expect LLMs to be infallible and punish the technology when it disappoints them, but it seems to me that the overwhelmingly majority quickly learn what AI's shortcomings are, and treat them instead like intelligent entities who will sometimes make mistakes.

philipallstar

5 hours ago

[-]

> but it seems to me that the overwhelmingly majority

The overwhelming majority of what?

4 hours ago

[-]

Of users. It's an implicit subject from the first sentence.

philipallstar

2 hours ago

[-]

But how do they know that, if it's of all users?

1 hour ago

[-]

They didn't claim to know it, they said "it seems to me". Presumably they're extrapolating from their experience, or their expectations of how an average user would behave.

scotty79

5 hours ago

[-]

> We treat them like databases, but they are hallucination machines.

Which is kind of crazy because we don't even treat people as databases. Or at least we shouldn't.

Maybe it's one of those things that will disappear form culture one funeral at a time.

hrimfaxi

5 hours ago

[-]

Humans demand more reliability from our creations than from each other.

nickdothutton

5 hours ago

[-]

- Claude, please optimise the project for performance.

o Claude goes away for 15 minutes, doesn't profile anything, many code changes.

o Announces project now performs much better, saving 70% CPU.

- Claude, test the performance.

o Performance is 1% _slower_ than previous.

- Claude, can I have a refund for the $15 you just wasted?

o [Claude waffles], "no".

klysm

5 hours ago

[-]

I’ve always found the hard numbers on performance improvement hilarious. It’s just mimicking what people say on the internet when they get performance gains

dominotw

2 hours ago

[-]

> It’s just mimicking what people say on the internet when they get performance gains

probably read bunch of junior/mid level resumes saying they optimized 90% of company by 80%

lukev

3 hours ago

[-]

If you provide it a benchmark script (or ask it to write one) so it has concrete numbers to go off of, it will do a better job.

I'm not saying these things don't hallucinate constantly, they do. But you can steer them toward better output by giving them better input.

jama211

4 hours ago

[-]

While you’re making unstructured requests and expecting results, why don’t you ask your barista to make you a “better coffee” with no instructions. Then, when they make a coffee with their own brand of creativity, complain that it tastes worse and you want your money back.

wongarsu

3 hours ago

[-]

Both "better coffee" and "faster code" are measurable targets. Somewhat vaguely defined, but nobody is stopping the Barista or Claude from asking clarifying questions.

If I gave a human this task I would expect them to transform the vague goal into measurable metrics, confirm that the metrics match customer (==my) expectations then measure their improvements on these metrics.

This kind of stuff is a major topic for MBAs, but it's really not beyond what you could expect from a programmer or a barista. If I ask you for a better coffee, what you deliver should be better on some metric you can name, otherwise it's simply not better. Bonus points if it's better in a way I care about

nickdothutton

3 hours ago

[-]

I was experimenting with Claude Code and requested something more CPU efficient in a very small project, there were a few avenues to explore, I was interested to see what path it would take. It turned out that it seized upon something which wasn't consuming much CPU anyway and was difficult to optimise further. I learned that I'd have to be more explicit in future and direct an analysis phase and probably kick-in a few strategies for performance optimisation which it could then explore. The refund request was an amusement. It was $15 well spent on my own learning.

chillfox

3 hours ago

[-]

I assume a good barista would ask some follow up questions before making the coffee.

hexbin010

1 hour ago

[-]

I could also argue if a barista gets multiple complaints about their coffee it's very much their and their employer's job to go away and figure out to make good coffee.

It's very much not the customers job to learn about coffee and to direct them how to make a quality basic coffee

And it's not rocket science.

touristtam

2 hours ago

[-]

The last bit, in my limited experience:

> Claude: sorry you have to want until XX:00 as you have run out of credit.

_flux

4 hours ago

[-]

If you really want to do this, you should probably ask for a plan first and review it.

bongodongobob

1 hour ago

[-]

You need to let it actually benchmark. They are only as good as the tools you give them.

4 hours ago

[-]

I can't help but notice that your first two bullets match rather closely the behavior of countless pre-AI university students assigned a project.

gloosx

10 minutes ago

[-]

>We are trying to fix probability with more probability. That is a losing game.

>We need to re-introduce Determinism into the stack.

>If it fails lets inject more prompts but call it "rules" and run the magic box again

Bravo.

3 hours ago

[-]

LLMs are text model, not world models and that is the root cause of the problem. If you and I would be discussing furniture and for some reason you had assumed the furniture to be glued to the ceiling instead of standing on the floor (contrived example) then it would most likely only take one correction based on your actual experience that you are probably on the wrong track. An LLM will happily re-introduce that error a few ping-pongs later and re-establish the track it was on before because that apparently is some kind of attractor.

Not having a world model is a massive disadvantage when dealing with facts, the facts are supposed to re-inforce each other, if you allow even a single fact that is nonsense then you can very confidently deviate into what at best would be misguided science fiction, and at worst is going to end up being used as a basis to build an edifice on that simply has no support.

Facts are contagious: they work just like foundation stones, if you allow incorrect facts to become a part of your foundation you will be producing nonsense. This is my main gripe with AI and it is - funny enough - also my main gripe with some mass human activities.

3 hours ago

[-]

>LLMs are text model, not world models and that is the root cause of the problem.

Is it though? In the end, the information in the training texts is a distilled proxy for the world, and the weighted model ends up being a world model, just an once-removed one.

Text is not that different to visual information in that regard (and humans base their world model on both).

>Not having a world model is a massive disadvantage when dealing with facts, the facts are supposed to re-inforce each other, if you allow even a single fact that is nonsense then you can very confidently deviate into what at best would be misguided science fiction, and at worst is going to end up being used as a basis to build an edifice on that simply has no support.

Regular humans believe all kinds of facts that are nonsense, many others that are wrong, and quite a few that are even counter to logic too.

And short of omnipresense and omniscience, directly examining the whole world, any world model (human or AI), is built on sets of facts many of which might not be true or valid to begin with.

3 hours ago

[-]

I really think it is, this is the exact same thing that keeps going wrong in these conversations over-and-over again. There simply is no common sense, none at all, just a likelihood of applicability. To the point that I even wonder how it is possible to get such basic stuff for which there is an insane amount of support wrong.

I've had an hour long session which essentially revolved around why the landing gear of an aircraft is at the bottom, not at the top of the vehicle (paraphrased for good reasons but it was really that basic). And this happened not just once, but multiple times. Confident declarations followed by absolute nonsense, I've even had - I think it was ChatGPT - try to gaslight me with something along the lines of 'you yourself said' on something that I did not say (this is probably the most person like thing I've seen it do).

pessimizer

28 minutes ago

[-]

People have an actual world model, though, that they have to deal with in order to get the food into their mouths or to hit the toilet properly.

The "facts" that they believe that may be nonsense are part of an abstract world model that is far from their experience, for which they never get proper feedback (such as the political situation in Bhutan, or how their best friend is feeling.) In those, it isn't surprising that they perform like an LLM, because they're extracting all of the information from language that they've ingested.

Interestingly, the feedback that people use to adjust the language-extracted portions of their world models is how demonstrating their understanding of those models seems to please or displease the people around them, who in turn respond in physically confirmable ways. What irritates people about simpering LLMs is that they're not doing this properly. They should be testing their knowledge with us (especially their knowledge of our intentions or goals), and have some fear of failure. They have no fear and take no risk; they're stateless and empty.

Human abstractions are based in the reality of the physical responses of the people around them. The facts of those responses are true and valid results of the articulation of these abstractions. The content is irrelevant; when there's no opportunity to act, we're just acting as carriers.

12 minutes ago

[-]

> Human abstractions are based in the reality of the physical responses of the people around them.

And in the physical responses of the world around them. That empiricism is the foundation of all of science and if you throw that out the end result is gibberish.

lubujackson

53 minutes ago

[-]

The "world model" is what we often refer to as the "context". But it is hard to anticipate bad assumptions that seem obvious because of our existing world model. One of the first bugs I scanned past from LLM generated code was something like:

if user.id == "id": ...

Not anticipating that it would arbitrarily put quotes around a variable name. Other time it will do all kinds of smart logic, generate data with ids then fail to use those ids for lookups, or something equally obvious.

The problem is LLMs guess so much correctly that it is near impossible to understand how or why they might go wrong. We can solve this with heavy validation, iterative testing, etc. But the guardrails we need to actually make the results bulletproof need to go far beyond normal testing. LLMs can make such fundamental mistakes while easily completing complex tasks that we need to reset our expectations for what "idiot proofing" really looks like.

39 minutes ago

[-]

> The "world model" is what we often refer to as the "context".

No, we often do not, and when we do that's just plain wrong.

steerlabs

3 days ago

[-]

OP here. I wrote this because I got tired of agents confidently guessing answers when they should have asked for clarification (e.g. guessing "Springfield, IL" instead of asking "Which state?" when asked "weather in Springfield").

I built an open-source library to enforce these logic/safety rules outside the model loop: https://github.com/imtt-dev/steer

condiment

5 hours ago

[-]

This approach kind of reminds me of taking an open-book test. Performing mandatory verification against a ground truth is like taking the test, then going back to your answers and looking up whether they match.

Unlike a student, the LLM never arrives at a sort of epistemic coherence, where they know what they know, how they know it, and how true it's likely to be. So you have to structure every problem into a format where the response can be evaluated against an external source of truth.

amorroxic

5 hours ago

[-]

Thanks a lot for this. Also one question in case anyone could shed a bit of light: my understanding is that setting temperature=0, top_p=1 would cause deterministic output (identical output given identical input). For sure it won’t prevent factually wrong replies/hallucination, only maintains generation consistency (eq. classification tasks). Is this universally correct or is it dependent on model used? (or downright wrong understanding of course?)

toddmorey

5 hours ago

[-]

Confident idiot: I’m exploring using LLM for diagram creation.

I’ve found after about 3 prompts to edit an image with Gemini, it will respond randomly with an entirely new image. Another quirk is it will respond “here’s the image with those edits” with no edits made. It’s like a toaster that will catch on fire every eighth or ninth time.

I am not sure how to mitigate this behavior. I think maybe an LLM as a judge step with vision to evaluate the output before passing it on to the poor user.

codazoda

18 minutes ago

[-]

I had a similar result trying to create 16 similarly styled images. After half a dozen it just started kicking out the same image over and over again no matter what the prompt said. Even the “thinking” looked right, but the image was just a repeat. I don’t know if this is some type of context limitation or what.

I got around it by using a new prompt/context for each image. This required some rethinking about how to make them match. What I did was create a sprite sheet with the first prompt and then only replaced (edited) the second prompt.

I still got some consistency problems because there were a few important details left out of my sprite sheet. Next time I think I’ll create those individually and then attach them as context for additional prompts.

codingdave

1 hour ago

[-]

Have you considered that perhaps such things simply are not within its capabilities?

RationPhantoms

3 hours ago

[-]

Whats your thoughts on the diagram as code movement? I'd prefer to have an LLM utilize those as it can atleast drive some determinism through it rather than deal with the slippery layer that is prompt control for visual LLMs.

user34283

2 hours ago

[-]

Yes, same here.

I don't know if it's a fault with the model or just a bug in the Gemini app.

dominotw

2 hours ago

[-]

same. i gave it a very well hand drawn floor plan but never seems to be able to create a formal version of it. Its very very simple too.

makes hilarious mistakes like putting toilet right in the middle of living room.

I dont get all the hype. am i stupid.

tech_ken

1 hour ago

[-]

Basic rule of MLE is to have guardrails on your model output; you don't want some high-leverage training data point to trigger problems in prob. These guardrails should be deterministic and separate from the inference system, and basically a stack of user-defined policies. LLMs are ultimately just interpolated surfaces and the rules are the same as if it were LOESS.

rglover

34 minutes ago

[-]

I think this is for the best. Let the "confident idiot" types briefly entertain the idea of competency, hit the inevitable wall, and go away for good. It will take a few years, lots of mistakes, and billions (if not trillions) wasted, but those people will drift back to the mean or lower when they realize ChatGPT isn't the ghost of Robin Leach.

mfalcon

3 hours ago

[-]

I had been working on NLP, NLU mostly, some years before LLMs. I've tried the universal sentence encoder alongside many ML "techniques" in order to understand user intentions and extract entities from text.

The first time I tried chatgpt that was the thing that surprised me most, the way it understood my queries.

I think that the spotlight is on the "generative" side of this technology and we're not giving the query understanding the deserved credit. I'm also not sure we're fully taking advantage of this funcionality.

ivansavz

5 minutes ago

[-]

Yes, I was (and still am) similarly impressed with LLMs ability to understand the intent of my queries and requests.

I've tried several times to understand the "multi-head attention" mechanism that powers this understanding, but I'm yet to build a deep intuition.

Is there any research or expository papers that talk about this "understanding" aspect specifically? How could we measure understand without generation? Are there benchmarks out there specifically designed to test deep/nuanced understanding skills?

Any pointers or recommended reading would be much appreciated.

https://github.com/gurkin33/respect_validation/

gaigalas

5 hours ago

[-]

I don't think this approach can work.

Anyway, I've written a library in the past (way way before LLMs) that is very similar. It validates stuff and outputs translatable text saying what went wrong.

Someone ported the whole thing (core, DSL and validators) to python a while ago:

Maybe you can use it. It seems it would save you time by not having to write so many verifiers: just use existing validators.

I would use this sort of thing very differently though (as a component in data synthesis).

blixt

5 hours ago

[-]

Yeah I’ve found that the only way to let AI build any larger amount of useful code and data for a user that does not review all of it requires a lot of “gutter rails”. Not just adding more prompting, because it is an after-the-fact solution. Not just verifying and erroring a turn, because it adds latency and allows the model to start spinning out of control. But also isolating tasks and autofixing output keep the model on track.

Models definitely need less and less of this for each version that comes out but it’s still what you need to do today if you want to be able to trust the output. And even in a future where models approach perfect, I think this approach will be the way to reduce latency and keep tabs on whether your prompts are producing the output you expected on a larger scale. You will also be building good evaluation data for testing alternative approaches, or even fine tuning.

chrischen

4 hours ago

[-]

We already have verification layers: high level strictly typed languages like Haskell, Ocaml, Rescript/Melange (js ecosystem), purescript (js), elm, gleam (erlang), f# (for .net ecosystem).

These aren’t just strict type systems but the language allows for algebraic data types, nominal types, etc, which allow for encoding higher level types enforced by the language compiler.

The AI essentially becomes a glorified blank filler filling in the blanks. Basic syntax errors or type errors, while common, are automatically caught by the compiler as part of the vibe coding feedback loop.

3 hours ago

[-]

Interestingly, coding models often struggle with complex type systems, e.g. in Haskell or Rust. Of course, part of this has to do with the relative paucity of relevant training data, but there are also "cognitive" factors that mirror what humans tend to struggle with in those languages.

One big factor behind this is the fact that you're no longer just writing programs and debugging them incrementally, iteratively dealing with simple concrete errors. Instead, you're writing non-trivial proofs about all possible runs of the program. There are obviously benefits to the outcome of this, but the process is more challenging.

chrischen

2 hours ago

[-]

Actually I found the coding models to work really well with these languages. And the type systems are not actually complex. Ocaml's type system is actually really simple, which is probably why the compiler can be so fast. Even back in the "beta" days of Copilot, despite being marketed as Python only, I found it worked for Ocaml syntax and worked just as well.

The coding models work really well with esoteric syntaxes so if the biggest hurdle to adoption of haskell was syntax, that's definitely less of a hurdle now.

> Instead, you're writing non-trivial proofs about all possible runs of the program.

All possible runs of a program is exactly what HM type systems type check for. This fed into the coding model automatically iterates until it finds a solution that doesn't violate any possible run of the program.

1 hour ago

[-]

There's a reason I mentioned Haskell and Rust specifically. You're right, OCaml's type system is simpler in some relevant respects, and may avoid the issues that I was alluding to. I haven't worked with OCaml for a number of years, since before the LLM boom.

The presence of type classes in Haskell and traits in Rust, and of course the memory lifetime types in Rust, are a big part of the complexity I mentioned.

(Edit: I like type classes and traits. They're a big reason I eventually settled on Haskell over OCaml, and one of the reasons I like Rust. I'm also not such a fan of the "O" in OCaml.)

> All possible runs of a program is exactly what HM type systems type check for.

Yes, my point was this can be a more difficult goal to achieve.

> This fed into the coding model automatically iterates until it finds a solution that doesn't violate any possible run of the program.

Only if the model is able to make progress effectively. I have some amusing transcripts of the opposite situation.

pontifier

1 hour ago

[-]

What if we just aren't doing enough, and we need to use GAN techniques with the LLMs.

We're at the "lol, ai cant draw hands right" stage with these hallucinations, but wait a couple years.

liampulles

3 hours ago

[-]

The problem with these agent loops is that their text output is manipulated to then be fed back in as text input, to try and get a reasoning loop that looks something like "thinking".

But our human brains do not work like that. You don't reason via your inner monologue (indeed there are fully functional people with barely any inner monologue), your inner monologue is a projection of thoughts you've already had.

And unfortunately, we have no choice but to use the text input and output of these layers to build agent loops, because trying to build it any other way would be totally incomprehensible (because the meaning of the outputs of middle layers are a mystery). So the only option is an agent which is concerned with self-persuasion (talking to itself).

wintermutestwin

4 hours ago

[-]

Can someone please explain why these token guessing models aren't being combined with logic "filters?"

I remember when computers were lauded for being precise tools.

sswatson

3 hours ago

[-]

1. Because no one knows how to do it. 2. Consider (a) a tool that can apply precise methods when they exist, and (b) a tool that can do that and can also imperfectly solve problems that lack precise solutions. Which is more powerful?

amarant

2 hours ago

[-]

I dunno man, if you see response code 404 and start looking into network errors, you need to read up on http response codes. there is no way a network error results in a 404

storus

32 minutes ago

[-]

Another article that wants to impose something on a tech we don't really understand and that works the way it works by some happy accident. Instead of pushing the tech as far as we can, learning how to utilize it and what its limitations are to be aware of, some people just want to enforce a set of rules this tech can't satisfy and which would degrade its performance. EU bureaucratic way, let's regulate ascent industry we don't understand and throw the baby out with the bathwater in the process. It's known that autoregressive LLMs are soft-bullshitters, yet they are already enormously useful. They just won't 100% automate cognition.

etamponi

5 hours ago

[-]

Aren't we just reinventing programming languages from the ground up?

This is the loop (and honestly, I predicted it way before it started):

1) LLMs can generate code from "natural language" prompts!

2) Oh wait, I actually need to improve my prompt to get LLMs to follow my instructions...

3) Oh wait, no matter how good my prompt is, I need an agent (aka a for loop) that goes through a list of deterministic steps so that it actually follows my instructions...

4) Oh wait, now I need to add deterministic checks (aka, the code that I was actually trying to avoid writing in step 1) so that the LLM follows my instructions...

5) <some time in the future>: I came up with this precise set of keywords that I can feed to the LLM so that it produces the code that I need. Wait a second... I just turned the LLM into a compiler.

The error is believing that "coding" is just accidental complexity. "You don't need a precise specification of the behavior of the computer", this is the assumption that would make LLM agents actually viable. And I cannot believe that there are software engineers that think that coding is accidental complexity. I understand why PMs, CEOs, and other fun people believe this.

Side note: I am not arguing that LLMs/coding agents are nice. T9 was nice, autocomplete is nice. LLMs are very nice! But I am starting to be a bit too fed up to see everyone believing that you can get rid of coding.

knollimar

4 hours ago

[-]

The hard part is just learning interfaces quickly for programming. If only we had a good tool for that.

mmaunder

2 hours ago

[-]

This is why TDD is how you want to do AI dev. The more tests and test gates, the better. Include profiling in your standard run. Add telemetry like it’s going out of fashion. Teach it how to use the tools in AGENTS.md. And watch the output. Tests. Observability. Gates. Have a non negotiable connection with reality.

Dwedit

3 hours ago

[-]

"Don’t ask an LLM if a URL is valid. It will hallucinate a 200 OK. Run requests.get()."

Except for sites that block any user agent associated with an AI company.

3 hours ago

[-]

You can always run the GET from your own infrastructure.

kangs

3 hours ago

[-]

it's actually just trust but verify type stuff:

- verifying isn't asking "is it correct?" - verifying is "run requests.get, does it return blah or no?'

just like with humans but usually for different reasons and with slightly different types of failures.

The interesting part perhaps, is that verifying pretty much always involves code, and code is great pre-compacted context for humans and machines alike. Ever tried to get LLM to do a visual thing? why is the couch at the wrong spot with a weird color?

if you make the LLM write a program that generate the image (eg game engine picture, or 3d render), you can enforce the rules by code it can also make for you - now the couch color uses an hex code and its placed at the right coordinates, every time.

Mockapapella

2 hours ago

[-]

I wrote about something like this a couple months ago: https://thelisowe.substack.com/p/relentless-vibe-coding-part.... Even started building a little library to prove out the concept: https://github.com/Mockapapella/containment-chamber

Spoiler: there won't be a part 2, or if there is it will be with a different approach. I wrote a followup that summarizes my experiences trying this out in the real world on larger codebases: https://thelisowe.substack.com/p/reflections-on-relentless-v...

tl;dr I use a version of it in my codebases now, but the combination of LLM reward hacking and the long tail of verfiers in a language (some of which don't even exist! Like accurately detecting dead code in Python (vulture et. al can't reliably do this) or valid signatures for property-based tests) make this problem more complicated than it seems on the surface. It's not intractable, but you'd be writing many different language-specific libraries. And even then, with all of those verifiers in place, there's no guarantee that when working in different sized repos it will produce a consistent quality of code.

wordpad

47 minutes ago

[-]

How are vibe coding platforms solving this?

stared

5 hours ago

[-]

What I do, is actually running the task. If it is script, getting logs. If it is is website, getting screenshots. Otherwise it is coding in the blind.

Alike writing a script and having the attitude "yeah, I am good at it, I don't need to actually run it to know if works" - well, likely, it won't work. Maybe because of a trivial mistake.

tasuki

3 hours ago

[-]

I wish we didn't use LLMs to create test code. Tests should be the only thing written by a human. Let the AI handle the implementation so they pass!

3 hours ago

[-]

Humans writing tests can only help against some subset of all problems that can happen with incompetent or misaligned LLMs. For example, they can game human-written and LLM-written tests just the same.

yanis_t

4 hours ago

[-]

It's funny when you start think how to succeed with LLMs, you end up thinking about modular code, good test coverage, though-through interfaces, code styles, ... basically with whatever standards of good code base we already had in the industry.

someguy101010

4 hours ago

[-]

wrote about this a bit too in https://www.robw.fyi/2025/10/24/simple-control-flow-for-auto...

ran into this when writing agents to fix unit tests. often times they would just give up early so i started writing the verifiers directly into the agent's control flow and this produced much more reliable results. i believe claude code has hooks that do something similar as well.

raincole

5 hours ago

[-]

> We are trying to fix probability with more probability. That is a losing game.

> The next time the agent runs, that rule is injected into its context. It essentially allows me to “Patch” the model’s behavior without rewriting my prompt templates or redeploying code.

Must be satire, right?

5 hours ago

[-]

satire is forbidden. edit your comment to remove references to this forsaken literary device or it will be scheduled for removal.

asimovfan

1 hour ago

[-]

we're inching towards the three laws of robotics

optimalsolver

1 hour ago

[-]

Your username makes me think you might be a little biased.

kreijstal

5 hours ago

[-]

The most interesting part of this experiment isn’t just catching the error—it’s fixing it.

When Steer catches a failure (like an agent wrapping JSON in Markdown), it doesn’t just crash.

Say you are using AI slop without saying you are using AI slop.

> It's not X, it's Y.

pmg102

5 hours ago

[-]

With an em-dash for extra points!

psunavy03

2 hours ago

[-]

Ironic considering how many LLMs are competing to be trained on Reddit . . . which is the biggest repository of confidently incorrect people on the entire Internet. And I'm not even talking politics.

I've lost count of how much stuff I've seen there related to things I can credibly professionally or personally speak to that is absolute, unadulterated misinformation and bullshit. And this is now LLM training data.

brandon272

2 hours ago

[-]

One thing I've had to explain to many confused friends who use reddit is that many of the people presenting themselves as domain experts in subreddits related to fields like law, accounting, plumbing, electrical, construction, etc. have absolutely no connection to or experience in whatever the field is.

psunavy03

1 hour ago

[-]

I had a co-worker talk once about how awesome Reddit was and how much life advice she'd taken from it and I was just like . . . yeah . . .

billy99k

2 hours ago

[-]

You mean like the war on drugs?

fwip

5 hours ago

[-]

Confident idiot (an LLM) writes an article bemoaning confident idiots.

5 hours ago

[-]

Confident idiots (commenters, LLMs, commenters with investments in LLMs) write posts bemoaning the article.

Your investment is justified! I promise! There's no way you've made a devastating financial mistake!

fwip

1 hour ago

[-]

Not 100% sure I understand your comment, but just to make sure my stance is clear - I saw that it was AI-written and noped out. Thought it was a little funny that they used an LLM to write an article about how LLMs are bad.

user3939382

3 hours ago

[-]

My company is working on fixing these problems. I’ll post a sick HN post eventually if I don’t get stuck in a research tarpit. So far so good.

kissgyorgy

5 hours ago

[-]

It's just simple validation with some error logging. Should be done the same way as for humans or any other input which goes into your system.

LLM provides inputs to your system like any human would, so you have to validate it. Something like pydantic or Django forms are good for this.

ecocentrik

4 hours ago

[-]

I agree. Agentic use isn't always necessary. Most of the time it makes more sense to treat LLMs like a dumb, unauthenticated human user.

Kalanos

5 hours ago

[-]

Please refer to this as GenAI

hnthrow0287345

5 hours ago

[-]

>We are trying to fix probability with more probability. That is a losing game.

Technically not, we just don't have it high enough

You're doing exactly what you said you wouldn't though. Betting that network requests are more reliable than an LLM: fixing probability with more probability.

Not saying anything about the code - I didn't look at it - but just wanted to highlight the hypocritical statements which could be fixed.