FilterHN

Apple study proves LLM-based AI models are flawed because they cannot reason - https://news.ycombinator.com/item?id=41823822 - Oct 2024 (19 comments)

▲

wkat4242

4 days ago

[-]

LLMs were never designed for this. In Apple's language: "you're holding it wrong".

It's an impressive technology but its limits are highly overlooked in the current hype cycle.

AI researchers have known this from the start and won't be surprised by this because it was never intended to be able to do this.

The problem is the customers who are impressed by the human-sounding bot (sounding human is exactly what an LLM is for) and mentally ascribe human skills and thought processes to it. And start using it for things it's not, like an oracle of knowledge, a reasoning engine or a mathematics expert.

If you want to have knowledge, go to a search engine (a good one like kagi) which can be ai assisted like perplexity. If you want maths, go to Wolfram Alpha. For real reasoning we need a few more steps on the road to general AI.

This is the problem with hypes. People think a tech is the be all end all for everything and no longer regard its limitations. The metaverse hype saw the same problem even though there's some niche usecases where it really shines.

But now it's labelled as a flop because the overblown expectation of all the overhyped investors couldn't be met.

What an LLM is great at is the human interaction part. But it needs to be backed by other types of AI that can actually handle the request and for many usecases this tech still needs to be invented. What we have here is a toy dashboard that looks like one of a real car, except it's not connected to one. The rest will come but it'll take a lot more time. Meanwhile making LLMs smarter will not really solve the problem that they're inherently not the tool for the job they're being used for.

▲

labcomputer

3 days ago

[-]

> LLMs were never designed for this. […] AI researchers have known this from the start and won't be surprised by this because it was never intended to be able to do this.

Sure, but it doesn’t help when well-respected researchers, like Ilya Sutskever, go around talking about how OpenAI’s LLMs have intelligence. There have been plenty of commenters on HN who, without a hint of irony, talk about how “well maybe self-attention is the mechanism of consciousness”. And all the scaling papers suggest (suggest in the sense that they seem to want to draw that inference, not that I am endorsing it) that LLMs have no limit to scaling with more parameters and training tokens. Still other (serious, and well-cited) papers run benchmarks that include math and logic tests… why would anyone do that if they genuinely believed that LLMs are just stochastic next-token predictors?

So this is a little more than “you’re holding it wrong”. The entire AI/ML industry is telling you to hold it that way, then acting shocked when they discovered gambling in the establishment.

▲

insane_dreamer

2 days ago

[-]

> The problem is the customers who are impressed by the human-sounding bot (sounding human is exactly what an LLM is for) and mentally ascribe human skills and thought processes to it.

This is the key issue, I think.

> it was never intended to be able to do this

But to be fair, plenty of technologies start off that way until someone finds a way to make the technology do something it was never intended to do

▲

hackinthebochs

3 days ago

[-]

What does it matter what it was designed for? What matters is what they can do. We still don't know what this new class of computing substrate is capable of. It is premature to definitively say what they can't do.

▲

namaria

4 days ago

[-]

The fatal mistake of this AI cycle was calling LLMs AIs, and publishing impressive chatbots before they were wired up with more useful stuff than passing the Turing test.

It made some billionaires who will argue it was a tremendous idea. But in the long term I think it will cause another AI winter that will dry up funding for useful research that would take longer to mature. Or maybe it's just like fusion... promising on paper but so incredibly expensive to handle as to render it useless in practice.

▲

insane_dreamer

2 days ago

[-]

> The fatal mistake of this AI cycle was calling LLMs AIs, and publishing impressive chatbots before they were wired up with more useful stuff than passing the Turing test

LLMs are extremely useful but it's important to recognize the limited use cases:

- summarizing information

- faster search (summarizes search results for you)

- writing in fluent English

- translating between common languages

- realistic sounding chatbots, customer support

- some lower level coding tasks

- technical answers from published documentation - RTFM for you

None of these are any more "AI" than the previous cycle of "AI" technology

▲

NemoNobody

3 days ago

[-]

Fusion has made leaps recently and is no longer looking as unviable as you suggest.

▲

namaria

3 days ago

[-]

I'm aware, I follow it closely. It's still not viable tho. I'll be thrilled if it ever becomes viable.

▲

vacuity

4 days ago

[-]

I don't know that an AI winter is bad. Frankly, I don't think society is ready for "actual" AI. It'll probably be used to entrench corporations and increase exploitation if we don't fix our current problems.

▲

namaria

4 days ago

[-]

That's assuming 'real AI' is even possible. On a different note, there's an argument dating back to Hobbes' Leviathan that the State and corporations are in fact a form of AI.

▲

tech_ken

3 days ago

[-]

> On a different note, there's an argument dating back to Hobbes' Leviathan that the State and corporations are in fact a form of AI

Woah neat I've been wondering about this myself, any reading you'd recommend?

▲

namaria

3 days ago

[-]

The Cybernetic Hypothesis https://mitpress.mit.edu/9781635900927/the-cybernetic-hypoth...

The Stack: On Software and Sovereignty https://direct.mit.edu/books/monograph/3504/The-StackOn-Soft...

The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power https://www.hbs.edu/faculty/Pages/item.aspx?num=56791

Platform Capitalism https://www.wiley.com/en-in/Platform+Capitalism-p-9781509504...

▲

HKH2

3 days ago

[-]

If you can assume you're talking to a human when talking to an AI then it is real enough.

▲

namaria

3 days ago

[-]

I strongly disagree. Chat gpt seems to be a great argument against that.

▲

HKH2

3 days ago

[-]

So you get tricked into thinking it's a real person, but it's not real enough?

▲

namaria

3 days ago

[-]

Don't play sophistry. It's debasing for both of us.

▲

HKH2

3 days ago

[-]

If an AI can pass a Turing test, does that make people stupid? I don't think that's necessarily so. Either way, I'm glad chatbots will force people to use more discernment.

▲

wkat4242

1 day ago

[-]

No but an LLM can pass a Turing test just because it's echoing humans that can. It's like that coder that copy pastes bits from stack overflow all day but doesn't really know what they're doing.

▲

HKH2

1 hour ago

[-]

Sounds like a toddler.

▲

wkat4242

3 days ago

[-]

I think it is. I'm surprised also we need such extreme resources to do it.

After all, nature can do it with 4kg of brain mass and less than 100 Watts. There must be a way we can do this in tech too.

But yeah I doubt it would be a benefit for society too.

▲

namaria

3 days ago

[-]

Yeah the thing is, nature can't do it with 4 kg 100 watt human brains. Nature can do it with planetary ecosystems running on supernova detritus bathed in just the right amount of star light. It took hundreds of millions of years just on brain development to get here.

The same way that openai did what they did using the whole of common crawl, god knows how many phds, decades of development of basic concepts and several years of datacenters worth of compute. And they got a chatbot.

▲

nomel

2 days ago

[-]

The AI is the head of that causal chain. It required all of that, including human brains, to come into existence.

▲

sim7c00

4 days ago

[-]

llms arent a flop. they make for great chatbots when adequatly trained with billions of dollars. fish clearly go m000! chatgpt even has it in the name yet people are blinded by the greed of others...

▲

n_ary

4 days ago

[-]

Agreed. We actually had a company wide meeting where our senior level management and directors are very happy to announce that they will roll out ChatGPT access to all employees starting with Engineering and HR. Apparently, some random director is buddy with some IBM big honcho and was sold about the immense productivity gains seen in IBM plus how certain IBM office managed to lay off most expensive super senior engineers and had seen only productivity and margin gains.

ChatGPT has the branding and first mover moat. Outside of tech folks, Claude/Gemini/Mistral/Phind/Perplexity/Midjourney do no exist, only ChatGPT+CoPilot are real and AI.

▲

gota

4 days ago

[-]

This seems to be a comprehensive repeat of the "Rot13" and "Mystery Blocks world" experiments as described by Prof. Subbarao Kambhampati

Rot13 meaning that LLMs can't do Rot 3, 4, ..., n except for Rot13 (because that' in the training data)

Mystery Blocks World being a trivial "translation" (by direct replacement of terms) of a simple Blocks World. The LLMs can solve the original, but not the "translation" - susprisingly, even when provided with the term replacements!

Both are discussed in Prof. Subbarao's Machine Learning Street Talks episode

▲

TexanFeller

3 days ago

[-]

A couple years ago I heard an interview with someone that has stuck with me. He said that human “reasoning” didn’t evolve to reason in the logical sense, but to _provide reasons_ likely to be accepted by other humans, allowing better survival by manipulating other humans. This matches my perception of most people’s reasoning.

What’s funny is that AI is now being trained by a human accepting or rejecting its answers, probably not on the basis of the rigor of the answer since the temp worker hired to do it is probably not a logician, mathematician, or scientist. I suspect most people’s reasoning is closer to an LLM’s than we would be comfortable admitting.

▲

kelseyfrog

3 days ago

[-]

I'm of this camp. I believe we rationalize our decisions rather than decide rationally [1]. The two are so intertwined in our minds that we confuse even ourselves. Outside of specific rationality frameworks, we can't help but do it. The smarter someone is, the more likely they'll be able to better rationalize and more clearly articulate their rationalizations, so intelligence isn't even an escape hatch.

1. The vast vast majority of the time

▲

randcraw

3 days ago

[-]

Yeah, that kind of reasoning (abduction) tends to be included with deduction and induction as the third form of reasoning. But IMO, it's really not. It's essentially confabulation, or legalistic argument -- where you're not attempting to discover an outcome (entailment) but you're trying to convince others that a desired outcome is plausible because it's possible to get there by cherry-picking a line of reasons.

If abduction has an Achilles Heel, it's that the sequence of reasons proposed is only ONE OF MANY possible chains of reasons that could have led to the current outcome. That tells you nothing about one possible chain's correctness and little about its likelihood to be correct (since you need to understand the degrees of freedom in each scenario to do that). That makes abduction more useful as a means to express personal bias and manipulate the beliefs of others than an attempt to fair-mindedly understand how something came to be.

It's hardly surprising that LLMs employ abduction rather than deduction or induction. Abduction is really just storytelling, where you recount a similar tale that resembles the current scenario (and ends where you want it to). It doesn't require any ability to generalize similar events into precise logical rules based on instances that share the same dependencies, mechanism of action, and outcome. Only by creating such rules using induction can deductive reasoning then take place using them. But LLMs associate, they don't equate. To date, I see no path to that changing.

▲

AnimalMuppet

3 days ago

[-]

If you believe Darwin, then human reasoning evolved to give a right enough answer fast enough for the individual to have higher survival odds in the most common situations that it ran into. Even persuading other humans is an extra, not the real "purpose".

▲

llamaimperative

3 days ago

[-]

If you believe Darwin, there's no distinction between primary, "extra", or "real" purposes. All of these forces are in play all the time.

▲

airstrike

4 days ago

[-]

> OpenAI's ChatGPT-4o, for instance, dropped from 95.2 percent accuracy on GSM8K to a still-impressive 94.9 percent on GSM-Symbolic.

In other words, ChatGPT continues to dominate. A 0.3% drop might as well be noise.

Also the original, allegedly more expensive GPT-4 (can we call it ChatGPT-4og??) is conspicuously missing from the report...

▲

osigurdson

3 days ago

[-]

Companies have been salivating at the possibility of firing everybody and paying ChatGPT $20 per month instead to run the entire business. I don't have any moral objections to it but find it incredibly naive. ChatGPT / LLMs help a bit - that's it.

▲

nomel

2 days ago

[-]

$20/month to help a bit seems incredibly cheap.

▲

randcraw

3 days ago

[-]

One think I like about this effort is their attempt to factor out the cacheing of prior answers due to having asked a similar question before. Due to the nearly eidetic memoization ability of LLMs, no cognitive benchmark can be meaningful unless the LLM's question history can somehow be voided after each query. I think this is especially true when measuring reasoning which will surely benefit greatly from the cacheing of answers from earlier questions into a working set that will enhance its associations on future similar questions -- which only looks like reasoning.

▲

rotexo

3 days ago

[-]

This talk touches on similar points, I haven’t finished listening to it though https://youtu.be/s7_NlkBwdj8

▲

cyanydeez

4 days ago

[-]

Best thing LLMs do is add to the theory of p-zombies among the population.

Instead of the dead Internet theory, we should start finding what percent of the population is no better than a LLM.

▲

jnwatson

4 days ago

[-]

The next question is what percent of jobs can be performed by such p-zombies? The answer is probably pretty high.

▲

Aeglaecia

4 days ago

[-]

its not a theory anymore, starting to notice reddit threads entirely ai generated from the content down to the comments

▲

n_ary

4 days ago

[-]

What beneficial outcomes are gained from such behavior by the contributors?

The real fun is in intellectual engagement, but if thread is generated by bots and commented by bots as well, all I can see is a fake depiction of activity.

However, I understand that, my perspective of beneficial activity could be limited.

▲

Aeglaecia

3 days ago

[-]

jack up engagement by humans , promote products , use your imagination or wait a year and find out lol

▲

cyanydeez

2 days ago

[-]

All thwse bots ate poltical engagement vampires and plat forms are incemtivized to ignore them because they make sitws seem popular.

And dont forget, reddit started life as a bunch of owner sock puppets.

▲

Aeglaecia

2 days ago

[-]

its far too late to malform your drop in the 0cean of online comments. it wont do much but make it harder for some people to read. but i ddo find it amusing. have you seen the paper where deanonymysing requires like 7 data points? if someone wants to u they will make a digital copy of u. the physical copy already existed from your 3rd cousins doing 23 and me. stop worrying and start loving the bomb mate. anyway re the point, reddit started jacking up ai comments around the ipo, the ai content is probably sole traders farming for views symbiotically pairing with the ai comments, or perhaps it is the next step in jacking engagement to make site metrics look better.

▲

jprete

4 days ago

[-]

I believe this is true, but I'm curious to see an example, if you have one.

▲

Aeglaecia

3 days ago

[-]

the two I recollect have been deleted by mods, the overarching theme was ai generated video used to promote an ai generated product , 30+ top level comments with only one pointing out the insanity , obviously this does not necessarily indicate all comments as ai they could simply not have noticed , a worse possibility to me ...

▲

shermantanktop

4 days ago

[-]

Well, now your comment is part of the corpus that will get sucked up and trained on. So retroactively your comment won’t qualify as better, I guess.

▲

Aeglaecia

4 days ago

[-]

all the more incentive to behave like an illitierate dick

▲

jokoon

4 days ago

[-]

Finally, some people are using basic cognition science to evaluate AI

Also they mapped an insect brain

Seems like my several comments suggesting AI scientists should peek other fields, did get some attention.

That probably makes me the most talented and insightful AI scientist on the planet.

▲

bubble12345

4 days ago

[-]

I mean so far LLMs can't even do addition and multiplication of integers accurately. So we can't really expect too much in terms of logical reasoning.

▲

boroboro4

4 days ago

[-]

Can you multiply 1682671 and 168363 without pen and paper? I can’t. LLMs can if you force them do it step by step, but can’t in one shot.

▲

janalsncm

4 days ago

[-]

For logical reasoning tasks you should use pen and paper if necessary, not just say the first thing that comes to mind.

Comparing one-shot LLM responses with what a human can do in their head doesn’t make much sense. If you ask a person, they would try to work out the answer using a logical process but fail due to a shortage of working memory.

An LLM will fail at the task because it is trying to generate a response token by token, which doesn’t make any sense. The next digit in the number can only be determined by following a sequence of logical steps, not by sampling from a probability distribution of next tokens. If the model was really reasoning the probability for each incorrect digit would be zero.

▲

boroboro4

3 days ago

[-]

And that's why OpenAI o1 will use chain of thoughts for this particular question rather than hallucinate approximate answer. And it does work just like before by generating token by token.

▲

janalsncm

3 days ago

[-]

Here are some actual performance metrics:

https://x.com/yuntiandeng/status/1836114401213989366

If chain of thought really worked we should see no difference between 1 digit and 20 digit multiplication.

▲

Tainnor

4 days ago

[-]

No, but you can say "I don't know", "I can't do this in my head", "Why is this important?", "Let me get my calculator" or any other thing that is categorically more useful than just making up a result.

▲

solveit

4 days ago

[-]

It's relatively trivial to get an LLM that does that and every big lab has one, even if they're not selling them.

ChatGPT 4o as of right now just runs python code, which I guess is "Let me get my calculator", see https://chatgpt.com/share/670df313-9f88-8004-a137-22c302f8bf...).

Claude 3.5 just... does the multiplication correctly by independently deciding to go step-by-step (don't see a convenient way to share conversations, but the prompt was just "What is 1682671* 168363?").

▲

serf

4 days ago

[-]

it's a weird differentiation , part of how they do that is by reading back what they said - someone trained in doing so could essentially abuse this characteristic themselves to do the math in a simplified step by step way if they had perfect recall of what they said or wrote..

in other words, for the LLMs that do that kind of thing well, like gpt-o1, don't they essentially also use 'a pen and paper'?

▲

boroboro4

4 days ago

[-]

And this is very good comparison, because o1 indeed does multiply these numbers correctly...

Ask LLMs without chain of thought built-in is the same as to ask people to multiply these numbers without pen and paper. And LLMs with chain of thought actually are capable of doing this math.

▲

akomtu

4 days ago

[-]

LLMs have pen and paper: it's their output buffer, capped to a few KBs, which is far longer than necessary to multiply the two numbers.

If you tell an LLM to explain how to multiply two numbers it will give a flawless textbook answer. However when you ask it to actually multiply the numbers it will fail. LLMs have all the knowledge in the world in their memory, but they can't connect that knowledge into a coherent picture.

▲

namaria

4 days ago

[-]

They have codified human knowledge in human language, represented by arrays of numbers. They can't access that knowledge in any meaningful way, they can just shuffle numbers to give the illusion of cogency.

▲

auggierose

4 days ago

[-]

Does that make an LLM the perfect academic?

▲

myflash13

4 days ago

[-]

Pen and paper? LLMs are literally a computer program that cannot compute.

▲

moi2388

4 days ago

[-]

But it can call into systems that can do compute.

Do you think your inner monologue is any different? Because it sure as hell isn’t the same system as the one doing math, or recognising faces, or storing or retrieving memories, to name a few

▲

carlmr

4 days ago

[-]

The comparison makes sense though. We're trying to build an simulated brain. We want to create a brain that can think about math.

And chain of thought is kind of like giving that brain some scratch space to figure out the problem.

This simulated brain can't access multiplication instructions on the CPU directly. It has to do the computation via it's simulated neurons interacting.

This is why it's not so surprising that this is an issue.

▲

namaria

4 days ago

[-]

LLMs are not simulating brains in any capacity. The words 'neural network' shouldn't be taken at face value. A single human neuron can take quite a few 'neurons' and layers to simulate as a 'neural network'.

▲

carlmr

3 days ago

[-]

Sure, but the basic idea of firing neurons is there, and the connection of these "neurons" to a neural network like an LLM does not allow the network to perform computations directly.

The level of detail of the simulation has little bearing on this. And in fact whether you call it a simulation or something else doesn't matter either. Understanding that the LLM does not compute by using the CPU or GPU directly is what's necessary to understand why computation is hard for LLMs.

▲

ulbu

4 days ago

[-]

Does it have an understanding of the strict rules that govern the problem and that it needs to produce a result that is in total accordance to them? (In accordance which is not 100%, but boolean) i.e., can it apply a function over a sentence?

I don’t know, that’s why I ask.

▲

ThunderSizzle

4 days ago

[-]

The answer is sometimes. Typically it'll forget rules you've given it by the time it might be useful because of the memory limit of LLMs. Either way, you basically need to know it's hallucinating to you so you can keep applying more rules.

▲

blitzar

4 days ago

[-]

282399355737 - My answer is not wrong, I was hallucinating.

▲

tanduv

4 days ago

[-]

yea, but I'm able to count the number of r's in 'strawberry' without second guessing myself

▲

mewpmewp2

4 days ago

[-]

Except o1 can do that and previously gpt could also do it if you asked it to count character by character while keeping count.

▲

krick

3 days ago

[-]

First off, I want to say this is kinda baffling to me, that this is some kind of novel "research", and it's published by Apple of all companies in the field. I could be more forgiving that some journalists try to sell it as "look, LLMs are incapable of logical reasoning!", because journalists always shout loud stupid stuff, otherwise they don't get paid, apparently. But still, it's kind of hard to justify the nature of this "advancement".

I mean, what is being described seems like super basic debug step for any real world system. This is kind of stuff not very advanced QA teams in boring banks do to test your super-boring not very advanced back-office bookkeeping systems. After this kind of testing reveals a number of bugs, you don't erase this bookkeeping system and conclude banking should be done manually on paper only, since computers are obviously incapable of making correct decisions, you fix these problems one by one, which sometimes means not just fixing a software bug, but revisioning the whole business-logic of the process. But this is, you know, routine.

So, not being aware of what are these benchmarks everyone uses to test LLM-products (please note, they are not testing LLMs as some kind of concept here, they are testing products), I would assume that OpenAI in particular, and any major company that released their own LLM product in the last couple of years in general, already does this super-obvious thing. But why this huge discovery happens now, then?

Well, obviously, there are 2 possibilities. Either none of them really do this, which sounds unbelievable - what all these high-paid genius researchers even do then? Or, more plausibly, they do, but not publish that. This one sounds reasonable, given there's no OpenAI, but AltmanAI, and all that stuff. Like, they compete to make a better general reasoning system, of course they don't want to reveal all their research.

But this doesn't really look reasonable to me (at least, at this very moment) given how basic the problem being discussed is. I mean, every school kid knows you shouldn't test on data you use for learning, so to be "peeking into answers when writing a test" only to make your product to perform slightly better on popular benchmarks seems super-cheap. I can understand when Qualcomm tweaks processors specifically to beat AnTuTu, but trying to beat problem-solving by improving your crawler to grab all tests on the internet is pointless. It seems, they should actively try to not contaminate their learning step by training on popular benchmarks. So what's going on? Are people working on these systems really that uncreative?

This said, all of it only applies to general approach, which is to say it's about what article claims, not what it shows. I personally am not convinced.

Let's take kiwi example. The whole argument is framed as if it's obvious that the model shouldn't have substracted these 5 kiwies. I don't know about that. Let's imagine, this is a real test, done by real kids. I guarantee you, the most (all?) of them would be rather confused by the wording. Like, what should we do with this information? Why was it included? Then, they will decide if they should or shouldn't substract the 5. I won't try to guess how many of them will, but the important thing is, they'll have to make this decision, and (hopefully) nobody will suddenly multiply the answer by 5 or do some meaningless shit like that.

And neither did LLMs in question, apparently.

In the end, these students will get the wrong answer, sure. But who decides if it's wrong? Well, of course, the teacher does. Why it's wrong? Well, "because it wasn't said you should discard small kiwies!" Great, man, you also didn't tell us we shouldn't discard them. This isn't a formal algebra problem, we are trying to use some common sense here.

In the end, it doesn't really matter, what teacher thinks the correct answer is, because it was just a stupid test. You may never really agree with him on this one, and it won't affect your life. Probably, you'll end up making more than him anyway, so here's your consolation.

So framing situations like this as a proof that LLM gets things objectively wrong just isn't right. It did subjectively wrong, judged by opinion of Apple researchers in question, and some other folks. Of course, this is what LLM development essentially is: doing whatever magic you deem necessary, to get it give more subjectively correct answers. And this returns it's to my first point: what is OpenAI's (Anthropic's, Meta's, etc) subjectively correct answer here? What is the end goal anyway? Why this "research" comes from "Apple researchers", not from one of these compenies' tech blogs?

▲

fungiblecog

4 days ago

[-]

No shit!