FilterHN

2 days ago

[-]

I very much disagree. To attempt a proof by contradiction:

Let us assume that the author's premise is correct, and LLMs are plenty powerful given the right context. Can an LLM recognize the context deficit and frame the right questions to ask?

They can not: LLMs have no ability to understand when to stop and ask for directions. They routinely produce contradictions, fail simple tasks like counting the letters in a word etc. etc. They can not even reliably execute my "ok modify this text in canvas" vs "leave canvas alone, provide suggestions in chat, apply an edit once approved" instructions.

bubblyworld

2 days ago

[-]

This is not a proof by contradiction - you have stated an assumption followed by a bunch of non-sequitors about what LLMs can and can't do, also known as begging the question. Under the conditions of your assumption (namely that LLMs are plenty powerful with the right context) why would you believe anything in your last paragraph? That's how a proof by contradiction works.

(not saying you are wrong, necessarily, but I don't think this argument holds water)

1 day ago

[-]

> you have stated an assumption

I don't think I stated an assumption, this is an assertion, worded rhetorically. You are welcome to disagree with it and refute it, but its structural role is not that of an assumption.

"Can an LLM recognize the context deficit and frame the right questions to ask?"

> a bunch of non-sequitors

I'm guessing you're referring to the "canvas or not" bit? The sequitir there was that LLMs routinely fail to execute simple instructions for which they have all the context.

> not saying you are wrong

Happy to hear counterarguments of course, but I do not yet see an argument for why what I said was not structurally coherent as counterexamples, nor anything that weakens the specifics of what I said.

therein

2 days ago

[-]

I agree it isn't really proof by contradiction. It is more like proof by demonstration of concrete failures in real life demonstrations, which is stronger.

It is like the author is saying 12 is a prime number and I am like but I divided it by 2 just the other day.

OrderlyTiamat

2 days ago

[-]

Nit pick, but proof by contradiction is necessarily stronger as it is deductive reasoning, and this kind of "proof" by anecdotal evidence doesn't rise above abductive reasoning. Still useful, very much not a proof.

woooooo

2 days ago

[-]

We don't have a formal model of how/why any given LLM works, and incidentally we're also short on proofs for real-world software and organizations.

Empirical facts are the strongest thing we have in this domain.

giantg2

2 days ago

[-]

You don't need a full model. You can build deductive arguments using empirical facts to support the premises.

bubblyworld

2 days ago

[-]

True, but in this case these are hardly globally applicable facts about LLM-based systems (not nearly to the same degree as "12 divides 2" anyway). Different systems have different properties on all those fronts.

I don't think no argument is the right substitute for a bad one!

bobbylarrybobby

2 days ago

[-]

Claude routinely stops and asks me clarifying questions before continuing, especially when the given extended thinking or doing research.

diggan

2 days ago

[-]

Indeed, the ability to do so seems to depend more on how well your system prompt is laying out that workflow, than how "intelligent" the model is.

JumpCrisscross

1 day ago

[-]

Haven’t we all met a smart person who never learned to think critically or in structured ways?

tharant

1 day ago

[-]

Hi, it’s me.

2 days ago

[-]

Prompting it to ask clarifying questions will make it ask questions it has seen before, not ask questions it needs you to clarify. So that doesn't solve the problem, it just causes other problems.

If it actually did solve the problem then they would train the models to act that way by default, so anything that you need to make smart prompts for has to be dumb.

beering

2 days ago

[-]

It feels crazy to keep arguing about LLMs being able to do this or that, but not mention the specific model? The post author only mentions the IMO gold-medal model. And your post could be about anything. Am I to believe that the two of you are talking about the same thing? This discussion is not useful if that’s not the case.

themanmaran

2 days ago

[-]

This depends on whether you mean LLMs in the sense of single shot, or LLMs + software built around it. I think a lot of people conflate the two.

In our application e use a multi-step check_knowledge_base workflow before and after each LLM request. Pretty much, make a separate LLM request to check the query against the existing context to see if more info is needed, and a second check after generation to see if output text exceeded it's knowledge base.

And the results are really good. Now coding agents in your example are definitely stepwise more complex, but the same guardrails can apply.

visarga

2 days ago

[-]

> Pretty much, make a separate LLM request to check the query against the existing context to see if more info is needed, and a second check after generation to see if output text exceeded it's knowledge base.

They are unreliable at that. They can't reliably judge LLM outputs without access to the environment where those actions are executed and sufficient time to actually get to the outcomes that provide feedback signal.

For example I was working on evaluation for an AI agent. The agent was about 80% correct, and the LLM judge about 80% accurate in assessing the agent. How can we have self correcting AI when it can't reliably self correct? Hence my idea - only the environment outcomes over a sufficient time span can validate work. But that is also expensive and risky.

2 days ago

[-]

are the different LLMs correlated in what they get wrong? I suspect they are, given how much incest there's been in their training, but if they each have some edge in one particular area, you could use a committee. would cost that much more tokens, obviously.

patrickhogan1

2 days ago

[-]

Do you have a concrete example of what you mean?

For example, the article above was insightful. But the authors pointing to 1,000s of disparate workflows that could be solved with the right context, without actually providing 1 concrete example of how he accomplishes this makes the post weaker.

themanmaran

1 day ago

[-]

Sure, concrete example. We do conversational AI for banks, and spend a lot of time on the compliance side. Biggest thing is we don't want the LLM to ever give back an answer that could violate something like ECOA.

So every message that gets generated by the first LLM is then passed to a second series of LLM requests + a distilled version of the legislation. ex: "Does this message imply likelihood of credit approval (True/False)". Then we can score the original LLM response based on that rubric.

All of the compliance checks are very standardized, and have very little reasoning requirements, since they can mostly be distilled into a series of ~20 booleans.

patrickhogan1

18 hours ago

[-]

Thank you! Great example!

vrighter

2 days ago

[-]

if an llm is unreliable, then why would another just-as-unreliable llm make it any better?

ordersofmag

2 days ago

[-]

If a hard drive sometimes fails, why would a raid with multiple hard drives be any more reliable?

"Do task x" and "Is this answer to task x correct?" are two very different prompts and aren't guaranteed to have the same failure modes. They might, but they might not.

citrin_ru

1 day ago

[-]

RAID only works when failures are independent. E. g. if you bought two drivers from the same faulty batch which die after 1000 power-on hours RAID would not help. With LLM it’s not obvious that errors are not correlated.

giantrobot

2 days ago

[-]

> If a hard drive sometimes fails, why would a raid with multiple hard drives be any more reliable?

This is not quite the same situation. It's also the core conceit of self-healing file systems like ZFS. In the case of ZFS it not only stores redundant data but redundant error correction. It allows failures to not only be detected but corrected based on the ground truth (the original data).

In the case of an LLM backstopping an LLM, they both have similar probabilities for errors and no inherent ground truth. They don't necessarily memorize facts in their training data. Even with a RAG the embeddings still aren't memorized.

It gives you a constant probability for uncorrectable bullshit. One of the biggest problems with LLMs is the opportunity for subtle bullshit. People can also introduce subtle errors recalling things but they can be held accountable when that happens. An LLM might be correct nine out of ten times with the same context or only incorrect given a particular context. Even two releases of the same model might not introduce the error the same way. People can even prompt a model to error in a particular way.

potsandpans

1 day ago

[-]

If one person is unreliable, why would a group of people make it any better.

ainllauh

1 day ago

[-]

Yeah 15 random guys ought to do surgery just as well as one surgeon right?

ramesh31

2 days ago

[-]

>They routinely produce contradictions, fail simple tasks like counting the letters in a word etc. etc

It's all about tools. Given sufficient tooling, the model's inherent abilities become irrelevant. Give a model a tool that counts characters and it will get this question right 100% of the time. Copy and paste to your domain. And what are tools but a means of providing context from the real world? People seem blinded by focusing on the raw abilities of models, missing the fact that these things should be seen simply as reasoning engines for tool usage.

antithesizer

2 days ago

[-]

The word you're looking for is "rebuttal" since this is neither proof nor refutation of anything, but merely an argument against the thesis.

1 day ago

[-]

A rebuttal is just an alias for "counterargument", it does not define the structure of the counterargument.

However flawed, what I said did have a structure (please refer to my other response in this thread for why).

Wowfunhappy

2 days ago

[-]

> LLMs have no ability to understand when to stop and ask for directions.

I haven't read TFA so I may be missing the point. However, I have had success getting Claude to stop and ask for directions by specifically prompting it to do so. "If you're stuck or the task seems impossible, please stop and explain the problem to me so I can help you."

1 day ago

[-]

Ok I think the confusion arises because of the probabilistic nature of LLM responses that blurs the line between "intelligent vs not".

Let's take driving a car as an example, and a random decision generator as a lower bound on the intelligence of the driver.

- A professionally trained human, who is not fatigued or unhealthy or substance-impaired, rarely makes a mistake, and when they do, there are reasonable mitigating factors.

- ML models, OTOH, are very brittle and probabilistic. A model trained on blue tinted windshields may suffer a dramatic drop in performance if ran on yellow-tinted windshields.

Models are unpredictably probabilistic. They do not learn a complete world model, but the very specific conditions and circumstances of their training dataset.

They continue to get better, and you are able to induce a behavior similar to true intelligence more and more often. In your case, you are able to get them to stop and ask, but if they had the ability to do this reliably, they would not make mistakes as agents at all. Right now they resemble intelligence under a very specific light, and as the regimes under which they resemble one get bigger, they will get to AGIs. But we're not there yet.

thorum

2 days ago

[-]

This article is insightful, but I blinked when I saw the headline “Reducing the human bottleneck” used without any apparent irony.

At some point we should probably take a step back and ask “Why do we want to solve this problem?” Is a world where AI systems are highly intelligent tools, but humans are needed to manage the high level complexity of the real world… supposed to be a disappointing outcome?

Davidzheng

2 days ago

[-]

it actually doesn't matter what we want. Because eliminating it will in long run increase yield, economic forces will automate humans away by capitalistic forces.

swivelmaster

2 days ago

[-]

We should stop considering it a given that capitalistic forces will do this and start considering how we build systems that optimize for the maximum amount of human good rather than the maximum amount of abstract economic good (which nowadays usually means an increase in wealth disparity).

CamperBob2

1 day ago

[-]

Because no one has come up with a system that's better than capitalism at accomplishing that.

itsalotoffun

2 days ago

[-]

This is correct. It will require non-market forces to regulate soft-landings for humans. We may see a wave of "job-preserving" legislation in the coming years but these will eventually be washed away in favor of taxing the AI economy.

LunaSea

2 days ago

[-]

If you don't have customers anymore, who are you selling your products to?

fnordpiglet

2 days ago

[-]

Assuming you buy the idea of a post scarcity society and assuming we can separate our long ingrained notion that spending your existence in toil to survive is a moral imperative and not working is deserving of punishment if not death, I personally look forward to a time we can get off the hamster wheel. Most buttons that get pushed by people are buttons not worth spending your existence pushing. This includes an awful lot of “knowledge work,” which is often better paid but more insidious in that it requires not just your presence but capturing your entire attention and mind inside and outside work. I would also be hopeful that fertility rates would decline and there would simply be far fewer humans.

In Asimov’s robots stories the spacers are long lived and low population because robots do most everything. He presents this as a dead end, that stops us from conquering the galaxy. This to me sounds like a feature not a bug. I think human existence could be quite good with large scale automation, fewer people, and less suffering due to the necessity for everyone to be employed.

Note I recognize you’re not saying exactly the same thing as I’m saying. I think humans will never cede full executive control by choice at some level. But I suspect, sadly, power will be confined to those few who do get to manage the high level complexity of the real world.

nradov

2 days ago

[-]

We will never have a post scarcity society. Automation can make certain foodstuffs and manufactured goods somewhat cheaper but the things that people really want will always be in short supply, for example real estate in geographically favorable areas.

lll-o-lll

2 days ago

[-]

With a stable population, post scarcity is surely possible technically. Just invest resource into improving everything that already exists.

I also agree that we will never have a post scarcity society; but this is more about humanity than technology.

parineum

2 days ago

[-]

There will always be scarcity for goods whose value is derived from their scarcity.

Maybe food won't be scarce (we wre actually very close to that) and shelter may not be scarce but, even if you invent the replicator, there will still be things that are bespoke.

2 days ago

[-]

there are levels of post scarcity. if food, shelter, medicine and leisure are available to all for almost no toil, then we're in post-scarcity. You'll (probably) never have your own planet. You might never be able to convince a certain artist to produce something for you personally.

alanbernstein

2 days ago

[-]

I have never understood "post scarcity" to mean the end of ALL scarcity, which is essentially impossible by definition.

Relative to 500 years ago, we have already nearly achieved post-scarcity for a few types of items, like basic clothing.

It seems this is yet another concept for which we need to adjust our understanding from binary to a spectrum, as we find our society advancing along the spectrum, in at least some aspects.

nasmorn

2 days ago

[-]

Also for basic food. You can get all the rice and beans you really need for basically no money. That means actual starvation is nowadays a political not a resource issue

aswegs8

1 day ago

[-]

But what is scarce now, really? We are just moving up Maslow's hierarchy and strive for more abstract concepts like power, self-actualization or recognition. How would AI provide those to us? Our self-image is linked to other's perception of us. It feels like we would have to radically re-think what it means to be human. As someone above wrote, we need to overcome equating existence with toil, however, it seems so ingrained into human nature to compete that we might all have to hop on antidepressants at some point.

boredhedgehog

2 days ago

[-]

When the celibate classes have been able to sublimate what is arguably the strongest of all wants for as long as they have, I doubt there is any desire that could not be redirected with similar techniques.

nine_k

2 days ago

[-]

This assumes that the celibate was actually maintained, not pretended and secretly violated. There is plenty of evidence that those who were intended to preserve celibate in medieval times actually did not.

nradov

2 days ago

[-]

Lol buddy if you really believe there are any "celibate classes" then you've fallen for the oldest con in the world.

addaon

20 hours ago

[-]

Such cynicism! I’m proudly celibate, like my father before me, and my grandfather before him!

hgomersall

2 days ago

[-]

Indeed, scarcity will be artificially created if it's not naturally present. The human need to have something that others do not is strong.

variadix

2 days ago

[-]

You’re not imagining what post scarcity can really look like. If you have abundant energy, automation, etc. you could manipulate geography and climate, you could build artificial land mass, and so on. It really depends on what people mean by post scarcity.

GauntletWizard

2 days ago

[-]

We can automate plenty in physiological needs, and in fact have already. There's plenty of food and housing for everyone to have them, but a bunch of people will immediately destroy them if provided with such. I don't think "Dispose of a full house every 3 months" will ever be practical, but we might be able to "solve" physiological needs.

Safety needs might be possible to solve. Totalitarian states with ubiquitous panopticons can leave you "safe" in a crime sense, and AI gaslighting and happy pills will make you "feel" safe.

Love and belonging we have "Plenty" of already - If you're looking for your people, you can find them. Plenty aren't willing to look.

But once you get up to Esteem, it all falls apart. Reputation and Respect are not scalable. There will always be a limited quantity of being "The Best" at anything, and many are not willing to be "The Best" within tight constraints; There's always competition. You can plausibly say that this category is inherently competitive. There's no respect without disrespect. There's no best if there's no second best, and second best is first loser. So long as humans interact with each other - So long as we're not each locked in our own private shards of reality - There will be competition, and there will be those that fall short.

Self Actualization is almost irrelevant at this point. It falls into exactly the same as the above. You can simulate a reality where someone is always the best at whatever they decide to so, but I think it will inherently feel hollow. Agent Smith said it best: https://youtu.be/9Qs3GlNZMhY?t=23

vladms

2 days ago

[-]

> There will always be a limited quantity of being "The Best" at anything

Still, to pick a simple example, we do have different sports at which different people are "The Best". One solution would be to multiple the categories, which I feel is already happening to some extent with all the computer games or niche artistical trends.

And I would claim that very few people are "The Best", it's mostly about not being "the worst" at everything you are involved in.

GauntletWizard

2 days ago

[-]

You would think, but you've never seen drama like single-speedrunner games. They know they're unfulfilled and kings of a molehill, and as soon as there's the slightest competition - a single other "run" from someone who bothers with a little practice - there's a blowup. Super-niche-ing is not the solution you think it is.

djrj477dhsnv

2 days ago

[-]

Do you really want to live in this "post scarcity" world? With no effort required to meet your needs and desires, what motivation will you have to do anything?

Kaczynski's warnings seem more apt with every year that passes.

javcasas

2 days ago

[-]

I want to live in the post scarcity world. Given that we are headed into an ultra-productive world, I prefer by miles a world without scarcity over a world full of scarcity because the elites are hoarding the resources, and the only way to provide for oneself is by outcompeting the machines that already produce at zero marginal price, but only for the elites.

oinfoalgo

1 day ago

[-]

It doesn't solve the problem that people want what other people want.

In a "post scarcity" world we will figure out how to make certain things scarce and more desirable. Then people will start gaming the system to try to acquire the more expensive/scarce items. Some will even make it their life mission to acquire the intentionally scarce items/experiences.

Basically, the same situation we have now.

hdgvhicv

2 days ago

[-]

Plenty of weathy people do things, not because they have to, but because they want to

Plenty of retired people carry on doing things too

kbrkbr

2 days ago

[-]

Here is another view: some of them maybe do things to perform richness. And others are probably so bored that they just try new extreme things, but nothing fills that inner void. I can't get no satisfaction.

Or maybe not. I'll never know.

1 day ago

[-]

so the only way to get that inner satisfaction is to do work that you hate? is there another option I'm missing here?

djrj477dhsnv

1 day ago

[-]

Most people find work that directly addresses their basic needs fulfilling to some degree. Things like growing your own food, hunting, building and maintaining your home, etc.

1 day ago

[-]

and they can still do that if food is available for free. in fact people today put enormous amounts of effort into growing their own "organic" food instead of buying the cheap stuff from supermarkets. I don't see why that wouldn't still do it when it's essentially free.

IanCal

2 days ago

[-]

People dedicate their lives to making realistic paintings despite being able to buy a far more accurate camera for a few hours of work. I’m not hugely convinced that we should worry about work to stay alive and sheltered.

vkou

2 days ago

[-]

Look, even a stopped clock is right twice a day.

Kaczynski didn't invent any of these ideas, or even develop them, instead of citing him, why not cite... Literally any other person with them whose mind wasn't blown out by LSD and a desire to commit random political murder.

You're doing your point a disservice by bringing in all of that baggage.

djrj477dhsnv

2 days ago

[-]

Perhaps there are more original or precise sources for the ideas. I've read Jacques Ellul, for example, but for someone not well versed in philosophy like myself, Kaczynski is more accessible and well known.

I don't agree with many of his conclusions or actions, but I have no problem judging the good ideas he advocated on their own merit.

frozenseven

2 days ago

[-]

>Do you really want to live in this "post scarcity" world?

Yes.

>Kaczynski

You're citing a psychopathic terrorist who murdered 3 people and injured a further 23.

>what motivation will you have to do anything?

For one thing, freedom from self-appointed taskmasters who view Kaczynski as a source of inspiration.

1 day ago

[-]

I'm practically living in a post scarcity situation - my work is stuff I'd do for fun anyway, other than a bit of paperwork now and again. nothing is compulsory if you want to do it anyway. even then I only need to work part time to survive.

the rest of the time I spend studying and doing sports. I've tried doing nothing - but boredom is actually worse than work.

what I really want is for other people to also be in a similar situation. I also want to be able to afford to just not work for 6 months and travel the world - but I've got a mortgage to pay. so I think further reductions in scarcity in my life would not reduce my drive to do, learn, experience one bit.

I suspect that most people would be the same if they weren't accustomed to not having the energy to look after themselves and growing their mind.

aswegs8

1 day ago

[-]

Can you think of any more literature on the topic? Preferably non-fiction?

fmbb

2 days ago

[-]

> It’s because the bottleneck isn’t in intelligence, but in human tasks: specifying intent and context engineering.

So the bottleneck is intelligence.

Junior engineers are intelligent enough to understand when they don't understand. They interrogate the intent and context of the tasks they are given. This is intelligence.

Solving math questions is not intelligence, computers have been better than humans at that for like 100 years, as long as you first do the intelligent part as a human: specifying the task formally.

Now we just have computer programs with another kind of input in natural language, and which require dozens of gigabytes of video ram and millions of cores to execute. And we still have to have humans to the intelligent part, figure out how to describe the problem so the dumb but very very fast machine can answer the question.

agentcoops

2 days ago

[-]

I'm not sure your argument applies only to AI. Intelligence is certainly not knowing through, say, divine inspiration what another person wants you to do. This bottleneck of "describing the problem" is the same bottleneck faced when working with junior (or senior) engineers, especially in a team. One need only consider the classic of our field, Mythical Man-Month, which is really dedicated to this precise and, in some sense, irresolvable problem -- often it's best to just have one person who understands and ideally first posed the problem do the work, rather than introduce this bottleneck of communication.

It's a difficult and crucial problem, we all agree, but it's a stretch to define intelligence as such to be "describing the problem." Choosing the right problem in the first place (i.e. not just telling person B to do X but selecting the X that in fact is worth pursuing), perhaps, but I don't think that's right either as a definition of intelligence. Indeed, even the best scientists often speak of an "intuition" that drives their choice of problems.

More classical definitions place intelligence in the domain of "means-ends rationality", i.e. given an end to pursue being capable of determining the correct way to do so and carrying it out until completion. A calculator like a hammer is certainly not intelligent in that sense, but I would struggle to see how even an AI skeptic could maintain that state-of-the-art LLMs today are not a qualitative step above calculators according to this measure.

fmbb

2 days ago

[-]

All living things have means and ends and pursue goals to completion. That does not make us call them intelligent.

Whenever the LLM fails to act intelligently, we blame the person who gave it the task. So we don't expect them to be able to figure anything out, we are just treating them as easily reconfigurable Skinner boxes.

I'm not an expert or even very interested in the field so I cannot judge what you propose, only intuit from the word "intelligence" and how these machines are described to work and how I observe them working. Reading a bit of https://en.wikipedia.org/wiki/Intelligence leads me to believe these machines have even less to do with any classical definition of intelligence, but I did notice that

> Scholars studying artificial intelligence have proposed definitions of intelligence that include the intelligence demonstrated by machines

which seems rather relevant. Yeah when the AI researchers describe intelligence the machines are intelligent.

pvtmert

2 days ago

[-]

I truly love this comment, which essentially says: LLMs are glorified calculators, with ambiguous grammar. :)

bamboozled

2 days ago

[-]

That’s really what they feel like to me, a type of word / number hybrid calculator. Like a probability machine…you attempt to give it the right input and you hopefully and non demonically get some output.

handzhiev

2 days ago

[-]

Computers are glorified calculators, yet they power most of our lives

giantrobot

1 day ago

[-]

Many computers and interfaces are deterministic. LLMs are by nature not deterministic and not even non-deterministic the same way on any two invocations given the same prompt and context. Natural language is ambiguous and for many languages very context dependent. It's not the greatest interface for a calculator from which we're expecting deterministic accurate answers.

WolframAlpha is a more impressive front end to a calculator than I've seen out of LLMs. Not only does it show me how it translated my natural-ish language query but it shows me potential alternative interpretations to my question. LLMs by the nature of how training works can't necessarily tell me why and how they interpreted my prompt. The thinking models are better but still not great.

2 days ago

[-]

> Junior engineers are intelligent enough to understand when they don't understand. They interrogate the intent and context of the tasks they are given

Eh, I wouldn't apply that as if it's a general thing. yes, the really good ones do. many will equally plough through into the mud with albeit admirable determination.

fmbb

23 hours ago

[-]

yeah so the LLMs are like a stupid junior

only they work for free and produce megabytes of stupid code per hour

21 hours ago

[-]

a junior that can instantly fix a problem spotted by a senior even if it involves touching 10 files. no meat junior can do that.

elevatortrim

1 day ago

[-]

Yes but for the purposes of these conversations, we do not need to say "good juniors" "good engineers" every time. After all, for every task, it is possible to find someone who is really bad at it, and we should not need to keep repeating we are not talking about _that_ person.

1 day ago

[-]

ah you're a mathematician I see. without loss of generality, let's assume the bottom half doesn't exist. therefore only the top half needs to be bested and the task is now impossible. qed.

neom

2 days ago

[-]

Same same human problems. Regardless of their inherent intelligence...humans perform well only when given decent context and clear specifications/data. If you place a brilliant executive into a scenario without meaningful context.... an unfamiliar board meeting where they have no idea of the company’s history, prior strategic discussions, current issues, personel dynamics...expectations..etc etc, they will struggle just as a model does surly. They may still manage something reasonably insightful, leveraging general priors, common sense, and inferential reasoning... their performance will never match their potential had they been fully informed of all context and clearly data/objectives. I think context is the primary primitive property of intelligent systems in general?

2 days ago

[-]

> they will struggle just as a model does surly

A human will struggle, but they will recognize the things they need to know, and seek out people who may have the relevant information. If asked "how are things going" they will reliably be able to say "badly, I don't have anything I need".

2 days ago

[-]

That's just additional context.

2 days ago

[-]

That the person go and get themselves. If a model could to that we wouldn't need you to drive them. Basically every human is self going that way, you don't need to go and pick them up since they got stuck in a loop of unknowns at a grocery store etc.

2 days ago

[-]

Yes, like search or operator. It's at the early stages but they are doing this. Currently counting in minutes rather than hours or days.

whattheheckheck

2 days ago

[-]

An intelligent system would know how to get that information without getting spoon fed it

Greenpants

2 days ago

[-]

I really like this analogy! Many real-world tasks that we'd like to use AI for seem infinitely more complex than can be captured in a simple question/prompt. The main challenge going forward, in my opinion, is how to let LLMs ask the right questions – query for the right information – given a task to perform. Tool use with MCPs might be a good start, though it still feels hacky to have to define custom tools for LLMs first, as opposed to how humans effectively browse and skim lots of documentation to find actually relevant bits.

2 days ago

[-]

> I think context is the primary primitive property of intelligent systems in general?

What do you mean by 'context' in this context? As written, I believe that I could knock down your claim by pointing out that there exist humans who would do catastrophically poorly at a task that other humans would excel at, even if both humans have been fully informed of all of the same context.

2 days ago

[-]

To clarify what I'm thinking here by analogy...

Imagine that someone said:

> I think wood is the primary primitive property of sawmills in general.

An obvious observation would be that it is dreadfully difficult to produce the expected product of a sawmill without tools to cut or sand or otherwise shape the wood into the desired shapes.

One might also notice that while a sawmill with no wood to work on will not produce any output, a sawmill with wood but without woodworking tools is vanishingly unlikely to produce any output... and any it does manage to produce is not going to be good enough for any real industrial purpose.

neom

2 days ago

[-]

My perspective ("context as primary primitive") was about context as the foundational prerequisite of intelligent performance. I'm discussing a scenario with the minimum conditions for any intelligent action, whether small scale or large scale. At risk of talking past each other due to nuance methinks and I'm a bit lazy to think it through properly but... I think there is something in saw vs sawmill? Like a scale thing? Either way I wasn't trying to be profound or anything, I was just saying I think context abilities is likely the first prerequisite for any minimally intelligent thing (maybe I shouldn't have used the word system in my original comment).

getnormality

2 days ago

[-]

This comparison may make sense on short-horizon tasks for which there is no possibility of preparation. Given some weeks to prepare, a good human executive will get the context, while today's best AI systems will completely fail to do so.

https://dwyer.co.za/static/claude-code-is-all-you-need.html

crazylogger

2 days ago

[-]

Today’s AI systems probably won’t excel, but they won’t completely fail either.

Basically give the LLM a computer to do all kinds of stuff against the real world, kick it off with a high level goal like “build a startup”.

The key is to instruct it to manage its own memory in its computer, and when context limit inevitably approaches, programmatically interrupt the LLM loop and instruct it to jot down everything it has for its future self.

It already kinda works today, and I believe AI systems a year from now will excel at this:

https://www.anthropic.com/research/project-vend-1

threecheese

2 days ago

[-]

Author IMO correctly recognizes that access to context needs to scale (“latent intent” which I love), but I’m not sure I’m convinced that current models will be effective even if given access to all priors needed for a complex task. The ability to discriminate valuable from extraneous context will need to scale with size of available context, it will be pulling needles from haystacks that aren’t straightforward similarity. I think we will need to steer these things.

jondwillis

2 days ago

[-]

We’re already steering, during pre-training (e.g. reasoning RLHF), as well as test-time (structured outputs, tool calls, agents…)

etler

2 days ago

[-]

I think the framing of these models are being "intelligent" is not the right way to go. They've gotten better at recall and association.

They can recall prior reasoning from text they are trained on which allows them to handle complex tasks that have been solved before, but when working on complex, novel, or nuanced tasks there is no high quality relevant training data to recall.

Intelligence has always been a fraught word to define and I don't think what LLMs do is the right attribute for defining it.

I agree with a good deal of the article but because it keeps using loaded works like "intelligent" and "smarter", it has a hard time explaining what's missing.

2 days ago

[-]

It's specific model that run for maths. GPT-5 and Gemini 2.5 still cannot compute an arbitrary length sum of whole number without a calculator. I have a proceduraly generated benchmark of basic operations, LLMs gets better at it with time, but they cant still solve basic maths or logic problems.

BTW I'm open to selling it, my email is on my hn profile.

HappMacDonald

2 days ago

[-]

Have you ever seen what these arbitrary length whole numbers look like once they are tokenized? They don't break down to one-digit-per-token, and the same long number has no guarantee of breaking down into tokens the same way every time it is encountered.

But the algorithms they teach humans in school to do long-hand arithmetic (which are liable to be the only algorithms demonstrated in the training data) require a single unique numeral for every digit.

This is the same source as the problem of counting "R"'s in "Strawberry".

2 days ago

[-]

That's was the initial thinking of anyone which I explained this, it was also my speculation, but when you look in it's reasoning where it do the mistake, it correctly extract the digits out of the input token. As I say in another comments, most of the mistakes her happen when it recopy the answer it calculated from the summation table. You can avoid tokenization issue when it extract the answer by making it output an array of digits of the answer, it will still fail at simply recopying the correct digit.

Fade_Dance

2 days ago

[-]

I recently saw someone that posted a leaked system prompt for GPT5 (and regardless of the truth of the matter since I can't confirm the authenticity of the claim, the point I'm making stands alone to some degree).

A portion of the system prompt was specifically instructing the LLM that math problems are, essentially, "special", and that there is zero tolerance for approximation or imprecision with these queries.

To some degree I get the issue here. Most queries are full of imprecision and generalization, and the same type of question may even get a different output if asked in a different context, but when it comes to math problems, we have absolutely zero tolerance for that. To us this is obvious, but when looking from the outside, it is a bit odd that we are so loose and sloppy with, well basically everything we do, but then we put certain characters in a math format, and we are hyper obsessed with ultra precision.

The actual system prompt section for this was funny though. It essentially said "you suck at math, you have a long history of sucking at math in all contexts, never attempt to do it yourself, always use the calculation tools you are provided."

HappMacDonald

22 hours ago

[-]

o/~ Mathematics keeps your intellect intact / many answers should be carefully exact

But for daily application, use a close approximation, round it off.. o/~

2 days ago

[-]

> But the algorithms they teach humans in school to do long-hand arithmetic (which are liable to be the only algorithms demonstrated in the training data) require a single unique numeral for every digit.

But humans don't see single digits, we learn to parse noisy visual data into single digits and then use those single digits to do the math.

It is much easier for these models to understand what the number is based on the tokens and parse that than it is for a visual model to do it based on an image, so getting those tokens streamed straight into its system makes its problem to solve much much simpler than what humans do. We weren't born able to read numbers, we learn that.

bt1a

2 days ago

[-]

i'd wager your benchmark problems require cumbersome arithmetic or are poorly worded / inadequately described. or, you're mislabeling them as basic math and logic (a domain within which LLMs have proven their strengths!)

i only call this out because you're selling it and don't hypothesize* on why they fail your simple problems. i suppose an easily aced bench wouldn't be very marketable

2 days ago

[-]

This is a simple sum of 2 whole number, the number are simply big.

Most of the time they make a correct summation table but fail to copy correctly the sum result into a final result. That is not a tokenisation problem (you can change the output format to make sure of it). I have a separated benchmark that test specifically this, when the input is too large, the LLMs fails to accuratly copy the correct token. I suppose the positional embedding, are not perfectly learned and it sometimes cause a mistake.

The prompt is quite short, it use structured output, and I can generate a nice graph of % of good response accross difficulity of the question (which is just the total digit count of the input numbers.

LLMs have 100% success rate on theses sum until they reach a frontier, past that their accuracy collapse at various speed depending of the model.

[1] - https://machinelearning.apple.com/research/illusion-of-think...

bwfan123

2 days ago

[-]

This is close to what the apple paper [1] also found on constraint satisfaction problems. As an example, on towers of hanoi, past a frontier, accuracy collapses.

Even when the algorithm steps are laid out precisely, they cannot be followed. Perhaps, LLMs should be trained on turing machine specs and be given a tape lol.

Constraint satisfaction and combinatorics are where the search space is exponential, and the techniques are not formalized (not enough data in training set), and remain hard for machines as seen in the Problem 6 of IMO which could not be solved by LLMs. I suspect, there is this aspect of human intelligence which is not yet captured in LLMs.

energy123

2 days ago

[-]

Have you tried greedy decoding (temp 0) in aistudio?

The temp 0.7-1.0 defaults are not designed for reconstructing context with perfect accuracy.

2 days ago

[-]

I always use the lowest temperature that I can input. But GPT-5 doesn't support a temperature setting. You'll get something like:

{ "error": { "message": "Unsupported value: 'temperature' does not support 0.0 with this model. Only the default (1) value is supported.", "type": "invalid_request_error", "param": "temperature", "code": "unsupported_value" } }

2 days ago

[-]

I can't see why that's necessary, when it can call a tool. Everyone uses a calculator. A logic problem, it can solve with reasoning, perhaps it's not the smartest but it can solve logic problems. All indications are that it will continue to become smarter.

2 days ago

[-]

Simple maths problems are simple logic problem. Here it doesn't even have to come up with a reasoning, it probably already memorised how to solve sums. Yet it fails at that, it shows it cannot solve logic problems if there are too much steps.

> All indications are that it will continue to become smarter.

I'm not disputing that, every new model score better at my benchmark, but right now, none truly "solve" one of these small logic problem.

2 days ago

[-]

If it can frame the question for the tool, it therefore has the logic (whether that was static recall or deductive).

LLM's struggle with simple maths by nature of their architecture not due to a lack of logic. Yes it struggles with logic questions too but they're not directly related here.

2 days ago

[-]

Most of the failures for theses simple logic question come from the inability to simply copy data accuratly. Logic is too abstract to be measured, but this single bench show something getting in it's way. I got another bench that show that the LLMs do basic mistakes that can be easily avoided with minimum logic and observation.

2 days ago

[-]

> LLM's struggle with simple maths by nature of their architecture not due to a lack of logic.

No, if it was good at logic it would have overcame that tiny architectural hurdle, its such a trivial process to convert tokens to numbers that it is ridiculous for you to suggest that is the reason it fails at math.

The reason it fails at math is because it fails at logic, and math is the most direct set of logic we have. It doesn't fail at converting between formats, it can convert strawberry to correct Base64 encoding, meaning it does know exactly what letters are there, it just lacks to logic to actually understand what "count letters" means.

2 days ago

[-]

It can't see that data so how can it convert it? It can only see the token input.

An analogy (probably poor) is like asking a human to see UV light. We can do so but only with tools or by removing our lense.

The fact that SOTA models (not yet publicly available) can achieve gold at IOM implies otherwise.

2 days ago

[-]

It's because math problems allow to easily check that the solution is correct, it allow to do a lot of 'search': https://yellow-apartment-148.notion.site/AI-Search-The-Bitte...

2 days ago

[-]

> GPT-5 and Gemini 2.5 still cannot compute an arbitrary length sum of whole number without a calculator.

Neither can many humans, including some very smart ones. Even those who can will usually choose to use a calculator (or spreadsheet or whatever) rather than doing the arithmetic themselves.

2 days ago

[-]

> Neither can many humans...

1) GPT-5 is advertised as "PhD-level intelligence". So, I take OpenAI (and anyone else who advertises their bots with language like this) at their word about the bot's capabilities and constrain the set of humans I use for comparison to those who also have PhD-level intelligence.

2) Any human who has been introduced to long addition will absolutely be able to compute the sum of two whole numbers of arbitrary length. You may have to provide them a sufficiently strong incentive to actually do it long-hand, but they absolutely are capable because the method is not difficult. I'm fairly certain that most adult humans [0] (regardless of whether or not they have PhD-level intelligence) find the method to be trivial, if tedious.

[0] And many human children!

2 days ago

[-]

I have a PhD, in mathematics, from a top university. If you give me, say, 100 10-digit numbers to add up and tell me to do the job in my head then I will probably get the answer wrong.

Of course, if you give me 100 10-digit numbers to add up and let me use a calculator, or pencil and paper, then I will probably get it right.

Same for, say, two 100-digit numbers. (I can probably get that one right without tools if you obligingly print them monospaced and put one of them immediately above the other where I can look at them.)

Anyway, the premise here seems to be simply false. I just gave ChatGPT and Claude (free versions of both; ChatGPT5, whatever specific model it routed my query to, and Sonnet 4) a list of 100 random 10-digit numbers to add up, with a prompt encouraging them to be careful about it but nothing beyond that (e.g., no specific strategies or tools to use), and both of them got the right total. Then I did the same with two 100-digit numbers and both of them got that right too.

https://i.imgur.com/l2elIAv.png

2 days ago

[-]

Difficulty is the amount of digits, small models struggle with 10 digits numbers, gemini and gpt-5 are very good recent models, gemini start failing before 40 digits, GPT-5 (the one by api, the online chat version is worse and I didn't tested it) can do more than 120 digits (at this point it's pointless to test for more).

2 days ago

[-]

My tests of GPT-5 were using the online chat version.

Of course, I only ran it once; I can't at all rule out the possibility that sometimes it gets it wrong. But, again, the same is true of humans.

1 day ago

[-]

The online version is way worse, it also have a router that could route it to a random model.

mathiaspoint

2 days ago

[-]

Right but most (competent) humans will reliably use a calculator. It's difficult to get these to reliably make lots of tool calls like that.

2 days ago

[-]

I do think that competent humans can solve any arbitrary sum of 2 whole number with a pen, paper and time. LLMs can't do that.

rileymat2

2 days ago

[-]

That’s interesting, you added a tool. You did not just leave it to the human alone.

2 days ago

[-]

I'm not the fellow you replied to, but I felt like stepping in.

> That’s interesting, you added a tool.

The "tool" in this case, is a memory aid. Because they are computer programs running inside a fairly-ordinary computer, the LLMs have exactly the same sort of tool available to them. I would find a claim that LLMs don't have a free MB or so of RAM to use as scratch space for long addition to be unbelievable.

2 days ago

[-]

The fact that an LLM is running inside an ordinary computer does not mean that it gets to use all the abilities of that computer. They do not have megabytes of scratch space merely because the computer has a lot of memory.

They do have something a bit like it: their "context window", the amount of input and recently-generated output they get to look at while generating the next token. Claude Sonnet 4 has 1M tokens of context, but e.g. Opus 4.1 has only 200k and I think GPT-5 has 256k. And it doesn't really behave like "scratch space" in any useful sense; e.g., the models can't modify anything once it's there.

1 day ago

[-]

Well, the GPT-5 context windows offer roughly a little more than a MB

2 days ago

[-]

LLMs already get enough working memory, they do not fail because of lack of working space.

visarga

2 days ago

[-]

Verification is the bottleneck, not ideation. LLMs can generate anything on tap, but solving any non-trivial problem requires iteration between thinking, doing and observing outcomes. The real world is too complex to be simulated by AI or humans. The scientific method works the same way, we are not exempt from having to validate our ideas. But as humans we have better feedback and access to context and we can assume risks on our own. AI has no skin and bears no responsibility.

So the missing ingredient for AI is access to environment for feedback learning. It has little to do with AI architecture or datasets. I think a huge source of such data is our human-LLM chat logs. We act as LLM eyes, hands and feed on the ground. We carry the tacit knowledge and social context. OpenAI reports billions of tasks per day, probably trillions of tokens of interactive language combining human, AI and feedback from the environment. Maybe this is how AI can inch towards learning how to solve real world problems, it is part of the loop of problem solving, and benefits from having this data for training.

bwfan123

2 days ago

[-]

> Verification is the bottleneck

In my use of cursor as a coding assistant, this is the primary problem. The code is 90% on the mark, but still buggy, and needs verification, and the feedback it gets from me is not with full fidelity as something is lost in translation.

But, a bigger issue is that AI has only some solution templates for problems that it is trained on, and being able to generate new templates is beyond its capability as that requires training on datasets of higher levels of abstration.

stephc_int13

2 days ago

[-]

This is because we tend to use a human-centric reference to evaluate the difficulty of a task : playing chess at grand master level is a lot harder than folding laundry, except that it is the opposite, and this weird bias is well known as Moravec’s Paradox.

Intelligence is the bottleneck, but not the kind of intelligence you need to solve puzzles.

mdaniel

2 days ago

[-]

For others who also hadn't heard of that: https://en.wikipedia.org/wiki/Moravec%27s_paradox

ankit219

2 days ago

[-]

The bottleneck for automation is verification. With human work, verification was fast(er) because you know where to look with certain assumptions that your upstream tasker would not have made trivial mistakes. For automation, AI needs to verify it's own work, review, and self correct to be able to automate any given work. Where this works, it will also change the abstraction layer compared to what it is today. The problem is same with every automation promise - it needs to work reliably at say 95% or 99% times and when it doesn't, there should be human contingency in terms of what to look for. Considering coding as the first example: it's already underway. AI generates the code, the test cases, and then verifies if the code works as intended. Code has a built in verification layer (both compiler and unit tests). High probablity the other domains move towards something similar too. I would also say the model needs to be intelligent to course correct when the output isn't validated[1].

Verification solves the human in the loop dependency both for AI and human tasks. All the places where we could automate in the past, there were clearly quality checks which ensured the machinery were working as expected. Same thing will be replicated with AI too.

Disclaimer: I have been working on building a universal verifier for AI tasks. The way it works is you give it a set of rules (policy) + AI output (could be human output too) and it outputs a scalar score + clause level citations. So I have been thinking about the problem space and might be over rating this. Would welcome contrarian ideas. (no, it's not llm as a judge)

[1]: Some people may call it environment based learning, but in ML terms i feel it's different. That woudl be another example of sv startups using technical terms to market themselves when they dont do what they say.

saint_yossarian

2 days ago

[-]

One thing that comes to mind: You still have to verify that the tests are exhaustive, and that the code isn't just gaming specific test scenarios.

I guess fuzzing and property-based testing could mitigate this to some extent.

ankit219

2 days ago

[-]

Yes, we are getting there. I think compiler is a bigger problem than unit tests given most verticals don't even have that. With unit tests, there would be some reward hacking but would be controlled at the model level + tests. (this is one of the reason i dont believe in transformer based llm as a judge for a verifier)

aledalgrande

1 day ago

[-]

I don't agree with the author. Where is the part about nondeterminism and hallucinations? Drawing a pretty chart doesn't make the argument true. All these benchmarks and competitions are on problems that have a _right answer_. I write most my code entirely through Claude at work and have Claude Max for personal, and I can see every day that even with the right context, it's not certain that the model is going to converge to a decent answer on complex real life issues. At least one thing I do agree on: model growth is not an exponential, like everyone thought when we were on the first leg of it, but a logarithmic.

dash2

2 days ago

[-]

I'm not sure about the assumption that science is context-free. Maths maybe, but a lot of practical science has tons of unformalized contextual knowledge that is "handed down" by practitioners. It's one reason why replication can be so hard.

OTOH, I also think a lot of science is like 1% inspiration, 99% very mundane tasks like data cleaning. So no reason the AI can't help with that. And scientists write terrible code, so the bar is low :-)

jefftitan

2 days ago

[-]

Providing more context is difficult for a number of reasons. If you do it RAG style you need to know which context is relevant. LLMs are notorious for knowing that a factor is relevant if directly asked about that factor, but not bringing it up if it's implicit. In business things like people's feelings on things, historical business dealings, relevance to trending news can all be factors. If you fine tune... well... there have been articles recently about fine tuning on specific domains causing overall misalignment. The more you fine tune, the riskier.

Paratoner

2 days ago

[-]

I'm endlessly fascinated by the way these Humans (probably) speak of their follow peers as though they are a problem to solve for.

> Longer term, we can reduce the human bottleneck by

Thank God we have ways to remove the thorn in our(?) side for good. The world can finally heal when the pursuit of fulfillment becomes inaccessible to the masses.

2 days ago

[-]

It 100% is still intelligence. GPT-5 with Thinking still can't win at tic-tac-toe.

storus

2 days ago

[-]

What if it's the desired outcome? Become more human-like (i.e. dumb) to make us feel better about ourselves? NI beats AI again!

2 days ago

[-]

> What if it's the desired outcome?

To be able to reason about the rules of a game so trivial that it has been solved for ages, so that it can figure out enough strategy to never not bring the game to a draw (if played against one who is playing to not lose), or a win (if played against someone who is leaving the bot an opening to win), as mentioned in [0] and probably a squillion other places?

Duh?

[0] <https://news.ycombinator.com/item?id=44919138>

2 days ago

[-]

Speaking of human-level capabilities, it looks like I totally failed to correctly read the section of your comment that I quoted. Shame on me.

However, I'd expect that "Appearing to fail to reason well enough to know how to always fail to lose, and -if the opportunity presents itself- win at one of the simplest games there is." is absolutely not a desired outcome for OpenAI, or any other company that's burning billions of dollars producing LLMs.

If their robot was currently reliably capable of adequate performance at Tic Tac Toe, it absolutely would be exhibiting that behavior.

dismalaf

2 days ago

[-]

Tic-tac-toe is solved and a draw can be forced 100% of the time...

2 days ago

[-]

That's exactly why it's so crazy that GPT-5 with Thinking still loses...

dismalaf

2 days ago

[-]

Ah, your first comment said "can't win". Which is different than "always loses".

2 days ago

[-]

Ah okay, well it will still lose some of the time, which is surprising. And it will lose in surprising way, e.g., thinking for 14 seconds and then making an extremely basic mistake like not seeing it already have two on a row and could just win.

HappMacDonald

2 days ago

[-]

.. and you can "program" a neural network — so simple it can be implemented by boxes full of marbles and simple rules about how to interact with the boxes — to learn by playing tictactoe until it always plays perfect games. This is frequently chosen as a lesson in how neural network training even works.

But I have a different challenge for you: train a human to play tictactoe, but never allow them to see the game visually, even in examples. You have to train them to play only by spoken words.

Point being that tictactoe is a visual game and when you're only teaching a model to learn from the vast sea of stream-of-tokens (similar to stream-of-phonemes) language, visual games like this aren't going to be well covered in the training set, nor is it going to be easy to generalize to playing them.

2 days ago

[-]

Well whatever your story is, I know with near certainty that no amount of scaffolding is going to get you from an LLM that can't figure out tic-tac-toe (but will confidently make bad moves) to something that can replace a human in an economically important job.

bwfan123

2 days ago

[-]

llm maximalists' apologies:

- but tokens are not letters - but humans fail too - just wait, we are on an S curve to AGI - but your prompt was incorrect - but I tried and here it works

Meanwhile, their claims:

- LLMs are performing at PhD levels. - AGI is around the corner - humanity will be wiped out - situational awareness report

fifteen1506

2 days ago

[-]

As a human, I'd also appreciate the specifications, documentation and meetings were not inaccessible to me.

darepublic

2 days ago

[-]

Model capability is absolutely the main constraint.