My main critique is that I don't think there's evidence that this issue would persist after continuing to scale models to be larger and doing more RL. With a harness like what coding agents do these days and with sufficient tool use, I bet models could go much further on that reasoning benchmark. Otherwise, if the reasoning problem were entirely done within a single context window, it's expected that a complex enough reasoning problem would be too difficult for the model to solve.
This is same as the critiques of the LLM paper by apple where they showed that LLMs fail to solve the tower of hanoi problem after a set number of towers. The test was to see how well these models can reason out a long task. People online were like they could solve that problem if they had access to a coding enviornment. Again the test was to check reasoning capability not if it knew how to code and algorithm to solve the problem.
If model performance degrade a lot after a number of reasoning steps it's good to know where the limits are. Wheather the model had access to tools or not is orthogonal to this problem
And how much larger do we need to make the models? 2x? 3x? 10x? 100x? How large do they need to get before scaling-up somehow solves everything?
Because: 2x larger, means 2x more memory and compute required. Double the cost or half the capacity. Would people still pay for this tech if it doubles in price? Bear in mind, much of it is already running at a loss even now.
And what if 2x isn't good enough? Would anyone pay for a 10x larger model? Can we even realistically run such models as anything other than a very expensive PoC and for a very short time? And whos to say that even 10x will finally solve things? What if we need 40x? Or 100x?
Oh, and of course: Larger models also require more data to train them on. And while the Internet is huge, it's still finite. And when things grow geometrically, even `sizeof(internet)` eventually runs out ... and, in fact, may have done so already [1] [2]
What if we actually discover that scaling up doesn't even work at all, because of diminishing returns? Oh wait, looks like we did that already: [3]
[1]: https://observer.com/2024/12/openai-cofounder-ilya-sutskever...
[2]: https://biztechweekly.com/ai-training-data-crisis-how-synthe...
[3]: https://garymarcus.substack.com/p/confirmed-llms-have-indeed...
The best solution in the meantime is giving the LLM a harness that allows tool use like what coding agents have. I suspect current models are fully capable of solving arbitrary complexity artificial reasoning problems here, provided that they’re used in the context of a coding agent tool.
Then the first step would be to prove that this works WITHOUT needing to burn through the trillions to do so.
The problem, I find, is that they then don't stop, or say they don't know (unless explicitly prompted to do so) they just make stuff up and express it with just as much confidence.
I like to think that AI are the great apes of the digital world.
They don't have the dexterity to really sign properly
At the very least, more than one researcher was involved and more than one ape was alleged to have learned ASL. There is a better discussion about what our threshold is for speech, along with our threshold for saying that research is fraud vs. mistaken, but we don’t fix sloppiness by engaging in more of it.
More weirdly was this lawsuit against Patterson:
> The lawsuit alleged that in response to signing from Koko, Patterson pressured Keller and Alperin (two of the female staff) to flash the ape. "Oh, yes, Koko, Nancy has nipples. Nancy can show you her nipples," Patterson reportedly said on one occasion. And on another: "Koko, you see my nipples all the time. You are probably bored with my nipples. You need to see new nipples. I will turn my back so Kendra can show you her nipples."[47] Shortly thereafter, a third woman filed suit, alleging that upon being first introduced to Koko, Patterson told her that Koko was communicating that she wanted to see the woman's nipples
There was a bonobo named Kanzi who learned hundreds of lexigrams. The main criticism here seems to be that while Kanzi truly did know the symbol for “Strawberry” he “used the symbol for “strawberry” as the name for the object, as a request to go where the strawberries are, as a request to eat some strawberries”. So no object-verb sentences and so no grammar which means no true language according to linguists.
This seems like a rather awkward way of putting it. They may just lack conceptualization or abstraction, making the above statement meaningless.
Which shouldn't come as a surprise, considering that this is, at the core of things, what language models do: Generate sequences that are statistically likely according to their training data.
Using this image - https://www.whimsicalwidgets.com/wp-content/uploads/2023/07/... and the prompt: "Generate a video demonstrating what will happen when a ball rolls down the top left ramp in this scene."
You'll see it struggles - https://streamable.com/5doxh2 , which is often the case with video gen. You have to describe carefully and orchestrate natural feeling motion and interactions.
You're welcome to try with any other models but I suspect very similar results.
But this is unlikely, because they still can fall over pretty badly on things that are definitely in the training set, and still can have success with things that definitely are not in the training set.
The hard problem then is not to eliminate non-deterministic behavior, but find a way to control it so that it produces what you want.
Heisenberg would disagree.
"I wasn’t able to finish; no changes were shipped."
And it's not the first time.
  Rather than check in something half-broken, I’m pausing here. Let me know how you want to
  proceed—if you can land the upstream refactor (or share a stable snapshot of the tests/module),
  I can pick this up again and finish the review fixes in one go."> some even claiming they are capable of generalized reasoning and innovation in reasoning-intensive fields such as mathematics, physics, medicine, and law. However, by more carefully scaling the complexity of reasoning problems, we show existing benchmarks actually have limited complexity
Can someone ELI5 what the definitions of reasoning and complexity are here?
I see they seem to focus on graph problems and representing problems as graph problems. But I didn't completely read the paper or understand it in depth. I skimmed some parts that seem to address this question (e.g. section 5 and the Introduction), but maybe there are simpler definitions that elude me.
Surely they don't mean "computational complexity"?
And what exactly is "reasoning"?
I'm aware of philosophical logic and strict logic that can be applied to natural language arguments.
But have we already agreed on a universal scale that grades answers to questions about the physical world? Or is this about mathematical reasoning?
Mixing all of this together always irks me when it comes to these AI "benchmarks". But apparently people see value in these?
I know my question isn't new.
To me it seems, that when we leave the mathematical realms, it quickly becomes fuzzy what correct "reasoning" should be.
People can be convincing and avoid obious logical fallacies, and still make wrong conclusions... or conclusions that run counter to assumed goals.
But yes, I assume you mean they abort their loop after a while, which they do.
This whole idea of a "reasoning benchmark" doesn't sit well with me. It seems still not well-defined to me.
Maybe it's just bias I have or my own lack of intelligence, but it seems to me that using language models for "reasoning" is still more or less a gimmick and convenience feature (to automate re-prompts, clarifications etc, as far as possible).
But reading this pop-sci article from summer 2022 seems like this definition problem hasn't changed very much since then.
Although it's about AI progress before ChatGPT and it doesn't even mention the GPT base models. Sure, some of the tasks mentioned in the article seem dated today.
But IMO, there is still no AI model that can be trusted to, for example, accurately summarize a Wikipedia article.
Not all humans can do that either, sure. But humans are better at knowing what they don't know, and deciding what other humans can be trusted. And of course, none of this is an arithmetic or calculation task.
https://www.science.org/content/article/computers-ace-iq-tes...
I feel like if LLMs "knew" when they're out of their depth, they could be much more useful. The question is whether knowing when to stop can be meaningfully learned from examples with RL. From all we've seen the hallucination problem and this stopping problem all boil down to this problem that you could teach the model to say "I don't know" but if that's part of the training dataset it might just spit out "I don't know" to random questions, because it's a likely response in the realm of possible responses, instead of spitting out "I don't know" to not knowing.
SocratesAI is still unsolved, and LLMs are probably not the path to get knowing that you know nothing.
I used to think this, but no longer sure.
Large-scale tasks just grind to a halt with more modern LLMs because of this perception of impassable complexity.
And it's not that they need extensive planning, the LLM knows what needs to be done (it'll even tell you!), it's just more work than will fit within a "session" (arbitrary) and so it would rather refuse than get started.
So you're now looking at TODOs, and hierarchical plans, and all this unnecessary pre-work even when the task scales horizontally very well (if it just jumped into it).
I don’t find all these claims that models are somehow worse than humans in such areas convincing. Yes, they’re worse in some respects. But when you’re talking about things related to failures and accuracy, they’re mostly superhuman.
For example, how many humans can write hundred of lines of code (in seconds mind you) and regularly not have any syntax errors or bugs?
Ez, just use codegen.
Also the second part (not having bugs) is unlikely to be true for the LLM generated code, whereas traditional codegen will actually generate code with pretty much no bugs.
I have too, and I sense that this is something that has been engineered in rather than coming up naturally. I like it very much and they should do it a lot more often. They're allergic to "I can't figure this out" but hearing "I can't figure this out" gives me the alert to help it over the hump.
> But when you’re talking about things related to failures and accuracy, they’re mostly superhuman.
Only if you consider speed to failure and inaccuracy. They're very much subhuman in output, but you can make them retry a lot in a short time, and refine what you're asking them each time to avoid the mistakes they're repeatedly making. But that's you doing the work.
Preferably like not at the start and best not to do more than 40KB at a time at all.
That's how I learned how to deal with nftables' 120KB parser_bison.y file by breaking them up into clean sections.
All of a sudden, a fully-deterministic LL(1) full semantic pathway of nftables' CLI syntax appears before my very eye (and spent hours validating it): 100% and test generators now can permutate crazy test cases with relative ease.
Cue in Joe Walsh's "Life's Been Good To Me".
The CogniLoad benchmark does this as well (in addition to scaling reasoning length and distractor ratio). Requiring the LLM to purely reason based on what is in the context (i.e. not based on the information its pretrained on), it finds that reasoning performance decreases significantly as problems get harder (i.e. require the LLM to hold more information in its hidden state simultaneously), but the bigger challenge for them is length.
https://arxiv.org/abs/2509.18458
Disclaimer: I'm the primary author of CogniLoad so feel free to ask me any questions.
So it's simpler than "reasoning". This is not necessarily a bad thing as it boils down the reasoning to a simpler, more controlled sub problem.
Nobody knows.
Moreover, nobody talks about that because it's boring and non-polarizing. Instead, supposedly smart people post stupid comments that prevent anyone from understanding this paper is worthless.
The paper is worthless because it has a click-bait title. Blog posts get voted down for that, why not this?
The implicit claim is worthless. Failure to navigate a synthetic graph == failure to solve real world problems. False.
Absolutely no connection to real world examples. Just losing the model in endless graphs.
This statement is the dictionary definition of attacking a strawman.
Every new model that is sold to us, is sold on the basis that it performs better than the old model on synthetic benchmarks. This paper presents a different benchmark that those same LLMs perform much worse on.
You can certainly criticize the methodology if the authors have erred in some way, but I'm not sure why it's hard to understand the relevance of the topic itself. If benchmarks are so worthless then go tell that to the LLM companies.
I also believe the problem is we don't know what we want: https://news.ycombinator.com/item?id=45509015
If we could make LLMs to apply a modest set of logic rules consistently, it would be a win.
When I prompt an RLM, I can see it spits out reasoning steps. But I don't find that evidence RLMs are capable of reasoning.
There's no evidence to be had when we only know the inputs and outputs of a black box.
Don't ask how it works cuz its called a "Mind reading language model" duh.
Imo the paper itself should have touched on the lack of paper discussing what's in the blackbox that makes them Reasoning LMs. It does mention some tree algorithm supposedly key to reasoning capabilities.
By no means attacking the paper as its intent is to demonstrate the lack of success to even solve simple to formulate, complex puzzles.
I was not making a point, I was genuinely asking in case someone knows of papers I could read on that make claims with evidence that's those RLM actually reason, and how.
Pattern matching is a component of reason. Not === reason.
I actually believe it is technically possible, but is going to be very hard.
ChatGPT knows WebPPL really well for example.
Take this statement for example:
>ChatGPT knows WebPPL really well
What formal language can express this statement? What will the text be parsed into? Which transformations can you use to produce other truthful (and interesting) statements from it? Is this flexible enough to capture everything that can be expressed in English?
The closest that comes to mind is Prolog, but it doesn’t really come close.
Up next: "Lawn mowers are good at cutting grass until they aren't"
We (developers) do this because it's what we've always done with our own code. Everyone's encountered a bug that they just couldn't figure out. So they search the Internet, try different implementations of the same thing, etc but nothing works. Usually, we finally solve such problems when we take a step back and look at it with a different lens.
For example, just the other day—after spending far too long trying to get something working—I realized, "Fuck it! The users don't really need this feature." :thumbsup:
The extent to which this is true is a rough measure of how derivative your work is, no?
I was interviewed about this recently, and mentioned the great work of a professor of CS and Law who has been building the foundations for this approach. My own article about it was recently un-linked due to a Notion mishap (but available if anyone is interested - I have to publish it again)
https://www.forbes.com/sites/hessiejones/2025/09/30/llms-are...
They simulate reasoning through matching patterns.
Also bottom 10% feels like a bad comparison, median human would be better. And unlike "specialized" things like programming, game playing is something almost all of us have done.
It is. It's very common for socially apt people to bullshit through things they don't know, or outright want to hide.
No, it’s not - you don’t even need to be literate to count symbols - but also consider the complexity of the second task and how many skills each requires: unlike counting letters, lying isn’t simple confabulation and requires a theory of mind and some kind of goal. A child who lies to avoid trouble is doing that because they have enough of a world model to know they are going to get in trouble for something even if they haven’t worked out yet that this is unlikely to work.
Pirahã language doesn't even have numerals - that's an extreme case, but there quite a few languages where people stop counting beyond certain small number and just say "a lot". Same people though don't have issues lying to one another. Let that sink in for a while - fully grown-ass adults, fully capable of functioning in their society, not capable of counting one-two-three because the concept is beyond them.
What I'm trying to say is that all of those "requires theory of mind" statements are probably true but completely irrelevant because humans (and LLMs) have "hardware acceleration" of whatever it takes to lie, meanwhile counting is an abstract idea that requires to use the brain in a way it didn't evolve to be used. Similarly, LLMs cannot count if they aren't connected to a math engine - not because they're stupid, but because counting is really difficult.
This was the obvious outcome of the study (don't get me wrong, obvious outcomes are still worth having research on).
"LRMs" *are* just LLMs. There's no such thing as a reasoning model, it's just having an LLM write a better prompt than the human would and then sending it to the LLM again.
Despite what Amodei and Altman want Wall Street to believe, they did not suddenly unlock reasoning capabilities in LLMs by essentially just running two different prompts in sequence to answer the user's question.
The truly amazing thing is that reasoning models show ANY improvement at all compared to non-reasoning models, when they're the same exact thing.
If you mean solving logic problems, then reasoning LLMs seem to pass that bar as they do very well programming and maths competitions. Reasoning LLMs can also complete problems like multiplying large numbers, which requires applying some sort of algorithm where the results cannot just be memorised. They also do this much better than standard pre-trained LLMs with no RL.
So, that makes me come back to this question of what definition of reasoning do people use that reasoning models do not meet? They're not perfect, obviously, but that is not a requirement of reasoning if you agree that humans can reason. We make mistakes as well, and we also suffer under higher complexity. Perhaps they are less reliable in knowing when they have made mistakes or not than trained humans, but I wouldn't personally include reliability in my definition for reasoning (just look at how often humans make mistakes in tests).
I am yet to see any serious, reasoned, arguments that suggest why the amazing achievements of reasoning LLMs in maths and programming competitions, on novel problems, does not count as "real reasoning". It seems much more that people just don't like the idea of LLMs reasoning, and so reject the idea without giving an actual reason themselves, which seems somewhat ironic to me.
In that I guess the model does not need to be the most reasonable intepreter of vague and poorly formulated user inputs but I think to improve a bit at least, to become usefull general appliances and not just test-scoring-automatons.
The key differentiator here is that tests generally _are made to be unambiguously scoreable_. Real world problems are often more vague from the point of view of optimal outcome.
Although, I would argue that this is not reasoning at all, but rather "common sense" or the ability to have a broader perspective or think of the future. These are tasks that come with experience. That is why these do not seem like reasoning tasks to me, but rather soft skills that LLMs lack. In my mind these are pretty separate concerns to whether LLMs can logically step through problems or apply algorithms, which is what I would call reasoning.
They say LLM are PhD-level. Despite billion dollars, PhD-LLMs sure are not contributing a lot solving known problems. Except of course few limited marketing stunts.
You can give a human PhD an _unsolved problem_ in field adjacent to their expertise and expect some reasonable resolution. LLM PhD:s solve only known problems.
That said humans can also be really bad problem solvers.
If you don't care about solving the problem and only want to create paperwork for bureaucracy I guess you don't care either way ("My team's on it!") but companies that don't go out of business generally recognize pretty soon lack of outcomes where it matters.
Terry Tao would disagree: https://mathstodon.xyz/@tao/114508029896631083
https://deepmind.google/discover/blog/alphaevolve-a-gemini-p...
If this is the maximum AGI-PhD-LRM can do, that'll be disappointing compared to investments. Curious to see what all this will become in few years.
I sometimes do on problems where I have particular insight, but I mostly find it is far more effective to give it test cases and give it instructions on how to approach a task, and then let it iterate with little to no oversight.
I'm letting Claude Code run for longer and longer with --dangerously-skip-permissions, to the point I'm pondering rigging up something to just keep feeding it "continue" and run it in parallel on multiple problems.
Because at least when you have a good way of measuring success, it works.
I am just saying that LLMs have demonstrated they can reason, at least a little bit. Whereas it seems other people are saying that LLM reasoning is flawed, which does not negate the fact that they can reason, at least some of the time.
Maybe generalisation is one area where LLM's reasoning is weakest though. They can be near-elite performance at nicely boxed up competition math problems, but their performance dramatically drops on real-world problems where things aren't so neat. We see similar problems in programming as well. I'd argue the progress on this has been promising, but other people would probably vehemently disagree with that. Time will tell.
A lot of people appear to be - often not consciously or intentionally - setting the bar for "reasoning" at a level many or most people would not meet.
Sometimes that is just a reaction to wanting an LLM that is producing result that is good for their own level. Sometimes it reveals a view of fellow humans that would be quite elitist if stated outright. Sometimes it's a kneejerk attempt at setting the bar at a point that would justify a claim that LLMs aren't reasoning.
Whatever the reason, it's a massive pet peeve of mine that it is rarely made explicit in these conversations, and it makes a lot of these conversations pointless because people keep talking past each other.
For my part a lot of these models often clearly reason by my standard, even if poorly. People also often reason poorly, even when they demonstrably attempt to reason step by step. Either because they have motivations to skip over uncomfortable steps, or because they don't know how to do it right. But we still would rarely claim they are not capable of reasoning.
I wish more evaluations of LLMs would establish a human baseline to test them against for much this reason. It would be illuminating in terms of actually telling us more about how LLMs match up to humans in different areas.
The real question is how useful this tool is and if this is as transformative as investors expect. Understanding its limits is crucial.
The models can learn reasoning rules, but they are not able to apply them consistently or recognize the rules they have learned are inconsistent. (See also my other comment which references comments I made earlier.)
And I think they can't without a tradeoff, as I commented https://news.ycombinator.com/item?id=45717855 ; the consistency requires certain level of close-mindedness.
I would argue that humans are not 100% reliable in their reasoning, and yet we still claim that they can reason. So, even though I would agree that the reasoning of LLMs is much less reliable, careful, and thoughtful than smart humans, that does not mean that they are not reasoning. Rather, it means that their reasoning is more unreliable and less well-applied than people. But they are still performing reasoning tasks (even if their application of reasoning can be flawed).
Maybe the problem is that I am holding out a minimum bar for LLMs to jump to count as reasoning (demonstrated application of logical algorithms to solve novel problems in any domain), whereas other people are holding the bar higher (consistent and logical application of rules in all/most domains).
You can argue that damaged toaster is still a toaster, conceptually. But if it doesn't work, then it's useless. As it stands, models lack ability to reason because they can fail to reason and you can't do anything about it. In case of humans, it's valid to say they can reason, because humans can at least fix themselves, models can't.
The best example of this is Sean Heelan, who used o3 to find a real security vulnerability in the Linux kernel: https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-...
Sean Heelan ran o3 100 times, and it found a known vulnerability in 8% of runs. For a security audit, that is immensely useful, since an expert can spend the time to look at the results from a dozen runs and quickly decide if there is anything real. Even more remarkably though, this same testing exposed a zero-day that they were not even looking for. That is pretty incredible for a system that makes mistakes.
This is why LLM reasoning absolutely does not need to be perfect to be useful. Human reasoning is inherently flawed as well, and yet through systems like peer review and reproducing results, we can still make tremendous progress over time. It is just about figuring out systems of verification and review so that we don't need to trust any LLM output blindly. That said, greater reliability would be massively beneficial to how easy it is to get good results from LLMs. But it's not required.
it could be this is just result of good stochastic parroting and not reasoning. Both of those niches are narrow with high amount of training data (e.g. corps buying solutions from leetcode and training LLMs on them).
From another hand we see that LLMs fail in more complex environment: e.g. ask to build some new feature in postgres database.
It's because they do more compute. The more tokens "spent" the better the accuracy. Same reason they spit out a paragraph of text instead of just giving a straight answer in non-reasoning mode.
Which isn't particularly amazing, as # of tokens generated is basically a synonym in this case for computation.
We spend more computation, we tend towards better answers.
A simplified way of thinking about it is: pretraining gives LLMs useful features, SFT arranges them into useful configurations, RLVR glues them together and makes them work together well, especially in long reasoning traces. Makes sense to combine it all in practice.
How much pretraining gives an LLM depends on the scale of that LLM, among other things. But raw scale is bounded by the hardware capabilities and the economics - of training and especially of inference.
Scale is still quite desirable - GPT-4.5 scale models are going to become the norm for high end LLMs quite soon.
I'm doubtful you'd have useful LLMs today if labs hadn't scaled in post-training.
Why is that amazing? It seems expected. Use a tool differently, get different results.
(Slams the door angrily)
(stomps out angrily)
(touches the grass angrily)
That said the input space of supported problems is quite large and you can configure the problem parametrs quite flexibly.
I guess the issue is that what the model _actually_ provides you is this idiot savant who has pre-memorized everything without offering a clear index that would disambiguate well-supported problems from ”too difficult” (i.e. novel) ones
(tips fedora)
(does something)