I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?
Or does this simply show that esolangs are hard to reason in by design? A more honest approach would use a "real", but relatively unpopular, language. Make them use CoffeeScript or Ada or PL/I or Odin or that other systems programming language that that very opinionated guy is implementing on top of QBE.
AI's capacity for memorisation is unrivaled, I find it mind blowing that you can download a tiny ~4gb model and it will have vastly more general knowledge than an average human (considering that the human is more likely to be wrong if you ask it trivia about e.g. the spanish civil war).
But the average human still has actual reasoning capabilities, which is still (I think?) a debated point with AI.
It's not, people misread an Apple study and it became a meme. It lost currency as a meme because it is impossible to use a model in 2026 and come away with the idea it cannot reason, for any reasonable definition of the word reason (pun intended). Most of the debate from there is just people misreading each-other and imagining incentive structures at play. (ex. I am not claiming they are never stupid, ex. the car wash dilemma, but I am claiming its gee-whiz enough at enough that it's become de facto beyond honest debate)
> AI's capacity for memorisation is unrivaled,
Much like "it just memorizes training data", "memorization" has a kernel of truth to it. Memorizing does not imply "it has 100% "learned", for some definition of learned similar to "guaranteed 100% reproducible translatable computation", brainfuck to the point it's just as easy as writing any other program, and thus if it hasn't, it cannot reason"
At the end of the day these are just mathematical objects. And while it's not discourse-contributing, the mundane truth is, those matmuls born from boring curve-fitting at scale know/memorized/can reason about/can parrot/have adjusted the float32s in such a way that it produces C a lot better than Brainfuck. Much like us. But they're just matmuls curve-fitting at scale.
The singularity crowd has never listened to reason and never will.
It doesn't even prove the models do that. The RLVR environments being mostly Python isn't "training data memorization". That's just the kind of dumb thing people say to sound savvy.
Esolang vs mainstream paradigm.
Popular vs scarce training data.
So you'd want to control for training data (e.g. brainfuck vs Odin?)
And ideally you'd control by getting it down to 0, i.e. inventing new programming languages with various properties and testing the LLMs on those.
I think that would be a useful benchmark for other reasons. It would measure the LLMs' ability to "learn" on the spot. From what I understand, this remains an underdeveloped area of their intelligence. (And may not be solvable with current architectures.)
Setting aside whether this benchmark is meaningful or not - the argument you're making is faulty. There are indeed humans who can write complete programs in Brainfuck and these other esolangs. The fact that you personally can't is not logically relevant.
Palindrome:
https://chatgpt.com/s/t_69bc8d8c116c8191a339a33f0fbcc935
This is a noticeable improvement from a year ago.
I wish it would use Return instead of Quit but that’s a stochastic parrot for you.
Before looking at the results my guess was that scores would be higher for Unlambda than any of the others, because humans that learn Scheme don't find it all that hard to learn about the lambda calculus and combinatory logic.
But the model that did the best, Qwen-235B, got virtually every problem wrong.
It's bad enough that I've considered writing some sort of cursed bash->posh translation layer
Yet it has no issues at all implementing and then writing slopjective-c 3.0
It’s the first model where I didn’t have to ask, repeatedly, that it use Powershell 5, and never use emojis or other invalid characters, like Gemini and those non-ASCII spaces.
If the llm has “skills” for that language, it will definitely increase accuracy.
Would love to see how the benchmarks results change if the esoteric languages are changed a bit to make them have 1-token keywords only.
Reasoning is hard, reasoning about colors while wearing glasses that obfuscate the real colors... even harder... but not the core issue if your brain not wired correctly to reason.
I suspect the way out of this is to separate knowledge from reason: to train reasoning with zero knowledge and zero language... and then to train language on top of a pre-trained-for-reasoning model.
Current frontier models are really good at generating boiler plate, and really good at summarizing, but really lack the ability to actually comprehend and reason about what’s going on. I think this sort of test really highlights that. And is a nice reminder that, the LLMs, are only as good as their training data.
When an LLM or some other kind of model does start to score well on tests like this, I’d expect to see better them discovering new results, solutions, and approaches to questions/problems. Compared to how they work now, where they generally only seem to uncover answers that have been obfuscated but are present.