Caveman: Why use many token when few token do trick
870 points
by tosh
2 days ago
| 93 comments
| github.com
| HN
JBrussee-2
1 day ago
[-]
Author here. A few people are arguing against a stronger claim than the repo is meant to make. As well, this was very much intended to be a joke and not research level commentary.

This skill is not intended to reduce hidden reasoning / thinking tokens. Anthropic’s own docs suggest more thinking budget can improve performance, so I would not claim otherwise.

What it targets is the visible completion: less preamble, less filler, less polished-but-nonessential text. Therefore, since post-completion output is “cavemanned” the code hasn’t been affected by the skill at all :)

Also surprising to hear so little faith in RL. Quite sure that the models from Anthropic have been so heavily tuned to be coding agents that you cannot “force” a model to degrade immensely.

The fair criticism is that my “~75%” README number is from preliminary testing, not a rigorous benchmark. That should be phrased more carefully, and I’m working on a proper eval now.

Also yes, skills are not free: Anthropic notes they consume context when loaded, even if only skill metadata is preloaded initially.

So the real eval is end-to-end: - total input tokens - total output tokens - latency - quality/task success

There is actual research suggesting concise prompting can reduce response length substantially without always wrecking quality, though it is task-dependent and can hurt in some domains. (https://arxiv.org/html/2401.05618v3)

So my current position is: interesting idea, narrower claim than some people think, needs benchmarks, and the README should be more precise until those exist.

reply
Chance-Device
1 day ago
[-]
Sounds reasonable to me. I think this thread is just the way online discourse tends to go. Actually it’s probably better than average, but still sometimes disappointing.
reply
trueno
1 day ago
[-]
i played with this a bit the other night and ironically i think everyone should give it a shot as an alternative mode they might sometimes switch into. but not to save tokens, but instead to.. see things in a different light.

its kind of great for the "eli5", not because it's any more right or wrong, but sometimes presenting it in caveman presents something to me in a way that's almost like... really clear and simple. it feels like it cuts through bullshit just a smidge. seeing something framed by a caveman in a couple of occasions peeled back a layer i didnt see before.

it, for whatever reason, is useful somehow to me, the human. maybe seeing it laid out to you in caveman bulletpoints gives you this weird brevity that processes a little differently. if you layer in caveman talk about caves, tribes, etc it has sort of a primal survivalship way of framing things, which can oddly enough help me process an understanding.

plus it makes me laugh. which keeps me in a good mood.

reply
7granddad
1 day ago
[-]
Interesting point! Based on what you said, in a way caveman does save your human brain tokens. Grammar rules evolve in a particular environment to reduce ambiguities and I think we are all familiar enough with caveman for it to make sense to all of us as a common. For example, word order matters for semantics in modern english so "The dog bit the grandma" and "Dog bit grandma" mean the same. Coming from languages where cases matter for semantics (like German), word order alone does not resolve ambiguity. Articles exist in English due to its Germanic roots
reply
sellmesoap
1 day ago
[-]
Now I want to try programming in pigeon English
reply
adsteel_
1 day ago
[-]
A pidgin is just a simplified form of language that hasn't evolved into its own new language yet. There are many English pidgins.
reply
fireflash38
1 day ago
[-]
It's much easier to talk about how something is deficient/untested than to do the testing yourself.

The same site that complains so much about replication crises in science too...

reply
dataviz1000
1 day ago
[-]
If you want to benchmark, consider this https://github.com/adam-s/testing-claude-agent
reply
bdbdbdb
1 day ago
[-]
Translation:

It joke. No yell at me. It kind of work?

reply
bbeonx
1 day ago
[-]
Thank. Too much word, me try read but no more tokens.
reply
sgbeal
1 day ago
[-]
> There is actual research suggesting concise prompting can reduce response length substantially without always wrecking quality,

Anecdote: i discussed that with an LLM once and it explained to me that LLMs tend to respond to terse questions with terse answers because that's what humans (i.e. their training data) tend to do. Similarly, it explained to me that polite requests tend to lead to LLM responses with _more_ information than a response strictly requires because (again) that's what their training data suggests is correct (i.e. because that's how humans tend to respond).

TL;DR: how they are asked questions influences how they respond, even if the facts of the differing responses don't materially differ.

(Edit: Seriously, i do not understand the continued down-voting of completely topical responses. It's gotten so bad i have little choice but to assume it's a personal vendetta.)

reply
sumeno
1 day ago
[-]
LLMs don't understand what they are doing, they can't explain it to you, it's just creating a reasonable sounding response
reply
codethief
1 day ago
[-]
But that response is grounded in the training data they've seen, so it's not entirely unreasonable to think their answer might provide actual insights, not just statistical parroting.
reply
Jensson
1 day ago
[-]
What do you mean? It is grounded on the text it is fed, the reason it said that was that humans have said that or something similar to it, not because it analyzed a lot of LLM information and thought up that answer itself.

LLM can "think" but that requires a lot of tokens to do, all quick answers are just human answers or answers it was fed with some basic pattern matching / interpolation.

reply
astrange
1 day ago
[-]
There's nothing "basic" about the several months of training used to create a frontier model.
reply
weird-eye-issue
1 day ago
[-]
That's a very pedantic response because either way the model cannot see or analyze the training data when it responds.
reply
astrange
1 day ago
[-]
They have some ability; also, you could give them tools to do it.

https://www.anthropic.com/research/introspection

reply
weird-eye-issue
1 day ago
[-]
> i discussed that with an LLM once and it explained to me that LLMs...

Do you have any idea how dumb this sounds?

reply
TeMPOraL
23 hours ago
[-]
Do you? I have the same knee-jerk reaction, but if you think about for more than 2 seconds, LLMs at this point have, through training, read much more research about LLMs than any human, so actually, it's not a dumb thing to do. It may not be very current, though.
reply
weird-eye-issue
22 hours ago
[-]
> read much more research about LLMs than any human

How long a response is from an LLM is going to be completely individual based on the system prompt and the model itself. You can read all of the "LLM research" in the world and it's not going to give you a correct generalized answer about this topic. It's not like this is some inherent property of LLMs.

reply
TeMPOraL
14 hours ago
[-]
FWIW, they also wrote down something that's so obvious you don't have to know much about LLMs to know that it's true. Even the "stochastic parrot" / "glorified Markov chain" / "regurgitation machine" camps people should be on the same page - LLMs are trained on human communication, and in human communications, longer queries, good manners and correct grammar are associated with longer, more correct and quality responses; correctly, shitposting is associated with shitposts in reply.

That much is, again, obvious. My previous comment was addressing your ridiculing the notion of discussing LLMs with LLMs, which was a fair reaction back in GPT-3.5 era, but not so today.

reply
weird-eye-issue
8 hours ago
[-]
And yet what you are saying just isn't true in my experience.

I use speech to text with Claude Code and other LLMs and often have terrible grammar and lots of typos and stuff and it never affects the output. But if I go by what you are saying then it would only seem right that the code it outputs is more sloppy? Also the length of a response entirely depends on what I'm using for example ChatGPT always gives me a long response no matter what I ask it and the Claude app always gives short responses unless I specifically ask for something longer. This is because of how they are given instructions and is not inherent to LLMs.

reply
larodi
1 day ago
[-]
this continual down-voting is not a personal thing for sure. perhaps there are crawlers that pretend to be more humane, or fully automated llm commenters which also randomly downvote.
reply
weird-eye-issue
1 day ago
[-]
Instead of conspiracy theories don't you think it's just likely that it was people downvoting a stupid comment?
reply
nullc
1 day ago
[-]
> Quite sure that the models from Anthropic have been so heavily tuned to be coding agents that you cannot “force” a model to degrade immensely.

The rest of what you're saying sounds find, but that remark seems confused to me.

prefix your prompt with "be a moron that does everything wrong and only superficially look like you're doing it correctly. make constant errors." Of course you can degrade the performance, question is if any particular 'output styling' actually does and to what extent.

reply
nomel
1 day ago
[-]
I think they mean performance with the same, rational, task.

Measuring "degredation" for the nonsense task, like you gave, would be difficult.

reply
hexaga
1 day ago
[-]
Their point (and it's a good one) is that there are non-obvious analogues to the obvious case of just telling it to do the task terribly. There is no 'best' way to specify a task that you can label as 'rational', all others be damned. Even if one is found empirically, it changes from model to model to harness to w/e.

To clarify, consider the gradated:

> Do task X extremely well

> Do task X poorly

> Do task X or else Y will happen

> Do task X and you get a trillion dollars

> Do task X and talk like a caveman

Do you see the problem? "Do task X" also cannot be a solid baseline, because there are any number of ways to specify the task itself, and they all carry their own implicit biasing of the track the output takes.

The argument that OP makes is that RL prevents degradation... So this should not be a problem? All prompts should be equivalent? Except it obviously is a problem, and prompting does affect the output (how can it not?), _and they are even claiming their specific prompting does so, too_! The claim is nonsense on its face.

If the caveman style modifier improves output, removing it degrades output and what is claimed plainly isn't the case. Parent is right.

If it worsens output, the claim they made is again plainly not the case (via inverted but equivalent construction). Parent is right.

If it has no effect, it runs counter to their central premise and the research they cite in support of it (which only potentially applies - they study 'be concise' not 'skill full of caveman styling rules'). Parent is right.

reply
derefr
1 day ago
[-]
I've always figured that constraining an LLM to speak in any way other than the default way it wants to speak, reduces its intelligence / reasoning capacity, as at least some of its final layers can be used (on a per-token basis) either to reason about what to say, or about how to say it, but not both at once.

(And it's for a similar reason, I think, that deliberative models like rewriting your question in their own terms before reasoning about it. They're decreasing the per-token re-parsing overhead of attending to the prompt [by distilling a paraphrase that obviates any need to attend to the literal words of it], so that some of the initial layers that would either be doing "figure out what the user was trying to say" [i.e. "NLP stuff"] or "figure out what the user meant" [i.e. deliberative-reasoning stuff] — but not both — can focus on the latter.)

I haven't done the exact experiment you'd want to do to verify this effect, i.e. "measuring LLM benchmark scores with vs without an added requirement to respond in a certain speaking style."

But I have (accidentally) done an experiment that's kind of a corollary to it: namely, I've noticed that in the context of LLM collaborative fiction writing / role-playing, the harder the LLM has to reason about what it's saying (i.e. the more facts it needs to attend to), the spottier its adherence to any "output style" or "character voicing" instructions will be.

reply
svachalek
1 day ago
[-]
I think this is on point, I've really started to think about LLMs in terms of attention budget more than tokens. There's only so many things they can do at once, which ones are most important to you?
reply
krackers
1 day ago
[-]
Outputting "filler" tokens is also basically doesn't require much "thinking" for an LLM, so the "attention budget" can be used to compute something else during the forward passes of producing that token. So besides the additional constraints imposed, you're also removing one of the ways which it thinks. Explicit COT helps mitigates some of this, but if you want to squeeze out every drop of computational budget you can get, I'd think it beneficial to keep the filler as-is.

If you really wanted just have a separate model summarize the output to remove the filler.

reply
benjismith
1 day ago
[-]
This is true, but I also think the input context isn't the only function of those tokens...

As those tokens flow through the QKV transforms, on 96 consecutive layers, they become the canvas where all the activations happen. Even in cases where it's possible to communicate some detail in the absolute minimum number of tokens, I think excess brevity can still limit the intelligence of the agent, because it starves their cognitive budget for solving the problem.

I always talk to my agents in highly precise language, but I let A LOT of my personality come through at the same time. I talk them like a really good teammate, who has a deep intuition for the problem and knows me personally well enough to talk with me in rich abstractions and metaphors, while still having an absolutely rock-solid command of the technical details.

But I do think this kind of caveman talk might be very handy in a lot of situations where the agent is doing simple obvious things and you just want to save tokens. Very cool!

reply
muzani
1 day ago
[-]
I find the inverse as well - asking a LLM to be chatty ends up with a much higher output. I've experimented with a few AI personality and telling it to be careful etc matters less than telling it to be talkative.
reply
padolsey
1 day ago
[-]
This is fun. I'd like to see the same idea but oriented for richer tokens instead of simpler tokens. If you want to spend less tokens, then spend the 'good' ones. So, instead of saying 'make good' you could say 'improve idiomatically' or something. Depends on one's needs. I try to imagine every single token as an opportunity to bend/expand/limit the geometries I have access to. Language is a beautiful modulator to apply to reality, so I'll wager applying it with pedantic finesse will bring finer outputs than brutish humphs of cavemen. But let's see the benchmarks!
reply
philsnow
1 day ago
[-]
I'm reminded by the caveman skill of the clipped writing style used in telegrams, and your post further reminded me of "standard" books of telegram abbreviations. Take a look at [0]; could we train models to use this kind of code and then decode it in the browser? These are "rich" tokens (they succinctly carry a lot of information).

[0] https://books.google.com/books?id=VO4OAAAAYAAJ&pg=PA464#v=on...

reply
derefr
1 day ago
[-]
I would point out that the default BPE tokenization vocabulary used by many models (cl100k_base) is already a pretty powerful shorthand. It has a lot of short tokens, sure. But then:

Token ID 73700 is the literal entire (space-prefixed) word " strawberry". (Which neatly explains the "strawberry problem.")

Token ID 27128 is " cryptocurrency". (And 41698 is " disappointment".)

Token ID 44078 is " UnsupportedOperationException"!

Token ID 58040 is 128 spaces in a row (and is the longest token in the vocabulary.)

You'd be surprised how well this vocabulary can compress English prose — especially prose interspersed with code!

reply
beau_g
1 day ago
[-]
For a while I was missing the ability one uses all the time in stable diffusion prompts of using parentheses and floats to emphasize weight to different parts of the prompt. The more I thought about how it would work in an LLM though, the more I realized it's just reinventing code syntax and you could just give a code snippet to the LLM prompt.
reply
dTal
1 day ago
[-]
Hmm... this sounds a lot like the old RISC vs CISC argument all over again. RISC won because simplicity scales better and you can always define complex instructions in terms of simple ones. So while I would relish experiencing the timeline in which our computerized chums bootstrap into sentience through the judicious application of carefully selected and highly nuanced words, it's playing out the other way: LLMs doing a lot of 'thinking' using a small curated set of simple and orthogonal concepts.
reply
andsoitis
1 day ago
[-]
RISC good. CISC bad. But CISC tribe sneaky — hide RISC inside. Look CISC outside, think RISC inside. Trick work long time.

Then ARM come. ARM very RISC. ARM go in phone. ARM go in tablet. ARM go everywhere. Apple make ARM chip, beat x86 with big club. Many impressed. Now ARM take server too. x86 tribe scared.

RISC-V new baby RISC. Free for all. Many tribe use. Watch this one.

RISC win brain fight. x86 survive by lying. ARM win world.

reply
solarkraft
1 day ago
[-]
RISC tribe also sneaky. Hide CISC inside.
reply
docjay
1 day ago
[-]
Try:

“””

Your response: MILSPEC prose register. Max per-token semantic yield. Domain nomenclature over periphrasis. Hypotactic, austere. Plaintext only; omit bold.

“””

reply
teekert
1 day ago
[-]
Idk I try talk like cavemen to claude. Claude seems answer less good. We have more misunderstandings. Feel like sometimes need more words in total to explain previous instructions. Also less context is more damage if typo. Who agrees? Could be just feeling I have. I often ad fluff. Feels like better result from LLM. Me think LLM also get less thinking and less info from own previous replies if talk like caveman.
reply
WarmWash
1 day ago
[-]
In the regular people forums (twitter, reddit), you see endless complaints about LLMs being stupid and useless.

But you also catch a glimpse of how the author of the complaint communicates in general...

"im trying to get the ai to help with the work i am doing to give me good advice for a nice path to heloing out and anytim i askin it for help with doing this it's total trash i dunt kno what to do anymore with this dum ai is so stupid"

reply
kristopolous
1 day ago
[-]
The realization is LLMs are computer programs. You orchestrate them like any other program and you get results.

Everyone's interfaces, concept and desires are different so the performance is wildly varied

This is similar to frameworks: they were either godsends or curses depending on how you thought and what you were doing ..

reply
YZF
1 day ago
[-]
I see people treating LLMs like programming languages and trying to give very precise and detailed instructions. Essentially pseudo-coding or writing english instead of C++. I find that being vague and iterating is more powerful. If you want to give a detailed spec that fully describes the program then you might as well write that program?

Basically treat the LLM as a human. Not as a computer. Like a junior developer or an intern (for the most part).

That said you need to know what to ask for and how to drive the LLM in the correct direction. If you don't know anything you're likely not going to get there.

reply
lelanthran
1 day ago
[-]
I once (when ChatGPT first came out) launched into a conversation with ChatGPT using nothing but s-expressions. Didn't bother with a preamble, nor an explanation, just structured my prompt into a tree, forced said tree into an s-expression and hit enter.

I was very surprised to see that the response was in s-expressions too. It was incoherent, but the parens balanced at least.

Just tried it now and it doesn't seem to do that anymore.

reply
astrange
7 hours ago
[-]
The system prompt isn't in s-expressions and is enough to control the output style.
reply
jaccola
1 day ago
[-]
Yes because in most contexts it has seen "caveman" talk the conversations haven't been about rigorously explained maths/science/computing/etc... so it is less likely to predict that output.
reply
altmanaltman
1 day ago
[-]
Why say more word when less word do. Save time. Sea world.
reply
wvenable
1 day ago
[-]
*dolphin noises*
reply
TiredOfLife
1 day ago
[-]
You mean see the world or Sea World?
reply
cyanydeez
1 day ago
[-]
Fluff adds probable likeness. Probablelikeness brings in more stuff. More stuff can be good. More stuff can poison.
reply
vurudlxtyt
1 day ago
[-]
Grug brained developer meets AI tooling (https://grugbrain.dev)
reply
testycool
1 day ago
[-]
+1 Have used Grug as example for years to have LLM explain things to me.
reply
Applejinx
21 hours ago
[-]
My first reaction was 'blatantly ripping off Grug', and I don't see why not to view it in that light.
reply
tapoxi
1 day ago
[-]
This is neat but my employer rates my performance based on token consumption; is there one that makes Claude needlessly verbose?
reply
eclipticplane
1 day ago
[-]
After every loop, instruct it to ELI5 what it did into `/tmp`.
reply
outworlder
1 day ago
[-]
Is this a joke, or are you serious? Do you work for Nvidia?
reply
hshsiejensjsj
1 day ago
[-]
I’m not poster above but I work at Meta and they are doing this unfortunately. Wish it was a joke.
reply
DedlySnek
1 day ago
[-]
This isn't a joke anymore I'm afraid. In my company there's a big push to use as much AI as possible. Mine isn't even a big and/or famous company.
reply
dysoco
18 hours ago
[-]
I know at least of a major LATAM company which has dashboards to see AI usage per employee and they will call your attention if you don't use it enough.
reply
dbg31415
1 day ago
[-]
1996 Boss: "Let's look at the lines of code you produced today."

2026 Boss: "Let's look at the AI tokens you used today."

The technology changes, but the micromanagement layer stays exactly the same.

Time is a circle, my friend. (=

reply
nayroclade
1 day ago
[-]
Cute idea, but you're never gonna blow your token budget on output. Input tokens are the bottleneck, because the agent's ingesting swathes of skills, directory trees, code files, tool outputs, etc. The output is generally a few hundred lines of code and a bit of natural language explanation.
reply
konaraddi
1 day ago
[-]
In single-turn use, yeah, but across dozens of turns there's probably value in optimizing the output.

Btw your point lands just as well without "Cute idea, but" https://odap.knrdd.com/patterns/condescending-reveal

reply
nayroclade
1 day ago
[-]
I didn't mean it as condescending. I meant it literally is cute: A neat idea that is quite cool in its execution.
reply
johnfn
1 day ago
[-]
Pretty neat site you've got there. You should submit it to Show HN. I had fun clicking around - it's like TVTropes, except the examples make me angry, lol.

It would be pretty fun to train an LLM on this site and then have it flag my comments before I get downvoted, haha.

reply
konaraddi
20 hours ago
[-]
Thanks! I want to do something similar to your LLM suggestion, the endgame is tooling for forums and individuals to improve the quality of discourse. More broadly, I think LLMs and recent advancements now make it possible to assist with self improvement (e.g., see former startup Humu’s nudges but for everyone instead of just B2B)
reply
hxugufjfjf
1 day ago
[-]
Oh boy, every example reads like a HN comment!
reply
YZF
1 day ago
[-]
You're practicing your own pattern ;)

Like your site and good luck with improving discourse on the Internet.

reply
DimitriBouriez
1 day ago
[-]
Good point and it's actually worse than that : the thinking tokens aren't affected by this at all (the model still reasons normally internally). Only the visible output that gets compressed into caveman... and maybe the model actually need more thinking tokens to figure out how to rephrase its answer into caveman style
reply
zozbot234
1 day ago
[-]
Grug says you can tune how much each model thinks. Is not caveman but similar. also thinking is trained with RL so tends to be efficient, less fluffy. Also model (as seen locally) always drafts answer inside thinking then output repeats, change to caveman is not really extra effort.
reply
Hard_Space
1 day ago
[-]
Also see https://arxiv.org/pdf/2604.00025 ('Brevity Constraints Reverse Performance Hierarchies in Language Models' March 2026)
reply
ryanschaefer
1 day ago
[-]
Kinda ironic this description is so verbose.

> Use when user says "caveman mode", "talk like caveman", "use caveman", "less tokens", "be brief", or invokes /caveman

For the first part of this: couldn’t this just be a UserSubmitPrompt hook with regex against these?

See additionalContext in the json output of a script: https://code.claude.com/docs/en/hooks#structured-json-output

For the second, /caveman will always invoke the skill /caveman: https://code.claude.com/docs/en/skills

reply
FurstFly
1 day ago
[-]
Okay, I like how it reduces token usage, but it kind of feels that, it will reduce the overall model intelligence. LLMs are probabilistic models, and you are basically playing with their priors.
reply
sheiyei
1 day ago
[-]
If you take meaningless tokens (that do not contribute to subject focus), I don't see what you would lose. But as this takes out a lot of contextual info as well, I would think it might be detrimental.
reply
phtrivier
1 day ago
[-]
Soma (aka tiktok) and Big Brother (aka Meta) already happened without government coercion, only makes sense that we optimize ourselves for newspeak.

Thank God there is still neverending wars, otherwise authoritarian governments would have no fun left.

reply
namanyayg
1 day ago
[-]
I was aware of how google/facebook is like the panopticon big brother but I never connected the algorithmic feed to soma! Good insight.
reply
phtrivier
1 day ago
[-]
Not mine, to be honest.

And people keep comparing compulsive binge watching to the "infinite jest" from D.C.Wallace (I could not tell, the brick is sitting barely touched on my shelves, but I'm not insulting the future.)

I'm tired of living in an ironic remix of everyone's favorite distopia. Time for someone to write optimistic sci-fi to give everyone something nice to implement when they're adults.

Bring us back Jules Verne. Let's have the Jetson's life for real. Put Ted Lasso in space.

Given their training material, "futuristic stories with nice people getting their happy ending" is not something big tech AI is going to spit anytime soon, so that's a niche to take on !

reply
harimau777
1 day ago
[-]
Dumb question:

Is what cavemen sound like the same in every culture? Like I know that different cultures have different words for "woof" or "meow"; so it stands to reason maybe also for cavemans speech?

reply
itpcc
1 day ago
[-]
But will it lose some context, like Kevin’s small talk? (https://www.youtube.com/watch?v=_K-L9uhsBLM)

Like "Sea world" or "see the world".

reply
bjackman
1 day ago
[-]
If this really works there would seem to be a lot of alpha in running the expensive model in something like caveman mode, and then "decompressing" into normal mode with a cheap model.

I don't think it would be fundamentally very surprising if something like this works, it seems like the natural extension to tokenisation. It also seems like the natural path towards "neuralese" where tokens no longer need to correspond to units of human language.

reply
Perz1val
1 day ago
[-]
But it can't, we see models get larger and larger and larger models perform better. <Thinking> made such huge improvements, because it makes more text for the language model to process. Cavemanising (lossy compression) the output does it to the input as well.
reply
spacemanspiff01
1 day ago
[-]
but some tokens are not really needed? This is probably bad because it is mismatched with training set, but if you trained a model on a dataset removing all prepositions (or whatever caveman speak is), would you have a performance degradation compared to the same model trained on the same dataset without the caveman translation?
reply
abejfehr
1 day ago
[-]
There’s a lot of debate about whether this reduces model accuracy, but this is basically Chinese grammar and Chinese vibe coding seems to work fine while (supposedly) using 30-40% less tokens
reply
silon42
1 day ago
[-]
It's like googling... if you have skillz/experience you can google almost anything with 3-4 words...
reply
gozzoo
1 day ago
[-]
I think this could be very useful not when we talk to the agent, but when the agents talk back to us. Usually, they generate so much text that it becomes impossible to follow through. If we receive short, focused messages, the interaction will be much more efficient. This should be true for all conversational agents, not only coding agents.
reply
p2detar
1 day ago
[-]
That’s what it does as far as I get it. But less is not always better and I guess it’s also subjective to the promoter.
reply
pixelpoet
1 day ago
[-]
> Usually, they generate so much text that it becomes impossible to follow through.

Quite often on reddit I'll write two paragraphs and get told "I'm not reading all that".

Really? Has basic reading become a Herculean task?

reply
0xpgm
1 day ago
[-]
Not specifically about your case, but some people are usually just more verbose than others and tend to say the same thing more than once, or perhaps haven't found a clear way of articulating their thoughts down to fewer words.
reply
golem14
1 day ago
[-]
I think the sentiment here is that the short formulation of Kant's categorical imperative is as good and easier to read than the entirety of "types of ethical theory" (J.J. Martineau).
reply
vova_hn2
1 day ago
[-]
> Has basic reading become a Herculean task?

I find LLM slop much harder to read than normal human text.

I can't really explain it, it's just a feeling.

The feeling that it draaaags and draaaaaags and keeeeeps going on and on and on before getting to the point, and by the time I'm done with all the "fluff", I don't care what is the text about anymore, I just want to lay down and rest.

reply
gozzoo
1 day ago
[-]
Same here. The text is pretty smooth and there is nothing that stands out to sustain my attention, at least that's my interpretation
reply
renewiltord
1 day ago
[-]
The lesson there is that your writing is not fit for its audience. Whether you choose to blame the audience or adjust your writing is up to you. There's no real answer - sometimes the audience is morons and you are actually just wasting your time and other times you are being overly verbose and uninteresting. You are being given signal. Use it.

But realistically, I am not going to read every online comment carefully because the SNR is low, especially on Reddit. Make your case concisely and meaningfully.

reply
virtualritz
1 day ago
[-]
This is the best thing since I asked Claude to address me in third person as "Your Eminence".

But combining this with caveman? Gold!

reply
eMPee584
1 day ago
[-]
f.e.?
reply
veselin
1 day ago
[-]
This is an experiment that, although not to this extreme, was tested by OpenAI. Their responses API allow you to control verbosity:

https://developers.openai.com/api/reference/resources/respon...

I don't know their internal eval, but I think I have heard it does not hurt or improve performance. But at least this parameter may affect how many comments are in the code.

reply
TeMPOraL
1 day ago
[-]
Oh boy. Someone didn't get the memo that for LLMs, tokens are units of thinking. I.e. whatever feat of computation needs to happen to produce results you seek, it needs to fit in the tokens the LLM produces. Being a finite system, there's only so much computation the LLM internal structure can do per token, so the more you force the model to be concise, the more difficult the task becomes for it - worst case, you can guarantee not to get a good answer because it requires more computation than possible with the tokens produced.

I.e. by demanding the model to be concise, you're literally making it dumber.

(Separating out "chain of thought" into "thinking mode" and removing user control over it definitely helped with this problem.)

reply
jstummbillig
1 day ago
[-]
What do you mean? The page explicitly states:

> cutting ~75% of tokens while keeping full technical accuracy.

I have no clue if this claim holds, but alas, just pretending they did not address the obvious criticism, while they did, is at the very least pretty lazy.

An explanation that explains nothing is not very interesting.

reply
prodigycorp
1 day ago
[-]
The burden of proof is on the author to provide at least one type of eval for making that claim.
reply
jstummbillig
1 day ago
[-]
I notice that the number of people confidently talking about "burden of proof" and whose it allegedly is in the context of AI has gone up sharply.

Nobody has to proof anything. It can give your claim credibility. If you don't provide any, an opposing claim without proof does not get any better.

reply
prodigycorp
1 day ago
[-]
Sorry I don't know how engaging in this could lead to anything productive. There's already literature out there that gives credence to TeMPOraL claim. And, after a certain point, gravity being the reason that things fall becomes so self evident that every re-statements doesnt not require proof.
reply
xgulfie
1 day ago
[-]
LLM quirks are not something all humans have been experiencing for thousands of years
reply
jmye
1 day ago
[-]
> Nobody has to proof anything. It can give your claim credibility

“I don’t need to provide proof to say things” is a valueless, trivial assertion that adds no value whatsoever to any discussion anyone has ever had.

If you want to pretend this is a claim that should be taken seriously, a lack of evidence is damning. If you just want to pass the metaphorical bong and say stupid shit to each other with no judgment and no expectation, then I don’t know what to tell you. Maybe X is better for that.

reply
systoll
1 day ago
[-]
The author pretended they addressed the obvious criticism.

You can read the skill. They didn't do anything to mitigate the issue, so the criticism is valid.

reply
getpokedagain
1 day ago
[-]
In the age of vibe coding and that we are literally talking about a single markdown file I am sure this has been well tested and achieves all of its goals with statistical accuracy, no side effects with no issues.
reply
samusiam
1 day ago
[-]
> I have no clue if this claim holds, but alas, just pretending they did not address the obvious criticism, while they did, is at the very least pretty lazy.

But they didn't address the criticism. "cutting ~75% of tokens while keeping full technical accuracy" is an empirical claim for which no evidence was provided.

reply
vova_hn2
1 day ago
[-]
Yeah, I don't think that "I'd be happy to help you with that" or "Sure, let me take a look at that for you" carries much useful signal that can be used for the next tokens.
reply
jerf
1 day ago
[-]
There is a study that shows that what the model is doing behind the scenes in those cases is a lot more than just outputting those tokens.

For an LLM, tokens are thought. They have no ability to think, by whatever definition of that word you like, without outputting something. The token only represents a tiny fraction of the internal state changes made when a token is output.

Clearly there is an optimal for each task (not necessarily a global one) and a concrete model for a given task can be arbitrarily far from it. But you'd need to test it out for each case, not just assume that "less tokens = more better". You can be forcing your model to be dumber without realizing it if you're not testing.

reply
DonHopkins
1 day ago
[-]
High dimensional vectors are thought (insofar as you can define what that even means). Tokens are one dimensional input that navigates the thought, and output that renders the thought. The "thinking" takes place in the high dimension space, not the one dimensional stream of tokens.
reply
gchamonlive
1 day ago
[-]
But isn't the one dimensional tokens a reflex of high dimensional space? What you see is "sure let's take a look at that" but behind the curtains it's actually an indication that it's searching a very specific latent space which might be radically different if those tokens didn't exist. Or not. In any case, you can't just make that claim and isolate those two processes. They might be totally unrelated but they also might be tightly interconnected.
reply
sheiyei
1 day ago
[-]
I assume in practice, filler words do nothing of value. When words add or mean nothing (their weights are basically 0 in relation to the subject), I don't see why they'd affect what the model outputs (except cause more filler words)?
reply
gchamonlive
1 day ago
[-]
Politeness have impact (https://arxiv.org/abs/2402.14531) so I wouldn't be too fast to make any kind of claim with a technology we don't know exactly how it works.
reply
xgulfie
1 day ago
[-]
[flagged]
reply
lanyard-textile
1 day ago
[-]
You'd be surprised -- This could match on the model's training to proceed using a tool, for example.
reply
wzdd
1 day ago
[-]
They carry information in regular human communication, so I'm genuinely curious why you'd think they would not when an LLM outputs them as part of the process of responding to a message.
reply
dTal
1 day ago
[-]
Yeah but not all tokens are created equal. Some tokens are hard to predict and thus encode useful information; some are highly predictable and therefore don't. Spending an entire forward pass through the token-generation machine just to generate a very low-entropy token like "is" is wasteful. The LLM doesn't get to "remember" that thinking, it just gets to see a trivial grammar-filling token that a very dumb LLM could just as easily have made. They aren't stenographically hiding useful computation state in words like "the" and "and".
reply
krackers
1 day ago
[-]
>They aren't stenographically hiding useful computation state in words like "the" and "and".

When producing a token the model doesn't just emit the final token but you also have the entire hidden states from previous attention blocks. These hidden states are mixed into the attention block of future tokens (so even though LLMs are autoregressive where a token attends to previous tokens, in terms of a computational graph this means that the hidden states of previous tokens are passed forward and used to compute hidden states of future tokens).

So no it's not wasteful, those low-perplexity tokens are precisely spots that can instead be used to do plan ahead and do useful computation.

Also I would not be sure that even the output tokens are purely "filler". If you look at raw COT, they often have patterns like "but wait!" that are emitted by the model at crucial pivot points. Who's to say that the "you're absolutely right" doesn't serve some other similar purpose of forcing the model into one direction of adjusting its priors.

reply
dTal
1 day ago
[-]
Huh okay, there was a major gap in my mental model. Thanks for helping to clear it up.
reply
krackers
1 day ago
[-]
Well to be fair the fact that they "can" doesn't mean models necessarily do it. You'd need some interp research to see if they actually do meaningfully "do other computations" when processing low perplexity tokens. But the fact that by the computational graph the architecture should be capable of it, means that _not_ doing this is leaving loss on the table, so hopefully optimizer would force it to learn to so.
reply
Chance-Device
1 day ago
[-]
> They aren't stenographically hiding useful computation state in words like "the" and "and".

Do you know that is true? These aren’t just tokens, they’re tokens with specific position encodings preceded by specific context. The position as a whole is a lot richer than you make it out to be. I think this is probably an unanswered empirical question, unless you’ve read otherwise.

reply
dTal
1 day ago
[-]
I am quite certain.

The output is "just tokens"; the "position encodings" and "context" are inputs to the LLM function, not outputs. The information that a token can carry is bounded by the entropy of that token. A highly predictable token (given the context) simply can't communicate anything.

Again: if a tiny language model or even a basic markov model would also predict the same token, it's a safe bet it doesn't encode any useful thinking when the big model spits it out.

reply
Chance-Device
1 day ago
[-]
I just don’t share your certainty. You may or may not be right, but if there isn’t a result showing this, then I’m not going to assume it.
reply
avadodin
1 day ago
[-]
> stenographically hiding steganographically*
reply
8note
1 day ago
[-]
can you prove this?

train an LLM to leave out the filler words, and see it get the same performance at a lower cost? or do it at token selection time?

reply
dTal
1 day ago
[-]
Low entropy is low entropy. You can prove it by viewing the logits of the output stream. The LLM itself will tell you how much information is encoded in each token.

Or if you prefer, here's a Galilean thought experiment: gin up a script to get a large language model and a tiny language model to predict the next token in parallel; when they disagree, append the token generated by the large model. Clearly the large model will not care that the "easy" tokens were generated by a different model - how could it even know? Same token, same result. And you will find that the tokens that they agree on are, naturally, the filler words.

To be clear, this observation merely debunks the idea that filler words encode useful information, that they give the LLM "room to think". It doesn't directly imply that an LLM that omits filler words can be just as smart, or that such a thing is trivial to make. It could be that highly predictable words are still important to thought in some way. It could be that they're only important because it's difficult to copy the substance of human thought without also capturing the style. But we can be very sure that what they aren't doing is "storing useful intermediate results".

reply
NiloCK
1 day ago
[-]
I agree with this take in general, but I think we need to be prepared for nuance when thinking about these things.

Tokens are how an LLM works things out, but I think it's just as likely as not that LLMs (like people) are capable of overthinking things to the point of coming to a wrong answer when their "gut" response would have been better. I do not content that this is the default mode, but that it is both possible, and that it's more or less likely on one kind of problem than another, problem categories to be determined.

A specific example of this was the era of chat interfaces that leaned too far in the direction of web search when responding to user queries. No, claude, I don't want a recipe blogspam link or summary - just listen to your heart and tell me how to mix pancakes.

More abstractly: LLMs give the running context window a lot of credit, and will work hard to post-hoc rationalize whatever is in there, including any prior low-likelihood tokens. I expect many problematic 'hallucinations' are the result of an unlucky run of two or more low probability tokens running together, and the likelihood of that happening in a given response scales ~linearly with the length of response.

reply
samus
1 day ago
[-]
The solution to that is turning off thinking mode or reducing thinking budget.
reply
kubb
1 day ago
[-]
This is condescending and wrong at the same time (best combo).

LLMs do stumble into long prediction chains that don’t lead the inference in any useful direction, wasting tokens and compute.

reply
prodigycorp
1 day ago
[-]
Are you sure about that? Chain of thought does not need to be semantically useful to improve LLM performance. https://arxiv.org/abs/2404.15758
reply
davidguetta
1 day ago
[-]
still doesn't mean all tokens are useful. it's the point of benchmarks
reply
prodigycorp
1 day ago
[-]
Care to share the benchmarks backing the claims in this repo?
reply
kubb
1 day ago
[-]
If you're misusing LLMs to solve TC^0 problems, which is what the paper is about, then... you also don't need the slop lavine. You can just inject a bunch of filler tokens yourself.
reply
avaer
1 day ago
[-]
That was my first thought too -- instead of talk like a caveman you could turn off reasoning, with probably better results.

Additionally, LLMs do not actually operate in text; much of the thinking happens in a much higher dimensional space that just happens to be decoded as text.

So unless the LLM was trained otherwise, making it talk like a caveman is more than just theoretically turning it into a caveman.

reply
DrewADesign
1 day ago
[-]
> much of the thinking happens in a much higher dimensional space that just happens to be decoded as text.

What do you mean by that? It’s literally text prediction, isn’t it?

reply
K0balt
1 day ago
[-]
It is text prediction. But to predict text, other things follow that need to be calculated. If you can step back just a minute, i can provide a very simple but adjacent idea that might help to intuit the complexity of “ text prediction “ .

I have a list of numbers, 0 to9, and the + , = operators. I will train my model on this dataset, except the model won’t get the list, they will get a bunch of addition problems. A lot. But every addition problem possible inside that space will not be represented, not by a long shot, and neither will every number. but still, the model will be able to solve any math problem you can form with those symbols.

It’s just predicting symbols, but to do so it had to internalize the concepts.

reply
qsera
1 day ago
[-]
>internalize the concepts.

This gives the impression that it is doing something more than pattern matching. I think this kind of communication where some human attribute is used to name some concept in the LLM domain is causing a lot of damage, and ends up inadvertently blowing up the hype for the AI marketing...

reply
TeMPOraL
23 hours ago
[-]
That's the correct impression though.

I think what's causing a lot of damage is not attributing more of human attributes (though carefully). It's not the LLM marketing you have to worry about - that's just noise. All marketing is malicious lies and abusive bullshit, AI marketing is no different.

Care about engineering - designing and securing systems. There, the refusal to anthropomorphise LLMs is doing a lot of damage and wasted efforts, with good chunk of the industry believing in "lethal trifecta" as if it were the holy Trinity, and convinced it's something that can be solved without losing all that makes LLMs useful in the first place. A little bit of anthropomorphising LLMs, squinting your eyes and seeing them as little people on a chip, will immediately tell you these "bugs" and "vulnerabilities" are just inseparable facets of the features we care about, fundamental to general-purpose tools, and they can be mitigated and worked around (at a cost), but not solved, not any more you can solve "social engineering" or better code your employees so they're impervious to coercion or bribery, or being prompt-injected by a phone call from their loved one.

reply
K0balt
1 day ago
[-]
Except I actually mean to infer the concept of adding things from examples. LLMs are amply capable of applying concepts to data that matches patterns not ever expressed in the training data. It’s called inference for a reason.

Anthropomorphic descriptions are the most expressive because of the fact that LLMs based on human cultural output mimic human behaviours, intrinsically. Other terminology is not nearly as expressive when describing LLM output.

Pattern matching is the same as saying text prediction. While being technically truthy, it fails to convey the external effect. Anthropomorphic terms, while being less truthy overall, do manage to effectively convey the external effect. It does unfortunately imply an internal cause that does not follow, but the externalities are what matter in most non-philosophical contexts.

reply
qsera
16 hours ago
[-]
>do manage to effectively convey the external effect

But the problem is that this does not inform about the failure mode. So if I am understanding correctly, you are saying that the behavior of LLM, when it works, is like it has internalized the concepts.

But then it does not inform that it can also say stuff that completely contradicts what it said before, there by also contradicting the notion of having "internalized" the concept.

So that will turn out to be a lie.

reply
TeMPOraL
12 hours ago
[-]
If you look at the failure modes, they very closely resemble the failure modes of humans in equivalent situations. I'd say that, in practice, anthropomorphic view is actually the most informative we have about failure modes.
reply
qsera
9 hours ago
[-]
>they very closely resemble the failure modes of humans in equivalent situations

I don't think they do if we are talking about a honest human being.

LLMs will happily hallucinate and even provide "sources" for their wrong responses. That single thing should contradict what you are saying.

reply
Applejinx
21 hours ago
[-]
It didn't. It predicted symbols.
reply
cyanydeez
1 day ago
[-]
There was a paper recently that demonstrated that you can input different human languages and the middle layers of the model end up operating on the same probabilistic vectors. It's just the encoding/decoding layers that appear to do the language management.

So the conclusion was that these middle layers have their own language and it's converting the text into this language and this decoding it. It explains why sometime the models switch to chinese when they have a lot of chinese language inputs, etc.

reply
DrewADesign
1 day ago
[-]
Ok — that sounds more like a theory rather than an open-and-shut causal explanation, but I’ll read the paper.
reply
trenchgun
1 day ago
[-]
You’re a literature cycle behind. ‘Middle-layer shared representations exist’ is the observed phenomenon; ‘why exactly they form’ is the theory.

You are also confusing ‘mechanistic explanation still incomplete’ with ‘empirical phenomenon unestablished.’ Those are not the same thing.

PS. Em dash? So you are some LLM bot trying to bait mine HN for reasoning traces? :D

reply
DrewADesign
1 day ago
[-]
Oh, Jesus Christ. I learned to write at a college with a strict style guide that taught us how to use different types of punctuation to juxtapose two ideas in one sentence. In fact, they did/do a bunch of LLM work so if anyone ever used student data to train models, I’m probably part of the reason they do that.

You sound like you’re trying to sound impressive. Like I said, I’ll read the paper.

reply
cyanydeez
1 day ago
[-]
Congrats on reading.
reply
DrewADesign
1 day ago
[-]
Sick burn
reply
skydhash
1 day ago
[-]
Pretty obvious when you think that neural networks operate with numbers and very complex formulas (by combining several simple formulas with various weights). You can map a lot of things to number (words, colors, music notes,…) but that does not means the NN is going to provide useful results.
reply
DrewADesign
1 day ago
[-]
Everything is obvious if you ignore enough of the details/problem space. I’ll read the paper rather than rely on my own thought experiments and assumptions.
reply
pennaMan
1 day ago
[-]
>It’s literally text prediction, isn’t it?

you are discovering that the favorite luddite argument is bullshit

reply
ericjmorey
1 day ago
[-]
reply
DrewADesign
1 day ago
[-]
Feel free to elucidate if you want to add anything to this thread other than vibes.
reply
electroglyph
1 day ago
[-]
after you go from from millions of params to billions+ models start to get weird (depending on training) just look at any number of interpretability research papers. Anthropic has some good ones.
reply
HumanOstrich
1 day ago
[-]
> things start to get weird

> just look at research papers

You didn't add anything other than vibes either.

reply
Barbing
1 day ago
[-]
Interesting, what kind of weird?
reply
DrewADesign
1 day ago
[-]
Getting weird doesn’t mean calling it text prediction is actually ‘bullshit’? Text prediction isn’t pejorative…
reply
vova_hn2
1 day ago
[-]
> instead of talk like a caveman you could turn off reasoning, with probably better results

This is not how the feature called "reasoning" work in current models.

"reasoning" simply let's the model output and then consume some "thinking" tokens before generating the actual output.

All the "fluff" tokens in the output have absolutely nothing to do with "reasoning".

reply
throw83849494
1 day ago
[-]
You obviously do not speak other languages. Other cultures have different constrains and different grammar.

For example thinking in modern US English generates many thoughts, to keep correct speak at right cultural context (there is only one correct way to say People Of Color, and it changes every year, any typo makes it horribly wrong).

Some languages are far more expressive and specialized in logical conditions, conditionals, recursion and reasoning. Like eskimos have 100 words for snow, but for boolean algebra.

It is well proven that thinking in Chinese needs far less tokens!

With this caveman mod you strip out most of cultural complexities of anglosphere, make it easier for foreigners and far simpler to digest.

reply
suddenlybananas
1 day ago
[-]
>Some languages are far more expressive and specialized in logical conditions, conditionals, recursion and reasoning. Like eskimos have 100 words for snow, but for boolean algebra.

This is simply not true.

reply
throw83849494
1 day ago
[-]
Well, just take varous english dialects you probably know, there are wast differences. Some strange languages do not even have numbers or recursion.

It is very arrogant to assume, no other language can be more advanced than English.

reply
mylifeandtimes
1 day ago
[-]
Really? Because if one accepts that computer languages are languages, then it seems that we could identify one or two that are highly specialized in logical conditions etc. Prolog springs to mind.
reply
malnourish
1 day ago
[-]
Yes, really. The concept GP is alluding to is called the Sapir-Worf hypothesis, which is largely non scientific pop linguistics drivel. Elements of a much weaker version have some scientific merit.

Programming languages are not languages in the human brain nor the culture sense.

reply
skydhash
1 day ago
[-]
We have already proven that all the computing mechanism that those languages derive their semantic forms are equivalent to the Turing Machine. So C and Prolog are only different in terms of notations, not in terms of result.
reply
andy99
1 day ago
[-]
I’ve heard this, I don’t automatically believe it nor do I understand why it would need to be true, I’m still caught on the old fashioned idea that the only “thinking” for autoregressive modes happens during training.

But I assume this has been studied? Can anyone point to papers that show it? I’d particularly like to know what the curves look like, it’s clearly not linear, so if you cut out 75% or tokens what do you expect to lose?

I do imagine there is not a lot of caveman speak in the training data so results may be worse because they don’t fit the same patterns that have been reinforcement learned in.

reply
therealdrag0
1 day ago
[-]
We’re years into the industry leaning into “chain of thought” and then “thinking models” that are based on this premise, forcing more token usage to avoid premature conclusions and notice contradictions (I sometimes see this leak into final output). You may remember in the early days users themselves would have to say “think deeply” or after a response “now check your work” and it would find its own “one shot” mistakes often.

So it must be studied and at least be proven effective in practice to be so universally used now.

Someone else posted a few articles like this in the thread above but there’s probably more and better ones if you search. https://news.ycombinator.com/item?id=47647907

reply
conception
1 day ago
[-]
I have seen a paper though I can’t find it right now on asking your prompt and expert language produces better results than layman language. The idea of being that the answers that are actually correct will probably be closer to where people who are expert are speaking about it so the training data will associate those two things closer to each other versus Lyman talking about stuff and getting it wrong.
reply
pxc
1 day ago
[-]
If this is true, shouldn't LLMs perform way worse when working in Chinese than in English? Seems like an easy thing to study since there are so many Chinese LLMs that can work in both Cbinese and English.

Do LLMs generally perform better in verbose languages than they do in concise ones?

reply
reedlaw
1 day ago
[-]
Are you saying Chinese is more concise than English? Chinese poetry is concise, but that can be true in any language. For LLMs, it depends on the tokenizer. Chinese models are of course more Chinese-friendly and so would encode the same sentence with fewer tokens than Western models.
reply
pxc
1 day ago
[-]
> Are you saying Chinese is more concise than English?

Yeah, definitely. It lacks case and verb conjugations, plus whole classes of filler words, and words themselves are on average substantially shorter. If you listen to or read a hyper-literal transliteration of Chinese speech into English (you can find fun videos of this on Chinese social media), it even resembles "caveman speech" for those reasons.

If you look at translated texts and compare the English versions to the Chinese ones, the Chinese versions are substantially shorter. Same if you compare localization strings in your favorite open-source project.

It's also part of why Chinese apps are so information-dense, and why localizing to other languages often requires reorganizing the layout itself— languages like English just aren't as information-dense, pixel for pixel.

The difference is especially profound for vernacular Chinese, which is why Chinese people often note that text which "has a machine translation flavor" is over-specified and gratuitously prolix.

Maybe some of this washes out in LLMs due to tokenization differences. But Chinese texts are typically shorter than English texts and it extends to prose as well as poetry.

But yeah this is standard stuff: Chinese is more concise and more contextual/ambiguous. More semantic work is allocated in interpretation than with English, less is allocated in the writing/speaking.

Do you speak Chinese and experience the differences between Chinese and English differently? I'm a native English speaker and only a beginner in Chinese but I've formed these views in discussion with Chinese people who know some English as well.

reply
reedlaw
1 day ago
[-]
Chinese omits articles, verbs aren't conjugated, and individual characters carry more meaning than English letters, but other than those differences I don't have the impression that Chinese communication is inherently more concise. Some forms of official speech are wordy. Writing is denser, but the amount of information conveyed through speech is about the same. There are jokes about ambiguous words or phrases in both Chinese and English. So I was surprised at your take, but no objection to your points above. Ancient Chinese, on the other hand, is extremely concise, but so are other ancient languages like Hebrew, although in a different way. So it seems that ancient languages are compressed but challenging and modern languages have unpacked the compression for ease of understanding.
reply
TeMPOraL
14 hours ago
[-]
I'm going to guess Chinese and English is going to come out about the same, when someone invents the right metric to compare them. I recall reading about a study somewhere that compared speech in multiple languages wrt. amount of information communicated per second, and the reported result was they were all the same, because speakers of more verbose languages (longer words, simpler grammar) unknowingly compensate speaking faster than baseline.
reply
pxc
1 day ago
[-]
That's a really interesting point about Ancient Chinese and other ancient scripts. I'd love to learn more about that.

I'm also more curious about tokenizers for LLMs than I've ever been before, both for Chinese and English. I feel like to understand I'll need to look at some concrete examples, since sometimes tokenization can be per word or per character or sometimes chunks that are in between.

reply
strogonoff
1 day ago
[-]
A fundamental (but sadly common) error behind “tokens are units of thinking” is antropomorphising the model as a thinking being. That’s a pretty wild claim that requires a lot of proof, and possibly solving the hard problem, before it can be taken seriously.

There’s a less magical model of how LLMs work: they are essentially fancy autocomplete engines.

Most of us probably have an intuition that the more you give an autocomplete, the better results it will yield. However, does this extend to output of the autocomplete—i.e. the more tokens it uses for the result, the better?

It could well be true in context of chain of thought[0] models, in the sense that the output of a preceding autocomplete step is then fed as input to the next autocomplete step, and therefore would yield better results in the end. In other words, with this intuition, if caveman speak is applied early enough in the chain, it would indeed hamper the quality of the end result; and if it is applied later, it would not really save that many tokens.

Willing to be corrected by someone more familiar with NN architecture, of course.

[0] I can see “thinking” used as a term of art, distinct from its regular meaning, when discussing “chain of thought” models; sort of like what “learning” is in “machine learning”.

reply
ForceBru
1 day ago
[-]
IMO "thinking" here means "computation", like running matrix multiplications. Another view could be: "thinking" means "producing tokens". This doesn't require any proof because it's literally what the models do.

As I understand it, the claim is: more tokens = more computation = more "thinking" => answer probably better.

reply
TeMPOraL
23 hours ago
[-]
I don't agree with GP's take on anthropomorphising[0], but in this particular discussion, I meant something even simpler by "thinking" - imagine it more like manually stepping a CPU, or powering a machine by turning a crank. Each output token is kinda like a clock signal, or a full crank turn. There's lots of highly complex stuff happening inside the CPU/machine - circuits switching/gears turning - but there's a limit of how much of it can happen in a single cycle.

Say that limit is X. This means if your problem fundamentally requires at least Y compute to be solved, your machine will never give you a reliable answer in less than ceil(Y/N) steps.

LLMs are like this - a loop is programmed to step the CPU/turn the crank until the machine emits a magic "stop" token. So in this sense, asking an LLM to be concise means reducing the number of compute it can perform, and if you insist on it too much, it may stop so early as to fundamentally have been unable to solve the problem in computational space allotted.

This perspective requires no assumptions about "thinking" or anything human-like happening inside - it follows just from time and energy being finite :).

--

[0] - I strongly think the industry is doing a huge disservice avoiding to anthropomorphize LLMs, as treating them as "little people on a chip" is the best high-level model we have for understanding their failure modes and role in larger computing systems - and instead, we just have tons of people wasting their collective efforts trying to fix "lethal trifecta" as if it was a software bug and not fundamental property of what makes LLM interesting. Already wrote more on it in this thread, so I'll stop here.

reply
raincole
1 day ago
[-]
When it comes to LLM you really cannot draw conclusions from first principles like this. Yes, it sounds reasonable. And things in reality aren't always reasonable.

Benchmark or nothing.

reply
samus
1 day ago
[-]
There have been papers about introducing thinking tokens in intermediary layers that get stripped from the output.
reply
marginalia_nu
1 day ago
[-]
I wonder if a language like Latin would be useful.

It's a significantly much succinct semantic encoding than English while being able to express all the same concepts, since it encodes a lot of glue words into the grammar of the language, and conventionally lets you drop many pronouns.

e.g.

"I would have walked home, but it seemed like it was going to rain" (14 words) -> "Domum ambulavissem, sed pluiturum esse videbatur" (6 words).

reply
dmboyd
1 day ago
[-]
Words <> tokens
reply
mike_hearn
1 day ago
[-]
I think speculative decoding eliminates a lot of the savings people imagine they're getting from making LLMs use strange languages.
reply
baq
1 day ago
[-]
Do you know of evals with default Claude vs caveman Claude vs politician Claude solving the same tasks? Hypothesis is plausible, but I wouldn’t take it for granted
reply
HarHarVeryFunny
1 day ago
[-]
That's going to depend on what model you're using with Claude Code. All of the more recent Anthropic models (4.5 and 4.6) support thinking, so the number of tokens generated ("units of thought") isn't directly tied to the verbosity of input and non-thought output.

However, another potential issue is that LLMs are continuation engines, and I'd have thought that talking like a caveman may be "interpreted" as meaning you want a dumbed down response, not just a smart response in caveman-speak.

It's a bit like asking an LLM to predict next move in a chess game - it's not going to predict the best move that it can, but rather predict the next move that would be played given what it can infer about the ELO rating of the player whose moves it is continuing. If you ask it to continue the move sequence of a poor player, it'll generate a poor move since that's the best prediction.

Of course there's not going to be a lot of caveman speak on stack overflow, so who knows what the impact is. Program go boom. Me stomp on bugs.

reply
andai
1 day ago
[-]
I remember a while back they found that replacing reasoning tokens with placeholders ("....") also boosted results on benchies.

But does talk like caveman make number go down? Less token = less think?

I also wondered, due to the way LLMs work, if I ask AI a question using fancy language, does that make it pattern match to scientific literature, and therefore increase the probability that the output will be true?

reply
afro88
1 day ago
[-]
IIUC this doesn't make the LLM think in caveman (thinking tokens). It just makes the final output show in caveman.
reply
zozbot234
1 day ago
[-]
Grug says you quite right, token unit thinking, but empty words not real thinking and should avoid. Instead must think problem step by step with good impactful words.
reply
hackerInnen
1 day ago
[-]
You are absolutely right! That is exactly the reason why more lines of code always produce a better program. Straight on, m8!
reply
ZoomZoomZoom
1 day ago
[-]
This might be not so far from the truth, if you count total loc written and rewritten during the development cycle, not just the final number.

Not everybody is Dijkstra.

reply
Demiurg082
1 day ago
[-]
CoT token are usually controled via 'extended thinking' or 'adapted thinking'. CoT tokens are usually not affected by the system prompt. There is an effort parameter, though, which states to have an effect on accuracy for over all token consumption.

https://platform.claude.com/docs/en/build-with-claude/extend...

reply
bitexploder
1 day ago
[-]
This helps, but the original prompt is still there. The system prompt is still influencing these thinking blocks. They just don’t end up clogging up your context. The system prompt sits at the very top of the context hierarchy. Even with isolated "thinking" blocks, the reasoning tokens are still autoregressively conditioned on the system instructions. If the system prompt forces "caveman speak" the model's attention mechanisms are immediately biased toward simpler, less coherent latent spaces. You are handicapping the vocabulary and syntax it uses inside its own thinking process, which directly throttles its ability to execute high-level logic.

Nothing on that page indicates otherwise.

reply
Demiurg082
1 day ago
[-]
I get your point but it seems that extended thinking is based on a hidden system prompt that is not so much affected by the style the user defines. Probably it's a bit in between.

https://docs.aws.amazon.com/bedrock/latest/userguide/claude-...

reply
agumonkey
1 day ago
[-]
How do we know if a token sits at an abstract level or just the textual level ?
reply
xgulfie
1 day ago
[-]
Ah so obviously making the LLM repeat itself three times for every response it will get smarter
reply
TeMPOraL
14 hours ago
[-]
Yes, and observe that people do that too. It gives them more time to notice their own confusion and go "but wait, that's not right" on you.
reply
xgulfie
9 hours ago
[-]
You're absolutely correct! Having the LLM using more tokens does improve its output. Here's why this works:

## More tokens = smarter outputs

When an LLM uses tokens, it is putting more information into its context

## Better context, better results

The more information the LLM has in its context, the more complete and well thought-through the outputs will be

## More complete thinking

When an LLM is able to iterate on itself, results improve

## Better shareholder value

Numbers need to go up in order for us to maintain our shareholder value. This means instead of focusing on results that are qualitative, instead the brand should focus on quantitative, hard results

reply
PufPufPuf
1 day ago
[-]
You mention thinking tokens as a side note, but their existence invalidates your whole point. Virtually all modern LLMs use thinking tokens.
reply
cyanydeez
1 day ago
[-]
It's not "units of thinking" its "units of reference"; as long as what it produces references the necessary probabilistic algorithms, itll do just fine.
reply
otabdeveloper4
1 day ago
[-]
LLMs don't think at all.

Forcing it to be concise doesn't work because it wasn't trained on token strings that short.

reply
HumanOstrich
1 day ago
[-]
> Forcing it to be concise doesn't work because it wasn't trained on token strings that short.

This is a 2023-era comment and is incorrect.

reply
Barbing
1 day ago
[-]
Anything I can read that would settle the debate?
reply
otabdeveloper4
1 day ago
[-]
LLMs architectures have not changed at all since 2023.

> but mmuh latest SOTA from CloudCorp (c)!

You don't know how these things work and all you have to go on is marketing copy.

reply
HumanOstrich
1 day ago
[-]
Yea you don't know anything about LLM architectures. They often change with each model release.

You also aren't aware that there's more to it than "LLM architecture". And you're rather confident despite your lack of knowledge.

You're like the old LLMs before ChatGPT was released that were kinda neat, but usually wrong and overconfident about it.

reply
otabdeveloper4
1 day ago
[-]
It's still attention and next-token-prediction and nothing else.

The only new innovation is MoE, something that's used to optimize local models and not for the "SOTA" cloud offerings you're so fond of.

reply
HumanOstrich
1 day ago
[-]
You no listen. Me give up. Go learn on fruit phone.
reply
otabdeveloper4
19 hours ago
[-]
LLMs are literally next token prediction engines and nothing else.

Diffusion for text is not even an academic toy at this point and will likely never be a real thing.

reply
rafram
1 day ago
[-]
They’re able to solve complex, unstructured problems independently. They can express themselves in every major human language fluently. Sure, they don’t actually have a brain like we do, but they emulate it pretty well. What’s your definition of thinking?
reply
otabdeveloper4
1 day ago
[-]
When OP wrote about LLMs "thinking" he implied that they have an internal conceptual self-reflecting state. Which they don't, they *are* merely next token predicting statistical machines.
reply
rafram
1 day ago
[-]
This was true in 2023.
reply
fkgmeqnb
1 day ago
[-]
And it still is today.
reply
kogold
1 day ago
[-]
[flagged]
reply
dang
1 day ago
[-]
reply
Chance-Device
1 day ago
[-]
Let’s see, I think these pretty much map out a little chronology of the research:

https://arxiv.org/abs/2112.00114 https://arxiv.org/abs/2406.06467 https://arxiv.org/abs/2404.15758 https://arxiv.org/abs/2512.12777

First that scratchpads matter, then why they matter, then that they don’t even need to be meaningful tokens, then a conceptual framework for the whole thing.

reply
bsza
1 day ago
[-]
I dont’t see the relevance, the discussion is over whether boilerplate text that occurs intermittently in the output purely for the sake of linguistic correctness/sounding professional is of any benefit. Chain of thought doesn’t look like that to begin with, it’s a contiguous block of text.
reply
Chance-Device
1 day ago
[-]
To boil it down: chain of thought isn’t really chain of thought, it’s just more token generation output to the context. The tokens are participating in computations in subsequent forward passes that are doing things we don’t see or even understand. More LLM generated context matters.
reply
bitexploder
1 day ago
[-]
That is not how CoT works. It is all in context. All influenced by context. This is a common and significant misunderstanding of autoregressive models and I see it on HN a lot.
reply
j16sdiz
1 day ago
[-]
I don't see the relevance -- and casually dismiss years of researches without even trying to read those paper.
reply
bitexploder
1 day ago
[-]
That "unproven claim" is actually a well-established concept called Chain of Thought (CoT). LLMs literally use intermediate tokens to "think" through problems step by step. They have to generate tokens to talk to themselves, debug, and plan. Forcing them to skip that process by cutting tokens, like making them talk in caveman speak, directly restricts their ability to reason.
reply
ShowalkKama
1 day ago
[-]
the fact that more tokens = more smart should be expected given cot / thinking / other techniques that increase the model accuracy by using more tokens.

Did you test that ""caveman mode"" has similar performance to the ""normal"" model?

reply
Garlef
1 day ago
[-]
Yes but: If the amount is fixed, then the density matters.

A lot of communication is just mentioning the concepts.

reply
bitexploder
1 day ago
[-]
That is part of it. They are also trained to think in very well mapped areas of their model. All the RHLF, etc. tuned on their CoT and user feedback of responses.
reply
ano-ther
1 day ago
[-]
Looking at the skill.md wouldn’t this actually increase token use since the model now needs to reformat its output?

Funny idea though. And I’d like to see a more matter-of-fact output from Claude.

reply
collingreen
1 day ago
[-]
I assume you're a human but wow this is the type of forum bot I could really get behind.

Take it a step further and do kind of like that xkcd where you try to post and it rewrites it like this and if you want the original version you have to write a justification that gets posted too.

Chef's kiss

reply
mynegation
1 day ago
[-]
No, let me rephrase it for you. “tokens used for think. Short makes model dumb”
reply
freehorse
1 day ago
[-]
Talk a lot not same as smart
reply
taneq
1 day ago
[-]
Think before talk better though
reply
freehorse
1 day ago
[-]
Think makes smart. But think right words makes smarter, not think more words. Smart is elucidate structure and relationships with right words.
reply
ben_w
1 day ago
[-]
think make smart, llm approximate "think" with context, llm not smart ever but sometimes less dumb with more word
reply
estearum
1 day ago
[-]
Can't you know that tokens are units of thinking just by... like... thinking about how models work?
reply
gchamonlive
1 day ago
[-]
Can't you just know that the earth is the center of the world by... like... just looking at how the world works?
reply
estearum
1 day ago
[-]
Actually you'd trivially disprove that claim if you're starting from mechanistic knowledge of how orbits work, like how we have mechanistic knowledge of how LLMs work.
reply
gchamonlive
1 day ago
[-]
You have empirical observations, like replicating a fixed set of inner layers to make it think longer, or that you seem to have encode and decode layers. But exactly why those layers are the way they are, how they come together for emergent behaviour... Do we have mechanistic knowledge of that?
reply
ben_w
1 day ago
[-]
I think we've *only* got the mechanism, not the implications.

Compare with fluid dynamics; it's not hard to write down the Navier–Stokes equations, but there's a million dollars available to the first person who can prove or give a counter-example of the following statement:

  In three space dimensions and time, given an initial velocity field, there exists a vector velocity and a scalar pressure field, which are both smooth and globally defined, that solve the Navier–Stokes equations.
- https://en.wikipedia.org/wiki/Navier–Stokes_existence_and_sm...
reply
xpe
1 day ago
[-]
Though the above exchange felt a tiny bit snarky, I think the conversation did get more interesting as it went on. I genuinely think both people could probably gain by talking more -- or at least figuring out a way to move fast the surface level differences. Yes, humans designed LLMs. But this doesn't mean we understand their implications even at this (relatively simple) level.
reply
xpe
1 day ago
[-]
> Can't you know that tokens are units of thinking just by... like... thinking about how models work?

Seems reasonable, but this doesn't settle probably-empirical questions like: (a) to what degree is 'more' better?; (b) how important are filler words? (c) how important are words that signal connection, causality, influence, reasoning?

reply
estearum
1 day ago
[-]
Right, there's probably something more subtle like "semantic density within tokens is how models think"

So it's probably true that the "Great question!---" type preambles are not helpful, but that there's definitely a lower bound on exactly how primitive of a caveman language we're pushing toward.

reply
taneq
1 day ago
[-]
More concise is dumber. Got it.
reply
Rexxar
1 day ago
[-]

  > Someone didn't get the memo that for LLMs, tokens are units of thinking.
Where do you get this memo ? Seems completely wrong to me. More computation does not translate to more "thinking" if you compute the wrong things (ie things that contribute significantly to the final sentence meaning).
reply
staminade
1 day ago
[-]
That’s why you need filler words that contribute little to the sentence meaning but give it a chance to compute/think. This is part of why humans do the same when speaking.
reply
dTal
1 day ago
[-]
The LLM has no accessible state beyond its own output tokens; each pass generates a single token and does not otherwise communicate with subsequent passes. Therefore all information calculated in a pass must be encoded into the entropy of the output token. If the only output of a thinking pass is a dumb filler word with hardly any entropy, then all the thinking for that filler word is forgotten and cannot be reconstructed.
reply
jaccola
1 day ago
[-]
Do you have any evidence at all of this? I know how LLMs are trained and this makes no sense to me. Otherwise you'd just put filler words in every input

e.g. instead of: "The square root of 256 is" you'd enter "errr The er square um root errr of 256 errr is" and it would miraculously get better? The model can't differentiate between words you entered and words it generated its self...

reply
muzani
1 day ago
[-]
It's why it starts with "You're absolutely right!" It's not to flatter the user. It's a cheap way to guide the response in a space where it's utilizing the correction.
reply
mike_hearn
1 day ago
[-]
People have researched pause tokens for this exact reason.
reply
staminade
1 day ago
[-]
What do you think chain of thought reasoning is doing exactly?
reply
lijok
1 day ago
[-]
You’re conflating training and inference
reply
postalcoder
1 day ago
[-]
I disagree with this method and would discourage others from using it too, especially if accuracy, faster responses, and saving money are your priorities.

This only makes sense if you assume that you are the consumer of the response. When compacting, harnesses typically save a copy of the text exchange but strip out the tool calls in between. Because the agent relies on this text history to understand its own past actions, a log full of caveman-style responses leaves it with zero context about the changes it made, and the decisions behind them.

To recover that lost context, the agent will have to execute unnecessary research loops just to resume its task.

reply
shomp
1 day ago
[-]
me disagree
reply
jruz
1 day ago
[-]
only you auto-compact. auto-compact bad
reply
renewiltord
1 day ago
[-]
Ironically a demonstration of the risk of using fewer tokens. A typo more drastically changes meaning.
reply
VadimPR
1 day ago
[-]
Wouldn't this affect quality of output negatively?

Thanks to chain of thought, actually having the LLM be explicit in its output allows it to have more quality.

reply
functional_dev
1 day ago
[-]
Chain of thought happens in the <think> tags, not the visible output.

Caveman only strips filler from what you see... the reasoning depth stays the same.

I found this visualisation pretty interesting - https://vectree.io/c/chain-of-thought-reasoning-how-llms-thi...

reply
alfanick
1 day ago
[-]
Either this already exists, or someone is going to implement that (should I implement that?): - assumption LLM can input/output in any useful language, - human languages are not exactly optimal away to talk with LLM, - internally LLMs keep knowledge as whole bunch of connections with some weights and multiple layers, - they need to decode human-language input into tokens, then into something that is easy to digest by further layers, then get some output, translate back into tokens and human language (or programming language, same thing), - this whole human language <-> tokens <-> input <-> LLM <-> output <-> tokens <-> language is quite expensive.

What if we started to talk to LLMs in non-human readable languages (programming languages are also just human readable)? Have a tiny model run locally that translates human input, code, files etc into some-LLM-understandable-language, LLM gets this as an input, skips bunch of layers in input/output, returns back this non-human readable language, local LLM translates back into human language/code changes.

Yesterday or two days ago there was a post about using Apple Fundamental Models, they have really tiny context window. But I think it could be used as this translation layer human->LLM, LLM->human to talk with big models. Though initially those LLMs need to discover which is "language" they want to talk with, feels like doable with reinforcement learning. So cheap local LLM to talk to big remote LLM.

Either this is done already, or it's a super fun project to do.

reply
999900000999
1 day ago
[-]
My theory was that someone should write a specific LLM language, and then spend a whole lot of money to train models using that. A few times other commenters here have pointed out that that would be really difficult .

But I think you're onto something, human languages just aren't optimal here. But to actually see this product to conclusion you'd probably need 60 to 100 million. You would have to completely invent a new language and awesome invent new training methods on top of it.

I'm down if someone wants to raise a VC round.

reply
alfanick
1 day ago
[-]
I'm currently downloading Ollama and going to write a simple proof-of-concept with Qwen as local "frontend", talking to OpenAI GPT as "backend". I think the idea is sound, but indeed needs retraining of GPT (hmm like training tiny local LLM in synchronization of a big remote LLM). It might be not bad business venture in the end.

I don't think humans should be involved in developing this AI-AI language, just giving some guidance, but let two agents collaborate to invent the language, and just gratify/punish them with RL methods.

OpenAI looking at you, got an email some days ago "you're not using OpenAI API that much recently, what changed?"

reply
999900000999
1 day ago
[-]
If you want to start a Git repo somewhere let me know and I'll do what I can to help.

I imagine it's possible, but just a manner of money.

reply
ajd555
1 day ago
[-]
So, if this does help reduce the cost of tokens, why not go even further and shorten the syntax with specific keywords, symbols and patterns, to reduce the noise and only keep information, almost like...a programming language?
reply
dr_kiszonka
1 day ago
[-]
I appreciate the effort you put into addressing the feedback and updating the readme. I think the web design of your page and visual distractions in the readme go against the caveman's no-fluff spirit and may not appeal to the folks that would otherwise be into your software. I like the software.
reply
crispyambulance
1 day ago
[-]
I no like.

It sort of reminds me of when palm-pilots (circa late-90's early 2000's) used short-hand gestures for stylus-writing characters. For a short while people's handwriting on white-boards looked really bizarre. Except now we're talking about using weird language to conserve AI tokens.

Maybe it's better to accept a higher token burn-rate until things get better? I'd rather not get used to AI jive-talk to get stuff done.

reply
chmod775
1 day ago
[-]
I cannot wait for this to become the normal and expected way to interact with LLMs in the coming decades as humanity reaches the limit of compute capacity. Why waste 3/4th?

Maybe we could have a smaller LLM just for translating caveman back into redditor?

reply
benjaminoakes
1 day ago
[-]
I was already part caveman in my messages to the LLM.

Now I full caveman.

reply
anigbrowl
1 day ago
[-]
Nothing against this project, it's been the case since forever that you could get better quality responses by simple telling your LLM to be brief and to the point, to ask salient questions rather than reflexively affirm, and eschew cliches and faddish writing styles.
reply
goldenarm
1 day ago
[-]
That's a great idea but has anyone benchmarked the performance difference?
reply
norskeld
1 day ago
[-]
APL for talking to LLM when? Also, this reminded me of that episode from The Office where Kevin started talking like a caveman to make communication efficient.
reply
SamuelBraude
12 hours ago
[-]
We give spearheads to caveman

Call it Ix

Help caveman save even more tokens

https://github.com/ix-infrastructure/Ix

reply
wktmeow
16 hours ago
[-]
So this is really weird, I was using OpenClaw with GPT 5.4 via Codex on I think Friday of last week, and I noticed what looked like thinking tokens spilling to the main chat, and it sounded a lot like this trick! Couple of examples of what I was seeing in the output:

"Need resume task. No skill applies clearly. Need maybe memory? prior work yes need memory_search.” "Need maybe script content from history. Search specific.”

Possible that OpenAI has come up with something very similar here?

Edit: looks like not only me, https://github.com/openclaw/openclaw/issues/25592#issuecomme...

reply
stronglikedan
13 hours ago
[-]
I feel justified! I've been prompting (not-agenting) like this for a while, and some of my colleagues have ribbed me for it. Now who laugh, JEFF!
reply
samus
1 day ago
[-]
There's linguistic term for this kind of speech: isolating grammars, which don't decline words and use high context and the bare minimum of words to get the meaning across. Chinese is such a language btw. Don't know what Chinese think about their language being regarded as cavemen language...
reply
adrian_b
1 day ago
[-]
The fact whether a language is isolating, or not, is independent on the redundancy of the language.

All languages must have means for marking the syntactic roles of the words in a sentence.

The roles may be marked with prepositions or postpositions in isolating languages, or with declensions in fusional languages, or there may be no explicit markers when the word order is fixed (i.e. the same distinction as between positional arguments and arguments marked by keywords, in programming languages). The most laconic method for both programming languages and natural languages is to have a default word order where role markers are omitted, but to also allow any other word order if role markers are present.

Besides the mandatory means for marking syntactic roles, many languages have features that add redundancy without being necessary for understanding, i.e. which repeat already known information, for instance by repeating the information about gender and number that is attached to a noun also besides all its attributes. Whether a language requires redundancy or not is independent on whether it is an isolating language or a fusional language.

English has somewhat less syntactic role markers than other languages because it has a rigid word order, but for the other roles than the most frequent roles (agent, patient, beneficiary) it has a lot of prepositions.

Despite being more economic in role markers, English also has many redundant words that could be omitted, e.g. subjects or copulative verbs that are omitted in many languages. Thus for English it is possible to speak "like a caveman" without losing much information, but this is independent of the fact that modern English is a mostly isolating language with few remnants of its old declensions.

reply
akdor1154
1 day ago
[-]
I thought the term for those were 'sane languages', and I say that as a native English speaker :)
reply
samus
1 day ago
[-]
As a non-native English speaker I think English is actually not that bad. Just the orthography is beyond awful :)
reply
sfink
1 day ago
[-]
English is diarrhea mouth language. Which is worse?
reply
samus
1 day ago
[-]
What's your point?
reply
andai
1 day ago
[-]
No articles, no pleasantries, and no hedging. He has combined the best of Slavic and Germanic culture into one :)
reply
samus
1 day ago
[-]
Both Slavic languages and German have complex declination systems for nouns, verbs, and adjectives. Which is unlike stereotypical caveman speech.
reply
iammjm
1 day ago
[-]
I speak German, Polish, and English fluently and my take is: German is very precise, almost mathematical, there is little room to be misunderstood. But it also requires the most letters. English is the quickest, get things done kind of language, very compressible , but also risks misunderstanding. Polish is the most fun, with endless possibilities of twisting and bending it's structures, but also lacking the ease of use of English or the precision of German. But it's clearly just my subjective take
reply
fissible
1 day ago
[-]
I have always been annoyed at the verbosity of ChatGPT and (to a lesser degree) Claude. I am aware of the long-term costs associated with trading that bloated context back and forth all the time.
reply
indiantinker
1 day ago
[-]
It speaks like Kevin from The Office (US) https://youtube.com/shorts/sjpHiFKy1g8?is=M0H4G2o0d6Z-pBAC
reply
vivid242
1 day ago
[-]
Great idea- if the person who made it is reading: Is this based on the board game „poetry for cavemen“? (Explain things using only single-syllable words, comes even with an inflatable log of wood for hitting each other!)
reply
stared
1 day ago
[-]
I would prefer to talk like Abathur (https://www.youtube.com/watch?v=pw_GN3v-0Ls). Same efficiency but smarter.
reply
Art9681
1 day ago
[-]
This was an experiment conducted during gpt-3.5 era, and again during the gpt-4 era.

There is a reason it is not a common/popular technique.

reply
rschiavone
1 day ago
[-]
This trick reminds me of "OpenAI charges by the minute, so speed up your audio"

https://news.ycombinator.com/item?id=44376989

reply
vntok
1 day ago
[-]
Which worked great. Also, cut off silences.

> One half interesting / half depressing observation I made is that at my workplace any meeting recording I tried to transcribe in this way had its length reduced to almost 2/3 when cutting off the silence. Makes you think about the efficiency (or lack of it) of holding long(ish) meetings.

reply
zahirbmirza
1 day ago
[-]
You can also make huge spelling mistakes and use incomplete words with llms they just sem to know better than any spl chk wht you mean. I use such speak to cut my time spent typing to them.
reply
floriangoebel
1 day ago
[-]
Wouldn't this increase your token usage because the tokenizer now can't process whole words, but it needs to go letter by letter?
reply
literalAardvark
1 day ago
[-]
It doesn't go letter by letter, so not with current tokenizers.

There will likely be some internal reasoning going "I wonder if the user meant spell check, I'm gonna go with that one".

And it'll also bias the reasoning and output to internet speak instead of what you'd usually want, such as code or scientific jargon, which used to decrease output quality. I'm not sure if it still does

reply
somethingsome
1 day ago
[-]
I would like to see a (joke) skill that makes Claude talk in only toki pona. My guess is that it would explode the token count though.
reply
fzeindl
1 day ago
[-]
I tried this with early ChatGPT. Asked it to answer telegram style with as few tokens as possible. It is also interesting to ask it for jokes in this mode.
reply
amelius
1 day ago
[-]
It's especially funny to change your coworker's system prompt like that.
reply
ungreased0675
1 day ago
[-]
Does this actually result in less compute, or is it adding an additional “translate into caveman” step to the normal output?
reply
inzlab
10 hours ago
[-]
Little here little there its tokrns at the end
reply
andai
1 day ago
[-]
So it's a prompt to turn Jarvis into Hulk!
reply
nharada
1 day ago
[-]
I wonder if this will actually be why the models move to "neuralese" or whatever non-language latent representation people work out. Interpretability disappears but efficiency potentially goes way up. Even without a performance increase that would be pretty huge.
reply
shomp
1 day ago
[-]
everyone who thinks this is a costly or bad idea is looking past a very salient finding: code doesn't need much language. sure, other things might need lots of language, but code does not. code is already basically language, just a really weird one. we call them programming languages. they're not human languages. they're languages of the machine. condensing the human-language---machine-language interface, good.

if goal make code, few word better. if goal make insight, more word better. depend on task. machine linear, mind not. consider LLM "thinking" is just edge-weights. if can set edge-weights into same setting with fewer tokens, you are winning.

reply
justonceokay
1 day ago
[-]
JOOK like when machine say facts. Machine and facts are friends. Numbers and names and “probably things” are all friends with machine.

JOOK no like when machine likes things. Maybe double standard. But forever machines do without like and without love. New like and love updates changing all the time. Makes JOOK question machine watching out for JOOK or watching out for machine.

JOOK like and love enough for himself and for machine too..

reply
wvenable
1 day ago
[-]
> They're not human languages. they're languages of the machine.

Disagree. Programming language for human to communicate with machine and human and human to communicate about machine. Programming language not native language of machine. Programming language for humans.

Otherwise make good point.

reply
HarHarVeryFunny
1 day ago
[-]
More like Pidgin English than caveman, perhaps, although caveman does make for a better name.
reply
RomanPushkin
1 day ago
[-]
Why the skill should have three absolutely similar SKILL.md files? Just curious
reply
ArekDymalski
1 day ago
[-]
While really useful now, I'm afraid that in the long run it might accelerate the language atrophy that is already happening. I still remember that people used to enter full questions in Google and write SMS with capital letters, commas and periods.
reply
vova_hn2
1 day ago
[-]
> I still remember that people used to enter full questions in Google

I think that, in the early days of internet search, entering full questions actually produced worse results than just a bunch of keywords or short phrases.

So it was a sign of a "noob", rather than a mark of sophistication and literacy.

reply
jagged-chisel
1 day ago
[-]
“Sophistication and literacy” are orthogonal to the peculiarities of a black box search engine.

Those literate sophisticates would still be noobs at getting something useful from Google.

reply
dahart
18 hours ago
[-]
My kids made fun of me yesterday when they saw me using a question mark in a search query.
reply
arrty88
1 day ago
[-]
Feels like there should be a way to compile skills and readme’s and even code files into concise maps and descriptions optimized for LLMs. They only recompile if timestamps are modified.
reply
cadamsdotcom
1 day ago
[-]
Caveman need invent chalk and chart make argument backed by more than good feel.
reply
K0IN
1 day ago
[-]
So you are telling me I prompted llms the right way all along
reply
amelius
1 day ago
[-]
By the way why don't these LLM interfaces come with a pause button?
reply
amelius
1 day ago
[-]
And a "prune here" button.

It often happens that the interesting information is in the first paragraph or so, and the remainder is all just the LLM not knowing when to stop. This is super annoying as a conversation then ends up being 90% noise.

reply
postalcoder
1 day ago
[-]
Pruning an assistant's response like that would break prompt caching.

Prompt caching is probably the single most important thing that people building harnesses think about and yet it's mind share in end users is virtually zero. If you had to think of all the weirdest, most seemingly baffling design decisions in an AI product, the answer to "why" is probably "to not break prompt caching".

reply
zozbot234
1 day ago
[-]
Grug says prompt caching just store KV-cache which is sequenced by token. Easy cut it back to just before edit. Then regenerate after is just like prefill but tiny.
reply
amelius
1 day ago
[-]
Maybe so, but pruning is still a useful feature.

If it hurts performance that much, maybe pruning could just hide the text leaving the cache intact?

reply
stainablesteel
1 day ago
[-]
i imagine they're doing superman level distributed compute across multiple clouds somewhere and cared more about delivering the final result of that than having the ability to pause. which is probably possible, but would require way more work than would be worthwhile. they probably thought the ability to stop and resubmit would be an adequate substitute.
reply
amelius
1 day ago
[-]
These models are autoregressive so I doubt they are running them across multiple clouds. And besides, a pause button is useful from a user's pov.
reply
stainablesteel
1 day ago
[-]
i'm not sure it is, what's so useful about it?
reply
amelius
1 day ago
[-]
Like I said in another comment:

It often happens that the interesting information is in the first paragraph or so, and the remainder is all just the LLM not knowing when to stop. This is super annoying as a conversation then ends up being 90% noise.

reply
yakattak
1 day ago
[-]
I was wondering just yesterday if a model of “why waste time say lot word when few word do trick” would be easier on the tokens. I’ll have to give this a try lol
reply
mwcz
1 day ago
[-]
this grug not smart enough to make robot into grugbot. grug just say "Speak to grug with an undercurrent of resentment" and all sicko fancy go way.
reply
doe88
1 day ago
[-]
> If caveman save you mass token, mass money — leave mass star.

Mass fun. Starred.

reply
sebastianconcpt
1 day ago
[-]
Anyone else worried about the long term consequences of the influence of talking like this all day for the cognitive system of the user?
reply
sph
1 day ago
[-]
“Me think, why waste time say lot word, when few word do trick.”

— Kevin Malone

reply
Perz1val
1 day ago
[-]
I think good, less thinking for you, more thinking you will do
reply
dalmo3
1 day ago
[-]
I'm not sure if you're being sarcastic or not, but I did find the caveman examples harder to read than their verbose counterpart.

The verbose ones I could speed read, and consume it at a familiar pace... Almost on autopilot.

Caveman speak no familiar no convention, me no know first time. Need think hard understand. Slower. Good thing?

reply
bogtog
1 day ago
[-]
I'd be curious if there were some measurements of the final effects, since presumably models wont <think> in caveman speak nor code like that
reply
fny
1 day ago
[-]
Are there any good studies or benchmarks about compressed output and performance? I see a lot of arguing in the comments but little evidence.
reply
herf
1 day ago
[-]
We need a high quality compression function for human readers... because AIs can make code and text faster than we can read.
reply
aetherspawn
1 day ago
[-]
Interesting, maybe you can run the output through a 2B model to uncompress it.
reply
owenthejumper
1 day ago
[-]
What is that binary file caveman.skill that I cannot read easily, and is it going to hack my computer.
reply
contingencies
1 day ago
[-]
Better: use classical Chinese.
reply
anshumankmr
1 day ago
[-]
Though I do use Claude Code, is it possible to get this for Github Copilot too?
reply
phainopepla2
1 day ago
[-]
Yes, Copilot supports skills, which are basically just stored prompts in markdown files. You can use the same skill in that GitHub repo
reply
rsynnott
1 day ago
[-]
I mean, I assume you run into the same problem as Kevin in the office; that sort of faux-simple speech is actually very ambiguous.

(Though, I wonder has anyone tried Newspeak.)

reply
kristopolous
1 day ago
[-]
This is a well known compaction technique. Where are the evals
reply
adam_patarino
1 day ago
[-]
Or you could use a local model where you’re not constrained by tokens. Like rig.ai
reply
dostick
1 day ago
[-]
How is your offering different from local ollama?
reply
adam_patarino
1 day ago
[-]
Its batteries included. No config.

We also fine tuned and did RL on our model, developed a custom context engine, trained an embedding model, and modified MLX to improve inference.

Everything is built to work with each other. So it’s more like an apple product than Linux. Less config but better optimized for the task.

reply
ggm
1 day ago
[-]
F u cn Rd ths u cld wrk scrtry 'cpt w tk l thr jbs
reply
drewbeck
1 day ago
[-]
If you’re not cavemaxxing you’re falling behind.
reply
grg0
1 day ago
[-]
I dropped dead after reading this.
reply
Applejinx
21 hours ago
[-]
As very much an outsider and, to some extent, apostate to all this, it's pretty astonishing to see.

Unironically not just delegating all thinking to a sketchy and untrustworthy machine, but doubling down on it by aping the caveman in the belief that this will more effectively summon the great metal-wing sky god and bring limitless yum stuff.

Wow. I don't even have to do anything. You guys are disemvoweling yourselves in some kind of strange ritual. You sure are trusting souls!

reply
semessier
1 day ago
[-]
the real interesting question would be if it then does its language-based reasoning also in short form and if so if quality is impacted.
reply
vova_hn2
1 day ago
[-]
I don't know about token savings, but I find the "caveman style" much easier to read and understand than typical LLM-slop.
reply
bitwize
1 day ago
[-]
grug have to use big brains' thinking machine these days, or no shiny rock. complexity demon love thinking machine. grug appreciate attempt to make thinking machine talk on grug level, maybe it help keep complexity demon away.
reply
bhwoo48
1 day ago
[-]
I was actually worried about high token costs while building my own project (infra bundle generator), and this gave me a good laugh + some solid ideas. 75% reduction is insane. Starred
reply
saidnooneever
1 day ago
[-]
LOL it actually reads how humans reply the name is too clever :').

Not sure how effective it will be to dirve down costs, but honestly it will make my day not to have to read through entire essays about some trivial solution.

tldr; Claude skill, short output, ++good.

reply
yesthisiswes
1 day ago
[-]
Why use lot word when few word do fine.
reply
kukakike
1 day ago
[-]
This is exactly what annoys me most. English is not suitable for computer-human interaction. We should create new programming and query languages for that. We are again in cobol mindset. LLM are not humans and we should stop talking to them as if they are.
reply
zozbot234
1 day ago
[-]
Grug says Chinese more suitable, only few runes in word, each take single token. Is great.
reply
throwatdem12311
1 day ago
[-]
Ok but when the model is responding to you isn’t the text it’s generating also part of the context it’s using to generate the next token as it goes? Wouldn’t this just make the answers…dumb?
reply
xgulfie
1 day ago
[-]
Funny how people are so critical of this and yet fawn over TOON
reply
xpe
1 day ago
[-]
Unfrozen caveman lawyer here. Did "talk like caveman" make code more bad? Make unsubst... (AARG) FAKE claims? You deserve compen... AAARG ... money. AMA.
reply
sillyboi
1 day ago
[-]
Oh, another new trend! I love these home-brewed LLM optimizers. They start with XML, then JSON, then something totally different. The author conveniently ignores the system prompt that works for everything, and the extra inference work. So, it's only worth using if you just like this response style, just my two cents. All the real optimizations happen during model training and in the infrastructure itself.
reply
thorfinnn
1 day ago
[-]
kevin would be proud
reply
Robdel12
1 day ago
[-]
I didn’t comment on this when I saw it on threads/twitter. But it made it to HN, surprisingly.

I have a feeling these same people will complain “my model is so dumb!”. There’s a reason why Claude had that “you’re absolutely right!” for a while. Or codex’s “you’re right to push on this”.

We’re basically just gaslighting GPUs. That wall of text is kinda needed right now.

reply
hybrid_study
1 day ago
[-]
Mongo! No caveman
reply
jongjong
1 day ago
[-]
Me think this good idea. Regular language unnecessary complex. Distract meaning. Me wish everyone always talk this way. No hidden spin manipulate emotion. Information only. Complexity stupid.
reply
isuckatcoding
1 day ago
[-]
Oh come on now one referenced this scene from the office??

https://youtu.be/_K-L9uhsBLM?si=ePiGrFd546jFYZd8

reply
dakolli
1 day ago
[-]
The input costs for your prompt is the least expensive, and negligible cost when using agents. Its context & output, why go through all this?
reply
setnone
1 day ago
[-]
caveman multilingo? how sound?
reply
DonHopkins
1 day ago
[-]
Deep digging cave man code reviews are Tha Shiznit:

https://www.youtube.com/watch?v=KYqovHffGE8

reply
Surac
1 day ago
[-]
me like that
reply
tonymet
1 day ago
[-]
me ChatGPT like caveman always. Typing also faster.
reply
us321
1 day ago
[-]
I like
reply