Transformers appear to have discrete "reasoning circuits" — contiguous blocks of 3-4 layers that act as indivisible cognitive units. Duplicate the right block and the model runs its reasoning pipeline twice. No weights change. No training. The model just thinks longer.
The results on standard benchmarks (lm-evaluation-harness, n=50):
Devstral-24B, layers 12-14 duplicated once: - BBH Logical Deduction: 0.22 → 0.76 - GSM8K (strict): 0.48 → 0.64 - MBPP (code gen): 0.72 → 0.78 - Nothing degraded
Qwen2.5-Coder-32B, layers 7-9 duplicated once: - Reasoning probe: 76% → 94%
The weird part: different duplication patterns create different cognitive "modes" from the same weights. Double-pass boosts math. Triple-pass boosts emotional reasoning. Interleaved doubling (13,13,14,14,15,15,16) creates a pure math specialist. Same model, same VRAM, different routing.
The circuit boundaries are sharp — shift by one layer and the effect disappears or inverts. Smaller models (24B) have tighter circuits (3 layers) than larger ones (Ng found 7 layers in 72B).
Tools to find circuits in any GGUF model and apply arbitrary layer routing are in the repo. The whole thing — sweep, discovery, validation — took one evening.
Happy to answer questions.
> Transformers appear to have discrete "reasoning circuits" — contiguous blocks of 3-4 layers that act as indivisible cognitive units. Duplicate the right block and the model runs its reasoning pipeline twice. No weights change. No training. The model just thinks longer.
How did you not expect that if you read his post? That's literally what he discovered, two years ago.
For anyone interested, there's more meat in the post and comments from last week: https://news.ycombinator.com/item?id=47322887
As far as I can see that's not implied by the original post.
But that's beside the point: quoting the bit where the poster says "here's what I'm building on top of" and using that to imply they haven't done anything new is a bit pointless, no?
Considering this, I think (again, assuming the benchmarks themselves are sound) the most plausible explanation for the observations is (1) the layers being duplicated are close to the identity function on most inputs; (2) something happened to the model in training (RLHF?) that forcefully degraded its reasoning performance; (3) the mechanism causing the degradation involves the duplicated layers, so their duplication has the effect of breaking the reasoning-degrading mechanism (e.g. by clobbering a "refusal" "circuit" that emerged in post-training).
More concisely, I'm positing that this is an approach that can only ever break things, and rather than boosting reasoning, it is selectively breaking things deleterious to reasoning.
Right, I had the same thought.
Even if the output was in the same "format", does the LLM even have any way to know which order the outputs will go in? The ordering of the nodes is part of our representation of the network, it's not fundamental to it.
It would be like shuffling the bytes in a PNG file and expecting the program still to understand it as a PNG file.
The more I think about this, the more I don't get this at all.
This is likely to be shaped by tied embeddings and skips on one end, and maybe training pressures on the other.
The very top of FF stack and the very bottom of FF stack both reflect the same token embeddings - and this propagates through the model, setting up a shared identity space. Skip connections propagate that through the layers. No explicit shared identity imposed, but there is an implicit one set by the architecture. Fairly well established.
(Now: highly speculative! Attention over past tokens creates an implicit "robustness/convergence" pressure? The model can't be "certain" if it'll have access to the right representations at a given layer, because representations depend not just on the past layers, but also on the highly uncertain contents of previous tokens as passed through attention. Which in turn depends on more of the same, increasing variance further. So the training causes: "each layer can't be certain of what it will have access to, so it develops to refine anything it currently has access to in a convergent fashion, because that's what's useful under pressure of attention-induced uncertainty".)
LLMs are notoriously nonfragile, and robust to perturbations. Far more so if you anneal with SFT/distillation after your model surgery, although this wasn't done here. Plenty of weird franken-LLM experiments prove that empirically.
So I'm not too surprised to find that someone has managed to improve benchmark performance on a few narrow tasks by duplicating a few middle layers. "Duplicating a few layers that were doing convergent iterative refinement benefits a few tasks that suffered from insufficient depth of convergent iterative refinement" is a fairly reasonable hypothesis, in my eyes.
The chances of duplication "breaking something somewhere" are high, and I would expect the capability profile of an unannealed franken-LLM like this to have a few gaps in it if evaluated extensively against the original. But "franken-LLM layer duplication can actually improve some things" is far too plausible with what we know to be dismissed pre-emptively.
It seems to me that the difference between "iterative improvement" as you put it and "close to the identity" (as in the output is close to the input for most of the volume of the input space) as I put it is fairly subtle, anyway. One experiment I would like to see is what happens to the reasoning performance if rather than duplicating the selected layers, they are deleted/skipped entirely. If the layers improve reasoning by iterative improvement, this should make the performance worse; but if they contain a mechanism that degrades reasoning and is not robust against unannealed self-composition, it should make the performance similarly better.
Wouldn't "pass-through" identity connections have exactly that effect? These are quite common in transformer models.
In any case, this has been done at least since the very first public releases of Llama by Meta... It also works for image models. There are even a few ComfyUI nodes that let you pick layers to duplicate on the fly, so you can test as many as you want really quickly.
On the prior art: you're right that layer duplication has been explored before. What I think is new here is the systematic sweep toolkit + validation on standard benchmarks (lm-eval BBH, GSM8K, MBPP) showing exactly which 3 layers matter for which model. The Devstral logical deduction result (0.22→0.76) was a surprise to me.
If there are ComfyUI nodes that do this for image models, I'd love links, the "cognitive modes" finding (different duplication patterns that leads to different capability profiles from the same weights) might be even more interesting for diffusion models.
From what I understand, transformers are resistant to network corruption (without complete collapse) thanks to residual connections.
I tried to repeat some layers too but got garbage results. I guess I need to automate finding the reasoning layers too, instead of just guessing.
This goes to the thing that I posted on the thread a couple of days ago. https://news.ycombinator.com/item?id=47327132
What you need is a mechanism to pick the right looping pattern, Then it really does seem to be Mixture of experts on a different level.
Break the model into input path, thinking, output path. and make the thinking phase a single looping layer of many experts. Then the router gets to decide 13,13,14,14,15,15,16.
Training the router left as an exercise to the reader.
That you can profitably loop some say 3-layer stack is likely a happy accident, where the performance loss from looping 3/4 of mystery circuit X that partially overlaps that stack is more than outweighed by the performance gain from looping 3/3 of mystery circuit Y that exactly aligns with that stack.
So, if you are willing to train from scratch, just build the looping in during training and let each circuit find its place, in disentangled stacks of various depths. Middle of transformer is:
(X₁)ᴹ ⊕ (Y₁∘Y₂)ᴺ ⊕ (Z₁∘Z₂∘Z₃)ᴾ ⊕ …
Notation: Xᵢ is a layer (of very small width) in a circuit of depth 1..i..D, ⊕ is parallel composition (which sums the width up to rest of transformer), ∘ is serial composition (stacking), and ᴹ is looping. The values of ᴹ shouldnt matter as long as they are > 1, the point is to crank them up after training.
Ablating these individual circuits will tell you whether you needed them at all, but also roughly what they were for in the first place, which would be very interesting.
I chat with the model to see if the thing was still working and seemed coherent to me, I didn't notice anything off.
I need to automate testing like that, where you pick the local maxima and then iterate over that picking layers to see if it's actually better, and then leave the thing running overnight
Whats's more. It was found out that you only need a single looped layer to be equivalent to a multi layer network.
i feel that sometimes a lot of the layers might just be redundant and are not fully needed once a model is trained.
if this is validated enough it can eventually lead to ship some kind of "mix" architecture with layers executed to fit some "vibe?"
Devstral was the first one I tried and optimize for math/eq, but that din't result in any better model, then I added the reason part, and that resulted in "better" model
I used the devstral with the vibe.cli and it look sharp to me, thing didn't fail, I also used the chat to "vibe" check it and look ok to me.
The other thing is that I pick a particular circuit and that was "good" but I don't know if it was a local maxima, I think I ran just like 10 sets of the "fast test harness" and pick the config that gave the most score... once I have that I use that model and run it against the llm_eval limited to only 50 tests... again for sake of speed, I didn't want to wait a week to discover the config was bad
I'm just trying different kinds of attention mechanisms, different configurations of the network, adding loops, ... All kind of wacky ideas. And the real weird thing is that 99% of the ideas I try work at all.
I wonder if they work for similar reasons.
I'm using the following configuration --tasks gsm8k_cot,ifeval,mbpp,bbh_cot_fewshot_logical_deduction_five_objects,mbpp I did also try humaneval but something in the harness is missing and failed...
notice that I'm running 50 tests for each task, mostly because of time limitation as it takes like two hours to validate the run for the base model and the modified one.
I'll also try to publish the results of the small tests harness when I'm testing the multiple layers configurations, for reference this is phi-4-Q6_K.gguf, still running, I'm now giving more importance to the Reason factor, the reason factor comes from running a small subset of all the problems in the task config above
Initially I tried the approach of the highest math/eq but in resulted in models that were less capable overall with the exception of math, and math like in the original research is basically how good was the model at giving you the answer of a really though question, say the cubic root of some really large number... but that didn't translate to the model being better at other tasks...
Config | Lyr | Math | EQ | Reas | Math Δ | EQ Δ | Reas Δ | Comb Δ
--------|-----|--------|-------|--------|---------|-------|---------|-------
BASE | 0 | 0.7405 | 94.49 | 94.12% | --- | --- | --- | ---
(6,9) | 3 | 0.7806 | 95.70 | 94.12% | +0.0401 | +1.21 | +0.00% | +1.21
(9,12) | 3 | 0.7247 | 95.04 | 94.12% | -0.0158 | +0.55 | +0.00% | +0.55
(12,15) | 3 | 0.7258 | 94.14 | 88.24% | -0.0147 | -0.35 | -5.88% | -6.23
(15,18) | 3 | 0.7493 | 95.74 | 88.24% | +0.0088 | +1.25 | -5.88% | -4.63
(18,21) | 3 | 0.7204 | 93.40 | 94.12% | -0.0201 | -1.09 | +0.00% | -1.09
(21,24) | 3 | 0.7107 | 92.97 | 88.24% | -0.0298 | -1.52 | -5.88% | -7.41
(24,27) | 3 | 0.6487 | 95.27 | 88.24% | -0.0918 | +0.78 | -5.88% | -5.10
(27,30) | 3 | 0.7180 | 94.65 | 88.24% | -0.0225 | +0.16 | -5.88% | -5.73
(30,33) | 3 | 0.7139 | 94.02 | 94.12% | -0.0266 | -0.47 | +0.00% | -0.47
(33,36) | 3 | 0.7104 | 94.53 | 94.12% | -0.0301 | +0.04 | +0.00% | +0.04
(36,39) | 3 | 0.7017 | 94.69 | 94.12% | -0.0388 | +0.20 | +0.00% | +0.20
(6,10) | 4 | 0.8125 | 96.37 | 88.24% | +0.0720 | +1.88 | -5.88% | -4.01
(9,13) | 4 | 0.7598 | 95.08 | 94.12% | +0.0193 | +0.59 | +0.00% | +0.59
(12,16) | 4 | 0.7482 | 93.71 | 88.24% | +0.0076 | -0.78 | -5.88% | -6.66
(15,19) | 4 | 0.7617 | 95.16 | 82.35% | +0.0212 | +0.66 | -11.76% | -11.10
(18,22) | 4 | 0.6902 | 92.27 | 88.24% | -0.0504 | -2.23 | -5.88% | -8.11
(21,25) | 4 | 0.7288 | 94.10 | 88.24% | -0.0117 | -0.39 | -5.88% | -6.27
(24,28) | 4 | 0.6823 | 94.57 | 88.24% | -0.0583 | +0.08 | -5.88% | -5.80
(27,31) | 4 | 0.7224 | 94.41 | 82.35% | -0.0181 | -0.08 | -11.76% | -11.84
(30,34) | 4 | 0.7070 | 94.73 | 94.12% | -0.0335 | +0.23 | +0.00% | +0.23
(33,37) | 4 | 0.7009 | 94.38 |100.00% | -0.0396 | -0.12 | +5.88% | +5.77
(36,40) | 4 | 0.7057 | 94.84 | 88.24% | -0.0348 | +0.35 | -5.88% | -5.53
(6,11) | 5 | 0.8168 | 95.62 |100.00% | +0.0762 | +1.13 | +5.88% | +7.02
(9,14) | 5 | 0.7245 | 95.23 | 88.24% | -0.0160 | +0.74 | -5.88% | -5.14
(12,17) | 5 | 0.7825 | 94.88 | 88.24% | +0.0420 | +0.39 | -5.88% | -5.49
(15,20) | 5 | 0.7832 | 95.86 | 88.24% | +0.0427 | +1.37 | -5.88% | -4.52
(18,23) | 5 | 0.7208 | 92.42 | 88.24% | -0.0197 | -2.07 | -5.88% | -7.95
(21,26) | 5 | 0.7055 | 92.89 | 88.24% | -0.0350 | -1.60 | -5.88% | -7.48
(24,29) | 5 | 0.5825 | 95.04 | 94.12% | -0.1580 | +0.55 | +0.00% | +0.55
(27,32) | 5 | 0.7088 | 94.18 | 88.24% | -0.0317 | -0.31 | -5.88% | -6.19
(30,35) | 5 | 0.6787 | 94.69 | 88.24% | -0.0618 | +0.20 | -5.88% | -5.69
(33,38) | 5 | 0.6650 | 94.96 | 88.24% | -0.0755 | +0.47 | -5.88% | -5.41
(6,12) | 6 | 0.7692 | 95.39 | 94.12% | +0.0287 | +0.90 | +0.00% | +0.90
(9,15) | 6 | 0.7405 | 94.65 | 94.12% | -0.0000 | +0.16 | +0.00% | +0.16
(12,18) | 6 | 0.7582 | 94.57 | 88.24% | +0.0177 | +0.08 | -5.88% | -5.80
(15,21) | 6 | 0.7828 | 93.52 | 88.24% | +0.0423 | -0.98 | -5.88% | -6.86
(18,24) | 6 | 0.7308 | 92.93 | 94.12% | -0.0097 | -1.56 | +0.00% | -1.56
(21,27) | 6 | 0.6791 | 92.54 | 82.35% | -0.0615 | -1.95 | -11.76% | -13.72 # Run lm-evaluation-harness
lm_eval --model local-chat-completions \
--model_args model=test,base_url=http://localhost:8089/v1/chat/completions,num_concurrent=1,max_retries=3,tokenized_requests=False \
--tasks gsm8k_cot,ifeval,mbpp,bbh_cot_fewshot_logical_deduction_five_objects,mbpp \
--apply_chat_template --limit 50 \
--output_path ./eval_results