> Generation tasks. Method applies to classification only. Preliminary decoder experiments show perplexity increases.
The distillation of a student that predicts "anchor layers" and then acts as a backbone for classification is perfectly cool on its own; no need to stretch the title/abstract so much.
The core result: a frozen Llama-3.3-70B can be distilled into a 256-dimensional field representation, giving 224× compression and slightly higher accuracy on several benchmarks. A small student model then learns to directly generate these fields from text, removing the transformer from the inference path.
The Zenodo link contains the full paper, statistical results, and methodology. A reference implementation (non-optimized) is here: https://github.com/Anima-Core/an1-core
Production variants (AN1-Turbo, FPU work, etc.) are not included.
I’m an outsider to academia so I’m posting this openly to get technical feedback, replication attempts, and critique from people who understand this space.
If i were a paper reviewer, here are a couple red flags that stood out to me. Suggest starting here if you want to rework this for an academic submission:
1. your LaTeX citations in the related work are broken, i see [?] everywhere. To a reviewer, this is often a strong sign of an AI-hallucinated bibliography, though many of your references actually do exist and are contextually relevant, so I'm not quite sure what's going on here. Similarly, figure references need to be fixed, I see references to "Figure ?" throughout.
2. bluntly, "Exact architecture details remain proprietary for production deployments" and "Production systems use architecture search tailored to target latency and accuracy constraints" is not how IP protection works in this field. Do your experiments use the "MLP baselines" or your proprietary architecture? Since you say the code "Achieves 80-90% of paper performance using baseline heuristics," this approach effectively isn't reproducible. As a reviewer, this really worries me. I strongly recommend benchmarking only the system you're able to open-source. I say this because I suspect there's a lot of "secret sauce" in the actual way you're approximating the anchor layers and the way that's transferred back to your student transformer model, and that's the part that's important to spend the most time/effort/writing on, but it's glossed over as an implementation detail in this manuscript.
3. I'm glad you ablate over hyperparameters of your system, but how does it compare to 1. an ordinary smaller model of identical size trained end-to-end, and 2. distilling from a single layer's activations? Eg. a reviewer might consider this work to be a novel method of model distillation, so what makes it better than previous distillation methods?
4. I found the paper fairly hard to read because it's full of sentence fragments rather than full thoughts. A little background on the benchmarks, failure cases, etc. would go a long way, and adding some discussion on why you think your approach improves on similar distillation methods would also be welcome here
5. "compression" is overloaded. Does 224x compression refer to (nparams(field transfer)+nparams(student model))/nparams(original model), or does it refer to reducing the representation dimensionality, 7*8192/256 ?
6. [nitpick] suggest changing the name "meaning field" to something a little more digestible, like "compressed representation" or "latent activation distillation" or something
sorry for being so critical. iron sharpens iron though. hopefully these thoughts are helpful to get you started, excited to see where this work leads
then the kitschy paper titles could follow from that, e.g. "extreme llama compression: when classification is all you need", or "Encoder-only models: a lightweight alternative to decoder-only GPT world models" or etc.
just spitballing
At the same time, possible since it's only classification tasks. I mean, the method explained is technically plausible, a lot of people thought about it, we were just unable to find a method to do so.
Very unlikely true, unfortunately.