Zebra-Llama: Towards Efficient Hybrid Models
34 points
2 hours ago
| 2 comments
| arxiv.org
| HN
adityashankar
40 minutes ago
[-]
Due to perverse incentives and the historical nature of models over-claiming accuracy, it's very hard to believe anything until it is open source and can be tested out

that being said, I do very much believe that computational efficiency of models is going to go up [correction] drastically over the coming months, which does pose interesting questions over nvidia's throne

*previously miswrote and said computational efficiency will go down

reply
credit_guy
32 minutes ago
[-]
reply
adityashankar
19 minutes ago
[-]
yes!, thanks for the link!
reply
danielbln
36 minutes ago
[-]
I think you mean computational efficiency will go _up_ in the future. To your last point: Jevons paradox might apply.
reply
adityashankar
26 minutes ago
[-]
yup that's what I meant!, Jevon's paradox applies to resource usage in general and not towards a specific companies dominance

if computational efficiency goes up (thanks for the correction), and CPU inference becomes viable for most practical applications, GPUs (or accelerators) themselves may be unnecessary for most practical functions

reply
mason_mpls
41 minutes ago
[-]
> Zebra-Llama achieves Transformer-level accuracy with near-SSM efficiency using only 7–11B training tokens (compared to trillions of tokens required for pre-training) and an 8B teacher. Moreover, Zebra-Llama dramatically reduces KV cache size—down to 3.9%, 2%, and 2.73% of the original for the 1B, 3B, and 8B variants, respectively—while preserving 100%, 100%, and 97% of average zero-shot performance on LM Harness tasks.

This is an extraordinary claim, is there a catch I’m missing? Am I misreading?

reply