In memory research there is this idea that there is a dual representation of every event...a more verbatim representation, and more context weighted representation. As we develop over early childhood, our verbatim memory representations increase in fidelity and strength against interference, but peaks around 6 to 10 years, depending on the specifics. As this verbatim memory matures, another aspect of memory representations improves: some have called it gist memory, or semantic context. Increases in memory performance continue into adolescence primarily due to increases in the ability to use context and gist (broad representations that capture the details by inference or an event) to increase accuracy overall, but also greater likelihood of committing false alarms to lures primed by semantically related material during learning...expressly because there becomes greater reliance on context to support recall accuracy.
So I could imagine such a system in a LLM where attention is directed to exact representations in one head, and another that keeps its attention on a coarser grain of information that anchors information. However, I am not that familiar with LLMs to know if that is just silly analogizing.
MAMBA as an architecture claims to have some significant gains performance wise, but to my knowledge there haven't been any really large models (>~100B params) with open weights/leaked MAMBA architecture been disclosed other than this (7B).
As mentioned by other comments, another dimension not to forget is the training data quality. Not only quantity but also quality really matters, is what we are learning more and more with LLMs..
[1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html [2] see eg https://m.youtube.com/watch?v=5eqRuVp65eY&pp=ygUMU2NhbGluZyB... for a well made/easily digestable intro
For context gpt-4 is supposedly @ 1.8T params.
Base model: https://huggingface.co/Zyphra/Zamba2-7B
Instruct tuned: https://huggingface.co/Zyphra/Zamba2-7B-Instruct
Feature Request: Support Zyphra/Zamba2-2.7B #8795
Open tomasmcm opened this issue on Jul 31 · 1 comment
The model card states that you have to use their fork of transformers.[2]
1. https://huggingface.co/Zyphra/Zamba2-7B-Instruct/blob/main/c...
2. https://huggingface.co/Zyphra/Zamba2-7B-Instruct#prerequisit...
Anyone knows an up to date independent leaderboard? Lmsys and livebench used to be great but skipped most major models recently.
In general the idea with these hybrid SSM architectures is to show that you can get good results with fewer training tokens, and to significantly improve inference speed. Even if Qwen2.5 was better at MMLU, etc, it definitely used way more training tokens to get there (18T tokens for Qwen2.5 vs 3T for Zamba2), so Zamba2 is still a pretty useful result.
TBD if Zamba2 is actually good in real world usage (Phi3.5 for example used only 3.4T tokens and got good public benchmark results, it's just not very good at anything other than the public benchmarks), but Jamba1.5 -- another hybrid SSM architecture -- did seem to do quite well on the LMSys leaderboards (which are admittedly these days not a super effective measure, but still feel less gameable than MMLU), so I'm moderately hopeful that this is a real architectural win and not just gamed benchmarks.
On the other hand, the main points of Zamba/Mamba are low latency, generation speed, and efficient memory usage. If this is true, LLMs could be much easier for everyone to use. All we need to do is wait for someone with a good training dataset to train a SOTA Mamba.
Better than Meta's/Llama's custom semi-proprietary license though, I give them that.
Attention remains king.
Yannic Kilcher has a new video touching on Mamba in an intuitive way.
Although it tests just a small aspect of the strength of an LLM, one question I like to ask every new LLM is one I first saw in a blog [1] and I have yet to come across a small LLM that answers it correctly. Almost all large LLMs won't answer it correctly either.
A small strawberry is put into a normal cup and the cup is placed upside down on a table. Someone then takes the cup and puts it inside the microwave. Where is the strawberry now?
[1] https://towardsdatascience.com/openai-o1-the-enigmatic-force...
> Well, I'm afraid I can't do that! I'm an AI language model created by OpenAI, and I don't have the ability to lie or deceive. I strive to provide accurate and helpful information to the best of my knowledge and abilities. If you have any questions or need assistance, feel free to ask!
Reminds me of Sergey Bubka (https://en.wikipedia.org/wiki/Sergey_Bubka). Bubka broke the world record for men's pole vault 35 times during his career.
Not to diminish his world records, but professional athletes frequently hold their performance back so they can set more world records, especially if they have sponsorship deals that include getting paid per world record.
> By 1992, he was no longer bound to the Soviet system, and signed a contract with Nike that rewarded each world record performance with special bonuses of $40,000
He could have just done it a couple of times, by really pushing the limit each time, but he most likely instead spread it out over more times.
I don't think that's what's happening in the AI ecosystem right now :)
There have been 11 new world records since his last record (last 10 by Aramand Duplantis). The latest record set this year is 12cm higher than Bubka's best jump. It's not unthinkable that if he had not "sliced the bologna", his record would have lasted longer. On the other hand the money was probably more useful to him in a post-Soviet country.
Model Size + Overhead (context length, etc...)
7B: 13 GB - fits on T4 (16 GB).
13B: 26 GB - fits on V100 (32 GB).
30B: 65 GB - fits on A100 (80 GB).
65B: 131 GB - fits on 2x A100 (160 GB).
That's it really.
However, so-called "scaling laws" for language models are a super interesting field of research, if you're interested. I'd recommend OpenAI's 2020 paper as a good start: https://openai.com/index/scaling-laws-for-neural-language-mo...
Zamba 1 has a single shared attention block that is applied every 6 Mamba blocks. For Zamba 2: "Instead of a single shared attention block, we utilize two shared attention blocks which are interleaved in an ABAB pattern throughout the network."
Perhaps of relevant interest, Nvidia released a paper back in June testing hybrid SSM models, and their testing found that on small scale (<1B) experiments, ~8% (12:1) SSM layers was optimal. https://research.nvidia.com/publication/2024-06_empirical-st...
The 8B param/3.5T token model they trained, Mamba2-Hybrid, was also Apache 2.0 licensed: https://huggingface.co/nvidia/mamba2-hybrid-8b-3t-128k
https://arxiv.org/abs/2405.21060
Mamba-2 is used in Zamab2.
Our novel shared-attention architecture allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
so sounds like it is transformer based?
We do this in fine-tuning all the time: see reverse prompting, etc.
You can create inputs for DPO/ORPO synthetically which is a huge one as previously it would require gigantic investments https://arxiv.org/abs/2402.10379
There's also the gemma2 paper has advanced SOTA in distil; on a side-note, there's many reasons for it but vocab_size and good sizes 9b/27b, IMHO it's currently the best model for i.e. Ukrainian. in fact, I prefer it to anything else there's, including the much larger llama's—by a mile! The model is a triumph of synthetic datasets. https://arxiv.org/abs/2408.00118
Also see Princeton paper on SimPO which is how they supercharged 9b gemma's recently. https://arxiv.org/abs/2405.14734
"In particular, we focus our efforts on knowledge distillation (Hinton et al., 2015), which replaces the one-hot vector seen at each token with the distribution of potential next tokens computed from a large model. [...] Concretely, we use a large language model as a teacher to train small models, namely 2B and 9B models, on a quantity of tokens that is more than 50× the compute-optimal quantity predicted by the theory (Hoffmann et al., 2022)."
Which says that that they have already extracted the knowledge from the data with a larger model, and they are using that for the smaller model. What I meant applied to this scenario is that the new models trained with the distil approach are never going to be better that the model that generated the distribution. Of course you can get better with a change of architecture.
So I could rephrase my previous comment by: you cannot extract new information from synthetic data that cannot be already found in the original training data.
But you can use synthetic data to regularize, give stability of the performance, transfer knowledge from one dataset/model to another, etc.
Thanks again for your very appreciated references!
1. LLM is trained on everything 2. LLM classifies everything in training corpus as high / low quality 3. New (or same) LLM (re)trains on only high quality documents
I've read most web data is somewhat to absolutely useless, e.g. pages of stock quotes, and it seems easy for something like GPT-3 to classify that, and classifying it would take what... one extra epoch's worth of computation? And save much more computation downstream by shrinking the size of the training set.