Thank you for all the good and curious comments.
For 72B models, around *36GB memory works fine* by the way. I ran the benchmark and shared the results on the website: https://opengraviton.github.io/index.html
While working on this research I realized something important: the way most current models are trained is extremely inefficient. Because of that, I started developing *graviton-native*, which trains AI models from scratch using more efficient architectures.
The idea is to design models that are optimized for efficiency from the beginning. My expectation is that this approach could bring around *~70% efficiency improvement*. Combined with OpenGraviton, I believe this could eventually make it possible to run *trillion-parameter scale models locally*.
You can find the paper here: https://opengraviton.github.io/paper.html
And the repository here: https://github.com/opengraviton/graviton-native
Right now I’m training a *72B model* using this approach. I’ll share the results soon and update the website.
For context, I run a Mac Mini M4 as a homelab server and the memory pressure from even 7B models is noticeable. Curious how this handles sustained inference without thermal throttling.
https://github.com/opengraviton/graviton?tab=readme-ov-file#...
the benchmarks don't show any results for using these larger-than-memory models, only the size difference
it all smells quite sloppy
~19 tok/s for Apple M1 Max (64 GB) with TinyLlama-1.1B-Chat-v1.0
I have a MacBook Pro M1 Max w/64 GB RAM, and a Mac Studio M3 Ultra w/96 GB RAM. What do you think is possible to run on these? Just curious before I really try it out.