By affordable I mean no greater then the cost of a current gen base model mac mini ($599) but ideally around the price of a raspberry pi 5 ($79) which when searching for budget PC gets mentioned[2]. Both devices have the same amount of ram in my case (8gb) but different performance observed given the importance of memory bandwidth. I mention these two because I've had experience running llama 3 via ollama on both with success although of slower speeds compared to a full workstation with a commodity GPU i.e. RTX 4090 which starts at ($1599). I'm interested in learning about what other devices are out there that people consider cheap and use for LLMs locally.
[1]: https://news.ycombinator.com/item?id=38646969
[2]: https://www.pcmag.com/picks/the-best-budget-desktop-computers
You can experiment with a lot of models, it's just going to be slow.
With ddr5 you can even go higher with 48GB modules.
Otherwise I got a 3060 12G, which can be had for 200€ used.
Its a very affordable setup.
Consider making friends with people who have a good desktop or laptop computer to see if you can use it for a little while when visiting and making them a meal or coffee.
If you give up on local, it reduces the cost by using servers.
Give up on performance and allow hallucination for an introduction to llm’s is my only option for budget and local. A very specific spellcheck or similar based llm would be possible on limited hardware.
Iirc, there is a publication on 1.3bit or 1.4bit quantization that someone implemented on GitHub.
These cards can easily be found at MSRP because they’re not a great improvement over the 3060/4060 8GB for gaming, but the added memory makes them excellent for AI.
I looked today on Newegg and similar PC components would cost $220-230.
From a performance perspective, I get about 9 tokens/s from mistral-7b-instruct-v0.2.Q4_K_M.gguf with a 1024 context size. This is with overclocked RAM which added 15-20% more speed.
The Mac Mini is probably faster than this. However the custom built PC route gives you the option to add more RAM later on to try bigger models. It also lets you add a decent GPU. Something like a used 3060, as one of comments says.
Depending on your answer, I suspect you might want to use Google Colab Pro/Paperspace/AWS/vast.ai instead of building your own hardware.
If you have access to the dataset you want you could train one yourself to fit on your target hardware. You could also look at FPGA solutions if you are comfortable working with those. Training locally might take some time but you could use google Codelab to train it.
For larger models, or GPU accelerated inference, there is no “cheap” solution.
Why do you think everyone is so in love with the 7B models?
It’s not because they’re good. They’re just ok, and it’s expensive to run larger models.
AMD Ryzen 7 5800H
I’d say “under $20k” is considered affordable. In comparison, a single H100 server is $250k. At least if you want to run decent models (>70B) at bearable speeds (>1t/s).
Your optimal choice today is Mac Studio with 192GB of unified memory (~$7k). But it will be too slow to run something like llama 400B.
If you just want to play with 13B parameter models or smaller, an RTX 4060 Ti 16GB is a great option at $450 or less.
If you want the ability to use larger models, RTX 3090s are a pretty good value. They can be had on the secondary market for $700ish, and are quite fast and have 24GB each. For 70B models, you’ll want to use 4-5 bit quantization and have two 3090s. You could probably run larger models on 4 or 6 of them.
Both of these options require a PC to install them into, but are nowhere near the cost of a Mac mini with 192GB of RAM. Yes, the Mac will give you more memory total, but won’t be as fast at inference and costs several multiples of a dual 3090 setup.