FilterHN

Ask HN: Affordable hardware for running local large language models?

31 points

13 days ago

| 12 comments

A while back there was a post about running stable diffusion on a Raspberry Pi Zero 2 [1] which was slow but incredibly impressive! And that sparked my curiosity with this question, what is considered affordable hardware for running large language models locally today? I'm aware that there's a lot of work underway to make inference cheap to run at the edge but I'm curious to understand the landscape at present that anyone could purchase. I seen people running models on flagship smartphones but those are more expensive then a mac mini with worse performance.

By affordable I mean no greater then the cost of a current gen base model mac mini ($599) but ideally around the price of a raspberry pi 5 ($79) which when searching for budget PC gets mentioned[2]. Both devices have the same amount of ram in my case (8gb) but different performance observed given the importance of memory bandwidth. I mention these two because I've had experience running llama 3 via ollama on both with success although of slower speeds compared to a full workstation with a commodity GPU i.e. RTX 4090 which starts at ($1599). I'm interested in learning about what other devices are out there that people consider cheap and use for LLMs locally.

[1]: https://news.ycombinator.com/item?id=38646969

[2]: https://www.pcmag.com/picks/the-best-budget-desktop-computers

▲

ysleepy

13 days ago

[-]

I simply bought 4x32GB ddr4 memory (~200 bucks) for a normal desktop mainboard and a high-thread-count cpu.

You can experiment with a lot of models, it's just going to be slow.

With ddr5 you can even go higher with 48GB modules.

Otherwise I got a 3060 12G, which can be had for 200€ used.

Its a very affordable setup.

▲

instagib

13 days ago

[-]

The energy/compute cost per performance is not a good ratio due to current optimization. Old hardware makes it worse.

Consider making friends with people who have a good desktop or laptop computer to see if you can use it for a little while when visiting and making them a meal or coffee.

If you give up on local, it reduces the cost by using servers.

Give up on performance and allow hallucination for an introduction to llm’s is my only option for budget and local. A very specific spellcheck or similar based llm would be possible on limited hardware.

Iirc, there is a publication on 1.3bit or 1.4bit quantization that someone implemented on GitHub.

▲

instagib

12 days ago

[-]

1.58 bits https://old.reddit.com/r/LocalLLaMA/comments/1bpa6ol/unoffic...

▲

angoragoats

13 days ago

[-]

This might technically be outside your budget, but if you happen to have a PC, I highly recommend the RTX 4060 Ti 16GB ($450, or less if on sale). It can easily handle 13B models and is quite fast. You don’t need a fancy PC to put it into; anything with a spare PCIe slot and a reasonable sized power supply will work.

These cards can easily be found at MSRP because they’re not a great improvement over the 3060/4060 8GB for gaming, but the added memory makes them excellent for AI.

▲

roosgit

12 days ago

[-]

About a year ago I bought some parts to build a Linux PC for testing LLMs with llama.cpp. I paid less than $200 for: a B550MH motherboard, AMD Ryzen 3 4100, 16GB DDR4, 256GB NVMe SSD. I already had an old PC case with a 350W PSU and a 256MB video card because the PC wouldn’t boot without one.

I looked today on Newegg and similar PC components would cost $220-230.

From a performance perspective, I get about 9 tokens/s from mistral-7b-instruct-v0.2.Q4_K_M.gguf with a 1024 context size. This is with overclocked RAM which added 15-20% more speed.

The Mac Mini is probably faster than this. However the custom built PC route gives you the option to add more RAM later on to try bigger models. It also lets you add a decent GPU. Something like a used 3060, as one of comments says.

▲

duffyjp

12 days ago

[-]

FYI for the Mac Mini idea, I have an M1 Macbook Pro with 32gb. There's some sort of limitation on how much ram can be allocated to the GPU. Trying to run even a 22gb ram model will fail. The best I've gotten is Code Llama 34B 3-bit at 18.8gb. There can be tons of RAM still empty but the LLM will just infinite loop dropping a chunk of RAM and reloading from disk.

▲

s1gsegv

12 days ago

[-]

Yes, Metal seems to allow a maximum of 1/2 of the RAM for one process, and 3/4 of the RAM allocated to the GPU overall. There’s a kernel hack to fix it, but that comes with the usual system integrity caveats. https://github.com/ggerganov/llama.cpp/discussions/2182

▲

pquki4

13 days ago

[-]

I think you need to be very clear about what your goal is -- just playing with different "real" hardware, or running some small models as experiments, or trying to do semi-serious work with LLMs? How much do you want to spend on hardware and electricity in the long run, and how much are you willing to "lose"? e.g. if a setup turns out to be not very useful and hard to repurpose it because you already have too many computers, and you need to either sell it or throw it away, what's your limit?

Depending on your answer, I suspect you might want to use Google Colab Pro/Paperspace/AWS/vast.ai instead of building your own hardware.

▲

lemonlime0x3C33

13 days ago

[-]

I have used a raspberry pi for running image classification CNN’s, it really depends on the model you are using. Edge IoT AI is making a lot of progress targeting running AI on resource constrained devices.

If you have access to the dataset you want you could train one yourself to fit on your target hardware. You could also look at FPGA solutions if you are comfortable working with those. Training locally might take some time but you could use google Codelab to train it.

▲

wokwokwok

13 days ago

[-]

You can use a raspberry pi with 8 GB of ram to run a quantised 7B model (eg. https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF), or any cheap stick pc.

For larger models, or GPU accelerated inference, there is no “cheap” solution.

Why do you think everyone is so in love with the 7B models?

It’s not because they’re good. They’re just ok, and it’s expensive to run larger models.

▲

ilaksh

13 days ago

[-]

What kind of computer do you currently have? Try phi3. It's amazing and under 4GB. If you want something affordable then the one rule is to stay away from Apple products. Maybe a Ryzen. https://www.geekompc.com/geekom-a7-mini-pc-ryzen-7000/

AMD Ryzen 7 5800H

▲

nubinetwork

13 days ago

[-]

I'm happy with a 2950x and a Radeon VII... but that costs more than your example Mac mini.

▲

cjbprime

12 days ago

[-]

There are some ARM SBCs with e.g. 32GB RAM and an NPU for under $300, such as the Orange Pi 5 Plus, but I'm guessing refurbished Apple Silicon hardware is the best answer for the price.

▲

p1esk

13 days ago

[-]

what is considered affordable hardware for running large language models locally today?

I’d say “under $20k” is considered affordable. In comparison, a single H100 server is $250k. At least if you want to run decent models (>70B) at bearable speeds (>1t/s).

Your optimal choice today is Mac Studio with 192GB of unified memory (~$7k). But it will be too slow to run something like llama 400B.

▲

angoragoats

13 days ago

[-]

What criteria are you using to define “optimal”? And what is your use case (e.g. how large of a model would you like to run)? If you want to maximize the amount of high-bandwidth memory available, Macs are decent. But they are not the best value IMHO.

If you just want to play with 13B parameter models or smaller, an RTX 4060 Ti 16GB is a great option at $450 or less.

If you want the ability to use larger models, RTX 3090s are a pretty good value. They can be had on the secondary market for $700ish, and are quite fast and have 24GB each. For 70B models, you’ll want to use 4-5 bit quantization and have two 3090s. You could probably run larger models on 4 or 6 of them.

Both of these options require a PC to install them into, but are nowhere near the cost of a Mac mini with 192GB of RAM. Yes, the Mac will give you more memory total, but won’t be as fast at inference and costs several multiples of a dual 3090 setup.

▲

p1esk

13 days ago

[-]

I want to play with the best model I can get my hands on. In the nearest future, this will probably be llama 400B. Even that model will probably be dumber than GPT4, and GPT4 feels pretty dumb sometimes.

▲

espinielli

13 days ago

[-]

maybe LLaMA helps https://justine.lol/oneliners/