I went with this because a) the models I wanted to use are a little too big to fit comfortably in 24gb, plus I need room for a few additional small models for autocomplete and speech recognition, and b) I already had a cheap server to use and dual gpus would've required upgrading the mobo and power supply and probably the case as well.
It was definitely a little tricky to set up. The Intel line requires a driver package called "level zero" to support something called SYCL (Intel's version of CUDA basically, AFAICT) that was tricky to get working. I am running llama.cpp in a docker container, which also required some fiddling to get the container to see the card. You also need a kernel from the last few months.
Once I got it working though, the results are very impressive for a $1k investment. Qwen 3.6 35B at q4 quantization takes about 3/4 of the ram and delivers like 88 tokens/sec. So, if you want a decent-sized model for cheap, this is one way to go.
The warning I would have for everyone is to temper your expectations and read the fine print carefully. The big build in article starts off with a $40K budget and then includes 4 GPUs that are $12K each. For those doing the math, this build is going to cost more like 50-55K.
Local setups also often rely on quantization and techniques like REAP to fit the models on their hardware. You will read a lot of claims that 4-bit quantization is lossless, but those claims come from KL divergence measurements on a small corpus. Use one of these 4-bit models on long context coding tasks and the quality will be noticeably less. Even for non-coding tasks like dataset analysis, I can measure a substantial quality difference between 4-bit models, 8-bit quants, and even some times the full 16-bit source.
This article is also encouraging the use of a REAP model, which means someone has cut out some of the weights to make it smaller. The idea is to remove weights that are less useful for certain tasks, but again this is going to reduce the overall quality of the output.
The trap is that people say "I'm running GLM-5.2 locally!" and it sounds amazing when you look at the GLM-5.2 benchmarks. However they're not actually running GLM-5.2, they're running a model derived from GLM-5.2 that discards most of the bits and drops some of the experts. It does not perform the same as what you see in the benchmarks. In my experience, the divergence between a quantized/REAP model and the parent model is unnoticeable when you try it on very small tasks or chat, but becomes painful when you start trying to use it on long-horizon tasks where little errors start compounding.
Then you get into the slippery slope of thinking you're $50K deep into this project, but what you really need is just one or two more of those $12K GPUs to use the next level of quantization that might improve the quality a little more and make your investment worthwhile...
> Local setups also often rely on quantization and techniques like REAP to fit the models on their hardware.
This seems to ignore the very real possibility of running SOTA models at full precision on ordinary local hardware using SSD offload. Yes this will be slow and usually have very low throughput (even batched decode can only achieve so much before power and thermal limits become important, and that still leaves you with slow prefill as a major bottleneck) but that's OK if you aren't expecting a real-time response to begin with and your volumes as a single user are low enough.
GLM-5.2 has 40B active parameters at a time. At Q4 that's 20GB. The best PCIe 5 SSDs can get 15GB/sec when everything goes well. Every expert load would take more than a second.
If you had enough RAM and enough SSDs in parallel you might get a couple tokens per second on a good day. If you left this machine running 24 hours straight, you might be able to get 200,000 tokens generated.
So it can be done, but only if you interact with your LLM like you're e-mailing someone back and forth and you're okay waiting until tomorrow for a response.
You would spend $50K to buy a machine that consumes 2000W and takes all day to produce as many tokens as I could buy on OpenRouter for $0.60. You would spend $5-15 on electricity depending on where you live.
If you have no other option but to process data locally and you must use a very large model and you aren't in a rush, this can do it. I would not recommend it unless you're desperate and operating inside of rigid constraints.
(If you’re curious, it was running in Pi, but somehow convinced itself it was running in Claude instead and started trying to call Claude tools that didn’t exist)
The best NV4FP quant for 5.2 appears to be lukealonso's at https://huggingface.co/lukealonso/GLM-5.2-NVFP4, and it is capable of good throughput (75-100 tps) without losing much reasoning performance. Allowing for overhead for the KV cache and other requirements, this quant will (barely) run in 8-way tensor-parallel mode on 8x RTX 6000 cards. Not too long ago it was possible to put an 8x machine together for less than $100K USD, but that's probably not true now, assuming you buy all-new components.
It'll almost certainly be worth it, given the abusive behavior we've seen and will continue to see from the major closed-model providers. If I hadn't already put a similar rig together, I'd be kicking myself. But getting it running well is by no means as simple as buying a bunch of RTX6K cards and calling it a day, and people need to know what they're getting into.
Local AI is in its Altair and IMSAI days. There's no turnkey Apple II or C64 on the market yet, much less an IBM PC. Hardware, yes -- you can buy a capable box off the shelf from various vendors -- but you have to be prepared to take up a whole new hobby when it comes to getting a complete system working well.
The proper financial comparison for GLM-5.2 would be one of the providers on OpenRouter or renting a server as needed. Compare apples to apples.
You will almost certainly never break even compared to paying per token.
Local LLMs at this scale are only worth it if you have extremely strict requirements that data not leave the premises.
They could 10X the prices and you’d still be better off. It’s also unlikely that prices go up enough to warrant a $100K local investment to prevent paying a couple bucks per million tokens.
> or denying you service
I guess you’re not familiar with OpenRouter? There are many providers there. There are providers outside of OpenRouter. There will always be someone to take your business.
> or somehow abusing your data...
If data security is your concern then you’re better renting a server as needed still.
If you cannot tolerate any data leaving, then local models are the only way. You pay a high premium for it!
Still... if it's not your weights, running on your box, you're always going to be behind somebody else's 8-ball. Everybody has to decide for themselves where their priorities lie.
The models are so powerful and consequently so expensive and confusing to use, I don't get all of it.
Why ask FABLE 5000 to "summarize this email thread" when a tiny model can do the job?
Sure Codex3000 can oneshot your backlog, but why not use a subsidized subscription to do it for now? We're clearly not at the peak of these model's capabilities yet.
Just want to note that for $3k you can get an M5 macbook pro with 48gb of shared memory, and it will not be a giant box. Also, consider committing to spending that money on a cloud hosting provider, which will be at least somewhat cheaper if not significantly cheaper. It is awesome being able to run models locally though.
So, I always thought local LLMs were toys not worth pursuing.
Only once have I tried something decent like Gemma 4 31B and Qwen 3.6 27B did I realize how incredibly useful they are.
You stop fearing you are sharing sensitive information.
You stop fearing you will run out of tokens.
You stop fearing about the availability of the remote AI.
Local LLMs are extremely valuable.
This translates to qwen 27b actually working fast enough for useful work on dual 3090s and being painfully slow on Macbook Pros. Also if you're running a big model on a macbook pro the UI gets laggy and the keyboard gets hot. Much better to run dual 3090s in your basement and connect to them from your Macbook.
Even a 128GB is $6.8k today. Still only 2/3 your quote.
Bandwidth is relevant (I have both a 5090 and an M4 Max 128GB Studio, so have direct comparison right here), but quote the cost appropriately!
There are other arguments for running an ssh-able box in a closet somewhere too as with KVMs you can give an agent remote control over the machine itself such that it has vastly more capabilities than if it were controlling its own machine it's running on, as well as not needing to keep the MacBook open all the time just to have the agent finish running.
The M5 hardware is amazing for what it is, but GPUs are still so much faster.
Running the models on the GPU box also means I can use the laptop on my lap instead of turning it into a hot plate.
Get a regular laptop and use the network to access the LLM
That is equivalent to 16.8 years of Claude Opus 4.8 or Codex GPT 5.5 at $200/mo.
I'm a huge fan of running local models, but they're still wildly expensive, lower quality, and possibly dangerous (if backdoored). I sincerely wish this wasn't the case.
(I'd be surprised if that local rig really can drive the equivalent of $4,000/month of API spend though, given that a local rig can run prompts in parallel a lot less effectively than Anthropic's many data centers.)
Inwrnt through 1B tokens my first month with an OEM spark. That's more than $1k of opus. Not a fair comparison, because token patterns are different, but since that time I have also seen a 2-3x improvement in then speeds.from improvements in vllm (mainly MTP). DiffusionGemma is around 4x regular gemma.
You don't own your fiber connection. So why try to own another rapidly depreciating, expensive, and annoying asset?
Rent cloud GPUs!
You get to participate in the ownership, data control, price control, and hacking culture without having to Frankenstein some hobbyist box that costs a ton, is distilled down to functional uselessness, and is a PITA to maintain.
GLM 5.2 is "almost Opus," and it needs at least 8xH200s for comfortable inference (so it's closer to $400k than $40k).
They suggest using this modified model:
>A REAP-pruned (≈22% of experts removed), Int8-mix NVFP4 quantized version of GLM-5.2, ≈594B parameters.
I wonder how it behaves in practice outside of benchmarks. Qwen3.6, even at 6-bit quantization, often gets stuck in loops while reasoning. And here they've also removed some experts. I mean, sometimes an 8-bit or 16-bit small model can be smarter than a lobotomized large model. I heard the consensus is you shouldn't go below 8 bit for coding.
Also, it's not clear what is left of the available context when you try to fit a lobotomized model into 4 RTX 6000s. Anything below 100k is barely usable because it often hits compaction before it's able to gather the necessary context P.S. found in the repos, 240k context
What is the behavior if one were to run GLM 5.2 with only a single H200 ?
Would it fail to run at all, or would it just run so slowly as to be unusable ?
I would like to prove out the build, and concept, of a SOTA model locally, but then backfill the rest of the GPUs in 18-24 months when they cost significantly less ...
I assume you can then somehow run several hundreds of prompts concurrently?
Seemingly every available option has some subtle-gotchas about how easy it is to blow off your foot and effectively have no security at all. I use VMs because I actually trust that security is a foundational principle of the technology, not a well-if-you-use-these-20-flags-and-squint kind of deal.
personally, i think either a VM or microVM is the way to go. these things are actually designed as security boundaries, as opposed to containers. and as compared to bubblewrap, you can just give the agent a whole FS to work with and run it in yolo mode, whereas with bubblewrap you have to manually bootstrap the availability of each individual dev tool and make sure its config dirs and package caches and etc are mounted in a secure way and still will probably hit perm errors all the time. and there's just way less isolation.
also, something that has limited support in harnesses but IMO would make a lot of sense is running the harness process in the host, but having all the tool calls and file system interactions delegated to the VM. that way you keep all your session data and auth keys on the main machine where it can never get into context. otoh it makes your harness part of the security boundary, so that's the trade-off.
there's also all the usability questions around how to actually get data in/out of the VM. i have a script which can push local git repos into the VM and then pull from them as a remote, so the VM can't initiate any connection with the host doesn't need to hold git credentials. but ig for someone who wants their agent to push straight to GitHub that's a waste of effort.
options i've tried or seen for the VM itself: - qemu + libvirt: takes some doing to wrangle it together, but very battle tested and configurable - crun-vm is a PoC of higher level integration layer between podman and qemu, which is a really cool way to go about it. seems maybe abandoned but i just think it's neat and very existing tools/standards oriented rather than starting a new project and brand so i mention - libkrun is a newer entrant, and several ppl have built wrappers around it: - microsandbox - smolvm (posted/discussed on here recently) - krunvm
this is all Linux oriented, it's all i know.
The risky part is in the agent/harness and what tools it has access to.
You don't need to give GPU passthrough to the VM running the agent/harness.
There is still a risk of a prompt messing with the inference server, but I think that's a much lower risk compared to an agent doing whatever on its own.
This approach requires that you trust the llama.cpp codebase, essentially. It might be reasonable not to.
I suppose in principle there is the risk of a prompt exploit corrupting the inference server.
But the ecosystem isn't as mature, so Whisper is still a valid option, even now. For example Parakeet uses Nemotron framework (made by Nvdia), normally you need CUDA, so you need to use an ONNX version instead on AMD. Meanwhile Whisper has VLLM and desktop apps like Buzz.
There aren't many benchmarks and they often don't have all the models, since STT doesn't get nearly enough attention as normal LLMs, but this is one of the more complete ones: https://artificialanalysis.ai/speech-to-text/non-streaming
I'm curious if GMKtec's EVO-X2, with ~96GB of usable VRAM, is still a good solution for something like this for $3,399.
The caveat is that if you try to use multiple models on the same device at the same time, you thrash and destroy tok/s
Buying four $13000 GPUs and several thousand dollars worth of supporting hardware seems crazy. This supply shortage has to end eventually, and I can buy billions of DeepSeek, MiMo, and GLM tokens, and use $100 or $200 a month subscriptions for the big guys in the meantime for the difference in price once that happens. And, you can't even run the full-sized GLM on that hardware, it is quantized and so is your KV cache; the degradation is small, but not non-existent. You're not running a model that's equal to what you get when you buy GLM tokens from Z.ai.
My recommendation for self-hosting is this: If you already have a 24GB or 32GB GPU, or two, or a recent Mac with 32GB or more, run the appropriate quantization of Qwen 3.6 27B or Gemma 4 31B. If your hardware is older and too slow for that, use the MoE, but know it'll be dumber. Use the tiny model for the stuff that doesn't need deep smarts: Research (give it a Brave or Exa MCP for web search), summarization, simple Python scripts for basic tasks, simple websites or web apps, categorization of stuff (I used Gemma 4 to review my past writing for friendliness and helpfulness), etc. It can also be a sub-agent for bigger agents (for those same kinds of tasks). Gemma 4 12B is an incredibly good model for its size, particularly for vision tasks, and in the 4-bit quantization (7GB on disk) it runs on anything, even a modern tablet or phone.
And, if you don't already have a big GPU or unified memory Mac, just wait. Use the cheap tokens every AI company wants to sell you, for now. A Claude or Codex or Gemini subscription is a good deal. Tokens from DeepSeek are a good deal, especially with Reasonix agent (which maximizes caching, which DeepSeek is uniquely good at, and cached tokens are uniquely cheap at DeepSeek). GLM is Good Enough and has a cheap coding plan. MiMo has the cheapest tokens for a 1T+ model in the game, though DeepSeek and GLM are better models, MiMo is fine.
When prices come down, I'll be speccing out a beast to run the big models, too. But, I'm not paying 4x for RAM and GPU and storage, and y'all shouldn't either. That's crazy. Computer prices go down over time. It is the law.
I’m pretty bullish that Apple will deliver something very competitive for the average consumer in the next couple years.
They have unified memory and respectable inference performance, and for some variations can be cheaper than video cards, especially if you get an older-gen high-end M series with a lot of RAM used or refurbished.
I've read that Apple has plans once the RAM bottleneck passes to offer more RAM in all their models, and that future M series GPUs and NPUs will be even better for local inference, so in the future I expect Apple to be a serious offering for local inference and AI research workstations.
And what about AMD and Intel Arc GPUs? They don't get as much love but I've heard they can be compelling for certain shapes of a local LLM configuration.
At this point though, I think we may be in a "renters market" for LLM compute. If you want privacy it might be better to rent GPU time in raw form or use spot pricing at various providers. It probably only makes sense to build if you have extreme privacy/security needs or just want to do it cause it's cool.
Do we have evidence that this will actually happen? Maybe the belief that it won't pass is what requires evidence, but I think there's a widespread feeling right now that things are just getting permanently worse and this is one example.