Comparing it against the RTX 4000 SFF Ada (20GB) which is around $1.2k (if you believe the original price on the nvidia website https://marketplace.nvidia.com/en-us/enterprise/laptops-work...). Which I have access to on a Hetzner GEX44.
I'm going to ballpark it between 2.5-3x faster than the desktop. Except for the tg128 test, where the difference is "minimal" (but I didn't do the math).
See my AI cluster automation setup here: https://github.com/geerlingguy/beowulf-ai-cluster
I was building that through the course of making this video, because it's insane how much manual labor people put into building home AI clusters :D
Obviously an RTX 5090 with 32GB of VRAM is even better, but they cost around $2000, if you can find one.
What's interesting about this Strix Halo system is that it has 128GB of RAM that is accessible (or mostly accessible) to the CPU/GPU/APU. This means that you can run much larger models on this system than you possibly could on a 3090, or even a 5090. The performance tests tend to show that the Strix Halo's memory bandwidth is a significant bottleneck though. This system might be the most affordable way of running 100GB+ models, but it won't be fast.
The BIOS allows pre-allocating 96 GB max, and I'm not sure if that's the maximum for Windows, but under Linux, you can use `amdttm.pages_limit` and `amdttm.page_pool_size` [1]
[1] https://www.jeffgeerling.com/blog/2025/increasing-vram-alloc...
htpc@htpc:~% free -h
total used free shared buff/cache available
Mem: 125Gi 123Gi 920Mi 66Mi 1.6Gi 1.4Gi
Swap: 19Gi 4.0Ki 19Gi
[1] https://bpa.st/LZZQIn 4K pages for example:
options ttm pages_limit=31457280
options ttm page_pool_size=15728640
This will allow up to 120GB to be allocated and pre-allocate 60GB (you could preallocate none or all depending on your needs and fragmentation size. I believe `amdgpu.vm_fragment_size=9` (2MiB) is optimal.I keep thinking that the bottleneck has to be CPU RAM, and for a large model the difference would be minor. For example with an 100 GByte model such as quantised gpt-oss-120B, I imagine that going from 10G to 24G would scale up my tk/s like 1/90 -> 1/76, so 20% advantage? But I can't find much on the high-level scaling math. People seem to either create calculators that oversimplify, or they seem too deep into the weeds.
I'd like a new anandtech please.
I’m struggling to justify the cost of a Threadripper (let alone pro!) for a AAA game studio though.
I wonder who can justify these machines. High frequency trading? data science? shouldn’t that be done on servers?
I think the Xeon systems should have worked and that it was actually a motherboard bios issue, but I had seen a photo of it running in a threadripper and prayed I wasn’t digging an even deeper hole.
If you need that in a single system, you gotta pay up. Lower tier SP6 processors are actually pretty reasonably priced, boards are still spendy though.
This is pure market segmentation. If you need that little bit extra, you’re forced to compromise, or to open your wallet big time, and AMD is betting on people who “really” need that slightly extra oomph to pay.
I found it difficult to install ROCm on Fedora 42 but after upgrading to Rawhide it was easy, so I re-tested everything with ROCm vs Vulkan.
Ollama, for some silly reason, doesn't support Vulkan even though I've used a fork many times to get full GPU acceleration with it on Pi, Ampere, and even this AMD system... (moral of the story just stick with llama.cpp).
https://x.com/ollama/status/1952783981000446029
No experimental flag option, no "you can use the fork that works fine but we don't have capacity to support this" just a hard "no, we think it's unreliable". I guess they just want you to drop them and use llama.cpp.
ROCm support is not wonderful. It's certainly worse for an end user to deal with than Vulkan, which usually 'just works'.
Considering they created mantle, you would think it would be the obvious move too.
There was a period in between where AMD basically EOL’d mantle and Vulkan wasn’t even in the works yet.
My conspiracy theory is that it would help if contributors kept the Vulkan Compute proposed support up to date with new Ollama versions; no maintainer wants to deal with out-of-date pull req's.
Great work on this!
I saw mixed results but comments suggest very good performance relative to other at-home setups. Can someone summarize?