Key corrections:
Ollama GPU usage - I was wrong. It IS using GPU (verified 96% utilization). My "CPU-optimized backend" claim was incorrect.
FP16 vs BF16 - enum caught the critical gap: I trained with BF16, tested inference with FP16 (broken), but never tested BF16 inference. "GPU inference fundamentally broken" was overclaimed. Should be "FP16 has issues, BF16 untested (likely works)."
llama.cpp - veber-alex's official benchmark link proves it works. My issues were likely version-specific, not representative.
ARM64+CUDA maturity - bradfa was right about Jetson history. ARM64+CUDA is mature. The new combination is Blackwell+ARM64, not ARM64+CUDA itself.
The HN community caught my incomplete testing, overclaimed conclusions, and factual errors.
Ship early, iterate publicly, accept criticism gracefully.
Thanks especially to enum, veber-alex, bradfa, furyofantares, stuckinhell, jasonjmcghee, eadwu, and renaudr. The article is significantly better now.
But if that's not the case, then yeah, it's a crappy practice and I'd hate to see it spread any further than it already has.
Is that version correct?
Asking because (in Ollama terms) it's positively ancient. 0.12.6 being the most recent release (currently).
I'm guessing it _might_ make a difference, as the Ollama crowd do seem to be changing things, adding new features and optimisations (etc) quite often.
For example, that 0.12.6 version is where initial experimental support for Vulkan (ie Intel Xe gpus) was added, and in my testing that worked. Not that Vulkan support would do anything in your case. ;)
https://www.anaconda.com/blog/python-nvidia-dgx-spark-first-...
> If you start Python and ask it how many CPU cores you have, it will count both kinds of cores and report 20
> Note that because of the speed difference between the cores, you will want to ensure there is some form of dynamic scheduling in your application that can load balance between the different core types.
Sounds like a new type of hell where I now not only need to manage the threads themselves, but also take into account what type of core they run on, and Python straight up report them as the same.
Isin't this the same architecture that the Mx from Apple implements from a memory perspective?
I have H100s to myself, and access to more GPUs than I know what to do with in national clusters.
The Spark is much more fun. And I’m more productive. With two of them, you can debug shallow NCCL/MPI problems before hitting a real cluster. I sincerely love Slurm, but nothing like a personal computer.
As for debugging, that's where you should be allowed to spin up a small testing cluster on-demand. Why can't you do that with your slurm access?
It's remarkable what can now be done on a whisper-quiet little box. I hope the Strix Halo's will be just as much fun, and they should be, so long as Flash Attention works.
Fair, thanks for the answer.
The bane of my existence...
salloc: Granted job allocation 1978
salloc: Waiting for resource configurationEven ignoring GPU details spark is an awesome little quiet powerhouse arm64 workstation that is 100% Linux first
Curious though how you offer idrac to customer, do you have another OOB BMC for the idrac? Or is this internal engineering context
We rent bare metal on-demand and our whole business is to be able to offer compute that you probably wouldn't be able to host in your house $, as if you own it yourself.
So, we made it so that users can get access into the BMC and modify the box however they want. When they are done, we've automated the reset as well. Fully self-service.
$ These boxes are very expensive, weigh 350lbs, sound like a jet engine and consume ~10kW.
That’s a big part of why we hot stage things for customers.
But it’s still not quite like exclusive access to resources when you want them. So I can see it from both ways.
Nah. Do you have 1st hand experience with Strix Halo? At less than 1600€ for a 128GB configuration it manages >45 tokens/s with gpt-oss 120b. Which is faster than DGX Spark at a fraction of the cost.
But please have your LLM post writer be less verbose and repetitive. This is like the stock output from any LLM, where it describes in detail and then summarizes back and forth over multiple useless sections. Please consider a smarter prompt and post-editing…
I was not familiar with the hardware, so I was disappointed there wasn't a picture of the device. Tried to skim the article and it's a mess. Inconsistent formatting and emoji without a single graph to visualize benchmarks.
I bet the input to the LLM would have been more interesting.
It looks like it worked? Why's it say this?
> Verdict: Inference speed scales proportionally with model size.
Author only tried one model size and it's faster than NVIDIA's reported speed at a larger model. Not really a "Verdict".
> Verdict: 4-bit quantization is production-viable.
That's not really something you can conclude from messing around with it and saying you like the outputs.
> GPU Inference is Fundamentally Broken
Probably not? It probably just doesn't work in llama.cpp right now? Takes a while reading this to work out they tried ollama and then later llama.cpp, which I'd guess is basically testing llama.cpp twice. Actually I don't even believe that, I'm sure author ran into errors that might be a pain to figure out, but there's no evidence it's worse than that.
But then it says this is the "root cause":
ARM64 + Blackwell + CUDA 13.0 = Bleeding Edge
↓
Limited production testing
↓
Edge cases in numerical precision (inference)
↓
Memory management issues (training)
Am I to believe GPU inference is really fundamentally broken? I'm not seeing the case made here, just claims. At this point the LLM seems to have gotten confused about whether it's talking about the memory fragmentation issue or the GPU inference issue. But it's hard to believe anything from this point on in the post.ARM64 Architecture: Not x86_64 (limited ML ecosystem maturity) No PyTorch wheels for ARM64+CUDA (must use Docker) Most ML tools optimized for x86
No evidence for any of this whatsoever. The author just asked Claude/claude code to write their article and it just plain hallucinated some rubbish.
Like in Upstream Color: https://www.youtube.com/watch?v=zfDyEr8Ykcg
I haven't exactly bisected the issue but I'm pretty sure convolutions are broken on sm_121 after a certain size, getting 20x memory blowup from a convolution from a 2x batch size increase _only_ on the DGX Spark.
I haven't had any problems with inference, but I also don't use the transformers library that much.
llama.cpp was working for openai-oss last time I checked and on release, not sure if something broke along the way.
I don't exactly know if memory fragmentation is something fixable on the driver side - this might just be the problem with kernel's policy and GPL, it prevents them from automatically interfering with the memory subsystem to the granularity they'd like - see zfs and their page table antics - or so my thoughts on it is.
If you've done stuff on WSL, you have similar issues and you can fix it by running a service that normally compacts and clean memory, I have it run every hour. Note that this does impact at the very least CPU performance and memory allocation speeds, but I have not have any issue with long training runs with it (24hr+, assuming that is the issue, I have never tried without it and put that service in place since getting it due to my experience on WSL).
It is also a standard UEFI+ACPI system; one Reddit user even reported that they were able to boot up Fedora 42 and install the open kernel modules no problem. The overall delta/number of specific patches for the Canonical 6.17-nvidia tree is pretty small when I looked (the current kernel is 6.11). That and the likelihood the consumer variant will support Windows hopefully bodes well for its upstream Linux compatibility, I hope.
To be fair, most of this also true of Strix Halo from what I can tell (most benchmarks put the DGX furthest ahead at prompt processing and a bit ahead at raw token output. But the software is still buggy and Blackwell is still a bumpy ride overall, so it might get better). But I think it's mostly the pricing that is holding it back. I'm curious what the consumer variant will be priced at.
There are official benchmarks of the Spark running multiple models just fine on llama.cpp
Practically, if the goal is 100% about AI and cloud isn't an option for some reason, both options are likely "a great way to waste a couple grand trying to save a couple grand" as you'd get 7x the performance and likely still feel it's a bit slow on larger models using an RTX Pro 6000. I say this as a Ryzen AI Max+ 395 owner, though I got mine because it's the closest thing to an x86 Apple Silicon laptop one can get at the moment.
SHOULDA
the reason we use ryzens is because we run linux with almost no problems on them.
The userspace side is where AI is difficult with AMD. Almost all of the community is build around Nvidia tooling first, others second (if it all).
Does it work if you change to torch.bfloat16?
- https://publish.obsidian.md/aixplore/Practical+Applications/... The PyTorch 2.9 wheels do work. You can pip install torch --index-url <whatever-it-is> and it just works. You do need to build flash attention from source, which takes an hour or so.Ryzen Max 395+ gets you 55 tok/s [1]
[1] https://www.reddit.com/r/LocalLLaMA/comments/1nabcek/anyone_...
My job just got me and our entire team a DGX spark. I'm impressed at the ease of use for ollama models I couldn't run on my laptop. gpt-oss:120b is shockingly better than what I thought it would be from running the 20b model on my laptop.
The DGX has changed my mind about the future being small specialized models.
Are you shocked because that isn't your experience?
From the article it sounds like ollama runs cpu inference not GPU inference. Is that the case for you?
Two things need to happen for me to get excited about this:
1. It stimulates other manufacturers into building their own DGX-class workstations.
2. This all eventually gets shipped in a decent laptop product.
As much as it pains me, until that happens, it still seems like Apple Sillicon is the more viable option, if not the most ethical.
Besides that though, I don't see how Nvidia is particularly non-ethical. They cooperate with Khronos, provide high-quality Linux and BSD drivers free of charge, and don't deliberately block third parties from writing drivers to support new standards. From a relativist standpoint that's as sanctimonious as server hardware gets.
Specifically WRT Mellanox, Nvidia's behavior was more petty than callous.
And yes... yes it is.
Really? Less RAM bw than an Epyc CPU? And 4x to 8x less than a consumer GPU?
How come this doesn’t massively limit LLM inference speeds?
Wow. Where do I sign up?
It would be cheaper to buy up a dozen 3060s and build a custom PC around them than to buy the Spark.
Given the extreme advantage they have with CUDA and the whole AI/ML ecosystem, barely matching Apple’s M-ultra speeds is a choice…
Apple benchmarks: https://github.com/ggml-org/llama.cpp/discussions/4167
(cited from https://lmsys.org/blog/2025-10-13-nvidia-dgx-spark/)