I'm running VLLM on it now and it was as simple as:
docker run --gpus all -it --rm \
--ipc=host --ulimit memlock=-1 \
--ulimit stack=67108864 \
nvcr.io/nvidia/vllm:25.09-py3
(That recipe from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?v... )And then in the Docker container:
vllm serve &
vllm chat
The default model it loads is Qwen/Qwen3-0.6B, which is tiny and fast to load.I am curious about where you find its main value, and how would it fit within your tooling, and use cases compared to other hardware?
From the inference benchmarks I've seen, a M3 Ultra always come on top.
Installation instructions: https://github.com/comfyanonymous/ComfyUI#nvidia
It's a webUI that'll let you try a bunch of different, super powerful things, including easily doing image and video generation in lots of different ways.
It was really useful to me when benching stuff at work on various gear. ie L4 vs A40 vs H100 vs 5th gen EPYC cpus, etc.
Also, the other reviews I’ve seen point out that inference speed is slower than a 5090 (or on par with a 4090 with some tailwind), so the big difference here (other than core counts) is the large chunk of “unified” memory. Still seems like a tricky investment in an age where a Mac will outlive everything else you care to put on a desk and AMD has semi-viable APUs with equivalent memory architectures (even if RoCm is… well… not all there yet).
Curious to compare this with cloud-based GPU costs, or (if you really want on-prem and fully private) the returns from a more conventional rig.
It's not comparable to 4090 inference speed. It's significantly slower, because of the lack of MXFP4 models out there. Even compared to Ryzen AI 395 (ROCm / Vulkan), on gpt-oss-120B mxfp4, somehow DGX manages to lose on token generation (pp is faster though.
> Still seems like a tricky investment in an age where a Mac will outlive everything else you care to put on a desk and AMD has semi-viable APUs with equivalent memory architectures (even if RoCm is… well… not all there yet).
ROCm (v7) for APUs came a long way actually, mostly thanks to the community effort, it's quite competitive and more mature. It's still not totally user friendly, but it doesn't break between updates (I know the bar is low, but that was the status a year ago). So in comparison, the strix halo offers lots of value for your money if you need a cheap compact inference box.
Havn't tested finetuning / training yet, but in theory it's supported, not to forget that APU is extremely performany for "normal" tasks (threadripper level) compared to the CPU of the DGX Spark.
I have no immediate numbers for prefill, but the memory bandwidth is ~4x greater on a 4090 which will lead to ~4x faster decode.
What I found was a good solution was using Spack: https://spack.io/ That allows you to download/build the full toolchain of stuff you need for whatever architecture you are on - all dependencies, compilers (GCC, CUDA, MPI, etc.), compiled Python packages, etc. and if you need to add a new recipe for something it is really easy.
For the fellow Brits - you can tell this was named by Americans!!!
This a high level overview by one of the Spack authors from the HN post back in 2023 (top comment from 100 comments), including the Spack original paper link [1]:
At a very high level, Spack has:
* Nix's installation model and configuration hashing
* Homebrew-like packages, but in a more expressive Python DSL, and with more versions/options
* A very powerful dependency resolver that doesn't just pick from a set of available configurations -- it configures your build according to possible configurations.
You could think of it like Nix with dependency resolution, but with a nice Python DSL. There is more on the "concretizer" (resolver) and how we've used ASP for it here:
* "Using Answer Set Programming for HPC Dependency Solving", https://arxiv.org/abs/2210.08404
[1] Spack – scientific software package manager for supercomputers, Linux, and macOS (100 comments):
I for example have some healthcare research projects with personally identifiable data, and in these times it’s simpler for the users to trust my company, than my company and some overseas company and it’s associated government.
Can people please not listen to this terrible advice that gets repeated so oft, especially in Australian IT circles somehow by young naive folks.
You really need to talk to your accountant here.
It's probably under 25% in deduction at double the median wage, little bit over @ triple, and that's *only* if you are using the device entirely for work, as in it sits in an office and nowhere else, if you are using it personally you open yourself up to all sorts of drama if and when the ATO ever decides to audit you for making a $6k AUD claim for a computing device beyond what you normally to use to do your job.
Even if what you are saying is correct, the discount is just lower. This is compared to no discount on compute/GPU rental unless your company purchases it.
I'm sure I'll get downvoted for this, but this common misunderstanding about tax deductions does remind me of a certain Seinfeld episode :)
Kramer: It's just a write off for them
Jerry: How is it a write off?
Kramer: They just write it off
Jerry: Write it off what?
Kramer: Jerry all these big companies they write off everything
Jerry: You don't even know what a write off is
Kramer: Do you?
Jerry: No. I don't
Kramer: But they do and they are the ones writing it off
For inference decode the bandwidth is the main limitation so if running LLMs is your use case you should probably get a Mac instead.
In the list price, it's 1000 USD cheaper. 3,699 vs 4,699 I know a lot can be relative but that's a lot for me for sure.
The Mac Studio is a more appropriate comparison. There is not yet a DGX laptop, though.
I can do that with a laptop too. And with a dedicated GPU. Or a blade in a data center. I though the feature of the DGX was that you can throw it in a backpack.
Why not?
Now that you bring it up, the M3 ultra Mac Studio goes up to 512GB for about a $10k config with around 850 GB/s bandwidth, for those who "need" a near frontier large model. I think 4x the RAM is not quite worth more than doubling the price, especially if MoE support gets better, but it's interesting that you can get a Deepseek R1 quant running on prosumer hardware.
Running some other distro on this device is likely to require quite some effort.
You CAN build - but for people wanting to get started this could be a real viable option.
Perhaps less so though with Apple's M5? Let's see...
P.S. exploded view from the horse's mouth: https://www.nvidia.com/pt-br/products/workstations/dgx-spark...
I'm looking forward to GLM 4.6 Air - I expect that one should be pretty excellent, based on experiments with a quantized version of its predecessor on my Mac. https://simonwillison.net/2025/Jul/29/space-invaders/
The 120B model is better but too slow since I only have 16GB VRAM. That model runs decent[1] on the Spark.
Beyond that, seems like the 395 in practice smashes the dgx spark in inference speeds for most models. I haven't seen nvfp4 comparisons yet and would be very interested to.
I dont think there are any models supporting nvfp4 yet but we shall probably start seeing them.
Can anyone explain this? Does this machine have multiple CPU architectures?
Is that true? nvidia Jetson is quite mature now, and runs on ARM.
Management becomes layers upon layers of bash scripts which ends up calling a final batch script written by Mellanox.
They'll catch up soon, but you end up having to stay strictly on their release cycle always.
Lots of effort.
And of course there's the part of totally random and inconsistent support outside of the few dedicated cards, which is honestly why CUDA the de facto standard everyone measures against - you could run CUDA applications, if slowly, even on the lowest end nvidia cards, like Quadro NVS series (think lowest end GeForce chip but often paired with more displays and different support that focused on business users that didn't need fast 3D). And you still can, generally, run core CUDA code within last few generations on everything from smallest mobile chip to biggest datacenter behemoth.
I kinda lost track, this whole thread reminded me how hopeful I was to play with GPGPU with my then new X1600
But maybe this will change? Software issues somehow?
It also runs CUDA, which is useful
plus apparently some of the early benchmarks were made with ollama and should be disregarded
● Bash(free -h)
⎿ total used free shared buff/cache available
Mem: 119Gi 7.5Gi 100Gi 17Mi 12Gi 112Gi
Swap: 0B 0B 0B
That 119Gi is indeed gibibytes, and 119Gi in GB is 128GB.I should be allowed to do stupid things when I want. Give me an override!
IS_SANDBOX=0 claude --dangerously-skip-permissions
You can run that as root and Claude won't complain.(Because Docker doesn't do this as by default, best practice is to create a non root user in your dockerfile and run as that)
I'd be pissed if I paid this much for hardware and the performance was this lacklustre while also being kneecapped for training
Obviously, even with connectx, it's only 240Gi of VRAM, so no big models can be trained.
The DGX Spark is completely overpriced for its performance compared to a single RTX 5090.
That's the use case, not running LLM efficiently, and you can't do that with a RTX5090.
I don't think the 5090 could do that with only 32G of VRAM, couldn't it ?