For example, Qwen3.5 27B and Qwen3.5 122B A10B have similar average performance across benchmarks. The 122B is much faster to run than the 27B (generates more tokens at the same compute). The 27B, on the other hand, uses ~4x less VRAM at low context lengths (less difference at high context lengths).
Right now, different hardware seems to be suited to different points in the dense vs. MoE balance. On one extreme is hardware like the DGX Spark and Strix Halo which have a lot of memory compared to compute performance and memory bandwidth, and are best-suited for MoE workflows. On the other extreme you have cards like RTX 5090 which have very high performance for the price but rather little memory, and is best suited for dense models.
The Arc Pro B70 seems to be the awkward middle. With 1-2 of these, you can run a ~30B dense model slowly, probably not fast enough to be useful interactively (you'd probably need a 5090 or 2x 3090 for that). Or, you can run a MoE model at high throughput, but probably not enough quality to support agentic workflows that actually use your throughput.
Why can't Intel look beyond this nonsense state of affair and build something with 1TB of RAM or more?
What I am trying to say, I am yet to see anything competitive in the market. Cards very much stalled in sub 100GB region and best corporations can do is throw something to run toy models and forget about it after a week.
I think e.g. AMD missed the boat with 9950x3d2 by limiting memory controller. If it was possible to hook it with 1TB of consumer DDR5 RAM, that would be something to write home about.
Whatever the hell you name it doesn't matter to me, I just want a workstation with one of them bad boys attached to 160GB of RAM for legit inference power!
I've been saving my money not paying for Claude Code so I can run my own agentic coding setup at home on yours. Please don't charge too much for the workstation class card if you can at all manage it. Maybe give us a discount to preorder? Please don't price a regular consumer like me out of the market!
Also, I am speculating integer based models will become hot due to lower memory and power requirements. Will the Xe3P be able to do integer-based math inference to use all that RAM to even greater effect?
Intel wouldn’t decide to do this even to save their own life
But 32GB for a TDP of 230W is perhaps not super interesting. Especially because you probably want to have more than one card. It's a lot of heat. You could use the cards for heating up a building, but heatpumps exist.
Prompt processing or parallel token generation can do a bit more work per memory transfer, as you can use the same weights for a few different calculations in parallel. But even still, memory bandwidth is a huge factor.
B70 runs at 1/3 token output rate of RTX PRO 4500 and consume 3X idle power when do nothing.
It lacked software support the for the primary target application, running LLM. The officially supported vllm fork is 6 version behind mainline. It did not run the latest hot new open models on huggingface. Parallel two of B70 reduce token rate, not improve it. So, the software behind B70 is basically so far behind.
The parent article shows that B70 is faster than RTX 4000.
RTX 4500 is faster than RTX 4000, but it cannot be more than 3 times faster, not even more than 2 times faster.
The parent article is consistent with RTX 4500 being faster than B70 for ML inference, but by a much smaller ratio, e.g. less than 50% faster.
If you know otherwise, please point to the source.
If you have run a benchmark yourself, please describe the exact conditions.
In the benchmarks shown at Phoronix for llama.cpp, the relative performance was extremely variable for different LLMs, i.e. for some LLMs a B70 was faster than RTX 4000, but for others it was significantly slower.
Your 3x performance ratio may be true for a particular LLM with a certain quantization, but false for other LLMs or other quantizations.
This performance variability may be caused by immature software for B70. For instance instead of using matrix operations (XMX engines), non-optimized software might use traditional vector operations, which are slower.
It is also possible that for optimum performance with a certain LLM one may need to choose a different quantization for B70 than for NVIDIA, because for sub-16-bit number formats Intel supports only integer numbers.
At that power consumption, you also end up being more expensive than API calls and many times slower. It starts to feel very stupid to run local interference.
If the client is very keen on privacy, then they can pay for the NVIDIA.
I end up returning my B70s, and bought RTX PRO 6000.
Hardware-wise a B70 should be significantly faster than any of the available CPUs at ML inference. If it was not so in your tests, that must really be a software problem, so you must identify the software, for others to know what does not work.
Tried to use the same model as the article:
llama-bench -m gpt-oss-20b-Q8_0.gguf -ngl 999 -p 2048 -n 128
AMD R9700 pp2048=3867 tg128=175
And a bigger model, because testing a tiny model with a 32GB card feels like a waste:
llama-bench -m Qwen3.6-27B-UD-Q6_K_XL.gguf -ngl 999 -p 2048 -n 128
AMD R9700 pp2048=917 tg128=22
| model | size | params | backend | ngl | test | t/s |
| --------------------- | --------: | ------: | ------- | --: | -----: | -------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | SYCL | 999 | pp2048 | 851.81 ± 6.50 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | SYCL | 999 | tg128 | 42.05 ± 1.99 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan | 999 | pp2048 | 2022.28 ± 4.82 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan | 999 | tg128 | 114.15 ± 0.23 |
| qwen35 27B Q6_K | 23.87 GiB | 26.90 B | SYCL | 999 | pp2048 | 299.93 ± 0.40 |
| qwen35 27B Q6_K | 23.87 GiB | 26.90 B | SYCL | 999 | tg128 | 14.58 ± 0.06 |
| qwen35 27B Q6_K | 23.87 GiB | 26.90 B | Vulkan | 999 | pp2048 | 581.99 ± 0.86 |
| qwen35 27B Q6_K | 23.87 GiB | 26.90 B | Vulkan | 999 | tg128 | 10.64 ± 0.12 |
Edit: I've no idea why one would use gpt-oss-20b at Q8, but the result is basically the same: | model | size | params | backend | ngl | test | t/s |
| --------------------- | --------: | ------: | ------- | --: | -----: | -------------: |
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | SYCL | 999 | pp2048 | 854.16 ± 6.06 |
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | SYCL | 999 | tg128 | 44.02 ± 0.05 |
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | Vulkan | 999 | pp2048 | 2022.24 ± 6.97 |
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | Vulkan | 999 | tg128 | 114.02 ± 0.13 |
Hopefully, support for the B70 will continue to improve. In retrospect, I probably should have bought a R9700 instead...In that particular model family, the choices are 20B and 120B, so 20B higher quant fits in VRAM, while you'd be settling for 120B at a lower quant. Is it that 20B MXFP4 is comparable in performance so no need for Q8?
Or is the insight simply that there are better models available now and the emphasis is on gpt-oss-20b, not Q8?
Though, looking inside my "gpt-oss 20B MXFP4 MoE" model, it looks to also be quantized the same way as the Q8, so that was probably an overstatement on my part.
Still, the Q8 is 12.1 GB and the FP16 is 13.8 GB. Not the ~1:2 ratio you might expect.
| model | size | params | backend | ngl | test | t/s |
| --------------------- | ---------: |--------: | -------- | --: |------: |----------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 999 | pp2048 | 10179.12 ± 52.86 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 999 | tg128 | 326.82 ± 7.82 |
| qwen35 27B Q6_K | 23.87 GiB | 26.90 B | CUDA | 999 | pp2048 | 3129.92 ± 5.12 |
| qwen35 27B Q6_K | 23.87 GiB | 26.90 B | CUDA | 999 | tg128 | 53.45 ± 0.15 |
build: 9d34231bb (8929)
gpt-oss-20b-MXFP4.gguf
Qwen3.6-27B-UD-Q6_K_XL.gguf
Using MXFP4 of GPT-OSS because it was trained quantization-aware for this quantization type, and it's native to the 50xx.5090 gets maybe 100TPS with MTP
Which might not sound like much, but 2months in llm time is a long time, especially regarding support for new hardware like the r9700.
Yes, that's what the G in GPU stands for. It's great to see that there are still manufacturers that understand this.
I have a pair of them with a 9480 and the only thing I have to do is keep the cache happy.
As for software, anything that has a SYCL or Vulkan backend, and/or can be Intel optimized (especially to the same degree as llama.cpp) can run well.
> We hope that, in the future, there will be real options other than NVIDIA for GPU-based rendering, as it is an area where competition is nearly non-existent.
And Checking opendata.blender.org, a NVIDIA GeForce RTX 4080 Laptop GPU scores 5301.8, while Intel Arc Pro B70 is still at 3824.64.
So there is still a bit more to go before Intel GPUs perform close to NVIDIA's.
> Over the last year or two, Intel has worked to deliver serious optimizations for and compatibility with Blender GPU rendering on its Arc GPUs. Although NVIDIA has long held an advantage in the application, our last time looking at Intel’s cards indicated ongoing improvements. This round of testing is no different. We found that the Arc Pro B70 provided more than twice the performance of the B50, also beating the R9700 by 9%.
For 8k HDR10 media or 3+ screens the rtx 5090 32G model is going to be the minimum card people should buy. Just because you see 4 DP ports, doesn't mean the card can push bit-rates needed to fill an HDR10 display >60Hz.
The Mac Studio Pro unified >512GB ram/vram is a better LLM lab solution (Apple recently NERF'd it to 256GB.) Who cares if a task completes a bit slower, it doesn't matter given the lower error rates... and not costing $14k like an rtx 6000. =3
Great tutorial on getting blender to behave on mid-grade PC and laptops etc. :
There was recent talk of them pulling back from the consumer segment, though obviously the leaks have also predicted Battlemage not being a thing so go figure: https://youtu.be/NYd2meJumyE?t=638 (timestamped)
That said, them not releasing a B770 in the consumer segment also sucks, since there are games and use cases that the B580 comes in a bit short for.
Since they will have both of those big and small "bookends" of GPU architectures, it is a question of whether they see benefits in maintaining an accessible foothold in the midmarket ecosystem. I could make an argument for both sides of that, but obviously the decision is not up to me.
https://www.tomshardware.com/news/lightweight-windows-11-run...
I cannot understand why would a tech reviewer do that.
Edit: Here is a simple llama.cpp compare where the token gen results match the rule of thumb.
https://www.reddit.com/r/LocalLLaMA/comments/1st6lp6/nvidia_...
Or the makers intentionally nerf them, in order to better segment the markets/product lines?
But I hope to somehow have 48Gb or 64GB VRAM in a GPU that's also gaming-ready.
I was looking for maybe getting a mac studio for this reason, but I don't think a mac is really good for for gaming.
Noob question.
Bandwidth on that memory interface and setup for dual channel would be significantly worse than Strix Halo, which already exists and could be an entire compute setup with no need for an ASIC.
I read that Intel is getting out of the dGPU space, but then again, their iGPUs are really getting good. I can't understand why they'd give up the space when the AI market is so insane.
The team working on drivers is doing a good job playing catch up and I hope intel will continue to invest in cards that focus on graphics workloads and not just on AI inference.
nVidia has zero incentives to play open for linux, they release the binary blobs, next to zero docs and support, and you deal with it. The last nVidia card I bought was 20 years ago, and it was so bad for linux (low perf and freezes for the open drivers, manual re-install hell and pray on each kernel update for the binaries) that I switched to ATI. Since then, ATI or Intel always were decent with zero headaches.
Intel looks like they'll leave the dedicated GPU space, so it's a bit doubtful if the drivers will ever catch up.
I've seen several stories like this. Which is a shame since Intel offers the best value GPUs on the market.
I guess it's possible they'll still make workstation GPUs while skipping the consumer market.
If you find a friendlier way to phrase it, you may find more people willing to discuss it.