DeepSeek-v3.2: Pushing the frontier of open large language models [pdf]
213 points
4 hours ago
| 11 comments
| huggingface.co
| HN
https://huggingface.co/deepseek-ai/DeepSeek-V3.2

https://api-docs.deepseek.com/news/news251201

zug_zug
1 hour ago
[-]
Well props to them for continuing to improve, winning on cost-effectiveness, and continuing to publicly share their improvements. Hard not to root for them as a force to prevent an AI corporate monopoly/duopoly.
reply
jstummbillig
37 minutes ago
[-]
How could we judge if anyone is "winning" on cost-effectiveness, when we don't know what everyones profits/losses are?
reply
ericskiff
13 minutes ago
[-]
I believe this was a statement on cost per token to us as consumers of the service
reply
srameshc
54 minutes ago
[-]
As much I agree with your sentiment, but I doubt the intention is singular.
reply
red2awn
1 hour ago
[-]
Worth noting this is not only good on benchmarks, but significantly more efficient at inference https://x.com/_thomasip/status/1995489087386771851
reply
twistedcheeslet
54 seconds ago
[-]
How capable are these models at tool calling?
reply
lalassu
15 minutes ago
[-]
Disclaimer: I did not test this yet.

I don't want to make big generalizations. But one thing I noticed with chinese models, especially Kimi, is that it does very well on benchmarks, but fails on vibe testing. It feels a little bit over-fitting to the benchmark and less to the use cases.

I hope it's not the same here.

reply
vorticalbox
7 minutes ago
[-]
This used to happen with bench marks on phones, manufacturers would tweak android so benchmarks ran faster.

I guess that’s kinda how it is for any system that’s trained to do well on benchmarks, it does well but rubbish at everything else.

reply
make3
4 minutes ago
[-]
yes, they turned off all energy economy measures when benchmarking software activity was detected, which completely broke the point of the benchmarks because your phone is useless if it's very fast but the battery lasts one hour
reply
not_that_d
1 minute ago
[-]
What is "Vibe testing"?
reply
make3
6 minutes ago
[-]
I would assume that huge amount is spent in frontier models just making the models nicer to interact with, as it is likely one of the main things that drives user engagement.
reply
spullara
3 minutes ago
[-]
I hate that their model ids don't change as they change the underlying model. I'm not sure how you can build on that.

  % curl https://api.deepseek.com/models \          
    -H "Authorization: Bearer ${DEEPSEEK_API_KEY}"  
  {"object":"list","data":[{"id":"deepseek-chat","object":"model","owned_by":"deepseek"},{"id":"deepseek-reasoner","object":"model","owned_by":"deepseek"}]}
reply
jodleif
4 hours ago
[-]
I genuinely do not understand the evaluations of the US AI industry. The chinese models are so close and far cheaper
reply
espadrine
1 hour ago
[-]
Two aspects to consider:

1. Chinese models typically focus on text. US and EU models also bear the cross of handling image, often voice and video. Supporting all those is additional training costs not spent on further reasoning, tying one hand in your back to be more generally useful.

2. The gap seems small, because so many benchmarks get saturated so fast. But towards the top, every 1% increase in benchmarks is significantly better.

On the second point, I worked on a leaderboard that both normalizes scores, and predicts unknown scores to help improve comparisons between models on various criteria: https://metabench.organisons.com/

You can notice that, while Chinese models are quite good, the gap to the top is still significant.

However, the US models are typically much more expensive for inference, and Chinese models do have a niche on the Pareto frontier on cheaper but serviceable models (even though US models also eat up the frontier there).

reply
coliveira
8 minutes ago
[-]
Nothing you said helps with the issue of valuation. Yes, the US models may be better by a few percentage points, but how can they justify being so costly, both operationally as well as in investment costs? Over the long run, this is a business and you don't make money being the first, you have to be more profitable overall.
reply
jodleif
1 hour ago
[-]
1. Have you seen the Qwen offerings? They have great multi-modality, some even SOTA.
reply
brabel
36 minutes ago
[-]
Qwen Image and Image Edit were among the best image models until Nano Banana Pro came along. I have tried some open image models and can confirm , the Chinese models are easily the best or very close to the best, but right now the Google model is even better... we'll see if the Chinese catch up again.
reply
raincole
11 minutes ago
[-]
> video

Most of AI-generated videos we see on social media now are made with Chinese models.

reply
torginus
36 minutes ago
[-]
Thanks for sharing that!

The scales are a bit murky here, but if we look at the 'Coding' metric, we see that Kimi K2 outperforms Sonnet 4.5 - that's considered to be the price-perf darling I think even today?

I haven't tried these models, but in general there have been lots of cases where a model performs much worse IRL than the benchmarks would sugges (certain Chinese models and GPT-OSS have been guilty of this in the past)

reply
agumonkey
17 minutes ago
[-]
forgive me for bringing politics into it, are chinese LLM more prone to censorship bias than US ones ?
reply
coliveira
6 minutes ago
[-]
Being open source, I believe Chinese models are less prone to censorship, since the US corporations can add censorship in several ways just by being a closed model that they control.
reply
jasonsb
1 hour ago
[-]
It's all about the hardware and infrastructure. If you check OpenRouter, no provider offers a SOTA chinese model matching the speed of Claude, GPT or Gemini. The chinese models may benchmark close on paper, but real-world deployment is different. So you either buy your own hardware in order to run a chinese model at 150-200tps or give up an use one of the Big 3.

The US labs aren't just selling models, they're selling globally distributed, low-latency infrastructure at massive scale. That's what justifies the valuation gap.

Edit: It looks like Cerebras is offering a very fast GLM 4.6

reply
irthomasthomas
1 minute ago
[-]
reply
observationist
51 minutes ago
[-]
The network effects of using consistently behaving models and maintaining API coverage between updates is valuable, too - presumably the big labs are including their own domains of competence in the training, so Claude is likely to remain being very good at coding, and behave in similar ways, informed and constrained by their prompt frameworks, so that interactions will continue to work in predictable ways even after major new releases occur, and upgrades can be clean.

It'll probably be a few years before all that stuff becomes as smooth as people need, but OAI and Anthropic are already doing a good job on that front.

Each new Chinese model requires a lot of testing and bespoke conformance to every task you want to use it for. There's a lot of activity and shared prompt engineering, and some really competent people doing things out in the open, but it's generally going to take a lot more expert work getting the new Chinese models up to snuff than working with the big US labs. Their product and testing teams do a lot of valuable work.

reply
DeathArrow
23 minutes ago
[-]
> If you check OpenRouter, no provider offers a SOTA chinese model matching the speed of Claude, GPT or Gemini.

I think GLM 4.6 offered by Cerebras is much faster than any US model.

reply
jasonsb
15 minutes ago
[-]
You're right, I forgot about that one.
reply
jodleif
1 hour ago
[-]
Assuming your hardware premise is right (and lets be honest, nobody really wants to send their data to chinese providers) You can use a provider like Cerebras, Groq?
reply
kachapopopow
47 minutes ago
[-]
cerebras AI offers models at 50x the speed of sonnet?
reply
csomar
1 hour ago
[-]
According to OpenRouter, z.ai is 50% faster than Anthropic; which matches my experience. z.ai does have frequent downtimes but so does Claude.
reply
jazzyjackson
1 hour ago
[-]
Valuation is not based on what they have done but what they might do, I agree tho it's investment made with very little insight into Chinese research. I guess it's counting on deepseek being banned and all computers in America refusing to run open software by the year 2030 /snark
reply
jodleif
1 hour ago
[-]
> Valuation is not based on what they have done but what they might do

Exactly what I’m thinking. Chinese models catching rapidly. Soon to be on-par with the big dogs.

reply
ksynwa
1 hour ago
[-]
Even if they do continue to lag behind they are a good bet against monopolisation by proprietary vendors.
reply
coliveira
1 minute ago
[-]
They would if corporations were allowed to run these models. I fully expect the US government to prohibit corporations from doing anything useful with Chinese models (full censorship). It's the same game they use with chips.
reply
bilbo0s
1 hour ago
[-]
>I guess it's counting on deepseek being banned

And the people making the bets are in a position to make sure the banning happens. The US government system being what it is.

Not that our leaders need any incentive to ban Chinese tech in this space. Just pointing out that it's not necessarily a "bet".

"Bet" imply you don't know the outcome and you have no influence over the outcome. Even "investment" implies you don't know the outcome. I'm not sure that's the case with these people?

reply
newyankee
1 hour ago
[-]
Yet tbh if the US industry had not moved ahead and created the race with FOMO it would not had been easier for Chinese strategy to work either.

The nature of the race may change as yet though, and I am unsure if the devil is in the details, as in very specific edge cases that will work only with frontier models ?

reply
isamuel
1 hour ago
[-]
There is a great deal of orientalism --- it is genuinely unthinkable to a lot of American tech dullards that the Chinese could be better at anything requiring what they think of as "intelligence." Aren't they Communist? Backward? Don't they eat weird stuff at wet markets?

It reminds me, in an encouraging way, of the way that German military planners regarded the Soviet Union in the lead-up to Operation Barbarossa. The Slavs are an obviously inferior race; their Bolshevism dooms them; we have the will to power; we will succeed. Even now, when you ask questions like what you ask of that era, the answers you get are genuinely not better than "yes, this should have been obvious at the time if you were not completely blinded by ethnic and especially ideological prejudice."

reply
mosselman
1 hour ago
[-]
Back when deepseek came out and people were tripping over themselves shouting it was so much better than what was out there, it just wasn’t good.

It might be this model is super good, I haven’t tried it, but to say the Chinese models are better is just not true.

What I really love though is that I can run them (open models) on my own machine. The other day I categorised images locally using Qwen, what a time to be alive.

Further even than local hardware, open models make it possible to run on providers of choice, such as European ones. Which is great!

So I love everything about the competitive nature of this.

reply
CamperBob2
39 minutes ago
[-]
If you thought DeepSeek "just wasn't good," there's a good chance you were running it wrong.

For instance, a lot of people thought they were running "DeepSeek" when they were really running some random distillation on ollama.

reply
bjourne
22 minutes ago
[-]
WDYM? Isn't https://chat.deepseek.com/ the real DeepSeek?
reply
breppp
40 minutes ago
[-]
Not sure how the entire Nazi comparison plays out, but at the time there were good reasons to imagine the Soviets will fall apart (as they initially did)

Stalin just finished purging his entire officer corps, which is not a good omen for war, and the USSR failed miserably against the Finnish who were not the strongest of nations, while Germany just steamrolled France, a country that was much more impressive in WW1 than the Russians (who collapsed against Germany)

reply
newyankee
1 hour ago
[-]
but didn't Chinese already surpass the rest of the world in Solar, batteries, EVs among other things ?
reply
cyberlimerence
1 hour ago
[-]
They did, but the goalposts keep moving, so to speak. We're approximately here : advanced semiconductors, artificial intelligence, reusable rockets, quantum computing, etc. Chinese will never catch up. /s
reply
lukan
48 minutes ago
[-]
"It reminds me, in an encouraging way, of the way that German military planners regarded the Soviet Union in the lead-up to Operation Barbarossa. The Slavs are an obviously inferior race; ..."

Ideology played a role, but the data they worked with, was the finnish war, that was disastrous for the sowjet side. Hitler later famously said, it was all a intentionally distraction to make them believe the sowjet army was worth nothing. (Real reasons were more complex, like previous purging).

reply
littlestymaar
45 minutes ago
[-]
> It reminds me, in an encouraging way, of the way that German military planners regarded the Soviet Union in the lead-up to Operation Barbarossa. The Slavs are an obviously inferior race; their Bolshevism dooms them; we have the will to power; we will succeed

Though, because Stalin had decimated the red army leadership (including most of the veteran officer who had Russian civil war experience) during the Moscow trials purges, the German almost succeeded.

reply
TIPSIO
1 hour ago
[-]
It's awesome that stuff like this is open source, but even if you have a basement rig with 4 NVIDIA GeForce RTX 5090 graphic cards ($15-20k machine), can it even run with any reasonable context window that isn't like a crawling 10/tps?

Frontier models are far exceeding even the most hardcore consumer hobbyist requirements. This is even further

reply
tarruda
11 minutes ago
[-]
You can run at ~20 tokens/second on a 512GB Mac Studio M3 Ultra: https://youtu.be/ufXZI6aqOU8?si=YGowQ3cSzHDpgv4z&t=197

IIRC the 512GB mac studio is about $10k

reply
noosphr
1 hour ago
[-]
Home rigs like that are no longer cost effective. You're better off buying an rtx pro 6000 outright. This holds both for the sticker price, the supporting hardware price, the electricity cost to run it and cooling the room that you use it in.
reply
torginus
47 minutes ago
[-]
I was just watching this video about a Chinese piece of industrial equipment, designed for replacing BGA chips such as flash or RAM with a good deal of precision:

https://www.youtube.com/watch?v=zwHqO1mnMsA

I wonder how well the aftermarket memory surgery business on consumer GPUs is doing.

reply
throw4039
29 minutes ago
[-]
Yeah, the pricing for the rtx pro 6000 is surprisingly competitive with the gamer cards (at actual prices, not MSRP). A 3x5090 rig will require significant tuning/downclocking to be run from a single North American 15A plug, and the cost of the higher powered supporting equipment (cooling, PSU, UPS, etc) needed will pay for the price difference, not to mention future expansion possibilities.
reply
mikae1
45 minutes ago
[-]
Or perhaps a 512GB Mac Studio. 671B Q4 of R1 runs on it.
reply
redrove
26 minutes ago
[-]
I wouldn’t say runs. More of a gentle stroll.
reply
storus
6 minutes ago
[-]
I run it all the time, token generation is pretty good. Just large contexts are slow but you can hook a DGX Spark via Exo Labs stack and outsource token prefill to it. Upcoming M5 Ultra should be faster than Spark in token prefill as well.
reply
halyconWays
26 minutes ago
[-]
As someone with a basement rig of 6x 3090s, not really. It's quite slow, as with that many params (685B) it's offloading basically all of it into system RAM. I limit myself to models with <144B params, then it's quite an enjoyable experience. GLM 4.5 Air has been great in particular
reply
bigyabai
1 hour ago
[-]
People with basement rigs generally aren't the target audience for these gigantic models. You'd get much better results out of an MoE model like Qwen3's A3B/A22B weights, if you're running a homelab setup.
reply
Spivak
1 hour ago
[-]
Yeah I think the advantage of OSS models is that you can get your pick of providers and aren't locked into just Anthropic or just OpenAI.
reply
zparky
5 hours ago
[-]
Benchmarks are super impressive, as usual. Interesting to note in table 3 of the paper (p. 15), DS-Speciale is 1st or 2nd in accuracy in all tests, but has much higher token output (50% more, or 3.5x vs gemini 3 in the codeforces test!).
reply
futureshock
2 hours ago
[-]
The higher token output is not by accident. Certain kinds of logical reasoning problems are solved by longer thinking output. Thinking chain output is usually kept to a reasonable length to limit latency and cost, but if pure benchmark performance is the goal you can crank that up to the max until the point of diminishing returns. DeepSeek being 30x cheaper than Gemini means there’s little downside to max out the thinking time. It’s been shown that you can further scale this by running many solution attempts in parallel with max thinking then using a model to choose a final answer, so increasing reasoning performance by increasing inference compute has a pretty high ceiling.
reply
htrp
1 hour ago
[-]
what is the ballpark vram / gpu requirement to run this ?
reply
rhdunn
1 hour ago
[-]
For just the model itself: 4 x params at F32, 2 x params at F16/BF16, or 1 x params at F8, e.g. 685GB at F8. It will be smaller for quantizations, but I'm not sure how to estimate those.

For a Mixture of Experts (MoE) model you only need to have the memory size of a given expert. There will be some swapping out as it figures out which expert to use, or to change expert, but once that expert is loaded it won't be swapping memory to perform the calculations.

You'll also need space for the context window; I'm not sure how to calculate that either.

reply
anvuong
32 minutes ago
[-]
I think your understanding of MoE is wrong. Depending on the settings, each token can actually be routed to multiple experts, called experts choice architecture. This makes it easier to parallelize the inference (each expert on a different device for example), but it's not simply just keeping one expert in memory.
reply
petu
47 minutes ago
[-]
I think your idea of MoE is incorrect. Despite the name they're not "expert" at anything in particular, used experts change more or less on each token -- so swapping them into VRAM is not viable, they just get executed on CPU (llama.cpp).
reply
BoorishBears
6 hours ago
[-]
3.2-Exp came out in September: this is 3.2, along with a special checkpoint (DeepSeek-V3.2-Speciale) for deep reasoning that they're claiming surpasses GPT-5 and matches Gemini 3.0

https://x.com/deepseek_ai/status/1995452641430651132

reply
nimchimpsky
8 hours ago
[-]
Pretty amazing that a relatively small Chinese hedge fund can build AI better than almost anyone.
reply