FilterHN

VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

42 points

by timhigins

2 hours ago

| past

| 5 comments

| arxiv.org

| HN

▲

SwellJoe

1 minute ago

[-]

[delayed]

▲

deftio

6 minutes ago

[-]

There is some base level of intelligence any model needs to be useful, even in narrow tasks.

Could you teach a 5 year old to drive a car? A 10 year old? A 12 year old? To drive a car requires being able to read, to have judgement about ice or rainy conditions, to anticipate a child running after a ball. By the time a human in in their mid teens they have acquired the base knowledge...

Small models need to have enough base knowledge to be able to be good enough -- even in a seemingly narrow regime. Where is that? Obviously they don't need all the obscure knowledge of a frontier model but there is some base level which is probably more than it would first seem.

▲

gslepak

19 minutes ago

[-]

Note that these is Python-only results, the model will not do as well with other languages.

I'm glad to see more domain-focused SLMs, we need more of them! A programming focused MoE should work well across many languages.

▲

aero2146

1 hour ago

[-]

I tried generating the classic pelican svg, but it failed horribly just showing me a rectangle and a black circle...

▲

fwipsy

36 minutes ago

[-]

I think this is predicted? Part of the story is how they were able to preserve core reasoning ability while cutting knowledge like "pelicans have wings."

> these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios.

▲

pylotlight

22 minutes ago

[-]

The only real essential item here is tool calling capability is it not? So I assume they tested a strong read/write/edit tool consistency?

▲

realitysballs

53 minutes ago

[-]

That’s all I needed to hear

▲

pylotlight

23 minutes ago

[-]

As in, you learnt that a useless test that no one should be using was tested here, that's what you meant right?

▲

physPop

52 minutes ago

[-]

Its for reasoning not generating art?

▲

websap

47 minutes ago

[-]

Can you explain this a bit more

▲

tyre

28 minutes ago

[-]

Imagine you want to make a smaller model that is really good at one thing, say, driving a car. You could remove the parameters that lead it to correctly answer, "What is the powerhouse of the cell?" or, "Who was the first president of the United States?"

It would look really dumb if someone asked it that, but that's fine. You're trying to make a model that is optimized for efficiency for a specific task. As much as possible, you should prune uncorrelated things.

▲

pylotlight

22 minutes ago

[-]

SVG generation is a useless test, what's there more to know?

▲

steve_adams_86

1 minute ago

[-]

What if you're reasoning about how to generate SVG correctly?

▲

noperator

40 minutes ago

[-]

Having some success while testing this model out as a replacement for GPT-5 nano in source code security review. Running on RTX 3090 (24 GB VRAM) via vLLM. It's not great on structured output (as noted in the model card) but I'm working around that in my harness.

▲

dummydummy1234

18 minutes ago

[-]

Can't you just force it to do structured output via constrained generation?