Gemma 4 12B: A unified, encoder-free multimodal model
410 points
by rvz
3 hours ago
| 32 comments
| blog.google
| HN
senko
52 minutes ago
[-]
I ran the Q4 quant (used with llama.cpp) though my "minesweeper" vibe-coding benchmark: https://senko.net/vibecode-bench/2026/minesweeper-gamma-4-12...

The result is decent, but it had a few bizzare/trivial syntax errors I had to fix manually: it would do an extra closing bracket or paren a few times, and wanted to separate function definitions with comma. Not sure what that was about, but otherwise the output run just fine.

So, with those qualifiers, I think it's a decent local coding model. It roughly compares with GPT-4.1 (!!), released 14 months ago, on the output: https://senko.net/vibecode-bench/2025/minesweeper-gpt-4.1.ht... (actually I'd call it better, but those syntax errors...)

I ran the quantized version (4-bit GGUF) on my consumer-grade card with 12G of VRAM and got 5t/s for output. Not for interactive use for coding, but fairly capable model.

To me, it's fascinating how much progress we got in over a year. GPT-4.1 was considered an extremely capable coding model. Now we got something with 12B of params performing roughly the same (in this specific benchmark, disclaimers, etc).

Lists of various models I tested: https://senko.net/vibecode-bench/

reply
frikk
7 minutes ago
[-]
Thank you for sharing this. Do you think the syntactical issues could be addressed with fine tuning or some other kind of parameter tweaking? That's frustrating hah.
reply
thomasjb
2 minutes ago
[-]
Unfortunately there's no gguf quants of the assistant model yet: https://huggingface.co/models?other=base_model:quantized:goo...
reply
minimaxir
2 hours ago
[-]
The big story here is the encoder-free part, which I still don't fully understand.

> Vision: We replaced Gemma 4’s vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations.

That's technically encoding, just without using a dedicated model for it like SigLIP? The Developer's Guide elaborates, it's still a 35M layer which I am curious is robust enough. https://developers.googleblog.com/gemma-4-12b-the-developer-...

> Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.

I am assuming that involves quantization, which due to the quality loss makes that statement somewhat misleading IMO.

reply
georgehm
2 hours ago
[-]
Embedded within that developer page is a good explainer of the encoder free architecture . https://newsletter.maartengrootendorst.com/p/a-visual-guide-...
reply
asim
19 minutes ago
[-]
That's a great explainer, thanks for sharing it.
reply
spott
1 hour ago
[-]
This is just early fusion basically.

FAIR did this 2 years ago now: https://arxiv.org/abs/2405.09818

I've been waiting for something like this to be released since then.

The annoying thing is that chameleon was multi-modal out based on the same principles, but this model is just inputs... (I'm curious how they did pre-training without having multi-modal outputs as well. I wonder if they just chopped them off rather than support image output).

reply
mchinen
1 hour ago
[-]
The audio side is even more interesting, as it seems they totally got rid of positional embedding are just doing a single linear transform to match the LLM input dimension and that's it.

> Audio: We simplified audio processing even further. We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens.

reply
make3
1 hour ago
[-]
I guarantee you there's positional information one way or another. they just don't mention it because positional embeddings are extremely cheap computationally, not worth mentioning
reply
neosat
1 hour ago
[-]
Agree. Audio has strongly temporal so there is almost certainly some positional encoding one way or another.
reply
mchinen
1 hour ago
[-]
Ah yeah, thinking further it's probably just using some positioning embedding based on sequence numbering added in the LLM layers. For vision it needs the patch location as well.
reply
jszymborski
2 hours ago
[-]
Totally agree that it is "encoding" in the general sense, but I think they are referring to the lack of an "encoder" neural network.
reply
minimaxir
2 hours ago
[-]
In hindsight I may have been pedantic.
reply
wilkystyle
2 hours ago
[-]
I had a similar thought to you, and found your question and the resulting discussion helpful!
reply
alberto467
2 hours ago
[-]
Not at all. Getting really pedantic, tokenization is also a form of encoding, so it doesn't matter the modality you're using, you'll end up doing some type of encoding in some way.
reply
altruios
1 hour ago
[-]
Tokens are such a strange base unit. Couldn't we do something that naturally conforms better to reality than such choppy units that cause all sorts of artifacts? making everything 'language based' prevents true multi-modality. Thinking isn't done in language. Thinking outputs language, but its far more like multiple waves of data coalescing into an 'idea', internal... subjectively (n=1) at least. I think wave/signal based transformers are the next jump.

After that a s1/s2 system: fast generation, slow wave correction / observation operating over the fast generation seems like the next leap forward.

Tokens create and hide too many problems to be the 'optimal' solution.

reply
kristjansson
2 hours ago
[-]
> quantization

12b means 12G @ 8 bits/param (basically lossless) and 6G at 4 b/p (generally accepted 'pretty close' level). Not too bad?

But TBD how well the base model performs before thinking too much about quantization

reply
matja
2 hours ago
[-]
One side-effect, is that the separate .mmproj file (Multi-Modal Projection encoder) is no longer needed, when using the model with llama.cpp etc.
reply
pferdone
1 hour ago
[-]
But do I have the option to run it 'text only'?
reply
mips_avatar
1 hour ago
[-]
I don't think we've bottomed out on what we can do with embedding models. They're these tiny models that absolutely rip on modern cpus with 8 bit int optimizations. Like in my app we can say pretty definitive things about hundreds of millions of places in the world on retrieval tasks on regular hardware.
reply
dofm
47 minutes ago
[-]
I would contend that the actual big story is the gallery app:

https://developers.google.com/edge/gallery

Anyone with a 16GB Mac — that is quite a lot of journalists, surely — can download that, install a model into it, and play.

Surely journalists have to start asking questions at least about OpenAI's consumer revenue projections now.

I am a major, major AI cynic, but I decided to be an informed cynic so I've been playing with local models for agentic work and a bit of CAD-to-image generation. I really quite like the 26B Gemma model — I've been using it to teach myself some fundamental things and learn OpenCode without developing a cloud dependency. It writes fairly good code and it is helping me learn the things I want to learn at a pace that I prefer.

But if this 12B model is even half as close as they say it is, this casts some doubt on the consumer end of the cloud business model, at least in the short term.

(Not clear if this app is using the MTP drafters; I've still not got them working with Gemma myself, though the Qwen 3.6 built-in MTP support is super in LM Studio)

reply
minimaxir
19 minutes ago
[-]
I had discounted Edge Gallery because it didn't support system prompts, but now it does so I will give it another go. I believe the implementation does use MTP since I got an update to Gemma-4-E4B on iOS indicating such, and on macOS it's very speedy.

However, on my 18GB RAM MacBook Pro, selecting Gemma-4-12B-it results in this error:

> The model "Gemma-4-12B-it' requires more memory (RAM) than is available on your device.

So yeah, my questions about the 16GB marketing copy are fair.

reply
dofm
5 minutes ago
[-]
Interesting; they may have fluffed up somewhere then.

(Though perhaps it'll squeeze in with a small context window? Not sure I understand that aspect yet)

It does seem to use MTP, yes, and it is quite quick — seemingly the underlying LiteRT stuff can do MTP with Gemma 4 and presumably MTP is a big part of the practicality picture here.

The system prompt thing was a surprise when I poked around.

reply
madduci
19 minutes ago
[-]
VRAM, not RAM. I wish it was light enough for iGPUs too
reply
rao-v
1 hour ago
[-]
Encoder free is huge for running on SBCs etc. often the encoding time is a significant fraction of generation time if you are using a VLM as a all purpose vision model
reply
wolttam
2 hours ago
[-]
I think the idea is that the model is seeing embeddings that map directly to underlying pixel data, rather than being fed semantically rich embeddings from an encoder model which itself had seen the raw pixel data.
reply
woadwarrior01
1 hour ago
[-]
There are many priors to encoder-free VLMs. I specifically remember the EVE series of models from ~2 years.

https://github.com/baaivision/EVE

reply
reactordev
2 hours ago
[-]
It actually works well because unlike encoders, the latent space is trained on that initial layer so it “knows” what to do with that sparse density. I’ve been using gemma4-12b with Flux2 and its ability to reason on visual input is pretty good. That said, each model is good in their own ways so YMMV but overall, it’s about as solid as Qwen just with a more advanced architecture.
reply
GaggiX
2 hours ago
[-]
> That's technically encoding

Isn't that just projecting the patches into the d_model size vectors that the models takes?

>I am assuming that involves of quantization

12B model in 16GB seems very reasonable to me, int8 is top quality for running models.

reply
WhitneyLand
55 minutes ago
[-]
I don’t think so, the HF weights are bf16 which means 24GB + cache/overhead.

It sounds like marketing spin where the performance claims are based on BF16 and the “runs in 16GB” claim is on a totally different quantized version.

reply
minimaxir
2 hours ago
[-]
The guide describes it as projection although there is apparently an extra step: "A factorized coordinate lookup (X and Y matrices) attaches spatial location information directly to the input."

12B at int8 would take up 12G memory, or 75% of the system memory which technically fits within 16GB but the OS will not like that. EDIT: On my 18G memory MacBook Pro, LM Studio reports a "partial GPU offload" for the int8 MLX weights. Can't test because the `gemma_unified" architecture is NYI.

reply
WhitneyLand
48 minutes ago
[-]
Yeah and it’s pretty memory efficient with only 8 attention layers so at int8 in 16GB ram maybe you still get 64k-128k context.

The part I hate though is that I’d bet none of the performance claims are based on int8.

Why do we care about bf16 benchmarks when no one will be using that with this model.

reply
LarsDu88
2 hours ago
[-]
Well its a real simple encoder I guess
reply
__natty__
2 minutes ago
[-]
It’s fascinating for me to see how small language models grow recently in capabilities while still consumer friendly in size to run on their machines
reply
asim
27 minutes ago
[-]
We are now entering the closed loop game. Google doesn't need anyone else to accelerate their models. This is their bread and butter.

I'm both shocked but also not surprised that they continue to develop such efficiencies. Honestly it's like silicon and CPU architecture advancement. We kept shrinking it and shrinking it and it kept getting more and more powerful and here we are with AI and it's only going to be 100x more efficient with time. Maybe there's some point of decay but essentially the next 30 years will be more advanced than the last 30 and were going to be living in some sort of futurist blade runner scenario where gene editing is repairing ageing cells, organs and curing all sorts of cancers that haven't even appeared yet. Beyond our lifetimes people will live to 125 quite steadily and with great mobility and then obviously people will look to how do we get to living 1000 years, which of anyone is religious knows Noah and others lived to that age in a totally different era.

Anyway I'm going off on some tangent but look back 30 years. Now look forward 30 years. It's going to be insane. May God protect us.

reply
petercooper
37 minutes ago
[-]
Its image processing is terrible. I ran several tests against it against Qwen 3.5 0.8b (yes, 7% the size) and Qwen beat it every time with Gemma often getting things entirely wrong. I even gave it a plain image saying "This is a test" and it thought for 6 minutes trying to analyze it and failed. Qwen 3.5 0.8b confidently got it in under a second.

It may be that the Q6 quant I got is borked (or my LM Studio is), but either way, the 0.8b's performance is mind boggling in comparison.

reply
ethanpil
2 hours ago
[-]
What's Google's business case for releasing open models? Don't get me wrong, I am grateful and appreciative of these releases. I'm trying to understand how it fits into their bigger picture as a for profit company? Are they not helping competitors build on the novel technology they have developed?

Is it simply goodwill and/or marketing? Or am I missing something strategic?

reply
browningstreet
2 hours ago
[-]
This won't replace commercially viable, revenue generating alternatives of their own devising, but it does enable development activity and initiate conversations with enterprises who start with this model but want to do slightly more.

That's my experience right now... my company is all in on a plethora of platform products. Also, Microsoft just yesterday said their goal was "Unmetered intelligence". There's a lot of things that can be enabled by small local models, and those things are part of stacks that can generate revenue in other layers.

reply
johnnyApplePRNG
1 hour ago
[-]
re "Unmetered intelligence" goal of Microshaft.

Of course it is...

This is Windows-Licensing-Level Money Opportunity 2.0.

reply
gen220
1 hour ago
[-]
A big part of the frontier labs abilities to charge 80% gross margins on inference is having the cornered resource of frontier models.

If that inference becomes popular and valuable enough that those companies make billions of dollars in profit, those companies could use that profit to fund the building of alternative products and platforms that dis-intermediate google's relationship with the customer.

Google already has an 80% gross margin business, the biggest one in the world. Everybody wants a slice of it.

By offering frontier inference closer to cost and open-sourcing everything that's sub-frontier, they're commoditizing frontier labs' models, which inhibits their ability to durably make high gross margins on inference.

It's a strategic play.

reply
zozbot234
1 hour ago
[-]
A 12B-sized model is a far cry from "frontier inference". That's more like DeepSeek V4 Pro territory which is a 1.6T model. Or for multi-modal models, Kimi 2.6 which is 1T.
reply
gen220
1 hour ago
[-]
at risk of quoting myself... :)

> By offering frontier inference closer to cost *and* open-sourcing everything that's sub-frontier

It's two prongs! One prong is that their frontier inference pricing is significantly cheaper/closer-to-at-cost as Anthropic's.

The subject of this thread is the other prong: offering compelling models that are sub-frontier and self-hostable.

Self-hosting models and at-cost frontier models are the high-end and low-end disruptions, respectively, to Ant/OAI/etc.'s business models.

reply
echelon
1 hour ago
[-]
Google needs an anti-trust breakup about 10 years ago.

They need one more than ever now.

This is ridiculously anti-competitive.

reply
airstrike
56 minutes ago
[-]
This is literally competition
reply
boutell
1 hour ago
[-]
You're right that it's not literally frontier. But like recent Qwen releases, it is a lot more capable than anybody thought models of this size could be a year ago, like capable enough to set a ceiling on what you can charge for AI for certain applications. Others still clearly justify a stronger model, but this trend may continue, etc.
reply
Mr_P
2 hours ago
[-]
Android and Chrome need on-device AI capabilities. Google can't lock down those weights like it can with server-side ML.

So it's easier to just release those models as open source and make it official, since someone would inevitably hack the weights out anyway.

reply
Aachen
2 hours ago
[-]
Could say the same for camera processing in the Pixel Camera app or any other binary someone wants to re-use that comes included in a software distribution (seemingly for 'free'). They can't lock the instructions up on the server so they might as well make the binary be freely distributable?

Companies don't commonly give away executable binaries "just because", why'd they start now for these binary blobs that are the models?

Not that I'm unhappy about it! Yay for open data any day, I'm just not understanding why, at least beyond PR in nerd circles

reply
lukeschlather
58 minutes ago
[-]
Binaries are source code outputs, they are copyrightable and patentable. Weights are not copyrightable so people can freely extract the weights and run them. If Google patents any of the novel algorithms here releasing it all freely isn't an impediment to making people license it.
reply
jack_pp
1 hour ago
[-]
Because a model like this can't be as easily obfuscated as image processing. Image processing is a bundle of many moving parts, a lot of functions each with it's own inputs and outputs. A model is a single function which can be easily extracted and reused, in comparison
reply
panarky
1 hour ago
[-]
> can't lock down those weights

They could lock them down legally which would prevent commercial use, but they choose not to, and they boast about how many tens of millions of times Gemma models have been downloaded by developers.

So there must be more to the rationale than just local model weights getting hacked out of devices.

reply
beambot
2 hours ago
[-]
Google is one of the few verticalized options in AI: Data, models, cloud services, low-level silicon (TPUs), internal use cases, retail use cases, B2B uses, distribution (browser & mobile), etc.

They rise with the tide of AI adoption. But they gain ground if people opt into Google solutions. And any token sent to a Google model (free or paid) actively punishes their competitors that are then required to spend vast sums to remain bleeding edge.

reply
rootusrootus
2 hours ago
[-]
Neutering OpenAI and Anthropic would be my guess. Commoditized LLMs won't hurt Google nearly as much as it hurts the LLM-only companies, and so accelerating the inevitable just helps knock out potential future competition in areas where Google -does- make a lot of money now.
reply
literalAardvark
1 hour ago
[-]
I think this plays a part, but the truth is that Google doesn't need to do that, Chinese open models are already doing that by themselves.

So perhaps another part is just Google showing that they can indeed play at the big boys table.

reply
gdiamos
1 hour ago
[-]
There is demand for US open models.
reply
literalAardvark
2 minutes ago
[-]
I sincerely wonder why. Chinese censorship is only really relevant if you're doing anti China stuff, which is to say never, while the Western kind of model censorship ( a combination of copyrights and general fairness ) are something everyone's had to work around at least once, even if just for writing an interesting story.
reply
baq
55 minutes ago
[-]
Demis is on record saying they need models on the edge and if they’ll be there they might as well be properly open as they’ll be dumped anyway.
reply
onlyrealcuzzo
2 hours ago
[-]
If you're an AI lab, you definitely want research teams in this space - as this is where you can most easily iterate and make improvements which you'll then bake into larger, frontier models.

The question is: do you want to release your models, or use them purely for R&D?

Since everyone else is already releasing models of similar qualities, it's hard to say you're shooting yourself in the foot if you join the chorus.

The added cannibalization of releasing them is effectively zero, so the reputational benefits are likely to be worth it.

reply
hadlock
1 hour ago
[-]
>The added cannibalization of releasing them is effectively zero, so the reputational benefits are likely to be worth it.

Nobody would be looking at Qwen if their ~30b class models weren't fantastically good, it's great advertising and builds significant goodwill with developers, who are going to be your biggest advocates.

The other thing is, all these models are already disposable grade, and in a year they'll all be outclassed by The Next Big Thing. "Open" models are less than 18 months behind SOTA right now and I can't imagine that will slow down much over the next two years, they may even begin to close the gap. Nobody even talks about llama 4 anymore despite only being a year old.

reply
mchusma
20 minutes ago
[-]
I think its even more puzzling because you can't even run Gemma 31b on google cloud, they only let you test it with a rate limit. No way (I can find) to actually pay them to use it.

We saw great results in our usecase using google direct. Moved to Openrouter because google wouldn't let us use it beyond a test.

Then Openrouters performance looked worse, not sure if there was a quantized version or something. So we instead looked at Deepseek v4 Flash, and opted to go for that.

This model would probably be great for a super low cost cloud model, would love to use it in the cloud, Google makes you go elsewhere.

reply
staticman2
1 hour ago
[-]
As long as Chinese firms are releasing good open models I imagine there isn't a huge downside for Google to release state of the art small models to compete in the "free" space.
reply
estearum
2 hours ago
[-]
It's to destroy possible footholds for competitors and prevent them from making money in segments that Google doesn't care too much about, but can trivially commoditize.
reply
ismailmaj
51 minutes ago
[-]
Gemini is a huge team while Gemma is relatively small. They can totally do this at a loss with no ulterior motive.

They remind me a bit of HuggingFace, create something great then make money … maybe.

reply
verdverm
8 minutes ago
[-]
Competition from Chinese alternatives hopefully forces more openness and efficient models. DeepSeek for example is nearly on par and far more resource efficient, good for the planet imo
reply
XzAeRosho
2 hours ago
[-]
Google's MO since always has been to release great products or services for free, position themselves high and then abandon them or just find uses for Enterprise sales.

I'm pretty sure they are doing it because they get some research experience by shrinking and improving these models, and because they know that by doing this they get some good PR among the dev community.

reply
Aachen
2 hours ago
[-]
Google's "free" is and was ad-supported, even if some products now have a paid tier. These models don't include ads. Doesn't seem like the same underlying reason
reply
theturtletalks
2 hours ago
[-]
Maybe they are hedging against a future where local models are just as good as cloud models? Or maybe they can go the Taalas route and start hardcoding Gemma on a chip and hardware manufacturers can use it for local private AI.
reply
ppeetteerr
2 hours ago
[-]
Isn't Apple about to license some variation of this from google for on-device AI? Maybe it’s their sales pitch to Apple and then they will lock it down.
reply
CuriouslyC
2 hours ago
[-]
They're trying to capture the segment of the market that wants to control the model, with the intent of getting you to run them on Vertex.
reply
stevenhubertron
2 hours ago
[-]
My guess is testing for Apple’s Siri replacement and partnership but that’s a total SWAG
reply
mmarian
2 hours ago
[-]
Marketing + Pro Serv if I had to take a guess.
reply
re-thc
1 hour ago
[-]
On-device, e.g. Android.
reply
dist-epoch
2 hours ago
[-]
Evangelism for AI. Google is one of the big AI providers.

Eventually the local model is not enough, and you'll upgrade to the big ones.

reply
accountrequired
2 hours ago
[-]
edge compute
reply
superchicken099
2 hours ago
[-]
Gemma overtakes and kills real open-source AI projects, pushing people who would support them towards enterprises like Google
reply
ComputerGuru
2 hours ago
[-]
Quite aside from the architectural changes, I suppose this is the answer to why Google had such a glaring hole in the (pretrained) Gemma4 model lineup between the Gemma4 4b and Gemma4 26b models!

A model that comfortably fits in 16GB of VRAM (allowing room for context) is a welcome upgrade.

reply
djyde
2 hours ago
[-]
What are the use cases for these small models? Is there anyone using models of this scale in their daily life who could share their experience?
reply
philipkglass
1 hour ago
[-]
I have vLLM running on a Linux machine in my basement, connected with Tailscale, and I use small models as part of tasks like this:

- Transcribing scanned documents into formatted text

- Captioning/describing images and classifying them for audience suitability (includes anti-spam)

- Matching documents with relevant Wikipedia pages for tagging

I don't use them like frontier models. I break the work down into micro-tasks with one clear goal for each prompt. I write a lot of glue software to make the complete flow work. I was working on all of these tasks before LLMs appeared on the scene. The LLMs have allowed me to replace a lot of complicated code with less code plus a model, while achieving better results.

I use local models for reasons of cost and control. I already had the workstation and GPU. The only running cost is electricity. I have used proprietary models from OpenAI and Google for some of these tasks, but I also encountered churn when the models I built my tools around were retired. I don't worry about that when I have the weights saved locally.

reply
quickthoughts
40 minutes ago
[-]
I use small models like Gemma to improve transcriptions from ASR models amongst other micro-tasks. I actually built out a fine-tuning whisper pipeline with all local (smaller) models meaning no cloud/big-tech co is able to train/sell my (private) data.

Repo is https://github.com/Rebreda/listenr - mainly geared toward Whisper fine-tuning, AMD hardware and local inference

reply
robgough
1 hour ago
[-]
I've got a home-built dictation app that uses a local model to clear up the text and fix grammar. It was super easy to build. I’m extending it to capture meeting notes and summarise too. All on-device.

I saw a little app the other day, I think someone posted on here, that looks at your screenshot and renames the file based off the contents of the file.

There's tons of little examples like that. For a lot of use cases, you really don't need the frontier models.

reply
mhitza
1 hour ago
[-]
In theory, locally you'd use these where lossiness is acceptable for audio transcription and image labeling (as simple examples).

In practice I haven't got around to building something around multimodality since I'm primarily using their text generation capabilities.

reply
properbrew
1 hour ago
[-]
I think small models have a very good niche for specific tasks. I utilise a fine tuned Phi-4 model (smaller than this one) that fits in about 3.5gb of RAM (not vram) for the document processing side of things for the desktop app I develop (a bit of a shameless plug - whistle-enterprise.com).

If you have a very specific idea for local model use you can find a way to make it work very well, you don't even need to have a graphics card or NPU chip. You just have to be extremely constrained in how it's used. I think as a generic chatbot they're not great, I'd use a hosted SOTA model and I'm a big fan of local LLMs myself.

reply
SeriousM
38 minutes ago
[-]
Thank you for sharing your usecase! I like your product very much!

Could you talk a bit how you did the finetuning? Did you use unsloth or any other tool and how went the verification to proof the outcome?

reply
Aachen
2 hours ago
[-]
"Small" models are the ones I can run myself on my own terms. LLMs aren't useful enough for me to justify spending hundreds of euros on a GPU with 16GB VRAM or something, and that's assuming I have the rest of the desktop just laying around. Back when I checked (before the RAM price hike), these models weren't meaningfully better than 4-8GB ones anyway, you'd have to go for the top tier cards at 24 or 32 GB iirc to get something vaguely in the direction of the SaaS versions, and that was absolutely out of my budget. Even if that changed, so have hardware prices so it'd probably still work out the same
reply
airstrike
53 minutes ago
[-]
This is one https://post.bot/
reply
bensyverson
16 minutes ago
[-]
What model is it using?
reply
Xiol
2 hours ago
[-]
I've yet to see someone answer a question like this with a decent, useful answer.
reply
julianlam
46 minutes ago
[-]
Last time I tried Gemma 4 (26B-A4B) its memory usage would balloon and consume all of my swap until my machine died.

Qwen 3.6 on the other hand barely uses any memory at all for its KV cache.

reply
verdverm
6 minutes ago
[-]
Turns out when you block people from the best and biggest hardware, they get innovative. It reminds me of the Pentium days when everyone was shipping inefficient programs because the processor would be better next year.
reply
SuperV1234
10 minutes ago
[-]
How does this compare to frontier models?
reply
anonova
33 minutes ago
[-]
Do Gemma 4 models compete with Gemini 3.1 Flash-Lite? I would assume even the smallest Gemini model would outperform even Gemma 4 31B, but I can't really get a sense of performance or output quality difference.
reply
mchusma
18 minutes ago
[-]
Gemma 4 31b outperformed Gemini 3.1 Flash-Lite in our app benchmarks (agentic tool use via api in our application as a part of various workflows). But google won't let you pay to use Gemma models, you have to go elsewhere, I think this may be because it would cannabilize Flash-lite.
reply
verdverm
4 minutes ago
[-]
You can actually get the gemma-4 models on a per-token API basis, you just have to click some extra buttons (in GCP). Not the same for other open weight models. For those they make you run your own hardware.

Use OpenCode Go instead: https://opencode.ai/go

reply
dwa3592
2 hours ago
[-]
This is a pretty good update. The demo video is a bit funny though - the tester asks to turn the release into bullet points. okay, the model obliges. then the tester says draft an email with this content. BAM! the LLM turns the content from bullets to passages even though it was not asked and it undid the last good thing that it did. i am not sure if it's an email etiquette to not put bullets in the email.
reply
nickandbro
2 hours ago
[-]
Wow Google is becoming the new pre Llama 4 Meta when it comes to releasing open weights models.
reply
embedding-shape
2 hours ago
[-]
I dunno, feels a bit unfair to companies that actually do FOSS releases (Gemma 4 being released under Apache 2.0 license) to compare them to a company that never done any FOSS releases, and mostly done proprietary "available to download" releases.
reply
seba_dos1
2 hours ago
[-]
Note that a binary released under Apache 2.0 license does not yet make it FOSS.
reply
embedding-shape
2 hours ago
[-]
Agreed, miles ahead though from "proprietary" which is what Meta been using for most model releases.

Ideally companies would share the fucking datasets and training code already, but no, no one wants to talk about the source of those or even share the ones they have as then who knows what comes out of Pandora's box...

reply
redman25
2 hours ago
[-]
IDK this model release is a bit disappointing considering the community has been chomping at the bit for the 124ba4b model. There was some leaked info about it but people suspect it was not released because it was too close to gemini flash in performance.
reply
brianwawok
2 hours ago
[-]
Every other Google model I have tried felt very weak compared to qwen models. I dont have a ton of use case for multimodal though, so its very possible this is a fantastic multimodal model.
reply
verdverm
40 seconds ago
[-]
qwen3.6 was my favorite, then I tried the deepseek-v4-{flash,pro}

still making my way through deep dives on the chinese open weights, they are all pretty good and way more cost / resource effective

reply
wongarsu
2 hours ago
[-]
Gemma 4 27b and 32b feel pretty capable for text and visionn. Comparable with qwen, maybe a bit better on tool calling heavy tasks

I am not overly impressed with the smaller gemma models. And gemma 3 was a bit of a mixed bag, great at some things, bad at most others

reply
powera
7 minutes ago
[-]
I'm seeing very low quality results on LMStudio with this model. Worse than Gemma 3 12B.

It is getting questions like "David has 18 apples and Ivan has 7 apples. How many apples do they have together?" wrong half the time, while Gemma3 12B could very consistently answer that. Other smoke tests (like Chinese translation, and the infamous "Rs in Strawberry" test) also show poor results.

I don't know if it is a quantization/release issue, if the parameters needed for accurate responses have changed (i.e. it needs "thinking" tokens to handle its base error rate), or if the model has been so focused on audio/video that the text processing is bad.

reply
semiinfinitely
37 minutes ago
[-]
Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away
reply
lxgr
2 hours ago
[-]
Am I missing something or are the Ollama versions of this (https://ollama.com/library/gemma4/tags) text-only for now?
reply
philipkglass
2 hours ago
[-]
Since ollama has diverged from llama.cpp, it will take a bit of time for ollama to support multi-modality. If you're using plain llama.cpp it looks like a PR has already merged for this model with vision and audio support:

https://github.com/ggml-org/llama.cpp/pull/24077

reply
zozbot234
1 hour ago
[-]
They've actually gone back to (a lightly patched) llama.cpp with the 0.30 release a few weeks ago, and have now vendored-in an up to date release. Needless to say this is great news for both projects!
reply
satvikpendem
1 hour ago
[-]
Just use llama.cpp or Unsloth Studio which wraps it, I don't know why anyone use Ollama anymore.
reply
RandyOrion
1 hour ago
[-]
A small dense multimodal model with audio support, interesting.

Wait, *Excluding Chinese language.

This is ... curious.

P.S. Where is gemma 4 124b?

reply
kylehotchkiss
50 minutes ago
[-]
Where are the computers we could purchase to run 124b models :’(
reply
zkmon
46 minutes ago
[-]
I't quite interesting to see the quants pour into the HF page. I keep refreshing it and see many new quants every few mins.
reply
Zambyte
2 hours ago
[-]
Is this Mac only? Or is that an Ollama issue that it only supports this release of models on Mac? It seems like every tag with the MLX badge is only supported on Mac[0], and that includes all of the tags in this release.

[0] https://ollama.com/library/gemma4/tags

Edit: MLX being Mac-only is independent of the model being MLX (and therefore Mac) only. The latter is what I am asking about.

reply
embedding-shape
2 hours ago
[-]
MLX is quite literally macOS-specific technology, for other platforms you want non-MLX.

I was sure "MLX" stood for "Metal-something-something" but can't find any reference to that somehow, anywho, "Metal" is hardware-accelerated graphics on Apple platforms FWIW.

Edit: about the actual release on Ollama, if you're on non-Apple hardware you probably want the NVFP4 variant ("gemma4:12b-nvfp4") which was uploaded 45 minutes ago, especially if you're with a recent nvidia GPU.

reply
sambaumann
1 hour ago
[-]
I still get "this model requires macOS" when trying to pull that one
reply
embedding-shape
1 hour ago
[-]
I don't use Ollama myself anymore, but seems others been having similar issues for quite some time, maybe one of these fit your environment exactly? https://github.com/ollama/ollama/issues?q=is%3Aissue%20state...
reply
jw1224
2 hours ago
[-]
MLX is Apple’s own machine learning framework, designed for Apple Silicon: https://opensource.apple.com/projects/mlx/
reply
accountrequired
50 minutes ago
[-]
reply
jasonjmcghee
1 hour ago
[-]
There's a CUDA backend for MLX now. Not sure about the maturity.
reply
comma_at
1 hour ago
[-]
Are there qwen or minimax or other open weight models of same hardware requirements that outperform this?
reply
spott
1 hour ago
[-]
Is there a paper on this?

I'm curious how they pre-trained it... I feel like it must have had audio/image output that they chopped off.

I wonder how hard it would be to add it back on.

reply
joaogui1
1 hour ago
[-]
I mean Claude is multimodal on input but not output, why couldn't this also be?
reply
randomNumber7
2 hours ago
[-]
> Novel unified architecture: No multimodal encoders. The vision and audio inputs flow directly into the LLM backbone.

I would be interested in how this actually works. I couldn't find a description of the model architecture (and I did check the links in the Google blog)

reply
spott
1 hour ago
[-]
https://newsletter.maartengrootendorst.com/p/a-visual-guide-... (in a link from here: https://developers.googleblog.com/gemma-4-12b-the-developer-..., which was linked in the text of the post, but not the linkdump at the end).
reply
toldnotmywrath
1 hour ago
[-]
My understanding is that early (and most extant) visual language models have a component module (called the image encoder) that transforms images into representations (called embeddings) the model's inner layers can process.

This is often a separate module grafted onto the main model, and further pre-trained (e.g. OpenAI's CLIP, SigLIP used in the Gemma 3 and PaliGemma series).

The image encoder approach has a few problems.

One problem is that many like Gemma 3's encoder have fixed image resolution constraints and inputs must be resized with all the attendant distortions that causes with spatial understanding. However, the Gemma 4 series image encoders overcame this and can handle variable-dimension inputs.

Two, these image encoders are somewhat large (ranging from 300-500M parameters) requiring extra memory and FLOPs to run.

Three, say we need to fine-tune a vision language model, updates to its weights, may affect its understanding of the representations generated by the image encoder if we don't fine-tune both together.

The new Gemma-4-12B replaces the encoder (with its many attention layers and large parameter count) with a simple linear projection to generate the embeddings for images. That reduces the computational requirements and simplifies the input pipelines for image processing.

I don't have any expertise on the topic though and might very well be wrong on some details.

reply
Havoc
2 hours ago
[-]
Quite a niche release. The MoE outperforms it on score and will likely be faster thanks to lower active weights. So this really only makes sense for specific ram constrained applications that can’t fit a quantized MoE
reply
dist-epoch
2 hours ago
[-]
The un-quantized MoE outperforms it.

But between same (V)RAM requirement 4 bit 26B-A3B and 8 bit 12B it's unclear which one will win, especially given one is MoE and the other dense.

All the launch benchmarks are at 16 bit.

reply
mlmonkey
1 hour ago
[-]
Is there some place where we can try it before downloading the gigabytes of weights?
reply
zuminator
2 hours ago
[-]
How does it compare with e4b, aside from being larger?
reply
anonova
2 hours ago
[-]
There's a comparison of all the Gemma 4 models (+ Gemma 3 27B) on the Huggingface model card: https://huggingface.co/google/gemma-4-12B-it#benchmark-resul...
reply
thomasjb
2 hours ago
[-]
That's what I want to know too. A smarter E4B that's happy in opencode would be a good selfhosted model for me
reply
zkmon
1 hour ago
[-]
I'm waiting for FP8 quant, preferably from Google.
reply
digdugdirk
2 hours ago
[-]
I do enjoy the immediate out of touch signaling with the "runs on your 16gb vram laptop" line. Because everyone has a laptop with 16gb vram, or can just pop out and buy a new one, right?
reply
vehemenz
1 hour ago
[-]
This comment has me a bit confused.

Consumers were complaining about the standard 8GB with the early 2020 refresh of MacBook Pros, many OSes ago. Sure, it might be workable for many tasks (as evidenced by the recent sales of the MacBook Neo), but users with a mere 8GB shouldn't have expectations of LLM performance. Even 16GB feels like a stretch.

reply
utternerd
1 hour ago
[-]
Unified Memory or VRAM, not just RAM.
reply
NekkoDroid
1 hour ago
[-]
I think you are mixing up RAM and VRAM.
reply
Schiendelman
1 hour ago
[-]
On a Mac they are the same thing; they're shared. Of course you need some amount for the OS, but if you have an Apple Silicon Mac with 24GB of RAM, you can likely run a 16GB model.
reply
crims0n
1 hour ago
[-]
They are effectively one and the same on Apple Silicon.
reply
NekkoDroid
1 hour ago
[-]
Which most people as a matter of fact don't use. A majority of people with laptop have separate memory pools and the VRAM of them is nowhere near that and even on most gaming laptops you aren't getting 16GB VRAM.
reply
mrkstu
47 minutes ago
[-]
I would say on this forum it wouldn’t be suprising for commentors to be near or above 50% that have access to an M Series Mac…
reply
claysmithr
1 hour ago
[-]
I have 24 gb unified memory so it’s a good model for me
reply
BiraIgnacio
2 hours ago
[-]
using an embedder instead of a decoder is quite clever. Not sure who came up with that first but it's a cool idea.
reply
claysmithr
2 hours ago
[-]
I don’t see the download in lm studio
reply
deckar01
1 hour ago
[-]
It also says it is supposed to be available in their own Edge Gallery app and it’s not there (on iOS).
reply
corgihamlet
1 hour ago
[-]
reply
claysmithr
36 minutes ago
[-]
Thanks looks like they just added it 1 hour ago
reply
kordlessagain
51 minutes ago
[-]
Cool!
reply
jdelman
2 hours ago
[-]
I can’t help but wonder if this is the basis of the model they’ve helped tune for Apple.
reply