FilterHN

Running Gemma 4 locally with LM Studio's new headless CLI and Claude Code

191 points

by vbtechguy

8 hours ago

| past

| 9 comments

| ai.georgeliu.com

| HN

▲

trvz

6 hours ago

[-]

  ollama launch claude --model gemma4:26b

▲

gcampos

1 hour ago

[-]

You need to increase the context window size or the tool calling feature wont work

▲

mil22

14 minutes ago

[-]

For those wondering how to do this:

  OLLAMA_CONTEXT_LENGTH=64000 ollama serve

or if you're using the app, open the Ollama app's Settings dialog and adjust there.

Codex also works:

  ollama launch codex --model gemma4:26b

▲

datadrivenangel

6 hours ago

[-]

It's amazing how simple this is, and it just works if you have ollama and claude installed!

▲

pshirshov

5 hours ago

[-]

For some reason, that doesn't work for me, claude never returns from some ill loop. Nemotron, glm and qwen 3.5 work just fine, gemma - doesn't.

▲

trvz

4 hours ago

[-]

Since that defaults to the q4 variant, try the q8 one:

  ollama launch claude --model gemma4:26b-a4b-it-q8_0

▲

pshirshov

2 hours ago

[-]

Even tried gemma4:31b and gemma4:31b with 128k context (I have 72GiB VRAM). Nothing. I'm cursed I guess. That's ollama-rocm if that matters (I had weird bugs on Vulkan, maybe gemma misbehaves on radeons somehow?..).

UPD: tried ollama-vulkan. It works, gemma4:31b-it-q8_0 with 64k context!

▲

vbtechguy

8 hours ago

[-]

Here is how I set up Gemma 4 26B for local inference on macOS that can be used with Claude Code.

▲

canyon289

7 hours ago

[-]

This is a nice writeup!

▲

martinald

6 hours ago

[-]

Just FYI, MoE doesn't really save (V)RAM. You still need all weights loaded in memory, it just means you consult less per forward pass. So it improves tok/s but not vram usage.

▲

IceWreck

5 hours ago

[-]

It does if you use an inference engine where you can offload some of the experts from VRAM to CPU RAM. That means I can fit a 35 billion param MoE in let's say 12 GB VRAM GPU + 16 gigs of memory.

▲

Yukonv

4 hours ago

[-]

With that you are taking a significant performance penalty and become severely I/O bottlenecked. I've been able to stream Qwen3.5-397B-A17B from my M5 Max (12 GB/s SSD Read) using the Flash MoE technique at the brisk pace of 10 tokens per second. As tokens are generated different experts need to be consulted resulting in a lot of I/O churn. So while feasible it's only great for batch jobs not interactive usage.

▲

zozbot234

3 hours ago

[-]

10 tok/s is quite fine for chatting, though less so for interaction with agentic workloads. So the technique itself is still worthwhile for running a huge model locally.

▲

IceWreck

3 hours ago

[-]

> So while feasible it's only great for batch jobs not interactive usage.

I mean yeah true but depends on how big the model is. The example I gave (Qwen 3.5 35BA3B) was fitting a 35B Q4 K_M (say 20 GB in size) model in 12 GB VRAM. With a 4070Ti + high speed 32 GB DDR5 ram you can easily get 700 token/sec prompt processing and 55-60 token/sec generation which is quite fast.

On the other hand if I try to fit a 120B model in 96 GB of DDR5 + the same 12 GB VRAM I get 2-5 token/sec generation.

▲

zozbot234

2 hours ago

[-]

Your 120B model likely has way more active parameters, so it can probably only fit a few shared layers in the VRAM for your dGPU. You might be better off running that model on a unified memory platform, slower VRAM but a lot more of it.

▲

charcircuit

5 hours ago

[-]

You never need to have all weights in memory. You can swap them in from RAM, disk, the network, etc. MOE reduces the amount of data that will need to be swapped in for the next forward pass.

▲

martinald

3 hours ago

[-]

Yes you're right technically, but in reality you'd be swapping them the (vast?) majority in and out per inference request so would create an enormous bottleneck for the use case the author is using for.

▲

zozbot234

3 hours ago

[-]

With unified memory, reading from RAM to GPU compute buffer is not that painful, and you can use partial RAM caching to minimize the impact of other kinds of swapping.

▲

charcircuit

1 hour ago

[-]

You don't have to only have the experts being actively used in VRAM. You can load as many weights as will fit. If there is a "cache miss" you have to pay the price to swap in the weights, but if there is a hit you don't.

▲

jonplackett

6 hours ago

[-]

So wait what is the interaction between Gemma and Claude?

▲

unsnap_biceps

6 hours ago

[-]

lm studio offers an Anthropic compatible local endpoint, so you can point Claude code at it and it'll use your local model for it's requests, however, I've had a lot of problems with LM Studio and Claude code losing it's place. It'll think for awhile, come up with a plan, start to do it and then just halt in the middle. I'll ask it to continue and it'll do a small change and get stuck again.

Using ollama's api doesn't have the same issue, so I've stuck to using ollama for local development work.

▲

keerthiko

6 hours ago

[-]

Claude Code is fairly notoriously token inefficient as far as coding agent/harnesses go (i come from aider pre-CC). It's only viable because the Max subscriptions give you approximately unlimited token budget, which resets in a few hours even if you hit the limit. But this also only works because cloud models have massive token windows (1M tokens on opus right now) which is a bit difficult to make happen locally with the VRAM needed.

And if you somehow managed to open up a big enough VRAM playground, the open weights models are not quite as good at wrangling such large context windows (even opus is hardly capable) without basically getting confused about what they were doing before they finish parsing it.

▲

unsnap_biceps

5 hours ago

[-]

I use CC at work, so I haven't explored other options. Is there a better one to use locally? I presumed they were all going to be pretty similar.

▲

jaggederest

4 hours ago

[-]

If you want to experiment with same-harness-different-models Opencode is classically the one to use. After their recent kerfluffle with Anthropic you'll have to use API pricing for opus/sonnet/haiku which makes it kind of a non-starter, but it lets you swap out any number of cloud or local models using e.g. ollama or z.ai or whatever backend provider you like.

I'd rate their coding agent harness as slightly to significantly less capable than claude code, but it also plays better with alternate models.

▲

blitzar

3 hours ago

[-]

I am hopeful the leaked claude code narrows the capability, perhaps even googles offering will be viable once they borrow some ideas from claude.

▲

satvikpendem

9 minutes ago

[-]

OpenCode

▲

storus

6 hours ago

[-]

Can't you use Claude caveman mode?

https://github.com/JuliusBrussee/caveman

▲

mbesto

4 hours ago

[-]

I don't get why I would use Claude Code when OpenCode, Cursor, Zed, etc. all exist, are "free" and work with virtually any llm. Seems like a weird use case unless I'm missing something.

▲

superb_dev

3 hours ago

[-]

From my experience, Claude Code is just better. Although I recently started using Zed and it’s pretty good

▲

blitzar

3 hours ago

[-]

previously I have found claude code to be just better than the alternatives, using large models or local. It is, however, closer now and not much excuse for the competition after the claude code leak. Personally, I will be giving this a go with OpenCode.

▲

bdangubic

4 hours ago

[-]

this is like asking why use intellij or vscode or … when there is vim and emacs

▲

NamlchakKhandro

1 hour ago

[-]

No it's more like, why use a Microsoft paid for distro of nvim when lazyvim, astronvim exist

▲

aetherspawn

1 hour ago

[-]

Can you use the smaller Gemma 4B model as speculative decoding for the larger 31B model?

Why/why not?

▲

asymmetric

4 hours ago

[-]

Is a framework desktop with >48GB of RAM a good machine to try this out?

▲

pshirshov

2 hours ago

[-]

Only for chat sessions, not for agentic coding. It's just too slow to be practical (10 minutes to answer a simple question about a 2k LoC project - and that's with a 5070 addon card).

▲

Someone1234

6 hours ago

[-]

Using Claude Code seems like a popular frontend currently, I wonder how long until Anthropic releases an update to make it a little to a lot less turn-key? They've been very clear that they aren't exactly champions of this stuff being used outside of very specific ways.

▲

nerdix

5 hours ago

[-]

I don't think there is any incentive to do so right now because the open models aren't as good. The vast majority of businesses are going to just pay the extra cost for access to a frontier model. The model is what gives them a competitive advantage, not the harness. The harness is a lot easier to replicate than Opus.

There are benefits too. Some developers might learn to use Claude Code outside of work with cheaper models and then advocate for using Claude Code at work (where their companies will just buy access from Anthropic, Bedrock, etc). Similar to how free ESXi licenses for personal use helped infrastructure folks gain skills with that product which created a healthy supply of labor and VMware evangelists that were eager to spread the gospel. Anthropic can't just give away access to Claude models because of cost so there is use in allowing alternative ways for developers to learn how to use Claude Code and develop a workflow with it.

▲

deskamess

2 hours ago

[-]

Are the Claude Code (desktop) models very different from what Bedrock has? I thought you could hook up VSCode (not Claude Desktop) to Bedrock Anthropic models. Are there features in Claude Desktop that are not in VSCode/cli?

▲

chvid

5 hours ago

[-]

Is it not about the same as using OpenCode?

And is running a local model with Claude Code actually usable for any practical work compared to the hosted Anthropic models?

▲

falcor84

3 hours ago

[-]

Well, if they did, it would probably be shooting themselves in the foot, seeing that the Claude Code source is out there now, and people are waiting for an excuse to "clean-room" reimplement and fork it

▲

moomin

6 hours ago

[-]

Right now it suits them down to the ground. You pay for the product and you don’t cost their servers anything.

▲

phainopepla2

6 hours ago

[-]

You don't pay anything to use Claude Code as a front end to non-Anthropic models

▲

quinnjh

5 hours ago

[-]

so no subscription is needed?

▲

kenmacd

2 hours ago

[-]

not to use the cli tool. You can install it and change the settings to point to pretty much any other model.

It's an okay-enough tool, but I don't see a lot of point in using it when open sources tools like Pi and OpenCode exist (or octofriend, or forge, or droid, etc).

▲

wyre

5 hours ago

[-]

I think CC is popular because they are catering to the common denominator programmer and are going to continue to do that, not because CC is particularly turn-key.

▲

NamlchakKhandro

1 hour ago

[-]

I don't know why people bother with Claude code.

It's so jank, there are far superior cli coding harness out there

▲

dimgl

1 hour ago

[-]

Vagueposting in Hacker News?

▲

loveparade

1 hour ago

[-]

What do you recommend? I've tried both pi and opencode and both are better than claude imo, but I wonder if there are others.

▲

tarruda

1 hour ago

[-]

Codex is the best out-of-box experience, especially due to its builtin sandboxing. Only drawback is that its edit tool requires the LLM to output a diff which only GPTs are trained to do correctly.

▲

prettyblocks

7 minutes ago

[-]

how is codex sandbox different from /sandbox on claude code?

▲

loveparade

1 hour ago

[-]

Interesting, I don't like codex exactly because of its built-in sandboxing. If I need a sandbox I rather do a simple bwrap myself around the agent process, I prefer that over the agent cli doing a bunch of sandboxing magic that gets in my way.

▲

z0mghii

1 hour ago

[-]

Can you elaborate what is jank about it?

▲

inzlab

1 hour ago

[-]

awesome, the lighter the hardware running big softwares the more novelty.