GLM-4.7-Flash
194 points
2 hours ago
| 15 comments
| huggingface.co
| HN
dajonker
1 hour ago
[-]
Great, I've been experimenting with OpenCode and running local 30B-A3B models on llama.cpp (4 bit) on a 32 GB GPU so there's plenty of VRAM left for 128k context. So far Qwen3-coder gives the me best results. Nemotron 3 Nano is supposed to benchmark better but it doesn't really show for the kind of work I throw at it, mostly "write tests for this and that method which are not covered yet". Will give this a try once someone has quantized it in ~4 bit GGUF.

Codex is notably higher quality but also has me waiting forever. Hopefully these small models get better and better, not just at benchmarks.

reply
behnamoh
45 minutes ago
[-]
> Codex is notably higher quality but also has me waiting forever.

And while it usually leads to higher quality output, sometimes it doesn't, and I'm left with a bs AI slop that would have taken Opus just a couple of minutes to generate anyway.

reply
latchkey
1 hour ago
[-]
reply
WanderPanda
35 minutes ago
[-]
I find it hard to trust post training quantizations. Why don't they run benchmarks to see the degradation in performance? It sketches me out because it should be the easiest thing to automatically run a suite of benchmarks
reply
dajonker
1 hour ago
[-]
Yes I usually run Unsloth models, however you are linking to the big model now (355B-A32B), which I can't run on my consumer hardware.

The flash model in this thread is more than 10x smaller (30B).

reply
a_e_k
23 minutes ago
[-]
When the Unsloth quant of the flash model does appear, it should show up as unsloth/... on this page:

https://huggingface.co/models?other=base_model:quantized:zai...

Probably as:

https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF

reply
homarp
16 minutes ago
[-]
it'a a new architecture. Not yet implemented in llama.cpp

issue to follow: https://github.com/ggml-org/llama.cpp/issues/18931

reply
dumbmrblah
16 minutes ago
[-]
One thing to consider is that this version is a new architecture, so it’ll take time for Llama CPP to get updated. Similar to how it was with Qwen Next.
reply
latchkey
1 hour ago
[-]
There are a bunch of 4bit quants in the GGUF link and the 0xSero has some smaller stuff too. Might still be too big and you'll need to ungpu poor yourself.
reply
disiplus
58 minutes ago
[-]
yeah there is no way to run 4.7 on a 32g vram this flash is something that im also waiting to try later tonight
reply
infocollector
9 minutes ago
[-]
Maybe someone here has tackled this before. I’m trying to connect Antigravity or Cursor with GLM/Qwen coding models, but haven’t had any luck so far. I can easily run Open-WebUI + LLaMA on my 5090 Ubuntu box without issues. However, when I try to point Antigravity or Cursor to those models, they don’t seem to recognize or access them. Has anyone successfully set this up?
reply
vessenes
2 hours ago
[-]
Looks like solid incremental improvements. The UI oneshot demos are a big improvement over 4.6. Open models continue to lag roughly a year on benchmarks; pretty exciting over the long term. As always, GLM is really big - 355B parameters with 31B active, so it’s a tough one to self-host. It’s a good candidate for a cerebras endpoint in my mind - getting sonnet 4.x (x<5) quality with ultra low latency seems appealing.
reply
Workaccount2
35 minutes ago
[-]
Unless one of the open model labs has a breakthrough, they will always lag. Their main trick is distilling the SOTA models.

People talk about these models like they are "catching up", they don't see that they are just trailers hooked up to a truck, pulling them along.

reply
HumanOstrich
1 hour ago
[-]
I tried Cerebras with GLM-4.7 (not Flash) yesterday using paid API credits ($10). They have rate limits per-minute and it counts cached tokens against it so you'll get limited in the first few seconds of every minute, then you have to wait the rest of the minute. So they're "fast" at 1000 tok/sec - but not really for practical usage. You effectively get <50 tok/sec with rate limits and being penalized for cached tokens.

They also charge full price for the same cached tokens on every request/response, so I burned through $4 for 1 relatively simple coding task - would've cost <$0.50 using GPT-5.2-Codex or any other model besides Opus and maybe Sonnet that supports caching. And it would've been much faster.

reply
twalla
39 minutes ago
[-]
I hope cerebras figures out a way to be worth the premium - seeing two pages of written content output in the literal blink of an eye is magical.
reply
behnamoh
43 minutes ago
[-]
> The UI oneshot demos are a big improvement over 4.6.

This is a terrible "test" of model quality. All these models fail when your UI is out of distribution; Codex gets close but still fails.

reply
ttoinou
38 minutes ago
[-]
Sonnet was already very good a year ago, do open weights model right are as good ?
reply
jasonjmcghee
34 minutes ago
[-]
Fwiw Sonnet 4.5 is very far ahead of where sonnet was a year ago
reply
mckirk
1 hour ago
[-]
Note that this is the Flash variant, which is only 31B parameters in total.

And yet, in terms of coding performance (at least as measured by SWE-Bench Verified), it seems to be roughly on par with o3/GPT-5 mini, which would be pretty impressive if it translated to real-world usage, for something you can realistically run at home.

reply
montroser
31 minutes ago
[-]
This is their blurb about the release:

    We’ve launched GLM-4.7-Flash, a lightweight and efficient model designed as the free-tier version of GLM-4.7, delivering strong performance across coding, reasoning, and generative tasks with low latency and high throughput.

    The update brings competitive coding capabilities at its scale, offering best-in-class general abilities in writing, translation, long-form content, role play, and aesthetic outputs for high-frequency and real-time use cases.
https://docs.z.ai/release-notes/new-released
reply
bilsbie
1 hour ago
[-]
What’s the significance of this for someone out of the loop?
reply
epolanski
58 minutes ago
[-]
You can run gpt 5 mini level ai on your MacBook with 32 gb ram.

You can get LLM as a service for cheaper.

E.g. This model costs less than a tenth of Haiku 4.5.

reply
montroser
16 minutes ago
[-]
> SWE-bench Verified 59.2

This seems pretty darn good for a 30B model. That's significantly better than the full Qwen3-Coder 480B model at 55.4.

reply
esafak
32 minutes ago
[-]
When I want fast I reach for Gemini, or Cerebras: https://www.cerebras.ai/blog/glm-4-7

GLM 4.7 is good enough to be a daily driver but it does frustrate me at times with poor instruction following.

reply
baranmelik
49 minutes ago
[-]
For anyone who’s already running this locally: what’s the simplest setup right now (tooling + quant format)? If you have a working command, would love to see it.
reply
pixelmelt
23 minutes ago
[-]
I would look into running a 4 bit quant using llama cpp (or any of its wrappers)
reply
dfajgljsldkjag
1 hour ago
[-]
Interesting they are releasing a tiny (30B) variant, unlike the 4.5-air distill which was 106B parameters. It must be competing with gpt mini and nano models, which personally I have found to be pretty weak. But this could be perfect for local LLM use cases.

In my ime small tier models are good for simple tasks like translation and trivia answering, but are useless for anything more complex. 70B class and above is where models really start to shine.

reply
pixelmelt
24 minutes ago
[-]
I'm glad they're still releasing models dispite going public
reply
eurekin
1 hour ago
[-]
I'm trying to run it, but getting odd errors. Has anybody managed to run it locally and can share the command?
reply
karmakaze
2 hours ago
[-]
Not much info than being a 31B model. Here's info on GLM-4.7[0] in general.

I suppose Flash is merely a distillation of that. Filed under mildly interesting for now.

[0] https://z.ai/blog/glm-4.7

reply
lordofgibbons
1 hour ago
[-]
How interesting it is depends purely on your use-case. For me this is the perfect size for running fine-tuning experiments.
reply
redrove
1 hour ago
[-]
A3.9B MoE apparently
reply
twelvechess
1 hour ago
[-]
Excited to test this out. We need a SOTA 8B model bad though!
reply
piyh
15 minutes ago
[-]
reply
cipehr
1 hour ago
[-]
Is essentialai/rnj-1 not the latest attempt at that?

https://huggingface.co/EssentialAI/rnj-1

reply
XCSme
2 hours ago
[-]
Seems to be marginally better than gpt-20b, but this is 30b?
reply
strangescript
1 hour ago
[-]
I find gpt-oss 20b very benchmaxxed and as soon as a solution isn't clear it will hallucinate.
reply
blurbleblurble
1 hour ago
[-]
Every time I've tried to actually use gpt-oss 20b it's just gotten stuck in weird feedback loops reminiscent of the time when HAL got shut down back in the year 2001. And these are very simple tests e.g. I try and get it to check today's date from the time tool to get more recent search results from the arxiv tool.
reply
lostmsu
1 hour ago
[-]
It actually seems worse. gpt-20b is only 11 GB because it is prequantized in mxfp4. GLM-4.7-Flash is 62 GB. In that sense GLM is closer to and actually is slightly larger than gpt-120b which is 59 GB.

Also, according to the gpt-oss model card 20b is 60.7 (GLM claims they got 34 for that model) and 120b is 62.7 on SWE-Bench Verified vs GLM reports 59.7

reply
epolanski
2 hours ago
[-]
Any cloud vendor offering this model? I would like to try it.
reply
PhilippGille
2 hours ago
[-]
z.ai itself, or Novita fow now, but others will follow soon probably

https://openrouter.ai/z-ai/glm-4.7-flash/providers

reply
epolanski
1 hour ago
[-]
Interesting, it costs less than a tenth than Haiku.
reply
saratogacx
1 hour ago
[-]
GLM itself is quite inexpensive. A year sub to their coding plan is only $29 and works with a bunch of various tools. I use it heavily as a "I don't want to spend my anthropic credits" day-to-day model (mostly using Crush)
reply
dvs13
2 hours ago
[-]
reply
latchkey
1 hour ago
[-]
We don't have lot of GPUs available right now, but it is not crazy hard to get it running on our MI300x. Depending on your quant, you probably want a 4x.

ssh admin.hotaisle.app

Yes, this should be made easier to just get a VM with it pre-installed. Working on that.

reply
omneity
1 hour ago
[-]
Unless using docker, if vllm is not provided and built against ROCm dependencies it’s going to be time consuming.

It took me quite some time to figure the magic combination of versions and commits, and to build each dependency successfully to run on an MI325x.

reply
latchkey
1 hour ago
[-]
Agreed, the OOB experience kind of suck.

Here is the magic (assuming a 4x)...

  docker run -it --rm \
  --pull=always \
  --ipc=host \
  --network=host \
  --privileged \
  --cap-add=CAP_SYS_ADMIN \
  --device=/dev/kfd \
  --device=/dev/dri \
  --device=/dev/mem \
  --group-add render \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  -v /home/hotaisle:/mnt/data \
  -v /root/.cache:/mnt/model \
  rocm/vllm-dev:nightly
  
  mv /root/.cache /root/.cache.foo
  ln -s /mnt/model /root/.cache
  
  VLLM_ROCM_USE_AITER=1 vllm serve zai-org/GLM-4.7-FP8 \
  --tensor-parallel-size 4 \
  --kv-cache-dtype fp8 \
  --quantization fp8 \
  --enable-auto-tool-choice \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --load-format fastsafetensors \
  --enable-expert-parallel \
  --allowed-local-media-path / \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 1 \
  --mm-encoder-tp-mode data
reply
xena
2 hours ago
[-]
The model literally came out less than a couple hours ago, it's going to take people a while in order to tool it for their inference platforms.
reply
idiliv
2 hours ago
[-]
Sometimes model developers coordinate with inference platforms to time releases in sync.
reply