FilterHN

1 hour ago

[-]

I run Q4_K_XL. All it takes to run to get about 6tk/sec is 512gb of ram and 2 3090 GPUs with llama.cpp -cmoe. I also have crappy DDR4, 2400mhz, 3200mhz will bring that speed up to about 9tk/sec. I also have ok 32core epyc CPU, a better 64core would bring it up to about 11tk/sec. I did a budget build before the crazy hardware cost and I regret it everyday. Nevertheless, it's fantastic being able to run this model at home. It's great for planning, one shot prompting once you have a plan or all the context you need. This entire hardware cost $2400 when it was built. If you're willing to be resourceful, you can find ways to run these models at home. I often get the silly question of why, and suggestions about how much I can save using cloud API, but the Fable drama has opened up eyes on why it's good for us to be independent. Thanks team unsloth, Q4_K_XL is solid, if you are going to grab a quant, make sure to get the K_XL variant if it can fit.

redox99

1 hour ago

[-]

That's crazy good for $2400.

Frannky

7 minutes ago

[-]

There is a push from multiple directions at the same time:

- new AI desktops with GB10s. They are relatively cheap and you can cluster them and load 1TB of VRAM

- Nvidia, amd, intel, Cerebras etc pushing new hardware

- oss models getting crazy good, like glm 5.2

- flash models getting very good like deepseek V4 flash

- quantizations

- harnesses being able to use different models (big for difficult stuff, small for grunt work)

So hopefully soon for the ones who want to break free from APIs, we will be able to host at home a cluster of AI desktops at a reasonable price with Opus-level capabilities, can't wait!!

https://unsloth.ai/docs/models/glm-5.2#usage-guide

xrd

5 hours ago

[-]

So close! My machine with 192GB RAM + RTX 3090 24GB can almost run this. It says it needs 24GB of VRAM and 256GB of RAM for MoE offloading.

In a prior thread, someone said it would take $500k in hardware:

https://news.ycombinator.com/item?id=48629970

elliotbnvl

4 hours ago

[-]

$500k is a vast overestimation. For massive concurrency at FP8 or even BF16 maybe.

NVFP4 at reasonable speeds (~120 tok/s) and concurrency is possible at a $80/90k figure with today's prices, maybe even less. That buys you 6 RTX 6000 PRO Blackwells, a decent CPU and motherboard, power supply. 576gb of VRAM.

You could do it for under $50k if you're OK with 40 tok/s decode, ~1200 tok/s prefill.

hbbio

2 hours ago

[-]

Yes, a single GB300 workstation also does it, probably even more than 120tok/s.

Official price 85k...

__m

3 hours ago

[-]

How fast will the hardware become outdated? Are there big improvements expected in the next 3 years?

easygenes

2 hours ago

[-]

M5 Ultra will ship before end of year, likely. Though with current RAM shortage, likely max spec will be 256GB and in short supply.

In late 2027 or early 2028, Nvidia will release Vera Rubin DGX Spark, likely with double or better the performance of current Blackwell, though unclear if memory capacity will go up much from current 128GB. Two to four of those will run models like this decently.

In 2028 we should expect Vera Rubin RTX discrete lineup, including the replacement to the RTX PRO 6000. Likely memory spec will be minimum 128GB. Good chance of up to 200GB. Two to four of those will run NVFP4 models in this class very well.

digitaltrees

1 hour ago

[-]

I feel like the models are good enough for a decade of future work. So Once you have a working set up you can keep using it to do the work at the same level. There will be better stuff and may make that type of work obsolete but if you can do useful things it won’t be worth less.

1 hour ago

[-]

P40 was release 2016 and still selling like hotcakes!

mgambati

5 hours ago

[-]

With 2 wouldn’t have good results. Ideal range for coding is at least Q8.

kibibu

5 hours ago

[-]

According to this very article, 4-bit dynamic is essentially lossless

Aurornis

3 hours ago

[-]

Watch out. Those claims are often made based on KL-divergence over some arbitrary corpus, not performance in the real world or benchmarks.

I’ve found that I need to go a couple steps past whatever quantizations are good enough in the KL-divergence testing to get good performance in real tasks with long context. So when Q4 is claimed to be lossless I end up with Q5 or Q6 for actual long-context tasks.

cheema33

4 hours ago

[-]

I have the RAM, but not the VRAM. What kind of speed/tps could you expect from a 3090 with 24GBs of RAM? I am somewhat tempted to pick a GPU with 24GBs of RAM.

phamilton

3 hours ago

[-]

Generation is basically just memory bandwidth math.

Each token has to read all the active weights. I think that's around 40B parameters active. At a 4-bit quant that's 20GB. With 100GB/s (replace with whatever your bandwidth is) and you get 5 tokens per second.

uberex

3 hours ago

[-]

Funny I casually asked Gemini and it said 500k for unquantized with decent throughput.

stymaar

2 hours ago

[-]

This is why you shouldn't believe uncritically an answer from an LLM (neither should you do for any answer from a human either though).

colinsane

1 hour ago

[-]

i asked gemini and it replied with "Error: 400 Your prompt was blocked by safety filters. Please revise and try again."

j45

2 hours ago

[-]

LLMs aren't discrete calcluators or estimators of things unless framed and guided to do so.

ijidak

2 hours ago

[-]

Crossing my fingers that this boom jumpstarts 90's like improvements in computing hardware.

I feel like part of the reason for the relative stagnation in hardware over the last twenty years was simply the lack of use cases to justify hardware refreshes by businesses.

Most of the money and energy went to mobile for the last fifteen years.

Affordable local inference might be the gravy train the server, desktop, and laptop manufacturers need to get back in gear.

0xbadcafebee

46 minutes ago

[-]

Definitely the stagnation was due to a lack of use cases, but this isn't a bad thing. We don't need most of the hardware advancement we got.

Business hardware got beefier because businesses demanded more data (or more specifically: the industry told businesses they needed more data), with no idea of what to actually do with it once they got it. To get all that data, bandwidth needed to be increased, with more iops to read/write it, more storage to keep it, and more memory and cpu to process it. But 99% of the data is junk. Companies have "data lakes" so big they need to come up with excuses to use the data, or risk somebody pointing out that they're spending a fortune hoarding bits.

Consumer hardware hasn't had a new use case since like 2012. Faster wifi for broadband & local file transfers, and higher-resolution video, are the only reasons one needed newer hardware. We actually got a resolution so high it makes no perceivable difference. And yeah we got faster CPUs and memory, but as soon as we did it got all eaten up by the most inefficient, wasteful software conceivable. Same use cases as 13 years ago, just more expensive, harder to use, and buggier. We should've gotten a new CPU architecture that was faster and more energy efficient. Finally it was delivered, but with a moat around the golden Apple.

Here we are two and a half decades into the Internet era, and my damn bluetooth earbuds and webcam microphone don't work half the time that I open a video conferencing app. Hardware can stay exactly like it is for the next few decades and I'd be happy. I just want software that works, and doesn't get continuously slower, forcing me to buy bigger hardware; or more draconian, locking me out of being able to use it how I want.

gruez

1 hour ago

[-]

>I feel like part of the reason for the relative stagnation in hardware over the last twenty years was simply the lack of use cases to justify hardware refreshes by businesses.

No, we're running into limits of moore's law, and it's showing in prices for new nodes, where they're getting denser but not cheaper.

horsawlarway

31 minutes ago

[-]

It's true we hit limits, but I feel like a lot of it was "limits" in the sense that the tradeoff stopped being worth the cost, so we optimized in other areas.

So we hit limits on clock speed in the early 2000s (ex - the 4ghz wall) but it also turned out that mobile as the driver for sales meant no one really cared much about clock speed compared to performance/watt.

Clock speed mattered, but only relative to how many watts it took to get it (and above 4ghz... too many watts).

But we've seen a 15x improvement over the last 20 years. Performance/Watt is WAY up.

My guess is that LLMs are going to drive another "improvement cycle" in areas that we didn't care much about before.

I've built about 10 personal desktop machines (1 every ~4 years) and I can honestly say that I didn't care much about memory bandwidth prior to 2021.

In the same way that I didn't care much about how many watts my pentium 4 was using in 2005.

But now... now I care a lot about memory bandwidth. I care about memory speeds and total system ram in a manner I really, really didn't before.

So I think we're going to see a big shift to machines built on unified ram with a crazy focus on squeezing memory bandwidth and total ram capacity as far as we can.

My bet is that we'll get a similar 10-15x improvement by 2040 in unified system ram designs.

I fully expect to see 2tb unified ram desktops and 200gb unified ram phones be relatively common on a 20 year timeline, assuming we see similar levels of geopolitical stability (ex - world war 3 throws a wrench into things).

linzhangrun

1 hour ago

[-]

Physical limitation of the manufacturing process may be more significant factor, starting from the TSMC 10nm ten years ago

skiing_crawling

4 hours ago

[-]

"it can fit" on 256GB of RAM, but it will be heavily quantized and still run very slowly. The headline number is not token generation, its prompt processing. So if you get 10 tok/s and an API gives you 20-30 tok/s, it doesn't seem that bad on its face, but a mac studio or any other machine that's not loading all of it into GPU will do PP 20-50X slower than a purely GPU based setup, which is what actually makes this unusable without $50k in GPUs.

On top of that, you will still be heavily quantized.

gerdesj

3 hours ago

[-]

A nvidia spark thingie has 128GB unified RAM. They also have a dual port version of one of these things: https://www.nvidia.com/content/dam/en-zz/Solutions/networkin.... ie 2 x 100GB/s ports, they may even be 2 x 200GB/s. Once I've got my paws on one, I'll know more.

You can cluster these beasts too. Two and three (with two IP subnets) is fairly obvious. Four or more might need a switch depending on how much network latency affects things.

Apple seem to have forgotten about M series with gobs of RAM. I can't get the Apple shop to show more than 96GB of unified RAM and that costs a kidney.

mapontosevenths

3 hours ago

[-]

I have one, and I love it. That said my buddies Mac smokes it for inference workloads in terms of tokens per second AND its more usable for other things.

If you are training and doing research it's great, if you want to cluster them it cant be beat, but if you just want local inference on a single box buy a mac or even a strix halo device.

colinsane

1 hour ago

[-]

can those macs boot linux? i've heard about Asahi but have no idea how far along they are. i've got my fleet configured with nix and sure, nix can target darwin, but there's a _lot_ of sharp edges there: i don't really want to pull that thread unless i have to...

https://build.nvidia.com/spark

mapontosevenths

1 hour ago

[-]

I don't know. I think he just uses LMStudio most of the time on his, but that's one place I can say the spark really shines for me.

I'm a Linux guy, but also don't always have alot of time. The Spark comes out of the box with a nice Linux distro that's pre-configured to be easy to setup and the guides and online resources make getting up and running trivial, for even some complex tasks. You would have to do a LOT of tinkering just to figure out some of the things the nvidia resources walk you through natively. They have guides for a ton of stuff that include the optimal settings so you don't have to figure it all out through trial and error.

Check out these "playbooks" for some examples. [0] There's a lot to be said for not having to piece all that together yourself.

I think between unboxing mine setting it up to run headless, and generating tokens was like 20 minutes total for me.

Fizz43

2 hours ago

[-]

which mac is smoking the spark?

pmarreck

2 hours ago

[-]

pretty much any of them, dude, as long as you have enough RAM, since it uses unified RAM and a powerful SoC CPU/GPU. Literally any M-class model, but the M5 is currently top tier.

10 minutes ago

[-]

The DGX Spark has basically the same memory bandwidth as a M5 Pro, and far more than a M5.

Only the M3 Ultra really beats it, and once you start scoping out the cost of a M3 Ultra with 128GB or 256GB, the DGX Spark doesn’t look bad after all.

mapontosevenths

1 hour ago

[-]

Yep. Memory bandwidth is what decides how fast LLM's generate tokens (mostly). The DGX Spark has something like 270 GB/s of memory bandwidth, and the m5 ultra is ~615 GB/s. Theoretically DOUBLE the speed. In practice he only generates like 25% more tok/s, but that's still very impressive.

The spark can fine tune models in 1/4 the time and excels at other compute tasks in ways that Mac never can. Plus the high bandwidth ConnectX-7 ports would be like $1700 to buy on a card just for the network adapters... But for generating tokens, it just plain loses.

jauntywundrkind

2 hours ago

[-]

200 Gb / s (not GB/s)!

(Still potentially very useful! But not magically ultra fast.)

Computer0

3 hours ago

[-]

128 gb of much slower ram than Apple.

8 minutes ago

[-]

DGX Spark is ~273GB/s. That’s about M5 Pro territory, and twice as fast as the M5. You’d have to go to the M5 Max, or M3 Ultra, to get higher memory bandwidth than the Spark.

pheggs

4 hours ago

[-]

I feel like the gap is closing to be able to run good enough models locally even for coding and I would assume it could make some companies a bit nervous. Am I wrong about that?

4 hours ago

[-]

If we didn't have a RAM/GPU shortage right now they would be more nervous than they are. But as it is very few people are going to be able to afford a rig that can run this model effectively. That's probably not going to change for several more years yet. I think if the Z.ai folks decide to come out with a flash version of GLM-5.2 specialized for coding that came in about about 80B params, then the US frontier labs would probably be more worried. Overall, the Chinese AI companies have been showing the way to do the same amount with less (sometimes much less) and as that trend continues it's going to make the frontier labs worried - but even the Chinese AI companies are going to want to protect their moat by not releasing capable models that are significantly smaller than their current flagship models. AliBaba Qwen seems to be there now - it's gotten mighty quiet from them lately - their latest 395B model is just too large for most folks to run at home and they don't seem to be making any noises about releasing smaller ones this time around.

7 minutes ago

[-]

When a large open weight model is released, a lab, startup, or a rich hoist can easily do logit-level distillation and create a XXb param model or whatever, and in theory you should get a really good distill.

gpm

3 hours ago

[-]

The ram/gpu shortage won't last forever though. Moreover we can be pretty confident that long-term the prices will obey wrights law and come down in cost significantly (from the pre-shortage prices) as we learn to produce them more efficiently.

LLM companies are valued as if they're going to have some enduring monopoly that they can extract money from... GLM-5.2 and similar models make that valuation very very questionable.

3 hours ago

[-]

> The ram/gpu shortage won't last forever though.

No disagreement there, but it could easily last another 3 to 5 years which is a long time in tech terms.

DougN7

24 minutes ago

[-]

Long enough for them to IPO and all the execs to retire. I doubt they care beyond the IPO.

mannanj

3 hours ago

[-]

> The ram/gpu shortage won't last forever though

Don't underestimate the markets ability to remain irrational

colinsane

1 hour ago

[-]

the companies which have the power to alleviate these shortages are the same companies who are profiting most from the shortage. scarcity is an asset, it's not irrational that a concentrated marked will produce more of that asset.

selectodude

1 hour ago

[-]

The solution for high prices is high prices.

If making RAM and SSDs is now cause for a 10 figure valuation, after enough time somebody will dive in.

elorant

3 hours ago

[-]

Very few people, but quite a lot of companies especially after per token pricing took effect and companies see their invoices skyrocketing. You pay an upfront cost once and you’re done.

verdverm

3 hours ago

[-]

I suspect the time horizon is shorter because of software advances. We are getting more capability out of smaller models

Alibaba released Qwen 3.6 "tiny" models not that long ago, they punch way above their weight(s)

1 hour ago

[-]

> Alibaba released Qwen 3.6 "tiny" models not that long ago, they punch way above their weight(s)

True, Qwen3.6-27B is amazing for it's size. However, it seems likely that we're not going to see anymore of these smaller models from Alibaba/Qwen since several key players exited that organization a few months back.

verdverm

18 minutes ago

[-]

Good to know, I think the trend is clear based on the models coming out of China and well see more capabilities in smaller, more efficient models.

Infernal

57 minutes ago

[-]

Do we know where those key players went?

simplyluke

2 hours ago

[-]

You don't even need to run them locally for them to be a threat. Plenty of companies are looking at paying third party companies to host these models and they come in at fractions of the price of the frontier labs.

4 hours ago

[-]

I don't think so. I could easily see a company deciding to host and run these models for their own development. If you have a dev team of about 10 people, a one time $50k investment in an LLM server has to be pretty tempting. Unlimited tokens, decent performance, upgrade options, and potential product integrations.

For companies wanting LLMs in their products in general, I have to think going the local llm route is even more tempting. Somewhat dumb models are more than good enough for a lot of the things people are integrating LLMs into their products.

twelvechairs

4 hours ago

[-]

Surely for most the desire is just an LLM provider that doesnt store or sell their queries (including by national actors). As long as that is allowed to happen surely its the answer for the vast majority.

eventualcomp

4 hours ago

[-]

Where is $50k coming from again?

stingraycharles

4 hours ago

[-]

That’s less than the monthly salary of 10 software engineers, and assuming they pay API prices, probably earns itself back in about a year.

Having said that, I don’t think it’s all that tempting for companies at all, considering this whole market is developing rapidly and it’s nearly impossible to predict where we’ll be at in a year or two.

4 hours ago

[-]

The hardware requirements aren't evolving and the local models have only been improving.

It's not like you'd lose capabilities, if anything this solution just gets better with time.

chatmasta

3 hours ago

[-]

If the newer models require more/better hardware then you’ll lose capabilities.

I think you’re better off renting GPU instances and running all the software on those. It’ll be cheaper than Anthropic and OpenRouter but slightly more expensive than electricity and depreciation of hardware.

2 hours ago

[-]

The newer models don't require more/better hardware. There's a small army of local llm enthusiasts who are running LLMs using 3090s and H100s because they have lots of memory. Them being old isn't really that big of an issue as the compute power needed is relatively low all things considered.

The number of parameters needed for these open weight models has mostly stabilized so the actual memory requirements aren't likely to change all that much.

5 minutes ago

[-]

Correct. The main bottleneck with LLM inference is, and have always been, memory bandwidth.

TPS = active weights in GB / your memory bandwidth.

That’s it for decode. That’s all.

4 hours ago

[-]

As in who pays for it or how did I arrive at that number?

For who pays for it, obviously the employer would.

For "how did I arrive at this number" Ballpark estimate from what I know about part cost. Most of that money will go towards AI cards about $5k for the mb, cpu, power supply, etc. $45k would be for as much ram and as big/expensive nVidia cards as you can get your hands on. The B300 has 288GB of VRAM in it. Probably what you'd be after.

fny

4 hours ago

[-]

The RAM requirements are still pretty painful.

yieldcrv

4 hours ago

[-]

equilibrium in one or two more years on the consumer/prosumer side

think Apple M6 or M7 with a currently unforeseen denser memory style, 256gb RAM

a couple inference or cache improvements on the algorithmic side, using less ram for context windows and doubling token speed again

denser open source models, packing more experts for smaller active layers

it'll still be expensive but like $8,000 - $13,000 instead of $450,000 worth of B200s

stingraycharles

4 hours ago

[-]

Fairly certain that model sizes and computational requirements will grow as the price for LLM compute drops.

3stacks

3 hours ago

[-]

Maybe there's a conversation to be had about how much is enough... Unless something beyond my imagination happened, I would be happy enough with Opus 4.5 levels of productivity

yieldcrv

3 hours ago

[-]

have you seen the open source LLM space? people fulfill all niches and there are active communities at every range of RAM and all are looking for the most capable in their respective range

a lot of innovation occurring

scosman

1 hour ago

[-]

It's not economic to run them locally. It's amazing for privacy, and fun hobby. But you're either looking at super slow CPU builds with $10k in RAM, $90k worth of GPUs, or a really quantized model that doesn't compare in quality.

I might build one for fun, but it's not going to change the economics alone. Still exciting it's possible.

CamouflagedKiwi

4 hours ago

[-]

The hardware requirements to run this locally are still very high. Seems far enough off mainstream for those companies not to be too worried yet.

notatoad

3 hours ago

[-]

locally on what hardware? something like the new dgx spark, ryzen halo, or mac studio will cost you ~ $4k plus whatever you pay for power. at the rate AI is currently progressing, i think you'd be optimistic to consider that as having a 2 year depreciation.

for $4k, you can get 20 months of claude max 200. i'd take claude over the hardware.

anthropic will have something to worry about when you can run a local model on your macbook that can code. but i think we're quite a ways off from that.

chatmasta

3 hours ago

[-]

Just a hunch, but I think the most cost effective “local” deployment method right now is renting GPU clusters by the hour and running all the inference software on them yourself. This will be cheaper than capital expenditure on hardware that will depreciate and become last-gen, and cheaper than OpenRouter pay per token.

tomr75

3 hours ago

[-]

people who can't afford Claude max 200 are using qwen 3.6 27b for local coding assistance already

stymaar

2 hours ago

[-]

Honestly, Qwen3.6 is already what you need for the large majority of tasks.

(I only ask Opus every 5 to 10 requests, when my local Qwen fails or when I encounter a situation that is too world-knowledge specific to be worth asking, but that way you can live easily with Claude's cheapest plan without ever facing usage limit).

CGamesPlay

3 hours ago

[-]

Can somebody help me understand the Quantization Analysis? It says "dynamic 4-bit UD-Q4_K_XL and dynamic 5-bit UD-Q5_K_XL are generally lossless" while showing a top-1% token agreement on the chart of 97.5%. Not what I would consider "generally lossless". Is this implying that some post-processing is going to account for the 2.5% loss? Beam search?

3 minutes ago

[-]

Generally 97.5% token agreement is very positive. Like the article explains, the difference isn’t the model thinking the capital of France isn’t Paris, but rather maybe saying “The capital of France is Paris” instead of “Paris is the capital of France”.

andai

4 hours ago

[-]

How is this model half the size of DeepSeek V4 Pro? Is it because DeepSeek did more aggressive cost cutting on the attention mechanism?

ramgine

3 hours ago

[-]

I have up to 1tb of ddr4 in my server but it only has a 12gb vram 3060. Would getting a 24gb vram make this a viable system or am I throwing money away?

1 hour ago

[-]

You can run it today with that 12gb vram 3060, but I would suggest getting 2 3090s. Use cmoe option. This will keep the attention/route tensors on the GPU and offload the rest to system memory. Try it now and see the performance.

rnewme

1 hour ago

[-]

Should work yes.

jonathanhefner

2 hours ago

[-]

> Runing GLM-5.2 on local hardware

Do the runes make it smarter or just run faster (or both)?

snootypoot

2 hours ago

[-]

if sam altman didnt exist i could afford to run this

dofm

2 hours ago

[-]

Can't run this myself.

But I do like Unsloth Studio, quite a lot. It's nicely designed.

Wowfunhappy

3 hours ago

[-]

> The full model requires 1.51TB of disk space

...a bit of an odd question: how well do LLMs losslessly compress, as in for cold storage?

I definitely don't have the hardware to run this model at any kind of reasonable speed (and I don't want to use a super aggressive quantization that would kill performance). Even so, I think it would be cool to retain an offline copy, in case... I don't really know, a solar flare destroys the internet some day, or maybe a zombie apocalypse. It would just be cool to have.

But 1.5 TB is a bit too much! If it could be compressed down into something semi kind of reasonable, that would be fun!

gcr

3 hours ago

[-]

There are two forms of compression relevant to LLMs:

1. Reduce the number of parameters

2. Reduce the resolution of each parameter (quantization)

For 1, changing the architecture is typically only possible by the labs producing the models, which is why each OSS model release tends to feature a small number of carefully chosen model sizes (for example, Gemma4 comes in e2B, e4B, 12B, 26Ba4B, and 31B sizes).

Generally, models with higher parameter counts have more world knowledge. For coding models, this shows up as a stronger command of uncommon libraries/languages. Very small models (<20B) also lack “smarts.”

Reducing the resolution of each parameter is easier which is why lots of practitioners have their own quantizations, but this makes it harder for a model to “think” fluently. Interacting with heavily quantized models feels like interacting with someone who didn’t get any sleep the night before.

Models that have higher-fidelity quantization take more RAM and have higher “smarts,” but don’t necessarily have more world knowledge. Models with aggressive quantization tend to be more likely to make rookie mistakes, emit malformed tool calls, get stuck in loops, or even exhibit signs of “neuroticism” / “distress” in their thinking tokens.

Parameter counts = world knowledge, quantization = “smarts.”

This is a soft rule of thumb, the difference isn’t very strong.

SirMadam

3 hours ago

[-]

SOTA LLM specific compression achieves around ~54%! https://arxiv.org/abs/2505.06252v3

redox99

3 hours ago

[-]

Probably not at all, considering weights are randomly initialized.

hxii

3 hours ago

[-]

Any time I see one of these posts about models of this size a quote comes to mind – "Your Scientists Were So Preoccupied With Whether Or Not They Could, They Didn’t Stop To Think If They Should".

Only a select few have the hardware required to run this to begin with, and even then the forecasted performance makes me wonder if it’s worth it at all.

1 hour ago

[-]

Completely worth it. At 6tk a second. If I can get 2 hrs of token generation. That's 2hrs * 3600secs * 6tk = 43200 tokens, at about 10tk to a line of code, that's about 4320 lines. Let's even trim it more and slice it by half. That's 2160 lines of code a day. Most professional programmers can't deliver that much consistently in a day.

The key to a model this large is (1) Use it to plan, generate lots of plan and farm out to a smaller model. Then for very specific and complicated portions precisely prompt for what you need.

nullc

4 hours ago

[-]

Just running cpu only w/ Q6 on 9684X I get about 1tok/s ... also still get about 1tok/s/stream when running 16 in parallel.

zuzululu

4 hours ago

[-]

wonder if AMD's new ai chip can run this with ease? I'm seriously consider buying it. GLM 5.2 is just shy of GPT 5.4 so I would welcome offloading any grunt work locally

I am very excited for local LLMs I think we may have GPT 5.5-xhigh level of performance for under 2000 EUR

This should put more pressure on the frontier models to avoid sitting on any fancy stuff and lower token prices as a whole.

Nothing beats a local LLM disconnected from the cloud.

4 hours ago

[-]

Are you talking about Medusa Halo? It's going to support up to 256GB unified memory (up from 128GB for Strix Halo and 192GB for Gorgon Halo). That might just be barely enough to run a 2-bit quant GLM-5.2. It will expand memory bus to 384-bits, vs. 256-bits for Strix Halo which will help with bandwidth (projected to be around 500 GB/sec). But don't expect Madusa Halo-based machines to appear until sometime in 2028.

The other way this could go is that Z.ai could decide to release a smaller model targeted towards coding. They've done that before (GLM-4.7-Flash had 30B params). It would be great if they decided to release something in the 80B-100B param range. Something that size would easily run in a current Strix Halo system.

monksy

39 minutes ago

[-]

Strix Halo only supports 96gb of video memory then it goes to 32gb to the host system.

zuzululu

2 hours ago

[-]

yeah you are correct 2 bit quant won't be enough

guess we'll be paying $200/month for a while

4 hours ago

[-]

> I am very excited for local LLMs I think we may have GPT 5.5-xhigh level of performance for under 2000 EUR

We are maybe 10 years off that.

RAM prices are going to continue to increase for the next 2 years at least.

Even putting that aside it's currently around 40-70,000 EUR to run this with a FP8 quantization (which you need to get close to maximum performance).

To actually get GPT 5.5-xhigh performance in the real world you need more headroom to support things like subagents (which will fill up your KV cache).

I like local models but realism is important. The sweet spot for the next 3 years will continue to be ~35B MoE models. They might match GPT 5.5-xhigh for chat-style problems but not for coding.

hsuduebc2

4 hours ago

[-]

I wonder, if in the near future any acquisitions of some RAM producers with intent to just keep RAM prices up, will happen from the AI companies. It could seriously hurt their business, if companies will be able to host their AI in some time.

3 hours ago

[-]

I think AI companies have enough things to spend capital on already.

Iolaum

4 hours ago

[-]

At full quantization GLM 5.2 may be close to GPT 5.4. But at Q2 or whatever one needs in order to run it on a pro-sumer device it will be worse.

Also I m not sure where you are getting the under 2k value. I bought a Framework desktop 128GB last year and my setup was around 2.7k. The same setup now sells for around 4.7k.

nh43215rgb

4 hours ago

[-]

Even with upcoming AI Max+ PRO 495 we are capped with 192GB, so no...

kccqzy

4 hours ago

[-]

The AMD 395 supports up to 128GB unified RAM. So still not enough even at 1-bit quant unfortunately.

monksy

38 minutes ago

[-]

96gb vram is the max it supports.

benjiro29

4 hours ago

[-]

"GLM 5.2 is just shy of GPT 5.4"... If your running the full model. As in have 750 (FP8) to 1.5TB(FP16) of memory available.

Do not mix the benchmark results of GLM 5.2 FP16/FP8 with FP4 or FP2.

* FP4 will mean a accuracy loss of about 3%. Not noticeable but more chance for mistakes.

* FP2 ... what is what most people are able to run at home, for a "reasonable" price. Your looking at over 17% loss in accuracy.

At that point, your running at less then claude-sonnet-4.6, as the issues compound with accuracy losses. And reasonable priced is still in the ~ $5000 range (192GB + GPU 32GB active/kv cache system).

For that price your using a Codex / Claude Pro subscription for the next 4+ years with better models (by default), let alone with a FP2 GLM 5.2 version. And your looking at < 10 fps. A MacStudio with 512GB will net you 18 a 20fps+ with FP4, but ... i mean, those used to be $10.000.

Unfortunately the local hardware cost is a major issue for running large models like that.

Edit: Its funny whenever the issue of cost and what you need to give up vs the subscription services, there are always people who downvote in bad faith.

kgeist

1 hour ago

[-]

The cost of local hardware is amortized if a whole team uses it instead of just 1 dev (GPUs are extremely underutilized if you launch just 1 generation stream). I'm not sure why everyone always assumes solo devs with Macs. We've just ordered a large datacenter-grade node for use by the whole dev team, and the calculations show that it's going to cost the same amount of money if we kept using AWS Bedrock (infosec reasons) for a couple years but... it gives us 100% privacy, we're immune to all the AI regulation dramas in the US/EU, all the random outages, and the developers won't have to think about token limits/weekly caps etc. ever again. And all that with a model which is Opus-grade

(it's not our first AI server, we already have experience deploying LLMs for our clients, so the numbers look solid)