Z-Image: Powerful and highly efficient image generation model with 6B parameters
249 points
6 days ago
| 19 comments
| github.com
| HN
vunderba
8 hours ago
[-]
I've done some preliminary testing with Z-Image Turbo in the past week.

Thoughts

- It's fast (~3 seconds on my RTX 4090)

- Surprisingly capable of maintaining image integrity even at high resolutions (1536x1024, sometimes 2048x2048)

- The adherence is impressive for a 6B parameter model

Some tests (2 / 4 passed):

https://imgpb.com/exMoQ

Personally I find it works better as a refiner model downstream of Qwen-Image 20b which has significantly better prompt understanding but has an unnatural "smoothness" to its generated images.

reply
tarruda
4 hours ago
[-]
> It's fast (~3 seconds on my RTX 4090)

It is amazing how far behind Apple Silicon is when it comes to use non- language models.

Using the reference code from Z-image on my M1 ultra, it takes 8 seconds per step. Over a minute for the default of 9 steps.

reply
p-e-w
2 hours ago
[-]
The diffusion process is usually compute-bound, while transformer inference is memory-bound.

Apple Silicon is comparable in memory bandwidth to mid-range GPUs, but it’s light years behind on compute.

reply
tarruda
1 hour ago
[-]
> but it’s light years behind on compute.

Is that the only factor though? I wonder if pytorch is lacking optimization for the MPS backend.

reply
nialv7
4 hours ago
[-]
China really is keeping the open weight/source AI scene alive. If in five years a consumer GPU market still exists it would be because of them.
reply
p-e-w
2 hours ago
[-]
Pretty sure the consumer GPU market mostly exists because of games, which has nothing to do with China or AI.
reply
soontimes
2 hours ago
[-]
If that’s your website please check GitHub link - it has a typo (gitub) and goes to a malicious site
reply
vunderba
2 hours ago
[-]
Thanks for the heads up. I just checked the site through several browsers and proxying through a VPN. There's no typo and it properly links to:

https://github.com/Tongyi-MAI/Z-Image

Screenshot of site with network tools open to indicate link

https://imgur.com/a/FZDz0K2

EDIT: It's possible that this issue might have existed in an old cached version. I'll purge the cache just to make sure.

reply
rprwhite
2 hours ago
[-]
The link with the typo is in the footer.
reply
vunderba
2 hours ago
[-]
Well holy crap - that's been there for about forever! I need a "domain name" spellchecker built into my Gulp CI/CD flow.

EDIT: Fixed! Thanks soontimes and rprwhite!

reply
amrrs
8 hours ago
[-]
On fal, it takes less than a second many times.

https://fal.ai/models/fal-ai/z-image/turbo/api

Couple that with the LoRA, in about 3 seconds you can generate completely personalized images.

The speed alone is a big factor but if you put the model side by side with seedream and nanobanana and other models it's definitely in the top 5 and that's killer combo imho.

reply
venusenvy47
4 hours ago
[-]
I don't know anything about paying for these services, and as a beginner, I worry about running up a huge bill. Do they let you set a limit on how much you pay? I see their pricing examples, but I've never tried one of these.

https://fal.ai/pricing

reply
tethys
3 hours ago
[-]
It works with prepaid credits, so there should be no risk. Minimum credit amount is $10, though.
reply
vunderba
2 hours ago
[-]
This. You can also run most (if not all) of the models that Fal.ai directly from the playground tab including Z-Image Turbo.

https://fal.ai/models/fal-ai/z-image/turbo

reply
echelon
8 hours ago
[-]
So does this finally replace SDXL?

Is Flux 1/2/Kontext left in the dust by the Z Image and Qwen combo?

reply
mythz
1 hour ago
[-]
SDXL has long been surpassed, it's primary redeeming feature is fine tuned variants for different focus and image styles.

IMO HiDream had the best quality OSS generations, Flux Schnell is decent as well. Will try out Z-Image soon.

reply
vunderba
7 hours ago
[-]
Yeah, I've definitely switched largely away from Flux. Much as I do like Flux (for prompt adherency), BFL's baffling licensing structure along with its excessive censorship makes it a noop.

For ref, the Porcupine-cone creature that ZiT couldn't handle by itself in my aforementioned test was easily handled using a Qwen20b + ZiT refiner workflow and even with two separate models STILL runs faster than Flux2 [dev].

https://imgur.com/a/5qYP0Vc

reply
tripplyons
8 hours ago
[-]
SDXL has been outclassed for a while, especially since Flux came out.
reply
aeon_ai
8 hours ago
[-]
Subjective. Most in creative industries regularly still use SDXL.

Once Z-image base comes out and some real tuning can be done, I think it has a chance of replacing it for the function SDXL has

reply
Scrapemist
6 hours ago
[-]
Source?
reply
echelon
3 hours ago
[-]
Most of the people I know doing local AI prefer SDXL to Flux. Lots of people are still using SDXL, even today.

Flux has largely been met with a collective yawn.

The only thing Flux had going for it was photorealism and prompt adherence. But the skin and jaws of the humans it generated looked weird, it was difficult to fine tune, and the licensing was weird. Furthermore, Flux never had good aesthetics. It always felt plain.

Nobody doing anime or cartoons used Flux. SDXL continues to shine here. People doing photoreal kept using Midjourney.

reply
muglug
8 hours ago
[-]
The [demo PDF](https://github.com/Tongyi-MAI/Z-Image/blob/main/assets/Z-Ima...) has ~50 photos of attractive young women sitting/standing alone, and exactly two photos featuring young attractive men on their own.

It's incredibly clear who the devs assume the target market is.

reply
abbycurtis33
7 hours ago
[-]
They're correct. This tech, like much before it, is being driven by the base desires of extremely smart young men.
reply
cma
6 hours ago
[-]
They maybe have an rhlf phase, but I mean there is also just the shape of the distribution of images on the internet and, since this is from alibaba, their part of the internet/social media (Weibo) to consider
reply
AuryGlenz
5 hours ago
[-]
Considering how gaga r/stablediffusion is about it, they weren’t wrong. Apparently Flux 2 is dead in the water even though the knowledge it has contained in the model is way, way higher than Z-Image (unsurprisingly).
reply
BoorishBears
3 hours ago
[-]
Flux 2[dev] is awful.

Z-Image is getting traction because it fits on their tiny GPUs and does porn sure, but even with more compute Flux 2[dev] has no place.

Weak world knowledge, worse licensing, and it ruins the #1 benefit of a larger LLM backbone with post-training for JSON prompts.

LLMs already understand JSON, so additional training for JSON feels like a cheaper way to juice prompt adherence than more robust post-training.

And honestly even "full fat" Flux 2 has no great spot: Nano Banana Pro is better if you need strong editing, Seedream 4.5 is better if you need strong generation.

reply
killingtime74
6 hours ago
[-]
It's interesting the handsome guy is literally Tony Leung Chiu-wai, https://www.imdb.com/name/nm0504897/, not even modified
reply
iamflimflam1
7 hours ago
[-]
The model is uncensored, so will probably suite that target market admirably.
reply
mhb
3 hours ago
[-]
Maybe both women and men prefer looking at attractive women.
reply
nubg
1 hour ago
[-]
Pray tell? I hope you didn't just post a sexist dogwhistle?
reply
bobsmooth
7 hours ago
[-]
The ratio of naked female loras compared to naked male loras, or even non-porn loras, on civitai is at least 20 to 1. This shouldn't be surprising.
reply
cess11
5 hours ago
[-]
"The Internet is really, really great..."

https://www.youtube.com/watch?v=LTJvdGcb7Fs

reply
thih9
6 hours ago
[-]
Please write what you mean instead of making veiled implications. What is the point of beating around the bush here?

It's not clear to me what you mean either, especially since female models are overwhelmingly more popular in general[1].

[1]: "Female models make up about 70% of the modeling industry workforce worldwide" https://zipdo.co/modeling-industry-statistics/

reply
muglug
5 hours ago
[-]
> Female models make up about 70% of the modeling industry workforce worldwide

Ok so a ~2:1 ratio. Those examples have a 25:1 ratio.

reply
danielbln
8 hours ago
[-]
We've come a long way with these image models, and the things you can do with paltry 6B are super impressive. The community has adopted this model wholesale, and left Flux(2) by the way side. It helps that Z-Image isn't censored, whereas BFL (makers of Flux 2) dedicated like a fith of their press release talking about how "safe" (read: censored and lobotomized) their model is.
reply
AuryGlenz
5 hours ago
[-]
To be fair, a lot of that was about their online service and not the model itself. It can definitely generate breasts.

That said I do find the focus on “safety” tiring.

reply
rfoo
6 hours ago
[-]
But this is a CCP model, would it refuse to generate Xi?
reply
vunderba
5 hours ago
[-]
reply
CamperBob2
3 hours ago
[-]
It will generate anything. Xi/Pooh porn, Taylor Swift getting squashed by a tank at Tiananmen Square, whatever, no censorship at all.

With simplistic prompts, you quickly conclude that the small model size is the only limitation. Once you realize how good it is with detailed prompts, though, you find that you can get a lot more diversity out of it than you initially thought you could.

Absolute game-changer of a model IMO. It is competitive with Nano Banana Pro in some respects, and that's saying something.

reply
cubefox
3 hours ago
[-]
I could imagine the Chinese government is not terribly interested in enforcing its censorship laws when this would conflict with boosting Chinese AI. Overregulation can be a significant inhibitor to innovation and competitiveness, as we often see in Europe.
reply
ForOldHack
2 hours ago
[-]
Explain lobotomizing a Image Generator? Modern problems require modern terms.
reply
xnx
8 hours ago
[-]
Z-Image seems to be the first successor to Stable Diffusion 1.5 that delivers better quality, capability, and extensibility across the board in an open model that can feasibly run locally. Excitement is high and an ecosystem is forming fast.
reply
khimaros
8 hours ago
[-]
i have been testing this on my Framework Desktop. ComfyUI generally causes an amdgpu kernel fault after about 40 steps (across multiple prompts), so i spent a few hours building a workaround here https://github.com/comfyanonymous/ComfyUI/pull/11143

overall it's fun and impressive. decent results using LoRA. you can achieve good looking results with as few as 8 inference steps, which takes 15-20 seconds on a Strix Halo. i also created a llama.cpp inherence custom node for prompt enhancement which has been helping with overall output quality.

reply
thot_experiment
1 hour ago
[-]
I've messed with this a bit and the distill is incredibly overbaked. Curious to see the capabilities of the full model but I suspect even the base model is quite collapsed.
reply
nine_k
8 hours ago
[-]
It's amazing how much knowledge about the world fits into 16 GiB of the distilled model.
reply
echelon
8 hours ago
[-]
This is early days, too. We're probably going to get better at this across more domains.

Local AI will eventually be booming. It'll be more configurable, adaptable, hackable. "Free". And private.

Crude APIs can only get you so far.

I'm in favor of intelligent models like Nano Banana over ComfyUI messes (the future is the model, not the node graph).

I still think we need the ability to inject control layers and have full access to the model, because we lose too much utility by not having it.

I think we'll eventually get Nano Banana Pro smarts slimmed down and running on a local machine.

reply
bobsmooth
7 hours ago
[-]
>Local AI will eventually be booming.

With how expensive RAM currently is, I doubt it.

reply
echelon
2 minutes ago
[-]
It's temporary. Sam Altman booked all the supply for a year. Give it time to unwind.
reply
api
1 hour ago
[-]
I’m old enough to remember many memory price spikes.
reply
xfalcox
7 hours ago
[-]
We have vLLM for running text LLMs in production. What is the equivalent for this model?
reply
mh-
6 hours ago
[-]
I would say there's isn't an equivalent. Some people will probably tell you ComfyUI - you can expose workflows via API endpoints and parameterize them. This is how e.g. Krita AI Diffusion uses a ComfyUI backend.

For various reasons, I doubt there are any large scale SaaS-style providers operating this in production today.

reply
reactordev
4 hours ago
[-]
My issue with this model is it keeps producing Chinese people and Chinese text. I have to very specifically go out of my way to say what kind of race they are.

If I say “A man”, it’s fine. A black man, no problem. It’s when I add context and instructions is just seems to want to go with some Chinese man. Which is fine, but I would like to see more variety of people it’s trained on to create more diverse images. For non-people it’s amazingly good.

reply
orbital-decay
2 hours ago
[-]
All modern models have their default looks. Meaningful variety of outputs for the same inputs in finetuned models is still an open technical problem. It's not impossible, but not solved either.
reply
thih9
6 hours ago
[-]
As an AI outsider with a recent 24GB macbook, can I follow the quick start[1] steps from the repo and expect decent results? How much time would it take to generate a single medium quality image?

[1]: https://github.com/Tongyi-MAI/Z-Image?tab=readme-ov-file#-qu...

reply
aleyan
4 hours ago
[-]
I have a 24GB M5 macbook pro. In ComfyUI using default z-image workflow, generating a single image just took me 399 seconds, during which the computer froze and my airpods lost audio.

On replicate.com a single image takes 1.5s at a price of 1000 images per $1. Would be interesting to see how quick it is on ComfyUI Cloud.

Overall, running generative models locally on Macs seems very poor time investment.

reply
altmanaltman
6 hours ago
[-]
If you don't know anything about AI in terms of how these models are run, comfyui's macos version is probably the easiset to use. There is already a Z-Image workflow that you can get and comfyui will get all the models you need and get it work together. Can expect decent speed
reply
thih9
5 hours ago
[-]
I'm fine with the quick start steps and I prefer CLI to GUI anyway. But if I try it and find it too complex, I now know what to try instead - thanks.

I'm still curious whether this would run on a MacBook and how long would it take to generate an image. What machine are you using?

reply
egeozcan
5 hours ago
[-]
Have a 48GB M4 Pro and every inference step takes like 10 seconds on a 1024x1024 image. so six steps and you need a minute. Not terrible, not great.
reply
Eisenstein
1 hour ago
[-]
Try koboldcpp with the kcppt config file. The easiest way by far.

Download the release here

* https://github.com/LostRuins/koboldcpp/releases/tag/v1.103

Download the config file here

* https://huggingface.co/koboldcpp/kcppt/resolve/main/z-image-...

Set +x to the koboldcpp executable and launch it, select 'Load config' and point at the config file, then hit 'launch'.

Wait until the model weights are downloaded and launched, then open a browser and go to:

* http://localhost:5001/sdui

EDIT: This will work for Linux, Windows and Mac

reply
zkmon
8 hours ago
[-]
Just want to learn - who actually needs or buys up generated images?
reply
wongarsu
7 hours ago
[-]
I follow an author who publishes online on places like Scribblehub and has a modestly successful Patreon. Over the years he has spent probably tens of thousands of dollars on commissioned art for his stories, and he's still spending heavily on that. But as image models have gotten better this has increasingly been supplemented with AI-images for things that are worth a couple dollars to get right with AI, but not a couple hundred to get a human artist to do them

Roughly speaking the art seems to have three main functions:

1. promote the story to outsiders: this only works with human-made art

2. enhance the story for existing readers: AI helps here, but is contentious

3. motivate and inspire the author: works great with AI. The ease of exploration and pseudo-random permutations in the results are very useful properties here that you don't get from regular art

By now the author even has an agreement with an artist he frequently commissions that he can use his style in AI art in return for a small "royalty" payment for every such image that gets published in one of his stories. A solution driven both by the author's conscience and by the demands of the readers

reply
nine_k
8 hours ago
[-]
Some ideas for your consideration:

- Illustrating blog posts, articles, etc.

- A creativity tool for kids (and adults; consider memes).

- Generating ads. (Consider artisan production and specialized venues.)

- Generating assets for games and similar, such as backdrops and textures.

Like any tool, it takes certain skill to use, and the ability to understand the results.

reply
zkmon
7 hours ago
[-]
Except for gaming, that doesn't sound like a huge market worthy of pouring millions into training these high-quality models. And there is a lot of competition too. I suspect there are some other deep-pocketed customers for these images. Probably animations? movies? TV ads?
reply
nine_k
2 hours ago
[-]
I'd say that picture ad market alone would suffice.

OTOH these are open-weight models released to the public. We don't get to use more advanced models for free; the free models are likely a byproduct of producing more advanced models anyway. These models can be the freemium tier, or gateway drugs, or a way of torpedoing the competition, if you don't want to believe in the goodwill of their producers.

reply
pixl97
6 hours ago
[-]
Propaganda?
reply
Youden
5 hours ago
[-]
During the holiday season I've been noticing AI-generated assets on tons of meatspace ads and cheap, themed products.
reply
leobg
8 hours ago
[-]
Dying businesses like newspapers and local banks, who use it to save the money they used to spend on shutterstock images? That’s where I’ve seen it at least. Replacing one useless filler with another.
reply
Copenjin
9 hours ago
[-]
Very good, not always perfect with text or with following exactly the prompt, but 6B so... impressive.
reply
accrual
58 minutes ago
[-]
I have had good textual results with the Turbo version so far. Sometimes it drops a letter in the output, but most of the time it adheres well to both the text requested and the style.

I tried this prompt on my username: "A painted UFO abducts the graffiti text "Accrual" painted on the side of a rusty bridge."

Results: https://imgur.com/a/z-image-test-hL1ACLd

reply
bilsbie
4 hours ago
[-]
What kind of rig is required to run this?
reply
CamperBob2
3 hours ago
[-]
The simple Python example program runs great on almost any GPU with 8 GB or more memory. Takes about 1.5 seconds per iteration on a 4090.

The bang:buck ratio of Z-Image Turbo is just bonkers.

reply
gatane
56 minutes ago
[-]
Dude, please give money to artists instead of using genAI
reply
ForOldHack
2 hours ago
[-]
It would be more useful to have some standards on what one could expect in terms of hardware requirements and expected performance.
reply
pawelduda
9 hours ago
[-]
Did anyone test it on 5090? I saw some 30xx reports and it seemed very fast
reply
egeres
3 hours ago
[-]
Incredibly fast, on my 5090 with CUDA 13 (& the latest diffusers, xformers, transformers, etc...), 9 samplig steps and the "Tongyi-MAI/Z-Image-Turbo" model I get:

- 1.5s to generate an image at 512x512

- 3.5s to generate an image at 1024x1024

- 26.s to generate an image at 2048x2048

It uses almost all the 32Gb Gb of VRAM and GPU usage. I'm using the script from the HF post: https://huggingface.co/Tongyi-MAI/Z-Image-Turbo

reply
Wowfunhappy
7 hours ago
[-]
Even on my 4080 it's extremely fast, it takes ~15 seconds per image.
reply
accrual
2 hours ago
[-]
Did you use PyTorch Native or Diffusers Inference? I couldn't get the former working yet so I used Diffusers, but it's terribly slow on my 4080 (4 min/image). Trying again with PyTorch now, seems like Diffusers is expected to be slow.
reply
Wowfunhappy
2 hours ago
[-]
Uh, not sure? I downloaded the portable build of ComfyUI and ran the CUDA-specific batch file it comes with.

(I'm not used to using Windows and I don't know how to do anything complicated on that OS. Unfortunately, the computer with the big GPU also runs Windows.)

reply
accrual
1 hour ago
[-]
Haha, I know how it goes. Thanks, I'll give that a try!

Update: works great and much faster via ComfyUI + the provided workflow file.

reply
cubefox
4 hours ago
[-]
I'm particularly impressed by the fact that they seem to aim for photorealism rather than the semi-realistic AI-look that is common in many text-to-image models.
reply
CamperBob2
3 hours ago
[-]
Exactly, and at the same time, if you want an affected style, all you have to do is ask for it.
reply
idontwantthis
7 hours ago
[-]
Does it run on apple silicon?
reply
sheepscreek
7 hours ago
[-]
Apparently - https://github.com/ivanfioravanti/z-image-mps

Supports MPS (Metal Performance Shaders). Using something that skips Python entirely along with a mlx or gguf converted model file (if one exists) will likely be even faster.

reply
opensandwich
3 hours ago
[-]
(Not tested) though apparently it already exists: https://github.com/leejet/stable-diffusion.cpp/wiki/How-to-U...
reply
iamflimflam1
7 hours ago
[-]
It's working for me - it does max out my 64GB though.
reply
sheepscreek
7 hours ago
[-]
Wow. I always forget how unlike autoregressive models, diffusion models are heavier on resources (for the same number of parameters).
reply
BoredPositron
6 hours ago
[-]
I wish they would have used the WAN vae.
reply