Thoughts
- It's fast (~3 seconds on my RTX 4090)
- Surprisingly capable of maintaining image integrity even at high resolutions (1536x1024, sometimes 2048x2048)
- The adherence is impressive for a 6B parameter model
Some tests (2 / 4 passed):
Personally I find it works better as a refiner model downstream of Qwen-Image 20b which has significantly better prompt understanding but has an unnatural "smoothness" to its generated images.
It is amazing how far behind Apple Silicon is when it comes to use non- language models.
Using the reference code from Z-image on my M1 ultra, it takes 8 seconds per step. Over a minute for the default of 9 steps.
Apple Silicon is comparable in memory bandwidth to mid-range GPUs, but it’s light years behind on compute.
Is that the only factor though? I wonder if pytorch is lacking optimization for the MPS backend.
https://github.com/Tongyi-MAI/Z-Image
Screenshot of site with network tools open to indicate link
EDIT: It's possible that this issue might have existed in an old cached version. I'll purge the cache just to make sure.
EDIT: Fixed! Thanks soontimes and rprwhite!
https://fal.ai/models/fal-ai/z-image/turbo/api
Couple that with the LoRA, in about 3 seconds you can generate completely personalized images.
The speed alone is a big factor but if you put the model side by side with seedream and nanobanana and other models it's definitely in the top 5 and that's killer combo imho.
Is Flux 1/2/Kontext left in the dust by the Z Image and Qwen combo?
IMO HiDream had the best quality OSS generations, Flux Schnell is decent as well. Will try out Z-Image soon.
For ref, the Porcupine-cone creature that ZiT couldn't handle by itself in my aforementioned test was easily handled using a Qwen20b + ZiT refiner workflow and even with two separate models STILL runs faster than Flux2 [dev].
Once Z-image base comes out and some real tuning can be done, I think it has a chance of replacing it for the function SDXL has
Flux has largely been met with a collective yawn.
The only thing Flux had going for it was photorealism and prompt adherence. But the skin and jaws of the humans it generated looked weird, it was difficult to fine tune, and the licensing was weird. Furthermore, Flux never had good aesthetics. It always felt plain.
Nobody doing anime or cartoons used Flux. SDXL continues to shine here. People doing photoreal kept using Midjourney.
It's incredibly clear who the devs assume the target market is.
Z-Image is getting traction because it fits on their tiny GPUs and does porn sure, but even with more compute Flux 2[dev] has no place.
Weak world knowledge, worse licensing, and it ruins the #1 benefit of a larger LLM backbone with post-training for JSON prompts.
LLMs already understand JSON, so additional training for JSON feels like a cheaper way to juice prompt adherence than more robust post-training.
And honestly even "full fat" Flux 2 has no great spot: Nano Banana Pro is better if you need strong editing, Seedream 4.5 is better if you need strong generation.
It's not clear to me what you mean either, especially since female models are overwhelmingly more popular in general[1].
[1]: "Female models make up about 70% of the modeling industry workforce worldwide" https://zipdo.co/modeling-industry-statistics/
Ok so a ~2:1 ratio. Those examples have a 25:1 ratio.
That said I do find the focus on “safety” tiring.
With simplistic prompts, you quickly conclude that the small model size is the only limitation. Once you realize how good it is with detailed prompts, though, you find that you can get a lot more diversity out of it than you initially thought you could.
Absolute game-changer of a model IMO. It is competitive with Nano Banana Pro in some respects, and that's saying something.
overall it's fun and impressive. decent results using LoRA. you can achieve good looking results with as few as 8 inference steps, which takes 15-20 seconds on a Strix Halo. i also created a llama.cpp inherence custom node for prompt enhancement which has been helping with overall output quality.
Local AI will eventually be booming. It'll be more configurable, adaptable, hackable. "Free". And private.
Crude APIs can only get you so far.
I'm in favor of intelligent models like Nano Banana over ComfyUI messes (the future is the model, not the node graph).
I still think we need the ability to inject control layers and have full access to the model, because we lose too much utility by not having it.
I think we'll eventually get Nano Banana Pro smarts slimmed down and running on a local machine.
With how expensive RAM currently is, I doubt it.
For various reasons, I doubt there are any large scale SaaS-style providers operating this in production today.
If I say “A man”, it’s fine. A black man, no problem. It’s when I add context and instructions is just seems to want to go with some Chinese man. Which is fine, but I would like to see more variety of people it’s trained on to create more diverse images. For non-people it’s amazingly good.
[1]: https://github.com/Tongyi-MAI/Z-Image?tab=readme-ov-file#-qu...
On replicate.com a single image takes 1.5s at a price of 1000 images per $1. Would be interesting to see how quick it is on ComfyUI Cloud.
Overall, running generative models locally on Macs seems very poor time investment.
I'm still curious whether this would run on a MacBook and how long would it take to generate an image. What machine are you using?
Download the release here
* https://github.com/LostRuins/koboldcpp/releases/tag/v1.103
Download the config file here
* https://huggingface.co/koboldcpp/kcppt/resolve/main/z-image-...
Set +x to the koboldcpp executable and launch it, select 'Load config' and point at the config file, then hit 'launch'.
Wait until the model weights are downloaded and launched, then open a browser and go to:
* http://localhost:5001/sdui
EDIT: This will work for Linux, Windows and Mac
Roughly speaking the art seems to have three main functions:
1. promote the story to outsiders: this only works with human-made art
2. enhance the story for existing readers: AI helps here, but is contentious
3. motivate and inspire the author: works great with AI. The ease of exploration and pseudo-random permutations in the results are very useful properties here that you don't get from regular art
By now the author even has an agreement with an artist he frequently commissions that he can use his style in AI art in return for a small "royalty" payment for every such image that gets published in one of his stories. A solution driven both by the author's conscience and by the demands of the readers
- Illustrating blog posts, articles, etc.
- A creativity tool for kids (and adults; consider memes).
- Generating ads. (Consider artisan production and specialized venues.)
- Generating assets for games and similar, such as backdrops and textures.
Like any tool, it takes certain skill to use, and the ability to understand the results.
OTOH these are open-weight models released to the public. We don't get to use more advanced models for free; the free models are likely a byproduct of producing more advanced models anyway. These models can be the freemium tier, or gateway drugs, or a way of torpedoing the competition, if you don't want to believe in the goodwill of their producers.
I tried this prompt on my username: "A painted UFO abducts the graffiti text "Accrual" painted on the side of a rusty bridge."
The bang:buck ratio of Z-Image Turbo is just bonkers.
- 1.5s to generate an image at 512x512
- 3.5s to generate an image at 1024x1024
- 26.s to generate an image at 2048x2048
It uses almost all the 32Gb Gb of VRAM and GPU usage. I'm using the script from the HF post: https://huggingface.co/Tongyi-MAI/Z-Image-Turbo
(I'm not used to using Windows and I don't know how to do anything complicated on that OS. Unfortunately, the computer with the big GPU also runs Windows.)
Update: works great and much faster via ComfyUI + the provided workflow file.
Supports MPS (Metal Performance Shaders). Using something that skips Python entirely along with a mlx or gguf converted model file (if one exists) will likely be even faster.