For reference, Z-Image Turbo scored 4 out of 15 points on GenAI Showdown. I’m aware that doesn’t sound like much, but given that one of the largest models, Flux.2 (32b), only managed to outscore ZiT (a 6b model) by a single point and is significantly heavier-weight, that’s still damn impressive.
Local model comparisons only:
If you get a chance, could you list your mobile device specs? That way I can at least try it on Browserstack and see if I can figure out a fix.
Update: Huh, now it's working
An older thread on this has a lot of comments: https://news.ycombinator.com/item?id=46046916
And small models are also much easier to fine tune than large ones.
I was trying to get it to create an image of a tiger jumping on a pogo stick, which is way beyond its capabilities, but it cannot create an image of a pogo stick in isolation.
Z-Image / Flux 2 / Hidream / Omnigen2 / Qwen Samples:
This is where smaller models are just going to be more constrained and will require additional prompting to coax out the physical description of a "pogo stick". I had similar issues when generating Alexander the Great leading a charge on a hippity-hop / space hopper.
Because in theory I would say that knowledge is something that does not have to be baked in the model but could be added using reference images if the model is capable enough to reason about them.
Tiger on pogo stick: https://i.imgur.com/lnGfbjy.jpeg
Dunno what this is, but it's not a pogo stick: https://i.imgur.com/OmMiLzQ.jpeg
Nano Banana Pro FTW: https://i.imgur.com/6B7VBR9.jpeg
I wonder what kind of use cases could be "latency-critical production use cases"?
Is that actually true? I'm not sure it's fair to compare lossless compression ratios of text (abstract, noiseless) to images and video that innately have random sampling noise. If you look at humanly indistinguishable compression, I'd expect that you'd see far better compression ratios for lossy image and video compression than lossless text.
1: Although it looks like the current Hutter competition leader is closer to 9:1, which I didn't realize. Pretty awesome by historical standards.
I will wait for invoke to add flux2 klein.
What will be really interesting to me is the release of Z-image, if that goes the way it’s looking, it’ll be natural language SDXL 2.0, which seems to be what people really want.
Releasing the Turbo/Distilled/Finetune months ago was a genius move really. It hurt Flux and Qwen releases on a possible future implication alone.
If this was intentional, I can’t think of the last time I saw such shrewd marketing.
I think that information still did not get through to most users.
"Notably, the resulting distilled model not only matches the original multi-step teacher but even surpasses it in terms of photorealism and visual impact."
"It achieves 8-step inference that is not only indistinguishable from the 100-step teacher but frequently surpasses it in perceived quality and aesthetic appeal"
However, I wonder what has been the source of the delay with its release and if there were problems with that approach.
Your frame of it is speculative, i.e. it is forthcoming. Theirs is present tense. Could I trouble you to give us plebes some more context? :)
ex. Parsed as is, and avoiding the general confusion if you’re unfamiliar, it is unclear how one can observe “the way it is looking”, especially if turbo was released months ago and there is some other model that is unreleased. Chose to bother you because the others comment was less focused on lab on lab strategy.
[1] https://tongyi-mai.github.io/Z-Image-blog/
[2] https://www.reddit.com/r/StableDiffusion/comments/1p9uu69/no...
Z Image got popular because the people stuck with 12GB video cards could still use it, and hell - probably train on it, at least once the base version comes out. I think most people disparaging Flux 2 never tried it as they wouldn't want to deal with how slowly it would work on their system, if they even realize that they could run it.
It’ll be interesting to see how the NSFW catering plays out for the Chinese labs. I was joking a couple months ago to someone that Seedream 4’s talents at undressing was an attempt to sow discord and it was interesting it flew under the radar.
Post-Grok going full gooner pedo, I wonder if it Grok will take the heat alone moving forward.
ZIT is not far short of revolutionary. It is kind of surreal to contemplate how much high-quality imagery can be extracted from a model that fits on a single DVD and runs extremely quickly on consumer-grade GPUs.
It is, however, small and quick.
Look at the images I posted elsewhere in this section. They are crappy excuses for pogo sticks, but they absolutely do NOT look like they came from a cell phone.
Also see vunderba's page at https://genai-showdown.specr.net/ . Even when Z-Image Turbo fails a test, it still looks great most of the time.
Edit re: your other comment -- don't make the mistake of confusing censorship with lack of training data. Z-Image will try to render whatever you ask for, but at the end of the day it's a very small model that will fail once you start asking for things it simply wasn't trained on. They didn't train it with much NSFW material, so it has some rather... unorthodox anatomical ideas.
However.. I’m already expecting the blowback when a Z-Image release doesn’t wow people like the Turbo finetune does. SDXL hasn’t been out two years yet, seems like a decade.
We’ll see. I’m hopeful that Z works as expected and sets the new watermark. I just am not sure it does it right out the gate.
Almost afraid to ask, but anytime grok or x or musk comes up I am never sure if there is some reality based thing, or some “I just need to hate this” thing. Sometimes they’re the same thing, other times they aren’t.
I can guess here that because Grok likely uses WAN that someone wrote some gross prompts and then pretended this is an issue unique to Grok for effect?
Personally, I go between “I don’t care at all” and “well it’s not ideal” on AI generations. It’s already too late, but the barrier of entry is a lot lower than it was.
But I’m applying a good faith argument where GP does not seem to have intended one.
You may note I am no shirking violet, nor do I lack perspective, as evidenced by my notes on Seedream. And fortuitiously, I only mentioned it before being dismissed as bad faith: I could not have foreseen needing to call out as credentials until now.
I don't think it's kind to accuse others of bad faith, as evidence by me not passing judgement on the person you are replying to's description.
I do admit it made my stomach churn a little bit to see how quickly people will other. Not on you, I'm sure I've done this too. It's stark when you're on the other side of it.
good competition breed innovation