Qwen-Image-2.0: Professional infographics, exquisite photorealism
101 points
2 hours ago
| 14 comments
| qwen.ai
| HN
tianqi
44 minutes ago
[-]
I've seen many comments describing the "horse riding man" example as extremely bizarre (which it actually is), so I'd like to provide some background context here. The "horse riding man" is a Chinese internet meme originating from an entertainment awards ceremony, when the renowned host Tsai Kang-yong wore an elaborate outfit featuring a horse riding on his back[1]. At the time, he was embroiled in a rumor about his unpublicized homosexual partner, whose name sounded "Ma Qi Ren" which coincidentally translates to "horse riding man" in Mandarin. This incident spread widely across Chinese internet and turned into a meme. So they used "horse riding man" as an example isn't entirely nonsensical, though the image per se is undeniably bizarre and carries an unsettling vibe.

[1] The photo of the outfit: https://share.google/mHJbchlsTNJ771yBa

reply
badhorseman
21 minutes ago
[-]
Why not ask for simply a man or even an Han man given the race of Tsai Kang-yong. Why a white man and why a man wearing medieval clothing. Gives your head a wobble.
reply
raincole
1 hour ago
[-]
It's crazy to think there was a fleeting sliver of time during which Midjourney felt like the pinnacle of image generation.
reply
Mashimo
1 hour ago
[-]
What ever happend to midjourney?
reply
wongarsu
1 hour ago
[-]
They have image and video models that are nowhere near SOTA on prompt adherence or image editing but pretty good on the artistic side. They lean in on features like reference images so objects or characters have a consistent look, biasing the model towards your style preferences, or using moodboards to generate a consistent style
reply
raincole
1 hour ago
[-]
Not much, while everything happened at OpenAI/Google/Chinese companies. And that's the problem.
reply
KeplerBoy
1 hour ago
[-]
How is it a problem? There simply doesn't seem to be a moat or secret sauce. Who cares which of these models is SOTA? In two months there will be a new model.
reply
waldarbeiter
50 minutes ago
[-]
There seems to be a moat like infrastructure/gpus and talent. The best models right now come from companies with considerable resources/funding.
reply
fguerraz
1 hour ago
[-]
I found the horse revenge-porn image at the end quite disturbing.
reply
embedding-shape
56 minutes ago
[-]
I think they call it "horse riding a human" which could have taken two very different directions, and the direction the model seems to have taken was the least worst of the two.
reply
wongarsu
48 minutes ago
[-]
At first I thought it's a clever prompt because you see which direction the model takes it, and whether it "corrects" it to the more common "human riding a horse" similar to the full wine glass test.

But if you translate the actual prompt the term riding doesn't even appear. The prompt describes the exact thing you see in excruciating detail.

"... A muscular, robust adult brown horse standing proudly, its forelegs heavily pressing between the shoulder blades and spine of a reclining man ... and its eyes sharp and focused, exuding a primal sense of power. The subdued man is a white male, 30-40 years old, his face covered in dust and sweat ... his body is in a push-up position—his palms are pressed hard against the cracked, dry earth, his knuckles white, the veins in his arms bulging, his legs stretched straight back and taut, his toes digging into the ground, his entire torso trembling slightly from the weight ..."

reply
blitzar
31 minutes ago
[-]
Wont someone think of the horses.
reply
inanothertime
1 hour ago
[-]
I recently tried out LMStudio on Linux for local models. So easy to use!

What Linux tools are you guys using for image generation models like Qwen's diffusion models, since LMStudio only supports text gen.

reply
PaulKeeble
6 minutes ago
[-]
Ollama is working on adding image generation but its not here yet. We really do need something that can run a variety of models for images.
reply
embedding-shape
58 minutes ago
[-]
Everything keeps changing so quickly, I basically have my own Python HTTP server with a unified JSON interface, then that can be routed to any of the impls/*.py files for the actual generation, then I have of those per implementation/architecture basically. Mostly using `diffusers` for the inference, which isn't the fastest, but tends to have the new model architectures much sooner than everyone else.
reply
ilaksh
1 hour ago
[-]
I have my own MIT licensed framework/UI: https://github.com/runvnc/mindroot. With Nano Banana via runvnc/googleimageedit
reply
guai888
1 hour ago
[-]
ComfyUI is the best for stable diffusion
reply
sandbach
59 minutes ago
[-]
The Chinese vertical typography is sadly a bit off. If punctuation marks are used at all, they should be the characters specifically designed for vertical text, like ︒(U+FE12 PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP).
reply
dsrtslnd23
2 hours ago
[-]
unfortunately no open weights it seems.
reply
cocodill
1 hour ago
[-]
interesting riding application picture
reply
rwmj
1 hour ago
[-]
"Guy being humped by a horse" wouldn't have been my first choice for demoing the capabilities of the model, but each to their own I guess.
reply
viraptor
11 minutes ago
[-]
It looks like a marketing move. It's a good quality, detailed picture. It's going to get shared a lot. I would assume they knew exactly what they were doing. Nothing like a bit of controversy for extra clicks.
reply
skerit
1 hour ago
[-]
> Qwen-Image-2.0 not only accurately models the “riding” action but also meticulously renders the horse’s musculature and hair > https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwe...

What the actual fuck

reply
wongarsu
59 minutes ago
[-]
For reference, below is the prompt translated (with my highlighting of the part that matters). They did very much ask for this version of "horse riding a man", not the "horse sitting upright on a crawling human" version

---

A desolate grassland stretches into the distance, its ground dry and cracked. Fine dust is kicked up by vigorous activity, forming a faint grayish-brown mist in the low sky.

Mid-ground, eye-level composition: A muscular, robust adult brown horse stands proudly, its forelegs heavily pressing between the shoulder blades and spine of a reclining man. Its hind legs are taut, its neck held high, its mane flying against the wind, its nostrils flared, and its eyes sharp and focused, exuding a primal sense of power. The subdued man is a white male, 30-40 years old, his face covered in dust and sweat, his short, messy dark brown hair plastered to his forehead, his thick beard slightly damp; he wears a badly worn, grey-green medieval-style robe, the fabric torn and stained with mud in several places, a thick hemp rope tied around his waist, and scratched ankle-high leather boots; his body is in a push-up position—his palms are pressed hard against the cracked, dry earth, his knuckles white, the veins in his arms bulging, his legs stretched straight back and taut, his toes digging into the ground, his entire torso trembling slightly from the weight.

The background is a range of undulating grey-blue mountains, their outlines stark, their peaks hidden beneath a low-hanging, leaden-grey, cloudy sky. The thick clouds diffuse a soft, diffused light, which pours down naturally from the left front at a 45-degree angle, casting clear and voluminous shadows on the horse's belly, the back of the man's hands, and the cracked ground.

The overall color scheme is strictly controlled within the earth tones: the horsehair is warm brown, the robe is a gradient of gray-green-brown, the soil is a mixture of ochre, dry yellow earth, and charcoal gray, the dust is light brownish-gray, and the sky is a transition from matte lead gray to cool gray with a faint glow at the bottom of the clouds.

The image has a realistic, high-definition photographic quality, with extremely fine textures—you can see the sweat on the horse's neck, the wear and tear on the robe's warp and weft threads, the skin pores and stubble, the edges of the cracked soil, and the dust particles. The atmosphere is tense, primitive, and full of suffocating tension from a struggle of biological forces.

reply
wiether
1 hour ago
[-]
I use gen-AI to produce images daily, but honestly the infographics are 99% terrible.

LinkedIn is filled with them now.

reply
smcleod
1 hour ago
[-]
To be fair it hasn't made LinkedIn any worse than it already was.
reply
nurettin
1 hour ago
[-]
To be fair, it is hard to make LinkedIn any worse.
reply
viraptor
30 minutes ago
[-]
Informatics are as bad as the author allows though. There's few people who could make or even describe a good infographic, so that's what we see in the results too.
reply
usefulposter
1 hour ago
[-]
Correct.

Much like the pointless ASCII diagrams in GitHub readmes (big rectangle with bullet points flows to another...), the diagrams are cognitive slurry.

See Gas Town for non-Qwen examples of how bad it can get:

https://news.ycombinator.com/item?id=46746045

(Not commenting on the other results of this model outside of diagramming.)

reply
viraptor
27 minutes ago
[-]
> cognitive slurry

Thank you for this phrase. I don't think that bad diagrams are limited to the AI in any way and this perfectly describes all "this didn't make things any clearer" cases.

reply
goga-piven
1 hour ago
[-]
Why is the only image featuring non-Asian men the one under the horse?
reply
andruby
1 hour ago
[-]
Is the problem the position/horse or that Qwen mostly shows asian people?

Do western AI models mostly default to white people?

reply
goga-piven
54 minutes ago
[-]
Well, what if some western models showcase white people in all good-looking images and the only embarrassing image features Asian people? wouldn't that be considered racism?
reply
wtcactus
57 minutes ago
[-]
> Do western AI models mostly default to white people?

No, they mostly default to black people even in historical contexts where they are completely out of place, actually. [1]

"Google paused its AI image-generator after Gemini depicted America's founding fathers and Nazi soldiers as Black. The images went viral, embarrassing Google."

[1] https://www.npr.org/2024/03/18/1239107313/google-races-to-fi...

reply
viraptor
52 minutes ago
[-]
> they mostly default to black people

You're referring to a case of one version of one model. That's not "mostly" or "default to".

reply
raincole
35 minutes ago
[-]
Out of curiosity I just tried this prompt:

> Generate a photo of the founding fathers of a future, non-existing country. Five people in total.

with Nano Banana Pro (the SOTA). I tried the same prompt 5 times and every time black people are the majority. So yeah, I think the parent comment is not that far off.

reply
viraptor
20 minutes ago
[-]
Luck? 1 black person, 3 south Asian in total for me.

But for an out of context imaginary future... why would you choose non-black people? There's about the same reason to go with any random look.

reply
wtcactus
17 minutes ago
[-]
So, the answer to the question "Do western AI models mostly default to white people?" is clearly a resounding: no, they don't.
reply
viraptor
13 minutes ago
[-]
No. But neither black people. Or anyone specifically. So we got to a nice balance it seems.
reply
KingMob
23 minutes ago
[-]
I mean it's still far off, because they said "historical context", i.e., the actual past, but your prompt is about a hypothetical future.

(I suspect you tried a prompt about the original founding fathers, and found it didn't make that mistake any more.)

reply
wtcactus
18 minutes ago
[-]
The question was "Do western AI models mostly default to white people?" and the answer is no, they don't, they mostly default to black people. And some examples are so egregious, that even in historical settings, they replace the white people by black people.
reply
z3dd
1 hour ago
[-]
they explicitly called for that in the prompt
reply
goga-piven
52 minutes ago
[-]
Exactly why did they choose this prompt with a white person and not an Asian person, as in all the other examples?
reply
wtcactus
55 minutes ago
[-]
But why? That image actually puzzled me. Does it have some background context? Some historical legend or something of the like?
reply
joeycodes
42 minutes ago
[-]
It is Lunar New Year season right now, 2026 is year of the horse, there is celebratory horse imagery everywhere in many Asian countries right now, so this image could be interpreted as East trampling West. I have no way to know the intention of the person at Qwen who wrote this, but you can form your own conclusions from the prompt:

A muscular, robust adult brown horse stands proudly, its forelegs heavily pressing between the shoulder blades and spine of a reclining man. Its hind legs are taut, its neck held high, its mane flying against the wind, its nostrils flared, and its eyes sharp and focused, exuding a primal sense of power. The subdued man is a white male...

reply
wtcactus
35 minutes ago
[-]
So, it’s just racism, pure and simple.
reply
badhorseman
30 minutes ago
[-]
racism yes simple no. https://live2makan.com/2024/08/07/treasures-statue-of-horse-...

symbolizing the triumph of the imperial race over the inferior but troublesome barbarians.

reply
singularfutur
1 hour ago
[-]
Another closed model dressed up as "coming soon" open source. The pattern is obvious: generate hype with a polished demo, lock the weights, then quietly move on. Real open source doesn't need a press release countdown.
reply
cubefox
1 hour ago
[-]
The complex prompt following ability and editing is seriously impressive here. They don't seem to be much behind OpenAI and Google. Which is backed op by the AI Arena ranking.
reply
Deukhoofd
2 hours ago
[-]
The text rendering is quite impressive, but is it just me or do all these generated 'realistic' images have a distinctly uncanny feel to it. I can't quite put my finger on it what it is, but they just feel off to me.
reply
finnjohnsen2
1 hour ago
[-]
I agree. They makes me nauseous. The same kind of light nausea as car sickness.

I assume our brains are used to stuff which we dont notice conciously, and reject very mild errors. I've stared at the picture a bit now and the finger holding the baloon is weird. The out of place snowman feels weird. If you follow the background blur around it isnt at the same depth everywehere. Everything that reflects, has reflections that I cant see in the scene.

I dont feel good staring at it now so I had to stop.

reply
jbl0ndie
1 hour ago
[-]
Sounds like you're describing the uncanny valley https://en.wikipedia.org/wiki/Uncanny_valley
reply
elorant
1 hour ago
[-]
The lighting is wrong, that's what's telling to me. They look too crisp. No proper shadows, everything looks crystal clear.
reply
techpression
1 hour ago
[-]
It’s the HDR era all over again, where people edited their photos to lack all contrast and just be ultra flat.
reply
likium
1 hour ago
[-]
At least for the real life pictures, there’s no depth of field. Everything is crystal clear like it’s composited.
reply
derefr
1 hour ago
[-]
> like its composited

Like focus stacking, specifically.

I’m always surprised when people bother to point out more-subtle flaws in AI images as “tells”, when the “depth-of-field problem” is so easily spotted, and has been there in every AI image ever since the earliest models.

reply
Mashimo
1 hour ago
[-]
I had no problems getting images with blurry background with the appropriate prompts. Something like "shallow depth of fields, bokeh, DSLR" can lead to good results. https://cdn.discordapp.com/attachments/1180506623475720222/1... [0]

But I found that that results in more professional looking images, and not more realistic photos.

Adding something like "selfy, Instagram, low resolution, flash" can lead to a .. worse image that looks more realistic.

[0] I think I did this one with z image turbo on my 4060 ti

reply
afro88
1 hour ago
[-]
The blur isn't correct though. Like the amount of blur is wrong for the distance, zoom amount etc. So the depth of field is really wrong even if it conforms to "subject crisp, background blurred"
reply
albumen
1 hour ago
[-]
Every photoreal image on the demo page has depth of field, it’s just subtle.
reply
BoredPositron
2 hours ago
[-]
Qwen always suffered from their subpar rope implementation and qwen 2 seems to suffer from it as well. The uncanny feel is down to the sparsity of text to image token and the higher in resolution you go the worse it gets. It's why you can't take the higher ends of the MP numbers serious no matter the model. At the moment there is no model that can go for 4k without problems you will always get high frequency artifacts.
reply
belter
2 hours ago
[-]
Agree, looks like the same effect they are applying on YouTube Shorts...
reply
GaggiX
1 hour ago
[-]
For me the only model that can really generate realistic images is nano banana pro (also known as gemini-3-pro-image). Other models are closing the gap, this one is pretty meh in my opinion in realistic images.
reply
Mashimo
1 hour ago
[-]
You can get flux and maybe z-image to do so, but you have to experiment with the promt a bit. Or maybe get an LoRa to help.
reply
cubefox
1 hour ago
[-]
The examples I saw of z-image look much more realistic than Nano Banana Pro, which is likely using Imagen 4 (plus editing) internally, which isn't very realistic. But Nano Banana Pro has obviously much better prompt alignment than something like z-image.
reply
GaggiX
1 hour ago
[-]
Are you sure you are not confusing nano banana pro for nano banana, z-image still has a bit of AI look that I do not find with nano banana pro, example for a comparison: https://i.ibb.co/YFtxs4hv/594068364-25101056889517041-340369...

Also Imagen 4 and Nano Banana Pro are very different models.

reply
yieldcrv
1 hour ago
[-]
when the horsey tranq hits
reply