I'd say the example actually does (vaguely) suggest that Qwen might be overfitting to the Pelican.
But in terms of making something physically plausible, Opus certainly got a lot closer
For a delightful moment this morning I thought I might have finally caught a model provider cheating by training for the pelican, but the flamingo convinced me that wasn't the case.