Qwen3-TTS Family Is Now Open Sourced: Voice Design, Clone, and Generation
254 points
5 hours ago
| 16 comments
| qwen.ai
| HN
simonw
1 hour ago
[-]
If you want to try out the voice cloning yourself you can do that an this Hugging Face demo: https://huggingface.co/spaces/Qwen/Qwen3-TTS - switch to the "Voice Clone" tab, paste in some example text and use the microphone option to record yourself reading that text - then paste in other text and have it generate a version of that read using your voice.

I shared a recording of audio I generated with that here: https://simonwillison.net/2026/Jan/22/qwen3-tts/

reply
javier123454321
1 hour ago
[-]
This is terrifying. With this and z-image-turbo, we've crossed a chasm. And a very deep one. We are currently protected by screens, we can, and should assume everything behind a screen is fake unless rigorously (and systematically, i.e. cryptographically) proven otherwise. We're sleepwalking into this, not enough people know about it.
reply
rdtsc
1 hour ago
[-]
That was my thought too. You’d have “loved ones” calling with their faces and voices asking for money in some emergency. But you’d also have plausible deniability as anything digital can be brushed off as “that’s not evidence, it could be AI generated”.
reply
neevans
49 minutes ago
[-]
this was already possible with chatterbox for a long while.
reply
freedomben
30 minutes ago
[-]
Yep, this has been the reality now for years. Scammers have already had access to it. I remember an article years ago about a grandma who wired her life savings to a scammer who claimed to have her granddaughter held hostage in a foreign country. Turns out they just cloned her voice from Facebook data and knew her schedule so timed it while she would be unreachable by phone.
reply
DANmode
11 minutes ago
[-]
or anyone who refuses to use hearing aids.
reply
echelon
25 minutes ago
[-]
We're going to be okay.

There are far more good and interesting use cases for this technology. Games will let users clone their voices and create virtual avatars and heroes. People will have access to creative tools that let them make movies and shows with their likeness. People that couldn't sing will make music.

Nothing was more scary than the invention of the nuclear weapon. And we're all still here.

Life will go on. And there will be incredible benefits that come out of this.

reply
supern0va
21 minutes ago
[-]
We'll be okay eventually, when society adapts to this and becomes fully aware of the capabilities and the use cases for abuse. But, that may take some time. The parent is right to be concerned about the interim, at the very least.

That said, I am likewise looking forward to the cool things to come out of this.

reply
DANmode
11 minutes ago
[-]
> People that couldn't sing will make music.

I was with you, until

But, yeah. Life will go on.

reply
echelon
9 minutes ago
[-]
There are plenty of electronic artists who can't sing. Right now they have to hire someone else to do the singing for them, but I'd wager a lot of them would like to own their music end-to-end. I would.

I'm a filmmaker. I've done it photons-on-glass production for fifteen years. Meisner trained, have performed every role from cast to crew. I'm elated that these tools are going to enable me to do more with a smaller budget. To have more autonomy and creative control.

reply
genewitch
2 hours ago
[-]
it isn't often that tehcnology gives me chills, but this did it. I've used "AI" TTS tools since 2018 or so, and i thought the stuff from two years ago was about the best we were going to get. I don't know the size of these, i scrolled to the samples. I am going to get the models set up somewhere and test them out.

Now, maybe the results were cherrypicked. i know everyone else who has released one of these cherrypicks which to publish. However, this is the first time i've considered it plausible to use AI TTS to remaster old radioplays and the like, where a section of audio is unintelligible but can be deduced from context, like a tape glitch where someone says "HEY [...]LAR!" and it's an episode of Yours Truly, Johnny Dollar...

I have dozens of hours of audio of like Bob Bailey and people of that era.

reply
freedomben
28 minutes ago
[-]
Indeed, I have a future project/goal of "restoring" Have Gun - Will Travel radio episodes to listenable quality using tech like this. There are so many lines where sound effects and tape rot and other "bad recording" things make it very difficult to understand what was sad. Will be amazing, but as with all tech the potential for abuse is very real
reply
kamranjon
1 hour ago
[-]
I wonder if it was trained on anime dubs cause all of the examples I listened to sounded very similar to a miyazaki style dub.
reply
throwaw12
2 hours ago
[-]
Qwen team, please please please, release something to outperform and surpass the coding abilities of Opus 4.5.

Although I like the model, I don't like the leadership of that company and how close it is, how divisive they're in terms of politics.

reply
mortsnort
2 hours ago
[-]
They were just waiting for someone in the comments to ask!
reply
mhuffman
1 hour ago
[-]
It really is the best way to incentivize politeness!
reply
WarmWash
2 hours ago
[-]
The Chinese labs distill the SOTA models to boost the performance of theirs. They are a trailer hooked up (with a 3-6 month long chain) to the trucks pushing the technology forwards. I've yet to see a trailer overtake it's truck.

China would need an architectural breakthrough to leap American labs given the huge compute disparity.

reply
overfeed
53 minutes ago
[-]
Care to explain how the volume of AI research papers authored by Chinese researchers[1] has exceeded US-published ones? Time-traveling plagiarism perhaps, since you believe the US is destined to lead always.

1. Chinese researcher in China, to be more specific.

reply
bfeynman
33 minutes ago
[-]
Not a great metric, research in academia doesn't necessarily translate to value. In the US they've poached so many academics because of how much value they directly translate to.
reply
jacquesm
38 minutes ago
[-]
Volume is easy: they have far more people, it is quality that counts.
reply
miklosz
1 hour ago
[-]
I have seen indeed a trailer overtake its truck. Not a beautiful view.
reply
digdugdirk
40 minutes ago
[-]
Agreed. I do think the metaphor still holds though.

A financial jackknifing of the AI industry seems to be one very plausible outcome as these promises/expectations of the AI companies starts meeting reality.

reply
aaa_aaa
2 hours ago
[-]
No all they need is time. I am awaiting the dowfall of the ai hegemony and hype with popcorn at hand.
reply
mhuffman
1 hour ago
[-]
I would be happy with an openweight 3 month old Claude
reply
cmrdporcupine
1 hour ago
[-]
DeepSeek 3.2 is frankly fairly close to that. GLM 4.7 as well. They're basically around Sonnet 4 level.
reply
TylerLives
2 hours ago
[-]
>how divisive they're in terms of politics

What do you mean by this?

reply
throwaw12
2 hours ago
[-]
Dario said not nice words about China and open models in general:

https://www.bloomberg.com/news/articles/2026-01-20/anthropic...

reply
vlovich123
2 hours ago
[-]
I think the least politically divisive issue within the US is concern about China’s growth as it directly threatens the US’s ability to set the world’s agenda. It may be politically divisive if you are aligned with Chinese interests but I don’t see anything politically divisive for a US audience. I expect Chinese CEOs speak in similar terms to a Chinese audience in terms of making sure they’re decoupled from the now unstable US political machine.
reply
cmrdporcupine
1 hour ago
[-]
"... for a US audience"

And that's the rub.

Many of us are not.

reply
giancarlostoro
1 hour ago
[-]
From the perspective of competing against China in terms of AI the argument against open models makes sense to me. It’s a terrible problem to have really. Ideally we should all be able to work together in the sandbox towards a better tomorrow but thats not reality.

I prefer to have more open models. On the other hand China closes up their open models once they start to show a competitive edge.

reply
Levitz
1 hour ago
[-]
I mean, there's no way it's about this right?

Being critical of favorable actions towards a rival country shouldn't be divisive, and if it is, well, I don't think the problem is in the criticism.

Also the link doesn't mention open source? From a google search, he doesn't seem to care much for it.

reply
Balinares
1 hour ago
[-]
They're supporters of the Trump administration's military, a view which is not universally lauded.
reply
pseudony
1 hour ago
[-]
Same issue (I am Danish).

Have you tested alternatives? I grabbed Open Code and a Minimax m2.1 subscription, even just the 10usd/mo one to test with.

Result? We designed a spec for a slight variation of a tool for which I wrote a spec with Claude - same problem (process supervisor tool), from scratch.

Honestly, it worked great, I have played a little further with generating code (this time golang), again, I am happy.

Beyond that, Glm4.7 should also be great.

See https://dev.to/kilocode/open-weight-models-are-getting-serio...

It is a recent case story of vibing a smaller tool with kilo code, comparing output from minimax m2.1 and Glm4.7

Honestly, just give it a whirl - no need to send money to companies/nations your disagree with with.

reply
nunodonato
49 minutes ago
[-]
I've been using GLM 4.7 with Claude Code. best of both worlds. Canceled my Anthropic subscription due to the US politics as well. Already started my "withdrawal" in Jan 2025, Anthropic was one of the few that was left
reply
bigyabai
46 minutes ago
[-]
I'm in the same boat. Sonnet was overkill for me, and GLM is cheap and smart enough to spit out boilerplate and FFMPEG commands whenever it's asked.

$20/month is a bit of an insane ask when the most valuable thing Anthropic makes is the free Claude Code CLI.

reply
amrrs
2 hours ago
[-]
Have you tried the new GLM 4.7?
reply
davely
1 hour ago
[-]
I've been using GLM 4.7 alongside Opus 4.5 and I can't believe how bad it is. Seriously.

I spent 20 minutes yesterday trying to get GLM 4.7 to understand that a simple modal on a web page (vanilla JS and HTML!) wasn't displaying when a certain button was clicked. I hooked it up to Chrome MCP in Open Code as well.

It constantly told me that it fixed the problem. In frustration, I opened Claude Code and just typed "Why won't the button with ID 'edit' work???!"

It fixed the problem in one shot. This isn't even a hard problem (and I could have just fixed it myself but I guess sunk cost fallacy).

reply
bityard
1 hour ago
[-]
I've used a bunch of the SOTA models (via my work's Windsurf subscription) for HTML/CSS/JS stuff over the past few months. Mind you, I am not a web developer, these are just internal and personal projects.

My experience is that all of the models seem to do a decent job of writing a whole application from scratch, up to a certain point of complexity. But as soon as you ask them for non-trivial modifications and bugfixes, they _usually_ go deep into rationalized rabbit holes into nowhere.

I burned through a lot of credits to try them all and Gemini tended to work the best for the things I was doing. But as always, YMMV.

reply
KolmogorovComp
1 hour ago
[-]
Exactly the same feedback
reply
Balinares
37 minutes ago
[-]
Amazingly, just yesterday, I had Opus 4.5 crap itself extensively on a fairly simple problem -- it was trying to override a column with an aggregation function while also using it in a group-by without referring to the original column by its full qualified name prefixed with the table -- and in typical Claude fashion it assembled an entire abstraction layer to try and hide the problem under, before finally giving up, deleting the column, and smugly informing me I didn't need it anyway.

That evening, for kicks, I brought the problem to GLM 4.7 Flash (Flash!) and it one-shot the right solution.

It's not apples to apples, because when it comes down to it LLMs are statistical token extruders, and it's a lot easier to extrude the likely tokens from an isolated query than from a whole workspace that's already been messed up somewhat by said LLM. That, and data is not the plural of anecdote. But still, I'm easily amused, and this amused me. (I haven't otherwise pushed GLM 4.7 much and I don't have a strong opinion about about it.)

But seriously, given the consistent pattern of knitting ever larger carpets to sweep errors under that Claude seems to exhibit over and over instead of identifying and addressing root causes, I'm curious what the codebases of people who use it a lot look like.

reply
throwaw12
2 hours ago
[-]
yes I did, not on par with Opus 4.5.

I use Opus 4.5 for planning, when I reach my usage limits fallback to GLM 4.7 only for implementing the plan, it still struggles, even though I configure GLM 4.7 as both smaller model and heavier model in claude code

reply
Onavo
1 hour ago
[-]
Well DeepSeek V4 is rumored to be in that range and will be released in 3 weeks.
reply
sampton
2 hours ago
[-]
Every time Dario opens his mouth it's something weird.
reply
TheAceOfHearts
43 minutes ago
[-]
Interesting model, I've managed to get the 0.6B param model running on my old 1080 and I can generated 200 character chunks safely without going OOM, so I thought that making an audiobook of the Tao Te Ching would be a good test. Unfortunately each snippet varies drastically in quality: sometimes the speaker is clear and coherent, but other times it bursts out laughing or moaning. In a way it feels a bit like magical roulette, never being quite certain of what you're going to get. It does have a bit of charm, when you chain the various snippets together you really don't know what direction it's gonna go.

Using speaker Ryan seems to be the most consistent, I tried speaker Eric and it sounded like someone putting on a fake exaggerated Chinese accent to mock speakers.

If it wasn't for the unpredictable level of emotions from each chunk, I'd say this is easily the highest quality TTS model I've tried.

reply
KaoruAoiShiho
15 minutes ago
[-]
Have you tried specifying the emotion? There's an option to do so and if it's left empty it wouldn't surprise me if it defaulted to rng instead of bland.
reply
whinvik
46 minutes ago
[-]
Haha something that I want to try out. I have started using voice input more and more instead of typing and now I am on my second app and second TTS model, namely Handy and Parakeet V3.

Parakeet is pretty good, but there are times it struggles. Would be interesting to see how Qwen compares once Handy has it in.

reply
PunchyHamster
1 hour ago
[-]
Looking forward for my grandma being scammed by one!
reply
jacquesm
37 minutes ago
[-]
So far that seems to be the main use case.
reply
satvikpendem
1 hour ago
[-]
This would be great for audiobooks, some of the current AI TTS still struggle.
reply
rahimnathwani
1 hour ago
[-]
Has anyone successfully run this on a Mac? The installation instructions appear to assume an NVIDIA GPU (CUDA, FlashAttention), and I’m not sure whether it works with PyTorch’s Metal/MPS backend.
reply
magicalhippo
10 minutes ago
[-]
FWIW you can run the demo without FlashAttention using --no-flash-attn command-line parameter, I do that since I'm on Windows and haven't gotten FlashAttention2 to work.
reply
turnsout
6 minutes ago
[-]
It seems to depend on FlashAttention, so the short answer is no. Hopefully someone does the work of porting the inference code over!
reply
javier123454321
1 hour ago
[-]
I recommend using modal for renting the metal.
reply
JonChesterfield
1 hour ago
[-]
I see a lot of references to `device_map="cuda:0"` but no cuda in the github repo, is the complete stack flash attention plus this python plus the weights file, or does one need vLLM running as well?
reply
thedangler
2 hours ago
[-]
Kind of a noob, how would I implement this locally? How do I pass it audio to process. I'm assuming its in the API spec?
reply
dust42
2 hours ago
[-]
Scroll down on the Huggingface page, there are code examples and also a link to github: https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base
reply
indigodaddy
2 hours ago
[-]
How does the cloning compare to pocket TTS?
reply
albertwang
3 hours ago
[-]
great news, this looks great! is it just me, or do most of the english audio samples sound like anime voices?
reply
bityard
1 hour ago
[-]
Well, if you look at the prompts, they are basically told to sound like that.

And if you ask me, I think these models were trained on tween fiction podcasts. (My kids listen to a lot of these and dramatic over-acting seems to be the industry standard.)

Also, their middle-aged adult with an "American English" accent sounds like any American I've ever met. More like a bad Sean Connery impersonator.

reply
reactordev
2 hours ago
[-]
The real value I see is being able to clone a voice and change timbre and characteristics of the voice to be able to quickly generate voice overs, narrations, voice acting, etc. It's superb!
reply
rapind
3 hours ago
[-]
> do most of the english audio samples sound like anime voices?

100% I was thinking the same thing.

reply
devttyeu
3 hours ago
[-]
Also like some popular youtubers and popular speakers.
reply
pixl97
2 hours ago
[-]
Hmm, wonder where they got their training data from?
reply
thehamkercat
2 hours ago
[-]
even the Japanese audio samples sound like anime
reply
htrp
2 hours ago
[-]
subbed audio training data (much better than cc data) is better
reply
ideashower
2 hours ago
[-]
Huh. One of the English Voice Clone examples features Obama.
reply
salzig
49 minutes ago
[-]
So now we're getting every movie in "original voice" but local language? Can't wait to view anime or Bollywood :D
reply
wahnfrieden
2 hours ago
[-]
How is it for Japanese?
reply
salzig
52 minutes ago
[-]
there is a sample clone -> Trump speaks Japanese.

Edit: "Cross-lingual Voice Clone" https://qwen.ai/blog?id=qwen3tts-0115#voice-clone

reply
lostmsu
3 hours ago
[-]
I still don't know anyone who managed Qwen3-Omni to work properly on a local machine.
reply