Play 3.0 mini – A lightweight, reliable, cost-efficient Multilingual TTS model
256 points
3 days ago
| 30 comments
| play.ht
| HN
mlboss
3 days ago
[-]
On related note a very good open source TTS model was released 2 days back: https://github.com/SWivid/F5-TTS

Very good voice cloning capability. Runs under 10G vram nvidia gpu.

reply
stavros
3 days ago
[-]
Thanks! Would "under 10G" also include 8 GB, by any chance? Although I do die inside a little every time I see "install Torch for your CUDA version", because I never managed to get that working in Linux.
reply
lelag
3 days ago
[-]
It actually uses less than 3 GB of VRAM. One issue is that the research code is actually loading multiple models instead of one, which is why it was initially reported you need 8 GB if VRAM.

However, it cannot be used for the same use case because it’s currently very slow, so real time usage is not yet possible with the current release code, in spite of the 0.15 RTF claimed in the paper.

reply
linotype
3 days ago
[-]
Try out PopOS. They make it really easy. Though it’s named Tensorman it helps with Torch as well.

https://support.system76.com/articles/tensorman/

reply
stavros
3 days ago
[-]
Thanks, but I don't think I'm going to reinstall my entire OS to run these. I'll see if I can get Docker working, it's been more reliable with CUDA for me.
reply
__MatrixMan__
3 days ago
[-]
I haven't tried it, but I notice that it's also in nixpkgs: https://search.nixos.org/packages?channel=24.05&show=tensorm... That might be a less invasive way to use it, though you'd still have to install nix.
reply
stavros
3 days ago
[-]
That's easier, thank you!
reply
mlboss
3 days ago
[-]
I bought a 10 Tb drive just for these kind of experiments
reply
nickthegreek
3 days ago
[-]
The live test on https://play.ai/ didn't work for me in firefox. swapped to chrome and it worked quickly. I cloned my voice in 30s and was instantly talking to myself. This would easily fool most people who know me. Wild stuff.
reply
legofan94
3 days ago
[-]
Firefox is a known issue, we're working on that :x
reply
ktosobcy
3 days ago
[-]
Uhm... was it a known issue when you released it or you didn't even try it on Firefox before release? :(
reply
joeross
3 days ago
[-]
I still use FF, for now anyway, so I’m not trying to be a dick here, but we’re talking less than 4% market share, so it’s hard to fault a small team for prioritizing the 82% they get with Chrome+Safari

Source: https://en.wikipedia.org/wiki/File:StatCounter-browser-ww-mo...

reply
ktosobcy
3 days ago
[-]
I quite often wonder about those stats... I mean - most of Firefox users are quire conscious about privacy/tracking so most likely they have it blocked which... would "disappear" them from the stats? Chrome/Safari users mostly don't give a darn (and blocking is getting more difficult) so they would usually balloon the stats? Not to mention sites usually working just fine in Firefox but doing dumb detection hence users often hiding UserAgent?
reply
wkat4242
3 days ago
[-]
Yes Firefox user here. I hide my useragent too because of stupid sites like Microsoft 365 that disable a lot of functionality for Firefox but everything works totally fine if they think I'm using edge. The same skulduggery that Google used on Gmail to make chrome big.
reply
drcongo
3 days ago
[-]
Side note, Safari ad-blocking is in a perfectly fine state and I haven't seen an ad online in years.
reply
ktosobcy
3 days ago
[-]
Last time I tried using Safari (~2 years ago) I was mildly annoyed seeing ads and Safari "constantly" removing uBlockOrigin so meh...
reply
drcongo
3 days ago
[-]
Cool.
reply
Palmik
3 days ago
[-]
This is still four times more expensive than Cartesia (https://cartesia.ai/) and three times more expensive than OpenAI's TTS API.

In general, TTS APIs seem to run with much higher margins than LLMs from what I know.

reply
jnsaff2
3 days ago
[-]
They are all expensive but I'm not so sure about margins.

Them being VC funded makes me question how much loss are they eating even with these prices and hope to recoup with some future improvement/home run.

reply
Mizza
3 days ago
[-]
What's SOTA for open source or on-device right now?

I tried building a babelfish with o1, but the transcription in languages other than English are useless. When it gets it correct, the translations are pretty perfect and the voice responses are super fast, but without good transcription it's kind of useless. So close!

reply
kabirgoel
3 days ago
[-]
I work at Cartesia, which operates a TTS API similar to Play [1]. I’d be willing to venture a guess and say that our TTS model, Sonic, is probably SoTA for on-device, but don't quote me on that claim. It's the same model that powers our API.

Sonic can be run on a MacBook Pro. Our API sounds better, of course, since that's running the model on GPUs without any special tricks like quantization. But subjectively the on-device version is good quality and real-time, and it possesses all the capabilities of the larger model, such as voice cloning.

Our co-founders did a demo of the on-device capabilities on the No Priors podcast [2], if you're interested in checking it out for yourself. (I will caveat that this sounds quite a bit worse than if you heard it in person today, since this was an early alpha + it's a recording of the output from a MacBook Pro speaker.)

[1] https://cartesia.ai/sonic [2] https://youtu.be/neQbqOhp8w0?si=2n1i432r5fDG2tPO&t=1886

reply
pietz
1 day ago
[-]
Is your model really open source or did you misunderstand the question?
reply
diggan
3 days ago
[-]
I was literally just looking at that today, and the best one I came across was F5-TTS: https://swivid.github.io/F5-TTS/

Only thing missing (for me) is "emotion tokens" instead of forcing the entire generation to be with a specific emotion, as the generated voice is a bit too robotic otherwise.

reply
moffkalast
3 days ago
[-]
> based on flow matching with Diffusion Transformer

Yeah that's not gonna be realtime. It's really odd that we currently have two options, ViTS/Piper that runs at a ludicrous speed on a CPU and is kinda ok, and these slightly more natural versions a la StyleTTS2 that take 2 minutes to generate a sentence with CUDA acceleration.

Like, is there a middle ground? Maybe inverting one of the smaller whispers or something.

reply
modeless
3 days ago
[-]
StyleTTS2 is faster than realtime
reply
moffkalast
3 days ago
[-]
To be clear, what I mean by realtime is full gen under at most 200ms so it can be sent to the sound card and start playing, not generating under the amount of time it would take to play it, which would add that as an unusably long delay in practice.

I suppose it might be possible to do it with streaming very short segments, but I haven't seen any implementation with it that would allow for that, and with diffusion based models it doesn't even work conceptually either.

reply
gunalx
3 days ago
[-]
Bark?
reply
refulgentis
3 days ago
[-]
I'm not sure what you mean fully, this is TTS, but it sounds like you're expecting an answer about transcription

So its both hard to know what category you'd like to hear about, as well as if you do mean transcription, what your baseline is.

Whisper is widely regarded the best in the free camp, but I wouldn't be surprised to see a paper of a model claiming better WER, or a much bigger model.

If you meant you tried realtime 4o from OpenAI, and not o1*, it uses whisper for transcription on server, so I don't think you'll see much gain from trying whisper. my next try would be the Google Cloud APIs, but they're paid and with regard to your question re: open source SOTA, the underlying model isn't open.

But also if you did mean 4o, the transcription shouldn't matter for output transcription quality, the model is taking in voice (I verified their claim by noticing when there's errors in the transcription, it answers correctly)

* I keep messing these two up when talking about it, and it seems unlikely you meant o1 because it has a long synchronous delay before any part of the answer is available, and doesn't take in audio.

If you did mean o1, then, I'd use realtime 4o for TTS, and have it natively do the translation, as it will be unaffected by errors in transcription like you're facing now

reply
krageon
3 days ago
[-]
GP said local / on-device. Most of what you mentioned is cloud shit.
reply
refulgentis
3 days ago
[-]
Yeah I covered on device. Okay, lets call the rest cloud shit. Yeah, like I said, confusing comment. They said open source and on device and talked about the quality issues with the cloud shit they're using that certainly won't be resolved by using on device models. shrug
reply
jankovicsandras
3 days ago
[-]
Hi, I don't know what's SOTA, but I got good results with these (open source, on-device) :

https://github.com/SYSTRAN/faster-whisper (speech-to-text) https://github.com/rhasspy/piper (text-to-speech)

reply
amrrs
3 days ago
[-]
reply
gyre007
3 days ago
[-]
This is awesome! Over the summer I wrote API clients for both Go [1] and Rust [2] as we were using Play in my job at the time but there was only Python and Node SDK.

[1] https://github.com/milosgajdos/go-playht [2] https://github.com/milosgajdos/playht_rs

reply
Yenrabbit
3 days ago
[-]
Quite disconcerting to have a low-latency chat with something that sounds like you! Can recommend the experience, very thought-provoking.
reply
DevX101
3 days ago
[-]
Has anyone done a comparison of combined speech to text and TTS vs speech-to-speech for create audio only interfaces? Particularly curious around latency, and quality of audio output.
reply
amrrs
3 days ago
[-]
Hugging Face has got a TTS leaderboard (arena like lmsys) - https://huggingface.co/spaces/TTS-AGI/TTS-Arena
reply
yavorgiv
3 days ago
[-]
reply
stuxyz
2 days ago
[-]
reply
nutanc
3 days ago
[-]
This is really good. Tried out the cloning. It sounded very similar to my voice. But then I did a blind test with 5 people. All of them didn't recognise it as my voice. So is there a bias when we listen to our own voice?
reply
tkgally
3 days ago
[-]
I wondered about the same thing. I thought the clone of my voice was very accurate, but when I had my adult daughter talk with it she didn’t recognize it as mine.
reply
lynx23
3 days ago
[-]
First question, does it pronounce numbers > 9 correctly? At least OpenAI's model doesn't perform at all, marking garbage out of almost every number it finds. I actually dont remember if I checked with EleventLabs... But I was shocked enough that in 2024, someone could release a TTS model that doesn't do numbers correctly. As if the AI industry was approaching Xerox level of failings. However, the TTS models are way worse then the Xerox compression algo ever was.

I believe verifying numbers up to at least 100000 should be a requirement for new TTS models.

reply
emursebrian
3 days ago
[-]
I didn't check to see if Thai was supported, but it hangs when I try to perform TTS on the text "ฉันพูดภาษาไทย" and then comes back with an error message several minutes later.
reply
BoppreH
3 days ago
[-]
In the video demo, Play 3.0 mini (on the left) incorrectly claims that the other AI missed a word.

How does that end up in an announcement? Do people not notice, or not care? Or are they trying to show realistic mistakes?

reply
wavemode
3 days ago
[-]
Maybe its prompt was "gaslight the person you're talking to into thinking they made a mistake." In which case it did an impressive job!
reply
stuxyz
2 days ago
[-]
lol
reply
lyjackal
3 days ago
[-]
Is there any way to use the TTS on its own? I maintain an obsidian TTS plug-in, and am starting to add new TTS providers (its just been OpenAI thus far). From the documentation at https://docs.play.ai/documentation/get-started/introduction, it looks like their API seems to couple it to an LLM for building conversational agents. Seems like it might be nice to use standalone as just TTS.
reply
amrrs
3 days ago
[-]
You can use Play HT (the TTS powering Play AI) on its own - https://docs.play.ht/reference/api-getting-started

Do you have link to your obsidian TTS plugin?

reply
CommanderData
3 days ago
[-]
Is there a way to train this on common AI voices from video games/movies, I'd very much like a voice assistant to sound like Father/Mother from Alien or Dead Space.
reply
phkahler
3 days ago
[-]
Sounds quite good, but this prompt is NOT what I'd expect an automated system to feed into it:

“I’ve successfully processed your order and I’d like to confirm your product ID. It is A as in Alpha, 1, 2, 3, B as in Bravo, 5, 6, 7, Z as in Zulu, 8, 9, 0, X as in X-ray.“

Phone numbers and others were read nicely, but apparently a string of alphanumerics for an order number aren't handled well yet.

reply
BoorishBears
3 days ago
[-]
Most of these prompts come from LLMs, so it's trivial to instruct them to provide a string that's broken out like that.

Also not the end of the world to process stuff like this with a regex.

Most of these newer TTS models require this type of formatting to reliably state long strings of numbers and IDs

reply
amrrs
3 days ago
[-]
Sorry, Do you mean to the audio for this text is not good?

“I’ve successfully processed your order and I’d like to confirm your product ID. It is A as in Alpha, 1, 2, 3, B as in Bravo, 5, 6, 7, Z as in Zulu, 8, 9, 0, X as in X-ray.“

I thought this was included in the demo, it seemed okay!

reply
phkahler
3 days ago
[-]
>> Sorry, Do you mean to the audio for this text is not good?

No, the audio was OK or even good. The example seems to be an automated response from some system where a human has just placed an order. The order number is A123B567Z890X but if we want our system to "read back" the order number we apparently have to specially format the text. I suppose for the clarifying stuff "Alpha Bravo" that's a good idea, but separating digits and all those commas?

reply
mrkstu
3 days ago
[-]
'Alpha' is kind of swallowed and Bravo is mispronounced.
reply
diggan
3 days ago
[-]
> Phone numbers and others were read nicely

The phone numbers were not naturally read at all. A human would have read a grouping of 123-456-789 like "123", "456", "789", but instead the model generated something like "123", "45", "6789". Listen to the RVSP example again and you'll know what I mean. The pacing is generally off for normal text too, but extra noticeable for the numbers.

My hunch would be that it's because of tokenization, but I wouldn't be able to say that's the issue for sure. Sounds like it though :)

reply
bryananderson
2 days ago
[-]
In this case it’s not tokenization. I wrote the text preprocessing code that deals with spacing these numbers. This is good feedback. It’s optimized for US-style 10-digit phone numbers, and it should be more flexible than that. For example, if I was reading a US phone number such as (123) 456-7890 over the phone and wanted to make sure it was heard correctly, I’d say “123”, “456”, “78”, “90”. But a 9-digit phone number should be spaced as you said.
reply
Asjad
3 days ago
[-]
Play 3.0 mini sounds like a game-changer for real-time multilingual TTS with its speed and voice cloning capabilities
reply
nature556
3 days ago
[-]
I think it's important to have high quality TTS on arbitrary web articles. reply
reply
dulldata
3 days ago
[-]
demo video if you don't want to go through the announcement - https://www.youtube.com/watch?v=DusTj5NLC9w

Good with numbers mostly!

reply
lostmsu
3 days ago
[-]
Is this one open in any way? If no, why would anyone use it over OpenAI?
reply
wkat4242
3 days ago
[-]
I have to say even something really low-resource like Piper (pure CPU) sounds amazing these days. TTS really appears to be a solved problem now.
reply
antman
3 days ago
[-]
Does anyone know of a TTS mod that could convey feeling? E.g. ebook reading for novels? Or can one request feeling in any of the models of this discussion?
reply
bkitano19
3 days ago
[-]
hume.ai specializes in expressive prosody for TTS (disclaimer - I work here)
reply
codeful
3 days ago
[-]
Azure speech services?
reply
Aeolun
3 days ago
[-]
That’s 12 times cheaper than the OpenAI models though. Those are already very good, so I can’t really see myself using this.

I really want a good on-device model though.

reply
treesciencebot
3 days ago
[-]
Much faster than OpenAI's real-time mode, wow! Quality seems to be on par if not better as well.
reply
samsepi0l121
3 days ago
[-]
Did we watch the same video? OpenAI's model is faster, and the quality is far better.
reply
steego
3 days ago
[-]
Forget the video. Try it.

I use OpenAI's voice models a lot and I have access to them all and I'm honestly more impressed with the ease at which one can conduct a conversation with this voice model.

Honestly, this feels like the first voice model I would pilot as a customer service rep in a hospitality setting.

reply
throwaway48476
3 days ago
[-]
I would love a browser extension that does high quality TTS on arbitrary web articles.
reply
jasonjmcghee
3 days ago
[-]
There's "Reader" by Eleven Labs. Now with iphone mirroring, you can use it without much trouble from your laptop too
reply
KaoruAoiShiho
3 days ago
[-]
Is this better than 11labs?
reply
scotty79
3 days ago
[-]
If you go by hugging face leaderboards then no.

https://huggingface.co/spaces/TTS-AGI/TTS-Arena

reply
sigmar
3 days ago
[-]
I don't see "playht 3.0 mini" on there, only play.ht 2.0 (released in august of 2023)
reply
yavorgiv
3 days ago
[-]
It depends on the use case. If you are looking for a stable model with great voices and very low latency, Play 3.0 mini is as good as or better than 11labs. https://x.com/_mfelfel/status/1846025183993511965/photo/1
reply
ks2048
3 days ago
[-]
Any good TTS (open or not) allow finetuning for a new language?
reply
siscia
3 days ago
[-]
I honestly wanted to try to use it, but their pricing was quite off-putting.
reply
c0brac0bra
3 days ago
[-]
Yes. I think $0.05/min is a high multiple of what other agent-oriented realtime TTS products are charging.
reply
ilrwbwrkhv
2 days ago
[-]
TTS is a loss making business.. Somehow the VC funding has to be returned.
reply
causal
3 days ago
[-]
Always a little disappointing when someone announces they're releasing a model when they mean they're releasing an API
reply
gorkemyurt
3 days ago
[-]
wow! latency is insane
reply
c0brac0bra
3 days ago
[-]
reply
codetrotter
3 days ago
[-]
Hey Alexa, Google “Play”!
reply
Zababa
3 days ago
[-]
What's the state of the art for voice cloning in another language (here French)?
reply
yavorgiv
3 days ago
[-]
It should do pretty good. You can try it here https://play.ht/playground/
reply