Pocket TTS: A high quality TTS that gives your CPU a voice
267 points
23 hours ago
| 16 comments
| kyutai.org
| HN
derHackerman
35 minutes ago
[-]
I read this, then realized I needed a browser extension to read my long case study and made a browser interface of this and put this together:

https://github.com/lukasmwerner/pocket-reader

reply
lukebechtel
5 hours ago
[-]
Nice!

Just made it an MCP server so claude can tell me when it's done with something :)

https://github.com/Marviel/speak_when_done

reply
tylerdavis
2 hours ago
[-]
Funny! I made one recently too using piper-tts! https://github.com/tylerdavis/speak-mcp
reply
codepoet80
2 hours ago
[-]
I just setup pushover to send a message to my phone for this exact reason! Trying out your server next!
reply
singpolyma3
5 hours ago
[-]
Love this.

It says MIT license but then readme has a separate section on prohibited use that maybe adds restrictions to make it nonfree? Not sure the legal implications here.

reply
CGamesPlay
4 hours ago
[-]
For reference, the MIT license contains this text: "Permission is hereby granted... to deal in the Software without restriction, including without limitation the rights to use". So the README containing a "Prohibited Use" section definitely creates a conflicting statement.
reply
jandrese
4 hours ago
[-]
The "prohibited uses" section seems to be basically "not to be used for crime", which probably doesn't have much legal weight one way or another.
reply
WhyNotHugo
1 hour ago
[-]
You might use it for something illegal in one country, and then leave for another country with no extradition… but you’ve lost the license to sue the software and can be sued for copyright infringement.
reply
Buttons840
4 hours ago
[-]
Good question.

If a license says "you may use this, you are prohibited from using this", and I use it, did I break the license?

reply
ethin
3 hours ago
[-]
If memory serves, the license is the ultimate source of truth on what is allowed or not. You cannot add some section that isn't in the text of the license (at least in the US and other countries that use similar legal systems) on some website and expect it to hold up in court because the license doesn't include that text. I know of a few other bigger-name projects that try to pull these kinds of stunts because they don't believe anyone is going to actually read the text of the license.
reply
iamrobertismo
4 hours ago
[-]
Yeah, I don't understand the point of the prohibited use section at all, seems like unnecessary fluff.
reply
armcat
5 hours ago
[-]
Oh this is sweet, thanks for sharing! I've been a huge fan of Kokoro and event setup my own fully-local voice assistant [1]. Will definitely give Pocket TTS a go!

[1] https://github.com/acatovic/ova

reply
gropo
5 hours ago
[-]
Kokoro is better for tts by far

For voice cloning, pocket tts is walled so I can't tell

reply
seunosewa
3 hours ago
[-]
Chatterbox-turbo is really good too. Has a version that uses Apple's gpu.
reply
echelon
4 hours ago
[-]
What are the advantages of PocketTTS over Kokoro?

It seems like Kokoro is the smaller model, also runs on CPU in real time, and is more open and fine tunable. More scripts and extensions, etc., whereas this is new and doesn't have any fine tuning code yet.

I couldn't tell an audio quality difference.

reply
jamilton
3 hours ago
[-]
Being able to voice clone with PocketTTS seems major, it doesn't look like there's any support for that with Kokoro.
reply
echelon
3 hours ago
[-]
Zero shot voice clones have never been very good. Fine tuned models hit natural speaker similarity and prosody in a way zero shot models can't emulate.

If it were a big model and was trained on a diverse set of speakers and could remember how to replicate them all, then zero shot is a potentially bigger deal. But this is a tiny model.

I'll try out the zero shot functionality of Pocket TTS and report back.

reply
jhatemyjob
1 hour ago
[-]
Less licensing headache, it seems. Kokoro says its Apache licensed. But it has eSpeak-NG as a dependency, which is GPL, which brings into question whether or not Kokoro is actually GPL. PocketTTS doesn't have eSpeak-NG as a dependency so you don't need to worry about all that BS.

Btw, I would love to hear from someone (who knows what they're talking about) to clear this up for me. Dealing with potential GPL contamination is a nightmare.

reply
amrrs
5 hours ago
[-]
Thanks for sharing your repo..looks super cool.. I'm planning to try out. Is it based on mlx or just hf transformers?
reply
armcat
5 hours ago
[-]
Thank you, just transformers.
reply
Imustaskforhelp
2 hours ago
[-]
Perhaps I have been not talking to voice models that much or the chatgpt voice always felt weird and off because I was thinking it goes to a cloud server and everything but from Pocket TTS I discovered unmute.sh which is open source and I think is from the same company as Pocket TTS/can I think use Pocket TTS as well

I saw some agentic models at 4B or similar which can punch above its weights or even some basic models. I can definitely see them in the context of home lab without costing too much money.

I think atleast unmute.sh is similar/competed with chatgpt's voice model. It's crazy how good and (effective) open source models are from top to bottom. There's basically just about anything for almost everyone.

I feel like the only true moat might exist in coding models. Some are pretty good but its the only industry where people might pay 10x-20x more for the best (minimax/z.ai subscription fees vs claude code)

It will be interesting to see if we will see another deepseek moment in AI which might beat claude sonnet or similar. I think Deepseek has deepseek 4 so it will be interesting to see how/if it can beat sonnet

(Sorry for going offtopic)

reply
mgaudet
3 hours ago
[-]
Eep.

So, on my M1 mac, did `uvx pocket-tts serve`. Plugged in

> It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way—in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only

(Beginning of Tale of Two Cities)

but the problem is Javert skips over parts of sentences! Eg, it starts:

> "It was the best of times, it was the worst of times, it was the age of wisdom, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the spring of hope, it was the winter of despair, we had everything before us, ..."

Notice how it skips over "it was the age of foolishness,", "it was the winter of despair,"

Which... Doesn't exactly inspire faith in a TTS system.

(Marius seems better; posted https://github.com/kyutai-labs/pocket-tts/issues/38)

reply
small_scombrus
48 minutes ago
[-]
Using your first text block 'Eponine' skips "we had nothing before us" and doesn't speak the final "that some of its noisiest"

I wonder what's going wrong in there

reply
sbarre
1 hour ago
[-]
Yeah Javert mangled up those sentences for me as well, it skipped whole parts and then also moved words around

- "its noisiest superlative insisted on its being received"

Win10 RTX 5070 Ti

reply
OfflineSergio
1 hour ago
[-]
This is amazing. The audio feels very natural and it's fairly good at handling complext text to speech tasks. I've been working on WithAudio (https://with.audio). Currently it only uses Kokoros. I need to test this a bit more but I might actually add it to the app. It's too good to be ignored.
reply
dust42
5 hours ago
[-]
Good quality but unfortunately it is single language English only.
reply
phoronixrly
5 hours ago
[-]
I echo this. For a TTS system to be in any way useful outside the tiny population of the world that speaks exclusively English, it must be multilingual and dynamically switch between languages pretty much per word.

Cool tech demo though!

reply
bingaweek
2 hours ago
[-]
This is a great illustration that nothing you ever do will be good enough without people whining.
reply
kamranjon
4 hours ago
[-]
That's a pretty crazy requirement for something to be "useful" especially something that runs so efficiently on cpu. Many content creators from non-english speaking countries can benefit from this type of release by translating transcripts of their content to english and then running it through a model like this to dub their videos in a language that can reach many more people.
reply
ethin
3 hours ago
[-]
Uh, no? This is not at all an absurd requirement? Screen readers literally do this all the time, with voices that are the classic way of making a speech synthesizer, no AI required. ESpeak is an example, or MS OneCore. The NVDA screen reader has an option for automatic language switching as does pretty much every other modern screen reader in existence. And absolutely none of these use AI models to do that switching, either.
reply
phoronixrly
4 hours ago
[-]
You mean youtubers? And have to (manually) synchronise the text to their video, and especially when youtube apparently offers voice-voice translation out of the box to my and many others' annoyance?
reply
Levitz
4 hours ago
[-]
But it wouldn't be for those who "speak exclusively English", rather, for those who speak English. Not only that but it's also common to have system language set to English, even if one's language is different.

There's about 1.5B English speakers in the planet.

reply
phoronixrly
4 hours ago
[-]
Let's indeed limit the use case to the system language, let's say of a mobile phone.

You pull up a map and start navigation. All the street names are in the local language, and no, transliterating the local names to the English alphabet does not make them understandable when spoken by TTS. And not to mention localised foreign names which then are completely mangled by transliterating them to English.

You pull up a browser, open up an news article in your local language to read during your commute. You now have to reach for a translation model first before passing the data to the English-only TTS software.

You're driving, one of your friends Signals you. Your phone UI is in English, you get a notification (interrupting your Spotify) saying 'Signal message', followed by 5 minutes of gibberish.

But let's say you have a TTS model that supports your local language natively. Well due to the fact that '1.5B English speakers' apparently exist in the planet, many texts in other languages include English or Latin names and words. Now you have the opposite issue -- your TTS software needs to switch to English to pronounce these correctly...

And mind you, these are just very simple use cases for TTS. If you delve into use cases for people with limited sight that experience the entire Internet, and all mobile and desktop applications (often having poor localisation) via TTS you see how mono-lingual TTS is mostly useless and would be switched for a robotic old-school TTS in a flash...

> only that but it's also common to have system language set to English

Ask a German whether their system language is English. Ask a French person. I can go on.

reply
echelon
4 hours ago
[-]
English has more users than all but a few products.
reply
knowitnone3
3 hours ago
[-]
I'm Martian so everything you create better support my language on day 1
reply
_ache_
1 hour ago
[-]
It's very impressive! I'm mean, it's better than other <200M TTS models I encounter.

In English, it's perfect and it's so funny in others languages. It sounds exactly like someone who actually doesn't speak the language, but got it anyway.

I don't know why Fantine is just better than the others in others languages. Javer seems to be the worst.

Try Jean in Spanish « ¡Es lo suficientemente pequeño como para caber en tu bolsillo! » sound a lot like they don't understand the language.

Or Azelma in French « C'est suffisament petit pour tenir dans ta poche. » is very good.I mean half of the words are from a Québécois accent, half French one but hey, it's correct French.

Però non capisce l'italiano.

reply
tschellenbach
5 hours ago
[-]
It's cool how lightweight it is. Recently added support to Vision Agents for Pocket. https://github.com/GetStream/Vision-Agents/tree/main/plugins...
reply
grahamrr
1 hour ago
[-]
voices sound great! i see sample rate can be adjusted, is there any way to adjust the actual speed of the voice?
reply
indigodaddy
3 hours ago
[-]
Perfect timing that is exactly what I am looking for for a fun little thing I'm working on. The voices sound good!
reply
GaggiX
6 hours ago
[-]
I love that everyone is making their own TTS model as they are not as expensive as many other models to train. Also there are plenty of different architecture.

Another recent example: https://github.com/supertone-inc/supertonic

reply
andai
5 hours ago
[-]
In-browser demo of Supertonic with WASM:

https://huggingface.co/spaces/Supertone/supertonic-2

reply
coder543
5 hours ago
[-]
Another one is Soprano-1.1.

It seems like it is being trained by one person, and it is surprisingly natural for such a small model.

I remember when TTS always meant the most robotic, barely comprehensible voices.

https://www.reddit.com/r/LocalLLaMA/comments/1qcusnt/soprano...

https://huggingface.co/ekwek/Soprano-1.1-80M

reply
nunobrito
6 hours ago
[-]
Thank you. Very good suggestion with code available and bindings for so many languages.
reply
syntaxing
5 hours ago
[-]
Is there something similar for STT? I’m using whisper distill models and they work ok. Sometimes it gets what I say completely wrong.
reply
daemonologist
4 hours ago
[-]
Parakeet is not really more accurate than Whisper, but it's much faster - faster than realtime even on CPU: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3 . You have to use Nemo though, or mess around with third-party conversions. (Also has a big brother Canary: https://huggingface.co/nvidia/canary-1b-v2. There's also the confusingly named/positioned Nemotron speech: https://huggingface.co/nvidia/nemotron-speech-streaming-en-0...)
reply
satvikpendem
4 hours ago
[-]
Keep in mind Parakeet is pretty limited in the number of languages it supports compared to Whisper.
reply
phoronixrly
5 hours ago
[-]
reply
oybng
4 hours ago
[-]
>If you want access to the model with voice cloning, go to https://huggingface.co/kyutai/pocket-tts and accept the terms, then make sure you're logged in locally with `uvx hf auth login` lol
reply
snvzz
5 hours ago
[-]
Relative to AmigaOS translator.device + narrator.device, this sure seems bloated.
reply