Kokoro was posted and it works on webgpu, absolutely incredible quality for where it can run
I can go back and try to repro and get a recording....
Interesting, because the hero image is a Mac App screenshot.
Umm, it does.
> We don't currently support Apple Silicon, as there is not yet a Kokoro implementation in MLX. As soon as it will be available, we will support it.
I thought that meant that it didn't support Apple Silicon in general, but they were just talking about GPU support.
I might try using F5-TTS-MLX instead actually (https://github.com/lucasnewman/f5-tts-mlx) and see how that does.
Companies won't stop pulling this garbage unless we stop supporting them.
I know the decisions of the Dev team were disappointing, but it's also worth pointing out that the site was kept up until around last month - despite the warning stating that'd be down in November.
Omnivore could have shut down their code base, and prevented self-hosting entirely. I'm glad they didn't.
As for contribution model, it’s still something I’m trying to figure out. For the moment, it was just trying to get a self host build ready and working.
But I have admin rights to the repo, and am not working for ElevenLabs, nor officially Omnivore. I was just a contributor before.
Open source models drive proprietary foundation models' margin to zero.
The only reason elevenlabs became a unicorn was their margin. If they became a commodity, they'd find themselves in a deep pit.
I was really hoping they would fix these issues by now because it was promising. This app truly does feel like a portfolio demo app for a text to speech engine company rather than an actual reader app.
UPDATE: yes, I have actually used the app, no it does not work well. See replies for details.
Fwiw, I would use their app way more if it were better. Right now I use it for 1-2 long form articles at a time, I am sometimes willing to push buttons in order to stay focused but will bail out to eg my podcasts app if that becomes untenable
Some of their voices sound very artificial, some very real. I've been slowly making a list of the good ones.
I use it to convert long articles into audio, and have a script to add it to my podcast feed to listen to while driving:
https://blog.nawaz.org/posts/2024/Apr/reading-articles-via-p...
One other feature I'd really like: Having the AI figure out who is saying what and use different voices (e.g. one voice for overall narrator, and separate voices for each person who is quoted in the article).
Not sure if any of the solutions out there do that automatically without my guidance.
(Still probably wouldn't pay more than $2/mo for it - I just don't use it often enough to justify paying much).
And, you know, this is not a service I'd provide others. Just for my own use running from my PC. Audible won't know or care, just as no one cares if you borrow a book from the library and photocopy it for your own use.
It's unfortunate that I can't export audio clips locally; otherwise I would immediately look into using this for generating my Finnish flashcard decks from the same material [2]. I've thought about doing the same with the audio and video feeds included with this news broadcast, but getting Whisper to sync up properly with what's written down and cutting up the raw audio in that way still seems like more effort than I'm willing to invest right now.
elevenlabs has an API which seemed quite reasonable when I looked into it. A bit of python should get you what you want pretty quickly.
If you are looking to convert very short texts or words into speach, I had best result with eleven_multilingual_v2 with the following text for tts "Hän sanoo rauhallisesti ja hitaasti: <break time=\"1.0s\" /> '${text}'" An then i use a postprocessing to split at the silence.
This was nessesary as you cannot set the language explicitly and it is detected from the input.
With eleven_turbo_v2_5 you can set the language, but the results are not as good.
E. g. saying "1963" when the actual year in the text was 1967. Yeah, the voices sound very realistic. But I'm not sure how useful that is if you can't trust the spoken words.
Does anyone know if it got better in the last weeks?
So much of my time for "reading" is in a context where I can't physically read, so audiobooks are incredibly useful. But being limited to the set of books that gets recorded by the publisher is a real shame.
Haven't tried it yet but AI TTV seems basically perfect now so I'm very optimistic this will work great.
Since then, they’ve released a few cheaper models, but the quality suffers greatly (they still have the old models though so it’s not an issue). They’ve also been releasing a ton of different products around TTS.
I don’t mean this as a criticism — I just am curious why SOTA TTS has not improved from one model by one company several years ago, and why even said company isn’t able to improve on that model.
Which is why it's especially ridiculous ElevenLabs allows professionals to upload their voices, charges users of those voices a minimum of $50 per million characters, likely pays under $1 for the compute... and then passes on a whopping $2 back to the professional.
I think the next disruptive TTS competitor is going to form out of just offering to pay better rates than ElevenLabs to their PVC users.
Finetuning established architectures on cleaner synthetic data is already getting open source models increasingly competitive, so getting top PVC samples from the source would likely put you right about where they are today.
Edit: And since you're concerned we might not be aware of Elevenlabs' generous terms... why is your documentation so cagey about them? https://elevenlabs.io/docs/product-guides/voices/payouts#thi...
I see users need to keep paying you a subscription fee in order to even get their payouts... but "up to 20%" isn't saying particularly much without the kind of details that should probably be on that page.
-
Considering how much your company owes to an open source model, it's also impressive how little you've returned to the commons.
But no worries, the top comment under this post is an open source model that was finetuned for a couple of thousand dollars by a single dude soliciting the public for random voice samples.
If Google has no moat, you're out to sea.
Maybe you're in a bubble devoid of that kind of thinking, so it seems very foreign or quaint.
Even then it's short-sighted thinking at best: the "market rate" is not some magic self-optimizing number.
Underpaying their creators is just creating the opportunity for someone to take the best of them on better terms.
-
Elevenlabs is also able to raise trivially in this environment: you'd think while they're still floating out here without a moat other than high quality data, they'd overpay if anything and make narrators feel like royalty until they're replaced.
This isn't unlike Uber initially paying drivers massive bonuses and undercharging riders until they were able to leverage their massive network to increase prices past what the taxi providers they had decimated were charging. But in this case the marginal cost of providing the service is so low they don't even have to lose money to run a similar play, just take less of it. (in other words, even ruthless greed is not antithetical to paying these folks better)
I hire someone to paint a fence. We agree on $200 for the job and I pay them $200. We both know that it’s undifferentiated work and I could find a dozen other people who would do it for $200.
Where’s the lack of integrity? Or does it just appear if I know that I actually could afford to pay them $10,000, but chose not to?
You're building fence painting robot and need someone to teach it how to paint fences by example.
You decide you won't pay the fence painters to teach the robot upfront.
Instead, painters will pay you $20 to even visit the factory.
Then, if a particular painter's fence painting is especially highly rated, you'll pay them a small royalty.
So you send your fence painting robot to compete with fence painters for $20 a fence, passing on a tiny slice of the $20 to the ones who helped teach it.
-
We can consider the creation of fence robot and the competition with the existing market just another piece of the steady march of progress, but there's still obvious room to act with more integrity in this situation.
There was no established rate for what fair wages to teach the robot to paint were, and you can't pay $200 because you don't charge $200... but it's also probably not "-$22/month + 1% royalties".
Does anyone here actually have positive results doing this? It seems to me listening to anything that's even remotely complex with the intent of learning it just isn't something that's feasible.
But just to clarify, do you listen to dense material on commutes or while on walks or are you listening and taking notes at a table or something?
That's my main point. As someone that doesn't have dyslexia, sitting at a table and taking notes while reading dense material is already quite difficult to do when I'm trying to actually learn the material. I couldn't imagine being effective in my learning by just listening while on an early morning commute or something like the promotional video shows.
I get easily distracted and lose attention while listening to an audiobook. This is usually problematic with fiction, because suddenly I don't know who this new character is or what's happening. And rewinding to the precise position where I stopped paying attention is of course much more difficult than in written text.
I found that non-fiction books work great for me, because even if you ignore a page or two it makes no difference, the author keeps repeating their point and propping it up with many arguments anyway.
No audiobook exists, drop epub into ElevenReader and have Bert Reynolds read it to you, honestly better than some human narrators.
Ideally, I'd be able to strip out the text content and send it to my kindle in readable form. Since apparently that's science fiction, this looks like a really good plan B! Will definitely give it a go.
By automating it, it lowers the barrier to access this type of audio content for the masses. If you want to choose to pay someone you read something for you, the market allows that. This feels like a net gain.
I can't even remotely agree.
Narrating a book is absolutely an art. Listen to a book narrated by Stephen Fry, and all other books will sound awful. Considerable care and craft goes into a well-read book.
But this is why I'm actually excited about good TTS tools. Not because I want to displace Stephen Fry, but because there are so many books read by awful narrators and something like ElevenReader would be a huge step up in quality.
I share the parent commenter's concerns about the displacement of artists, but I'm less convinced that TTS tools are a net negative.
We lived this already with social networks. Initially us tech enthusiasts were all like "it will democratize access to news, it democratize producing the news! curated work will still be there, it's a net gain". And we all saw how it actually developed. As someone on the Internet said, I want AI to do my laundry and repeating task so I can do art or other more interesting things, I don't want AI to do arts and force me to do laundry by hand because due to AI taking my job now I don't have money to pay for a washing machine.
So I guess in your worldview a concert violinist also doesn’t make art, when they are playing a Mozart composition?
> can't find better things to do, such that it makes them poorer, or anti-social its a loss
I feel like this misses the point a bit - lost income/sustainability for artists is obviously a big issue we'll be facing, but looking for a performance indicator in an artistic endeavour doesn't really get you anywhere. There's more ways to value a painting than "what the market would pay" and "potential heat output as firewood", right?
Do you people ever step out of the abstract and think about the actual context you're living in?
I agree with your criticism, just not sure you understand who you were criticizing. But I hope you can think about actual context and see if that tempers what seems like a pretty emotional take on AI.
Having natural sounding TTS enhances accessibility for blind users, enables language localizations, etc. It's 100% a win even though there will be (and already is) disruption in the VA community.
My main use case is comp. sci and philosophy books. I download PDFs of varying quality off the internet onto my phone and import them into this app. The text translation is always solid but for the former, graphs and diagrams really break it. It’s a tricky problem because these often are important to the text so skipping them (for the app) isn’t ideal but the current solution just makes the reader goof up. I think it would be cool if the model could identify these objects and maybe generate some text describing the object and TTSing that. Minor gripe and for the latter, it’s perfect.
I’ve probably used this app for 70 reading hours at 1.5x speed across long road trips and walking my dog at the park. I’ve gotten through numerous books I wouldn’t have and for free. I’m happy!
(annoying bug I find often: it seems certain characters or tokens just break it and it freezes. I need to manually skip ahead hoping it doesn’t get stuck again. Really detracts from the hands free nature and is difficult to manage while driving)
The text to speech is alright, but it lacks almost any emotion, and it reads everything literally, which when the article/pdf has a weird layout, or has figures, doesn't sound natural. Though I expect they're just not using their top-of-the-line models for this - I've had much more luck pushing a pdf through Claude to generate the "verbal version" (which is mostly literal, but also describes the layout and figures) and then the result through the top-of-the-line ElevenLabs model.
Now, I've also checked out the podcast feature, and it's pretty clear they first do a textual generation, and then a simple text to speech. Again, lack of emotion, very mechanical flow.
I made a podcast of a technical article[0] in both ElevenLabs reader and Google's NotebookLM, and the NotebookLM podcast is a night-and-day improvement - maybe they use a better model, maybe they use straight "article to podcast" end-to-end multimodal generation, I don't know, but the quality, flow, emotion, is just on a completely different level. I had to quickly turn off the ElevenLabs-generated podcast cause I couldn't keep listening to it, while NotebookLM's one is legitimately enjoyable.
Now to finish on a more positive note, fingers crossed for the ElevenLabs team improving this, and us getting some competition in the area of article-to-audio, both podcast-style, and direct! I think, in general, it's a very promising product direction. Feature-wise, I would also love to get a daily overview podcast based on all my RSS feed articles for a given day.
That said, even in their cherries the emphasis still isn’t quite right in the Tolkien example.
For a long time I wanted to make a game - think The Stanley Parable or Thomas Was Alone - that would be narrated by the voice of either David Attenborough or Morgan Freeman. You know, it's a low hanging fruit, you can have a two hours long footage of zebras running around narrated by either of these and it's suddenly eerily fascinating.
So far I'm AI skeptic, but this voice thing really makes me think about an actual shift in how certain jobs can become irrelevant in foreseeable future.
Services are expensive and in most cases the voices are easily detectable as not human. I would find it very hard to listen to such voices for a long period of time.
Even ElevenLabs voices which seem to be known as the best have only a few that are really good quality but even then they're very, very far from the capabilities of a human.
Edit: I think the effect of the invention of vinyl on live performers is more akin to how the commoditisation of HQ TTS will be detrimental to audiobook narrators.
Two audiobooks that come to my mind:
- The Lord of the Ring series read by Andy Serkis; not only he perfectly switches between each characters voice, but also the feeling of listening "Gollum" for ours is something else altogether
- David Goggins' books; the audiobook version is completely different than the book, since he's not just reading the book, and overall it makes the content easier to digest
https://chasingperfection.co.uk/post/2013/01/14/text-to-spee...
Can anyone else confirm?
[1] - https://unicornriot.ninja/2024/sextortion-coms-inside-a-vile...
I use it to listen to PDFs. It works, but has plenty of hiccups with headers, footers and colons.
The first impression is not that great. There's nothing natural about the voice. While individual words and phrases sound good, there's still no decent cadence and intonation. Feels flat and robotic.
However, I will definitely experiment some more.
Here is an example: https://youtube.com/shorts/UKjqrydITLA?si=iC7ehp6LmlLH0M-U
Probably because I have WebGL disabled in this browser. Not exactly sure what they're doing with it on the landing page, maybe the fluffy effects.
Could you expand upon this? Any milestones towards that which we should be mindful of?
With llms, "knowing things" is already starting to feel like a thing of a past, not to me, but to a lot of others, there's no longer an incentive to "switch on".
Why should a kid learn anything if a robot is instantly better at everything? Maths got replaced by calculators, deep critical thinking will get replaced by llms a lot of the time, which are word calculators, which is the closest thing we have to a logic calculator.
This is more passive autopilot software, which further promotes learning as something you 'consume' rather than something you seek.
The public consciousness has absolutely taken a semptember 11 tier nosedive since social media, we're approaching what I term cultual schizophrenia, which I posted about on my blog which I deleted, but I've readded it if you're interested [https://substack.com/home/post/p-156983317]. There's no contextualisers in the media to give the right emphasis to the right things.
This is just my perspective, from what I've seen from other younger people of my age. We are heading into extremely interesting times, everything profoundly destabalising thing we've speculated about is happening at the exact same time. We desperately need visionaries in politics.
Basically I'm not doing too hot
The world is changing, but then again it always has been. IMO some things will get better, some will get worse, but ghe overall arc of human health and prosperity will continue upwards. There is less poverty, less starvation, more opportunity today than ever… even though some aspects of the world are bad and getting worse. That’s the way it’s always been.
I found that it’s my preferred way to use their reader, as it makes the reading more neutral and transparent for my brain.
I need to find some AI-assist OCR to fix tons of mistakes like "186o" for 1860 or "gla)" for glad.
Also check out my open source project for that:
Hyphenated words, page numbers and chapter titles seem to be my main issue. I can easily do search and replace on chapter titles though.
Before, we were hiring people to translate, and then hiring others to dub the audio. Now, our files are automatically translated and spoken in the voice of the actual speaker, and we just have a small Quality Control team of native speakers quickly verify the results are accurate. We've reduced costs and increased the quality of our translated media.
> Is the app free?
> Yes. The app is completely free to download and use today. Listening to content on the app will not consume credits from your monthly web plan. We do plan to eventually launch some premium version of the app, but even then we will maintain a generous free plan.
you could build your local TTS using kokoro browser though — https://huggingface.co/spaces/webml-community/kokoro-webgpu
Ingests URLs in a variety of ways, converts to natural language audio, puts it in your podcast feed.
Free to use.
1.2 million people die in road accidents, and most of them are children and young people. Even more are seriously injured.
If that's the case, maybe a driver's license isn't your thing?
1.2 million people die in road accidents, and most of them are children and young people. Even more are seriously injured.
if you are reading for information, I guess if this helps, sure go ahead.
when reading for pleasure, this is not it though.