The state of modern AI text to speech systems for screen reader users
37 points
4 hours ago
| 7 comments
| stuff.interfree.ca
| HN
cachius
1 hour ago
[-]
Glooming bottom line:

So what's the way forward for blind screen reader users? Sadly, I don't know.

Modern text to speech research has little overlap with our requirements. Using Eloquence [32-bit voice last compiled in 2003], the system that many blind people find best, is becoming increasingly untenable. ESpeak uses an odd architecture originally designed for computers in 1995, and has few maintainers. Blastbay Studios [...] is a closed-source product with a single maintainer, that also suffers from a lack of pronunciation accuracy.

In an ideal world, someone would re-implement Eloquence as a set of open source libraries. However, doing so would require expertise in linguistics, digital signal processing, and audiology, as well as excellent programming abilities. My suspicion is that modernizing the text to speech stack that is preferred by blind power-users is an effort that would require several million dollars of funding at minimum.

Instead, we'll probably wind up having to settle for text to speech voices that are "good enough", while being nowhere near as fast and efficient [800 to 900 words per minute] as what we have currently.

reply
Jeff_Brown
1 hour ago
[-]
This surprises me: "These modern systems are developed to sound human, natural, and conversational. Unfortunately this seems to come at the expense of accuracy. In my testing, both models had a tendency to skip words, read numbers incorrectly, chop off short utterances, and ignore prosody hints from text punctuation. "
reply
rhdunn
15 minutes ago
[-]
It's not just screen reader users. I use TTS to listen to text content and the AI TTS voices I've tried have the issues with skipping words or generating garbled output in sections.

I don't know if this is a data/transcription issue, an issue with noisy audio, or what.

reply
nuc1e0n
3 hours ago
[-]
Has anyone considered decompiling eloquence? With something like ghidra or ida pro? Mario 64 was turned back into high level language source code this way.
reply
aaronbrethorst
55 minutes ago
[-]
Who owns Eloquence and why hasn’t a new version been released since 2003?

I feel like there’s a lot of backstory I’m missing.

reply
46493168
16 minutes ago
[-]
Microsoft. A new version hasn’t been released because Microsoft, like most companies, don’t take accessibility seriously.

The original Eloquence TTS was developed as ETI-Eloquence. ScanSoft acquired speech recognition company SpeechWorks in 2003, and in October 2005, ScanSoft merged with Nuance Communications, with the combined company adopting the Nuance name. Currently, Code Factory distributes ETI Eloquence for Windows as a SAPI 5 TTS synthesizer, though I can’t figure out exact licensing relationship between Code Factory and Nuance, which was acquired by Microsoft in like 2022

reply
superkuh
44 minutes ago
[-]
What use is human sounding TTS when your desktop cannot read the contents of windows?

As someone with progressive retinal tearing who's used the linux desktop for 20 years I'm terrified. The forcing of the various incompatible waylands by the big linux corps has meant the end of support for screen readers. The only wayland compositor that supports screen readers in linux is GNOME's mutter and they literally only added that support last year (after 15 years of waylands) and instead of supporting standard at-spi and existing protocols that Orca and the like use GNOME decided to come up with two new in-house GNOME proprietary protocols (which themselves don't send the full window tree or anything on request but instead push only info about single windows, etc, etc) for doing it. No other wayland compositor supports screen readers. And without any standardization no developers will ever support screenreaders on waylands. Basically only GNOME's userspace will sort of support it. There's no hope for non-X11 based screen readers and all the megacorps are say they're dropping X11 support.

The only options I have are to use and maintain old X11 linux distros myself. But eventually things like CA TLS and browsers just won't be feasible for me to backport and compile myself. Eventually I'm going to have to switch to using Windows. It's a sad, sad state of things.

And regarding AI based text to speech: almost all of it kind of sucks for screen readers. Particularly the random garbled ai-noises that happen between and at the end of utterances, inaccurate readings, etc in many models. Not to mention requiring the use of a GPU and lots of system resources. The old Festival 1.96 Nitech HTS voices on (core2duo) CPU from the early 2000 are incomparibly faster, more accurate, and sound decent enough to understand.

reply
blabla_bla
59 minutes ago
[-]
The author reviews recent AI-based text-to-speech (TTS) systems and finds that despite big advances in AI voices, they still *don’t meet the needs of screen reader users*. Traditional voices used by blind users are fast, clear, and predictable — preferences not matched by newer AI models, which tend to be slower, less accurate with pronunciation, and lacking in customization. Issues include heavy software dependencies, slower startup, misreading words/numbers, and poor control over voice characteristics. The result is that modern AI TTS isn’t yet suitable for everyday screen reader use, and good legacy systems remain hard to replace
reply