https://opensource.builders/alternatives/superwhisper
Just added Ghost Pepper, and you can actually create a skill.md with the features you need to build your own
>“Compare” - This is the most important part. Apps in the most saturated categories (whisper dictation, clipboard managers, wallpaper apps, etc.) must clearly explain their differentiation from existing solutions.
https://www.reddit.com/r/macapps/comments/1r6d06r/new_post_r...
windows (kotlin multi platform) => https://github.com/maceip/daydream
parakeet-tdt-0.6b-v2
When I most recently abandoned it, the trigger word would fire one time in five.
I built one for cross platform — using parakeet mlx or faster whisper. :)
But I did it because I wanted it to work exactly the way I wanted it.
Also, for kicks, I (codex) ported it to Linux. But because my Linux laptop isn't as fast, I've had to use a few tricks to make it fast. https://github.com/obra/pepper-x
My 2021 Google Pixel 6, when offline, can transcribe speech to text, and also corrects things contextually. it can make a mistake, and as I continue to speak, it will go back and correct something earlier in the sentence. What tech does Google have shoved in there that predates Whisper and Qwen by five years? And why do we now need a 1Gb of transformers to do it on a more powerful platform?
I was actually on the OneNote team when they were transitioning to an online only transcription model because there was no one left to maintain the on device legacy system.
It wasn't any sort of planned technical direction, just a lack of anyone wanting to maintain the old system.
I've switched away from Gboard to Futo on Android and exclusively use MacWhisper on MacOS instead of the default Apple transcription model.
The latest open source local STT models people are running on devices are significantly more robust (e.g. whisper models, parakeet models, etc.). So background noise, mumbling, and/or just not having a perfect audio environment doesn't trip up the SoTA models as much (all of them still do get tripped up).
I work in voice AI and am using these models (both proprietary and local open source) every day. Night and day different for me.
1. For a Linux user, you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem. From Windows or Mac, this FTP account could be accessed through built-in software.
2. It doesn't actually replace a USB drive. Most people I know e-mail files to themselves or host them somewhere online to be able to perform presentations, but they still carry a USB drive in case there are connectivity problems. This does not solve the connectivity issue.
3. It does not seem very "viral" or income-generating. I know this is premature at this point, but without charging users for the service, is it reasonable to expect to make money off of this?
I have collected the best open-source voice typing tools categorized by platform in this awesome-style GitHub repo. Hope you all find this useful!
On Linux, there's access to the latest Cohere Transcribe model and it works very, very well. Requires a GPU though. Larger local models generally shouldn't require a subordinate model for clean up.
Have you compared WhisperKit to faster-whisper or similar? You might be able to run turbov3 successfully and negate the need for cleanup.
Incidentally, waiting for Apple to blow this all up with native STT any day now. :)
But in this case I built hyprwhspr for Linux (Arch at first).
The goal was (is) the absolute best performance, in both accuracy & speed.
Python, via CUDA, on a NVIDIA GPU, is where that exists.
For example:
The #1 model on the ASR (automatic speech recognition) hugging face board is Cohere Transcribe and it is not yet 2 weeks old.
The ecosystem choices allowed me to hook it up in a night.
Other hardware types also work great on Linux due to its adaptability.
In short, the local stt peak is Linux/Wayland.
If this needs nvidia CPU acceleration for good performance it is not useful to me, I have Intel graphics and handy works fine.
That said: If handy works, no need whatsoever to change.
Not sure how you're running it, via whichever "app thing", but...
On resource limited machines: "Continuous recording" mode outputs when silence is detected via a configurable threshold.
This outputs as you speak in more reasonable chunks; in aggregate "the same output" just chunked efficiently.
Maybe you can try hackin' that up?
Have you ever considered using a foot-pedal for PTT?
Apple incidentally already has native STT, but for some reason they just don't use a decent model yet.
Apparently they do have a better model, they just haven't exposed it in their own OS yet!
https://developer.apple.com/documentation/speech/bringing-ad...
Wonder what's the hold up...
For footpedal:
Yes, conceptually it’s just another evdev-trigger source, assuming the pedal exposes usable key/button events.
Otherwise we’d bridge it into the existing external control interface. Either way, hooks are there. :)
Parakeet does both just fine.
Also, wish it was on nixpkgs, where at least it will be almost guaranteed to build forever =)
I've been using parakeet v3 which is fantastic (and tiny). Confused why we're still seeing whisper out there, there's been a lot of development.
It's also in many flavours, from tiny to turbo, and so can fit many system profiles.
That's what makes it unique and hard to replace.
Also vibe coded a way to use parakeet from the same parakeet piper server on my grapheneos phone https://zach.codes/p/vibe-coding-a-wispr-clone-in-20-minutes
E.G. if your name is `Donold` (pronounced like Donald) there is not a transcription model in existence that will transcribe your name correctly. That means forget inputting your name or email ever, it will never output it correctly.
Combine that with any subtleties of speech you have, or industry jargon you frequently use and you will have a much more useful tool.
We have a ton of options for "predict the most common word that matches this audio data" but I haven't found any "predict MY most common word" setups.
https://developers.openai.com/cookbook/examples/whisper_prom...
I'll give a shoutout as well to Glimpse: https://github.com/LegendarySpy/Glimpse
Extra bonus is that Handy lets add an automatic LLM post-processor. This is very handy for the Parakeet V3 model, which can sometimes have issues where it repeats words or makes recognition errors for example, duplicating the recognition of a single word a dozen dozen dozen dozen dozen dozen dozen dozen times.
Once in a while it will only output a literal space instead of the actual translation, but if I go into the 'history' page the translation is there for me to copy and paste manually. Maybe some pasting bug.
"You know what would be useful?" followed by asking your LLM of choice to implement it.
Then again for a lot of scenarios it's your slop or someone else's slop.
I think the only difference is that I keep my own slop tools private.
I did that so that I could record my own inputs and finetune parakeet to make it accurate enough to skip post-processing.
It's used by this dictation app: https://github.com/altic-dev/FluidVoice/
I like that openwhisper lets me do on device and set a remote provider.
The button next to it pastes when I press it. If I press it again, it hits the enter command.
You can get a lot done with two buttons.
Project repo: https://github.com/finnvoor/yap
What makes the others vastly better?
Whisper models I barely bother checking anymore
You could hook it up to some workflow over the local API depending on how you want to dump the text, but the web UI is good too.
The Show HN by the author was at: https://news.ycombinator.com/item?id=44145564
EDIT: I see there is an open issue for that on github
Would you consider making available a video showing someone using the app?
The macOS built-in TTS (dictation) seems better than all the 3rd party, local apps I tried in the past that people raved about. I have tried several.
Is this better somehow?
If the 3rd party apps did streaming with typing in place and corrections within a reasonable window when they understand things better given more context, that would be cool. Theoretically, a custom model or UX could be "better" than what comes free built into macOS (more accurate or customizable).
But when I contacted the developer of my favorite one they said that would be pretty hard to implement due to having to go back and make corrections in the active field, etc.
I assume streaming STT in these utilities for Mac will get better at some point, but I haven't seen it yet (been waiting). It seems these tools generally are not streaming, e.g. they want you to finish speaking first before showing you anything. Which doesn't work for me when I'm dictating. I want to see what I've been saying lately, to jog my memory about what I've just said and help guide the next thing I'm about to say. I certainly don't want to split my attention by manually toggling the control (whether PTT or not) periodically to indicate "ok, you can render what I just said now".
I guess "hold-to-talk" tools are for delivering discrete, fully formed messages, not for longer, running dictation.
AFAICT, TFA is focused on hold-to-talk as the differentiator, over double-tap to begin speaking and double-tap to end speaking?
https://news.ycombinator.com/item?id=45930659
Works great for me!
Also, on a Mac with 32GB of RAM, 24GB of that (75%) is available to the GPU, and that makes the models run much faster. On my 64GB MacBook Pro, 48GB is available to the GPU. Have you priced an nvidia GPU with 48GB of RAM? It’s simply cheaper to do this on Macs.
Macs are just better for getting started with this kind of thing.