FilterHN

Show HN: Ghost Pepper – Local hold-to-talk speech-to-text for macOS

237 points

6 hours ago

| 31 comments

I built this because I wanted to see how far I could get with a voice-to-text app that used 100% local models so no data left my computer. I've been using a ton for coding and emails. Experimenting with using it as a voice interface for my other agents too. 100% open-source MIT license, would love feedback, PRs, and ideas on where to take it.

▲

atlgator

4 hours ago

[-]

This thread is a support group for people who have each independently built the same macOS speech-to-text app.

▲

theturtletalks

34 minutes ago

[-]

I'm tracking them all here:

https://opensource.builders/alternatives/superwhisper

Just added Ghost Pepper, and you can actually create a skill.md with the features you need to build your own

▲

karimf

2 hours ago

[-]

In the /r/macapps subreddit, they have huge influx of new apps posts, and the "whisper dictation" is one of the most saturated category. [0]

>“Compare” - This is the most important part. Apps in the most saturated categories (whisper dictation, clipboard managers, wallpaper apps, etc.) must clearly explain their differentiation from existing solutions.

https://www.reddit.com/r/macapps/comments/1r6d06r/new_post_r...

▲

tpowell

2 hours ago

[-]

I cobbled my own together one night before I came across the thoughtfully-built KeyVox and got to talking shop with its creator. Our cups runneth over. https://github.com/macmixing/keyvox/

▲

rmac

1 hour ago

[-]

checking in

windows (kotlin multi platform) => https://github.com/maceip/daydream

parakeet-tdt-0.6b-v2

▲

colechristensen

28 minutes ago

[-]

My name is Cole and I have a speech to text app.

When I most recently abandoned it, the trigger word would fire one time in five.

▲

pmarreck

2 hours ago

[-]

Are there any better than Superwhisper? Because I haven't found any.

▲

lxe

1 hour ago

[-]

hahaha I’m glad I’m just a procedurally generated NPC

I built one for cross platform — using parakeet mlx or faster whisper. :)

▲

fragmede

1 hour ago

[-]

Yeah, but mine... Oh. Hello. sighs It's been three weeks since I tried to add feature to my version of the app. I don't miss it. I like this new life. Sober.

▲

brcmthrowaway

4 hours ago

[-]

Oh to be 20-something and do a bunch of free work for your portfolio again

▲

obrajesse

3 hours ago

[-]

I'll have you know that I'm Matt's top contributor to Ghost Pepper and I'm nearly fifty

But I did it because I wanted it to work exactly the way I wanted it.

Also, for kicks, I (codex) ported it to Linux. But because my Linux laptop isn't as fast, I've had to use a few tricks to make it fast. https://github.com/obra/pepper-x

▲

arkensaw

2 hours ago

[-]

This is great, and I'm not knocking it, but every time I see these apps it reminds me of my phone.

My 2021 Google Pixel 6, when offline, can transcribe speech to text, and also corrects things contextually. it can make a mistake, and as I continue to speak, it will go back and correct something earlier in the sentence. What tech does Google have shoved in there that predates Whisper and Qwen by five years? And why do we now need a 1Gb of transformers to do it on a more powerful platform?

▲

com2kid

2 hours ago

[-]

Microsoft OneNote had this back in 2007 or so, granted the speech to text model wasn't nearly as advanced as they are now.

I was actually on the OneNote team when they were transitioning to an online only transcription model because there was no one left to maintain the on device legacy system.

It wasn't any sort of planned technical direction, just a lack of anyone wanting to maintain the old system.

▲

adamsmark

2 hours ago

[-]

The accuracy is much lower though.

I've switched away from Gboard to Futo on Android and exclusively use MacWhisper on MacOS instead of the default Apple transcription model.

▲

cootsnuck

2 hours ago

[-]

Interesting. My Pixel 7 transcription is barely usable for me. Makes way too many mistakes and defeats the purpose of me not having to type, but maybe that's just my experience.

The latest open source local STT models people are running on devices are significantly more robust (e.g. whisper models, parakeet models, etc.). So background noise, mumbling, and/or just not having a perfect audio environment doesn't trip up the SoTA models as much (all of them still do get tripped up).

I work in voice AI and am using these models (both proprietary and local open source) every day. Night and day different for me.

▲

cupcake-unicorn

2 hours ago

[-]

https://handy.computer/ already exists?

▲

semiquaver

1 hour ago

[-]

I have a few qualms with this app:

1. For a Linux user, you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem. From Windows or Mac, this FTP account could be accessed through built-in software.

2. It doesn't actually replace a USB drive. Most people I know e-mail files to themselves or host them somewhere online to be able to perform presentations, but they still carry a USB drive in case there are connectivity problems. This does not solve the connectivity issue.

3. It does not seem very "viral" or income-generating. I know this is premature at this point, but without charging users for the service, is it reasonable to expect to make money off of this?

▲

Graziano_M

56 minutes ago

[-]

I got that reference!

▲

smcleod

2 hours ago

[-]

Yeah props to Handy, really nice tool.

▲

forbiddenvoid

1 hour ago

[-]

More than one solution can exist for the same problem.

▲

ktimespi

1 hour ago

[-]

This is ideal for my use case, yeah. No need to fiddle around with another app's UI.

▲

primaprashant

4 hours ago

[-]

Speech-to-text has become integral part of my dev flow especially for dictating detailed prompts to LLMs and coding agents.

I have collected the best open-source voice typing tools categorized by platform in this awesome-style GitHub repo. Hope you all find this useful!

https://github.com/primaprashant/awesome-voice-typing

▲

kushalpandya

26 minutes ago

[-]

Speecg-to-text is basically AI version of Todo app that we used to build every week when new frontend framework would release.

▲

goodroot

6 hours ago

[-]

Nice one! For Linux folks, I developed https://github.com/goodroot/hyprwhspr.

On Linux, there's access to the latest Cohere Transcribe model and it works very, very well. Requires a GPU though. Larger local models generally shouldn't require a subordinate model for clean up.

Have you compared WhisperKit to faster-whisper or similar? You might be able to run turbov3 successfully and negate the need for cleanup.

Incidentally, waiting for Apple to blow this all up with native STT any day now. :)

▲

VorpalWay

4 hours ago

[-]

How does it compare to the more well established https://github.com/cjpais/handy? Are there any stand out features (for either option)? What was the reason for writing your own rather than using or improving existing software?

▲

goodroot

3 hours ago

[-]

Not sure I know what you mean by IR...

But in this case I built hyprwhspr for Linux (Arch at first).

The goal was (is) the absolute best performance, in both accuracy & speed.

Python, via CUDA, on a NVIDIA GPU, is where that exists.

For example:

The #1 model on the ASR (automatic speech recognition) hugging face board is Cohere Transcribe and it is not yet 2 weeks old.

The ecosystem choices allowed me to hook it up in a night.

Other hardware types also work great on Linux due to its adaptability.

In short, the local stt peak is Linux/Wayland.

▲

VorpalWay

3 hours ago

[-]

IR was a typo, meant "it" (fixed it). I blame the phone keyboard plus insufficient proof reading on my part.

If this needs nvidia CPU acceleration for good performance it is not useful to me, I have Intel graphics and handy works fine.

▲

goodroot

3 hours ago

[-]

It works well with anything. :)

That said: If handy works, no need whatsoever to change.

▲

LuxBennu

5 hours ago

[-]

I've been running whisper large-v3 on an m2 max through a self-hosted endpoint and honestly the accuracy is good enough that i stopped bothering with cleanup models. The bigger annoyance for me was latency on longer chunks, like anything over 30 seconds starts feeling sluggish even with metal acceleration. Haven't tried whisperkit specifically but curious how it handles longer audio compared to the full model.

▲

goodroot

5 hours ago

[-]

Ah yeah, longform is interesting.

Not sure how you're running it, via whichever "app thing", but...

On resource limited machines: "Continuous recording" mode outputs when silence is detected via a configurable threshold.

This outputs as you speak in more reasonable chunks; in aggregate "the same output" just chunked efficiently.

Maybe you can try hackin' that up?

▲

LuxBennu

4 hours ago

[-]

Yeah that makes sense, chunking on silence would sidestep the latency issue pretty cleanly. I've been running it through a basic fastapi wrapper so it just takes whatever audio blob gets thrown at it, no chunking logic on the server side. Might be worth adding a vad pass before sending to whisper though, would cut down on processing dead air too.

▲

hephaes7us

5 hours ago

[-]

Thanks for sharing! I was literally getting ready to build, essentially, this. Now it looks like I don't have to!

Have you ever considered using a foot-pedal for PTT?

Apple incidentally already has native STT, but for some reason they just don't use a decent model yet.

▲

goodroot

5 hours ago

[-]

They do, and they even have that nice microphone F5 key for it, and an ideal OS level API making the input experience >perfect<.

Apparently they do have a better model, they just haven't exposed it in their own OS yet!

https://developer.apple.com/documentation/speech/bringing-ad...

Wonder what's the hold up...

For footpedal:

Yes, conceptually it’s just another evdev-trigger source, assuming the pedal exposes usable key/button events.

Otherwise we’d bridge it into the existing external control interface. Either way, hooks are there. :)

▲

jiehong

4 hours ago

[-]

The only issue with Apple models is that they do not detect languages automatically, nor switch if you do between sentences.

Parakeet does both just fine.

▲

chrisweekly

4 hours ago

[-]

sorry, PTT?

▲

serf

4 hours ago

[-]

push-to-talk.

▲

pmarreck

2 hours ago

[-]

looks like there's a nearly identically named one for Hyprland

Also, wish it was on nixpkgs, where at least it will be almost guaranteed to build forever =)

▲

charlietran

6 hours ago

[-]

Thank you for sharing, I appreciate the emphasis on local speed and privacy. As a current user of Hex (https://github.com/kitlangton/Hex), which has similar goals, what are your thoughts on how they compare?

▲

parhamn

5 hours ago

[-]

I see a lot of whisper stuff out there. Are these the same old OpenAI whispers or have they been updated heavily?

I've been using parakeet v3 which is fantastic (and tiny). Confused why we're still seeing whisper out there, there's been a lot of development.

▲

daemonologist

4 hours ago

[-]

Whisper is still old reliable - I find that it's less prone to hallucinations than newer models, easier to run (on AMD GPU, via whisper.cpp), and only ~2x slower than parakeet. I even bothered to "port" Parakeet to Nemo-less pytorch to run it on my GPU, and still went back to Whisper after a couple of days.

▲

goodroot

4 hours ago

[-]

Whisper is very good in many languages.

It's also in many flavours, from tiny to turbo, and so can fit many system profiles.

That's what makes it unique and hard to replace.

▲

zackify

5 hours ago

[-]

same, even have kokoro for speech back to text for home assistant and parakeet on mac os through voice ink.

Also vibe coded a way to use parakeet from the same parakeet piper server on my grapheneos phone https://zach.codes/p/vibe-coding-a-wispr-clone-in-20-minutes

▲

ericmcer

4 hours ago

[-]

I see quite a few of these, the killer feature to me will be one that fine tunes the model based on your own voice.

E.G. if your name is `Donold` (pronounced like Donald) there is not a transcription model in existence that will transcribe your name correctly. That means forget inputting your name or email ever, it will never output it correctly.

Combine that with any subtleties of speech you have, or industry jargon you frequently use and you will have a much more useful tool.

We have a ton of options for "predict the most common word that matches this audio data" but I haven't found any "predict MY most common word" setups.

▲

sorenjan

4 hours ago

[-]

Whisper supports a prompt, you can put your "Donold" there.

https://developers.openai.com/cookbook/examples/whisper_prom...

▲

MattHart88

4 hours ago

[-]

I've found the "corrections" feature works well for most of the jargon and misspelling use cases. Can you give it a try and let me know edge cases?

▲

bonkler59

3 hours ago

[-]

My experience is that Aqua voice does a good job of this with custom dictionary and replacements.

▲

konaraddi

6 hours ago

[-]

That’s awesome! Do you know how it compares to Handy? Handy is open source and local only too. It’s been around a while and what I’ve been using.

https://github.com/cjpais/handy

▲

JohnPDickerson

3 hours ago

[-]

Handy is an awesome project, highly recommended - many of our engineers and PMs use it! CJ, Handy's creator, recently joined us as a Builder in Residence at Mozilla.ai. So for those interested in deploying a more raw/lightweight approach to local speech-to-text (or other multimodal) models, feel free to check out llamafile - which includes whisperfile, a single-file whisper.cpp + cosmopolitan framework-based executable. We're hoping to build some bridges between the two projects as well. https://github.com/mozilla-ai/llamafile

▲

cootsnuck

2 hours ago

[-]

Yup, Handy is the one that made me stop looking for local open source alternatives to Wispr Flow.

I'll give a shoutout as well to Glimpse: https://github.com/LegendarySpy/Glimpse

▲

vunderba

4 hours ago

[-]

I’d also be interested to know what the impetus was for developing ghost-pepper, which looks relatively recent, given that Handy exists and has been pretty well received.

Extra bonus is that Handy lets add an automatic LLM post-processor. This is very handy for the Parakeet V3 model, which can sometimes have issues where it repeats words or makes recognition errors for example, duplicating the recognition of a single word a dozen dozen dozen dozen dozen dozen dozen dozen times.

▲

rob

4 hours ago

[-]

Yep. Using Handy with Parakeet v3 + a custom coding-tailored prompt to post-process on my 2019 Intel Mac and it's been working great.

Once in a while it will only output a literal space instead of the actual translation, but if I go into the 'history' page the translation is there for me to copy and paste manually. Maybe some pasting bug.

▲

alasano

3 hours ago

[-]

I think it's the same reasoning for anything these days.

"You know what would be useful?" followed by asking your LLM of choice to implement it.

Then again for a lot of scenarios it's your slop or someone else's slop.

I think the only difference is that I keep my own slop tools private.

▲

swaptr

5 hours ago

[-]

Handy is awesome! I used it for quite a while before Claude Code added voice support. Solid software, very good linux and mac integration. Shoutout to Parakeet models as well, extremely fast and solid models for their relatively modest memory requirements.

▲

kwakubiney

2 hours ago

[-]

I love it. I use it all the time to communicate to my agents via opencode.

▲

youniverse

5 hours ago

[-]

I love and have been using handy for a while too, what we need is this for mobile apps I don't think there's any free apps and native dictation is not always fully local and not as good.

▲

olup

4 hours ago

[-]

I use handy all day long as a software engineer, and recommended it to all of my team members. I love it.

▲

stavros

5 hours ago

[-]

Handy is fantastic.

▲

fiatpandas

1 hour ago

[-]

The clean up prompt needs adjusting. If your transcription is first person and in the voice of talking to an AI assistant, it really wants to “answer” you, completing ignoring its instructions. I fiddled with the prompt but couldn’t figure out how to make it not want to act like an AI assistant.

▲

ipsum2

6 hours ago

[-]

Parakeet is significantly more accurate and faster than Whisper if it supports your language.

▲

yeutterg

5 hours ago

[-]

Are you running Parakeet with VoiceInk[0]?

[0]: https://github.com/beingpax/VoiceInk

▲

ipsum2

3 hours ago

[-]

I'm using https://github.com/senstella/parakeet-mlx library.

▲

zackify

5 hours ago

[-]

i am, working great for a long time now

▲

rahimnathwani

5 hours ago

[-]

Right, and if you're on MacOS you can use it for free with Hex: https://github.com/kitlangton/Hex

▲

lloyd-christmas

3 hours ago

[-]

Or write your own custom one with the library that backs it: https://github.com/FluidInference/FluidAudio

I did that so that I could record my own inputs and finetune parakeet to make it accurate enough to skip post-processing.

▲

rahimnathwani

2 hours ago

[-]

There's a fork of FluidAudio that supports the recent Cohere model: https://github.com/altic-dev/FluidAudio/tree/B/cohere-coreml...

It's used by this dictation app: https://github.com/altic-dev/FluidVoice/

▲

treetalker

5 hours ago

[-]

I have been using Parakeet with MacWhisper's hold-to-talk on a MacBook Neo and it's been awesome.

▲

obrajesse

4 hours ago

[-]

And indeed, Ghost Pepper supports parakeet v3

▲

raybb

2 hours ago

[-]

Would also like to know how it compares to https://github.com/openwhispr/openwhispr

I like that openwhisper lets me do on device and set a remote provider.

▲

maxmorrish

23 minutes ago

[-]

love seeing more local-first tools like this. feels like theres been a real shift since the codebeautify breach last year, people are actually thinking about where there data goes now. nice work on keeping it all on device

▲

__mharrison__

4 hours ago

[-]

Cool, I've been doing a lot of "coding" (and other typing tasks) recently by tapping a button on my Stream Deck. It starts recording me until I tap it again. At which point, it transcribes the recording and plops it into the paste buffer.

The button next to it pastes when I press it. If I press it again, it hits the enter command.

You can get a lot done with two buttons.

▲

mathis

5 hours ago

[-]

If you don't feel like downloading a large model, you can also use `yap dictate`. Yap leverages the built-in models exposed though Speech.framework on macOS 26 (Tahoe).

Project repo: https://github.com/finnvoor/yap

▲

rcarmo

4 hours ago

[-]

Not sure why I should use this instead of the baked-in OS dictation features (which I use almost daily--just double-tap the world key, and you're there). What's the advantage?

▲

qq66

4 hours ago

[-]

I haven't used this one but WisprFlow is vastly better than the built-in functionality on MacOS. Apple is way behind even startups, even for fundamental AI functionality like transcribing speech

▲

ibero

4 hours ago

[-]

WisprFlow has a lot of good recommendations behind it but the fact they used Delve for SOC2 compliance gives me major pause.

▲

janalsncm

3 hours ago

[-]

The fact that a company could slurp up all of your data and then use Delve for their SOC2 is a great reason to use local models.

▲

jonwinstanley

4 hours ago

[-]

I use the baked in Apple transcription and haven't had any issues. But what I do is usually pretty simple.

What makes the others vastly better?

▲

MattDamonSpace

3 hours ago

[-]

I’ve rarely had macOS TTS produce a sentence I didn’t have to edit

Whisper models I barely bother checking anymore

▲

hyperhello

5 hours ago

[-]

Feature request or beg: let me play a speech video and transcribe it for me.

▲

MattHart88

5 hours ago

[-]

I like this idea and it should work -- whatever microphone you have on should be able to hear the speaker. LMK if not (e.g., are you wearing headphones? if so, the mic can't hear the speaker)

▲

vaulpann

51 minutes ago

[-]

very cool - huge open source drop!

▲

pmarreck

2 hours ago

[-]

How does this compare with Superwhisper, which is otherwise excellent but not cheap?

▲

janalsncm

3 hours ago

[-]

I think the jab at the bottom of the readme is referring to whispr flow?

https://wisprflow.ai/new-funding

▲

tito

3 hours ago

[-]

This is great. I'm typing this message now using Ghost Pepper. What benefits have you seen from the OCR screen sharing step?

▲

Supercompressor

4 hours ago

[-]

I've been looking for the opposite - wanting to dump text and it be read to me, coherently. Anyone have good recommendations?

▲

realityfactchex

4 hours ago

[-]

Sure, Chatterbox TTS Server is rather high quality: https://github.com/devnen/Chatterbox-TTS-Server

You could hook it up to some workflow over the local API depending on how you want to dump the text, but the web UI is good too.

The Show HN by the author was at: https://news.ycombinator.com/item?id=44145564

▲

Supercompressor

4 hours ago

[-]

Appreciated - thank you.

▲

guzik

5 hours ago

[-]

Sadly the app doesn't work. There is no popup asking for microphone permission.

EDIT: I see there is an open issue for that on github

▲

ttul

4 hours ago

[-]

And many people are mailing in Codex and Claude Code generated PRs - myself included. Fingers crossed, I suppose.

▲

MattHart88

4 hours ago

[-]

Thanks to everyone who submitted PRs! The fix is merged, new version is up.

▲

thatxliner

2 hours ago

[-]

why isn't the cleanup done on the transcription (as opposed to screen record)

▲

purplehat_

4 hours ago

[-]

Hi Matt, there's lots of speech-to-text programs out there with varying levels of quality. 100% local is admirable but it's always a tradeoff and users have to decide for themselves what's worth it.

Would you consider making available a video showing someone using the app?

▲

semiquaver

4 hours ago

[-]

Slop

▲

gegtik

4 hours ago

[-]

how does this compare to macos built in siri TTS, in quality and in privacy?

▲

realityfactchex

4 hours ago

[-]

Exactly my question. I double-tap the control button and macOS does native, local TTS dictation pretty well. (Similar to Keyboard > Enable Dictation setting on iOS.)

The macOS built-in TTS (dictation) seems better than all the 3rd party, local apps I tried in the past that people raved about. I have tried several.

Is this better somehow?

If the 3rd party apps did streaming with typing in place and corrections within a reasonable window when they understand things better given more context, that would be cool. Theoretically, a custom model or UX could be "better" than what comes free built into macOS (more accurate or customizable).

But when I contacted the developer of my favorite one they said that would be pretty hard to implement due to having to go back and make corrections in the active field, etc.

I assume streaming STT in these utilities for Mac will get better at some point, but I haven't seen it yet (been waiting). It seems these tools generally are not streaming, e.g. they want you to finish speaking first before showing you anything. Which doesn't work for me when I'm dictating. I want to see what I've been saying lately, to jog my memory about what I've just said and help guide the next thing I'm about to say. I certainly don't want to split my attention by manually toggling the control (whether PTT or not) periodically to indicate "ok, you can render what I just said now".

I guess "hold-to-talk" tools are for delivering discrete, fully formed messages, not for longer, running dictation.

AFAICT, TFA is focused on hold-to-talk as the differentiator, over double-tap to begin speaking and double-tap to end speaking?

▲

aristech

5 hours ago

[-]

Great job. How about the supported languages? System languages gets recognised?

▲

MattHart88

5 hours ago

[-]

Thanks! We currently have 2 multi-lingual options available: - Whisper small (multilingual) (~466 MB, supports many languages) - Parakeet v3 (25 languages) (~1.4 GB, supports 25 languages via FluidAudio)

▲

dakila5

3 hours ago

[-]

MacWhisper is also a good one

▲

douglaswlance

4 hours ago

[-]

does it input the text as soon as it hears it? or does it wait until the end?

▲

romeroej

4 hours ago

[-]

always mac. when windows? why can you just make things multios

▲

patja

2 hours ago

[-]

I've been using Chirp which uses parakeet on Windows. Learned about it here:

https://news.ycombinator.com/item?id=45930659

Works great for me!

▲

cootsnuck

2 hours ago

[-]

Handy has Windows support. https://handy.computer/

▲

naikrovek

3 hours ago

[-]

Because like all other modern Macs, the GPU in my Mac uses the same API as the GPU in your Mac.

Also, on a Mac with 32GB of RAM, 24GB of that (75%) is available to the GPU, and that makes the models run much faster. On my 64GB MacBook Pro, 48GB is available to the GPU. Have you priced an nvidia GPU with 48GB of RAM? It’s simply cheaper to do this on Macs.

Macs are just better for getting started with this kind of thing.

▲

patja

2 hours ago

[-]

Fair enough for GPU-intensive stuff like running Qwen locally. But do you really need a GPU for decent local TTS? I run parakeet just on CPU.