A lighter approach: use the accessibility API (NSAccessibility on macOS) to grab the focused app's text content — window title, selected text, nearby field labels, recipient names in mail composers. That gives you ~90% of the useful context as a small text prompt that a 1-3B parameter local model (like Qwen2.5-1.5B or Phi-3-mini) can process in under 500ms on Apple Silicon's Neural Engine.
The screenshot path is only really needed for non-standard UIs where text isn't programmatically accessible. Splitting the pipeline into a fast text-context path (common case) and a fallback vision path would get you sub-2s end-to-end locally, while still handling edge cases gracefully.
This is essentially the same pattern used in assistive technology — screen readers have solved the "what's on screen" problem without vision models for decades.
But a few weeks ago someone on HN pointed me to Hex, which also supports Parakeet-V3 , and incredibly enough, is even faster than Handy because it’s a native MacOS-only app that leverages CoreML/Neural Engine for extremely quick transcriptions. Long ramblings transcribed in under a second!
It’s now my favorite fully local STT for MacOS:
I think the biggest difference between FreeFlow and Handy is that FreeFlow implements what Monologue calls "deep context", where it post-processes the raw transcription with context from your currently open window.
This fixes misspelled names if you're replying to an email / makes sure technical terms are spelled right / etc.
The original hope for FreeFlow was for it to use all local models like Handy does, but with the post-processing step the pipeline took 5-10 seconds instead of <1 second with Groq.
Thank you for making Handy! It looks amazing and I wish I found it before making FreeFlow.
You can go to Settings > Run Logs in FreeFlow to see the full pipeline ran on each request with the exact prompt and LLM response to see exactly what is sent / returned.
Surprisingly, it produced a better output (at least I liked its version) than the recommended but heavy model (Parakeet V3 @ 478 MB).
F12 -> sox for recording -> temp.wav -> faster-whisper -> pbcopy -> notify-send to know what’s happening
https://github.com/sathish316/soupawhisper
I found a Linux version with a similar workflow and forked it to build the Mac version. It look less than 15 mins to ask Claude to modify it as per my needs.
F12 Press → arecord (ALSA) → temp.wav → faster-whisper → xclip + xdotool
https://github.com/ksred/soupawhisper
Thanks to faster-whisper and local models using quantization, I use it in all places where I was previously using Superwhisper in Docs, Terminal etc.
Chatterbox TTS (from Resemble AI) does the voice generation, WhisperX gives word-level timestamps so you can click any word to jump, and FastAPI ties it all together with SSE streaming so audio starts playing before the whole thing is done generating.
There's a ~5s buffer up front while the first chunk generates, but after that each chunk streams in faster than realtime. So playback rarely stalls.
It took about 4 hours today... wild.
I build https://github.com/bwarzecha/Axii to keep EVERYTHING locally and be fully open source - can be easily used at any company. No data send anywhere.
Edit: Ah but Parakeet I think isn’t available for free. But very worthwhile single purchase app nonetheless!
And then I set the button right below that as the enter key so it feels mostly handsoff the keyboard.
https://github.com/kitlangton/Hex
for me it strikes the balance of good, fast, and cheap for everyday transcription. macwhisper is overkill, superwhisper too clever, and handy too buggy. hex fits just right for me (so far)
https://github.com/corlinp/voibe
I do see the name has since been taken by a paid service... shame.
My take for X11 Linux systems. Small and low dependency except for the model download.
Just use handy: https://github.com/cjpais/Handy
If you do that, the total pipeline takes too long for the UX to be good (5-10 seconds per transcription instead of <1s). I also had concerns around battery life.
Some day!
It’s free and offline
Won't be free when xAI starts charging.