Pure C, CPU-only inference with Mistral Voxtral Realtime 4B speech to text model
71 points
6 hours ago
| 1 comment
| github.com
| HN
Curiositry
3 hours ago
[-]
This was a breeze to install on Linux. However, I haven't managed to get realtime transcription working yet, ala Whisper.cpp stream or Moonshine.

--from-mic only supports Mac. I'm able to capture audio with ffmpeg, but adapting the ffmpeg example to use mic capture hasn't worked yet:

ffmpeg -f pulse -channels 1 -i 1 -f s16le - 2>/dev/null | ./voxtral -d voxtral-model --stdin

It's possible my system is simply under spec for the default model.

I'd like to be able to use this with the voxtral-q4.gguf quantized model from here: https://huggingface.co/TrevorJS/voxtral-mini-realtime-gguf

reply
jwrallie
1 hour ago
[-]
I am interested in a way to capture audio not only from the mic, but also from one of the monitor ports so you could pipe the audio you are hearing from the web directly for real-time transcription with one of these solutions. Did anyone manage to do that?

I can, for example, capture audio from that with Audacity or OBS Studio and do it later, so it should be possible to do it in real time too assuming my machine can keep up.

reply
yjftsjthsd-h
2 hours ago
[-]
Does it work if you use ffmpeg to feed it audio from a file? I personally would try file->ffmpeg->voxtral then mic->ffmpeg->file, and then try to glue together mic->ffmpeg->voxtral.

(But take with grain of salt; I haven't tried yet)

reply
Curiositry
1 minute ago
[-]
Recording audio with FFMPEG, and transcribing a file that’s piped from FFMPEG both work.

Given that it took 19.64 mins to transcribe the 11 second sample wav, it’s possible I just didn’t wait long enough :)

reply