Don't be confused if it says "no microphone", the moment you click the record button it will request browser permission and then start working.
I spoke fast and dropped in some jargon and it got it all right - I said this and it transcribed it exactly right, WebAssembly spelling included:
> Can you tell me about RSS and Atom and the role of CSP headers in browser security, especially if you're using WebAssembly?
And open weight too! So grateful for this.
I tried speaking in 2 languages at once, and it picked it up correctly. Truly impressive for real-time.
But I'm definitely going to keep an eye on this for local-only TTS for Home Assistant.
Model is around 7.5 GB - once they get above 4 GB running them in a browser gets quite difficult I believe.
The dataset is ~100 8kHz call recordings with gnarly UK accents (which I consider to be the final boss of english language ASR). It seems like it's SOTA.
Where it does fall down seems to be the latency distribution but I'm testing against the API. Running it locally will no doubt improve that?
I tried English + Polish:
> All right, I'm not really sure if transcribing this makes a lot of sense. Maybe not. A цьому nie mówisz po polsku. A цьому nie mówisz po polsku, nie po ukrańsku.
> The model is natively multilingual, achieving strong transcription performance in 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch. With a 4B parameter footprint, it runs efficiently on edge devices, ensuring privacy and security for sensitive deployments.
I wonder how much having languages with the same roots (e.g. the romance languages in the list above or multiple Slavic languages) affects the parameter count and the training set. Do you need more training data to differentiate between multiple similar languages? How would swapping, for example, Hindi (fairly distinct from the other 12 supported languages) for Ukrainian and Polish (both share some roots with Russian) affect the parameter count?
39 million people speak Polish, and most of those also speak English or another more common language.
Try sticking to the supported languages
The base likely was pretrained on days that included Polish and Ukrainian. You shouldn't be surprised to learn it doesn't perform great on languages it wasn't trained on, or perhaps had the highest share of training data.
Amazons transcription service is $0.024 per minute, pretty big difference https://aws.amazon.com/transcribe/pricing/
For example fal.ai has a Whisper API endpoint priced at "$0.00125 per compute second" which (at 10-25x realtime) is EXTREMELY cheaper than all the competitors.
https://huggingface.co/nvidia/nemotron-speech-streaming-en-0...
https://github.com/m1el/nemotron-asr.cpp https://huggingface.co/m1el/nemotron-speech-streaming-0.6B-g...
I used to use Dragon Dictation to draft my first novel, had to learn a 'language' to tell the rudimentary engine how to recognize my speech.
And then I discovered [1] and have been using it for some basic speech recognition, amazed at what a local model can do.
But it can't transcribe any text until I finish recording a file, and then it starts work, so very slow batches in terms of feedback latency cycles.
And now you've posted this cool solution which streams audio chunks to a model in infinite small pieces, amazing, just amazing.
Now if only I can figure out how to contribute to Handy or similar to do that Speech To Text in a streaming mode, STT locally will be a solved problem for me.
https://github.com/pipecat-ai/nemotron-january-2026/
discovered through this twitter post:
For example, "here it is, voila!" "turn left on el camino real"
I think it's nice to have specialized models for specific tasks that don't try to be generalists. Voxtral Transcript 2 is already extremely impressive, so imagine how much better it could be if it specialized in specific languages rather than cramming 14 languages into one model.
That said, generalist models definitely have their uses. I do want multilingual transcribing models to exist, I just also think that monolingual models could potentially achieve even better results for that specific language.
https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...
~9GB model.
No, I just heard about it this morning.
But whatever I tried, it could not recognise my Ukrainian and would default to Russian in absolutely ridiculous transcription. Other STT models recognise Ukrainian consistently, so I assume there is a lot of Russian in training material, and zero Ukrainian. Made me really sad.
We need better independent comparison to see how it performs against the latest Qwen3-ASR, and so on.
I can no longer take at face value the cherry picked comparisons of the companies showing off their new models.
For now, NVIDIA Parakeet v3 is the best for my use case, and runs very fast on my laptop or my phone.
Is it better? Worse? Why do they only compare to gpt4o mini transcribe?
The thing that makes it particularly misleading is that models that do transcription to lowercase and then use inverse text normalization to restore structure and grammar end up making a very different class of mistakes than Whisper, which goes directly to final form text including punctuation and quotes and tone.
But nonetheless, they're claiming such a lower error rate than Whisper that it's almost not in the same bucket.
There's a reason that quite a lot of good transcribers still use V2, not V3.
For Whisper API online (with v3 large) I've found "$0.00125 per compute second" which is the cheapest absolute I've ever found.
Why it should be Whisper v3? They even released an open model: https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...
If you transcribe a minute of conversation, you'll have like 5 words transcribed wrongly. In an hour podcast, that is 300 wrongly transcribed words.
[0] https://www.microsoft.com/en-us/research/wp-content/uploads/...
- familiarity with the accent and/or speaker;
- speed and style/cadence of the speech;
- any other audio that is happening that can muffle or distort the audio;
- etc.
It can also take multiple passes to get a decent transcription.
"Click me to try now!" banners that lead to a warning screen that says "Oh, only paying members, whoops!"
So, you don't mean 'try this out', you mean 'buy this product'.
Let's not act like it's a free sampler.
I can't comment on the model : i'm not giving them money.
> We've worked hand-in-hand with the vLLM team to have production-grade support for Voxtral Mini 4B Realtime 2602 with vLLM. Special thanks goes out to Joshua Deng, Yu Luo, Chen Zhang, Nick Hill, Nicolò Lucchesi, Roger Wang, and Cyrus Leung for the amazing work and help on building a production-ready audio streaming and realtime system in vLLM.
https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...
https://docs.vllm.ai/en/latest/serving/openai_compatible_ser...
On the information density of languages: it is true that some languages have a more information dense textual representation. But all spoken languages convey about the same information in the same time. Which is not all that surprising, it just means that human brains have an optimal range at which they process information.
Further reading: Coupé, Christophe, et al. "Different Languages, Similar Encoding Efficiency: Comparable Information Rates across the Human Communicative Niche." Science Advances. https://doi.org/10.1126/sciadv.aaw2594
It seems like the best tradeoff between information density and understandability actually comes from the deep latin roots of the language
I agree with your belief, other languages have either lower density (e.g. German) or lower understandability (e.g. English)
Italian has one official italian (two, if you count IT_ch, but difference is minor), doesn't pay much attention to stress and vowel length, and only has a few "confusable" sounds (gl/l, gn/n, double consonants, stuff you get wrong in primary school). Italian dialects would be a disaster tho :)
That's interesting. As a linguist, I have to say that Haskell is the most computationally advanced programming language, having the best balance of clear syntax and expressiveness. I am qualified to say this because I once used Haskell to make a web site, and I also tried C++ but I kept on getting errors.
/s obviously.
Tldr: computer scientists feel unjustifiably entitled to make scientific-sounding but meaningless pronouncements on topics outside their field of expertise.
I don't know how widely accepted that conclusion is, what exceptions there may be, etc.
You could use their api (they have this snippet):
```curl -X POST "https://api.mistral.ai/v1/audio/transcriptions" \ -H "Authorization: Bearer $MISTRAL_API_KEY" \ -F model="voxtral-mini-latest" \ -F file=@"your-file.m4a" \ -F diarize=true \ -F timestamp_granularities="segment"```
In the api it took 18s to do a 20m audio file I had lying around where someone is reviewing a product.
There will, I'm sure, be ways of running this locally up and available soon (if they aren't in huggingface right now) but the API is $0.003/min. If it's something like 120 meetings (10 years of monthly ones) then it's roughly $20 if the meetings are 1hr each. Depending on whether they're 1 or 10 hours (or if they're weekly or monthly but 10 parallel sessions or something) then this might be a price you're willing to pay if you get the results back in an afternoon.
edit - their realtime model can be run with vllm, the batch model is not open
- make sure you have a list of all these YouTube meeting URLs somewhere
- ask your preferred coding assistant to write you up a script that downloads the audio for these videos with yt-dlp & calls Mixtrals' API
- ????
- profit
What estimates do others use?
[^1]: https://www.wired.com/story/mistral-voxtral-real-time-ai-tra...
Handy – Free open source speech-to-text app https://github.com/cjpais/Handy
This combo has almost unbeatable accuracy and it rejects noises in the background really well. It can even reject people talking in the background.
The only better thing I've seen is Ursa model from Speechmatics. Not open weights unfortunately.
Depending on the permissions granted to apps on your mobile device, it can even be passively exfiltrated without you ever noticing - and that's ignoring the video clips people take and put online. Like your grandma uploading to Facebook a short moment from a Christmas meet or similar
There have already been successful scams - eg calls from "relatives" (AI) calling family members needing money urgently and convincing them to send the money...