Cohere Transcribe: Speech Recognition
101 points
3 hours ago
| 12 comments
| cohere.com
| HN
kieloo
55 seconds ago
[-]
The problem with many STT models is that they seem to mostly be trained on perfectly-accented speech and struggle a lot with foreign accents so I’m curious to try this one.

So far, the best I have found while testing models fit my language learning app (Copycat Cafe) is Soniox. All others performed badly for non native accents. The worst were whisper-based models because they hallucinate when they misunderstand and tend to come up with random phrases that have nothing to do with the topic.

reply
dinakernel
2 hours ago
[-]
My worry is that ASR will end up like OCR. If the multi modal large AI system is good enough (latency wise), the advantage of domain understanding eats the other technlogies alive.

In OCR, even when the characters are poorly scanned, the deep domain understanding these large multi modal AIs have allows it to understand what the document actually meant - this is going to be order id because in the million invoices I have seen before order id is normally below order date - etc. The same issue is going to be there in ASR also is my worry.

reply
progbits
1 hour ago
[-]
This is both good and bad. Good ASR can often understand low quality / garbled speech that I could not figure out, but it also "over corrects" sometimes and replaces correct but low prior words with incorrect but much more common ones.

With OCR the risk is you get another xerox[1] incident where all your data looks plausible but is incorrect. Hope you kept the originals!

(This is why for my personal doc scans, I use OCR only for full text search, but retain the original raw scans forever)

[1] https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

reply
corlinp
39 minutes ago
[-]
This is exactly the case today. Multimodal LLMs like gpt-4o-transcribe are way better than traditional ASR, not only because of deeper understanding but because of the ability to actually prompt it with your company's specific terminology, org chart, etc.

For example, if the prompt includes that Caitlin is an accountant and Kaitlyn is an engineer, if you transcribe "Tell Kaitlyn to review my PR" it will know who you're referring to. That's something WER doesn't really capture.

BTW, I built an open-source Mac tool for using gpt-4o-transcribe with an OpenAI API key and custom prompts: https://github.com/corlinp/voibe

reply
nkzd
1 hour ago
[-]
Why are you 'worried' about it? Shouldn't we strive for better technology even if it means some will 'lose'?
reply
yorwba
1 hour ago
[-]
"Better" isn't just about increasing benchmark numbers. Often, it's more important that a system fails safely than how often it fails. Automatic speech recognition that guesses when the input is unclear will occasionally be right and therefore have a lower word error rate, but if it's important that the output be correct, it might be better to insert "[unintelligible]" and have a human double-check.
reply
IshKebab
15 minutes ago
[-]
It's better in terms of WER. It's not better in terms of not making shit up that sounds plausible.

Probably the answer is simply to tweak the metric so it's a bit more smart than WER - allow "unclear" output which is penalised less than actually incorrect answers. I'd be surprised if nobody has done that.

reply
_medihack_
23 minutes ago
[-]
Unfortunately, this model does not seem to support a custom vocabulary, word boosting or an additional prompt.
reply
gruez
1 hour ago
[-]
> Limitations

>Timestamps/Speaker diarization. The model does not feature either of these.

What a shame. Is whisperx still the best choice if you want timestamps/diarization?

reply
bartman
1 hour ago
[-]
Even in the commercial space, there’s a lack of production grade ASR APIs that support diarization and word level timestamps.

My experiences with Google’s Chirp have been horrendous, with it sometimes skipping sections of speech entirely, hallucinating speech where the audio contains noise, and unreliable word level timestamps. And this all is even with using their new audio prefiltering feature.

AWS works slightly better, but also has trouble with keeping word level timestamps in sync.

Whisper is nice but hallucinates regularly.

OpenAI’s new transcription models are delivering accurate output but do not support word level timestamps…

A lot of this could be worked around by sending the resulting transcripts through a few layers of post processing, but… I just want to pay for an API that is reliable and saves me from doing all that work.

reply
stavros
51 minutes ago
[-]
Isn't Elevenlabs the best in this?
reply
akreal
1 hour ago
[-]
WhisperX is not a model but a software package built around Whisper and some other models, including diarization and alignment ones. Something similar will be built around the Cohere Transcribe model, maybe even just an integration to WhisperX itself.
reply
lifesaverluke
52 minutes ago
[-]
reply
atoav
43 minutes ago
[-]
I would try Qwen-ASR: https://qwen.ai/blog?id=qwen3asr

See the very bottom of the page for a transcription with timestamps.

reply
GaggiX
1 hour ago
[-]
There is also: https://github.com/linto-ai/whisper-timestamped

It doesn't use an extra model (so it supports every language that works with Whisper out of the box and use less memory), it works by applying Dynamic Time Warping to cross-attention weights.

reply
geooff_
2 hours ago
[-]
I can't say enough nice things about Cohere's services. I migrated over to their embedding model a few months ago for clip-style embeddings and it's been fantastic.

It has the most crisp, steady P50 of any external service I've used in a long time.

reply
bluegatty
2 hours ago
[-]
can u comment on overall quality? their models tend to be a bit smaller and less performant overall.
reply
geooff_
25 minutes ago
[-]
My baseline was Jina, A Chinese model provider. I had major issues with their reliability. I have no comparison to provide in terms of offline metrics as I had to do an emergency migration because their inference service has extended downtimes.

My experience with Cohere and interacting with their sales engineers has been boring, I say that is the most flattering way possible. Embeddings are a core service at this point like VMs and DBs. They just need to work and work well and thats what they're selling.

reply
stavros
49 minutes ago
[-]
To clarify, this is SOTA in its size category, right? It's not better than Parakeet, for example?
reply
jwineinger
20 minutes ago
[-]
Looking at the ASR leaderboard (https://huggingface.co/spaces/hf-audio/open_asr_leaderboard), Parakeet (.6B) is still near the top on speed, but about 10th on WER.
reply
stavros
16 minutes ago
[-]
Thanks, I don't know how much to trust benchmarks so I figured I'd ask.
reply
caminanteblanco
35 minutes ago
[-]
Well, to clarify, it is both larger than parakeet in parameter count (parakeet is available in 0.6B and 1.1B), since it's 2B params, and also performs better than it on the benchmarks that hugging face publishes on the openASR leaderboard
reply
stavros
33 minutes ago
[-]
Ahh thanks, I confused my parameter count, thanks. I guess Parakeet is 0.6B, I was somehow thinking 6B.
reply
teach
1 hour ago
[-]
Dumb question, but if this is "open source" is there source code somewhere? Or does that term mean something different in the world of models that must be trained to be useful?
reply
Doman
1 hour ago
[-]
Files can be downloaded here: https://huggingface.co/CohereLabs/cohere-transcribe-03-2026/...

And someone has already converted it to onnx format: https://huggingface.co/eschmidbauer/cohere-transcribe-03-202... - so it can be run on CPU instead of GPU.

reply
gunalx
33 minutes ago
[-]
Most use definition is just awailable weigths.

This kids make sense because "compiling" (training) the model cost inhibitly much, and we can still benefit from the artifacts.

reply
stronglikedan
1 hour ago
[-]
I presume it means the model itself.
reply
ramon156
59 minutes ago
[-]
I had to set-up fireflies for our company recently. Cool tool, but I'm sending dozens of internal meetings to an american company. Our ISO inspector wouldn't be pleased to know.

This is a good option. Will check it out.

reply
Oras
45 minutes ago
[-]
There are many open source STT models that can run locally on Mac with good performance, such as whisper and Parakeet
reply
Void_
1 hour ago
[-]
Just today I shipped support for this in Whisper Memos: https://whispermemos.com/changelog/2026-04-cohere-transcribe

Accurate and fast model, very happy with it so far!

reply
kalmuraee
34 minutes ago
[-]
Multimodels are way better
reply
Fidelix
15 minutes ago
[-]
Can you clarify? I tested a few and they are rubbish and don't have the same features.
reply
topazas
2 hours ago
[-]
How hard could it be to train other European language(-s)?
reply
gunalx
1 hour ago
[-]
If you have to ask you dont really need the answer.

Seems to not be to difficult in finding or creating training code. So a pretty decent amount of high quality training data should be many hours. And a few hours in high end data enter GPU compute, and many iterations to get it right.

reply
harvey9
1 hour ago
[-]
It includes several European languages.
reply
stronglikedan
1 hour ago
[-]
hence "other" lol
reply
simonw
2 hours ago
[-]
It's great that this is Apache 2.0 licensed - several of Cohere's other models are licensed free for non-commercial use only.
reply