I input a bunch of completely made up words (Quastral Syncing, Zarnix Meshing, HIBAX, Bilxer) and used them in a sentence and the model zero-shotted perfect speech recognition!
It’s so counterintuitive for me that this would work. I would have bet that you have to provide at least one audio sample in order for the model to recognize a word it was never trained on.
Providing it to the model in text modality and it being able to recognize it in the audio modality must be an emergent property.
seems like it could be very useful but it really comes down to the specifics.
you can prompt whisper with context — how does this compare?
how large of a vocabulary can it work with? if it's a few dozen words it's only gonna help for niche use cases. if it can handle 100s-1000s with good performance that could completely replace fine-tuning for many uses
https://arxiv.org/pdf/2309.08561 https://arxiv.org/pdf/2406.02649
I haven't really dug in yet but from a quick skim, it looks promising. They show a big improvement over Whisper on a medical dataset (F1 increased from 80.5% to 96.58%).
The inference time for the keyword detection is about 10ms. If it scales linearly with additional keywords you could potentially scale to hundreds or thousands of keywords but it really depends on how sensitive you are to latency. For real-time with large vocabularies my guess is you might still want to fine-tune.
How does keyword spotting handle complex phrases as commands?