[1]: https://www.science.org/content/article/turtle-or-rifle-hack...
But (1) older audio models typically used different architectures like RNNs (Recurrent networks) which came with additional challenges compared to the CNNs (Convolutional networks) that image models used. e.g. the exploding gradients problem. during training of RNNs vanishing gradients are a potential problem. during advex optimization the problem gets inverted and you have to do different things to solve it.
Also (2) the human stuff related to imperceptibility is very different with audio. Ears vs eyes.
So, they're the same, but different.
source -- this is what my (unfinished) phd was on. i should really write up the attack that i crafted, but never got published :(
Would it help to significantly lower the hearing capabilities of the AI system? At Juvoly, we always encouraged GPs to invest in high quality microphone like Jabra Speak, connected through USB. A good mic results in much better audio transcriptions, but maybe that was all for the wrong reasons?
In case anyone hasn't had the displeasure of viewing these I'll link some in a comment below once I scroll through my feed and find one.
Between these two trends, I struggle to see what the future holds for the security industry.
Either way, as is always the case with the tech industry, the incumbents in this space will be getting paid the big bucks and the consumer will ultimately hold the bag. We absolutely need tougher data privacy / security laws & I wonder what catastrophic event will force law makers and voters to take this issue seriously.
The Art Of Poison-Pilling Music Files
- Inject adversarial noise to make it transcribe what you want (https://arxiv.org/abs/2210.17316)
- Stop it from transcribing (https://arxiv.org/abs/2405.06134)
- Adversarial prompt injection to make it translate instead of transcribe (https://arxiv.org/abs/2407.04482v2).
the article says
> This required full access to the model, restricting the researchers to open models with publicly available weights. They found, however, that attacks developed for open models transferred to commercial models from Microsoft and Mistral that share the same underlying architecture.
so it depends on what architecture whisper is using (i don't think they're LLM? or they weren't last time i checked about 4 years ago lol)
edit -- replaced last section, missed this bit in the article
It's insane to me how much of a nose-dive Siri or any Apple-based STT takes when there is _any_ noise in the background. I like to play music at low levels in my house just as background noise and I've noticed that if I'm playing any music my STT just goes to complete shit (often missing the last 2-3+ words and messing up things in the middle). On the other hand, in the exact same environment, Parakeet v3 (via MacWhisper) has zero issues even with background noise.