The reason? Training data bias and the "last mile" problem - demos use ideal conditions while real usage involves messy audio, overlapping speech, and domain-specific vocabulary the models never saw during training.
Did you end up adding any guardrails (confidence thresholds, “please repeat,” glossary/term injection, or human fallback)? Also curious: were failures mostly ASR or translation/context?