Some technical details:
- Predicts conversational floor ownership, not speech endpoints
- Audio-native streaming model, no ASR dependency
- Human-timed responses without silence-based delays
- Zero interruptions at sub-100ms median latency
- In benchmarks Sparrow-1 beats all existing models at real world turn-taking baselines
I wrote more about the work here: https://www.tavus.io/post/sparrow-1-human-level-conversation...
Could Sparrow instead be used to produce high quality transcription that incorporate non-verbal cues?
Or even, use Sparrow AND another existing transcription/ASR thing to augment the transcription with non-verbal cues
Common ...