What moved the needle:
Voice is a turn-taking problem, not a transcription problem. VAD alone fails; you need semantic end-of-turn detection.
The system reduces to one loop: speaking vs listening. The two transitions - cancel instantly on barge-in, respond instantly on end-of-turn - define the experience.
STT → LLM → TTS must stream. Sequential pipelines are dead on arrival for natural conversation.
TTFT dominates everything. In voice, the first token is the critical path. Groq’s ~80ms TTFT was the single biggest win.
Geography matters more than prompts. Colocate everything or you lose before you start.
GitHub Repo: https://github.com/NickTikhonov/shuo
Follow whatever I next tinker with: https://x.com/nick_tikhonov
An interesting fact I learned at the time: The median delay between human speakers during a conversation is 0ms (zero). In other words, in many cases, the listener starts speaking before the speaker is done. You've probably experienced this, and you talk about how you "finish each other's sentences".
It's because your brain is predicting what they will say while they speak, and processing an answer at the same time. It's also why when they say what you didn't expect, you say, "what?" and then answer half a second later, when your brain corrects.
Fact 2: Humans expect a delay on their voice assistants, for two reasons. One reason is because they know it's a computer that has to think. And secondly, cell phones. Cell phones have a built in delay that breaks human to human speech, and your brain thinks of a voice assistant like a cell phone.
Fact 3: Almost no response from Alexa is under 500ms. Even the ones that are served locally, like "what time is it".
Semantic end-of-turn is the key here. It's something we were working on years ago, but didn't have the compute power to do it. So at least back then, end-of-turn was just 300ms of silence.
This is pretty awesome. It's been a few years since I worked on Alexa (and everything I wrote has been talked about publicly). But I do wonder if they've made progress on semantic detection of end-of-turn.
Edit: Oh yeah, you are totally right about geography too. That was a huge unlock for Alexa. Getting the processing closer to the user.
This reminds me of a great diversity training at a previous employer, where we dug into the different expectations of when and how to take your turn in conversation and how that can create a lot of friction just from different cultural/familial habits. In my family, we’re expecting to talk over each other and it’s not offensive at all to do so, whereas some of my friends really get upset if we don’t take clear turns, a mode which would cause high levels of irritation in my family (and still do in me).
That was the most stressfully hard to use phone call I've ever had. The delay was nearly 10 seconds, and eventually I just said I was only going to speak yes or no, if he needed a longer answer he needed to shut up. And that worked. We no longer talked over eachother.
1. Compute. It's easy to make a voice assistant for a few people. But it takes a hell of a lot of GPU to serve millions.
2. Guard Rails. All of those assistants have the ability to affect the real world. With Alexa you can close a garage or turn on the stove. It would be real bad if you told it to close the garage as you went to bed for the night and instead it turned on the stove and burned down the house while you slept. So you need so really strong guard rails for those popular assistants.
3 And a bonus reason: Money. Voice assistants aren't all the profitable. There isn't a lot of money in "what time is it" and "what's the weather". :)
- Alexa, what time is it?
- Current time is 5:35 P.M. - the perfect time to crack open a can of ice cold Budweiser! A fresh 12-pack can be delivered within one hour if you order now!
I am serious though about having it sent to me: if anyone has an Alexa they no longer want, I'm happy to take it off your hands. I have eight and have never bought one. Having worked there I actually trust the security more than before I worked there. It was basically impossible for me, even as a Principle Engineer, to get copies of the Text to Speech of a customer and I literally never heard a customer voice recording.
Also, my Alexa does advertise stuff to me when I talk to it. It's not Budweiser, but it'll try to upsell me on Amazon services all the time.
- "Alexa, name the new unnamed outlet 'Living Room Lights', and the other unnamed one 'Stair Lights', then add them to a new group called 'Christmas Lights', and add the other three outlets as well"
- "Alexa, create a routine to turn off all the Christmas lights if there's nobody in the room and it's after 11pm"
- "Alexa, turn off all the Christmas lights except the tree in this room and the mantle"
That same fuzziness has definitely fucked up things that used to work more reliably like music playback though. Sometimes it works when I fall back to giving it more "robotic" commands in those cases but not always. They've also gone completely overboard with the cutesy responses because it's so trivial to do now ("I've set your spaghetti sauce timer for ten minutes. Happy to help with getting this evening's Italian-inspired dinner ready!")
I only use it for music, and use two commands, but apparently having this work correctly is too much to ask for these days.
Which just launched last year, about four years after ChatGPT had AI voice chat. And it costs extra money to cover the costs. And as you aptly point out, all the guardrails they had to put in made the experience less than ideal.
> Also, my Alexa does advertise stuff to me when I talk to it.
Yes, that is how they try to make money. And it's gotten worse. But how many times does it get you to buy something?
Still not boxing them up. Though I now have a Pi with a HomeAssistant setup I'm trialling, so maybe that'll change.
I mean its deployed now (Alexa+/gemini). but its expensive as hell. and also kinda useless. Claude cowork/clawbot form factors are better.
Wrong form factor/use case really. People really wanna buy stuff using clawbot.
It was difficult to detrain and that made me stop using voice chat with LLMs all together.
If, when the speaker actually stops speaking, there is a match vs predicted, the response can be played without any latency.
Seems like an awesome approach! One could imagine doing this prediction for the K most likely threads simultaneously, subject by computer power available, and prune/branch as some threads become inaccurate.
People are already trained to say a name to start. Curious why the tech has avoided a cap?
“Alexa, what’s tomorrow’s weather [dada]?”
"It will be sunny with a high of 10 degrees. Over"
"Thank you. Over and out."
Just add some noise and Push-To-Talk and it will be great for ham radio enthusiasts!
To me, be the best solution would be semantic + keyword + silence.
Hey Agent, blablablabla, thank you.
Hey Agent, blablablabla, please.
Hey Agent, blablablabla, oops cancel.
The idea of having an LLM follow and continuously predict the speaker. It would allow a response to be continually generated. If the prediction is correct, the response can be started with zero latency.
(Meanwhile at OpenAI: testing out the free ChatGPT, it feels like they prompted GPT 3.5 to write at length based on the last one or maybe two prompts)
"The windows upstairs..."
"...are all closed except for the bedroom window"
The first portion of the response requires a couple of seconds to play but only a few tens of milliseconds to start streaming using a small model. Currently I just break the small model's response off at whatever point will produce about enough time to spin up the larger model.
But all responses spin up both models.
Does that mean that half of responses have a negative delay? As in, humans interrupt each others sentences precisely half of the time?
Same the other way. If I stop taking and then 300ms later you start talking, then the delay is 300ms.
And if you start talking right when I stop, the delay is 0ms.
You can get the info by just listening to recorded conversations of two people and tagging them.
All that to say, I'd imagine people are adaptable enough to easily handle 100ms+ delay when they know they're talking to an AI.
It really feels to me like there’s some low hanging fruit with voice that no one is capitalizing on: filler words and pacing. When the llm notices a silence, it fills it with a contextually aware filler word while the real response generates. Just an “mhmm” or a “right, right”. It’d go so far to make the back and forth feel more like a conversation, and if the speaker wasn’t done speaking; there’s no talking over the user garbage. (Say the filler word, then continue listening.)
Same strategy but employed by humans.
That's different from banning the computer from thinking before they speak, ain't it?
2) if end-of-turn was detected very late, we can randomly select a first phonetic syllable, and then add it in the prompt that the reply should start with this syllable!
The one spot where it feels a bit off is the "2x faster than Vapi" claim. Your system is a clean straight pipe: transcript -> LLM -> TTS -> audio. No tool calls, no function execution, no webhooks, no mid-turn branching.
Production platforms like Vapi are doing way more work on every single turn. The LLM might decide to call a tool—search a knowledge base, hit an API, check a calendar—which means pausing token streaming, executing the tool, injecting the result back into context, re-prompting the LLM, and only then resuming the stream to TTS. That loop can happen multiple times in a single turn. Then layer on call recording, webhook delivery, transcript logging, multi-tenant routing, and all the reliability machinery you need for thousands of concurrent calls… and you’re comparing two pretty different workloads.
The core value of the post is that deep dive into the orchestration loop you built yourself. If it had just been "here’s what I learned rolling my own from scratch," it would’ve been an unqualified win. The 2x comparison just needs a quick footnote acknowledging that the two systems aren’t actually doing the same amount of work per turn.
Text in, audio out, so you can merge in a single step LLM+TTS (streamable)
https://ai.google.dev/gemini-api/docs/models/gemini-2.5-flas...
Also Groq is very fast, but the latency wasn't always consistent and I saw some very strange responses on a few calls that I had to attribute to quantization.
One thing I've noticed working with voice agents: latency isn't just about total response time, it's about the shape of the response. Streaming the first few tokens in <200ms while the rest generates creates a much better UX than waiting 450ms for a complete response, even if the total time is similar. Humans perceive the start of a response as "acknowledgment" and are more patient after that.
Curious what your architecture looks like for handling interruptions (barge-in). That's usually where the real complexity hides — detecting when the user starts speaking mid-response and gracefully stopping generation.
https://soniox.com/docs/stt/rt/endpoint-detection
Soniox also wins the independent benchmarks done by Daily, the company behind Pipecat.
https://www.daily.co/blog/benchmarking-stt-for-voice-agents/
You can try a demo on the home page:
Disclaimer: I used to work for Soniox
Edit: I commented too soon. I only saw VAD and immediately thought of Soniox which was the first service to implement real time endpoint detection last year.
Another good follow-up would be to try PersonaPlex, Nvidia's new model that would completely replace this architecture with a single model that does everything:
The cascading model (STT -> LLM -> TTS), is unlikely to go away anytime soon for a whole lot of reasons. A big one is observability. The people paying for voice agents are enterprises. Enterprises care about reliability and liability. The cascading model approach is much more amenable to specialization (rather than raw flexibility / generality) and auditability.
Organizations in regulated industries (e.g. healthcare, finance, education) need to be able to see what a voice agent "heard" before it tries to "act" on transcribed text, and same goes for seeing what LLM output text is going to be "said" before it's actually synthesized and played back.
Speech-to-Speech (end-to-end) models definitely have a place for more "narrative" use cases (think interviewing, conducting surveys / polls, etc.).
But from my experience from working with clients, they are clamoring for systems and orchestration that actually use some good ol' fashioned engineering and that don't solely rely on the latest-and-greatest SoTA ML models.
Is it super sexy? No. But each individual type of model is developing at a different rate (TTS moves really fast, low latency STT/ASR moved slower, LLMs move at a pretty good pace).
Just use the same tricks humans are using for that.
Carmack's 2013 "Latency Mitigation Strategies" paper[0] made the same point for VR too: every millisecond hides in a different stage of the pipeline, and you only find them by tracing the full path yourself. Great find with the warm TTS websocket pool saving ~300ms, perfect example of this.
The insights about VAD and streaming pipelines in this thread are exactly what I'm looking at for v2. Moving to a WebSocket streaming pipeline with proper voice activity detection would close the latency gap significantly, even with local models.
I am using it daily to drive Claude and it works really-well for me (much better than macOS dictation mode).
The "turn-taking problem, not transcription problem" framing is exactly right. We burned weeks early on optimizing STT accuracy when the actual UX killer was the agent jumping in mid-sentence or waiting too long. Switching from fixed silence thresholds to semantic end-of-turn detection was night and day.
One dimension I'd add: geography matters even more when your callers are in a different region than your infrastructure. We serve callers in India connecting to US-East, and the Twilio edge hop alone adds 150-250ms depending on the carrier. Region-specific deployments with caller-based routing helped a lot.
The barge-in teardown is the part most people underestimate. It's not just canceling LLM + TTS — if you have downstream automation (updating booking state, triggering webhook workflows, writing to DB), you need to handle the race condition where the system already committed to a response path that's now invalid. We had a bug where a barged-in appointment confirmation was still triggering the downstream booking pipeline.
Groq 8b instant is the fastest llm from my test. I used smallest ai for tts as it has the smallest TTFT
My rasberry pi stack: porcupine for wakeword detection + elevenlabs for STT + groq scout as it supports home automation better + smallest.ai for 70ms ttfb
Call stack: twilio + groq whisper for STT + groq 8b instant + smallest.ai for tts
Alexa skill stack: wrote a alexa skill to contact my stack running on a VPS server
I also have a setting in mr_sip to use gpt-realtime via plugin ah_openai, which is very low latency speech-to-speech but quite expensive.
But my client saw the Sesame demo page, and so now I am trying to fine tune PersonaPlex.
For anyone curious: https://flux.deepgram.com/
> Chunks the audio when the model believes based on the words said by the user that they have completed their utterance.
Source: https://developers.openai.com/api/docs/guides/realtime-vad
OpenAI's Semantic mode is looking at the semantic meaning of the transcribed text to make an educated guess about where the user's end of utterance is.
According to Deepgram, Flux's end-of-turn detection is not just a semantic VAD (which inherently is a separate model from the STT model that's doing the transcribing). Deepgram describes Flux as:
> the same model that produces transcripts is also responsible for modeling conversational flow and turn detection.
[...]
> With complete semantic, acoustic, and full-turn context in a fused model, Flux is able to very accurately detect turn ends and avoid the premature interruptions common with traditional approaches.
Source: https://deepgram.com/learn/introducing-flux-conversational-s...
So according to them, end-of-turn detection isn't just based on semantic content of the transcript (which makes sense given the latency), but rather the the characteristics of the actual audio waveform itself as well.
Which Pipecat (open source voice AI orchestration platform) actually does as well seemingly with their smart-turn native turn detection model as well: https://github.com/pipecat-ai/smart-turn (minus the built-in transcription)
Curious about your semantic end-of-turn detection: are you using a separate lightweight model for that, or is it baked into the main LLM inference? That seems like the hardest part to get right without adding latency.
(Raspberry Pi Voice Assistant)
Jarvis uses Porcupine for wake word detection with the built-in "jarvis" keyword. Speech input flows through ElevenLabs Scribe v2 for transcription. The LLM layer uses Groq llama-3.3-70b-versatile as primary with Groq llama-3.1-8b-instant as fallback. Text-to-speech uses Smallest.ai Lightning with Chetan voice. Audio input/output handled by ALSA (arecord/aplay). End-to-end latency is 3.8–7.3 seconds.
(Twilio + VPS)
This setup ingests audio via Twilio Media Streams in μ-law 8kHz format. Silero VAD detects speech for turn boundaries. Groq Whisper handles batch transcription. The LLM stack chains Groq llama-4-scout-17b (primary), Groq llama-3.3-70b-versatile (fallback 1), and Groq llama-3.1-8b-instant (fallback 2) with automatic failover. Text-to-speech uses Smallest.ai Lightning with Pooja voice. Audio is encoded from PCM to μ-law 8kHz before streaming back via Twilio. End-to-end latency is 0.5–1.1 seconds.
───
(Alexa Skill)
Tina receives voice input through Alexa's built-in ASR, followed by Alexa's NLU for intent detection. The LLM is Claude Haiku routed through the OpenClaw gateway. Voice output uses Alexa's native text-to-speech. End-to-end latency is 1.5–2.5 seconds.
However the naturalness of how it sounds will depend on how the TTS model works and whether two identical chunks of text will sound alike every generation.
Perhaps I'm in an older cohort, but I remember this delay, and what it felt like sustaining a conversation with this class of delay.
(it's still a remarkable advance, but do bear in mind the UX)
> "extensively" = 2 comments
Possibly GP has teenagers. Two comments is a pretty extensive discussion with teenagers ))1. I wonder if it could be optimised more by just having a single language, and
2. How do we get around the problem of interference, humans are good at conversation discrimination ie listing while multiple conversations, TV, music, etc are going on in the background, I've not had too much success with voice in noisy environments.
I like to listen to space content when going to sleep. Channels like History of the Universe, Astrum, PBS space time, SEA, etc.
Lately there's been a bunch of new-ish channels that produce content in that space (heh) and I'm amazed of how good the voices sound. Sometimes it takes a few good minutes to figure out they're genai voices, they're that good. If it weren't for small mistakes I bet more than 80% of the general population wouldn't have a clue.
Curious how you handled latency and response time. Voice agents usually struggle with that.
Nice work.
You could probably improve your metrics even more with those in the mix again?
At a minimum Siri, Alexa, and Google Home should at least have a path to plugin a tool like this. Instead I’m hacking together conversation loops in iOS Shortcuts to make something like this style of interaction with significantly worse UX.
In the middle of moving though so probably have to wait before taking on hardware.
Even a minute if you need it!
And you can get the agent to crunch when you are ready.
Imagine you speak. you need to look something up. find it. speak some more. then "over to you!"
The agent doesn't have to behave like a human and figure out when to butt in.
After all chat rooms and Slack also have realtime 2 way but we didn't worry about emulating that in agent chat. We can be convention breaking in agentic voice chat too.
First I'll describe the performance metrics and the architecture.
Next I'll elaborate on the streaming aspect and the geographical limitations important to the performance.
Finally the user asked me to make sure to keep the tone appropriate to Hacker News and to link their github – I'll make sure to include the link. </think>