FilterHN

Show HN: A personalised AI tutor with < 1s voice responses

72 points

11 months ago

| 9 comments

TLDR: We created a personalised Andrej Karpathy tutor that can response to questions about his Youtube videos in sub 1 second responses (voice-to-voice). We do this using a voice enabled RAG agent. See later in the post for demo link, Github Repo and blog write up.

A few weeks ago we released the worlds fastest voice bot, achieving 500ms voice-to-voice response times, including a 200ms delay waiting for a user to stop speaking.

After reaching the front page of HN, we thought about how we could take this a step further based on feedback we were getting from the community. Many companies were looking for a way to implement function calling and RAG with voice interfaces while retaining a low enough latency. We couldn’t find many resources about how to do this online that:

1. Allowed us to achieve sub-second voice-to-voice latency 2. Was more flexible than existing solutions. Vapi, Retell, [Bland.ai](http://Bland.ai) are too opinionated plus since they just orchestrate API’s which incur network latency at every step. See requirement above 3. The unit economics actually work at scale.

So we decided to create a implementation of our own.

Process:

As we mentioned in our previous release, if you want to achieve response times this low you need to make everything as local as possible. So below was our setup

- Local STT: Deepgram model - Local Embedding model: Nomic v1.5 - Local VectorDB: Turso - Local LLM: Llama 3B - Local TTS: Deepgram model

From our previous example, the only new components where:

- Local Embedding model: We chose Nomic Embed text v1.5 model that gave a processing time of roughly ~200ms - Turso offers local embedded replicas combined with edgeDB’s which meant we were able to achieve 0.01 second read times. Pinecone also gave us good times of 0.043 seconds.

The above changes led us to achieve sub 1 second voice-to-voice response times

Application:

With Andrej Karpathy’s announcement around [Eureka Labs](https://eurekalabs.ai/), a new AI+Education company we thought we would create our very own personalised Andrej tutor.

Listen to anyone of his Youtube lectures, as soon as your start specking, the video will pause and he will reply. Once your question has been answered you can then tell him to continue with the lecture and the video will automatically start playing.

Demo: https://educationbot.cerebrium.ai/

Blog: https://www.cerebrium.ai/blog/creating-a-realtime-rag-voice-...

Github Repo: https://github.com/CerebriumAI/examples/tree/master/19-voice...

For demo purposes:

- We used OpenAI for GPT-4-mini and embeddings (its cheaper to run on a CPU than GPU’s when running demos at scale. These changes add about ~1 second to the response time - We used Eleven labs to clone his voice to make replies sound more realistic. This adds about 300ms to the response time.

The improvements that can be made which we would like the community to contribute to are:

- Embed the video screens as well that when you ask certain questions it can show you the relevant lecture slide for the same chuck that it got context from to answer. - Insert the timestamps in the vectorDB timestamps so that if a question will be answered later in the lecture he can let you know

This unlocks so many use cases in education, employee training, sales etc that it would be great to see what the community builds!

▲

tunesmith

11 months ago

[-]

I think one of the hardest things about voice AI is being able to gracefully modulate between styles of input/output delay. In actual technical conversations sometimes it is appropriate to patiently wait for someone to ask a very difficult question with delays in speech, and sometimes it is appropriate to interrupt regularly and ask brief pointed questions that require one-word answers, and everywhere in between. I'm really looking forward to having that kind of interchange with an LLM.

▲

euroderf

11 months ago

[-]

Maybe a variable-pressure PTT button for the user would be effective. Push harder or less hard depending on whether or not a word is on the tip of your tongue. The AI could appropriately interrupt or not.

▲

anvil-on-my-toe

11 months ago

[-]

I wonder if you need to account for body language to get that figured out. A technical conversation on the phone is usually not as fluid as in-person.

▲

za_mike157

11 months ago

[-]

you are 100% correct! Here we just check if there is a pause of 200ms (you can change that in the code). I haven't seen any models that can detect if the user has finish speaking base on the question or social cues. You also don't want to make the delay so long since then it sound unnatural

▲

jimmar

11 months ago

[-]

Interesting demo. I'd argue that a good tutor will ask more questions than answer. The tutor has to gauge the learner's understanding and adapt instruction. Properly formulating a question can be tricky if you're confused.

▲

za_mike157

11 months ago

[-]

Thats a interesting thought. We could get it to periodically ask a question(s) once a concept has been explained in the demo. There have actually been a few studies that students learn best via discussions

▲

TINJ

11 months ago

[-]

Ending most responses with something like, "Did that make sense?" or "Should I clarify anything I just talked about?"

Also, I asked some math questions, and the AI started talking about and giving equations. For me, at least, this was impossible to understand. I know part of the point of this is to make it conversational, but I would think having a transcript displayed somewhere would help a lot.

For bonus points you could do this with the voice input. Only show the transcription when it's relevant or if the user asks.

▲

za_mike157

11 months ago

[-]

This is a good suggestion! I will at it to the list of things that the community can do to extend

▲

rahimnathwani

11 months ago

[-]

I tried to use it, but:

1. The first instruction starts "When the bot has connected...", and

2. After waiting a couple of minutes, 'BOT STATUS' still says 'Connecting'.

I am running this in a 'Guest' profile in Chrome, that has no extensions installed.

These are the errors/warnings I see in the dev console:

Unrecognized feature: 'web-share'. ta$2 @ index-CElq2hz-.js:33 index-CElq2hz-.js:2067 Loading VAD index-CElq2hz-.js:2043 env.wasm.numThreads is set to 4, but this will not work unless you enable crossOriginIsolated mode. See https://web.dev/cross-origin-isolation-guide/ for more info. ug @ index-CElq2hz-.js:2043 10Third-party cookie will be blocked in future Chrome versions as part of Privacy Sandbox.

▲

za_mike157

11 months ago

[-]

Sorry its getting smashed right now - increasing capacity

▲

dsmurrell

11 months ago

[-]

If I wanted to make something half as good that I can run on a server somewhere, where would you suggest I start with STT. I've been looking for some online (i.e. realtime services) but found nothing great. Can you point me to a low effort/cost solution I can use myself? Are there good tutorials or example projects that have this working but not quite as well as you?

▲

za_mike157

11 months ago

[-]

Can you please elaborate on what you would like help with?

I linked all the code in the post so you can experiment with it yourself and even extend it. Also we are making use of the Pipecat framework (https://github.com/pipecat-ai/pipecat) which means you can swap in/out any STT, LLM or TTS model you would like to use:)

▲

heystefan

11 months ago

[-]

Cool idea! Been experimenting with similar stuff, still trying to solve for detecting when a question ends.

BTW at one point it got confused and said that it doesn't have the ability to continue a lecture. I even have a recording in case you need it.

▲

za_mike157

11 months ago

[-]

Awesome! Let me know if you figure it out!

Yeah I got lazy with the "continue with lecture part" by doing fuzzy matching on the text. If you have an accent or its noisy its not that great. The more robust way to do it would be function calling

▲

mmcclure

11 months ago

[-]

Super cool idea. I think it might be getting the hug of death right now because even once I got in I never seem to get a connected bot.

▲

za_mike157

11 months ago

[-]

Sorry it was indeed receiving the hug of death - but it should be alleviated now

▲

AdamRomyn

11 months ago

[-]

This is awesome! Nice work to you and your team!

▲

ArronYoung

11 months ago

[-]

We also want to implement similar functions. You did a great job and inspired us.

▲

za_mike157

11 months ago

[-]

Im glad! Keep pushing!

▲

ukuina

11 months ago

[-]

How is using a Deepgram model "local"?

▲

za_mike157

11 months ago

[-]

We partnered with them and so its locally running their STT model on the container. You will see in the code we have a image reference which includes the Deepgram model

▲

ukuina

11 months ago

[-]

Not sure why it requires a Deepgram API key if it is running locally?

> DeepgramClient(os.getenv("DEEPGRAM_API_KEY"))

▲

za_mike157

11 months ago

[-]

That is only for the data processing step that I run locally on my Mac to embed all his Youtube videos and upload to the VectorDB. Im not running Deepgram locally there

▲

ukuina

11 months ago

[-]

That is awesome!