This Github repo turns an ESP32-S3 into a realtime AI speech companion using the OpenAI Realtime API, Arduino WebSockets, Deno Edge Functions, and a full-stack web interface. You can talk to your own custom AI character, and it responds instantly.
I couldn't find a resource that helped set up a reliable, secure websocket (WSS) AI speech to speech service. While there are several useful Text-To-Speech (TTS) and Speech-To-Text (STT) repos out there, I believe none gets Speech-To-Speech right. OpenAI launched an embedded-repo late last year which sets up WebRTC with ESP-IDF. However, it's not beginner friendly and doesn't have a server side component for business logic.
This repo is an attempt at solving the above pains and creating a great speech to speech experience on Arduino with Secure Websockets using Edge Servers (with Deno/Supabase Edge Functions) for fast global connectivity and low latency.
----
If anyone is trying to build physical devices with Realtime API I would love to help. I work at OpenAI on Realtime API and worked on [0] (was upstreamed) and I really believe in this space. I want to see this all built with Open/Interoperable standards so we don't have vendor lock-in and developers can build the best thing possible :)
Offer is open for anyone. If you need help with WebRTC/Realtime API/Embedded I am here to help. I have an open meeting link on my website.
The OpenAI "Voice Mode" is closer, but when we can have near instantaneous and natural back and forth voice mode, that will be a big in terms of it feeling magical. Today, it is say something, awkwardly wait N seconds then listen to the reply and sometimes awkwardly interrupt it.
Even if the models were no smarter than they are today, if we could crack that "conversational" piece and performance piece, it would be a big difference in my opinion.
``` turn_detection: { type: "server_vad", threshold: 0.4, prefix_padding_ms: 400, silence_duration_ms: 1000, }, ```
What would be REALLY cool is if we had something that would interrupt you during conversation like talking with a real human.
Both the supabase Api and OpenAI billing is per api call.
So the lovely talking toys can die if the company stops being profitable.
I would love to see a version with decent hardware that runs a local model, that could have a long lifespan and work offline.
This is a good point to me as a parent -- in a world where this becomes a precious toy, it would be a serious risk of emotional pain if the child experienced this scenario like the death of a pet or friend.
> version with decent hardware that runs a local model
I feel like something small and efficient enough to meet that (today) would be dumb as a post. Like Siri-level dumb.
Personally, I'd prefer a toy which was tethered to a home device. Without a cloud (and thus commercial) dependency, the toy wouldn't be 'smart' outside of Wi-fi range, but I'd design it so that it got 'sleepy' when away from Wi-fi, able to be "woken up" and, in that state, to respond to a few phrases with canned, Siri-like answers. Perhaps new content could be made up for it daily and downloaded to local storage while at home, so that it could still "tell me a story" offline etc.
We've already seen this exact scenario play out with "Moxie" a few months ago:
https://www.axios.com/2024/12/10/moxie-kids-robot-shuts-down
I noticed that it is dependent on openAIs realtime API, so it got me wondering what open alternatives there are as I would love a more realtime alexa-like device in my home that doesnt contact the cloud. I have only played with software, but the existing solutions have never felt realtime to me.
I could only find <https://github.com/fixie-ai/ultravox> that would seem to really work as realtime. It seems to be some model that wires up llama and whisper somehow, rather than treating them as separate steps which is common with other projects.
What other options are available for this kind of real-time behaviour?
The design of OpenAI + WebRTC was to lean on WebRTC as much as possible to make it easier for users.
[0] https://speaches.ai/ [1] https://huggingface.co/spaces/Xenova/kokoro-web
Pretty sure you'd need to host this on something more robust than an ESP32 though.
- why do you need nextjs frontend for what looks like a headless use case? - how much would be the OpenAI bill if there is 15 minutes of usage per day?
https://openai.com/index/introducing-the-realtime-api/
About the nextjs site, I was thinking maybe its difficult to have supabase hold long connections, or route the response? I'm curious too.
https://en.wikipedia.org/wiki/AG_Bear
With regard to this project, using an ESP32 makes a lot of sense, I used an Espressif ESP32-S3 Box to build a smart speaker along with the Willow inference server and it worked very well. The ESP speech recognition framework helps with wake word / far field audio processing.
Maybe I'm alone? To me, this comes across as extremely creepy, the exact opposite of what we should desire from AI in products aimed at children.
Children don’t need this; they are so much more creative than an AI (and the adults that trained the AI), and their creativity is fueled by boredom.
That said, I totally agree that I wouldn't want this in a kids toy. The whole idea is super creepy in that respect, with so much scope for abuse.
I poured hours into games/programming because it was a happy place away from school etc… These toys could be the same.
This technology is neutral, but I see so much potential for projects that do good.
Bots are for doing tasks. I don't want to socialize with them and find the idea of kids being socialized by bots supremely weird. At least the AI girlfriend people are (probably unwell) adults.
IMO this is only exacerbated by how little children (who are the presumably the target audience for stuffed animals that talk) often don't follow "normal" patterns of conversation or topics, so it feels like it'd be hard to accurately simulate/test ways in which unexpected & undesirable responses could come out.
Essentially, telling kids the truth before they're ready and without typical parental censorship? Or is there some other fear, like the AI will get compromised by a pedo and he'll talk your kid into who knows what? Or similar for "fill in state actor" using mind control on your kid (which, honestly, I feel like is normalized even for adults; eg. Fox News, etc., again US-centric)
https://www.npr.org/2024/12/10/nx-s1-5222574/kids-character-...
https://apnews.com/article/chatbot-ai-lawsuit-suicide-teen-a...
https://www.euronews.com/next/2023/03/31/man-ends-his-life-a...
I have a 6 year old FWIW, I'm not some childless ignoramus I just do my risk calcs differently and view it as my job to oversee their use of a device like this. I wouldn't fear it outright because of what could happen. If I took that stance, my kid would never have any experiences at all.
Can't play baseball, I read a story where kid got hit by a bat. Can't travel to Mexico, cartels are in the news again. Home school it is, because shootings. And so on.
> telling kids the truth before they're ready and without typical parental censorship
Does AI today reliably respond with "the truth"? There are countless documented incidents of even full-grown, extremely well-educated adults (e.g. lawyers) believing well-phased hallucinations. Kids, and particularly small kids who haven't yet had much education about critical thinking and what to believe, have no chance. Conversational AI today isn't an uncensured search engine into a set of well-reasoned facts, it's an algorithm constructing a response based on what it's learned people on the internet want to hear, with no real concept of what's right or wrong, or a foundational set of knowledge about the world to contrast with and validate against.
> what exactly is the fear
Being fed reliable-sounding misinformation is one. Another is being used for emotional support (which kids do even with non-talking stuffed animals), when the AI has no real concept of how to emotionally support a kid and could just as easily do the opposite. I guess overall, the concern is having a kid spend a large amount of time talking to "someone" who sounds very convincing, has no real sense of morality or truth, and can potentially distort their world view in negative ways.
And yea, there's also exposing kids to subjects they're in no way equipped to handle yet, or encouraging them to do something that would result in harm to themselves or to others. Kids are very suggestible, and it takes a long while for them to develop a real understanding of the consequences of their actions.
I mean, that's not a silly fear. But perhaps you don't have any children? "Typical parental censorship" doesn't mean prudish pearl-clutching.
I have an autistic child who already struggles to be appropriate with things like personal space and boundaries -- giving him an early "birds and bees talk" could at minimum result in him doing and saying things that could cause severe trauma to his peers. And while he uses less self-control than a typical kid, even "completely normal" kids shouldn't be robbed of their innocence and forced to confront every adult subject until they're mature enough to handle it. There's a reason why content ratings exist.
Explaining difficult subjects to children, such as the Holocaust, sexual assault, etc. is very difficult to do in a way that doesn't leave them scarred, fearful, or worse, end up warping their own moral development so that they identify with the bad actors.
I think my theory is kind of correct, people generally 'trust' a YouTube censor but an AI censor is currently seen as untrusted boogeyman territory.
I had a similar idea that I never followed through with(even down to using an ESP).
Basically you could make a Harry Potter talking painting with basically your device + an e-ink display that displays some 3D modeled character.
For others, here’s a direct link to a demo video:
I do wonder if the cellphone/app argument is why we didn't see that many hardware LLM API wrappers up until now. The rabbit R1 was basically just that.
I've seen more products in this space recently such as Ropet[1], LOOI[2], and others but for now it's going to be costly for companies to sell such a product at a fixed cost as I think a subscription model would be a hard sell [3] for consumers.
[1] https://www.kickstarter.com/projects/1067657324/ropet-your-n... [2] https://looirobot.com/products/looi-robot?variant=4909200762... [3] https://tech.yahoo.com/ai/articles/tragic-robot-shutdown-sho...
What kind of interesting challenges have you run into, and how have your work influenced the OpenAI's realtime API?
PS: Your github readme is quite well crafted, nowadays hard to come across.
Not the first time I ran into it, but I did not bother commenting.
I can recognize it from far away. Thankfully I am not the only one.
I think the readme is still well crafted, AI couldn't do this without the author.
If he meant your reply, I do not see any reasons as to why. :D