How OpenAI delivers low-latency voice AI at scale
339 points
9 hours ago
| 21 comments
| openai.com
| HN
Sean-Der
8 hours ago
[-]
Very grateful that OpenAI published the article/publicized their usage of Pion[0] a library I work on. If you aren't familiar with WebRTC it's a super fun space. I work on a book WebRTC for the Curious [1] that details how it works.

[0] https://github.com/pion/webrtc

[1] https://webrtcforthecurious.com

reply
ericmcer
5 hours ago
[-]
I use pion thanks for making it!

Curious if you thought their approach was necessary, it seemed like a ton of complexity to reduce one of the faster parts of a voice AI setup. Having a fast model and accurate VAD seems way more important than fine tuning WebRTC transit times.

reply
Sean-Der
5 hours ago
[-]
Thanks for using it :)

I think It’s a case of you improve what you own. The owners of WebRTC servers were aggressively improving their part. They don’t own the inference servers.

reply
willmeyers
2 hours ago
[-]
WebRTC is great and so is Pion, thanks for help making and maintaining it! I loved learning about WebRTC from WebRTC for the Curious!!!
reply
aleda145
7 hours ago
[-]
Appreciate you putting the entire book online!

I read parts of it a while ago when I had an idea on using webRTC data channels to pass data from databases to browser clients via a CLI. Your book made me understand that it's probably not a great fit for my use case. I just used a centralized control plane and websockets instead.

I still feel like there is something fun that we can do with webRTC data channels + zero copy Apache Arrow arraybuffers + duckdb WASM, but haven't figured it out yet

reply
oezi
6 hours ago
[-]
What is preventing the fun is that even though we now have IPv6 widely enough available we still can't have p2p connections in the browser without a cumbersome control plane of servers. If you could join a federation in the browser from some bootstrap IPs then I think we could have some real distributed fun.
reply
Sean-Der
5 hours ago
[-]
Thanks for reading it!

You can't beat Websockets :) Especially since you have so much tooling/existing stuff that works with HTTP.

I have been trying to get a website off the ground that does Datachannels + SQlite in the browser and then users sync between each other. I have gotten distracted so many times though.

reply
dtran
7 hours ago
[-]
Thanks for WebRTC for the Curious and for Pion! Not using the latter directly, but have used both to better understand WebRTC
reply
ryanar
4 hours ago
[-]
I used pion and it was fantastic. Most of the article seems pretty standard webrtc techniques for performant voice.
reply
thatxliner
7 hours ago
[-]
slightly unrelated but what’s with storing the entire codebase in the root directory instead of a nested src folder? It makes getting to the README a lot more difficult
reply
nemothekid
7 hours ago
[-]
Thats the default for go projects. Go imports are repository strings (e.g.):

     import ("github.com/go-sql-driver/mysql")
so it's standard to have the library files in the root directory.
reply
a456463
7 hours ago
[-]
This is valid criticism. Go fanbois don't like listening to any go criticism. They were all like who needs templates in go. and now go has templates.

To me go code looks like somebody vomitted stuff in the root dir and i have to wade through that every time. No namespacing. nothing

reply
junon
7 hours ago
[-]
I don't like go as a personal preference but reducing them to "fanboys" is a bit reductive. I'm sure the same could be said about your own favorite language.
reply
altmanaltman
6 hours ago
[-]
Is it reductive when its describing a group of people that like something and refusing to hear any ill of it? The comment wasn't shade at people using the language in general.

And you're right, fanboys are in every language. But resorting to changing the argument by whataboutism is a bit reductive.

reply
simondotau
5 hours ago
[-]
I’m not a go fanboy, but I do know from other contexts that so-called “fanboy“ behaviour is frequently associated with level-headed supporters getting defensive in the face of imprecise criticism.

There’s an oft-repeated pattern where valid specific criticisms morph into broad criticism, which morphs into judgement, which breeds defensiveness, which feeds the criticism. Once you recognise this pattern, you see it everywhere.

reply
twodave
3 hours ago
[-]
Sure, and there's the near-identical pattern where valid specific criticisms are taken as broad criticism even though they aren't, etc., etc..
reply
idiotsecant
4 hours ago
[-]
Ok... The question was why is it like that. The answer is because it's in go. Nobody was anything other than civil before you neckbearded in here. Chill. There's a sane way to say what you said.
reply
haaz
6 hours ago
[-]
Only a software dev would start their referencing at 0 lol
reply
vorticalbox
6 hours ago
[-]
I do this too I never made the connection.
reply
legohead
8 hours ago
[-]
The low latency is more of a pain point than a good thing, the way they have it implemented. Trying to have a casual conversation with it, as humans we naturally pause, and GPT will take this as you are "done" and start blabbing away.

I also suffer from finding the appropriate word I want as I've gotten older and slower, and this fast-voice-gpt just ends up frustrating me more than helping. I have to sit there and think out the whole sentence in my head before I say anything -- not very natural.

reply
zamadatix
7 hours ago
[-]
I think these are 2 different layers of "latency". The latency in the article is referring to the transport of the audio stream itself while the latency in your scenario is about how quickly to start responding inside the audio stream.
reply
hun3
4 hours ago
[-]
They are orthogonal.

Suppose you have 100ms audio latency and no wait time. Then, natural pause will trigger response immediately but you won't notice it has started until after ~200ms (round-trip time). Twice as annoying.

reply
ericmcer
5 hours ago
[-]
I think he’s saying they are doing an insane level of complexity to shave ~100ms off response times in a scenario where that isn’t important and might even be a negative
reply
zamadatix
5 hours ago
[-]
When GP mentioned reducing conversational latency as a negative that made sense (and should probably be done IMO), it just wasn't the same category of latency the article talks about reducing. I.e. increasing "network latency" just makes the conversation feel more and more out of sync, it doesn't change the rate at which the AI will interrupt ("turn latency") because the latter is based on the duration of the pause in the audio stream, not the duration it took to deliver that audio stream.

If you meant there is a case where reducing the network latency at the same delivery reliability for a given audio stream is actually a negative then I'd love to hear more about it as I'm a network guy always in search of an excuse for latency :D.

reply
theptip
54 minutes ago
[-]
By you want to be able to interject “hold on…” and have it immediately stop talking, when it goes off the rails.

And GP is correctly pointing out that the only negative here (silence waiting latency maybe being too low) is tunable separately from the network latency number.

reply
ButlerianJihad
48 minutes ago
[-]
I want to be able to click the "Stop" button on my earphones remote. I want to be able to interject "woah" or "stop!" or "wait!" or that it would detect that I've inhaled a breath, or that my eyes glazed over. I want the LLM to figure out that every speed setting for its voice output is in "auctioneer" territory rather than "lecturing university professor with tenure and a pension" pacing.

But we won't get any of that, because the prime directive of LLMs is to burn tokens like there's no tomorrow. Burn tokens on a naïve answer without asking clarifying questions. Burn tokens on writing, debugging, and running a Python script or accessing and parsing 10 websites without asking for consent. Burn tokens on half-baked images with misspellings and 31 fingers. Burn tokens arguing "how many 'r's in strawberry?". Burn tokens asking a followup question at the end of every single answer, begging the user to re-engage and burn more tokens.

There is a little red "Stop" control when text output is being produced, at least, but does "Stop" halt everything and throw away the context? Re-prompt from the beginning?

The "maximize tokens burnt" prime directive is not to be found in any system prompt or user documentation. It is seemingly a common feature of the training for any consumer model.

Currently, if I'm using voice for an LLM, I use the voice dictation in the keyboard feature, because then the response is in text. There is no way to prevent "responding in kind" if I query the thing with audio. Or in Swahili.

reply
janalsncm
7 hours ago
[-]
I’ve also experienced this and it’s really annoying. There is this pressure to keep talking if I’m not done with my thought that feels pretty unnatural at least for me. If I’m searching for the right word, I want the opportunity to find it.

I think the solution is to handle pauses more intelligently rather than having a higher latency protocol. With low latency you can interrupt and the bot can immediately stop rambling.

reply
wnmurphy
5 hours ago
[-]
100%. I have to hold the floor by filling the space with "ummmmmmmm.... uhhhh...." which inevitably distracts me from my point altogether. Poor user experience.
reply
Gracana
3 hours ago
[-]
Seems like there's a big risk of having that habit leak into human conversation. A lot of people try really hard to train themselves not to add those fillers.
reply
discordance
4 hours ago
[-]
Have you tried telling it to pause to let you think?

I often use it while I’m walking and tell it to not respond until I initiate a conversation.

reply
pottertheotter
4 hours ago
[-]
I’ve tried this and it says it will but just keeps cutting in. I hate this feature so much.
reply
650REDHAIR
25 minutes ago
[-]
If anyone has an alternative I’m all ears.

This would be a killer feature for me and something I’ve tried to use on cross-country road trips.

reply
taneq
3 hours ago
[-]
I find this is a problem even with human conversations. Some people just aren’t very good at telegraphing when they’ve finished ‘their turn’ talking. Or worse yet, aren’t willing to take turns in the first place.
reply
dtran
6 hours ago
[-]
This has more to do with Voice Activity Detection (VAD) than the latency described in the article
reply
lxgr
5 hours ago
[-]
That seems to be the issue: VAD is insufficient here.

Knowing when to respond requires semantic understanding, which probably only the model itself is capable enough.

Maybe it’s hard for them to train it to only respond once it seems appropriate to do so?

reply
Sean-Der
5 hours ago
[-]
I am excited for VAD to go away. PersonaPlex totally seems like the future.

However things like 'Call center helpline' turn based actually seems better! You don't want to be interrupted when giving information back and forth (I think?)

reply
wnmurphy
5 hours ago
[-]
Exactly. It's a tangent, but clearly a pain point for enough users.
reply
jameshush
3 hours ago
[-]
This is more of a VAD/turn detection issue. It's gotten a lot better over the last few years, but it's a hard problem. The extra ~100ms of latency makes a huge difference otherwise, especially when you have use cases that require tool calling that can easily add 500ms+ of latency.
reply
angry_octet
2 hours ago
[-]
It seems that tool calling shouldn't be 500ms of latency?
reply
saturdaysaint
7 hours ago
[-]
In voice conversations I tell it not to reply at all or only say “Understood” until I use some kind of code word. Not perfect, but less intrusive.
reply
jdironman
6 hours ago
[-]
Roger that, over.
reply
miki_oomiri
2 hours ago
[-]
People are migrating to the "End Of Thought" triggers. Deepgram does that wonderfully.
reply
richardw
7 hours ago
[-]
Hard problem. I find myself adding in filler to stop the thing from jabbering.

I also think it spends most of its iq on sounding good rather than thinking about the problem. “Yeah absolutely I can see why you’d like to…” etc. This is likely because it’s on a timer and maybe voice is more expensive to process? Text responses spend more time on the task.

reply
lxgr
5 hours ago
[-]
Their voice capable model is several generations behind the state of the art text-only one, as far as I know.

I don’t think it even has reasoning tokens, so it’s no surprise that it’s as most as smart as the “instant” models (i.e., not very).

reply
asdfman123
6 hours ago
[-]
Fwiw you can prompt it to respond differently to you.
reply
wnmurphy
5 hours ago
[-]
Strongly agree, some of us like to choose our words more carefully when interacting with an LLM.

I've tried to convey this to OpenAI through various available channels (dev forums, app feedback, etc.).

Grok solves this by having an optional push-to-talk mode, but this is not hands-free and thus more cumbersome than just having a user-configurable variable like seconds_delay_before_sending_voice_input.

reply
ericmcer
5 hours ago
[-]
yeh exactly, you cannot get a strong signal that a user is done speaking without some amount of “wait for 500ms of silence”. You could kick of processing and abandon if they continued talking, but that seems over optimized.

1-2s replies feel natural and like you pointed out pausing for 2-3s mid sentence is super normal.

reply
charcircuit
3 hours ago
[-]
The AI should be able to model a probability for when is a natural moment to start talking.
reply
throwuxiytayq
7 hours ago
[-]
With higher latency this would be even more of an issue. When you pause and start talking again, the model wouldn't catch that until it has already interrupted you.

The actual implementation is at fault. I had some luck with instructing the model to only respond with "Mhm" until I've explicitly finished my thought and asked it a question. Makes this much less of an issue.

But I've decided that their voice mode is completely unusable for a different reason: the model feels incredibly dumb to interact with, keeps repeating and re-phrasing what I said, ends every single answer with a "hook" making the entire interaction idiotically robotic, completely ignores instructions when you ask it to stop that, and - most importantly - doesn't feel helpful for brainstorming. I was completely surprised how bad it is in practice; this should be their killer app but the model feels incredibly badly tuned.

reply
MagicMoonlight
7 hours ago
[-]
It’s possible to change the amount of time it waits if you’re using the API
reply
Lucasoato
7 hours ago
[-]
Wait a minute... I’m genuinely happy that they are sharing this, but keep in mind that realtime audio model from OpenAI are still stuck with the 4o family in terms of capabilities, sadly. I still find them so useful, such a pity that there’s no real competitor in this segment, having the experience a real conversation has helped me so much in expressing ideas and concepts.

Still, it’s worth to keep in mind that these are not frontier models, differently from when they were released.

(Please Sam, if you read this, release the new realtime audio models)

reply
modeless
4 hours ago
[-]
Grok voice is surprisingly good, actually. It's still a dumber model than the thinking modes of frontier models, but it's less dumb than the voice modes of other providers.
reply
artdigital
2 hours ago
[-]
Grok voice model is also a thinking model. I agree that it’s far better than the other voice models

Just give me a option to have a slower response but better model…

reply
dharma1
6 hours ago
[-]
Yes the voice part of OpenAI realtime/voice mode is great but it’s pretty dumb compared to newer models and often gets stuck repeating itself.

Google’s Gemini flash live 3.1 is better, especially used via the API - it can do tool calling (including to other, even smarter LLMs if you set it up yourself), you can set the reasoning level (even high is still close enough to realtime) and it can ground answers in google search. I love bidirectional voice and right now it’s probably the best option. You can try it in AI studio

reply
Lucasoato
6 hours ago
[-]
Thanks, I’ll try it, even if my experience wasn’t that great with Google models lately (503s)
reply
dharma1
6 hours ago
[-]
Give it a shot, 3.1 live one in AI studio/API and max out reasoning - not the one in Gemini app it’s an older model.

Another option is to use pipecat with their VAD and separate STT and TTS and any (fast) LLM of your choice - but it’s more plumbing and not a true speech to speech model

reply
stavros
4 hours ago
[-]
Haha, wow, I never thought I'd see a voice model that was too quick, but 3.1 live felt like it responded unnaturally quickly! I'm kind of blown away, I'd want to insert a 100ms delay to make it sound more natural, wow. I never thought I'd see that.
reply
artdigital
5 hours ago
[-]
This is what makes their voice mode unusable to me. I can’t stand the way 4o replies and it’s such a big jump in quality from text mode
reply
ddp26
5 hours ago
[-]
Yeah, the question in the title can be answered: "by using gpt-4o, a model 2 years behind the frontier, to serve audio responses"
reply
thimabi
8 hours ago
[-]
> Voice AI only feels natural if conversation moves at the speed of speech […] At OpenAI’s scale, that translates into three concrete requirements: Global reach for more than 900 million weekly active users

Surely the number refers to the total users of ChatGPT overall, and the fraction of those who use voice features is considerably smaller, is it not?

That’s the kind of thing that influences business decisions like knowing how much hardware and software optimization to throw at a problem.

reply
stuartmemo
8 hours ago
[-]
Yeah, that's why they've used "reach" - the total number of users who could be exposed to the feature regardless of engagement.
reply
janalsncm
7 hours ago
[-]
To defend them a little: voice is a little rough around the edges now, so there’s a chicken and egg problem of whether to prioritize improving voice if usage isn’t high partially because it’s clunky.
reply
notfromhere
2 hours ago
[-]
id rather use the thinking models so the voice mode isnt' useful, i do use voice-to-text more and more just to speed things up though
reply
Aeroi
8 hours ago
[-]
if anyone is looking to get into this. pipecat is a great open-source repo and community. https://github.com/pipecat-ai/pipecat
reply
pncnmnp
8 hours ago
[-]
I wish I had known about Pipecat a lot sooner. I found out about it a few weeks back, and since Gemma 4 launched, I've been building my own entirely local voice assistant using Gemma 4 + Kokoro TTS + Whisper from scratch - https://github.com/pncnmnp/strawberry.

Pipecat's smart turn model is really good for VAD - https://huggingface.co/pipecat-ai/smart-turn-v3

reply
zarldev
6 hours ago
[-]
Yeah Gemma4 was and is great fun to do this with - I too am building pretty much the same as yourself in Go.

https://github.com/zarldev/zarl & https://www.zarl.dev/posts/hal-by-any-other-name

reply
jaggederest
5 hours ago
[-]
Looks like everyone is building one of these, I have my own little version that's using streaming STT, it can actually be too fast in some cases, and I have a little ring buffer grabbing audio from before the wake word detection fires (so it can hear "Hey Jarvis, turn on the lights" without deliberate pause) https://github.com/jaggederest/pronghorn/
reply
AnthOlei
8 hours ago
[-]
What do you have going on the hardware side? I want to plug this into hass but don’t know what hardware I need for reasonable latency
reply
Sean-Der
7 hours ago
[-]
Check out [0]. You can do 'Voice AI' on small/cheap hardware. It's the most fun you can have in the space ATM :) It's been a while, but posted a demo here [1]

[0] https://github.com/pipecat-ai/pipecat-esp32

[1] https://www.youtube.com/watch?v=6f0sUEUuruw

reply
AnthOlei
7 hours ago
[-]
beautiful demo - is it running fully locally or talking to 3rd party API’s? That box was jaw dropping small
reply
jameshush
3 hours ago
[-]
For the best experience, you'll still want it to communicate with 3rd party APIs to handle the speech to text, text to speech, and LLM.
reply
pncnmnp
7 hours ago
[-]
The whole setup works on my M2 MacBook Pro with 16 GB RAM. I use Gemma 4B via LiteRT-LM.

I've found that LiteRT-LM has a much lower DRAM footprint than Ollama. I've also made tons of optimizations in the code - for eg, you can do quite a bit with a 16k context window for a voice assistant while managing a good footprint, so I keep track of the token usage and then perform an auto-compaction after a while. I use sub-agents and only do deep-think calls with them, so the context window is separated out. In a multi-turn conversation, if Gemma 4 directly processes audio input, the KV cache fills up within a few turns, so I channel it all via Whisper.

Also, by far the biggest optimization is: 3-stage producer-consumer architecture. The LiteRT-LM streams tokens and I split them into sentences. A synthesizer thread then converts each sentence to audio via Kokoro TTS - the main thread then plays audio chunks sequentially. There's a parallel barge-in monitor thread. https://github.com/pncnmnp/strawberry/blob/main/main.py#L446

I did not want to use openWakeWord or Picovoice because they had limitations on which wake word you could choose. Alternative was to train a model of my own. So I created my own wake word detection pipeline using Whisper Tiny - works surprisingly well: https://github.com/pncnmnp/strawberry/blob/main/main.py#L143...

Also, I have VAD going with smart turn v3 (like I mentioned above) + I use browser/websocket for AEC + Barge-in (https://github.com/pncnmnp/strawberry/blob/main/audio_ws.py).

I'm using the MacBook's built-in microphones for this, though, and I haven't fully tested it with other microphones. I've been ironing out the rough edges on a daily basis. I should write a quick blog on this too.

reply
BoxedEmpathy
8 hours ago
[-]
I've been looking at this! Great project.
reply
didibus
8 hours ago
[-]
I wouldn't mind waiting longer for answers that would go through a better model with more thinking. As long as it has good support for interrupting and also it doesn't start answering as soon as I pause for 1 second and it's smart about knowing I'm done speaking.
reply
qrush
7 hours ago
[-]
Am I reading this right that OpenAI is not using Livekit for WebRTC/audio anymore?
reply
fidotron
7 hours ago
[-]
It does appear that way. The LiveKit server is not what you would want for this architecture anyway (as they basically say with the SFU discussion), although it does have a lot of useful stuff in the client SDKs.
reply
fuddle
6 hours ago
[-]
They do link to the Livekit docs in the footnotes: https://docs.livekit.io/transport/self-hosting/kubernetes/
reply
zuzululu
4 hours ago
[-]
whats wrong with livekit ?
reply
logickkk1
7 hours ago
[-]
IMO this probably isn't just about latency. keeping people in voice gives them training data text never will. is that why they were fine going transceiver over sfu and mostly ignoring multi-party?
reply
hnav
5 hours ago
[-]
RFC 9297 support can't come quick enough in browsers. Would obviate having to deal with WebRTC in a client-server scenario.
reply
charisma123
8 hours ago
[-]
If a transceiver crashes during a stream, how is the active session recovered? Does the system automatically re-establish the context in a new WebRTC session?
reply
Sean-Der
7 hours ago
[-]
It doesn't today, but you could with sometime like this [0]. You can save/suspend all WebRTC state and bring it back with the next process.

[0] https://github.com/pion/webrtc-zero-downtime-restart

reply
furyofantares
8 hours ago
[-]
> Global reach for more than 900 million weekly active users

lol, definitely didn't need to know there's 900M weekly users for this post. I mean yeah, there's a lot of users and they serve globally, that's relevant. But this is just pulling out your biggest stat because you can. How many voice users you have would actually be relevant and interesting but, to baselessly speculate on motivation here, might be a number that doesn't add as much fuel to an upcoming IPO as reminded people that you're almost at a billion users does.

reply
anzerarkin
8 hours ago
[-]
I hate the voice ai though, it's so much dumber
reply
brett-jackson
55 minutes ago
[-]
I used to use it all the time until about a year ago or so. Its responses are full of filler and the safeguards are really overbearing. It often will just give wrong answers in a way that GPT-5.x does not. I once asked it why a particular celebrity was canceled and it refused to tell me because it may harm me to know what they said!
reply
NikolaNovak
8 hours ago
[-]
Fwiw - I found the advanced AI voice feature to be actually detrimental. It's good if you just want a single sentence answer. I've turned it off though when I want a more detailed, structured, considered answer.
reply
drusepth
8 hours ago
[-]
Interestingly, that kind of parallels the real world too: if you want a quick and high level answer, talk to someone in person; if you want something detailed and info-dense, get them to write it down.
reply
CrzyLngPwd
7 hours ago
[-]
It's bad enough having to speed-read the waffle of its written answers; even when told to be concise, the thought of having to listen to it waffle on in its smarmy, sycohpantic fashion makes me want to reach for the sick bag.
reply
doctorpangloss
8 hours ago
[-]
what i learned from making a webrtc+kubernetes game streaming product:

- openai is wrong. almost of the issues they described are issues with libwebrtc, not with webrtc, kubernetes, network architecture, etc. the clue was when they said "the conventional one-port-per-session WebRTC model."

- there are no alternatives worth trying. everything else open source in the ecosystem, like pion, coturn, stunner, are too immature.

- libwebrtc is the only game in town.

- they haven't discovered libwebrtc feature flags or how it works with candidates, which directly fix a bunch of latency issues they are discovering. a correct feature flag can instantly reduce latency for free, compared to pay for twilio network traversal style solutions

- 99% of low latency voice END USERS will be in a network situation that can eliminate relays, transceivers, etc. it is totally first class on kubernetes. but you have to know something :)

this is the first time i'm experiencing gell mann amnesia with openai! look those guys are brilliant, but there is hardly anyone in the world who is doing this stuff correctly.

reply
Sean-Der
7 hours ago
[-]
Did you use libwebrtc on the backend? When you say `libwebrtc` is the only game in town are you talking about clients or servers?

Even for clients you have things like libpeer that libwebrtc can't hit.

reply
doctorpangloss
6 hours ago
[-]
yes - i used libwebrtc on the backend and, pre-LLM, patched it to work around a lot of the things i discovered that were directly related to low latency AV streaming. pion didn't exist then.

i think the challenge is that pion is an excellent product today. it would benefit me if its innovations were subsumed into libwebrtc, because eventually those innovations will show up in the iOS stack, which is one of the customers that matter to me. it is subjective if it is the MOST important customer, that is my belief and it is probably true of openai, at least until they get their own device out the door.

there can be many, many use cases though! not everything has to be, try to make the thing for 1b people that has to interact with all the most powerful and meanest businesses on the planet.

reply
chevman
7 hours ago
[-]
When you have hard problems with unclear optimal solutions, taking this approach of a public show & tell will often (always?) solicit lots of interesting ideas the team may have not yet considered :)
reply
jiggawatts
8 hours ago
[-]
Something I noticed is that companies that are vibe-coding their products miss out on the intelligence that (still) only humans can bring to bear. Just the knowledge cutoff alone puts AI at a serious disadvantage in any rapidly changing field.
reply
fragmede
7 hours ago
[-]
GPT 5.5's knowledge cutoff is August 2025. Which aspect of WebRTC has meaningfully changed since then?
reply
tedsanders
4 hours ago
[-]
Dec 2025, actually: https://developers.openai.com/api/docs/models/gpt-5.5

(though knowledge cutoffs in practice can be bit fuzzy)

reply
jiggawatts
5 hours ago
[-]
There's a difference between some piece of information being "officially published" and the AIs gaining a sufficient understanding of it.

Take any popular technology problem that has been around for a few years such as... wrangling Kubernetes with YAML config files. There's probably hundreds of thousands of discussions, source code samples from GitHub, official docs, blogs, bug reports, pull requests, etc... all discussing the nuances, pitfalls, pros/cons, etc. During pre-training the AIs internalise this and can utilise it later.

Now compare this with anything recent and (relatively) obscure, such as new .NET 10 features which were first officially publishing in November 2025, a month before GPT 5.5 cutoff.

As a human developer, these new language capabilities are on the same "level" for me in my day-to-day work as the features from .NET 9 or .NET 8. Similarly, my IDE has native refactoring and code cleanup support that can take C# code from the previous years and bring it up to the idiomatic style of $currentyear.

The AIs just can't do this, because one single Microsoft release note and one learn.microsoft.com page is nowhere near enough training data! The AI hasn't seen millions of lines of code written with .NET 10, taking advantage of .NET 10 improvements, and hasn't seen thousands of discussions about it. Not yet.

This is a fundamental issue with how LLMs are (currently) trained! Simply moving the cutoff date is not enough.

Human learning is second-order. If I see even the tiniest bit of updated information that invalidates a huge pile of older information, my memory marks everything old as outdated and from that second onwards I use only the new approach.

AI learning is first-order. It has to be given the discussions/blogs/posts that say "Stop using the legacy way, it's terrible! Start using the new hotness"! That, it can learn, but it'll be perpetually behind the rest of us by at least a few years.

Not to mention that thanks to AI forums like StackOverflow are dying, so... where is it going to get this kind of training data from in the future!?

AI training needs to switch to "second order", but AFAIK this is an unsolved problem at this time.

reply
mschuster91
6 hours ago
[-]
The problem is the sheer amount of knowledge out there. Particularly when using niche technologies (which webrtc and web audio still is, when measured by how many people develop using it), it is not surprising that AI doesn't have everything available in its responses, unless you specifically ask it about something you already know it should know.
reply
tom1IIIl1iIL
7 hours ago
[-]
I think it's better to join some kind of club if you want to make friends?
reply
flakiness
8 hours ago
[-]
Should I or shouldn't I be glad to see zero mention on Codex.
reply
mock-possum
8 hours ago
[-]
Shouldn’t, I think - advanced voice is a surprisingly slick feature, and if you’re someone who feels that they can think and speak more naturally than when they think and type, AI voice transcription is kind of huge.
reply
gyanchawdhary
7 hours ago
[-]
100% .. as a product designer/developer, i use it heavily for early feature ideation .. i’ll do a loose, exploratory back and forth on a long walk .. then pass the transcript to claude to validate and turn into a spec ..
reply
devopsengine
3 hours ago
[-]
Inspired
reply
AIorNot
8 hours ago
[-]
so is the answer

WebRTC + Kubernetes

reply
rvz
7 hours ago
[-]
OpenAI uses Go for the networking implementation for the relays and the services, which makes a ton of sense, instead of something immature as TypeScript / Node or whatever.

Yet another reason to not consider anything else like that for low-latency networking. Golang (or even Rust and C++) is unmatched for this use-case.

reply
bananamogul
7 hours ago
[-]
"something immature as TypeScript / Node or whatever"

Node.js's initial release was May 27, 2009

Golang 's initial release was November 10, 2009

They're different, yes, but it's not like

reply
mghackerlady
7 hours ago
[-]
okay, sure, but one is by microsoft, the other by a 25 year old, and another by rob pike. The one by rob pike is going to be infinitely more mature and thought out than a hacky type system on JS because it isn't his first rodeo
reply
nvarsj
7 hours ago
[-]
Can golang do zero copy networking nowadays? In the past golang was terrible at this kind of thing due to allocations and copies of all relayed data.
reply
fragmede
7 hours ago
[-]
And the GC!
reply
cdrnsf
8 hours ago
[-]
It's missing the part where they explain how they obtained the training data for their voice AI.
reply
jonahs197
7 hours ago
[-]
Who cares? Their company is dying.
reply