Microsoft VibeVoice: Open-Source Frontier Voice AI
194 points
by tosh
4 hours ago
| 27 comments
| github.com
| HN
steinvakt2
3 hours ago
[-]
This is not a new model. Also, it hallucinates a lot. Also, it's very heavy and slow in inference. It's also bad in multilingual.

Edit: I'm talking purely about speech to text (STT). Not sure about the other things this can do.

reply
zuzululu
1 minute ago
[-]
you saved us a lot of time here.... i unstarred the repo

moving on....

reply
scotty79
1 hour ago
[-]
You just saved me an afternoon.
reply
lblock
3 hours ago
[-]
Yeah, I don't get why it is suddenly getting so much attention today, it is all over twitter too
reply
xnx
1 hour ago
[-]
Simonw (who has a bit of a Midas touch for posts here) just posted about it https://simonwillison.net/2026/Apr/27/vibevoice/
reply
realty_geek
1 hour ago
[-]
To be fair, his Midas touch is a result of consistency and a lot of hard work.

It's like the gardener at one of the Oxford colleges said - it's really easy to create these perfect lawns, just turn up every day and trim and water it - for a couple hundred years.

reply
GuinansEyebrows
1 hour ago
[-]
there is so much more subversive marketing out there than any of us can really fathom. i try not to be too paranoid but it's getting a lot harder every day.

i know someone who worked in what we might call the 'astroturfing' space within the entertainment industry. after having a few discussions with him and with things like this[0] becoming more known, it's really difficult to afford any assumption of organic intent when money is on the line - especially at the scale that microsoft works at compared to something as comparatively quaint as the music industry.

[0] https://www.wired.com/story/geese-chaotic-good-marketing-ind...

reply
ramon156
3 hours ago
[-]
well duh, they updated the news section

https://github.com/microsoft/VibeVoice/commit/e73d1e17c3754f...

which is microsoft for "we removed two dead links". AI innovation knows no limits!

reply
Vinnl
2 hours ago
[-]
Interestingly that seems to be in response to [1], which might indeed be the trigger for this.

[1] https://doublepulsar.com/microsoft-vibing-capturing-screensh...

reply
gagan2020
1 hour ago
[-]
It is not good for text to speech (TTS) as well. I am trying it for few days. First of all 1.5B model documentation is not there. 0.5B realtime is shit model. I was converting text, line by line and it was randomly adding music and couldn't handle special characters like "…".

I really disappointed with this model to say the least.

reply
SecretDreams
3 hours ago
[-]
I think this was all covered when they said it was released by Microsoft?
reply
NobleLie
2 hours ago
[-]
The nuance is lost on LLM agentic dominant partakers.
reply
isodev
19 minutes ago
[-]
I think in this category, Voxtral by Mistral is a lot better. It also happens to be small enough to run on webGPU https://huggingface.co/spaces/mistralai/Voxtral-Realtime-Web...
reply
maxloh
3 hours ago
[-]
I think we should stop calling this type of models open source. They are indeed "open weight." The training code is proprietary and never revealed.

https://github.com/microsoft/VibeVoice/issues/102

reply
simonw
1 hour ago
[-]
I'm reserving that complaint for "open source" models which are released under non-open-source licenses.

I care that I know what I can DO with the project when I see it described as "open source".

reply
yjftsjthsd-h
1 hour ago
[-]
> I care that I know what I can DO with the project when I see it described as "open source".

Yes, the first of which is that you should be able to build it from source. Which requires the source code, and in this case data.

reply
simonw
1 hour ago
[-]
The OSI's take on this is that an open source model can be modified through fine-tuning etc, even if you can't rebuild it from scratch.

The problem with requiring "build from scratch" for open source models is that the number of interesting models with training data that can be openly licensed is close to zero.

If you trained your model on an unlicensed scrape of the web you can't release the data under an open source license!

The Open Source Initiative have a bunch of their thinking around this in their FAQ for the "Open Source AI definition": https://opensource.org/ai/faq#isn-t-training-data-required-t...

reply
riedel
56 minutes ago
[-]
I would personally disagree slightly with this take. Freely being able to use means IMHO, that this can be done for all applications in a legal (and ideally ethical) fashion. Regulation often requires to prove the quality or provenance of data. Open source has IMHO often a very libertarian view on things focusing on the rights of the user an not society in general.
reply
rogerrogerr
1 hour ago
[-]
They’ll never reveal the data, because that would reveal this is all built on stolen work.
reply
simonw
58 minutes ago
[-]
Some of the models DO reveal the data, and it's still built on "stolen work" in that it's unlicensed scrapes of the Web. Here's an example:

https://huggingface.co/allenai/OLMo-2-0325-32B

Here's one of their training mixes: https://huggingface.co/datasets/allenai/dolma3_pool - which includes 8 trillion tokens from Common Crawl.

reply
data-ottawa
1 hour ago
[-]
That would be “permissive license”

Maybe we should have a little cue card for models: vendor/name, size, open weights, open source, permissive license.

It’s simple enough an idea.

reply
jcmfernandes
2 hours ago
[-]
Indeed. We now live in a world where freeware is named open source. We are very sorry, Stallman.
reply
MarsIronPI
2 hours ago
[-]
If you're going to apologize to Stallman, you should apologize for conflating open source with software freedom. ;D
reply
jcmfernandes
1 hour ago
[-]
I totally get you, but this is yet another thick layer away.
reply
psychoslave
2 hours ago
[-]
With free libre software, where freedom and liberty are about what the end user is empowered with actually, the software is mostly metonymic. Free software, free society, because there are free people in the middle of course.
reply
jrm4
2 hours ago
[-]
Right, as I said elsewhere, maybe let's just let "open-source" have it.

"Open-source" can be "anything you can go out and grab a copy of and use" but doesn't give you much legal certainty about any of it, and reserve "free software" for the other, better thing.

reply
hedora
1 hour ago
[-]
But, free software lost it's way around GPLv3. From the end user's perspective, GPLv3 says that you can only use the software if it's either a cloud service, on hypothetical open firmware devices, or if you install it yourself.

AGPLv3 partially solves the issue by blocking people like Google from using it to build proprietary cloud services that take away their users' freedom. (It still doesn't solve the problem where providers use network effects to achieve the same end game.)

reply
MarsIronPI
1 hour ago
[-]
> From the end user's perspective, GPLv3 says that you can only use the software if it's either a cloud service, on hypothetical open firmware devices, or if you install it yourself.

What in the world do you mean?

reply
jrm4
29 minutes ago
[-]
I don't understand this either. The GPL doesn't address end users and their use of software at all, to be technical. It only addresses what terms of copyright redistributors of GPLed software are allowed to apply in-turn to subsequent end users.
reply
hedora
5 minutes ago
[-]
The point of the Free in free software was always to protect the users of the software, not the vendors or the redistributors.

The first sentence of the GNU manifesto says this, and a few sections later in the document elaborate on the point:

https://www.gnu.org/gnu/manifesto.html

Note, in particular, footnote [1] which explains that its OK for distributors to ask for payment, but that it's never OK for users to have to ask for permission to use the software, and the section "Why I Must Write GNU".

Since then, software service monopolies became common, and all of the most end-user-hostile systems on earth rely heavily on the GNU system. At this point, we're paying for permission to use those services with our money, our data, our democracy, etc.

I certainly cannot give you permission to use any of the GPLed services that I have used, or that I've been paid to extend. Therefore, I say the free software movement has lost its way.

reply
WhyNotHugo
1 hour ago
[-]
Devils advocate here: I can give you a binary of my open source MIT code and never phone you the code. The code is still MIT licensed, and open source. You just have no access to it.

That said, I entirely agree that MS is misrepresenting their openness here, which isn’t in the least surprising.

reply
freedomben
48 minutes ago
[-]
In their defense, most everyone else does the same thing. They still shouldn't do it, but at least they're not the trendsetter here (though they are contributing to the ongoing problem)
reply
Otek
49 minutes ago
[-]
? Do you know what “source” means in open source? Like, what is the source of the binary? It’s the code. That’s the source in open source.
reply
freedomben
45 minutes ago
[-]
I don't disagree, but it is perfectly acceptable per the MIT license, which is an OSI approved license. MIT doesn't require source distribution with the binary (which is why from the developer perspective, it's a more "permissive" license)
reply
clickety_clack
34 minutes ago
[-]
The license describes what users are allowed to do with the source code, it doesn’t (and shouldn’t) define what a creator has to do to make the source code open.
reply
JumpCrisscross
3 hours ago
[-]
> we should stop calling this type of model open source. They are indeed "open weight”

This ship has sailed. It’s now in the same category as hacker/cracker and the pronunciation of GIF.

reply
engeljohnb
22 minutes ago
[-]
The inventor of GIF didn't begin with a document* clearly laying out what is and isn't to be called a "GIF."

I think it's right to push back whenever a huge tech corporation tries to build goodwill by falsely using terms like "open source."

*https://opensource.org/osd

reply
JumpCrisscross
4 minutes ago
[-]
> inventor of GIF didn't begin with a document clearly laying out what is and isn't to be called a "GIF”*

Neither did the inventors of AI. A third party published a document after corporations went with open weights = open source and a spoiler block in FOSS wanted all training data published.

> it's right to push back whenever a huge tech corporation tries to build goodwill by falsely using terms like "open source

I think it’s counterproductive. Most people only see a squabble, which makes any ensuing points from the open-source community seem silly. Those who care can continue using the more-precise language they choose to.

Put another way, there is a difference between using terms like cracker and fully spelling out cryptocurrency, and telling people who use hacker and crypto more loosely that they’re wrong. They aren’t wrong and that isn’t meaningful feedback. At the same time, the person using the precise language isn’t wrong either.

reply
andy_ppp
3 hours ago
[-]
I think you mean GIF.
reply
giancarlostoro
3 hours ago
[-]
It's the same as GIS, you wouldn't say jizz now would you?
reply
DoctorOW
2 hours ago
[-]
I absolutely do, every single time it comes up.
reply
kevin_thibedeau
2 hours ago
[-]
The developer of the format declared the pronunciation 30+ years ago. It has always been jif.
reply
Geezus_42
2 hours ago
[-]
Yeah, but society overruled them.
reply
ziml77
2 hours ago
[-]
I hadn't thought about how to pronounce GIS, but do you have a problem with the pronunciation of the Japanese Industrial Standards: JIS?
reply
s20n
1 hour ago
[-]
I've been pronouncing both of them as /dʒis/ like hiss and not /dʒɪz/. I however am not a native english speaker of English. I wonder if native speakers gravitate towards the z more?
reply
ziml77
45 minutes ago
[-]
I would end both with the S sound, but I'm operating under the assumption that the person I was replying to either pronounces their Ss as Zs or can't tell the difference between the S and Z sounds.

Because the other assumption I could have gone with is the less charitable take that they know GIS with a soft G doesn't sound like jizz, but they were just looking for a crude way to mock the soft G.

reply
bronson
53 minutes ago
[-]
I think it depends on region. Related, many speakers pronounce chips and salza, Tezla, Wezley.
reply
dijksterhuis
2 hours ago
[-]
i am absolutely going to from now on
reply
notabotiswear
2 hours ago
[-]
I take it that you haven’t met the Arcgees people…
reply
pardon_me
2 hours ago
[-]
How do you pronounce giraffe?
reply
giancarlostoro
1 hour ago
[-]
Same way I pronounce my first name btw ;) but I think of "gif" as "gift" and this is probably the subconscious association people make without realizing it.
reply
WorldMaker
1 hour ago
[-]
Which is why I find it fun to bring up that in Old English "gift" hadn't yet picked up the "t" and was spelled "gif", but in Old English "g" was most commonly "HY". I like the Old English pronunciation of "gif" as "HYEEF", which is a "compromise" position that often makes some of both soft-g and hard-g "gif" pronunciation fans angry.
reply
giancarlostoro
36 minutes ago
[-]
I sometimes just pick the opposite of whatever everyone agreed to just for fun. I do the same when people cry about vim or emacs since I have used both. ;)

Some men just want to watch the world burn. At least it's mostly harmless fun anyway. It's even funnier when they bring up how my name is pronounced in defense of "jiff" and I tell them, so you're calling me the expert in "Gi" pronunciation then? :)

reply
ziml77
44 minutes ago
[-]
I have never heard this third option before but I love it!
reply
parineum
2 hours ago
[-]
How do you pronounce gift?
reply
briffle
1 hour ago
[-]
gorge = george
reply
WarmWash
2 hours ago
[-]
And "hallucination" which should have been "delusion".

Way early on (spring 2023) people tried to stop it, but no luck.

reply
MagicMoonlight
1 hour ago
[-]
Why would it be delusion? It’s making something up which isn’t there and describing it.
reply
WarmWash
1 hour ago
[-]
A hallucination is a false sensory experience.

A delusion is a false mental belief.

Basically hallucinations are false external things, and delusions false internal things. You hallucinate a pink elephant, you delude yourself into thinking trump won 2020.

reply
btown
2 hours ago
[-]
At least it's MIT licensed! As much as non-open training data irks me, restrictive licensing irks me more!
reply
cute_boi
52 minutes ago
[-]
what is problem with restrictive licensing? Most of them starts if you have 1M users etc?
reply
bitvvip
1 hour ago
[-]
What you said makes a lot of sense. Free software should not be confused with open source
reply
giancarlostoro
3 hours ago
[-]
I mean, you have "AI" which means just about anything in marketing speak, "Agentic" is kind of becoming similar, hopefully they don't goof that one too badly, would be nice to know what you are trying to sell me. Used to be "Cloud" meant storage not just hosting (I guess it still does).

Then there's "Smart" in front of Car, Phone, TV, and so on... Meaning different things.

I do think "Open Weight" should be more commonly used. There's definitely communities that spring up that build the training infrastructure and inference infrastructure around open models on the other hand.

reply
scotty79
1 hour ago
[-]
Open weights is not exactly right either because we do get source of the software that uses those open weights.

Maybe open inference?

But we often also get source code for fine tunning the model.

So maybe it's closer to open source than to anything else?

Isn't it a bit like not calling a game open source because engine tooling used to made it isn't open source and they didn't publish .psd files with asset designs?

reply
jrm4
2 hours ago
[-]
I'm genuinely torn on this one; I get technically why not, but why I think I have no problem with it is the wishy-washiness of "open source" generally.

As I teach this stuff to people newer to this tech, it's probably just easier and more helpful to refer to the wide array of "stuff you can just download and use yourself" as "open-source" and then after that, go deeper and talk about why Stallman was right, how "Free Software" was first. etc.

reply
notabotiswear
2 hours ago
[-]
Openwashing is the new greenwashing, which, coincidently, seems to have gone out of fashion a few hundred datacentres ago.
reply
dist-epoch
2 hours ago
[-]
it was replaced with abundancewashing
reply
Geezus_42
2 hours ago
[-]
What is "abundancewashing"?
reply
dist-epoch
2 hours ago
[-]
> “This means a future of abundance. A future where there is no poverty, where people can have whatever they want in terms of goods and services.” – Elon Musk

> “I think we see a path now where the world gets much more abundant and much better every year.” – Sam Altman

https://www.diamandis.com/blog/elon-sam-abundance

reply
dragonfax
3 minutes ago
[-]
Shouldn't it be called something like "Copilot Voice"?
reply
aqme28
3 hours ago
[-]
Interesting to see "vibe" enshrined by the likes of Microsoft as an AI product word.
reply
amlib
16 minutes ago
[-]
Maybe they were trying to make a pun on "Via Voice", the cursed IBM STT from the 90s?
reply
accrual
3 hours ago
[-]
Especially when "vibe coded" can have a negative connotation meaning quickly put together without understanding.
reply
ryandrake
1 hour ago
[-]
In my mind, Vibe-anything means "some slop carelessly thrown together to ship as fast as possible." Wild that it's being used in a serious product name!
reply
Barbing
2 hours ago
[-]
I’m just surprised they put the name of the e-waste slop company in their product
reply
lvncelot
29 minutes ago
[-]
I'm honestly more surprised that they could resist the temptation to call it Copilot
reply
altmanaltman
2 hours ago
[-]
Which makes it even more weird they get offended when people use Mircoslop. They are the ones leaning into the marketing
reply
Vinnl
2 hours ago
[-]
"get offended" is just what the clickbait news cycle made of it. It was based on the post at [1], and this is all it said:

> We need to get beyond the arguments of slop vs sophistication and develop a new equilibrium in terms of our “theory of the mind” that accounts for humans being equipped with these new cognitive amplifier tools as we relate to each other

[1] https://snscratchpad.com/posts/looking-ahead-2026/

reply
fg137
40 minutes ago
[-]
Are you sure you have the correct reference?

I think everyone else is relating to

https://futurism.com/artificial-intelligence/microsoft-bans-...

reply
altmanaltman
1 hour ago
[-]
When a CEO says "We need to get beyond the arguments of X" it is universally a polite, PR-scrubbed way of saying, "Please stop talking about X, it is hurting our business" which is how the media interpreted it.
reply
mberg
1 hour ago
[-]
I've been using VibeVoice's ASR (speech to text) model quite intensively for the past month and have found it to be a lot more reliable and out-of-the box functional then Whisper, parakeet and other models. The fact that is has diarization built into to the model is a huge win in my book. Without that you have to run a different model just for that which adds significantly to the overall processing time vs VibeVoice which gives you reliably great results. Big fan.
reply
embedding-shape
3 hours ago
[-]
Isn't this project the one Microsoft published but then soon after pulled it for security/safety reasons? What has changed since then?
reply
542458
3 hours ago
[-]
Look at the "News" section in the readme - The original TTS model is gone from this repo (you can still find it other places), but the SST/ASR, long form TTS, and streaming TTS models are newer.
reply
infecto
2 hours ago
[-]
It’s confusing (at least for me) because the project covers a number of things including what you are mentioning.
reply
Barbing
2 hours ago
[-]
[off topic]

When explanations get posted directly in HN comments, I imagine someone somewhere in the world is able to learn in spite of their Internet restrictions/firewalls

People will also post their own interpretations in response to comments, and quickly find out they missed something.

… But if you try to automate it, like include a summary under every HN post, you encourage laziness too much and are pre-chewing too heavily. Some balance here.

[on topic]

(OK I’m done making excuses, time to read the article… thanks for the encouragement!)

I thought this was not explained in the readme directly but in fact I missed it. I wasn’t going to read Microsoft entire changelog! But it was substantive, thanks to sibling commenter:

“2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have removed the VibeVoice-TTS code from this repository.”

reply
xnx
1 hour ago
[-]
Still waiting for the open weights model that conclusively beats the multi-year old Whisper in accuracy, features, and performance.
reply
scotty79
1 hour ago
[-]
It's crazy that a lot is happening in open models for stt, but there's very little progress when it comes to results, esp multilingual.
reply
CubsFan1060
3 hours ago
[-]
Great post last night from Simon: https://simonwillison.net/2026/Apr/27/vibevoice/
reply
542458
3 hours ago
[-]
Note that this just covers the Speech-to-Text/Speech-Recognition aspect (a-la whisper), there's also models for long-form Text-To-Speech and steaming Text-To-Speech.
reply
JumpCrisscross
3 hours ago
[-]
“VibeVoice can only handle up to an hour of audio”

Why?

reply
podgietaru
3 hours ago
[-]
So we've really just settled on Vibe as the verb for AI then?
reply
giarc
3 hours ago
[-]
I'd be willing to bet it will be "Word of the Year" for 2026. Merriam-Webster had 'slop' for 2025, and 'polarization' for 2024. Is there a prediction market for this?
reply
internet_points
3 hours ago
[-]
it'll probably be something we're not even talking about yet - we still have 7 months in which to make the world even worse
reply
pryanshu89
3 hours ago
[-]
Why use precise technical language when you can just vibe with your AI system?
reply
ryukoposting
2 hours ago
[-]
Holy moly, a Microsoft AI product that isn't named Copilot!
reply
DoctorOW
2 hours ago
[-]
Missed opportunity to call it Vopilot
reply
silverwind
1 hour ago
[-]
Slopilot
reply
Anonyneko
3 hours ago
[-]
You have selected Microsoft Sam as the computer's default voice.
reply
accrual
2 hours ago
[-]
My friends and I had fun in the computer lab with Microsoft Sam, inputting long strings of characters to create funny sound effects. Sususususususu.
reply
pluc
3 hours ago
[-]
Interesting story about this repo/product/author by cybersecurity researcher Kevin Beaumont: https://cyberplace.social/@GossiTheDog/116454846703138243
reply
solomatov
1 hour ago
[-]
It would have been better if they provided not just weights, but also some frontend where it is usable as is.
reply
frangonf
2 hours ago
[-]
I took a look into local options for ASR and diarization some months ago, I missed that VibeVoice now has this feature.

My conclusions back then (which only came from a shallow research on the topic and 0 real experience mind you) was that Whisper + Pyannote was the "stable" approach.

Have the VibeVoice, Voxtral, Qwen or the Nemo solutions caught up in segmentation and speaker recognition?

reply
Mobius01
2 hours ago
[-]
Microsoft has historically made poor choices in product naming, but this has to be a new low.
reply
chaosprint
2 hours ago
[-]
Microsoft Store App Vibing.exe Accused of Harvesting Screens, Audio, and Clipboard Data:

https://cyberpress.org/microsoft-store-app-vibing-exe-accuse...

reply
Void_
3 hours ago
[-]
I the past month or so, I added 2 models to my app Whisper Memos (https://whispermemos.com):

- Cohere Transcribe (self hosted)

- Grok Speech To Text (they provide an API, only $0.10/hr!)

They are both excellent. I'm not sure about this one. Would you like to see it in a consumer speech to text app?

reply
olejorgenb
3 hours ago
[-]
I've had good experiences with the Mistral Voxtral models (I've used the API, but some of the model-variants are open weight)
reply
Barbing
2 hours ago
[-]
Does Cohere work with longer transcripts? Do you have to do some magic to merge recordings over 35 seconds long?
reply
2ndorderthought
3 hours ago
[-]
Have you tried qwen?
reply
SecretDreams
3 hours ago
[-]
Any non-Musk alternatives that are comparable in quality and cost?
reply
jayphen
2 hours ago
[-]
Voxtral competes on price ($0.003/min) and quality. Speechmatics has best in class accuracy but is a bit more expensive ($0.004/min)
reply
Void_
3 hours ago
[-]
Our default is still OpenAI Whisper. Grok is just a choice for users who might prefer it.
reply
JumpCrisscross
3 hours ago
[-]
What’s the current state of the art, for each of training locally and in the cloud, for learning my voice?
reply
yreg
2 hours ago
[-]
Locally maybe https://voicebox.sh/

Elevenlabs in the cloud.

reply
chrsw
3 hours ago
[-]
Local? No idea. Cloud? Eleven Labs, probably. But it's described as "cloning" not "training". Not sure what the distinction is or why it matters if the end result is you can to generate any TTS that sounds like you. There might very well be an important one, I just don't know it.
reply
khimaros
2 hours ago
[-]
open weights i would say S2: https://github.com/rodrigomatta/s2.cpp
reply
BlastBash192
3 hours ago
[-]
Maybe Microsoft’s real strength was never making the best model, it was knowing you don’t need to, as long as you own the platform everyone builds on.
reply
khimaros
2 hours ago
[-]
looks like this offers ASR support in GGUF https://github.com/CrispStrobe/CrispASR -- haven't tested
reply
mistic92
3 hours ago
[-]
For me its giving me very poor results
reply
Zopieux
1 hour ago
[-]
English only?
reply
ChrisArchitect
2 hours ago
[-]
reply
simonw
1 hour ago
[-]
That was about the text-to-speech model, the speech-to-text one was release in January.
reply
walthamstow
3 hours ago
[-]
Seems quite heavy for a STT model, Parakeet and Whisper are much smaller and perform great for quick dictation and transcription of longer files. I guess that's due to additional accuracy and speaker diarisation?

The TTS example clip in the repo of 'spontaneous singing' is creepy as fuck

reply
starkeeper
2 hours ago
[-]
Microsoft is famous for choosing terrible names but how could they be this terrible.
reply
villgax
37 minutes ago
[-]
lol they rug-pulled the 7B for our own safety some months ago
reply