Show HN: Dia, an open-weights TTS model for generating realistic dialogue
600 points
1 day ago
| 56 comments
| github.com
| HN
sebstefan
12 hours ago
[-]
I inserted the non-verbal command "(pauses)" in the middle of a sentence and I think I caused it to have an aneurysm.

https://i.horizon.pics/4sEVXh8GpI (27s)

It starts with an intro, too. Really strange

reply
abrookewood
9 hours ago
[-]
That's certainly unusual ...
reply
antiraza
8 hours ago
[-]
That was... amazing.
reply
throwaway-alpha
11 hours ago
[-]
I have a hunch they're pulling data from radio shows to give it that "high quality" vibe. Tried running it through this script and hit some weird bugs too:

    [S1] It really sounds as if they've started using NPR to source TTS models
    [S2] Yeah... yeah... it's kind of disturbing (laughs dejectedly).
    [S3] I really wish, that they would just Stop with this.
https://i.horizon.pics/Tx2PrPTRM3
reply
degosuke
5 hours ago
[-]
It even added an extra f-word at the end. Still veeery impressive
reply
xdfgh1112
4 hours ago
[-]
Just noticed he says dejectedly too
reply
bt1a
8 hours ago
[-]
You hittin them balloons again, mate ?
reply
devnen
1 hour ago
[-]
This is really impressive work and the dialogue quality is fantastic.

For anyone wanting a quick way to spin this up locally with a web UI and API access, I put together a FastAPI server wrapper around the model: https://github.com/devnen/Dia-TTS-Server

The setup is just a standard pip install -r requirements.txt (works on Linux/Windows). It pulls the model from HF automatically – defaulting to the faster BF16 safetensors (ttj/dia-1.6b-safetensors), but that's configurable in the .env. You get an OpenAI-compatible API endpoint (/v1/audio/speech) for easy integration, plus a custom one (/tts) to control all the Dia parameters. The web UI gives you a simple way to type text, adjust sliders, and test voice cloning. It'll use your CUDA GPU if you have one configured, otherwise, it runs on the CPU.

Might be a useful starting point or testing tool for someone. Feedback is welcome!

reply
hemloc_io
1 day ago
[-]
Very cool!

Insane how much low hanging fruit there is for Audio models right now. A team of two picking things up over a few months can build something that still competes with large players with tons of funding

reply
miki123211
13 hours ago
[-]
Yeah, Eleven Labs must be raking it in.

You can get hours of audio out of it for free with Eleven Reader, which suggests that their inference costs aren't that high. Meanwhile, those same few hours of audio, at the exact same quality, would cost something like $100 when generated through their website or API, a lot more than any other provider out there. Their pricing (and especially API pricing) makes no sense, not unless it's just price discrimination.

Somebody with slightly deeper pockets than academics or one guy in a garage needs to start competing with them and drive costs down.

Open TTS models don't even seem to utilize audiobooks or data scraped off the internet, most are still Librivox / LJ Speech. That's like training an LLM on just Wikipedia and expecting great results. That may have worked in 2018, but even in 2020 we knew better, not to mention 2025.

TTS models never had their "Stable Diffusion moment", it's time we get one. I think all it would take is somebody doing open-weight models applying the lessons we learned from LLMs and image generation to TTS models, namely more data, more scraping, more GPUs, less qualms and less safety. Eleven Labs already did, and they're profiting from it handsomely.

reply
pzo
10 hours ago
[-]
Kokoro gives great results especially when speaking english. Model is small enough to run even on smartphone ~3x faster than realtime.
reply
bavell
9 hours ago
[-]
Another +1 to Kokoro from me, great quality with good speed.
reply
toebee
22 hours ago
[-]
Thank you for the kind words <3
reply
kreelman
19 hours ago
[-]
This is amazing. Is it possible to build in a chosen voice, a bit like Eleven Labs does? ...This may be on the git summary, being lazy and asking anyway :=) Thanks for your work.
reply
JonathanFly
17 hours ago
[-]
reply
Versipelle
1 day ago
[-]
This is really impressive; we're getting close to a dream of mine: the ability to generate proper audiobooks from EPUBs. Not just a robotic single voice for everything, but different, consistent voices for each protagonist, with the LLM analyzing the text to guess which voice to use and add an appropriate tone, much like a voice actor would do.

I've tried "EPUB to audiobook" tools, but they are really miles behind what a real narrator accomplishes and make the audiobook impossible to engage with

reply
mclau157
1 day ago
[-]
Realistic voice acting for audio books, realistic images for each page, realistic videos for each page, oh wait I just created a movie, maybe I can change the plot? Oh wait I just created a video game
reply
hleszek
14 hours ago
[-]
Now do it in VR and make it fully interactive.
reply
azinman2
1 day ago
[-]
Wouldn’t it be more desirable to hear an actual human on an audiobook? Ideally the author?
reply
satvikpendem
15 hours ago
[-]
Why a human? There are many cases where I like a book but dislike the audiobook speaker, so I essentially can't listen to that book anymore. With a machine, I can tweak the voice to my heart's content.
reply
iamsaitam
14 hours ago
[-]
And get a completely wrong/bland but custom read of the book. Reading is much more than simply transforming text to audio.
reply
satvikpendem
8 hours ago
[-]
Sometimes, I don't care if it's bland, I just want to listen to the text. There are a lot of Asian light novels for example which never get English audiobooks, and I've listened to many of them with basic TTS, not even an AI model TTS like these more recent ones, and I thoroughly enjoyed these books even still.
reply
Versipelle
1 day ago
[-]
> Wouldn’t it be more desirable to hear an actual human on an audiobook? Ideally the author?

Of course, but it's not always available.

For example, I would love an audiobook for Stanisław Lem's "The Invincible," as I just finished its video game adaptation, yet it simply doesn't exist in my native language.

It's quite seldom that the author narrates the audiobooks I listen to, and sometimes the narrator does a horrible job, butchering the characters with exaggerated tones.

reply
ks2048
22 hours ago
[-]
With 1M+ new books every year, that’s not possible for all but the few most popular.
reply
senordevnyc
1 day ago
[-]
Honestly, I’d say that’s true only for the author. Anyone else is just going to be interpreting the words to understand how to best convey the character / emotion / situation / etc., just like an AI will have to do. If an AI can do that more effectively than a human, why not?

The author could be better, because they at least have other info beyond the text to rely on, they can go off-script or add little details, etc.

reply
DrSiemer
1 day ago
[-]
As somebody who has listened to hundreds of audiobooks, I can tell you authors are generally not the best choice to voice their own work. They may know every intent, but they are writers, not actors.

The most skilled readers will make you want to read books _just because they narrated them_. They add a unique quality to the story, that you do not get from reading yourself or from watching a video adaptation.

Currently I'm in The Age of Madness, read by Steven Pacey. He's fantastic. The late Roy Dotrice is worth a mention as well, for voicing Game of Thrones and claiming the Guinness world record for most distinct voices (224) in one series.

It will be awesome if we can create readings automatically, but it will be a while before TTS can compete with the best readers out there.

reply
azinman2
22 hours ago
[-]
I’d suggest even if the TTS sounded good, I’d still rather a human because:

1. It’s a job that seems worthwhile to support, especially as it’s “practice” that only adds to a lifetime of work and improves their central skill set

2. A voice actor will bring their own flare, just like any actor does to their job

3. They (should) prepare for the book, understanding what it’s about in its entirety, and bring that context to the reading

reply
cchance
18 hours ago
[-]
You really think people writing these papers actually have good speaking voices? LOL, theirs a reason not everyone could be an audio book maker or podcaster, a lot of peoples voices suck for audiobooks
reply
tyrauber
1 day ago
[-]
Hey, do yourself a favor and listen to the fun example:

> [S1] Oh fire! Oh my goodness! What's the procedure? What to we do people? The smoke could be coming through an air duct!

Seriously impressive. Wish I could direct link the audio.

Kudos to the Dia team.

reply
jinay
1 day ago
[-]
For anyone who wants to listen, it's on this page: https://yummy-fir-7a4.notion.site/dia
reply
mrandish
1 day ago
[-]
Wow. Thanks for posting the direct link to examples. Those sound incredibly good and would be impressive for a frontier lab. For two people over a few months, it's spectacular.
reply
dostick
9 hours ago
[-]
This is an instant classic. Sesame comparison examples all sound like clueless rich people from The White Lotus.
reply
DoctorOW
1 day ago
[-]
A little overacted, it reminds me of the voice acting in those flash cartoons you'd see in the early days of YouTube. That's not to say it isn't good work, it still sounds remarkably human. Just silly humans :)
reply
3by7
11 hours ago
[-]
Overacted and silly humans indeed: https://www.youtube.com/watch?v=gO8N3L_aERg
reply
Cthulhu_
12 hours ago
[-]
"flash cartoons in the early days of Youtube" Wouldn't those be straight from Newgrounds?
reply
selimthegrim
22 hours ago
[-]
Reminded me of the Fenslerfilm G.I. Joe sketch where the kids have something on the stove burning
reply
wisemang
22 hours ago
[-]
Stop all the downloading!
reply
intalentive
6 hours ago
[-]
Sounds great. One of the female examples has convincing uptalk. There must be a way to manipulate the latent space to control uptalk, vocal fry, smoker’s voice, lispiness, etc.
reply
toebee
22 hours ago
[-]
Thank you!! Indeed the script was inspired from a scene in the Office.
reply
3abiton
23 hours ago
[-]
This is oddly reminiscent of the office. I wonder if tv shows were part of its training data!
reply
hombre_fatal
6 hours ago
[-]
Yeah, that example is insane.

Is there some sort of system prompt or hint at how it should be voiced, or does it interpret it from the text?

Because it would be hilarious if it just derived it from the text and it did this sort of voice acting when you didn't want it to, like reading a matter-of-fact warning label.

reply
nojs
1 day ago
[-]
This is so good. Reminds me of The Office. I love how bad the other examples are.
reply
fwip
1 day ago
[-]
The text is lifted from a scene in The Office: https://youtu.be/gO8N3L_aERg?si=y7PggNrKlVQm0qyX&t=82
reply
notdian
1 day ago
[-]
made a small change and got it running on M2 Pro 16GB Macbook pro, the quality is amazing.

https://github.com/nari-labs/dia/pull/4

reply
rahimnathwani
21 hours ago
[-]
Thank you for this! My desktop GPU has only 8GB VRAM, but my MacBook has plenty of unified RAM.
reply
emmelaich
20 hours ago
[-]
Thanks, works well but slowly on a Mac Air M3 with 24gb. Will have to try it again after freeing up more ram as it was doing a bit of swapping with Chrome running too.

(later). It did nicely for the default example text but just made weird sounds for a "hello all" prompt. And took longer?!

reply
toebee
21 hours ago
[-]
Thank you for the contribution! We'll be merging PRs and cleaning code up very soon :)
reply
noiv
1 day ago
[-]
Can confirm, runs straight forward on 15.4.1@M4, THX.
reply
toebee
1 day ago
[-]
Hey HN! We’re Toby and Jay, creators of Dia. Dia is 1.6B parameter open-weights model that generates dialogue directly from a transcript.

Unlike TTS models that generate each speaker turn and stitch them together, Dia generates the entire conversation in a single pass. This makes it faster, more natural, and easier to use for dialogue generation.

It also supports audio prompts — you can condition the output on a specific voice/emotion and it will continue in that style.

Demo page comparing it to ElevenLabs and Sesame-1B https://yummy-fir-7a4.notion.site/dia

We started this project after falling in love with NotebookLM’s podcast feature. But over time, the voices and content started to feel repetitive. We tried to replicate the podcast-feel with APIs but it did not sound like human conversations.

So we decided to train a model ourselves. We had no prior experience with speech models and had to learn everything from scratch — from large-scale training, to audio tokenization. It took us a bit over 3 months.

Our work is heavily inspired by SoundStorm and Parakeet. We plan to release a lightweight technical report to share what we learned and accelerate research.

We’d love to hear what you think! We are a tiny team, so open source contributions are extra-welcomed. Please feel free to check out the code, and share any thoughts or suggestions with us.

reply
dangoodmanUT
22 hours ago
[-]
I know it’s taboo to ask, but I must: where’s the dataset from? Very eager to play around with audio models myself, but I find existing datasets limiting
reply
xdfgh1112
2 hours ago
[-]
I suspect podcasts, as you have a huge amount of transcribed data with good diction and mic quality. The voices sound like podcast voices to me.
reply
zelphirkalt
20 hours ago
[-]
Why would that be a taboo question to ask? It should be the question we always ask, when presented with a model and in some cases we should probably reject the model, based on that information.
reply
dangoodmanUT
20 hours ago
[-]
Because generally the person asking this question is trying to cancel the model maker
reply
tough
19 hours ago
[-]
or by replying you expose yourself to handing -proof- of the origins of the training data set to the copyright owner wanting to sue you next
reply
deng
13 hours ago
[-]
No. It's for giving credit where credit is due. And yes, that includes the question if the people who generated the training data in the first place have given their consent that this can be used for AI training.

It's quite concerning that the community around here is usually livid about FOSS license violations, which typically use copyright law as leverage, but somehow is perfectly OK with training models on copyrighted work and just labels that as "fair use".

reply
gfaure
1 day ago
[-]
Amazing that you developed this over the course of three months! Can you drop any insight into how you pulled together the audio data?
reply
isoprophlex
1 day ago
[-]
+1 to this, amazing how you managed to deliver this, and iff you're willing to share i'd be most interested in learning what you did in terms of train data..!
reply
heystefan
1 day ago
[-]
Could one usecase be generating an audiobook with this from existing books? I wonder if I could fine-tune the "characters" that speak these lines since you said it's a single pass whole the whole convo. Wonder if that's a limitation for this kind of a usecase (where speed is not imperative).
reply
toebee
20 hours ago
[-]
Yes! But you would need to put together a LLM system that created scripts from the book content. There is an open source project called OpenNotebookLM (https://github.com/gabrielchua/open-notebooklm) that does something similar. If you hook the Dia model to that kind of system, it will be very possible :) Thanks for the interest!
reply
satvikpendem
15 hours ago
[-]
Another project, specifically for creating audiobooks: https://github.com/prakharsr/audiobook-creator
reply
karimf
20 hours ago
[-]
This is super awesome. Several questions.

1. What GPU did you use to train the model? I'd love to train a model like this, but currently, I only have a 16GB MacBook. Thinking about buying a 5090 if it's worth.

2. Is it possible to use this for real time audio generation, similar to the demo on the Sesame website?

reply
smusamashah
23 hours ago
[-]
Hi! This is awesome for size and quality. I want to see a book reading example or try it myself.

This is a tangent point but it would have been nicer if it wasn't a notion site. You could put the same page on github pages and it will be much lighter to open, navigate and link (like people trying to link some audio)

reply
toebee
20 hours ago
[-]
Thanks for the kind words! You can try it now on https://huggingface.co/spaces/nari-labs/Dia-1.6B Also, we'll try to update the Demo Page to something lighter when we have time. Thanks for the feedback :))
reply
cchance
18 hours ago
[-]
Its really amazing cant wait to play with it some, the samples are great... but oddly all seem... really fast, like they'd be perfect but they feel like they're playing at 1.2x speed or is that just me?
reply
claiir
12 hours ago
[-]
It’s not just you. The speedup is an artefact of the CFG (Classifier-Free Guidance) the model uses. The other problem is the speedup isn’t constant—it actually accelerates as the generation progresses. The Parakeet paper [1] (which OP lifted their model architecture almost directly from [2]) gives a fairly robust treatment to the matter:

> When we apply CFG to Parakeet sampling, quality is significantly improved. However, on inspecting generations, there tends to be a dramatic speed-up over the duration of the sample (i.e. the rate of speaking increases significantly over time). Our intuition for this problem is as follows: Say that is our model is (at some level) predicting phonemes and the ground truth distribution for the next phoneme occuring is 25% at a given timestep. Our conditional model may predict 20%, but because our uncondtional model cannot see the text transcription, its prediction for the correct next phoneme will be much lower, say 5%. With a reasonable level of CFG, because [the logit delta] will be large for the correct next phoneme, we’ll obtain a much higher final probability, say 50%, which biases our generation towards faster speech. [emphasis mine]

Parakeet details a solution to this, though this was not adopted (yet?) by Dia:

> To address this, we introduce CFG-filter, a modification to CFG that mitigates the speed drift. The idea is to first apply the CFG calculation to obtain a new set of logits as before, but rather than use these logits to sample, we use these logits to obtain a top-k mask to apply to our original conditional logits. Intuitively, this serves to constrict the space of possible “phonemes” to text-aligned phonemes without heavily biasing the relative probabilities of these phonemes (or for example, start next word vs pause more). [emphasis mine]

The paper contains audio samples with ablations you can listen to.

[1]: https://jordandarefsky.com/blog/2024/parakeet/#classifier-fr...

[2]: https://news.ycombinator.com/item?id=43758686

reply
nickthegreek
1 day ago
[-]
Are there any examples of the audio differences between the this and the larger model?
reply
toebee
21 hours ago
[-]
We're still experimenting, so do not have samples yet from the larger model. All we have is Dia-1.6B at the moment.
reply
cchance
18 hours ago
[-]
I didn't see or missed it are you planning to release the larger model as well?
reply
bzuker
1 day ago
[-]
hey, this looks (or rather, sounds) amazing! Does it work with different languages or is it English only?
reply
toebee
20 hours ago
[-]
Thank you!! Works for English only unfortunately :((
reply
llm_nerd
23 hours ago
[-]
This is a pretty incredible three month creation for a couple of people who had no experience with speech models.
reply
toebee
20 hours ago
[-]
Thanks for the kind words! We're just following our interests and staying upwind.
reply
new_user_final
1 day ago
[-]
Easily 10 times better than recent OpenAI voice model. I don't like robotic voices.

Example voices seems like over loud, over excitement like Andrew Tate, Speed or advertisement. It's lacking calm, normal conversation or normal podcast like interaction.

reply
toebee
20 hours ago
[-]
Thank you! You can add audio prompts of calm voices to make them a bit smoother. https://huggingface.co/spaces/nari-labs/Dia-1.6B you can try it here!
reply
moritonal
19 hours ago
[-]
Isn't it weird how "We don't have a full list of non-verbal [commands]". Like, I can imagine why, but it's wild we're at a point where we don't know what our code can do.
reply
kevmo314
18 hours ago
[-]
I have a sneaking suspicion it's because they lifted the model architecture almost directly from Parakeet: https://jordandarefsky.com/blog/2024/parakeet/

Parakeet references WhisperD which is at https://huggingface.co/jordand/whisper-d-v1a and doesn't include a full list of non-speech events that it's been trained with, except "(coughs)" and "(laughs)".

Not saying the authors didn't do anything interesting here. They put in the work to reproduce the blog post and open source it, a praiseworthy achievement in itself, and they even credit Parakeet. But they might not have the list of commands for more straightforward reasons.

reply
toebee
18 hours ago
[-]
You're absolutely right. We used Jordan's Whisper-D, and he was generous enough to offer some guidance along the way.

It's also a valid criticism that we haven’t yet audited the dataset for existing list of tags. That’s something we’ll be improving soon.

As for Dia’s architecture, we largely followed existing models to build the 1.6B version. Since we only started learning about speech AI three months ago, we chose not to innovate too aggressively early on. That said, we're planning to introduce MoE and Sliding Window Attention in our larger models, so we're excited to push the frontier in future iterations.

reply
kamranjon
16 hours ago
[-]
I’m curious what differentiates it from Parakeet? I was listening to some of the demos on the parakeet announcement and they sound very similar to your examples - are they trained on the same data? Are there benefits to using Dia over Parakeet?
reply
rustc
1 day ago
[-]
Is this Apache licensed or a custom one? The README contains this:

> This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

> This project offers a high-fidelity speech generation model *intended solely for research and educational use*. The following uses are strictly forbidden:

> Identity Misuse: Do not produce audio resembling real individuals without permission.

> ...

Specifically the phrase "intended solely for research and educational use".

reply
toebee
21 hours ago
[-]
Sorry for the confusion. the license is plain Apache 2.0, and we changed the wording to "intended for research and educational use." The point was, users are free to use it for their use cases, just don't do shady stuff with it.

Thanks for the feedback :)

reply
crooked-v
20 hours ago
[-]
So is that actually part of the license (making it non-Apache 2.0), or not?
reply
toebee
18 hours ago
[-]
not part of the license!
reply
montroser
22 hours ago
[-]
Hmm, the "strictly forbidden" part seems more important than whatever are their stated intentions... Either way, it seems like it needs clarifying.
reply
strobe
1 day ago
[-]
just in case, another opensource project using same name https://wiki.gnome.org/Apps/Dia/

https://gitlab.gnome.org/GNOME/dia

reply
freedomben
1 day ago
[-]
Fun, I can't get to it because I can't get past the "Making sure you're not a bot!" page. It's just stuck at "calculating...". I understand the desire to slow down AI bots, but . If all the gnome apps are now behind this, they just completely shut down a small-time contributor. I love to play with Gnome apps and help out with things here and there, but I'm not going to fight with this damn thing to do so.
reply
SoKamil
1 day ago
[-]
And another one, not open source but in AI sphere: https://www.diabrowser.com/
reply
toebee
1 day ago
[-]
Thanks for the heads-up! We weren’t aware of the GNOME Dia project. Since we focus on speech AI, we’ll make sure to clarify that distinction.
reply
aclark
1 day ago
[-]
Ditto this! Dia diagram tool user here just noticing the name clash. Good luck with your Dia!! Assuming both can exist in harmony. :-)
reply
mrandish
1 day ago
[-]
> Assuming both can exist in harmony.

I'm sure they can... talk it over.

I'll show myself out.

reply
Magma7404
1 day ago
[-]
I know it's a bit ridiculous to see that as some kind of conspiracy, but I have seen a very long list of AI-related projects that got the same name as a famous open-source project, as if they wanted to hijack the popularity of those projects, and Dia is yet another example. It was relatively famous a few years ago and you cannot have forgotten it if you used Linux for more than a few weeks. It's almost done on purpose.
reply
teddyh
1 day ago
[-]
The generous interpretation is that the AI hype people just didn’t know about those other projects, i.e. that they are neither open source developers, nor users.
reply
gapan
1 day ago
[-]
Of course, how could they have known? Doing a basic web search before deciding on a name is so last year.
reply
teddyh
5 hours ago
[-]
Maybe they only asked an LLM about it?
reply
Havoc
1 day ago
[-]
Sounds really good & human! Got a fair bit of unexpected artifacts though. e.g. 3 seconds hissing noise before dialogue. And music in background when I added (happy) in an attempt to control tone. Also don't understand how to control the S1 and S2 speakers...is it just random based on temp?

> TODO Docker support

Got this adapted pretty easily. Just latest nvidia cuda container, throw python and modules on it and change server to serve on 0.0.0.0. Does mean it pulls the model every time on startup though which isn't ideal

reply
toebee
22 hours ago
[-]
Thank you for the kind words! Dia wasn’t fine tuned on certain speaker, so you will get random voices every time you run it, unless you add a prompt / fix the seed.

The outputs are a bit unstable, might need to add cleaner training data and run longer training sessions. Hopefully we can do something like OAI Whisper and update with better performing checkpoints!

reply
dragonwriter
21 hours ago
[-]
> Also don't understand how to control the S1 and S2 speakers...

Do a clip with the speakers you want as the audio prompt, add the text of that clip (with speaker tags) of the clip at the beginning of your text prompt, and it clones the voices from your audio prompt for the output.

reply
yjftsjthsd-h
1 day ago
[-]
> Does mean it pulls the model every time on startup though which isn't ideal

Surely it just downloads to a directory that can be volume mapped?

reply
Havoc
23 hours ago
[-]
Yep. I just didn't spend the time to track down the location tbh. Plus huggingface usually does links to a cache folder that I don't recall the location of

Literally got cuda containers working earlier today so haven't spent a huge amount of time figuring things out

reply
genewitch
3 hours ago
[-]
Its in a dot folder in your home dir on Linux and in %appdata% on windows.
reply
xbmcuser
1 day ago
[-]
Wow first time I have felt that this could be the end of voice acting/audio book narration etc. The speed with with the ways things are changing how soon before you can make any book any novel into a complete audio video / movie or tv show.
reply
dindindin
23 hours ago
[-]
Was this trained on Planet Money / NPR podcasts? The last audio (continuation of prompt) sounds eerily like Planet Money, I had to double check if my Spotify had accidentally started playing.
reply
jelling
23 hours ago
[-]
NPR voice is a thing.

It started with Ira Glass voice and now the default voice is someone that sounds like they're not certain they should be saying the very banal thing they are about to say, followed by a hand-shake protocol of nervous laughter.

reply
genewitch
3 hours ago
[-]
Thank goodness for Scott Simon!
reply
stuartjohnson12
1 day ago
[-]
Impressive project! We'd love to use something like this over at Delfa (https://delfa.ai). How does this hold up from the perspective of stability? I've spoken to various folks working on voice models, and one thing that has consistently held Eleven Labs ahead of the pack from my experience is that their models seem to mostly avoid (while albeit not being immune to) accent shifts and distortions when confronted with unfamiliar medical terminology.

A high quality, affordable TTS model that can consistently nail medical terminology while maintaining an American accent has been frustratingly elusive.

reply
toebee
1 day ago
[-]
Interesting. I haven't thought of that problem before. I'm guessing a large enough audio dataset for medical terminology does not exist publicly.

But AFAIK, even if you have just a few hours of audio containing specific terminology (and correct pronunciation), fine-tuning on that data will significantly improve performance.

reply
codingmoh
1 day ago
[-]
Hey, this is really cool! Curious how good the multi-language support is. Also - pretty wild that you trained the whole thing yourselves, especially without prior experience in speech models.

Might actually be helpful for others if you ever feel like documenting how you got started and what the process looked like. I’ve never worked with TTS models myself, and honestly wouldn’t know where to begin. Either way, awesome work. Big respect.

reply
toebee
20 hours ago
[-]
Thank you so much for the kind words :) We only support English at the moment, hopefully can do more languages in the future. We are planning to release a technical report on some of the details, so stay tuned for that!
reply
bavell
9 hours ago
[-]
I'd also love to peek behind the curtains, if only to satisfy my own curiosity. Looking forward to the technical report, well done!
reply
toebee
21 hours ago
[-]
We have a ZeroGPU Space provided by HuggingFace up and running! Test it now on https://huggingface.co/spaces/nari-labs/Dia-1.6B
reply
daemonologist
19 hours ago
[-]
The examples on your site are impressive, but I'm having trouble getting good results on HF - it's generating a lot of near-silence (often nothing but) and when it does produce speech it bears no resemblance to the audio prompt and only produces parts of the text prompt. Would you suggest any adjustments to the default parameters to improve adherence, or might I expect better results running locally? Thanks!
reply
toebee
1 day ago
[-]
It is way past bedtime here, will be getting back to comments after a few hours of sleep! Thanks for all the kind words and feedback
reply
LarsDu88
16 hours ago
[-]
Fantastic model. I'm going to write a Unity plugin for this. Have been using ElevenLabs for my VR game side project, but this appears to be better
reply
sarangzambare
1 day ago
[-]
Impressive demo! We'd love to use this at https://useponder.ai

time to first audio is something that is crucial for us to reduce the latency - wondering if dia works with output streaming?

the python code snippet seems to imply that the entire audio bytes are generated directly?

reply
toebee
1 day ago
[-]
Sounds awesome! I think it won't be very hard to run it using output streaming, although that might require beefier GPUs. Give us an email and we can talk more - nari.ai.contact at gmail dot com.

It's way past bedtime where I live, so will be able to get back to you after a few hours. Thanks for the interest :)

reply
sarangzambare
1 day ago
[-]
no worries, i will email you
reply
eob
1 day ago
[-]
Bravo -- this is fantastic.

I've been waiting for this ever since reading some interview with Orson Scott Card ages ago. It turns out he thinks of his novels as radio theater, not books. Which is a very different way to experience the audio.

reply
toebee
21 hours ago
[-]
Thanks for the kind words :)))
reply
IshKebab
1 day ago
[-]
Why does it say "join waitlist" if it's already available?

Also, you don't need to explicitly create and activate a venv if you're using uv - it deals with that nonsense itself. Just `uv sync`.

reply
toebee
1 day ago
[-]
We're envisioning a platform with a social aspect, so that is the biggest difference. Also, bigger models!

We are aware of the fact that you do not need to create a venv when using pre-existing uv. Just added it for people spinning up new GPUs on cloud. But I'll update the README to make that a bit clearer. Thanks for the feedback :)

reply
flakiness
1 day ago
[-]
Seek back a few tens of bytes which states "Play with a larger version of Dia"
reply
oehtXRwMkIs
22 hours ago
[-]
Any plans for AMD GPU support? Maybe I'm missing something, but it's not working out of the box on a 7900xtx.
reply
toebee
20 hours ago
[-]
We will try to make it work, but not sure if will be an easy task. For now, you can try with https://huggingface.co/spaces/nari-labs/Dia-1.6B
reply
basilgohar
7 hours ago
[-]
Seeing is no longer believing. Hearing isn't either. The funny thing is, it's getting to the point where LLM-generated text is more easily spotted than AI audio, video, and images.

It's going to be an interesting decade of the new equivalent of "No, Tiffany, Bill Gates will NOT be sending you $100 for forwarding that email." Except it's going to be AI celebrities making appeals for donations to help them become billionaires or something.

reply
a2128
1 day ago
[-]
What's the training process like? I have some data in my language I'd love to use train it on my language seeing as it's English-only
reply
toebee
21 hours ago
[-]
We'll try to give a high-level overview when we publish the technical report!
reply
howon92
16 hours ago
[-]
Training an audio model this good from 0 prior experience is really amazing. I would love to read a blog post about how you guys approached ramping up knowledge and getting practical quickly. Any plans?
reply
999900000999
1 day ago
[-]
Does this only support English?

I would absolutely love something like this for practicing Chinese, or even just adding Chinese dialogue to a project.

reply
buttercrab
21 hours ago
[-]
Hi! I'm Dia's developer. We currently only support English.
reply
isoprophlex
1 day ago
[-]
Incredible quality demo samples, well done. How's the performance for multilingual generation?
reply
toebee
20 hours ago
[-]
Thank you for the kind words! We only support English at the moment.. Hope to add more languages in the future.
reply
elia_42
8 hours ago
[-]
Interesting. I will definitely look into it further.
reply
enodios
22 hours ago
[-]
The audio quality is seriously impressive. Any plans to add word-level timing maps? For my usecase that is a requirement, so unfortunately I cannot use this yet, but I would very much like to.
reply
toebee
21 hours ago
[-]
Thank you for the kind words! We don't have plans for that yet, but you can always open an issue or RP on Github.
reply
gitroom
11 hours ago
[-]
pretty cool - love seeing open stuff like this come together so fast. you think all this will ever match what a real voice actor can pull off, or something totally new comes out of it?
reply
ivape
1 day ago
[-]
Darn, don't have the appropriate hardware.

The full version of Dia requires around 10GB of VRAM to run.

If you have a 16gb of VRAM, I guess you could pair this with a 3B param model along side it, or really probably only 1B param with reasonable context window.

reply
toebee
1 day ago
[-]
We will work on a quantized version of the model, so hopefully you will be able to run it soon!

We've seen Bark from Suno go from 16GB requirement -> 4GB requirement + running on CPUs. Won't be too hard, just need some time to work on it.

reply
ivape
1 day ago
[-]
No doubt, these TTS models locally are what I'm looking for because I'm so done typing and reading :)
reply
toebee
21 hours ago
[-]
reply
ivape
1 hour ago
[-]
Woah, you guys are on it!
reply
gamificationpan
7 hours ago
[-]
Thank you! Awesome resources.
reply
youssefabdelm
1 day ago
[-]
Anyone know if possible to fine-tune for cloning my voice?
reply
toebee
20 hours ago
[-]
We're adding guides for Zero-shot voice cloning. You can try it using the second example on Gradio: https://huggingface.co/spaces/nari-labs/Dia-1.6B
reply
youssefabdelm
8 hours ago
[-]
Will give it a shot but I feel like fine-tuning will be more reliable, any way to do that?
reply
verghese
1 day ago
[-]
How does this compare with Spark TTS?

https://github.com/SparkAudio/Spark-TTS

reply
popalchemist
1 day ago
[-]
This looks excellent, thank you for releasing openly.
reply
instagary
23 hours ago
[-]
Does this use the the mimi codec by moshi? If so it would be straighforward to get Dia running on iOS!
reply
toebee
22 hours ago
[-]
We use descript audio codec! I’m not sure if DAC works on iOS…
reply
dangoodmanUT
22 hours ago
[-]
Has the same issue of cutting off the end of the provided text that many other models have.
reply
pzo
1 day ago
[-]
Sounds great. Hope more language support in the future. In comparison Sesame CSM-1B sounds like trained on stoned people.
reply
brumar
1 day ago
[-]
Impressive! Is it english only at the moment?
reply
toebee
1 day ago
[-]
Unfortunately yes at the moment
reply
flashblaze
17 hours ago
[-]
I'm lost for words. This is extremely impressive!
reply
noiv
1 day ago
[-]
The demo page does fancy stuff when marking text and hitting cmd-d to create a bookmark :)
reply
sroussey
19 hours ago
[-]
Can we get this working in the browser a la ONNX or similar?
reply
qwertytyyuu
20 hours ago
[-]
That Sesame CSM-1B voice sounds sooo done with life, haha.
reply
vagabund
1 day ago
[-]
The huggingface spaces link doesn't work, fyi.

Sounds awesome in the demo page though.

reply
toebee
21 hours ago
[-]
reply
toebee
21 hours ago
[-]
We are in the progress of fixing it! Thanks for letting us know :)
reply
film42
1 day ago
[-]
Very very impressive.
reply
user_4028b09
18 hours ago
[-]
This is a really impressive project – looking forward to trying it out!
reply
bazlan
12 hours ago
[-]
fluxions.ai has a similar model
reply
zhyder
1 day ago
[-]
V v cool: first time I've seen such expressiveness in TTS for laughs, coughs, yelling about a fire, etc!

What're the recommended GPU cloud providers for using such open-weights models?

reply
toebee
20 hours ago
[-]
Thanks you!! We personally used Quickpod and Runpod the most. But you can try it now on HF Spaces without spinning up GPUs yourself!

https://huggingface.co/spaces/nari-labs/Dia-1.6B

reply
JonathanFly
19 hours ago
[-]
> first time I've seen such expressiveness in TTS for laughs, coughs, yelling about a fire, etc!

The old Bark TTS is noisy and often unreliable, but pretty great at coughs, throat clears, and yelling. Even dialogs... sometimes. Same Dia prompt in Bark: https://vocaroo.com/12HsMlm1NGdv

Dia sounds much more clear and reliable, wild what 2 people can do in 3 months.

reply
hiAndrewQuinn
23 hours ago
[-]
Is this English-only? I'm looking for a local model for Finnish dialogue to run.
reply
lostmsu
1 day ago
[-]
Does this only work for two voices? Can I generate an entire conversation between multiple people? Like this HN thread.
reply
toebee
20 hours ago
[-]
Only two voices at the moment... We will need to upgrade the dataset to make that happen, and are considering that as one of the next steps.
reply
benterix
13 hours ago
[-]
Whoah!
reply
jackchina
8 hours ago
[-]
test good!
reply
xhkkffbf
1 day ago
[-]
Are there different voices? Or only [s1] and [s2] in the examples?
reply
toebee
20 hours ago
[-]
We just clarified in the README, sorry for the confusion ;(

Note that the model was not fine-tuned on a specific voice. Hence, you will get different voices every time you run the model. You can keep speaker consistency by either adding an audio prompt (a guide coming VERY soon - try it with the second example on Gradio or HF Space for now), or fixing the seed.

reply
xienze
1 day ago
[-]
How do you declare which voice should be used for a particular speaker? And can it created a cloned speaker voice from a sample?
reply
toebee
1 day ago
[-]
You can add an audio prompt and prepend text corresponding to it in the script. You can get a feel for it by trying the second example in the Gradio interface!
reply
jokethrowaway
1 day ago
[-]
Looking forward to try. My current go-to solution is E5-F2 (great cloning, decent delivery, ok audio quality, a lot of incoherence here and there forcing you to do multiple generations).

I've just been massively disappointed by Sesame's CSM: on their gradio on the website it was generating flawless dialogs with amazing voice cloning. When running it local the voice cloning performance is awful.

reply
toebee
20 hours ago
[-]
Thanks for the interest! We also enjoyed using E5-F2 :) You can try it now on HF Spaces: https://huggingface.co/spaces/nari-labs/Dia-1.6B
reply
mclau157
1 day ago
[-]
Will you support the other side with AI voice detection software to detect and block malicious voice snippets?
reply