Show HN: Tiny Diffusion – A character-level text diffusion model from scratch
153 points
5 days ago
| 9 comments
| github.com
| HN
This is a character-level language diffusion model for text generation.

The model is a modified version of Nanochat's GPT implementation and is trained on Tiny Shakespeare!

It is only 10.7 million parameters, so you can try it out locally.

simonw
1 day ago
[-]
This is really neat.

I noticed the diffusion-process.py demo was using matplotlib in a window, but I figured it would be cute if it used a terminal UI instead - so I had Claude Code convert it to use curses. Code and demo GIF here: https://gist.github.com/simonw/9033ebd8dd17b4c0ad101ddda7a54...

reply
mlmonkey
1 day ago
[-]
I'm curious: has there been any work done on generating embedding vectors instead of discrete tokens via diffusion? What would that look like? Please point me to some references. Thanks!
reply
yugretcx
1 day ago
[-]
Why do these text diffusion demos always look like the number of allowed tokens is fixed for a specific unfilled region?

Is this the case?

Ie. if the region only has four tokens(here characters) but calculates the best word is “forget” does it just abandon the best fit or truncate it to fit?

Are there text diffusion models with lax infill directives?

reply
rand0mwalk
1 day ago
[-]
Tokens start as a special [MASK] token. Then as the diffusion process runs they are "unmasked" i.e. sampled.

So yes, you define a sequence of [MASK] tokens with some length ahead of time.

In practice, if a model wants to write a shorter sequence, it'll just fill the remaining tokens with empty content. If it wants to write a longer sequence, you'll have to identify this and extend the sequence with more [MASK] tokens. This is typically obvious since there's no "end of sequence" token present if the model wants to generate more.

reply
nathan-barry
1 day ago
[-]
Yes, this is the case. During training, the model will get a sequence of text (ex, 512 tokens long) with a percentage of them masked out (with a special <MASK> token). It learns how to unmask those tokens to construct the original text.

In the case that you mentioned, if we had 4 <MASK> tokens in a row, all we are doing for decoding is predicting what those 4 tokens should be.

Generally, this does not seem to be a significant problem, as there are usually multiple ways to express an idea in varying lengths. Also, with confidence-aware parallel decoding, it can usually avoid the scenario you mentioned, as focusing on decoding the highest confident tokens will generally avoid such scenarios with a well trained model.

reply
Majromax
1 day ago
[-]
The basic MLP block in this model uses a ReLU^2 activation function (x <- ReLU(x)^2). That seems to be copied from the nanochat project, and it's not present in nanoGPT. Is there some documentation on the choice of this activation function?
reply
throwaway2027
21 hours ago
[-]
Isn't it because ReLU is cheap and ^2 is squared loss?
reply
kouteiheika
14 hours ago
[-]
When it comes to compute cost the choice of activation function makes little difference nowadays (and it can often be fused with whatever operation comes before it, which makes it effectively free).

The real reason is simple: it was inherited.

The relu^2 was used in the nanogpt speedrun[1] because it produced the best empirical results, then Andrej based his nanochat on the nanogpt speedrun without changing the activation function, and then this project was based on nanochat.

[1] -- https://github.com/KellerJordan/modded-nanogpt

reply
macleginn
12 hours ago
[-]
There has been some experimentation with the use of ReLU^2 in language models in recent years, e.g., here: https://proceedings.neurips.cc/paper_files/paper/2021/file/2...
reply
gdiamos
21 hours ago
[-]
One year later and there is still no inference engine for diffusion LLMs

Students looking for a project to break into AI - please!

reply
nathan-barry
21 hours ago
[-]
Actually NVIDIA made one earlier this year, check out their Fast-dLLM paper
reply
gdiamos
17 hours ago
[-]
Thanks I’ll check it out!
reply
gdiamos
16 hours ago
[-]
Did I miss something? https://github.com/NVlabs/Fast-dLLM/blob/main/llada/chat.py

That’s inference code, but where is the high perf web server?

reply
tough
5 hours ago
[-]
training inspired on nanochat for diffusion models: https://github.com/ZHZisZZ/dllm

now someone needs to make it work with vllm or something

reply
embedding-shape
1 day ago
[-]
Fun project, easy to understand and nice looking results, everything one could ask for! I played around with it locally, did some optimizations of low hanging fruits without making it much more complicated, and was gonna send over a PR. But then I noticed there is no license attached to the project. What are your plans regarding the licensing for this?
reply
nathan-barry
1 day ago
[-]
Hey, I’ll add the MIT licenses later today!
reply
volodia
1 day ago
[-]
There is also this one that was released in October: https://github.com/kuleshov/char-mdlm
reply
tell_me_whai
23 hours ago
[-]
Looks fun, thanks for sharing. I see you're implementing game of life sampling, what's the reasoning for using this logic?
reply
doppelgunner
9 hours ago
[-]
This is impressive. Can it run on mobile?
reply