The model is a modified version of Nanochat's GPT implementation and is trained on Tiny Shakespeare!
It is only 10.7 million parameters, so you can try it out locally.
I noticed the diffusion-process.py demo was using matplotlib in a window, but I figured it would be cute if it used a terminal UI instead - so I had Claude Code convert it to use curses. Code and demo GIF here: https://gist.github.com/simonw/9033ebd8dd17b4c0ad101ddda7a54...
Is this the case?
Ie. if the region only has four tokens(here characters) but calculates the best word is “forget” does it just abandon the best fit or truncate it to fit?
Are there text diffusion models with lax infill directives?
So yes, you define a sequence of [MASK] tokens with some length ahead of time.
In practice, if a model wants to write a shorter sequence, it'll just fill the remaining tokens with empty content. If it wants to write a longer sequence, you'll have to identify this and extend the sequence with more [MASK] tokens. This is typically obvious since there's no "end of sequence" token present if the model wants to generate more.
In the case that you mentioned, if we had 4 <MASK> tokens in a row, all we are doing for decoding is predicting what those 4 tokens should be.
Generally, this does not seem to be a significant problem, as there are usually multiple ways to express an idea in varying lengths. Also, with confidence-aware parallel decoding, it can usually avoid the scenario you mentioned, as focusing on decoding the highest confident tokens will generally avoid such scenarios with a well trained model.
The real reason is simple: it was inherited.
The relu^2 was used in the nanogpt speedrun[1] because it produced the best empirical results, then Andrej based his nanochat on the nanogpt speedrun without changing the activation function, and then this project was based on nanochat.
Students looking for a project to break into AI - please!
That’s inference code, but where is the high perf web server?
now someone needs to make it work with vllm or something