You can choose which dimensions to show, pick which embeddings to show, and play with vector maths between them in a visual way
It doesn't show the whole set of embeddings, though I am sure someone could fix that, as well as adapting it to use the gpt-oss model instead of the custom (?) mini set it uses.
"\u00e0\u00a7\u012d\u00e0\u00a6\u013e"
with some characters > 0xff (but none above 0x0143, weirdly).https://github.com/vasturiano/3d-force-graph
a try, for the text labels you can use
https://github.com/vasturiano/three-spritetext
its based on Three.js and creates great 3D graph visualisations GPU rendered (webgl). This could make it alot more interresting to watch because it could display actual depth (your gpu is gonne run hot but i guess worth it)
just a suggestion.
What is the most important problem anyone has solved this way?
Speaking as somewhat of a co-defendant.
Dimensionality reduction/clustering like this may be less useful for identifying trends in token embeddings, but for other types of embeddings it's extremely useful.
I wonder if being trained on significant amounts of synthetic data gave it any unique characteristics.
Applying the embeddings model to some dataset of yours of interest, and then a similar visualization, is where it gets cool because you can visually look at clusters and draw conclusions about the closeness of items in your own dataset
https://stock.adobe.com/images/asteroid-hitting-the-earth-ai...
Embeddings derived from autoregressive language models apply full attention mechanisms to get something different entirely.
My guess is its the 2 largest principle components of the embedding.
But none of the points are labelled? There isn't a writeup on the page or anything?
That they're related or connected or it arbitrary?
Why does it look like a fried egg?
edit: must be related in some way as one of the "droplets" in the bottom left quadrant seems to consist of various versions of the word "parameter"
The density of the clusters tend to have trends. In this case, the "yolk" has a lot of bizarre unicode tokens.