You could have designed state of the art positional encoding
212 points
6 days ago
| 15 comments
| fleetwood.dev
| HN
rgovostes
6 days ago
[-]
Thanks to the author for clarifying something that's been a mystery to me for a few years. The positional encoding scheme in the "Attention Is All You Need" paper is only given half a page and the construction appears to come out of nowhere.
reply
FL33TW00D
6 days ago
[-]
Thank you! Seemed like voodoo to me too, hence this post!
reply
valine
6 days ago
[-]
One of the things I really love about rope is that it allows for a lot of interesting encoding schemes at inference time without model retraining. I’ve had a lot of fun playing with different relative positions. You can elicit a lot of interesting behaviors from the model when you use different rotations for keys vs queries, they don’t always have to match.

For example exact position doesn’t matter too much when tokens are spaced out. Let’s say you use token position 100 for your query, you can shift all the keys around position 100, and the further they are back in the context the more freedom you have to play with the value.

reply
zackangelo
5 days ago
[-]
I'm surprised this is the case! I've been working on a rope implementation for my own project (needed to account for padding in unique situations) and even an off by one error usually causes the model to produce non-sensical output.
reply
valine
5 days ago
[-]
You have to be careful to keep the relative positions for adjacent and nearby tokens intact. The relative positions of distant tokens are less brittle.
reply
bhickey
5 days ago
[-]
Can you describe the behaviors that you can elicit with this technique?
reply
valine
5 days ago
[-]
One strategy I’ve been playing around with is to take an instruction I want the model to follow and squish the positional encodings for the keys down to position zero, and the new queries out slightly further in the window. The model will still follow the instruction but the behaviors are more global. It’s behaves more like a fine-tune and less like the instruction is part of the conversation.
reply
bhickey
3 days ago
[-]
> squish the positional encodings for the keys down to position zero

I might be misunderstanding, but wouldn't this turn your instructions into a bag of words?

reply
valine
2 days ago
[-]
No, and that’s because we are talking about relative positions. Every query can have its own set of keys. From the perspective of token 100 token 3 would be squished down, but from the perspective of token 3 it is still at position 3 and can see tokens 0,1,2 without them being squished.
reply
espadrine
6 days ago
[-]
> Furthermore, by rotating the vector, we have absolutely zero impact on the norm of the vector, which encodes the semantic information of our token.

Doesn’t the angle encode semantic information? Cosine similarity works for embeddings after all.

reply
elieb44
6 days ago
[-]
How about context encoding more generally ? Are there techniques to do that. I.E, during training, I want the string "Dubito ergo cogito, cogito ergo sum, sum ergo Deus est." to have embedded René Descartes as main author, year 1637 as date of writing and "Discours de la méthode" as global context of writing.

So that when trained again another part of the same book, the model can learn they were from same context.

reply
jmmcd
5 days ago
[-]
This is a good idea! The answer to my knowledge is no-one does this, we just the simplest, stupidest, possible method, which is to concatenate all the text in the world. That is during training, of course. At runtime, there is the system prompt.

The second simplest method might indeed use something like a system prompt with metadata like that, injected before the current window of text. But what would happen at runtime, when that metadata is not present? Probably performance would be much worse.

reply
alok-g
5 days ago
[-]
On a related note, one thing I still do not understand is why are positional encodings 'added' to the token embeddings as opposed to (having a smaller position encoding vector that is) 'concatenated'. It would be great if someone could explain.
reply
d3m0t3p
5 days ago
[-]
Increasing the dimension causes a lot more computation, this is one of the main reason. You can see evidence of this in the multi head where the dim is reduced via a linear projection.

h_i = attention(W_i^Q Q^T @ W_i^K K) W_i^v V

h = W_o @ concat(h_1...h_8)

reply
bfelbo
5 days ago
[-]
How many dimensions would you need to increase by to capture positional information?

Seems to me like it’d be a quite low number compared to the dimensionality of the semantic vectors?

reply
jcims
6 days ago
[-]
I'm effectively a complete layman in this (although I do see some parallels to physical positional encoders, which is interesting) so at first read this entire thing went WAAAAY over my head. At first glance it seemed to be way overcomplicated just to encode position, so I figured I was missing something. ChatGPT was super helpful in explaining spiking neural networks to me so I just spent 20 minutes asking ChatGPT to explain this to me and I feel like I actually learned something.

Then at the end I asked ChatGPT how this all relates to how it operates and it was interesting to see things like:

>Tokens as Subword Units: I use a tokenization method called Byte Pair Encoding (BPE), which breaks text into subword units.

I don't know if it's accurate or not, but it's wild seeing it talk about how it works.

reply
gloflo
6 days ago
[-]
The context includes that "it" is ChatGPT. The fact that ChatGPT uses Byte Pair Encoding is widely published. It is expectable that a LLM can regurgitate this kind of information, nothing wild about that.
reply
astrange
6 days ago
[-]
Note if you don't have a good system prompt, other LLMs will also tell you they're ChatGPT or Claude.
reply
im3w1l
5 days ago
[-]
That's kind of interesting. Like they will know they are an AI? Just not which one?
reply
astrange
5 days ago
[-]
I think it's because they've been trained by copying answers from ChatGPT. They're not really very copyrighted after all.

Though the other day I saw someone demonstrate this with Google's Gemini through the API, so maybe it is just picking up conversation traces off the internet.

reply
throwaway314155
5 days ago
[-]
You think Google is above stealing outputs from OpenAI?
reply
astrange
5 days ago
[-]
I think they know how to search and replace.
reply
refulgentis
6 days ago
[-]
100% accurate
reply
throwawaymaths
6 days ago
[-]
Maybe someone could answer this for me: it seems like encoding the positional embeddings as augmentations to the "natural" activations instead of as their own inputs (concatenated onto the activations) make things like sliding a window much harder... I guess obviously the drawback is you have a somewhat less textually derived information.

I recall a early transformers video where they tried both and it turned out that adding the position onto the existing vectors was no worse so they went with it... No further discussion about motivations happened in that video.

Is it worth revisiting that maybe now that activations have a gobsmackingly large dimension?

reply
stephantul
6 days ago
[-]
They are not concatenated, but summed. I think concatenation wouldn’t work, as you indicate.

I think you mean the line in the original paper where they say compared the learned attention weights with the predefined encoding, and it made no difference.

reply
throwawaymaths
6 days ago
[-]
> I think concatenation wouldn’t work, as you indicate.

Why do you say that?

reply
donkeyboy
6 days ago
[-]
Concat could work too although less efficient because you need to make a new tensor.

Actually summing might learn a concat on its own. Imagine the embedding learned for a token takes up the first N-20 dimensions and leaves the last 20 dimensions as 0. And the positional encoding causes the first N-20 dims to be 0 and the last 20 to encode the information. Then when you sum you are actually concatenating. So I think of them as equivalent except add is more efficient/preserves the dim space, while concat would grow the dim space. And for something like position, which certainly does not need to occupy 1000+ dimensions, it would not make sense to concat all of that since it would be wasteful

reply
throwawaymaths
5 days ago
[-]
why would you need to make a new tensor?

Suppose you had a 4096 (llama-2) sized activations. Maybe, you make do with 3084 activations and concatenate 1024 positional activations onto that.

Then you pass that to Mk Mq Mv and generate K, Q, V.

The only thing that would change would be the Mff-out, which would now be a (big)x3084 matrix instead of (big)x4096

In any case you would be retraining, so changing the dims of the tensors I think is not a big deal... In fact in this case they would be smaller (at the cost of fewer interlayer activations), but you would have the same number of tensors.

> Actually summing might learn a concat on its own.

But you see the point? You're forcing the model to learn something that maybe it didn't need to. That's like saying "well a fully connected network might learn convolution on its own". Historically breakthroughs in capability have accompanied one of: [more data | more layers | smarter constraints on activations]

Unless you have some sort of argument that forcing it to learn position has carryover value in generating activations, it seems, naively, a bad idea.

reply
imjonse
6 days ago
[-]
I don't think the first code example should work (it indeed says false here).

When given a permuted sequence, the attention output will also be permuted, not identical. The need for positional encodings is due to two tokens resulting in the same value in the final attention matrix regardless of the tokens' absolute and relative position; that is enough to miss a lot of meaning.

reply
aconz2
5 days ago
[-]
To add on since this took me a while to understand: for a single token, self attention is permutation invariant because we take the qK (one query dot all the other keys) weighted sum of all the values; that sum is what gives the invariance because + is commutative. But for all the tokens, the mha output matrix will not be invariant, but rather equivariant, where you apply the same permutation to the output matrix as you did to the input tokens. What might be a more useful example is to take one position, like the last one, and compute its mha for every permutation of the previous tokens; those will/should all be the same.
reply
FL33TW00D
6 days ago
[-]
The first code example says False because of high precision, I've updated the example.
reply
jmmcd
6 days ago
[-]
But u/imjonse's reasoning seems right. I haven't run either version of the code, but when reading it I expected that to be False. The output is still a list with an order.

the dog chased the cat: position 1 in the output is attention(dog, everything)

the cat chased the dog: position 1 in the output is attention(cat, everything)

reply
FL33TW00D
6 days ago
[-]
Run the code and look at the values!
reply
jmmcd
5 days ago
[-]
Well, yes, I deserved that reply! And yes the code is printing True. It's not that I disbelieved you... but something is wrong here. Investigation below, thanks to Claude.ai for walking me through it!

    In [10]: o1[0, :, :3]
    Out[10]:
    tensor([[ 0.0053,  0.0017, -0.0012],
        [ 0.0053,  0.0017, -0.0012],
        [ 0.0053,  0.0017, -0.0012],
        [ 0.0053,  0.0017, -0.0012],
        [ 0.0053,  0.0017, -0.0012],
        [ 0.0053,  0.0017, -0.0012]],       grad_fn=<SliceBackward0>)
Every token has the same attention values. I expect attention(cat, everything) to differ from attention(dog, everything), even without positional encoding.

Further, the attention weights are uniform and identical for both sentences:

    In [46]: o1, aw1 = mha(W_q(e1), W_k(e1), W_v(e1))
    In [47]: o2, aw2 = mha(W_q(e2), W_k(e2), W_v(e2))
    In [48]: aw1.shape
    Out[48]: torch.Size([1, 6, 6])
    In [49]: aw2.shape
    Out[49]: torch.Size([1, 6, 6])
    In [50]: aw1
    Out[50]:
    tensor([[[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667]]],
       grad_fn=<MeanBackward1>)

    In [51]: aw2
    Out[51]:
    tensor([[[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667]]],
       grad_fn=<MeanBackward1>)
That is not expected. It's because the Linear layers are initialised with such small values. And the softmax causes a collapse.

Trying random weights on a larger scale:

    In [52]: W_q.weight.data *= 100
         W_k.weight.data *= 100
         W_v.weight.data *= 100

    In [55]: o1, aw1 = mha(W_q(e1), W_k(e1), W_v(e1))
    In [56]: o2, aw2 = mha(W_q(e2), W_k(e2), W_v(e2))
    In [57]: aw1
    Out[57]:
    tensor([[[0.2049, 0.1606, 0.1256, 0.1095, 0.1723, 0.2270],
         [0.0883, 0.2047, 0.1544, 0.2776, 0.1405, 0.1345],
         [0.1196, 0.1719, 0.1831, 0.1541, 0.1374, 0.2339],
         [0.1413, 0.2399, 0.1617, 0.2056, 0.1634, 0.0880],
         [0.1455, 0.1432, 0.2432, 0.1239, 0.1494, 0.1948],
         [0.1897, 0.1817, 0.1920, 0.1478, 0.1618, 0.1270]]],
       grad_fn=<MeanBackward1>)

    In [58]: aw2
    Out[58]:
    tensor([[[0.2049, 0.1606, 0.2270, 0.1095, 0.1723, 0.1256],
         [0.0883, 0.2047, 0.1345, 0.2776, 0.1405, 0.1544],
         [0.1897, 0.1817, 0.1270, 0.1478, 0.1618, 0.1920],
         [0.1413, 0.2399, 0.0880, 0.2056, 0.1634, 0.1617],
         [0.1455, 0.1432, 0.1948, 0.1239, 0.1494, 0.2432],
         [0.1196, 0.1719, 0.2339, 0.1541, 0.1374, 0.1831]]],
       grad_fn=<MeanBackward1>)

    In [60]: o1[:, :, :5]
    Out[60]:
    tensor([[[ 0.0145,  0.3128, -0.3659, -0.1884,  0.1724],
         [-0.2319,  0.1407, -0.6010, -0.4064,  0.4259],
         [-0.3231,  0.1622, -0.6351, -0.1711,  0.4014],
         [-0.0596,  0.2610, -0.7388, -0.2987,  0.3214],
         [-0.2750,  0.0676, -0.4140, -0.2024,  0.3383],
         [-0.1434,  0.0871, -0.3154, -0.0755,  0.3314]]],
       grad_fn=<SliceBackward0>)

    In [61]: o2[:, :, :5]
    Out[61]:
    tensor([[[ 0.0145,  0.3128, -0.3659, -0.1884,  0.1724],
         [-0.2319,  0.1407, -0.6010, -0.4064,  0.4259],
         [-0.1434,  0.0871, -0.3154, -0.0755,  0.3314],
         [-0.0596,  0.2610, -0.7388, -0.2987,  0.3214],
         [-0.2750,  0.0676, -0.4140, -0.2024,  0.3383],
         [-0.3231,  0.1622, -0.6351, -0.1711,  0.4014]]],
       grad_fn=<SliceBackward0>)

    In [62]: print("Matches: ", torch.allclose(o1, o2, atol=1e-6))
    Matches:  False
reply
FL33TW00D
5 days ago
[-]
Hm! Very interesting! Thank you for taking the time to debug that.

I'm going to have to think hard about how to rewrite the motivating example to explain this best.

Edit: updated the post, thanks for pointing out the pernicious init values!

reply
breadislove
5 days ago
[-]
There is this really interesting blog post about making rope (by the main author of the paper) multimodal as used by qwen2 vl. it's in chinese but google translate does a pretty good job: https://spaces.ac.cn/archives/10040
reply
1024core
5 days ago
[-]
I didn't get the sudden leap from "position encodings" to "QKV" magic.

What is the connection between the two? Where does "Q" come from? What are "K" and "V"? (I know they stand for "Query", "Key", "Value"; but what do they have to do with position embeddings?)

reply
flebron
5 days ago
[-]
All of them are vectors of embedded representations of tokens. In a transformer, you want to compute the inner product between a query (the token who is doing the attending) and the key (the token who is being attended to). An inductive bias we have is that the neural network's performance will be better if this inner product depends on the relative distance between the query token's position, and the key token's position. We thus encode each one with positional information, in such a way that (for RoPE at least) the inner product depends only on the distance between these tokens, and not their absolute positions in the input sentence.
reply
FL33TW00D
5 days ago
[-]
"This post intends to limit the mathematical knowledge required to follow along, but some basic linear algebra, trigonometry and understanding of self attention is expected."

If you're not sure on self attention, the post will be a little unclear

reply
Scene_Cast2
6 days ago
[-]
If you're interested in positional embeddings for Transformers, check out this repo - https://github.com/gazelle93/Attention-Various-Positional-En... - it implements various popular ones.
reply
Der_Einzige
6 days ago
[-]
Similarly, "you" could have designed state of the art LLM sampling: https://openreview.net/forum?id=FBkpCyujtS&referrer=%5BTasks...
reply
cperciva
6 days ago
[-]
The binary coding example would have been much better with Gray codes.
reply
logicchains
6 days ago
[-]
Does anyone know why 2D rope implementations apply two separate 1D rotations to pairs, instead of applying a 2d rotation to triplets?
reply
rini17
5 days ago
[-]
No they apply many rotations, same as the number of dimensions of the embedding space.
reply
JP_DW
5 days ago
[-]
The illustrations are really good to look at. I couldn't find thel on your github, are they created with manim?
reply