For example exact position doesn’t matter too much when tokens are spaced out. Let’s say you use token position 100 for your query, you can shift all the keys around position 100, and the further they are back in the context the more freedom you have to play with the value.
I might be misunderstanding, but wouldn't this turn your instructions into a bag of words?
Doesn’t the angle encode semantic information? Cosine similarity works for embeddings after all.
So that when trained again another part of the same book, the model can learn they were from same context.
The second simplest method might indeed use something like a system prompt with metadata like that, injected before the current window of text. But what would happen at runtime, when that metadata is not present? Probably performance would be much worse.
h_i = attention(W_i^Q Q^T @ W_i^K K) W_i^v V
h = W_o @ concat(h_1...h_8)
Seems to me like it’d be a quite low number compared to the dimensionality of the semantic vectors?
Then at the end I asked ChatGPT how this all relates to how it operates and it was interesting to see things like:
>Tokens as Subword Units: I use a tokenization method called Byte Pair Encoding (BPE), which breaks text into subword units.
I don't know if it's accurate or not, but it's wild seeing it talk about how it works.
Though the other day I saw someone demonstrate this with Google's Gemini through the API, so maybe it is just picking up conversation traces off the internet.
I recall a early transformers video where they tried both and it turned out that adding the position onto the existing vectors was no worse so they went with it... No further discussion about motivations happened in that video.
Is it worth revisiting that maybe now that activations have a gobsmackingly large dimension?
I think you mean the line in the original paper where they say compared the learned attention weights with the predefined encoding, and it made no difference.
Why do you say that?
Actually summing might learn a concat on its own. Imagine the embedding learned for a token takes up the first N-20 dimensions and leaves the last 20 dimensions as 0. And the positional encoding causes the first N-20 dims to be 0 and the last 20 to encode the information. Then when you sum you are actually concatenating. So I think of them as equivalent except add is more efficient/preserves the dim space, while concat would grow the dim space. And for something like position, which certainly does not need to occupy 1000+ dimensions, it would not make sense to concat all of that since it would be wasteful
Suppose you had a 4096 (llama-2) sized activations. Maybe, you make do with 3084 activations and concatenate 1024 positional activations onto that.
Then you pass that to Mk Mq Mv and generate K, Q, V.
The only thing that would change would be the Mff-out, which would now be a (big)x3084 matrix instead of (big)x4096
In any case you would be retraining, so changing the dims of the tensors I think is not a big deal... In fact in this case they would be smaller (at the cost of fewer interlayer activations), but you would have the same number of tensors.
> Actually summing might learn a concat on its own.
But you see the point? You're forcing the model to learn something that maybe it didn't need to. That's like saying "well a fully connected network might learn convolution on its own". Historically breakthroughs in capability have accompanied one of: [more data | more layers | smarter constraints on activations]
Unless you have some sort of argument that forcing it to learn position has carryover value in generating activations, it seems, naively, a bad idea.
When given a permuted sequence, the attention output will also be permuted, not identical. The need for positional encodings is due to two tokens resulting in the same value in the final attention matrix regardless of the tokens' absolute and relative position; that is enough to miss a lot of meaning.
the dog chased the cat: position 1 in the output is attention(dog, everything)
the cat chased the dog: position 1 in the output is attention(cat, everything)
In [10]: o1[0, :, :3]
Out[10]:
tensor([[ 0.0053, 0.0017, -0.0012],
[ 0.0053, 0.0017, -0.0012],
[ 0.0053, 0.0017, -0.0012],
[ 0.0053, 0.0017, -0.0012],
[ 0.0053, 0.0017, -0.0012],
[ 0.0053, 0.0017, -0.0012]], grad_fn=<SliceBackward0>)
Every token has the same attention values. I expect attention(cat, everything) to differ from attention(dog, everything), even without positional encoding.Further, the attention weights are uniform and identical for both sentences:
In [46]: o1, aw1 = mha(W_q(e1), W_k(e1), W_v(e1))
In [47]: o2, aw2 = mha(W_q(e2), W_k(e2), W_v(e2))
In [48]: aw1.shape
Out[48]: torch.Size([1, 6, 6])
In [49]: aw2.shape
Out[49]: torch.Size([1, 6, 6])
In [50]: aw1
Out[50]:
tensor([[[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667]]],
grad_fn=<MeanBackward1>)
In [51]: aw2
Out[51]:
tensor([[[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667]]],
grad_fn=<MeanBackward1>)
That is not expected. It's because the Linear layers are initialised with such small values. And the softmax causes a collapse.Trying random weights on a larger scale:
In [52]: W_q.weight.data *= 100
W_k.weight.data *= 100
W_v.weight.data *= 100
In [55]: o1, aw1 = mha(W_q(e1), W_k(e1), W_v(e1))
In [56]: o2, aw2 = mha(W_q(e2), W_k(e2), W_v(e2))
In [57]: aw1
Out[57]:
tensor([[[0.2049, 0.1606, 0.1256, 0.1095, 0.1723, 0.2270],
[0.0883, 0.2047, 0.1544, 0.2776, 0.1405, 0.1345],
[0.1196, 0.1719, 0.1831, 0.1541, 0.1374, 0.2339],
[0.1413, 0.2399, 0.1617, 0.2056, 0.1634, 0.0880],
[0.1455, 0.1432, 0.2432, 0.1239, 0.1494, 0.1948],
[0.1897, 0.1817, 0.1920, 0.1478, 0.1618, 0.1270]]],
grad_fn=<MeanBackward1>)
In [58]: aw2
Out[58]:
tensor([[[0.2049, 0.1606, 0.2270, 0.1095, 0.1723, 0.1256],
[0.0883, 0.2047, 0.1345, 0.2776, 0.1405, 0.1544],
[0.1897, 0.1817, 0.1270, 0.1478, 0.1618, 0.1920],
[0.1413, 0.2399, 0.0880, 0.2056, 0.1634, 0.1617],
[0.1455, 0.1432, 0.1948, 0.1239, 0.1494, 0.2432],
[0.1196, 0.1719, 0.2339, 0.1541, 0.1374, 0.1831]]],
grad_fn=<MeanBackward1>)
In [60]: o1[:, :, :5]
Out[60]:
tensor([[[ 0.0145, 0.3128, -0.3659, -0.1884, 0.1724],
[-0.2319, 0.1407, -0.6010, -0.4064, 0.4259],
[-0.3231, 0.1622, -0.6351, -0.1711, 0.4014],
[-0.0596, 0.2610, -0.7388, -0.2987, 0.3214],
[-0.2750, 0.0676, -0.4140, -0.2024, 0.3383],
[-0.1434, 0.0871, -0.3154, -0.0755, 0.3314]]],
grad_fn=<SliceBackward0>)
In [61]: o2[:, :, :5]
Out[61]:
tensor([[[ 0.0145, 0.3128, -0.3659, -0.1884, 0.1724],
[-0.2319, 0.1407, -0.6010, -0.4064, 0.4259],
[-0.1434, 0.0871, -0.3154, -0.0755, 0.3314],
[-0.0596, 0.2610, -0.7388, -0.2987, 0.3214],
[-0.2750, 0.0676, -0.4140, -0.2024, 0.3383],
[-0.3231, 0.1622, -0.6351, -0.1711, 0.4014]]],
grad_fn=<SliceBackward0>)
In [62]: print("Matches: ", torch.allclose(o1, o2, atol=1e-6))
Matches: False
I'm going to have to think hard about how to rewrite the motivating example to explain this best.
Edit: updated the post, thanks for pointing out the pernicious init values!
What is the connection between the two? Where does "Q" come from? What are "K" and "V"? (I know they stand for "Query", "Key", "Value"; but what do they have to do with position embeddings?)
If you're not sure on self attention, the post will be a little unclear