For example, I have input with shape (1, 1000, 10) (so, src.shape wil be (1, 1000, 10), which means the sequence length is 1000, and the dimension is 10. Then:
num_head and key_dim):class Model(tf.keras.Model):
def __init__(self):
super(Model, self).__init__()
self.attention1 = tf.keras.layers.MultiHeadAttention(num_heads=20, key_dim=9)
self.dense = tf.keras.layers.Dense(10, activation="softmax")
def call(self, src) :
output = self.attention1(src, src)
output = tf.reshape(output, [1, 10000])
output = self.dense(output)
return output
num_head and key_dim) :class Model(tf.keras.Model):
def __init__(self):
super(Model, self).__init__()
self.attention1 = tf.keras.layers.MultiHeadAttention(num_heads=123, key_dim=17)
self.dense = tf.keras.layers.Dense(10, activation="softmax")
def call(self, src):
output = self.attention1(src, src)
output = tf.reshape(output, [1, 10000])
output = self.dense(output)
return output
So, this layer works with whatever num_heads and key_dim, which does not match the paper idea. (It works because no error report, and it able to train)
In the paper, 'attention is all you need', it says key_dim is the dimension of key for each head, not the original head dimension, and thus key_dim should equal to embed_dim/head_num. So, if we want to have a head_num of 5, the key_dim has to be 2, if embedding_dim is 10.
the screen shot from the paper
Also, from the keras attention class discription, the key_dim is Size of each attention head for query and key, which matches to the paper idea.
the screen shot from the class discription
Therefore, why tf.keras.layers.MultiHeadAttention able to take unmatched dimension. When it takes the unmatching dimension, how does it work internally with these extra weight parameters?
There are two dimensions d_k and d_v.
In their paper, Vaswani et al. set d_k = d_v. This, however, is not required.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With