tf.keras.layers.MultiHeadAttention's argument key_dim sometimes not matches to paper's example

Question

For example, I have input with shape (1, 1000, 10) (so, src.shape wil be (1, 1000, 10), which means the sequence length is 1000, and the dimension is 10. Then:

This works (random num_head and key_dim):

class Model(tf.keras.Model):
        def __init__(self):
            super(Model, self).__init__()
            self.attention1 = tf.keras.layers.MultiHeadAttention(num_heads=20, key_dim=9)
            self.dense = tf.keras.layers.Dense(10, activation="softmax")

        def call(self, src) :
            output = self.attention1(src, src)
            output = tf.reshape(output, [1, 10000])
            output = self.dense(output)
            return output

And this works too (random num_head and key_dim) :

class Model(tf.keras.Model):
        def __init__(self):
            super(Model, self).__init__()
            self.attention1 = tf.keras.layers.MultiHeadAttention(num_heads=123, key_dim=17)
            self.dense = tf.keras.layers.Dense(10, activation="softmax")

        def call(self, src):
            output = self.attention1(src, src)
            output = tf.reshape(output, [1, 10000])
            output = self.dense(output)
            return output

So, this layer works with whatever num_heads and key_dim, which does not match the paper idea. (It works because no error report, and it able to train)

In the paper, 'attention is all you need', it says key_dim is the dimension of key for each head, not the original head dimension, and thus key_dim should equal to embed_dim/head_num. So, if we want to have a head_num of 5, the key_dim has to be 2, if embedding_dim is 10.

the screen shot from the paper

Also, from the keras attention class discription, the key_dim is Size of each attention head for query and key, which matches to the paper idea.

the screen shot from the class discription

Therefore, why tf.keras.layers.MultiHeadAttention able to take unmatched dimension. When it takes the unmatching dimension, how does it work internally with these extra weight parameters?

Some of the question descriptions are cite from here.

Anirban Mukherjee · Accepted Answer

There are two dimensions d_k and d_v.

key_dim corresponds to d_k, which can be more or less than d_v. d_k is the size of the key and query dimensions for each head.
d_v = embed_dim/num_head. d_v is the size of the value for each head.

In their paper, Vaswani et al. set d_k = d_v. This, however, is not required.

tf.keras.layers.MultiHeadAttention's argument key_dim sometimes not matches to paper's example

Tags:

tensorflow

tf.keras

attention-model

transformer-model

EthanJiang

1 Answers

Anirban Mukherjee

Recent Activity

Donate For Us

tf.keras.layers.MultiHeadAttention's argument key_dim sometimes not matches to paper's example

Tags:

tensorflow

tf.keras

attention-model

transformer-model

EthanJiang

1 Answers

Anirban Mukherjee

Related questions

Recent Activity

Donate For Us