I'm currently using this code that i get from one discussion on github Here's the code of the attention mechanism:
_input = Input(shape=[max_length], dtype='int32') # get the embedding layer embedded = Embedding( input_dim=vocab_size, output_dim=embedding_size, input_length=max_length, trainable=False, mask_zero=False )(_input) activations = LSTM(units, return_sequences=True)(embedded) # compute importance for each step attention = Dense(1, activation='tanh')(activations) attention = Flatten()(attention) attention = Activation('softmax')(attention) attention = RepeatVector(units)(attention) attention = Permute([2, 1])(attention) sent_representation = merge([activations, attention], mode='mul') sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation) probabilities = Dense(3, activation='softmax')(sent_representation)
Is this the correct way to do it? i was sort of expecting the existence of time distributed layer since attention mechanism is distributed in every time step of the RNN. I need someone to confirm that this implementation(the code) is a correct implementation of attention mechanism. Thank you.
from tensorflow import keras from keras import layers layers.Attention( use_scale=False, **kwargs ) Here, the above-provided attention layer is a Dot-product attention mechanism. We can use the layer in the convolutional neural network in the following way.
In essence, when the generalized attention mechanism is presented with a sequence of words, it takes the query vector attributed to some specific word in the sequence and scores it against each key in the database. In doing so, it captures how the word under consideration relates to the others in the sequence.
If you want to have an attention along the time dimension, then this part of your code seems correct to me:
activations = LSTM(units, return_sequences=True)(embedded) # compute importance for each step attention = Dense(1, activation='tanh')(activations) attention = Flatten()(attention) attention = Activation('softmax')(attention) attention = RepeatVector(units)(attention) attention = Permute([2, 1])(attention) sent_representation = merge([activations, attention], mode='mul')
You've worked out the attention vector of shape (batch_size, max_length)
:
attention = Activation('softmax')(attention)
I've never seen this code before, so I can't say if this one is actually correct or not:
K.sum(xin, axis=-2)
Further reading (you might have a look):
https://github.com/philipperemy/keras-visualize-activations
https://github.com/philipperemy/keras-attention-mechanism
Attention mechanism pays attention to different part of the sentence:
activations = LSTM(units, return_sequences=True)(embedded)
And it determines the contribution of each hidden state of that sentence by
attention = Dense(1, activation='tanh')(activations)
attention = Activation('softmax')(attention)
And finally pay attention to different states:
sent_representation = merge([activations, attention], mode='mul')
I don't quite understand this part: sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)
To understand more, you can refer to this and this, and also this one gives a good implementation, see if you can understand more on your own.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With