Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to determine which Merge mode (add/ average/ multiply/ dot / concat) to use?

After testing the script of [babi_rnn.py] and [babi_memnn.py], the question of [how to determine which Merge mode (add/ average/ multiply/ dot / concat) to use?] raised up many times in my mind.

For example, for the LSTM modeling,it seems easy to understand that using [concat] to merge let's say two-branches's time sequence layer output.

However, it is not that easy for me to understand why to use [add] to merge two branches in [babi_rnn.py]. In [babi_memnn.py], the [add],[dot] and [concat] merging modes are recruited.

So, is there any suggestions for choosing which merging function to use in different usage scenarios?

like image 368
zshtom Avatar asked Mar 08 '23 14:03

zshtom


1 Answers

These Merge functions fall into 3 categories.

add, avg are linear combinations. It is used for simply combining several distinct components together because gradient flows nicely through addition and subtraction. A common use case is adding(+) several criterion together to obtain a loss function for a neural network that trains on multiple tasks jointly.

Another example is L2 regularization:

L2 regularization aims to minimize variance in weights. So the bigger the weights, the higher the loss.


multiply is a a special case of dot. In Keras, you can specify axis using dot. Dot product is used for determining how similar two or more vectors are to each other. Note: dot product is in fact a shrink operation. Its magnitude will be smaller or equal to either of the original inputs. Demonstrated geometrically as projection:


concat does not discard any input. The concatenated vector can then be fed into a hidden layer to be rescaled elementwise. You don't find the interaction between elements. One common practice is concatenating the hidden state and output of stacked RNN and feeding that into a Dense layer to have several RNN do different tasks similar to a feedforward network.


To sum up, each Merge operation has a different use case. In Luong Attention paper, there are 3 proposed scoring mechanism. Depending on your model, you can pick and choose the one that works best for you.

like image 88
Ricky Han Avatar answered Apr 03 '23 01:04

Ricky Han