I have been reading papers about LSTM and checking its implementations. There is one point that is not clear to me.
In most of the papers it is mentioned that the weight matrices from the cell to gate vectors should be diagonal(ex: Alex page 5, 2013), but I haven't seen this in any implementation.
For example this :
1
2
Another example is from mila lab.
3
Are these people implementing wrongly or am I missing something?
One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state. In this peephole connection we can see that all the gates are having an input along with the cell state.
From working of both layers i.e., LSTM and GRU, GRU uses less training parameter and therefore uses less memory and executes faster than LSTM whereas LSTM is more accurate on a larger dataset.
The weight matrix W contains different weights for the current input vector and the previous hidden state for each gate. Just like Recurrent Neural Networks, an LSTM network also generates an output at each time step and this output is used to train the network using gradient descent.
The Gated Recurrent Unit (GRU) is a type of Recurrent Neural Network (RNN) that, in certain cases, has advantages over long short term memory (LSTM). GRU uses less memory and is faster than LSTM, however, LSTM is more accurate when using datasets with longer sequences.
The TensorFlow implementation does use a diagonal matrix, see here. Note that what this means in practice is that the peepholes only go from the cell to itself, and so you're doing elementwise vector multiplies.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With