I was trying to port an existing trained PyTorch model into Keras.
During the porting, I got stuck at LSTM layer.
Keras implementation of LSTM network seems to have three state kind of state matrices while Pytorch implementation have four.
For eg, for an Bidirectional LSTM with hidden_layers=64, input_size=512 & output size=128 state parameters where as follows
State params of Keras LSTM
[<tf.Variable 'bidirectional_1/forward_lstm_1/kernel:0' shape=(512, 256) dtype=float32_ref>,
<tf.Variable 'bidirectional_1/forward_lstm_1/recurrent_kernel:0' shape=(64, 256) dtype=float32_ref>,
<tf.Variable 'bidirectional_1/forward_lstm_1/bias:0' shape=(256,) dtype=float32_ref>,
<tf.Variable 'bidirectional_1/backward_lstm_1/kernel:0' shape=(512, 256) dtype=float32_ref>,
<tf.Variable 'bidirectional_1/backward_lstm_1/recurrent_kernel:0' shape=(64, 256) dtype=float32_ref>,
<tf.Variable 'bidirectional_1/backward_lstm_1/bias:0' shape=(256,) dtype=float32_ref>]
State params of PyTorch LSTM
['rnn.0.rnn.weight_ih_l0', torch.Size([256, 512])],
['rnn.0.rnn.weight_hh_l0', torch.Size([256, 64])],
['rnn.0.rnn.bias_ih_l0', torch.Size([256])],
['rnn.0.rnn.bias_hh_l0', torch.Size([256])],
['rnn.0.rnn.weight_ih_l0_reverse', torch.Size([256, 512])],
['rnn.0.rnn.weight_hh_l0_reverse', torch.Size([256, 64])],
['rnn.0.rnn.bias_ih_l0_reverse', torch.Size([256])],
['rnn.0.rnn.bias_hh_l0_reverse', torch.Size([256])],
I tried to look in to the code of both implementation but not able to understand much.
Can someone please help me to transform 4-set of state params from PyTorch into 3-set of state params in Keras
They are really not that different. If you sum up the two bias vectors in PyTorch, the equations will be the same as what's implemented in Keras.
This is the LSTM formula on PyTorch documentation:
PyTorch uses two separate bias vectors for the input transformation (with a subscript starts with i
) and recurrent transformation (with a subscript starts with h
).
In Keras LSTMCell
:
x_i = K.dot(inputs_i, self.kernel_i)
x_f = K.dot(inputs_f, self.kernel_f)
x_c = K.dot(inputs_c, self.kernel_c)
x_o = K.dot(inputs_o, self.kernel_o)
if self.use_bias:
x_i = K.bias_add(x_i, self.bias_i)
x_f = K.bias_add(x_f, self.bias_f)
x_c = K.bias_add(x_c, self.bias_c)
x_o = K.bias_add(x_o, self.bias_o)
if 0 < self.recurrent_dropout < 1.:
h_tm1_i = h_tm1 * rec_dp_mask[0]
h_tm1_f = h_tm1 * rec_dp_mask[1]
h_tm1_c = h_tm1 * rec_dp_mask[2]
h_tm1_o = h_tm1 * rec_dp_mask[3]
else:
h_tm1_i = h_tm1
h_tm1_f = h_tm1
h_tm1_c = h_tm1
h_tm1_o = h_tm1
i = self.recurrent_activation(x_i + K.dot(h_tm1_i,
self.recurrent_kernel_i))
f = self.recurrent_activation(x_f + K.dot(h_tm1_f,
self.recurrent_kernel_f))
c = f * c_tm1 + i * self.activation(x_c + K.dot(h_tm1_c,
self.recurrent_kernel_c))
o = self.recurrent_activation(x_o + K.dot(h_tm1_o,
self.recurrent_kernel_o))
There's only one bias added in the input transformation. However, the equations would be equivalent if we sum up the two biases in PyTorch.
The two-bias LSTM is what's implemented in cuDNN (see the developer guide). I'm really not that familiar with PyTorch, but I guess that's why they use two bias parameters. In Keras, the CuDNNLSTM
layer also has two bias weight vectors.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With