Tensorflow offers a nice LSTM wrapper.
rnn_cell.BasicLSTM(num_units, forget_bias=1.0, input_size=None,
state_is_tuple=False, activation=tanh)
I would like to use regularization, say L2 regularization. However, I don't have direct access to the different weight matrices used in the LSTM cell, so I cannot explicitly do something like
loss = something + beta * tf.reduce_sum(tf.nn.l2_loss(weights))
Is there a way to access the matrices or use regularization somehow with LSTM?
tf.trainable_variables
gives you a list of Variable
objects that you can use to add the L2 regularization term. Note that this add regularization for all variables in your model. If you want to restrict the L2 term only to a subset of the weights, you can use the name_scope
to name your variables with specific prefixes, and later use that to filter the variables from the list returned by tf.trainable_variables
.
I like to do the following, yet the only thing I know is that some parameters prefers not to be regularized with L2, such as batch norm parameters and biases. LSTMs contains one Bias tensor (despite conceptually it has many biases, they seem to be concatenated or something, for performance), and for the batch normalization I add "noreg" in the variables' name to ignore it too.
loss = your regular output loss
l2 = lambda_l2_reg * sum(
tf.nn.l2_loss(tf_var)
for tf_var in tf.trainable_variables()
if not ("noreg" in tf_var.name or "Bias" in tf_var.name)
)
loss += l2
Where lambda_l2_reg
is the small multiplier, e.g.: float(0.005)
Doing this selection (which is the full if
in the loop discarding some variables in the regularization) once made me jump from 0.879 F1 score to 0.890 in one shot of testing the code without readjusting the value of the config's lambda
, well this was including both the changes for the batch normalisation and the Biases and I had other biases in the neural network.
According to this paper, regularizing the recurrent weights may help with exploding gradients.
Also, according to this other paper, dropout would be better used between stacked cells and not inside cells if you use some.
About the exploding gradient problem, if you use gradient clipping with the loss that has the L2 regularization already added to it, that regularization will be taken into account too during the clipping process.
P.S. Here is the neural network I was working on: https://github.com/guillaume-chevalier/HAR-stacked-residual-bidir-LSTMs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With