Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regularization for LSTM in tensorflow

Tensorflow offers a nice LSTM wrapper.

rnn_cell.BasicLSTM(num_units, forget_bias=1.0, input_size=None,
           state_is_tuple=False, activation=tanh)

I would like to use regularization, say L2 regularization. However, I don't have direct access to the different weight matrices used in the LSTM cell, so I cannot explicitly do something like

loss = something + beta * tf.reduce_sum(tf.nn.l2_loss(weights))

Is there a way to access the matrices or use regularization somehow with LSTM?

like image 425
BiBi Avatar asked Jun 01 '16 14:06

BiBi


2 Answers

tf.trainable_variables gives you a list of Variable objects that you can use to add the L2 regularization term. Note that this add regularization for all variables in your model. If you want to restrict the L2 term only to a subset of the weights, you can use the name_scope to name your variables with specific prefixes, and later use that to filter the variables from the list returned by tf.trainable_variables.

like image 58
keveman Avatar answered Sep 27 '22 22:09

keveman


I like to do the following, yet the only thing I know is that some parameters prefers not to be regularized with L2, such as batch norm parameters and biases. LSTMs contains one Bias tensor (despite conceptually it has many biases, they seem to be concatenated or something, for performance), and for the batch normalization I add "noreg" in the variables' name to ignore it too.

loss = your regular output loss
l2 = lambda_l2_reg * sum(
    tf.nn.l2_loss(tf_var)
        for tf_var in tf.trainable_variables()
        if not ("noreg" in tf_var.name or "Bias" in tf_var.name)
)
loss += l2

Where lambda_l2_reg is the small multiplier, e.g.: float(0.005)

Doing this selection (which is the full if in the loop discarding some variables in the regularization) once made me jump from 0.879 F1 score to 0.890 in one shot of testing the code without readjusting the value of the config's lambda, well this was including both the changes for the batch normalisation and the Biases and I had other biases in the neural network.

According to this paper, regularizing the recurrent weights may help with exploding gradients.

Also, according to this other paper, dropout would be better used between stacked cells and not inside cells if you use some.

About the exploding gradient problem, if you use gradient clipping with the loss that has the L2 regularization already added to it, that regularization will be taken into account too during the clipping process.


P.S. Here is the neural network I was working on: https://github.com/guillaume-chevalier/HAR-stacked-residual-bidir-LSTMs

like image 23
Guillaume Chevalier Avatar answered Sep 27 '22 22:09

Guillaume Chevalier