Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should I avoid to use L2 regularization in conjuntion with RMSProp?

Should I avoid to use L2 regularization in conjuntion with RMSprop and NAG?

The L2 regularization term interferes with the gradient algorithm (RMSprop)?

Best reggards,

like image 300
Seguy Avatar asked Feb 23 '17 12:02

Seguy


People also ask

Can I use L1 and L2 regularization together?

Elastic Net: When L1 and L2 regularization combine together, it becomes the elastic net method, it adds a hyperparameter.

Should I use L1 or L2 regularization?

From a practical standpoint, L1 tends to shrink coefficients to zero whereas L2 tends to shrink coefficients evenly. L1 is therefore useful for feature selection, as we can drop any variables associated with coefficients that go to zero. L2, on the other hand, is useful when you have collinear/codependent features.

What is the impact of using L1 instead of L2 as a loss function when training a linear model?

L1 regularization is more robust than L2 regularization for a fairly obvious reason. L2 regularization takes the square of the weights, so the cost of outliers present in the data increases exponentially. L1 regularization takes the absolute values of the weights, so the cost only increases linearly.

What effect does L2 Regularisation have on the weights of the neural network?

Briefly, L2 regularization works by adding a term to the error function used by the training algorithm. The additional term penalizes large weight values. The two most common error functions used in neural network training are squared error and cross entropy error.


1 Answers

Seems that someone have sorted out (2018) the question (2017).

Vanilla adaptive gradients (RMSProp, Adagrad, Adam, etc) do not match well with L2 regularization.

Link to the paper [https://arxiv.org/pdf/1711.05101.pdf] and some intro:

In this paper, we show that a major factor of the poor generalization of the most popular adaptive gradient method, Adam, is due to the fact that L2 regularization is not nearly as effective for it as for SGD.

L2 regularization and weight decay are not identical. Contrary to common belief, the two techniques are not equivalent. For SGD, they can be made equivalent by a reparameterization of the weight decay factor based on the learning rate; this is not the case for Adam. In particular, when combined with adaptive gradients, L2 regularization leads to weights with large gradients being regularized less than they would be when using weight decay.

like image 191
Seguy Avatar answered Sep 27 '22 22:09

Seguy