Should I avoid to use L2 regularization in conjuntion with RMSprop and NAG? The L2 regularization term interferes with the gradient algorithm (RMSprop)? Best reggards,

Seems that someone have sorted out (2018) the question (2017). Vanilla adaptive gradients (RMSProp, Adagrad, Adam, etc) do not match well with L2 regularization. Link to the paper [https://arxiv.org/pdf/1711.05101.pdf] and some intro: <blockquote> In this paper, we show that a major factor of the poor generalization of the most popular adaptive gradient method, Adam, is due to the fact that L2 regularization is not nearly as effective for it as for SGD. L2 regularization and weight decay are not identical. Contrary to common belief, the two techniques are not equivalent. For SGD, they can be made equivalent by a reparameterization of the weight decay factor based on the learning rate; this is not the case for Adam. In particular, when combined with adaptive gradients, L2 regularization leads to weights with large gradients being regularized less than they would be when using weight decay. </blockquote>

Should I avoid to use L2 regularization in conjuntion with RMSProp?

1 Answers

Seems that someone have sorted out (2018) the question (2017).

Vanilla adaptive gradients (RMSProp, Adagrad, Adam, etc) do not match well with L2 regularization.

Link to the paper [https://arxiv.org/pdf/1711.05101.pdf] and some intro:

In this paper, we show that a major factor of the poor generalization of the most popular adaptive gradient method, Adam, is due to the fact that L2 regularization is not nearly as effective for it as for SGD.

L2 regularization and weight decay are not identical. Contrary to common belief, the two techniques are not equivalent. For SGD, they can be made equivalent by a reparameterization of the weight decay factor based on the learning rate; this is not the case for Adam. In particular, when combined with adaptive gradients, L2 regularization leads to weights with large gradients being regularized less than they would be when using weight decay.

191

answered Sep 27 '22 22:09

Seguy

Related questions
                            
                                Binning of continuous variables in sklearn ensemble and trees
                            
                                Gaussian-RBM fails on a trivial example
                            
                                which is best svm example which classifies plain input text?
                            
                                Vowpal Wabbit training and testing data formats
                            
                                Cannot connect PlainText (JSON) to Dataset at Azure Machine Learning
                            
                                Doing hyperparameter estimation for the estimator in each fold of Recursive Feature Elimination
                            
                                Learning rate of a Q learning agent
                            
                                Accuracy issue in caffe
                            
                                get function by its values in certain points
                            
                                Missing Value in Data Analysis
                            
                                What are effective preprocessing methods for reducing data set size (e.g., removing records) without losing information for machine learning problems?
                            
                                What is a good way to extract dominant colors from image without the shadow?
                            
                                Can a model be created on Spark batch and use it in Spark streaming?
                            
                                What is the difference between classification and pattern recognition?
                            
                                Adapting binary stacking example to multiclass
                            
                                Possible to modify/prune learned trees in scikit-learn?
                            
                                The output of a softmax isn't supposed to have zeros, right?
                            
                                Gradient clipping appears to choke on None
                            
                                Add new columns to pandas dataframe based on other dataframe
                            
                                Plot decision tree in R (Caret)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Should I avoid to use L2 regularization in conjuntion with RMSProp?

Tags:

machine-learning

neural-network

backpropagation

Seguy

People also ask

1 Answers

Seguy

Recent Activity

Donate For Us