Should we do learning rate decay for adam optimizer

Tags:

I'm training a network for image localization with Adam optimizer, and someone suggest me to use exponential decay. I don't want to try that because Adam optimizer itself decays learning rate. But that guy insists and he said he did that before. So should I do that and is there any theory behind your suggestion?

761

asked Sep 15 '16 17:09

meng lin

1 Answers

It depends. ADAM updates any parameter with an individual learning rate. This means that every parameter in the network has a specific learning rate associated.

But the single learning rate for each parameter is computed using lambda (the initial learning rate) as an upper limit. This means that every single learning rate can vary from 0 (no update) to lambda (maximum update).

It's true, that the learning rates adapt themselves during training steps, but if you want to be sure that every update step doesn't exceed lambda you can than lower lambda using exponential decay or whatever. It can help to reduce loss during the latest step of training, when the computed loss with the previously associated lambda parameter has stopped to decrease.

answered Oct 16 '22 02:10

nessuno

Related questions
                            
                                In Keras, what exactly am I configuring when I create a stateful `LSTM` layer with N `units`?
                            
                                How to calculate the number of parameters for convolutional neural network?
                            
                                Tensorflow One Hot Encoder?
                            
                                Instance Normalisation vs Batch normalisation
                            
                                Estimating the number of neurons and number of layers of an artificial neural network [closed]
                            
                                How does Keras handle multilabel classification?
                            
                                How to update the bias in neural network backpropagation?
                            
                                Differences between numpy.random.rand vs numpy.random.randn in Python
                            
                                What's the difference between a bidirectional LSTM and an LSTM?
                            
                                How to tell Keras stop training based on loss value?
                            
                                How to assign a value to a TensorFlow variable?
                            
                                How to implement the ReLU function in Numpy
                            
                                pytorch - connection between loss.backward() and optimizer.step()
                            
                                keras: how to save the training history attribute of the history object
                            
                                How to choose cross-entropy loss in TensorFlow?
                            
                                How to fix RuntimeError "Expected object of scalar type Float but got scalar type Double for argument"?
                            
                                How to add regularizations in TensorFlow?
                            
                                What is the role of TimeDistributed layer in Keras?
                            
                                Common causes of nans during training
                            
                                NaN loss when training regression network

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Should we do learning rate decay for adam optimizer

Tags:

neural-network

tensorflow

meng lin

People also ask

1 Answers

nessuno

Recent Activity

Donate For Us