Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should we do learning rate decay for adam optimizer

I'm training a network for image localization with Adam optimizer, and someone suggest me to use exponential decay. I don't want to try that because Adam optimizer itself decays learning rate. But that guy insists and he said he did that before. So should I do that and is there any theory behind your suggestion?

like image 761
meng lin Avatar asked Sep 15 '16 17:09

meng lin


People also ask

Do you need learning rate decay with Adam?

Yes, absolutely. From my own experience, it's very useful to Adam with learning rate decay. Without decay, you have to set a very small learning rate so the loss won't begin to diverge after decrease to a point.

Does learning rate matter for Adam Optimizer?

Even in the Adam optimization method, the learning rate is a hyperparameter and needs to be tuned, learning rate decay usually works better than not doing it.

Does Adam Optimizer have weight decay?

Optimal weight decay is a function (among other things) of the total number of batch passes/weight updates. Our empirical analysis of Adam suggests that the longer the runtime/number of batch passes to be performed, the smaller the optimal weight decay.

What is the default learning rate for Adam Optimizer?

optimizers. schedules. LearningRateSchedule , or a callable that takes no arguments and returns the actual value to use, The learning rate. Defaults to 0.001.


1 Answers

It depends. ADAM updates any parameter with an individual learning rate. This means that every parameter in the network has a specific learning rate associated.

But the single learning rate for each parameter is computed using lambda (the initial learning rate) as an upper limit. This means that every single learning rate can vary from 0 (no update) to lambda (maximum update).

It's true, that the learning rates adapt themselves during training steps, but if you want to be sure that every update step doesn't exceed lambda you can than lower lambda using exponential decay or whatever. It can help to reduce loss during the latest step of training, when the computed loss with the previously associated lambda parameter has stopped to decrease.

like image 66
nessuno Avatar answered Oct 16 '22 02:10

nessuno