Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How is learning rate decay implemented by Adam in keras

I set learning rate decay in my optimizer Adam, such as

LR = 1e-3
LR_DECAY = 1e-2
OPTIMIZER = Adam(lr=LR, decay=LR_DECAY)

As the keras document Adam states, after each epoch learning rate would be

lr = lr * (1. / (1. + self.decay * K.cast(self.iterations, K.dtype(self.decay))))

If I understand correctly, learning rate be like this,

lr = lr * 1 / ( 1 + num_epoch * decay)

But I don't see the learning rate decay come into effect, after seeing that printed out. Is there any problem when I use this ?

Edit
I print out the learning by setting the verbose of the ReduceLROnPlateau, such as,

ReduceLROnPlateau(monitor='val_loss', factor=0.75, patience=Config.REDUCE_LR_PATIENCE, verbose=1, mode='auto', epsilon=0.01, cooldown=0, min_lr=1e-6

And that would monitor the val-loss and reduce the learning rate by multiplying the factor. The printed learning rate is like this,

Epoch 00003: ReduceLROnPlateau reducing learning rate to 0.0007500000356230885.

And I set the initial learning rate to be 1e-3. Therefore, it appears that the learning rate change from 1e-3 to 1e-3 * 0.75, so I doubt that the decay I set in Adam isn't working.

like image 682
yujuezhao Avatar asked Aug 16 '19 21:08

yujuezhao


People also ask

Does Adam automatically decay learning rate?

In my experience it usually not necessary to do learning rate decay with Adam optimizer. The theory is that Adam already handles learning rate optimization (check reference) : "We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement.

How do you decay learning rate in Keras?

Step Decay A typical way is to to drop the learning rate by half every 10 epochs. To implement this in Keras, we can define a step decay function and use LearningRateScheduler callback to take the step decay function as argument and return the updated learning rates for use in SGD optimizer.

What is decay rate in Adam Optimizer?

The exponential decay rate for the 1st moment estimates. Defaults to 0.9. beta_2: A float value or a constant float tensor, or a callable that takes no arguments and returns the actual value to use, The exponential decay rate for the 2nd moment estimates. Defaults to 0.999.

How does Adam learning rate work?

In Adam instead of adapting learning rates based on the average first moment as in RMSP, Adam makes use of the average of the second moments of the gradients. Adam. This algorithm calculates the exponential moving average of gradients and square gradients.


1 Answers

The learning rate changes with every iteration, i.e., with every batch and not epoch. So, if you set the decay = 1e-2 and each epoch has 100 batches/iterations, then after 1 epoch your learning rate will be

lr = init_lr * 1/(1 + 1e-2 * 100)

So, if I want my learning rate to be 0.75 of the original learning rate at the end of each epoch, I would set the lr_decay to

batches_per_epoch = dataset_size/batch_size
lr_decay = (1./0.75 -1)/batches_per_epoch

It seems to work for me. Also, since the new learning rate is calculated at every iteration, the optimizer doesn't change the value of the learning rate variable and always uses the initial learning rate to calculate the effective learning rate.

like image 144
ssaz_5 Avatar answered Oct 20 '22 02:10

ssaz_5