I set learning rate decay in my optimizer Adam, such as
LR = 1e-3
LR_DECAY = 1e-2
OPTIMIZER = Adam(lr=LR, decay=LR_DECAY)
As the keras document Adam states, after each epoch learning rate would be
lr = lr * (1. / (1. + self.decay * K.cast(self.iterations, K.dtype(self.decay))))
If I understand correctly, learning rate be like this,
lr = lr * 1 / ( 1 + num_epoch * decay)
But I don't see the learning rate decay come into effect, after seeing that printed out. Is there any problem when I use this ?
Edit
I print out the learning by setting the verbose of the ReduceLROnPlateau
, such as,
ReduceLROnPlateau(monitor='val_loss', factor=0.75, patience=Config.REDUCE_LR_PATIENCE, verbose=1, mode='auto', epsilon=0.01, cooldown=0, min_lr=1e-6
And that would monitor the val-loss and reduce the learning rate by multiplying the factor
.
The printed learning rate is like this,
Epoch 00003: ReduceLROnPlateau reducing learning rate to 0.0007500000356230885.
And I set the initial learning rate to be 1e-3. Therefore, it appears that the learning rate change from 1e-3 to 1e-3 * 0.75, so I doubt that the decay
I set in Adam isn't working.
In my experience it usually not necessary to do learning rate decay with Adam optimizer. The theory is that Adam already handles learning rate optimization (check reference) : "We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement.
Step Decay A typical way is to to drop the learning rate by half every 10 epochs. To implement this in Keras, we can define a step decay function and use LearningRateScheduler callback to take the step decay function as argument and return the updated learning rates for use in SGD optimizer.
The exponential decay rate for the 1st moment estimates. Defaults to 0.9. beta_2: A float value or a constant float tensor, or a callable that takes no arguments and returns the actual value to use, The exponential decay rate for the 2nd moment estimates. Defaults to 0.999.
In Adam instead of adapting learning rates based on the average first moment as in RMSP, Adam makes use of the average of the second moments of the gradients. Adam. This algorithm calculates the exponential moving average of gradients and square gradients.
The learning rate changes with every iteration, i.e., with every batch and not epoch. So, if you set the decay = 1e-2 and each epoch has 100 batches/iterations, then after 1 epoch your learning rate will be
lr = init_lr * 1/(1 + 1e-2 * 100)
So, if I want my learning rate to be 0.75 of the original learning rate at the end of each epoch, I would set the lr_decay to
batches_per_epoch = dataset_size/batch_size
lr_decay = (1./0.75 -1)/batches_per_epoch
It seems to work for me. Also, since the new learning rate is calculated at every iteration, the optimizer doesn't change the value of the learning rate variable and always uses the initial learning rate to calculate the effective learning rate.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With