In the paper Attention is all you need, under section 5.3, the authors suggested to increase the learning rate linearly and then decrease proportionally to the inverse square root of steps.
How do we implement this in PyTorch with Adam optimizer? Preferably without additional packages.
PyTorch provides learning-rate-schedulers for implementing various methods of adjusting the learning rate during the training process. Some simple LR-schedulers are are already implemented and can be found here: https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
In your special case you can - just like the other LR-schedulers do - subclass _LRScheduler
for implementing a variable schedule based on the number of epochs. For a bare-bones method you only need to implement __init__()
and get_lr()
methods.
Just note that many of these schedulers expect you to call .step()
once per epoch. But you can also update it more frequently or even pass a custom argument just like in the cosine-annealing LR-scheduler: https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html#CosineAnnealingLR
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With