Could you explain how it works during training?
learning_rate: {
cosine_decay_learning_rate {
learning_rate_base: 8e-2
total_steps: 300000
warmup_learning_rate: .0001
warmup_steps: 400
}
}```
3. Learning-Rate Warmup. The learning rate warmup, for example [58], is a recent approach that uses a relatively small step size at the beginning of the training. The learning rate is increased linearly or non-linearly to a specific value in the first few epochs, and then shrinks to zero.
How to determine a good learning rate. You can identify a learning rate by looking at the TensorBoard graph of loss against training step. You want find the section where loss is decreasing fastest, and use the learning rate that was being used at that training step.
Edit. Linear Warmup is a learning rate schedule where we linearly increase the learning rate from a low rate to a constant rate thereafter. This reduces volatility in the early stages of training.
answering my own question :) With the setting above, training starts with lr=0.0001 and reaches to 0.08 at the end of 400 epochs (warmup_steps). Until 400th epoch lr is incremented linearly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With